Architecture and Design¶
What sayt2 Does¶
sayt2 is a search-as-you-type library for Python. It lets you build a
full-text search index from structured data (a list of dicts) and query it
with substring matching, fuzzy tolerance, and multi-field sorting — all from a
single DataSet object.
Typical use cases:
Interactive search boxes (autocomplete) in CLI tools, desktop apps, or web back-ends.
Filtering large reference tables (URLs, documentation pages, product catalogues) with instant feedback as the user types.
Design Goals¶
Single-file deployment — the index, cache, and lock state all live under one
dir_root. Copy or delete the directory and you’re done.Zero-config search — define your fields, feed your data, call
search(). Tokenisers, caching, and locking are handled automatically.Tantivy isolation —
import tantivyappears in exactly one module (sayt2.dataset). Every other module is engine-agnostic, so a future back-end swap touches one file.Pydantic configuration — field definitions are pydantic models. Validation, serialisation, and IDE autocompletion come for free.
Module Dependency Layers¶
The library is organised into strict layers. Each layer may only import from the layers below it.
Layer 4 api ─ Public re-export surface
│
Layer 3 dataset ─ Core search engine (only tantivy consumer)
│
Layer 2 cache ─ Two-layer disk cache (diskcache)
│
Layer 1 fields ─ Field type definitions (pydantic)
tracker ─ Cross-process SQLite lock
│
Layer 0 constants ─ Enum constants
exc ─ Custom exceptions
Arrows point downward: dataset depends on cache, fields, and
tracker; fields depends on constants; tracker depends on
exc. There are no upward or circular imports.
Key Design Decisions¶
Tantivy as the search back-end¶
tantivy-py provides Rust-speed indexing and querying through Python bindings. Index builds that previously took seconds now complete in milliseconds, which in turn lets us lower the default lock expiry from 300 s to 60 s.
SQLite atomic locking¶
Cross-process coordination uses a single SQL UPSERT statement — see
Tracker. Because the lock check and the lock
acquisition happen in one atomic SQL statement, there is no race window between
“is it free?” and “I’ll take it”.
Two-layer caching¶
DataSetCache manages two concerns independently:
L1 — data freshness: has the index been rebuilt recently enough?
L2 — query results: has this exact
(query, limit)been answered before?
Both layers are keyed with a schema hash
(fields_schema_hash()), so changing the field definitions
automatically invalidates stale caches.
Discriminated-union fields¶
The seven field types share a common BaseField base and
are assembled into a single discriminated union (T_Field)
via pydantic’s Field(discriminator="type"). This allows polymorphic
deserialisation from plain dicts — useful when field definitions are stored in
config files.