Architecture and Design

What sayt2 Does

sayt2 is a search-as-you-type library for Python. It lets you build a full-text search index from structured data (a list of dicts) and query it with substring matching, fuzzy tolerance, and multi-field sorting — all from a single DataSet object.

Typical use cases:

  • Interactive search boxes (autocomplete) in CLI tools, desktop apps, or web back-ends.

  • Filtering large reference tables (URLs, documentation pages, product catalogues) with instant feedback as the user types.

Design Goals

  1. Single-file deployment — the index, cache, and lock state all live under one dir_root. Copy or delete the directory and you’re done.

  2. Zero-config search — define your fields, feed your data, call search(). Tokenisers, caching, and locking are handled automatically.

  3. Tantivy isolationimport tantivy appears in exactly one module (sayt2.dataset). Every other module is engine-agnostic, so a future back-end swap touches one file.

  4. Pydantic configuration — field definitions are pydantic models. Validation, serialisation, and IDE autocompletion come for free.

Module Dependency Layers

The library is organised into strict layers. Each layer may only import from the layers below it.

Layer 4  api            ─  Public re-export surface
           │
Layer 3  dataset        ─  Core search engine (only tantivy consumer)
           │
Layer 2  cache          ─  Two-layer disk cache (diskcache)
           │
Layer 1  fields         ─  Field type definitions (pydantic)
         tracker        ─  Cross-process SQLite lock
           │
Layer 0  constants      ─  Enum constants
         exc            ─  Custom exceptions

Arrows point downward: dataset depends on cache, fields, and tracker; fields depends on constants; tracker depends on exc. There are no upward or circular imports.

Key Design Decisions

Tantivy as the search back-end

tantivy-py provides Rust-speed indexing and querying through Python bindings. Index builds that previously took seconds now complete in milliseconds, which in turn lets us lower the default lock expiry from 300 s to 60 s.

SQLite atomic locking

Cross-process coordination uses a single SQL UPSERT statement — see Tracker. Because the lock check and the lock acquisition happen in one atomic SQL statement, there is no race window between “is it free?” and “I’ll take it”.

Two-layer caching

DataSetCache manages two concerns independently:

  • L1 — data freshness: has the index been rebuilt recently enough?

  • L2 — query results: has this exact (query, limit) been answered before?

Both layers are keyed with a schema hash (fields_schema_hash()), so changing the field definitions automatically invalidates stale caches.

Discriminated-union fields

The seven field types share a common BaseField base and are assembled into a single discriminated union (T_Field) via pydantic’s Field(discriminator="type"). This allows polymorphic deserialisation from plain dicts — useful when field definitions are stored in config files.