dataset¶

Core search engine — the only module that imports tantivy.

Responsibilities: - Build a tantivy schema from field definitions - Register custom tokenizers (ngram) - Write documents into the index - Query the index with automatic field_boosts - Fuzzy query for typo tolerance - Multi-field sorting with over-fetch

class sayt2.dataset.SortKey(*, name: str, descending: bool = True)[source]¶

One element of a multi-field sort specification.

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class sayt2.dataset.Hit(source: dict[str, Any], score: float)[source]¶: A single search hit with source document and relevance score.

class sayt2.dataset.SearchResult(hits: list[Hit], size: int, took_ms: int, fresh: bool, cache: bool)[source]¶: Immutable search result returned by DataSet.search().

sayt2.dataset.build_schema(fields: list[BaseField]) → tuple[Schema, dict[str, TextAnalyzer]][source]¶

Convert a list of field definitions into a tantivy Schema and a dict of custom tokenizers that must be registered on the Index.

Returns (schema, analyzers) where analyzers maps tokenizer name → TextAnalyzer.

sayt2.dataset.open_index(dir_index: Path, fields: list[BaseField]) → Index[source]¶

Open (or create) a tantivy Index at dir_index and register all required custom tokenizers.

Tantivy does not persist tokenizer configuration — only the inverted index data. So every Index.open() / Index(schema, path=...) must be followed by register_tokenizer calls.

sayt2.dataset.write_documents(index: Index, data: Iterable[dict[str, Any]], memory_budget_bytes: int = 128000000, num_threads: int | None = None) → int[source]¶

Write data into index.

Parameters:

data – Iterable of dicts, each dict is one document whose keys match the field names in the schema.
memory_budget_bytes – Heap budget for the index writer.
num_threads – Number of indexing threads (None = tantivy default).

Returns:

Number of documents written.

sayt2.dataset.search_index(index: Index, fields: list[BaseField], query_str: str, limit: int = 20) → list[Hit][source]¶

Parse query_str against the searchable fields in fields, execute the search, and return up to limit hits as a list of dicts.

Each hit dict contains every stored field from the document plus _score (BM25 relevance score).

Field boosts declared on field definitions are applied automatically.

sayt2.dataset.fuzzy_search_index(index: Index, fields: list[BaseField], query_str: str, limit: int = 20, distance: int = 1, transposition_cost_one: bool = True, prefix: bool = False) → list[Hit][source]¶

Fuzzy search using Query.fuzzy_term_query on each TextField.

Fuzzy matching works on word-level tokenized fields only (TextField), not on NgramField or KeywordField. One fuzzy query per TextField is built and combined with Occur.Should (boolean OR).

Parameters:

distance – Maximum Levenshtein edit distance (1 or 2).
transposition_cost_one – Count adjacent-char swaps as 1 edit.
prefix – Enable prefix Levenshtein mode.

sayt2.dataset.search_index_sorted(index: Index, fields: list[BaseField], query_str: str, sort_keys: list[SortKey], limit: int = 20, over_fetch_factor: int = 10) → list[Hit][source]¶

Search then sort by multiple fields.

tantivy-py only supports order_by_field on a single field, so multi-field sorting is done in Python after over-fetching.

Parameters:

sort_keys – List of SortKey specifying sort order.
over_fetch_factor – Fetch limit * over_fetch_factor candidates before sorting. Ensures the final top-limit is accurate.

class sayt2.dataset.DataSet(*, dir_root: Path, name: str, fields: list[Annotated[StoredField | KeywordField | TextField | NgramField | NumericField | DatetimeField | BooleanField, FieldInfo(annotation=NoneType, required=True, discriminator='type')]], downloader: Callable[[], Iterable[dict[str, Any]]] | None = None, cache_expire: int | None = None, sort: list[SortKey] | None = None, memory_budget_bytes: int = 128000000, num_threads: int | None = None, lock_expire: int = 60)[source]¶

High-level search dataset that integrates index building, caching, cross-process locking, and query execution into a single object.

Parameters:

dir_root – Root directory; index, cache, and tracker DB are stored inside sub-directories of this path.
name – Logical name (e.g. "books"). Used as the tracker lock key and cache namespace.
fields – Field definitions that determine the tantivy schema.
downloader – Optional callable that returns an iterable of document dicts. Called when the data is stale or on first search.
cache_expire – Seconds before L1 cache expires (None = never).
sort – Optional multi-field sort specification.
memory_budget_bytes – Heap budget for the tantivy index writer.
num_threads – Number of indexing threads (None = tantivy default).
lock_expire – Seconds before the tracker lock expires.

close() → None[source]¶

Close the underlying cache (sqlite3 connection).

Safe to call multiple times. After closing, the next search() or build_index() call will lazily re-open the cache.

build_index(data: Iterable[dict[str, Any]] | None = None) → int[source]¶

Build (or rebuild) the index with tracker lock protection.

If data is None, the downloader is called. Raises ValueError if both are None.

Returns:: Number of documents indexed.

search(query: str, limit: int = 20, refresh: bool = False) → SearchResult[source]¶

Full search flow:

Check L1 freshness (or refresh=True forces rebuild).
If stale, call build_index() with downloader.
Check L2 query cache.
On cache miss, execute the query, apply sorting, cache the result.

Parameters:

query – Query string.
limit – Maximum number of hits.
refresh – Force a data refresh even if the cache is fresh.

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None¶

This function is meant to behave like a BaseModel method to initialize private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Args:: self: The BaseModel instance. context: The context.