dataset¶
Core search engine — the only module that imports tantivy.
Responsibilities: - Build a tantivy schema from field definitions - Register custom tokenizers (ngram) - Write documents into the index - Query the index with automatic field_boosts - Fuzzy query for typo tolerance - Multi-field sorting with over-fetch
- class sayt2.dataset.SortKey(*, name: str, descending: bool = True)[source]¶
One element of a multi-field sort specification.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sayt2.dataset.Hit(source: dict[str, Any], score: float)[source]¶
A single search hit with source document and relevance score.
- class sayt2.dataset.SearchResult(hits: list[Hit], size: int, took_ms: int, fresh: bool, cache: bool)[source]¶
Immutable search result returned by
DataSet.search().
- sayt2.dataset.build_schema(fields: list[BaseField]) tuple[Schema, dict[str, TextAnalyzer]][source]¶
Convert a list of field definitions into a tantivy
Schemaand a dict of custom tokenizers that must be registered on theIndex.Returns
(schema, analyzers)where analyzers maps tokenizer name →TextAnalyzer.
- sayt2.dataset.open_index(dir_index: Path, fields: list[BaseField]) Index[source]¶
Open (or create) a tantivy
Indexat dir_index and register all required custom tokenizers.Tantivy does not persist tokenizer configuration — only the inverted index data. So every
Index.open()/Index(schema, path=...)must be followed byregister_tokenizercalls.
- sayt2.dataset.write_documents(index: Index, data: Iterable[dict[str, Any]], memory_budget_bytes: int = 128000000, num_threads: int | None = None) int[source]¶
Write data into index.
- Parameters:
data – Iterable of dicts, each dict is one document whose keys match the field names in the schema.
memory_budget_bytes – Heap budget for the index writer.
num_threads – Number of indexing threads (
None= tantivy default).
- Returns:
Number of documents written.
- sayt2.dataset.search_index(index: Index, fields: list[BaseField], query_str: str, limit: int = 20) list[Hit][source]¶
Parse query_str against the searchable fields in fields, execute the search, and return up to limit hits as a list of dicts.
Each hit dict contains every stored field from the document plus
_score(BM25 relevance score).Field boosts declared on field definitions are applied automatically.
- sayt2.dataset.fuzzy_search_index(index: Index, fields: list[BaseField], query_str: str, limit: int = 20, distance: int = 1, transposition_cost_one: bool = True, prefix: bool = False) list[Hit][source]¶
Fuzzy search using
Query.fuzzy_term_queryon each TextField.Fuzzy matching works on word-level tokenized fields only (TextField), not on NgramField or KeywordField. One fuzzy query per TextField is built and combined with
Occur.Should(boolean OR).- Parameters:
distance – Maximum Levenshtein edit distance (1 or 2).
transposition_cost_one – Count adjacent-char swaps as 1 edit.
prefix – Enable prefix Levenshtein mode.
- sayt2.dataset.search_index_sorted(index: Index, fields: list[BaseField], query_str: str, sort_keys: list[SortKey], limit: int = 20, over_fetch_factor: int = 10) list[Hit][source]¶
Search then sort by multiple fields.
tantivy-py only supports
order_by_fieldon a single field, so multi-field sorting is done in Python after over-fetching.- Parameters:
sort_keys – List of
SortKeyspecifying sort order.over_fetch_factor – Fetch
limit * over_fetch_factorcandidates before sorting. Ensures the final top-limit is accurate.
- class sayt2.dataset.DataSet(*, dir_root: Path, name: str, fields: list[Annotated[StoredField | KeywordField | TextField | NgramField | NumericField | DatetimeField | BooleanField, FieldInfo(annotation=NoneType, required=True, discriminator='type')]], downloader: Callable[[], Iterable[dict[str, Any]]] | None = None, cache_expire: int | None = None, sort: list[SortKey] | None = None, memory_budget_bytes: int = 128000000, num_threads: int | None = None, lock_expire: int = 60)[source]¶
High-level search dataset that integrates index building, caching, cross-process locking, and query execution into a single object.
- Parameters:
dir_root – Root directory; index, cache, and tracker DB are stored inside sub-directories of this path.
name – Logical name (e.g.
"books"). Used as the tracker lock key and cache namespace.fields – Field definitions that determine the tantivy schema.
downloader – Optional callable that returns an iterable of document dicts. Called when the data is stale or on first search.
cache_expire – Seconds before L1 cache expires (
None= never).sort – Optional multi-field sort specification.
memory_budget_bytes – Heap budget for the tantivy index writer.
num_threads – Number of indexing threads (
None= tantivy default).lock_expire – Seconds before the tracker lock expires.
- close() None[source]¶
Close the underlying cache (sqlite3 connection).
Safe to call multiple times. After closing, the next
search()orbuild_index()call will lazily re-open the cache.
- build_index(data: Iterable[dict[str, Any]] | None = None) int[source]¶
Build (or rebuild) the index with tracker lock protection.
If data is
None, thedownloaderis called. RaisesValueErrorif both areNone.- Returns:
Number of documents indexed.
- search(query: str, limit: int = 20, refresh: bool = False) SearchResult[source]¶
Full search flow:
Check L1 freshness (or
refresh=Trueforces rebuild).If stale, call
build_index()withdownloader.Check L2 query cache.
On cache miss, execute the query, apply sorting, cache the result.
- Parameters:
query – Query string.
limit – Maximum number of hits.
refresh – Force a data refresh even if the cache is fresh.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].