dataset

Core search engine — the only module that imports tantivy.

Responsibilities: - Build a tantivy schema from field definitions - Register custom tokenizers (ngram) - Write documents into the index - Query the index with automatic field_boosts - Fuzzy query for typo tolerance - Multi-field sorting with over-fetch

class sayt2.dataset.SortKey(*, name: str, descending: bool = True)[source]

One element of a multi-field sort specification.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class sayt2.dataset.Hit(source: dict[str, Any], score: float)[source]

A single search hit with source document and relevance score.

class sayt2.dataset.SearchResult(hits: list[Hit], size: int, took_ms: int, fresh: bool, cache: bool)[source]

Immutable search result returned by DataSet.search().

sayt2.dataset.build_schema(fields: list[BaseField]) tuple[Schema, dict[str, TextAnalyzer]][source]

Convert a list of field definitions into a tantivy Schema and a dict of custom tokenizers that must be registered on the Index.

Returns (schema, analyzers) where analyzers maps tokenizer name → TextAnalyzer.

sayt2.dataset.open_index(dir_index: Path, fields: list[BaseField]) Index[source]

Open (or create) a tantivy Index at dir_index and register all required custom tokenizers.

Tantivy does not persist tokenizer configuration — only the inverted index data. So every Index.open() / Index(schema, path=...) must be followed by register_tokenizer calls.

sayt2.dataset.write_documents(index: Index, data: Iterable[dict[str, Any]], memory_budget_bytes: int = 128000000, num_threads: int | None = None) int[source]

Write data into index.

Parameters:
  • data – Iterable of dicts, each dict is one document whose keys match the field names in the schema.

  • memory_budget_bytes – Heap budget for the index writer.

  • num_threads – Number of indexing threads (None = tantivy default).

Returns:

Number of documents written.

sayt2.dataset.search_index(index: Index, fields: list[BaseField], query_str: str, limit: int = 20) list[Hit][source]

Parse query_str against the searchable fields in fields, execute the search, and return up to limit hits as a list of dicts.

Each hit dict contains every stored field from the document plus _score (BM25 relevance score).

Field boosts declared on field definitions are applied automatically.

sayt2.dataset.fuzzy_search_index(index: Index, fields: list[BaseField], query_str: str, limit: int = 20, distance: int = 1, transposition_cost_one: bool = True, prefix: bool = False) list[Hit][source]

Fuzzy search using Query.fuzzy_term_query on each TextField.

Fuzzy matching works on word-level tokenized fields only (TextField), not on NgramField or KeywordField. One fuzzy query per TextField is built and combined with Occur.Should (boolean OR).

Parameters:
  • distance – Maximum Levenshtein edit distance (1 or 2).

  • transposition_cost_one – Count adjacent-char swaps as 1 edit.

  • prefix – Enable prefix Levenshtein mode.

sayt2.dataset.search_index_sorted(index: Index, fields: list[BaseField], query_str: str, sort_keys: list[SortKey], limit: int = 20, over_fetch_factor: int = 10) list[Hit][source]

Search then sort by multiple fields.

tantivy-py only supports order_by_field on a single field, so multi-field sorting is done in Python after over-fetching.

Parameters:
  • sort_keys – List of SortKey specifying sort order.

  • over_fetch_factor – Fetch limit * over_fetch_factor candidates before sorting. Ensures the final top-limit is accurate.

class sayt2.dataset.DataSet(*, dir_root: Path, name: str, fields: list[Annotated[StoredField | KeywordField | TextField | NgramField | NumericField | DatetimeField | BooleanField, FieldInfo(annotation=NoneType, required=True, discriminator='type')]], downloader: Callable[[], Iterable[dict[str, Any]]] | None = None, cache_expire: int | None = None, sort: list[SortKey] | None = None, memory_budget_bytes: int = 128000000, num_threads: int | None = None, lock_expire: int = 60)[source]

High-level search dataset that integrates index building, caching, cross-process locking, and query execution into a single object.

Parameters:
  • dir_root – Root directory; index, cache, and tracker DB are stored inside sub-directories of this path.

  • name – Logical name (e.g. "books"). Used as the tracker lock key and cache namespace.

  • fields – Field definitions that determine the tantivy schema.

  • downloader – Optional callable that returns an iterable of document dicts. Called when the data is stale or on first search.

  • cache_expire – Seconds before L1 cache expires (None = never).

  • sort – Optional multi-field sort specification.

  • memory_budget_bytes – Heap budget for the tantivy index writer.

  • num_threads – Number of indexing threads (None = tantivy default).

  • lock_expire – Seconds before the tracker lock expires.

close() None[source]

Close the underlying cache (sqlite3 connection).

Safe to call multiple times. After closing, the next search() or build_index() call will lazily re-open the cache.

build_index(data: Iterable[dict[str, Any]] | None = None) int[source]

Build (or rebuild) the index with tracker lock protection.

If data is None, the downloader is called. Raises ValueError if both are None.

Returns:

Number of documents indexed.

search(query: str, limit: int = 20, refresh: bool = False) SearchResult[source]

Full search flow:

  1. Check L1 freshness (or refresh=True forces rebuild).

  2. If stale, call build_index() with downloader.

  3. Check L2 query cache.

  4. On cache miss, execute the query, apply sorting, cache the result.

Parameters:
  • query – Query string.

  • limit – Maximum number of hits.

  • refresh – Force a data refresh even if the cache is fresh.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) None

This function is meant to behave like a BaseModel method to initialize private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Args:

self: The BaseModel instance. context: The context.