Dataset — Core Search Engine ============================================================================== :mod:`sayt2.dataset` is the **only module that imports tantivy**. It integrates field definitions, caching, and cross-process locking into a single high-level :class:`~sayt2.dataset.DataSet` object that handles the full index-build-search lifecycle. Module-level functions ------------------------------------------------------------------------------ The module exposes a set of stateless functions that operate on a tantivy ``Index`` directly. :class:`~sayt2.dataset.DataSet` composes them internally, but they are also usable standalone for advanced use cases. Schema construction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: build_schema :func:`~sayt2.dataset.build_schema` walks the field list and maps each :class:`~sayt2.fields.BaseField` subclass to the corresponding tantivy ``SchemaBuilder`` call. For :class:`~sayt2.fields.NgramField`, it also creates a custom ``TextAnalyzer`` via the helper below: .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: _ngram_tokenizer_name .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: _build_ngram_analyzer Each :class:`~sayt2.fields.NgramField` gets a deterministic tokenizer name derived from its parameters. This ensures that two fields with different gram ranges get separate tokenizers while identical configurations share one. Index opening ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: open_index tantivy does **not** persist tokenizer configuration — only the inverted index data. Every ``Index`` open must be followed by ``register_tokenizer`` calls, which :func:`~sayt2.dataset.open_index` handles automatically. Document writing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: write_documents Documents are written through tantivy's ``IndexWriter``. After committing, the writer waits for background merge threads to finish, then reloads the index so that subsequent searches see the new data. Query execution ------------------------------------------------------------------------------ Basic search ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: search_index :func:`~sayt2.dataset.search_index` uses ``index.parse_query`` with automatic field boosts. Only fields that have a ``boost`` attribute (:class:`~sayt2.fields.KeywordField`, :class:`~sayt2.fields.TextField`, :class:`~sayt2.fields.NgramField`) are included in the query. The helper :func:`~sayt2.dataset._collect_search_config` extracts searchable field names and non-default boosts: .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: _collect_search_config Fuzzy search ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: fuzzy_search_index Fuzzy matching uses ``Query.fuzzy_term_query`` (not ``parse_query``'s ``~N`` syntax, which does not work in tantivy-py). It operates only on :class:`~sayt2.fields.TextField` — ngram and keyword fields are excluded. Multiple query terms are split by whitespace and combined with ``Occur.Should`` (boolean OR). Sorted search ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ tantivy-py only exposes single-field ``order_by_field``, so multi-field sorting is done in Python after over-fetching. .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: search_index_sorted .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: _sort_hits The strategy: fetch ``limit * over_fetch_factor`` candidates using BM25 scoring, then re-sort in Python using successive stable sorts (least-significant key first). The default ``over_fetch_factor=10`` balances accuracy and performance. Data model ------------------------------------------------------------------------------ SortKey ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: SortKey :class:`~sayt2.dataset.SortKey` is a simple pydantic model specifying a field ``name`` and sort ``direction``. Pass a list of these to :attr:`~sayt2.dataset.DataSet.sort` for multi-field sorting. Hit ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: Hit :class:`~sayt2.dataset.Hit` is a **frozen dataclass** representing a single search result. Key fields: - ``source`` — dict of stored document fields (modelled after Elasticsearch's ``_source``). - ``score`` — BM25 relevance score. SearchResult ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: SearchResult :class:`~sayt2.dataset.SearchResult` is a **frozen dataclass** — immutable after creation. Key fields: - ``hits`` — list of :class:`~sayt2.dataset.Hit` objects. - ``size`` — number of hits returned. - ``took_ms`` — wall-clock time for the full search flow. - ``fresh`` — ``True`` if this search triggered a data refresh. - ``cache`` — ``True`` if the result was served from L2 cache. DataSet class ------------------------------------------------------------------------------ .. literalinclude:: ../../../../sayt2/dataset.py :language: python :pyobject: DataSet :class:`~sayt2.dataset.DataSet` is the primary user-facing class. It orchestrates three subsystems: - :class:`~sayt2.tracker.Tracker` — ensures only one process rebuilds the index at a time. - :class:`~sayt2.cache.DataSetCache` — avoids redundant rebuilds and repeated queries. - tantivy ``Index`` — the actual search engine. All state (index files, cache, tracker DB) lives under ``dir_root``: .. code-block:: text dir_root/ ├── tracker.db ← shared across datasets └── {name}/ ├── index-{schema_hash}/ ← tantivy index files └── cache/ ← diskcache files Resource lifecycle ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :class:`~sayt2.dataset.DataSet` lazily opens a :class:`~sayt2.cache.DataSetCache` (backed by ``diskcache.Cache``, which holds a ``sqlite3`` connection). The connection is **reused** across calls to :meth:`~sayt2.dataset.DataSet.search` and :meth:`~sayt2.dataset.DataSet.build_index` — it is **not** closed automatically after each call. Three ways to manage the lifecycle: .. code-block:: python # 1. Context manager (recommended) with DataSet(dir_root=..., name="books", fields=..., downloader=dl) as ds: r1 = ds.search("python") r2 = ds.search("rust") # cache closed automatically on __exit__ # 2. Explicit close ds = DataSet(...) ds.search("python") ds.close() # safe to call multiple times # 3. One-off script (GC will reclaim eventually) ds = DataSet(...) ds.search("python") After :meth:`~sayt2.dataset.DataSet.close`, the ``DataSet`` can still be used — the next call lazily re-opens the cache. :class:`~sayt2.tracker.Tracker` connections are **not** held open; each ``lock_it`` / ``unlock_it`` call opens and closes its own ``sqlite3`` connection, so the tracker needs no explicit lifecycle management. build_index ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :meth:`~sayt2.dataset.DataSet.build_index` acquires a tracker lock, evicts all caches, writes documents, and marks the data as fresh: .. code-block:: text lock(name) → evict_all() → open_index() → write_documents() → mark_fresh() → unlock If ``data`` is ``None``, the :attr:`~sayt2.dataset.DataSet.downloader` callable is invoked to fetch data. search ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :meth:`~sayt2.dataset.DataSet.search` implements the full search flow: .. code-block:: text 1. is_fresh()? ──no──→ build_index(downloader) │ yes │ 2. get_query_result(query, limit)? ──hit──→ return cached │ miss │ 3. Execute query → apply sort → set_query_result() → return Setting ``refresh=True`` forces step 1 to always rebuild, regardless of L1 freshness.