Cache — Two-Layer Disk Cache¶
sayt2.cache manages a two-layer disk cache backed by
diskcache. Both layers live in a
single diskcache.Cache instance, distinguished by key prefixes and linked
by a shared tag for bulk eviction.
Why two layers?¶
A search dataset has two independent freshness concerns:
Layer |
Concern |
Key pattern |
Expiry |
|---|---|---|---|
L1 |
Data freshness — is the index up-to-date? |
|
After |
L2 |
Query results — has this |
|
Never (bulk-evicted on rebuild) |
When L1 expires, search() triggers a
downloader() -> build_index() cycle, which calls
evict_all() to wipe both layers. This
guarantees that stale query results are never served after a data refresh.
Schema hash in cache keys¶
Every cache key embeds the schema hash produced by
fields_schema_hash(). If you add, remove, or modify a
field definition, the hash changes and all previous cache entries become
invisible — no explicit invalidation required.
DataSetCache class¶
class DataSetCache:
"""
Manages a two-layer cache backed by `diskcache <https://pypi.org/project/diskcache/>`__.
:param dir_cache: Directory for the ``diskcache.Cache`` files.
:param dataset_name: Logical name of the dataset (e.g. ``"books"``).
:param schema_hash: Short hash of the field definitions — ensures that
a schema change automatically invalidates all cached data.
:param expire: Seconds before L1 (data freshness) expires.
``None`` means "never expire automatically".
"""
def __init__(
self,
dir_cache: Path,
dataset_name: str,
schema_hash: str,
expire: int | None = None,
):
self._cache = diskcache.Cache(str(dir_cache))
self._dataset_name = dataset_name
self._schema_hash = schema_hash
self._expire = expire
self._tag = f"dataset:{dataset_name}"
# -- keys -----------------------------------------------------------------
@property
def _freshness_key(self) -> str:
"""L1 key — includes schema_hash so a schema change = auto miss."""
return f"fresh:{self._dataset_name}:{self._schema_hash}"
def _query_key(self, query: str, limit: int) -> str:
"""L2 key — deterministic, based on query string and limit."""
return f"q:{self._dataset_name}:{self._schema_hash}:{query}:{limit}"
# -- Layer 1: data freshness ----------------------------------------------
def is_fresh(self) -> bool:
"""Return ``True`` if the dataset index is still considered fresh."""
return self._freshness_key in self._cache
def mark_fresh(self) -> None:
"""
Mark the dataset as fresh. Starts the L1 expiry countdown.
Called after a successful ``downloader() → build_index()`` cycle.
"""
self._cache.set(
self._freshness_key,
True,
expire=self._expire,
tag=self._tag,
)
# -- Layer 2: query result cache ------------------------------------------
def get_query_result(self, query: str, limit: int) -> "SearchResult | None":
"""
Return the cached result for *(query, limit)*, or ``None`` on miss.
Query results are always ``SearchResult`` objects (never ``None``),
so a ``None`` return unambiguously means cache miss.
"""
return self._cache.get(self._query_key(query, limit))
def set_query_result(self, query: str, limit: int, result: SearchResult) -> None:
"""
Cache a query result. L2 entries never expire on their own — they
are bulk-evicted when L1 triggers a rebuild via :meth:`evict_all`.
"""
self._cache.set(
self._query_key(query, limit),
result,
tag=self._tag,
)
# -- eviction -------------------------------------------------------------
def evict_all(self) -> None:
"""
Remove **all** entries (L1 + L2) belonging to this dataset.
Called before a rebuild so that stale query results are not served.
"""
self._cache.evict(tag=self._tag)
# -- lifecycle ------------------------------------------------------------
def close(self) -> None:
"""Close the underlying ``diskcache.Cache``."""
self._cache.close()
Key methods:
is_fresh()/mark_fresh()L1 interface.
mark_fresh()is called after a successful index build and starts the expiry countdown.get_query_result()/set_query_result()L2 interface. Results are always
SearchResultobjects, so aNonereturn unambiguously means cache miss.evict_all()Removes all entries (L1 + L2) for this dataset using
diskcache’s tag-based eviction. Called at the start of every rebuild.close()Closes the underlying
diskcache.Cache. Always call this when done (or rely onDataSet, which handles it automatically).
Lifecycle within DataSet¶
You rarely need to instantiate DataSetCache directly.
DataSet creates and manages it internally:
build_index()—evict_all()thenmark_fresh()search()—is_fresh()to decide whether to rebuild, thenget_query_result()/set_query_result()for L2