Field Types¶
sayt2.fields defines seven field types that describe how each column
of your data is stored, indexed, and searched. All types are pydantic
BaseModel subclasses, so validation, serialisation, and IDE autocompletion
work out of the box.
Important
This module has no dependency on tantivy. The mapping from field
definitions to tantivy schema objects lives entirely in
sayt2.dataset.
Base class¶
class BaseField(BaseModel):
"""
Common base for all field types.
Every subclass must override ``type`` with a ``T.Literal["..."]`` so that
pydantic's discriminated union can reconstruct the correct class from a
plain dict.
"""
type: str # overridden by each subclass as a Literal
name: str = Field(min_length=1)
stored: bool = True
Every field has a type discriminator (overridden by each subclass as a
Literal), a name, and a stored flag. The type value is drawn
from FieldTypeEnum.
Text family¶
StoredField¶
class StoredField(BaseField):
"""Store-only field. Not indexed, not searchable, not sortable."""
type: T.Literal["stored"] = FieldTypeEnum.STORED.value
StoredField keeps the value in the index but does not
make it searchable or sortable. Use it for payload data you want returned with
search results (e.g. URLs, descriptions).
KeywordField¶
class KeywordField(BaseField):
"""
Exact-match field (id, tag, enum). Uses the ``raw`` tokenizer under the
hood — the entire field value is treated as one token.
"""
type: T.Literal["keyword"] = FieldTypeEnum.KEYWORD.value
boost: float = Field(default=1.0, gt=0)
KeywordField performs exact-match search. The
entire field value is treated as a single token (raw tokeniser), making it
ideal for IDs, tags, and enum values. The boost parameter controls how
much weight this field carries in relevance scoring.
TextField¶
class TextField(BaseField):
"""
Full-text search field. Uses the ``default`` (Unicode-aware word
boundary) or ``en_stem`` (English stemming) tokenizer.
"""
type: T.Literal["text"] = FieldTypeEnum.TEXT.value
tokenizer: T.Literal["default", "en_stem"] = TokenizerEnum.DEFAULT.value
boost: float = Field(default=1.0, gt=0)
TextField is for full-text search. Choose a
tokenizer from TokenizerEnum:
"default"— Unicode-aware word boundary splitting."en_stem"— English stemming (e.g. “running” matches “run”).
NgramField¶
class NgramField(BaseField):
"""
Search-as-you-type field. Builds an ngram inverted index so that any
substring of length ``[min_gram, max_gram]`` is a valid query token.
"""
type: T.Literal["ngram"] = FieldTypeEnum.NGRAM.value
min_gram: int = Field(default=2, ge=1)
max_gram: int = Field(default=6, ge=1)
prefix_only: bool = False
lowercase: bool = True
boost: float = Field(default=1.0, gt=0)
@model_validator(mode="after")
def _max_gte_min(self) -> NgramField:
if self.max_gram < self.min_gram:
raise ValueError(
f"max_gram ({self.max_gram}) must be >= min_gram ({self.min_gram})"
)
return self
NgramField is the search-as-you-type workhorse.
It builds an ngram inverted index so that any substring of length
[min_gram, max_gram] becomes a searchable token.
Key parameters:
min_gram/max_gram— control the ngram window. A@model_validatorensuresmax_gram >= min_gram.prefix_only— whenTrue, only prefixes of each word are indexed (faster, but no mid-word matching).lowercase— normalise tokens to lowercase before indexing.
Numeric family¶
NumericField¶
class NumericField(BaseField):
"""
Numeric field. Defaults to sort-only (``indexed=False, fast=True``) which
is the typical use case for rating/year columns.
"""
type: T.Literal["numeric"] = FieldTypeEnum.NUMERIC.value
kind: T.Literal["i64", "u64", "f64"] = NumericKindEnum.I64.value
indexed: bool = False
fast: bool = True
NumericField stores numbers. The kind parameter
selects the underlying type from NumericKindEnum.
Defaults to sort-only (indexed=False, fast=True), which is the typical
configuration for rating or year columns.
DatetimeField¶
class DatetimeField(BaseField):
"""Datetime field backed by tantivy's date type."""
type: T.Literal["datetime"] = FieldTypeEnum.DATETIME.value
indexed: bool = True
fast: bool = True
DatetimeField stores date/time values. Both
indexed and fast default to True.
BooleanField¶
class BooleanField(BaseField):
"""Boolean field."""
type: T.Literal["boolean"] = FieldTypeEnum.BOOLEAN.value
indexed: bool = True
BooleanField stores boolean values.
Discriminated union¶
All seven types are assembled into a single type alias:
# --- union & helpers ---------------------------------------------------------
T_Field = T.Annotated[
T.Union[
StoredField,
KeywordField,
TextField,
NgramField,
NumericField,
DatetimeField,
BooleanField,
],
Field(discriminator="type"),
]
"""Discriminated union of all field types. Use with ``TypeAdapter`` for
polymorphic deserialization::
T_Field uses pydantic’s Field(discriminator="type")
so that a plain dict like {"type": "ngram", "name": "title"} is
automatically deserialised into the correct subclass. This is especially
useful when field definitions are loaded from configuration files.
Schema hashing¶
def fields_schema_hash(fields: list[T_Field]) -> str: # type: ignore[type-arg]
"""
Deterministic hash of a list of field definitions.
Used as part of cache keys so that changing the schema automatically
invalidates stale caches.
"""
payload = "|".join(
json.dumps(f.model_dump(exclude_none=True), sort_keys=True) for f in fields
)
return hashlib.sha256(payload.encode()).hexdigest()[:16]
fields_schema_hash() computes a deterministic SHA-256
digest (truncated to 16 hex characters) of a list of field definitions. The
hash is used as part of cache keys in DataSetCache, so
that changing the schema automatically invalidates stale cached results.