Field Types¶

sayt2.fields defines seven field types that describe how each column of your data is stored, indexed, and searched. All types are pydantic BaseModel subclasses, so validation, serialisation, and IDE autocompletion work out of the box.

Important

This module has no dependency on tantivy. The mapping from field definitions to tantivy schema objects lives entirely in sayt2.dataset.

Base class¶

class BaseField(BaseModel):
    """
    Common base for all field types.

    Every subclass must override ``type`` with a ``T.Literal["..."]`` so that
    pydantic's discriminated union can reconstruct the correct class from a
    plain dict.
    """

    type: str  # overridden by each subclass as a Literal
    name: str = Field(min_length=1)
    stored: bool = True

Every field has a type discriminator (overridden by each subclass as a Literal), a name, and a stored flag. The type value is drawn from FieldTypeEnum.

Text family¶

StoredField¶

class StoredField(BaseField):
    """Store-only field.  Not indexed, not searchable, not sortable."""

    type: T.Literal["stored"] = FieldTypeEnum.STORED.value

StoredField keeps the value in the index but does not make it searchable or sortable. Use it for payload data you want returned with search results (e.g. URLs, descriptions).

KeywordField¶

class KeywordField(BaseField):
    """
    Exact-match field (id, tag, enum).  Uses the ``raw`` tokenizer under the
    hood — the entire field value is treated as one token.
    """

    type: T.Literal["keyword"] = FieldTypeEnum.KEYWORD.value
    boost: float = Field(default=1.0, gt=0)

KeywordField performs exact-match search. The entire field value is treated as a single token (raw tokeniser), making it ideal for IDs, tags, and enum values. The boost parameter controls how much weight this field carries in relevance scoring.

TextField¶

class TextField(BaseField):
    """
    Full-text search field.  Uses the ``default`` (Unicode-aware word
    boundary) or ``en_stem`` (English stemming) tokenizer.
    """

    type: T.Literal["text"] = FieldTypeEnum.TEXT.value
    tokenizer: T.Literal["default", "en_stem"] = TokenizerEnum.DEFAULT.value
    boost: float = Field(default=1.0, gt=0)

TextField is for full-text search. Choose a tokenizer from TokenizerEnum:

"default" — Unicode-aware word boundary splitting.
"en_stem" — English stemming (e.g. “running” matches “run”).

NgramField¶

class NgramField(BaseField):
    """
    Search-as-you-type field.  Builds an ngram inverted index so that any
    substring of length ``[min_gram, max_gram]`` is a valid query token.
    """

    type: T.Literal["ngram"] = FieldTypeEnum.NGRAM.value
    min_gram: int = Field(default=2, ge=1)
    max_gram: int = Field(default=6, ge=1)
    prefix_only: bool = False
    lowercase: bool = True
    boost: float = Field(default=1.0, gt=0)

    @model_validator(mode="after")
    def _max_gte_min(self) -> NgramField:
        if self.max_gram < self.min_gram:
            raise ValueError(
                f"max_gram ({self.max_gram}) must be >= min_gram ({self.min_gram})"
            )
        return self

NgramField is the search-as-you-type workhorse. It builds an ngram inverted index so that any substring of length [min_gram, max_gram] becomes a searchable token.

Key parameters:

min_gram / max_gram — control the ngram window. A @model_validator ensures max_gram >= min_gram.
prefix_only — when True, only prefixes of each word are indexed (faster, but no mid-word matching).
lowercase — normalise tokens to lowercase before indexing.

Numeric family¶

NumericField¶

class NumericField(BaseField):
    """
    Numeric field.  Defaults to sort-only (``indexed=False, fast=True``) which
    is the typical use case for rating/year columns.
    """

    type: T.Literal["numeric"] = FieldTypeEnum.NUMERIC.value
    kind: T.Literal["i64", "u64", "f64"] = NumericKindEnum.I64.value
    indexed: bool = False
    fast: bool = True

NumericField stores numbers. The kind parameter selects the underlying type from NumericKindEnum. Defaults to sort-only (indexed=False, fast=True), which is the typical configuration for rating or year columns.

DatetimeField¶

class DatetimeField(BaseField):
    """Datetime field backed by tantivy's date type."""

    type: T.Literal["datetime"] = FieldTypeEnum.DATETIME.value
    indexed: bool = True
    fast: bool = True

DatetimeField stores date/time values. Both indexed and fast default to True.

BooleanField¶

class BooleanField(BaseField):
    """Boolean field."""

    type: T.Literal["boolean"] = FieldTypeEnum.BOOLEAN.value
    indexed: bool = True

BooleanField stores boolean values.

Discriminated union¶

All seven types are assembled into a single type alias:

# --- union & helpers ---------------------------------------------------------

T_Field = T.Annotated[
    T.Union[
        StoredField,
        KeywordField,
        TextField,
        NgramField,
        NumericField,
        DatetimeField,
        BooleanField,
    ],
    Field(discriminator="type"),
]
"""Discriminated union of all field types.  Use with ``TypeAdapter`` for
polymorphic deserialization::

T_Field uses pydantic’s Field(discriminator="type") so that a plain dict like {"type": "ngram", "name": "title"} is automatically deserialised into the correct subclass. This is especially useful when field definitions are loaded from configuration files.

Schema hashing¶

def fields_schema_hash(fields: list[T_Field]) -> str:  # type: ignore[type-arg]
    """
    Deterministic hash of a list of field definitions.

    Used as part of cache keys so that changing the schema automatically
    invalidates stale caches.
    """
    payload = "|".join(
        json.dumps(f.model_dump(exclude_none=True), sort_keys=True) for f in fields
    )
    return hashlib.sha256(payload.encode()).hexdigest()[:16]

fields_schema_hash() computes a deterministic SHA-256 digest (truncated to 16 hex characters) of a list of field definitions. The hash is used as part of cache keys in DataSetCache, so that changing the schema automatically invalidates stale cached results.