sayt2 Quick Start Guide¶

What is sayt2?¶

sayt2 is a search-as-you-type full-text search library for Python. It lets you build a search index from a list of dictionaries and query it with substring matching (ngram), full-text search (BM25), fuzzy search, range queries, sorting, and more — all through a single DataSet object.

Under the hood it uses tantivy (a Rust-based search engine) for fast indexing and querying, pydantic for configuration validation, and diskcache for a two-layer disk cache.

Install¶

pip install sayt2

Example 1 — People Directory (Ngram + Full-Text Search)¶

In this example we build a search index for a people directory. Each person has a name, job title, and bio. We will demonstrate:

Ngram search: substring matching on names (search-as-you-type)
Full-text search: word-level BM25 search on title and bio
Caching: automatic L2 query result cache

Step 1: Prepare the data¶

Our data source is a list of dictionaries. In real projects this could come from a database, an API, or a file. Here we define it inline.

[1]:

records = [
    {
        "id": "1",
        "name": "Alice Johnson",
        "title": "Senior Data Scientist",
        "bio": "Alice specializes in machine learning and natural language processing. She has published several papers on transformer architectures.",
    },
    {
        "id": "2",
        "name": "Bob Martinez",
        "title": "Backend Engineer",
        "bio": "Bob builds scalable microservices with Python and Go. He is passionate about distributed systems and database optimization.",
    },
    {
        "id": "3",
        "name": "Charlie Wang",
        "title": "Frontend Developer",
        "bio": "Charlie creates beautiful user interfaces with React and TypeScript. He advocates for accessibility and performance.",
    },
    {
        "id": "4",
        "name": "Diana Patel",
        "title": "DevOps Engineer",
        "bio": "Diana manages cloud infrastructure on AWS and Kubernetes. She automates everything from CI pipelines to monitoring dashboards.",
    },
    {
        "id": "5",
        "name": "Edward Kim",
        "title": "Machine Learning Engineer",
        "bio": "Edward trains and deploys deep learning models for computer vision. He works extensively with PyTorch and TensorFlow.",
    },
]

Step 2: Define the schema¶

Each field tells sayt2 how to index a column. Different field types enable different search modes.

[2]:

from sayt2.api import (
    DataSet,
    NgramField,
    TextField,
    KeywordField,
    Hit,
    SearchResult,
)

fields = [
    # KeywordField — exact match, good for IDs and tags
    KeywordField(name="id"),
    # NgramField — substring matching (search-as-you-type)
    # boost=3.0 makes name matches rank higher
    NgramField(name="name", min_gram=2, max_gram=6, boost=3.0),
    # TextField — full-text BM25 search (word-level)
    TextField(name="title", boost=2.0),
    TextField(name="bio"),
]

Step 3: Create the DataSet and search¶

The DataSet is the main entry point. Pass a downloader callable that returns your data, and sayt2 will automatically build the index on the first search.

[3]:

import shutil
from pathlib import Path

dir_index = Path("./quick_start_index")

# clean up from previous runs
if dir_index.exists():
    shutil.rmtree(dir_index)

def downloader() -> list[dict]:
    """Return the raw records. In real use this could hit a DB or API."""
    return records

[4]:

ds = DataSet(
    dir_root=dir_index,
    name="people",
    fields=fields,
    downloader=downloader,
    cache_expire=None,  # no auto-expiry for this demo
)

Step 4: Ngram search — substring matching¶

Type a few characters and get instant results. The NgramField on name makes this possible.

[5]:

# "ali" matches "Alice Johnson"
result = ds.search("ali")
print(f"Found {result.size} results, took {result.took_ms:.1f} ms")
for hit in result.hits:
    print(f"  {hit.source['name']} (score: {hit.score:.2f})")

Found 1 results, took 263.0 ms
  Alice Johnson (score: 10.26)

[6]:

# "wan" matches "Charlie Wang"
result = ds.search("wan")
for hit in result.hits:
    print(f"  {hit.source['name']}: {hit.source['title']}")

  Charlie Wang: Frontend Developer

Step 5: Full-text search — word-level BM25¶

TextField fields support natural language queries. Words are tokenized and ranked by relevance.

[7]:

# Search across title and bio fields
result = ds.search("machine learning")
print(f"Found {result.size} results:")
for hit in result.hits:
    print(f"  {hit.source['name']} — {hit.source['title']} (score: {hit.score:.2f})")

Found 2 results:
  Edward Kim — Machine Learning Engineer (score: 5.90)
  Alice Johnson — Senior Data Scientist (score: 2.24)

[8]:

# "kubernetes" appears only in Diana's bio
result = ds.search("kubernetes")
for hit in result.hits:
    print(f"  {hit.source['name']}: {hit.source['bio'][:80]}...")

  Diana Patel: Diana manages cloud infrastructure on AWS and Kubernetes. She automates everythi...

Step 6: Automatic caching¶

sayt2 has a two-layer cache. The second query for the same search term returns instantly from the L2 cache.

[9]:

# "kubernetes" was already searched in Step 5 — this time it's a cache hit
r1 = ds.search("kubernetes")
print(f"cache hit: {r1.cache}")  # True

# A different query — cache miss
r2 = ds.search("distributed systems")
print(f"cache hit: {r2.cache}")  # False

# Search "distributed systems" again — cache hit
r3 = ds.search("distributed systems")
print(f"cache hit: {r3.cache}")  # True

cache hit: True
cache hit: False
cache hit: True

Step 7: Force refresh¶

Use refresh=True to force a rebuild of the index. This also invalidates the cache.

[10]:

r = ds.search("kubernetes", refresh=True)
print(f"fresh rebuild: {r.fresh}, cache: {r.cache}")  # True, False

fresh rebuild: True, cache: False

Step 8: Clean up¶

Always close the DataSet when done. Using with statement is recommended — here we close it manually since we created it without with.

[11]:

ds.close()

Example 2 — Book Catalog (Sort + Query Language)¶

This example shows advanced features: sorting, range queries, field-specific search, and boolean operators.

Step 1: Define schema with NumericFields¶

NumericField with indexed=True enables range queries in the query language. fast=True enables sorting.

[12]:

from sayt2.api import NumericField, SortKey

book_fields = [
    KeywordField(name="id"),
    NgramField(name="title", min_gram=2, max_gram=6, boost=3.0),
    TextField(name="author", boost=2.0),
    TextField(name="description"),
    NumericField(name="year", kind="i64", indexed=True, fast=True),
    NumericField(name="price", kind="f64", indexed=True, fast=True),
    NumericField(name="rating", kind="f64", indexed=True, fast=True),
    NumericField(name="pages", kind="i64", indexed=True, fast=True),
]

Step 2: Prepare book data¶

[13]:

books = [
    {"id": "1", "title": "Fluent Python", "author": "Luciano Ramalho", "description": "A hands-on guide to writing effective Python code.", "year": 2022, "price": 49.99, "rating": 4.7, "pages": 1012},
    {"id": "2", "title": "Python Crash Course", "author": "Eric Matthes", "description": "A fast-paced introduction to programming with Python.", "year": 2023, "price": 35.99, "rating": 4.6, "pages": 544},
    {"id": "3", "title": "Programming Rust", "author": "Jim Blandy", "description": "Fast, safe systems development with Rust.", "year": 2021, "price": 45.99, "rating": 4.5, "pages": 738},
]

def book_downloader() -> list[dict]:
    return books

Step 3: Sort by rating (descending)¶

Pass sort to DataSet to order results by a numeric field.

[14]:

if dir_index.exists():
    shutil.rmtree(dir_index)

with DataSet(
    dir_root=dir_index,
    name="books_by_rating",
    fields=book_fields,
    downloader=book_downloader,
    sort=[SortKey(name="rating", descending=True)],
) as ds:
    result = ds.search("python")
    print("Books about Python, sorted by rating (highest first):")
    for hit in result.hits:
        print(f"  {hit.source['title']} — rating: {hit.source['rating']}, year: {hit.source['year']}")

Books about Python, sorted by rating (highest first):
  Fluent Python — rating: 4.7, year: 2022
  Python Crash Course — rating: 4.6, year: 2023

Step 4: Range queries¶

tantivy supports Lucene-style range syntax directly in the query string.

[15]:

with DataSet(
    dir_root=dir_index,
    name="books_plain",
    fields=book_fields,
    downloader=book_downloader,
) as ds:
    # Books published between 2020 and 2023
    result = ds.search("year:[2020 TO 2023]")
    print("Books from 2020–2023:")
    for hit in result.hits:
        print(f"  {hit.source['title']} ({hit.source['year']})")

Books from 2020–2023:
  Python Crash Course (2023)
  Fluent Python (2022)
  Programming Rust (2021)

[16]:

    # Books priced above $40
    result = ds.search("price:>40")
    print("\nBooks priced above $40:")
    for hit in result.hits:
        print(f"  {hit.source['title']} — ${hit.source['price']}")


Books priced above $40:
  Fluent Python — $49.99
  Programming Rust — $45.99

Step 5: Field-specific search and boolean operators¶

[17]:

    # Search by author name
    result = ds.search("author:blandy")
    print("Books by Blandy:")
    for hit in result.hits:
        print(f"  {hit.source['title']} by {hit.source['author']}")

Books by Blandy:
  Programming Rust by Jim Blandy

[18]:

    # Combine text search with range query using AND
    result = ds.search("python AND year:[2022 TO 2025]")
    print("\nPython books from 2022+:")
    for hit in result.hits:
        print(f"  {hit.source['title']} ({hit.source['year']})")


Python books from 2022+:
  Fluent Python (2022)
  Python Crash Course (2023)

Summary¶

Feature	How to use
Substring search (search-as-you-type)	`NgramField` + `ds.search("ali")`
Full-text search	`TextField` + `ds.search("machine learning")`
Exact match	`KeywordField`
Sort results	`SortKey(name="rating", descending=True)`
Range query	`ds.search("year:[2020 TO 2025]")` or `ds.search("price:>40")`
Field-specific search	`ds.search("author:blandy")`
Boolean operators	`ds.search("python AND year:[2020 TO 2025]")`
Force refresh	`ds.search("query", refresh=True)`
Automatic caching	Built-in, no config needed

For more details, see the full documentation.

[ ]: