sayt2 Quick Start Guide

What is sayt2?

sayt2 is a search-as-you-type full-text search library for Python. It lets you build a search index from a list of dictionaries and query it with substring matching (ngram), full-text search (BM25), fuzzy search, range queries, sorting, and more — all through a single DataSet object.

Under the hood it uses tantivy (a Rust-based search engine) for fast indexing and querying, pydantic for configuration validation, and diskcache for a two-layer disk cache.

Install

pip install sayt2

Example 1 — People Directory (Ngram + Full-Text Search)

In this example we build a search index for a people directory. Each person has a name, job title, and bio. We will demonstrate:

  • Ngram search: substring matching on names (search-as-you-type)

  • Full-text search: word-level BM25 search on title and bio

  • Caching: automatic L2 query result cache

Step 1: Prepare the data

Our data source is a list of dictionaries. In real projects this could come from a database, an API, or a file. Here we define it inline.

[1]:
records = [
    {
        "id": "1",
        "name": "Alice Johnson",
        "title": "Senior Data Scientist",
        "bio": "Alice specializes in machine learning and natural language processing. She has published several papers on transformer architectures.",
    },
    {
        "id": "2",
        "name": "Bob Martinez",
        "title": "Backend Engineer",
        "bio": "Bob builds scalable microservices with Python and Go. He is passionate about distributed systems and database optimization.",
    },
    {
        "id": "3",
        "name": "Charlie Wang",
        "title": "Frontend Developer",
        "bio": "Charlie creates beautiful user interfaces with React and TypeScript. He advocates for accessibility and performance.",
    },
    {
        "id": "4",
        "name": "Diana Patel",
        "title": "DevOps Engineer",
        "bio": "Diana manages cloud infrastructure on AWS and Kubernetes. She automates everything from CI pipelines to monitoring dashboards.",
    },
    {
        "id": "5",
        "name": "Edward Kim",
        "title": "Machine Learning Engineer",
        "bio": "Edward trains and deploys deep learning models for computer vision. He works extensively with PyTorch and TensorFlow.",
    },
]

Step 2: Define the schema

Each field tells sayt2 how to index a column. Different field types enable different search modes.

[2]:
from sayt2.api import (
    DataSet,
    NgramField,
    TextField,
    KeywordField,
    Hit,
    SearchResult,
)

fields = [
    # KeywordField — exact match, good for IDs and tags
    KeywordField(name="id"),
    # NgramField — substring matching (search-as-you-type)
    # boost=3.0 makes name matches rank higher
    NgramField(name="name", min_gram=2, max_gram=6, boost=3.0),
    # TextField — full-text BM25 search (word-level)
    TextField(name="title", boost=2.0),
    TextField(name="bio"),
]

Step 4: Ngram search — substring matching

Type a few characters and get instant results. The NgramField on name makes this possible.

[5]:
# "ali" matches "Alice Johnson"
result = ds.search("ali")
print(f"Found {result.size} results, took {result.took_ms:.1f} ms")
for hit in result.hits:
    print(f"  {hit.source['name']} (score: {hit.score:.2f})")
Found 1 results, took 263.0 ms
  Alice Johnson (score: 10.26)
[6]:
# "wan" matches "Charlie Wang"
result = ds.search("wan")
for hit in result.hits:
    print(f"  {hit.source['name']}: {hit.source['title']}")
  Charlie Wang: Frontend Developer

Step 5: Full-text search — word-level BM25

TextField fields support natural language queries. Words are tokenized and ranked by relevance.

[7]:
# Search across title and bio fields
result = ds.search("machine learning")
print(f"Found {result.size} results:")
for hit in result.hits:
    print(f"  {hit.source['name']}{hit.source['title']} (score: {hit.score:.2f})")
Found 2 results:
  Edward Kim — Machine Learning Engineer (score: 5.90)
  Alice Johnson — Senior Data Scientist (score: 2.24)
[8]:
# "kubernetes" appears only in Diana's bio
result = ds.search("kubernetes")
for hit in result.hits:
    print(f"  {hit.source['name']}: {hit.source['bio'][:80]}...")
  Diana Patel: Diana manages cloud infrastructure on AWS and Kubernetes. She automates everythi...

Step 6: Automatic caching

sayt2 has a two-layer cache. The second query for the same search term returns instantly from the L2 cache.

[9]:
# "kubernetes" was already searched in Step 5 — this time it's a cache hit
r1 = ds.search("kubernetes")
print(f"cache hit: {r1.cache}")  # True

# A different query — cache miss
r2 = ds.search("distributed systems")
print(f"cache hit: {r2.cache}")  # False

# Search "distributed systems" again — cache hit
r3 = ds.search("distributed systems")
print(f"cache hit: {r3.cache}")  # True
cache hit: True
cache hit: False
cache hit: True

Step 7: Force refresh

Use refresh=True to force a rebuild of the index. This also invalidates the cache.

[10]:
r = ds.search("kubernetes", refresh=True)
print(f"fresh rebuild: {r.fresh}, cache: {r.cache}")  # True, False
fresh rebuild: True, cache: False

Step 8: Clean up

Always close the DataSet when done. Using with statement is recommended — here we close it manually since we created it without with.

[11]:
ds.close()

Example 2 — Book Catalog (Sort + Query Language)

This example shows advanced features: sorting, range queries, field-specific search, and boolean operators.

Step 1: Define schema with NumericFields

NumericField with indexed=True enables range queries in the query language. fast=True enables sorting.

[12]:
from sayt2.api import NumericField, SortKey

book_fields = [
    KeywordField(name="id"),
    NgramField(name="title", min_gram=2, max_gram=6, boost=3.0),
    TextField(name="author", boost=2.0),
    TextField(name="description"),
    NumericField(name="year", kind="i64", indexed=True, fast=True),
    NumericField(name="price", kind="f64", indexed=True, fast=True),
    NumericField(name="rating", kind="f64", indexed=True, fast=True),
    NumericField(name="pages", kind="i64", indexed=True, fast=True),
]

Step 2: Prepare book data

[13]:
books = [
    {"id": "1", "title": "Fluent Python", "author": "Luciano Ramalho", "description": "A hands-on guide to writing effective Python code.", "year": 2022, "price": 49.99, "rating": 4.7, "pages": 1012},
    {"id": "2", "title": "Python Crash Course", "author": "Eric Matthes", "description": "A fast-paced introduction to programming with Python.", "year": 2023, "price": 35.99, "rating": 4.6, "pages": 544},
    {"id": "3", "title": "Programming Rust", "author": "Jim Blandy", "description": "Fast, safe systems development with Rust.", "year": 2021, "price": 45.99, "rating": 4.5, "pages": 738},
]

def book_downloader() -> list[dict]:
    return books

Step 3: Sort by rating (descending)

Pass sort to DataSet to order results by a numeric field.

[14]:
if dir_index.exists():
    shutil.rmtree(dir_index)

with DataSet(
    dir_root=dir_index,
    name="books_by_rating",
    fields=book_fields,
    downloader=book_downloader,
    sort=[SortKey(name="rating", descending=True)],
) as ds:
    result = ds.search("python")
    print("Books about Python, sorted by rating (highest first):")
    for hit in result.hits:
        print(f"  {hit.source['title']} — rating: {hit.source['rating']}, year: {hit.source['year']}")
Books about Python, sorted by rating (highest first):
  Fluent Python — rating: 4.7, year: 2022
  Python Crash Course — rating: 4.6, year: 2023

Step 4: Range queries

tantivy supports Lucene-style range syntax directly in the query string.

[15]:
with DataSet(
    dir_root=dir_index,
    name="books_plain",
    fields=book_fields,
    downloader=book_downloader,
) as ds:
    # Books published between 2020 and 2023
    result = ds.search("year:[2020 TO 2023]")
    print("Books from 2020–2023:")
    for hit in result.hits:
        print(f"  {hit.source['title']} ({hit.source['year']})")
Books from 2020–2023:
  Python Crash Course (2023)
  Fluent Python (2022)
  Programming Rust (2021)
[16]:
    # Books priced above $40
    result = ds.search("price:>40")
    print("\nBooks priced above $40:")
    for hit in result.hits:
        print(f"  {hit.source['title']} — ${hit.source['price']}")

Books priced above $40:
  Fluent Python — $49.99
  Programming Rust — $45.99

Step 5: Field-specific search and boolean operators

[17]:
    # Search by author name
    result = ds.search("author:blandy")
    print("Books by Blandy:")
    for hit in result.hits:
        print(f"  {hit.source['title']} by {hit.source['author']}")
Books by Blandy:
  Programming Rust by Jim Blandy
[18]:
    # Combine text search with range query using AND
    result = ds.search("python AND year:[2022 TO 2025]")
    print("\nPython books from 2022+:")
    for hit in result.hits:
        print(f"  {hit.source['title']} ({hit.source['year']})")

Python books from 2022+:
  Fluent Python (2022)
  Python Crash Course (2023)

Summary

Feature

How to use

Substring search (search-as-you-type)

NgramField + ds.search("ali")

Full-text search

TextField + ds.search("machine learning")

Exact match

KeywordField

Sort results

SortKey(name="rating", descending=True)

Range query

ds.search("year:[2020 TO 2025]") or ds.search("price:>40")

Field-specific search

ds.search("author:blandy")

Boolean operators

ds.search("python AND year:[2020 TO 2025]")

Force refresh

ds.search("query", refresh=True)

Automatic caching

Built-in, no config needed

For more details, see the full documentation.

[ ]: