sayt2 Quick Start Guide¶
What is sayt2?¶
sayt2 is a search-as-you-type full-text search library for Python. It lets you build a search index from a list of dictionaries and query it with substring matching (ngram), full-text search (BM25), fuzzy search, range queries, sorting, and more — all through a single DataSet object.
Under the hood it uses tantivy (a Rust-based search engine) for fast indexing and querying, pydantic for configuration validation, and diskcache for a two-layer disk cache.
Install¶
pip install sayt2
Example 1 — People Directory (Ngram + Full-Text Search)¶
In this example we build a search index for a people directory. Each person has a name, job title, and bio. We will demonstrate:
Ngram search: substring matching on names (search-as-you-type)
Full-text search: word-level BM25 search on title and bio
Caching: automatic L2 query result cache
Step 1: Prepare the data¶
Our data source is a list of dictionaries. In real projects this could come from a database, an API, or a file. Here we define it inline.
[1]:
records = [
{
"id": "1",
"name": "Alice Johnson",
"title": "Senior Data Scientist",
"bio": "Alice specializes in machine learning and natural language processing. She has published several papers on transformer architectures.",
},
{
"id": "2",
"name": "Bob Martinez",
"title": "Backend Engineer",
"bio": "Bob builds scalable microservices with Python and Go. He is passionate about distributed systems and database optimization.",
},
{
"id": "3",
"name": "Charlie Wang",
"title": "Frontend Developer",
"bio": "Charlie creates beautiful user interfaces with React and TypeScript. He advocates for accessibility and performance.",
},
{
"id": "4",
"name": "Diana Patel",
"title": "DevOps Engineer",
"bio": "Diana manages cloud infrastructure on AWS and Kubernetes. She automates everything from CI pipelines to monitoring dashboards.",
},
{
"id": "5",
"name": "Edward Kim",
"title": "Machine Learning Engineer",
"bio": "Edward trains and deploys deep learning models for computer vision. He works extensively with PyTorch and TensorFlow.",
},
]
Step 2: Define the schema¶
Each field tells sayt2 how to index a column. Different field types enable different search modes.
[2]:
from sayt2.api import (
DataSet,
NgramField,
TextField,
KeywordField,
Hit,
SearchResult,
)
fields = [
# KeywordField — exact match, good for IDs and tags
KeywordField(name="id"),
# NgramField — substring matching (search-as-you-type)
# boost=3.0 makes name matches rank higher
NgramField(name="name", min_gram=2, max_gram=6, boost=3.0),
# TextField — full-text BM25 search (word-level)
TextField(name="title", boost=2.0),
TextField(name="bio"),
]
Step 3: Create the DataSet and search¶
The DataSet is the main entry point. Pass a downloader callable that returns your data, and sayt2 will automatically build the index on the first search.
[3]:
import shutil
from pathlib import Path
dir_index = Path("./quick_start_index")
# clean up from previous runs
if dir_index.exists():
shutil.rmtree(dir_index)
def downloader() -> list[dict]:
"""Return the raw records. In real use this could hit a DB or API."""
return records
[4]:
ds = DataSet(
dir_root=dir_index,
name="people",
fields=fields,
downloader=downloader,
cache_expire=None, # no auto-expiry for this demo
)
Step 4: Ngram search — substring matching¶
Type a few characters and get instant results. The NgramField on name makes this possible.
[5]:
# "ali" matches "Alice Johnson"
result = ds.search("ali")
print(f"Found {result.size} results, took {result.took_ms:.1f} ms")
for hit in result.hits:
print(f" {hit.source['name']} (score: {hit.score:.2f})")
Found 1 results, took 263.0 ms
Alice Johnson (score: 10.26)
[6]:
# "wan" matches "Charlie Wang"
result = ds.search("wan")
for hit in result.hits:
print(f" {hit.source['name']}: {hit.source['title']}")
Charlie Wang: Frontend Developer
Step 5: Full-text search — word-level BM25¶
TextField fields support natural language queries. Words are tokenized and ranked by relevance.
[7]:
# Search across title and bio fields
result = ds.search("machine learning")
print(f"Found {result.size} results:")
for hit in result.hits:
print(f" {hit.source['name']} — {hit.source['title']} (score: {hit.score:.2f})")
Found 2 results:
Edward Kim — Machine Learning Engineer (score: 5.90)
Alice Johnson — Senior Data Scientist (score: 2.24)
[8]:
# "kubernetes" appears only in Diana's bio
result = ds.search("kubernetes")
for hit in result.hits:
print(f" {hit.source['name']}: {hit.source['bio'][:80]}...")
Diana Patel: Diana manages cloud infrastructure on AWS and Kubernetes. She automates everythi...
Step 6: Automatic caching¶
sayt2 has a two-layer cache. The second query for the same search term returns instantly from the L2 cache.
[9]:
# "kubernetes" was already searched in Step 5 — this time it's a cache hit
r1 = ds.search("kubernetes")
print(f"cache hit: {r1.cache}") # True
# A different query — cache miss
r2 = ds.search("distributed systems")
print(f"cache hit: {r2.cache}") # False
# Search "distributed systems" again — cache hit
r3 = ds.search("distributed systems")
print(f"cache hit: {r3.cache}") # True
cache hit: True
cache hit: False
cache hit: True
Step 7: Force refresh¶
Use refresh=True to force a rebuild of the index. This also invalidates the cache.
[10]:
r = ds.search("kubernetes", refresh=True)
print(f"fresh rebuild: {r.fresh}, cache: {r.cache}") # True, False
fresh rebuild: True, cache: False
Step 8: Clean up¶
Always close the DataSet when done. Using with statement is recommended — here we close it manually since we created it without with.
[11]:
ds.close()
Example 2 — Book Catalog (Sort + Query Language)¶
This example shows advanced features: sorting, range queries, field-specific search, and boolean operators.
Step 1: Define schema with NumericFields¶
NumericField with indexed=True enables range queries in the query language. fast=True enables sorting.
[12]:
from sayt2.api import NumericField, SortKey
book_fields = [
KeywordField(name="id"),
NgramField(name="title", min_gram=2, max_gram=6, boost=3.0),
TextField(name="author", boost=2.0),
TextField(name="description"),
NumericField(name="year", kind="i64", indexed=True, fast=True),
NumericField(name="price", kind="f64", indexed=True, fast=True),
NumericField(name="rating", kind="f64", indexed=True, fast=True),
NumericField(name="pages", kind="i64", indexed=True, fast=True),
]
Step 2: Prepare book data¶
[13]:
books = [
{"id": "1", "title": "Fluent Python", "author": "Luciano Ramalho", "description": "A hands-on guide to writing effective Python code.", "year": 2022, "price": 49.99, "rating": 4.7, "pages": 1012},
{"id": "2", "title": "Python Crash Course", "author": "Eric Matthes", "description": "A fast-paced introduction to programming with Python.", "year": 2023, "price": 35.99, "rating": 4.6, "pages": 544},
{"id": "3", "title": "Programming Rust", "author": "Jim Blandy", "description": "Fast, safe systems development with Rust.", "year": 2021, "price": 45.99, "rating": 4.5, "pages": 738},
]
def book_downloader() -> list[dict]:
return books
Step 3: Sort by rating (descending)¶
Pass sort to DataSet to order results by a numeric field.
[14]:
if dir_index.exists():
shutil.rmtree(dir_index)
with DataSet(
dir_root=dir_index,
name="books_by_rating",
fields=book_fields,
downloader=book_downloader,
sort=[SortKey(name="rating", descending=True)],
) as ds:
result = ds.search("python")
print("Books about Python, sorted by rating (highest first):")
for hit in result.hits:
print(f" {hit.source['title']} — rating: {hit.source['rating']}, year: {hit.source['year']}")
Books about Python, sorted by rating (highest first):
Fluent Python — rating: 4.7, year: 2022
Python Crash Course — rating: 4.6, year: 2023
Step 4: Range queries¶
tantivy supports Lucene-style range syntax directly in the query string.
[15]:
with DataSet(
dir_root=dir_index,
name="books_plain",
fields=book_fields,
downloader=book_downloader,
) as ds:
# Books published between 2020 and 2023
result = ds.search("year:[2020 TO 2023]")
print("Books from 2020–2023:")
for hit in result.hits:
print(f" {hit.source['title']} ({hit.source['year']})")
Books from 2020–2023:
Python Crash Course (2023)
Fluent Python (2022)
Programming Rust (2021)
[16]:
# Books priced above $40
result = ds.search("price:>40")
print("\nBooks priced above $40:")
for hit in result.hits:
print(f" {hit.source['title']} — ${hit.source['price']}")
Books priced above $40:
Fluent Python — $49.99
Programming Rust — $45.99
Step 5: Field-specific search and boolean operators¶
[17]:
# Search by author name
result = ds.search("author:blandy")
print("Books by Blandy:")
for hit in result.hits:
print(f" {hit.source['title']} by {hit.source['author']}")
Books by Blandy:
Programming Rust by Jim Blandy
[18]:
# Combine text search with range query using AND
result = ds.search("python AND year:[2022 TO 2025]")
print("\nPython books from 2022+:")
for hit in result.hits:
print(f" {hit.source['title']} ({hit.source['year']})")
Python books from 2022+:
Fluent Python (2022)
Python Crash Course (2023)
Summary¶
Feature |
How to use |
|---|---|
Substring search (search-as-you-type) |
|
Full-text search |
|
Exact match |
|
Sort results |
|
Range query |
|
Field-specific search |
|
Boolean operators |
|
Force refresh |
|
Automatic caching |
Built-in, no config needed |
For more details, see the full documentation.
[ ]: