{ "cells": [ { "cell_type": "markdown", "id": "cc2b57e5-9ba9-4146-bd85-963b561d3b87", "metadata": {}, "source": [ "# sayt2 Quick Start Guide\n", "\n", "## What is sayt2?\n", "\n", "**sayt2** is a search-as-you-type full-text search library for Python. It lets you build a search index from a list of dictionaries and query it with substring matching (ngram), full-text search (BM25), fuzzy search, range queries, sorting, and more — all through a single `DataSet` object.\n", "\n", "Under the hood it uses [tantivy](https://github.com/quickwit-oss/tantivy-py) (a Rust-based search engine) for fast indexing and querying, [pydantic](https://docs.pydantic.dev/) for configuration validation, and [diskcache](https://grantjenks.com/docs/diskcache/) for a two-layer disk cache.\n", "\n", "## Install\n", "\n", "```python\n", "pip install sayt2\n", "```\n", "\n", "## Example 1 — People Directory (Ngram + Full-Text Search)\n", "\n", "In this example we build a search index for a people directory. Each person has a name, job title, and bio. We will demonstrate:\n", "\n", "- **Ngram search**: substring matching on names (search-as-you-type)\n", "- **Full-text search**: word-level BM25 search on title and bio\n", "- **Caching**: automatic L2 query result cache\n", "\n", "### Step 1: Prepare the data\n", "\n", "Our data source is a list of dictionaries. In real projects this could come from a database, an API, or a file. Here we define it inline." ] }, { "cell_type": "code", "execution_count": 1, "id": "cf1fdf52-a193-4539-a246-13d54a77f8af", "metadata": { "ExecuteTime": { "end_time": "2023-09-25T15:36:14.929877Z", "start_time": "2023-09-25T15:36:14.925278Z" } }, "outputs": [], "source": [ "records = [\n", " {\n", " \"id\": \"1\",\n", " \"name\": \"Alice Johnson\",\n", " \"title\": \"Senior Data Scientist\",\n", " \"bio\": \"Alice specializes in machine learning and natural language processing. She has published several papers on transformer architectures.\",\n", " },\n", " {\n", " \"id\": \"2\",\n", " \"name\": \"Bob Martinez\",\n", " \"title\": \"Backend Engineer\",\n", " \"bio\": \"Bob builds scalable microservices with Python and Go. He is passionate about distributed systems and database optimization.\",\n", " },\n", " {\n", " \"id\": \"3\",\n", " \"name\": \"Charlie Wang\",\n", " \"title\": \"Frontend Developer\",\n", " \"bio\": \"Charlie creates beautiful user interfaces with React and TypeScript. He advocates for accessibility and performance.\",\n", " },\n", " {\n", " \"id\": \"4\",\n", " \"name\": \"Diana Patel\",\n", " \"title\": \"DevOps Engineer\",\n", " \"bio\": \"Diana manages cloud infrastructure on AWS and Kubernetes. She automates everything from CI pipelines to monitoring dashboards.\",\n", " },\n", " {\n", " \"id\": \"5\",\n", " \"name\": \"Edward Kim\",\n", " \"title\": \"Machine Learning Engineer\",\n", " \"bio\": \"Edward trains and deploys deep learning models for computer vision. He works extensively with PyTorch and TensorFlow.\",\n", " },\n", "]" ] }, { "cell_type": "markdown", "id": "952172e5", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "### Step 2: Define the schema\n", "\n", "Each field tells sayt2 how to index a column. Different field types enable different search modes." ] }, { "cell_type": "code", "execution_count": 2, "id": "b2ba8e94-1dbf-4dd3-bb98-a6cbf351060d", "metadata": {}, "outputs": [], "source": [ "from sayt2.api import (\n", " DataSet,\n", " NgramField,\n", " TextField,\n", " KeywordField,\n", " Hit,\n", " SearchResult,\n", ")\n", "\n", "fields = [\n", " # KeywordField — exact match, good for IDs and tags\n", " KeywordField(name=\"id\"),\n", " # NgramField — substring matching (search-as-you-type)\n", " # boost=3.0 makes name matches rank higher\n", " NgramField(name=\"name\", min_gram=2, max_gram=6, boost=3.0),\n", " # TextField — full-text BM25 search (word-level)\n", " TextField(name=\"title\", boost=2.0),\n", " TextField(name=\"bio\"),\n", "]" ] }, { "cell_type": "markdown", "id": "7c1a2dcb-5017-4253-8d41-6dd0329dc69e", "metadata": {}, "source": [ "### Step 3: Create the DataSet and search\n", "\n", "The `DataSet` is the main entry point. Pass a `downloader` callable that returns your data, and sayt2 will automatically build the index on the first search." ] }, { "cell_type": "code", "execution_count": 3, "id": "98afef78-36a9-4ea2-96e1-5b228d1434b8", "metadata": {}, "outputs": [], "source": [ "import shutil\n", "from pathlib import Path\n", "\n", "dir_index = Path(\"./quick_start_index\")\n", "\n", "# clean up from previous runs\n", "if dir_index.exists():\n", " shutil.rmtree(dir_index)\n", "\n", "def downloader() -> list[dict]:\n", " \"\"\"Return the raw records. In real use this could hit a DB or API.\"\"\"\n", " return records" ] }, { "cell_type": "code", "execution_count": 4, "id": "1b1161fe-7c2f-4a5a-8651-86b386d71f1a", "metadata": {}, "outputs": [], "source": [ "ds = DataSet(\n", " dir_root=dir_index,\n", " name=\"people\",\n", " fields=fields,\n", " downloader=downloader,\n", " cache_expire=None, # no auto-expiry for this demo\n", ")" ] }, { "cell_type": "markdown", "id": "b10e9d45-e3fe-459b-8490-0afa92c9c733", "metadata": {}, "source": [ "### Step 4: Ngram search — substring matching\n", "\n", "Type a few characters and get instant results. The `NgramField` on `name` makes this possible." ] }, { "cell_type": "code", "execution_count": 5, "id": "3b318b40-43cb-47fa-a007-eb2547962a1d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 1 results, took 263.0 ms\n", " Alice Johnson (score: 10.26)\n" ] } ], "source": [ "# \"ali\" matches \"Alice Johnson\"\n", "result = ds.search(\"ali\")\n", "print(f\"Found {result.size} results, took {result.took_ms:.1f} ms\")\n", "for hit in result.hits:\n", " print(f\" {hit.source['name']} (score: {hit.score:.2f})\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "9d110e42-d0d5-4d59-afca-3853aa772795", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Charlie Wang: Frontend Developer\n" ] } ], "source": [ "# \"wan\" matches \"Charlie Wang\"\n", "result = ds.search(\"wan\")\n", "for hit in result.hits:\n", " print(f\" {hit.source['name']}: {hit.source['title']}\")" ] }, { "cell_type": "markdown", "id": "ebe89a9c-f568-42a7-944d-ebab24b832c3", "metadata": {}, "source": [ "### Step 5: Full-text search — word-level BM25\n", "\n", "`TextField` fields support natural language queries. Words are tokenized and ranked by relevance." ] }, { "cell_type": "code", "execution_count": 7, "id": "a182a074-04a0-4e23-9d36-b9607a7bd593", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 2 results:\n", " Edward Kim — Machine Learning Engineer (score: 5.90)\n", " Alice Johnson — Senior Data Scientist (score: 2.24)\n" ] } ], "source": [ "# Search across title and bio fields\n", "result = ds.search(\"machine learning\")\n", "print(f\"Found {result.size} results:\")\n", "for hit in result.hits:\n", " print(f\" {hit.source['name']} — {hit.source['title']} (score: {hit.score:.2f})\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "c8f5629e-90dc-47d1-9dfa-69ffa3ade33a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Diana Patel: Diana manages cloud infrastructure on AWS and Kubernetes. She automates everythi...\n" ] } ], "source": [ "# \"kubernetes\" appears only in Diana's bio\n", "result = ds.search(\"kubernetes\")\n", "for hit in result.hits:\n", " print(f\" {hit.source['name']}: {hit.source['bio'][:80]}...\")" ] }, { "cell_type": "markdown", "id": "a2ff76b8-8ee4-4d5f-aa27-0b8930f55d7b", "metadata": {}, "source": [ "### Step 6: Automatic caching\n", "\n", "sayt2 has a two-layer cache. The second query for the same search term returns instantly from the L2 cache." ] }, { "cell_type": "code", "execution_count": 9, "id": "68eec4e7-c53b-4c2e-a6c1-ade9edfccb6e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cache hit: True\n", "cache hit: False\n", "cache hit: True\n" ] } ], "source": [ "# \"kubernetes\" was already searched in Step 5 — this time it's a cache hit\n", "r1 = ds.search(\"kubernetes\")\n", "print(f\"cache hit: {r1.cache}\") # True\n", "\n", "# A different query — cache miss\n", "r2 = ds.search(\"distributed systems\")\n", "print(f\"cache hit: {r2.cache}\") # False\n", "\n", "# Search \"distributed systems\" again — cache hit\n", "r3 = ds.search(\"distributed systems\")\n", "print(f\"cache hit: {r3.cache}\") # True" ] }, { "cell_type": "markdown", "id": "46f83ff1-7055-46c2-9955-129b7794f6f4", "metadata": {}, "source": [ "### Step 7: Force refresh\n", "\n", "Use `refresh=True` to force a rebuild of the index. This also invalidates the cache." ] }, { "cell_type": "code", "execution_count": 10, "id": "766dc089-de2f-41f5-a35a-ab8115727a56", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fresh rebuild: True, cache: False\n" ] } ], "source": [ "r = ds.search(\"kubernetes\", refresh=True)\n", "print(f\"fresh rebuild: {r.fresh}, cache: {r.cache}\") # True, False" ] }, { "cell_type": "markdown", "id": "9d95d635-ef0f-4ba7-ae55-ef5f4d73a6d2", "metadata": {}, "source": [ "### Step 8: Clean up\n", "\n", "Always close the DataSet when done. Using `with` statement is recommended — here we close it manually since we created it without `with`." ] }, { "cell_type": "code", "execution_count": 11, "id": "ed5ce561-8510-4031-8cd0-9a5ec5ff46ce", "metadata": {}, "outputs": [], "source": [ "ds.close()" ] }, { "cell_type": "markdown", "id": "6c100b21-d58b-4c63-825d-077381ee2d03", "metadata": {}, "source": [ "## Example 2 — Book Catalog (Sort + Query Language)\n", "\n", "This example shows advanced features: sorting, range queries, field-specific search, and boolean operators.\n", "\n", "### Step 1: Define schema with NumericFields\n", "\n", "`NumericField` with `indexed=True` enables range queries in the query language. `fast=True` enables sorting." ] }, { "cell_type": "code", "execution_count": 12, "id": "a8a4d0d8-1c11-437f-8757-47decc793f43", "metadata": {}, "outputs": [], "source": [ "from sayt2.api import NumericField, SortKey\n", "\n", "book_fields = [\n", " KeywordField(name=\"id\"),\n", " NgramField(name=\"title\", min_gram=2, max_gram=6, boost=3.0),\n", " TextField(name=\"author\", boost=2.0),\n", " TextField(name=\"description\"),\n", " NumericField(name=\"year\", kind=\"i64\", indexed=True, fast=True),\n", " NumericField(name=\"price\", kind=\"f64\", indexed=True, fast=True),\n", " NumericField(name=\"rating\", kind=\"f64\", indexed=True, fast=True),\n", " NumericField(name=\"pages\", kind=\"i64\", indexed=True, fast=True),\n", "]" ] }, { "cell_type": "markdown", "id": "c66a64fc-30e5-4f71-a4a5-388e48db43df", "metadata": {}, "source": [ "### Step 2: Prepare book data" ] }, { "cell_type": "code", "execution_count": 13, "id": "f52beabb-65cc-4820-80cc-374094b48a18", "metadata": {}, "outputs": [], "source": [ "books = [\n", " {\"id\": \"1\", \"title\": \"Fluent Python\", \"author\": \"Luciano Ramalho\", \"description\": \"A hands-on guide to writing effective Python code.\", \"year\": 2022, \"price\": 49.99, \"rating\": 4.7, \"pages\": 1012},\n", " {\"id\": \"2\", \"title\": \"Python Crash Course\", \"author\": \"Eric Matthes\", \"description\": \"A fast-paced introduction to programming with Python.\", \"year\": 2023, \"price\": 35.99, \"rating\": 4.6, \"pages\": 544},\n", " {\"id\": \"3\", \"title\": \"Programming Rust\", \"author\": \"Jim Blandy\", \"description\": \"Fast, safe systems development with Rust.\", \"year\": 2021, \"price\": 45.99, \"rating\": 4.5, \"pages\": 738},\n", "]\n", "\n", "def book_downloader() -> list[dict]:\n", " return books" ] }, { "cell_type": "markdown", "id": "f76315fe-1378-4cd6-b722-9170f3e7f36b", "metadata": {}, "source": [ "### Step 3: Sort by rating (descending)\n", "\n", "Pass `sort` to `DataSet` to order results by a numeric field." ] }, { "cell_type": "code", "execution_count": 14, "id": "0e09d813-dea7-427c-8202-91841337bb6a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Books about Python, sorted by rating (highest first):\n", " Fluent Python — rating: 4.7, year: 2022\n", " Python Crash Course — rating: 4.6, year: 2023\n" ] } ], "source": [ "if dir_index.exists():\n", " shutil.rmtree(dir_index)\n", "\n", "with DataSet(\n", " dir_root=dir_index,\n", " name=\"books_by_rating\",\n", " fields=book_fields,\n", " downloader=book_downloader,\n", " sort=[SortKey(name=\"rating\", descending=True)],\n", ") as ds:\n", " result = ds.search(\"python\")\n", " print(\"Books about Python, sorted by rating (highest first):\")\n", " for hit in result.hits:\n", " print(f\" {hit.source['title']} — rating: {hit.source['rating']}, year: {hit.source['year']}\")" ] }, { "cell_type": "markdown", "id": "6581613e-60e3-4963-837b-b964e3a6869a", "metadata": {}, "source": [ "### Step 4: Range queries\n", "\n", "tantivy supports Lucene-style range syntax directly in the query string." ] }, { "cell_type": "code", "execution_count": 15, "id": "ba84002f-6b43-4b7b-aa82-4f73ca7f14ab", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Books from 2020–2023:\n", " Python Crash Course (2023)\n", " Fluent Python (2022)\n", " Programming Rust (2021)\n" ] } ], "source": [ "with DataSet(\n", " dir_root=dir_index,\n", " name=\"books_plain\",\n", " fields=book_fields,\n", " downloader=book_downloader,\n", ") as ds:\n", " # Books published between 2020 and 2023\n", " result = ds.search(\"year:[2020 TO 2023]\")\n", " print(\"Books from 2020–2023:\")\n", " for hit in result.hits:\n", " print(f\" {hit.source['title']} ({hit.source['year']})\")" ] }, { "cell_type": "code", "execution_count": 16, "id": "74fd4174-f113-4b12-a05a-a61cd5f9c212", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Books priced above $40:\n", " Fluent Python — $49.99\n", " Programming Rust — $45.99\n" ] } ], "source": [ " # Books priced above $40\n", " result = ds.search(\"price:>40\")\n", " print(\"\\nBooks priced above $40:\")\n", " for hit in result.hits:\n", " print(f\" {hit.source['title']} — ${hit.source['price']}\")" ] }, { "cell_type": "markdown", "id": "709d8a3e-d344-44a7-8431-0a33b5fec9bf", "metadata": {}, "source": [ "### Step 5: Field-specific search and boolean operators" ] }, { "cell_type": "code", "execution_count": 17, "id": "c53a728f-51fd-4d48-9237-4ddf64930c0d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Books by Blandy:\n", " Programming Rust by Jim Blandy\n" ] } ], "source": [ " # Search by author name\n", " result = ds.search(\"author:blandy\")\n", " print(\"Books by Blandy:\")\n", " for hit in result.hits:\n", " print(f\" {hit.source['title']} by {hit.source['author']}\")" ] }, { "cell_type": "code", "execution_count": 18, "id": "4c314597-1737-45b0-ba76-7c59b7668933", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Python books from 2022+:\n", " Fluent Python (2022)\n", " Python Crash Course (2023)\n" ] } ], "source": [ " # Combine text search with range query using AND\n", " result = ds.search(\"python AND year:[2022 TO 2025]\")\n", " print(\"\\nPython books from 2022+:\")\n", " for hit in result.hits:\n", " print(f\" {hit.source['title']} ({hit.source['year']})\")" ] }, { "cell_type": "markdown", "id": "62fcca88-f534-4955-ac57-46de6793ae7f", "metadata": {}, "source": [ "## Summary\n", "\n", "| Feature | How to use |\n", "|---------|-----------|\n", "| Substring search (search-as-you-type) | `NgramField` + `ds.search(\"ali\")` |\n", "| Full-text search | `TextField` + `ds.search(\"machine learning\")` |\n", "| Exact match | `KeywordField` |\n", "| Sort results | `SortKey(name=\"rating\", descending=True)` |\n", "| Range query | `ds.search(\"year:[2020 TO 2025]\")` or `ds.search(\"price:>40\")` |\n", "| Field-specific search | `ds.search(\"author:blandy\")` |\n", "| Boolean operators | `ds.search(\"python AND year:[2020 TO 2025]\")` |\n", "| Force refresh | `ds.search(\"query\", refresh=True)` |\n", "| Automatic caching | Built-in, no config needed |\n", "\n", "For more details, see the [full documentation](https://sayt2.readthedocs.io/).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fe8ab7c4-b074-4e84-8ffd-672f60f80ca3", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.4" } }, "nbformat": 4, "nbformat_minor": 5 }