{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cc2b57e5-9ba9-4146-bd85-963b561d3b87",
   "metadata": {},
   "source": [
    "# sayt2 Quick Start Guide\n",
    "\n",
    "## What is sayt2?\n",
    "\n",
    "**sayt2** is a search-as-you-type full-text search library for Python. It lets you build a search index from a list of dictionaries and query it with substring matching (ngram), full-text search (BM25), fuzzy search, range queries, sorting, and more — all through a single `DataSet` object.\n",
    "\n",
    "Under the hood it uses [tantivy](https://github.com/quickwit-oss/tantivy-py) (a Rust-based search engine) for fast indexing and querying, [pydantic](https://docs.pydantic.dev/) for configuration validation, and [diskcache](https://grantjenks.com/docs/diskcache/) for a two-layer disk cache.\n",
    "\n",
    "## Install\n",
    "\n",
    "```python\n",
    "pip install sayt2\n",
    "```\n",
    "\n",
    "## Example 1 — People Directory (Ngram + Full-Text Search)\n",
    "\n",
    "In this example we build a search index for a people directory. Each person has a name, job title, and bio. We will demonstrate:\n",
    "\n",
    "- **Ngram search**: substring matching on names (search-as-you-type)\n",
    "- **Full-text search**: word-level BM25 search on title and bio\n",
    "- **Caching**: automatic L2 query result cache\n",
    "\n",
    "### Step 1: Prepare the data\n",
    "\n",
    "Our data source is a list of dictionaries. In real projects this could come from a database, an API, or a file. Here we define it inline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cf1fdf52-a193-4539-a246-13d54a77f8af",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-09-25T15:36:14.929877Z",
     "start_time": "2023-09-25T15:36:14.925278Z"
    }
   },
   "outputs": [],
   "source": [
    "records = [\n",
    "    {\n",
    "        \"id\": \"1\",\n",
    "        \"name\": \"Alice Johnson\",\n",
    "        \"title\": \"Senior Data Scientist\",\n",
    "        \"bio\": \"Alice specializes in machine learning and natural language processing. She has published several papers on transformer architectures.\",\n",
    "    },\n",
    "    {\n",
    "        \"id\": \"2\",\n",
    "        \"name\": \"Bob Martinez\",\n",
    "        \"title\": \"Backend Engineer\",\n",
    "        \"bio\": \"Bob builds scalable microservices with Python and Go. He is passionate about distributed systems and database optimization.\",\n",
    "    },\n",
    "    {\n",
    "        \"id\": \"3\",\n",
    "        \"name\": \"Charlie Wang\",\n",
    "        \"title\": \"Frontend Developer\",\n",
    "        \"bio\": \"Charlie creates beautiful user interfaces with React and TypeScript. He advocates for accessibility and performance.\",\n",
    "    },\n",
    "    {\n",
    "        \"id\": \"4\",\n",
    "        \"name\": \"Diana Patel\",\n",
    "        \"title\": \"DevOps Engineer\",\n",
    "        \"bio\": \"Diana manages cloud infrastructure on AWS and Kubernetes. She automates everything from CI pipelines to monitoring dashboards.\",\n",
    "    },\n",
    "    {\n",
    "        \"id\": \"5\",\n",
    "        \"name\": \"Edward Kim\",\n",
    "        \"title\": \"Machine Learning Engineer\",\n",
    "        \"bio\": \"Edward trains and deploys deep learning models for computer vision. He works extensively with PyTorch and TensorFlow.\",\n",
    "    },\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "952172e5",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "### Step 2: Define the schema\n",
    "\n",
    "Each field tells sayt2 how to index a column. Different field types enable different search modes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b2ba8e94-1dbf-4dd3-bb98-a6cbf351060d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sayt2.api import (\n",
    "    DataSet,\n",
    "    NgramField,\n",
    "    TextField,\n",
    "    KeywordField,\n",
    "    Hit,\n",
    "    SearchResult,\n",
    ")\n",
    "\n",
    "fields = [\n",
    "    # KeywordField — exact match, good for IDs and tags\n",
    "    KeywordField(name=\"id\"),\n",
    "    # NgramField — substring matching (search-as-you-type)\n",
    "    # boost=3.0 makes name matches rank higher\n",
    "    NgramField(name=\"name\", min_gram=2, max_gram=6, boost=3.0),\n",
    "    # TextField — full-text BM25 search (word-level)\n",
    "    TextField(name=\"title\", boost=2.0),\n",
    "    TextField(name=\"bio\"),\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c1a2dcb-5017-4253-8d41-6dd0329dc69e",
   "metadata": {},
   "source": [
    "### Step 3: Create the DataSet and search\n",
    "\n",
    "The `DataSet` is the main entry point. Pass a `downloader` callable that returns your data, and sayt2 will automatically build the index on the first search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "98afef78-36a9-4ea2-96e1-5b228d1434b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import shutil\n",
    "from pathlib import Path\n",
    "\n",
    "dir_index = Path(\"./quick_start_index\")\n",
    "\n",
    "# clean up from previous runs\n",
    "if dir_index.exists():\n",
    "    shutil.rmtree(dir_index)\n",
    "\n",
    "def downloader() -> list[dict]:\n",
    "    \"\"\"Return the raw records. In real use this could hit a DB or API.\"\"\"\n",
    "    return records"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1b1161fe-7c2f-4a5a-8651-86b386d71f1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = DataSet(\n",
    "    dir_root=dir_index,\n",
    "    name=\"people\",\n",
    "    fields=fields,\n",
    "    downloader=downloader,\n",
    "    cache_expire=None,  # no auto-expiry for this demo\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b10e9d45-e3fe-459b-8490-0afa92c9c733",
   "metadata": {},
   "source": [
    "### Step 4: Ngram search — substring matching\n",
    "\n",
    "Type a few characters and get instant results. The `NgramField` on `name` makes this possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3b318b40-43cb-47fa-a007-eb2547962a1d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 1 results, took 263.0 ms\n",
      "  Alice Johnson (score: 10.26)\n"
     ]
    }
   ],
   "source": [
    "# \"ali\" matches \"Alice Johnson\"\n",
    "result = ds.search(\"ali\")\n",
    "print(f\"Found {result.size} results, took {result.took_ms:.1f} ms\")\n",
    "for hit in result.hits:\n",
    "    print(f\"  {hit.source['name']} (score: {hit.score:.2f})\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9d110e42-d0d5-4d59-afca-3853aa772795",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Charlie Wang: Frontend Developer\n"
     ]
    }
   ],
   "source": [
    "# \"wan\" matches \"Charlie Wang\"\n",
    "result = ds.search(\"wan\")\n",
    "for hit in result.hits:\n",
    "    print(f\"  {hit.source['name']}: {hit.source['title']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebe89a9c-f568-42a7-944d-ebab24b832c3",
   "metadata": {},
   "source": [
    "### Step 5: Full-text search — word-level BM25\n",
    "\n",
    "`TextField` fields support natural language queries. Words are tokenized and ranked by relevance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "a182a074-04a0-4e23-9d36-b9607a7bd593",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 2 results:\n",
      "  Edward Kim — Machine Learning Engineer (score: 5.90)\n",
      "  Alice Johnson — Senior Data Scientist (score: 2.24)\n"
     ]
    }
   ],
   "source": [
    "# Search across title and bio fields\n",
    "result = ds.search(\"machine learning\")\n",
    "print(f\"Found {result.size} results:\")\n",
    "for hit in result.hits:\n",
    "    print(f\"  {hit.source['name']} — {hit.source['title']} (score: {hit.score:.2f})\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "c8f5629e-90dc-47d1-9dfa-69ffa3ade33a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Diana Patel: Diana manages cloud infrastructure on AWS and Kubernetes. She automates everythi...\n"
     ]
    }
   ],
   "source": [
    "# \"kubernetes\" appears only in Diana's bio\n",
    "result = ds.search(\"kubernetes\")\n",
    "for hit in result.hits:\n",
    "    print(f\"  {hit.source['name']}: {hit.source['bio'][:80]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2ff76b8-8ee4-4d5f-aa27-0b8930f55d7b",
   "metadata": {},
   "source": [
    "### Step 6: Automatic caching\n",
    "\n",
    "sayt2 has a two-layer cache. The second query for the same search term returns instantly from the L2 cache."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "68eec4e7-c53b-4c2e-a6c1-ade9edfccb6e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cache hit: True\n",
      "cache hit: False\n",
      "cache hit: True\n"
     ]
    }
   ],
   "source": [
    "# \"kubernetes\" was already searched in Step 5 — this time it's a cache hit\n",
    "r1 = ds.search(\"kubernetes\")\n",
    "print(f\"cache hit: {r1.cache}\")  # True\n",
    "\n",
    "# A different query — cache miss\n",
    "r2 = ds.search(\"distributed systems\")\n",
    "print(f\"cache hit: {r2.cache}\")  # False\n",
    "\n",
    "# Search \"distributed systems\" again — cache hit\n",
    "r3 = ds.search(\"distributed systems\")\n",
    "print(f\"cache hit: {r3.cache}\")  # True"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46f83ff1-7055-46c2-9955-129b7794f6f4",
   "metadata": {},
   "source": [
    "### Step 7: Force refresh\n",
    "\n",
    "Use `refresh=True` to force a rebuild of the index. This also invalidates the cache."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "766dc089-de2f-41f5-a35a-ab8115727a56",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fresh rebuild: True, cache: False\n"
     ]
    }
   ],
   "source": [
    "r = ds.search(\"kubernetes\", refresh=True)\n",
    "print(f\"fresh rebuild: {r.fresh}, cache: {r.cache}\")  # True, False"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d95d635-ef0f-4ba7-ae55-ef5f4d73a6d2",
   "metadata": {},
   "source": [
    "### Step 8: Clean up\n",
    "\n",
    "Always close the DataSet when done. Using `with` statement is recommended — here we close it manually since we created it without `with`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "ed5ce561-8510-4031-8cd0-9a5ec5ff46ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c100b21-d58b-4c63-825d-077381ee2d03",
   "metadata": {},
   "source": [
    "## Example 2 — Book Catalog (Sort + Query Language)\n",
    "\n",
    "This example shows advanced features: sorting, range queries, field-specific search, and boolean operators.\n",
    "\n",
    "### Step 1: Define schema with NumericFields\n",
    "\n",
    "`NumericField` with `indexed=True` enables range queries in the query language. `fast=True` enables sorting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a8a4d0d8-1c11-437f-8757-47decc793f43",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sayt2.api import NumericField, SortKey\n",
    "\n",
    "book_fields = [\n",
    "    KeywordField(name=\"id\"),\n",
    "    NgramField(name=\"title\", min_gram=2, max_gram=6, boost=3.0),\n",
    "    TextField(name=\"author\", boost=2.0),\n",
    "    TextField(name=\"description\"),\n",
    "    NumericField(name=\"year\", kind=\"i64\", indexed=True, fast=True),\n",
    "    NumericField(name=\"price\", kind=\"f64\", indexed=True, fast=True),\n",
    "    NumericField(name=\"rating\", kind=\"f64\", indexed=True, fast=True),\n",
    "    NumericField(name=\"pages\", kind=\"i64\", indexed=True, fast=True),\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c66a64fc-30e5-4f71-a4a5-388e48db43df",
   "metadata": {},
   "source": [
    "### Step 2: Prepare book data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f52beabb-65cc-4820-80cc-374094b48a18",
   "metadata": {},
   "outputs": [],
   "source": [
    "books = [\n",
    "    {\"id\": \"1\", \"title\": \"Fluent Python\", \"author\": \"Luciano Ramalho\", \"description\": \"A hands-on guide to writing effective Python code.\", \"year\": 2022, \"price\": 49.99, \"rating\": 4.7, \"pages\": 1012},\n",
    "    {\"id\": \"2\", \"title\": \"Python Crash Course\", \"author\": \"Eric Matthes\", \"description\": \"A fast-paced introduction to programming with Python.\", \"year\": 2023, \"price\": 35.99, \"rating\": 4.6, \"pages\": 544},\n",
    "    {\"id\": \"3\", \"title\": \"Programming Rust\", \"author\": \"Jim Blandy\", \"description\": \"Fast, safe systems development with Rust.\", \"year\": 2021, \"price\": 45.99, \"rating\": 4.5, \"pages\": 738},\n",
    "]\n",
    "\n",
    "def book_downloader() -> list[dict]:\n",
    "    return books"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f76315fe-1378-4cd6-b722-9170f3e7f36b",
   "metadata": {},
   "source": [
    "### Step 3: Sort by rating (descending)\n",
    "\n",
    "Pass `sort` to `DataSet` to order results by a numeric field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "0e09d813-dea7-427c-8202-91841337bb6a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Books about Python, sorted by rating (highest first):\n",
      "  Fluent Python — rating: 4.7, year: 2022\n",
      "  Python Crash Course — rating: 4.6, year: 2023\n"
     ]
    }
   ],
   "source": [
    "if dir_index.exists():\n",
    "    shutil.rmtree(dir_index)\n",
    "\n",
    "with DataSet(\n",
    "    dir_root=dir_index,\n",
    "    name=\"books_by_rating\",\n",
    "    fields=book_fields,\n",
    "    downloader=book_downloader,\n",
    "    sort=[SortKey(name=\"rating\", descending=True)],\n",
    ") as ds:\n",
    "    result = ds.search(\"python\")\n",
    "    print(\"Books about Python, sorted by rating (highest first):\")\n",
    "    for hit in result.hits:\n",
    "        print(f\"  {hit.source['title']} — rating: {hit.source['rating']}, year: {hit.source['year']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6581613e-60e3-4963-837b-b964e3a6869a",
   "metadata": {},
   "source": [
    "### Step 4: Range queries\n",
    "\n",
    "tantivy supports Lucene-style range syntax directly in the query string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "ba84002f-6b43-4b7b-aa82-4f73ca7f14ab",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Books from 2020–2023:\n",
      "  Python Crash Course (2023)\n",
      "  Fluent Python (2022)\n",
      "  Programming Rust (2021)\n"
     ]
    }
   ],
   "source": [
    "with DataSet(\n",
    "    dir_root=dir_index,\n",
    "    name=\"books_plain\",\n",
    "    fields=book_fields,\n",
    "    downloader=book_downloader,\n",
    ") as ds:\n",
    "    # Books published between 2020 and 2023\n",
    "    result = ds.search(\"year:[2020 TO 2023]\")\n",
    "    print(\"Books from 2020–2023:\")\n",
    "    for hit in result.hits:\n",
    "        print(f\"  {hit.source['title']} ({hit.source['year']})\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "74fd4174-f113-4b12-a05a-a61cd5f9c212",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Books priced above $40:\n",
      "  Fluent Python — $49.99\n",
      "  Programming Rust — $45.99\n"
     ]
    }
   ],
   "source": [
    "    # Books priced above $40\n",
    "    result = ds.search(\"price:>40\")\n",
    "    print(\"\\nBooks priced above $40:\")\n",
    "    for hit in result.hits:\n",
    "        print(f\"  {hit.source['title']} — ${hit.source['price']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "709d8a3e-d344-44a7-8431-0a33b5fec9bf",
   "metadata": {},
   "source": [
    "### Step 5: Field-specific search and boolean operators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "c53a728f-51fd-4d48-9237-4ddf64930c0d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Books by Blandy:\n",
      "  Programming Rust by Jim Blandy\n"
     ]
    }
   ],
   "source": [
    "    # Search by author name\n",
    "    result = ds.search(\"author:blandy\")\n",
    "    print(\"Books by Blandy:\")\n",
    "    for hit in result.hits:\n",
    "        print(f\"  {hit.source['title']} by {hit.source['author']}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "4c314597-1737-45b0-ba76-7c59b7668933",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Python books from 2022+:\n",
      "  Fluent Python (2022)\n",
      "  Python Crash Course (2023)\n"
     ]
    }
   ],
   "source": [
    "    # Combine text search with range query using AND\n",
    "    result = ds.search(\"python AND year:[2022 TO 2025]\")\n",
    "    print(\"\\nPython books from 2022+:\")\n",
    "    for hit in result.hits:\n",
    "        print(f\"  {hit.source['title']} ({hit.source['year']})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62fcca88-f534-4955-ac57-46de6793ae7f",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Feature | How to use |\n",
    "|---------|-----------|\n",
    "| Substring search (search-as-you-type) | `NgramField` + `ds.search(\"ali\")` |\n",
    "| Full-text search | `TextField` + `ds.search(\"machine learning\")` |\n",
    "| Exact match | `KeywordField` |\n",
    "| Sort results | `SortKey(name=\"rating\", descending=True)` |\n",
    "| Range query | `ds.search(\"year:[2020 TO 2025]\")` or `ds.search(\"price:>40\")` |\n",
    "| Field-specific search | `ds.search(\"author:blandy\")` |\n",
    "| Boolean operators | `ds.search(\"python AND year:[2020 TO 2025]\")` |\n",
    "| Force refresh | `ds.search(\"query\", refresh=True)` |\n",
    "| Automatic caching | Built-in, no config needed |\n",
    "\n",
    "For more details, see the [full documentation](https://sayt2.readthedocs.io/).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe8ab7c4-b074-4e84-8ffd-672f60f80ca3",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}