Full-Text Search with DuckDB
4 hours ago
- #Full-Text Search
- #DuckDB
- #Database
- DuckDB's Full-Text Search (FTS) extension enables advanced text querying using algorithms like Okapi BM25, including features such as stemming, stop word removal, and accent stripping.
- Setting up FTS in DuckDB involves installing the extension via `INSTALL fts;` and `LOAD fts;`, then preprocessing data (e.g., converting emails to JSON) before indexing columns with `PRAGMA create_fts_index`.
- Queries can be refined using parameters like `conjunctive` to require all terms, and Okapi BM25 parameters (`k₁` for term frequency and `b` for document length normalization) to tune scoring based on use cases.
- Limitations include the lack of built-in match highlighting (e.g., similar to Postgres' `ts_headline`) and a less feature-rich set compared to engines like Elasticsearch or Postgres, but it's suitable for exploratory analysis.
- The workflow demonstrates processing a corpus of emails, creating an FTS index, and executing queries with customizable scoring, emphasizing DuckDB's ease of use for quick, ad-hoc text search tasks.