Modernizing my "150-line" Python search engine
3 months ago
- #Python
- #Search Engine
- #Modern Tooling
- The author updated a Python full-text search engine project to use Hugging Face datasets instead of discontinued Wikipedia XML dumps.
- The original project used lxml and requests to parse and download Wikipedia abstracts, but the data source was sunsetted.
- Hugging Face's Wikipedia dataset provided a suitable replacement with full article texts, simplifying the data pipeline.
- Modern Python tooling was introduced, including pyproject.toml, uv for dependency management, ruff for linting, and pytest for testing.
- GitHub Actions were set up for CI, running tests across multiple Python versions.
- The core search logic (inverted index, TF-IDF scoring) remained unchanged, focusing updates on surrounding tooling and data handling.