LangExtract: Python library for extracting structured data from language models

9 months ago

LangExtract is a Python library for extracting structured information from unstructured text using LLMs.
Key features include precise source grounding, reliable structured outputs, optimized long document processing, interactive visualization, flexible LLM support, and adaptability to any domain.
Supports cloud-based models like Google Gemini and local models via Ollama, requiring API keys for cloud models.
Quick start involves defining a prompt, providing examples, and running extraction with a few lines of code.
Installation is straightforward via pip, with options for development mode and Docker.
API key setup can be done via environment variables, .env files, or directly in code (not recommended for production).
Examples include processing full texts like Romeo and Juliet and extracting medical information from clinical notes.
Contributions are welcome, with guidelines provided in CONTRIBUTING.md.
Testing can be done locally with pytest or tox, with instructions for handling dependencies.
Disclaimer notes that LangExtract is not an officially supported Google product and is subject to Apache 2.0 License.

Hasty Briefsbeta