All of human cooking compressed into 2 megabytes
7 hours ago
- #multilingual corpus
- #food embeddings
- #ingredient analysis
- Epicure is a family of three sibling skip-gram ingredient embeddings trained from scratch on a multilingual recipe corpus.
- The dataset aggregates 4.14 million recipes from 11 sources across seven languages, normalized to 1,790 canonical ingredient entries using an LLM-augmented pipeline.
- Three Metapath2Vec variants are developed: Cooc (co-occurrence graph only), Chem (typed compound metapaths only), and Core (blends both with controlled mixing).
- The embeddings are seeded using a 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph with 2,247 compound nodes across 15 categories.
- Each model represents a distinct point on the spectrum between chemistry and recipe context.