Hasty Briefsbeta

Bilingual

All of human cooking compressed into 2 megabytes

7 hours ago
  • #multilingual corpus
  • #food embeddings
  • #ingredient analysis
  • Epicure is a family of three sibling skip-gram ingredient embeddings trained from scratch on a multilingual recipe corpus.
  • The dataset aggregates 4.14 million recipes from 11 sources across seven languages, normalized to 1,790 canonical ingredient entries using an LLM-augmented pipeline.
  • Three Metapath2Vec variants are developed: Cooc (co-occurrence graph only), Chem (typed compound metapaths only), and Core (blends both with controlled mixing).
  • The embeddings are seeded using a 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph with 2,247 compound nodes across 15 categories.
  • Each model represents a distinct point on the spectrum between chemistry and recipe context.