Can LLMs model real-world systems in TLA+?

11 hours ago

LLMs can produce syntactically correct TLA+ specifications but often fail to model specific system implementations accurately, instead reciting textbook examples.
SysMoBench is an automated benchmark that evaluates LLM-generated TLA+ specs through four phases: syntax, runtime, conformance, and invariance, revealing gaps between generic and system-specific modeling.
Common failure patterns include specs entering unreachable states or missing reachable states due to mismatched data structures and fused atomic actions, highlighting the need for conformance checks.
Transition Validation in SysMoBench provides per-action diagnostics by comparing spec transitions against real system execution traces, pinpointing misalignments.
Evaluation results show LLMs perform well on syntax but struggle with conformance and invariants, especially for complex distributed systems, indicating challenges in abstracting real-world logic.
Open challenges include trace coverage limitations, state abstraction losses, and generalizability across systems, with ongoing efforts to improve automation and collaboration.

Hasty Briefsbeta