Human Judgment as a Specification

3 days ago

The rise of GenAI in programming necessitates formal methods to ensure AI systems produce desired solutions, requiring mathematical specifications that many programmers lack expertise in.
LLMs can translate prose into formal specifications, but this approach risks generating incorrect or ambiguous specs due to human error, misconceptions, or lack of ground truth, which LLMs alone cannot fix.
Human involvement is crucial for formalizing specifications, targeting responsible programmers who care about quality but are limited by laziness or busyness; solutions must be both meaningful and moderate in effort.
PICK is a tool that presents multiple plausible outputs (e.g., regexes) and asks users to vote on concrete examples to distinguish candidates, enabling spec validation without domain-specific redesigns.
PICK works across domains like regular expressions, linear temporal logic, and attribute-based access control by leveraging closure under negation/intersection and sampling from differences to show user disagreements.
Synthesis (e.g., via LLMs) can produce incorrect outputs if the specification is flawed, as verification relies on independence; PICK provides an independent witness through user judgments to catch such errors.
Human judgments in PICK serve as a specification by having users commit to concrete behaviors, elucidating implicit details from prompts and sharpening vague intents, even when models improve, shifting effort toward validation.

Hasty Briefsbeta