LLM Output Drift in Financial Workflows: Validation and Mitigation (arXiv)

10 days ago

Copy Link

Financial institutions use Large Language Models (LLMs) for tasks like reconciliations, regulatory reporting, and client communications, but output drift undermines auditability and trust.
Smaller models (e.g., Granite-3-8B, Qwen2.5-7B) achieve 100% output consistency at T=0.0, while larger models like GPT-OSS-120B show only 12.5% consistency, challenging the assumption that larger models are always better.
The study introduces a finance-calibrated deterministic test harness, task-specific invariant checking, a three-tier model classification system, and an audit-ready attestation system with dual-provider validation.
Evaluation of five models across 480 runs shows structured tasks (SQL) remain stable even at T=0.2, while RAG tasks exhibit significant drift (25-75%), indicating task-dependent sensitivity.
The framework aligns with regulatory requirements from Financial Stability Board (FSB), Bank for International Settlements (BIS), and Commodity Futures Trading Commission (CFTC), offering compliance-ready AI deployment pathways.

Hasty Briefsbeta