LLM Output Drift in Financial Workflows: Validation and Mitigation (arXiv)
10 days ago
- #AI Compliance
- #Machine Learning
- #Financial Technology
- Financial institutions use Large Language Models (LLMs) for tasks like reconciliations, regulatory reporting, and client communications, but output drift undermines auditability and trust.
- Smaller models (e.g., Granite-3-8B, Qwen2.5-7B) achieve 100% output consistency at T=0.0, while larger models like GPT-OSS-120B show only 12.5% consistency, challenging the assumption that larger models are always better.
- The study introduces a finance-calibrated deterministic test harness, task-specific invariant checking, a three-tier model classification system, and an audit-ready attestation system with dual-provider validation.
- Evaluation of five models across 480 runs shows structured tasks (SQL) remain stable even at T=0.2, while RAG tasks exhibit significant drift (25-75%), indicating task-dependent sensitivity.
- The framework aligns with regulatory requirements from Financial Stability Board (FSB), Bank for International Settlements (BIS), and Commodity Futures Trading Commission (CFTC), offering compliance-ready AI deployment pathways.