Learning to Replicate Expert Judgment in Financial Tasks

21 days ago

Outperforming the market is challenging due to the need for unique insight from investor judgment, which is difficult to articulate or teach directly.
LLMs struggle with simple financial tasks like filtering and processing documents, even though these are routine for investors.
The post explores automating information triage using LLMs, showing that with expert annotations, proprietary models can achieve expert-level judgment.
Frontier models (e.g., Gemini, Claude, GPT) underperform on six filtering tasks, with accuracy around 50-80%, below the 80% threshold for trust.
Improved prompting boosted accuracy to the mid-70s, but fine-tuning with high-quality human-labeled data was necessary for further gains.
A custom training dataset was built using expert verification to correct non-expert labels, enhancing data quality.
The training recipe used Qwen3-235B as a base model, with techniques like interleaved batching, CISPO loss with asymmetric clipping, and on-policy distillation.
The final proprietary model achieved 84.7% accuracy, making 29.8% fewer mistakes than frontier models, with a 13.8x reduction in inference costs.
The conclusion highlights that custom models tuned to organizational needs outperform frontier models in accuracy and cost, enabling differentiated intelligence.

Hasty Briefsbeta