A Year of Fast Apply – Our Path to 10k Tokens per Second

7 months ago

Released Fast Apply model a year ago, focusing on fine-tuning small, specialized models for code-specific tasks.
Open-sourced training insights leading to Relace Apply 3, capable of 10k+ tokens per second with state-of-the-art accuracy.
Highlighted the inefficiency of regenerating unchanged code with expensive LLMs, proposing a lightweight diff application solution.
Introduced the concept of using an LLM as a merge algorithm to handle pathological diffs and infer intent, improving accuracy.
Detailed dataset production for training, emphasizing quality and diversity over size, with a focus on real production data.
Explained the evaluation process for merges, categorizing outcomes into six types to ensure high-quality training data.
Utilized LLM-as-a-judge to scale up dataset filtering, achieving a low false positive rate for reliable training examples.
Adopted LoRA for efficient model training, allowing specialization without catastrophic forgetting of general coding knowledge.
Achieved 10k tok/s with speculative decoding, leveraging strong priors in code merging for parallel token processing.
Showcased Relace Apply 3's improvements in merge accuracy, context length, and speed, positioning it as a market leader.
Reflected on Fast Apply's impact over the past year, highlighting its role in making structured code edits reliable.
Announced hiring for researchers and engineers to continue developing specialized models for coding tasks.

Hasty Briefsbeta