Open Weights Isn't Open Training
4 days ago
- #machine-learning
- #model-training
- #open-source
- Open-source ML infrastructure often has hidden bugs and inefficiencies, especially for large models.
- Attempting to post-train a 1T+ parameter model (Kimi-K2-Thinking) revealed multiple issues with existing tools like HuggingFace and LLaMA-Factory.
- Key problems included slow compression, uneven GPU memory distribution, and incompatibility with LoRA training due to quantized weights.
- Solutions involved manual fixes like skipping unnecessary compression, adjusting GPU memory allocation, and modifying forward passes to handle dequantization.
- Despite getting the model to train, performance was suboptimal, highlighting the challenges of open-source ML infrastructure for large-scale models.
- The experience underscored the need for better, more reliable tools in the open-source ML ecosystem.