Weight Transfer for RL Post-Training in under 2 seconds
21 days ago
- #RDMA
- #RL Fine-Tuning
- #Weight Transfer
- Achieved 1.3-second cross-machine parameter updates for Kimi-K2 (1T parameters).
- Utilized RDMA WRITE for low-latency, high-throughput, zero-copy transfers.
- Implemented a static weight transfer schedule computed once at initialization.
- Designed a pipelined execution to overlap different hardware resource usage.
- Ensured clean separation of weight update steps for easier maintenance and optimization.
- Avoided bottlenecks by using point-to-point communication instead of funneling through rank-0 GPUs.