Hasty Briefsbeta

Weight Transfer for RL Post-Training in under 2 seconds

21 days ago
  • #RDMA
  • #RL Fine-Tuning
  • #Weight Transfer
  • Achieved 1.3-second cross-machine parameter updates for Kimi-K2 (1T parameters).
  • Utilized RDMA WRITE for low-latency, high-throughput, zero-copy transfers.
  • Implemented a static weight transfer schedule computed once at initialization.
  • Designed a pipelined execution to overlap different hardware resource usage.
  • Ensured clean separation of weight update steps for easier maintenance and optimization.
  • Avoided bottlenecks by using point-to-point communication instead of funneling through rank-0 GPUs.