Rotary GPU: Exploring Local Execution for Large MoE Models Under Limited VRAM

5 hours ago

Rotary GPU explores local execution paths for large Mixture-of-Experts models under limited GPU memory to improve accessibility in constrained environments.
A validation using a Qwen3.6-35B-A3B-class model on a consumer laptop with an RTX 4060 GPU (8 GB VRAM) achieved 2048 output tokens, ~6.3 GB VRAM usage, and 21.06 tokens/sec decode throughput.
The goal is not to replace data-center infrastructure but to bring some capabilities of large models closer to where such infrastructure is unavailable, focusing on deployment accessibility over capability scaling.

Hasty Briefsbeta