A Visual Guide to Gemma 4 12B

22 days ago

Gemma 4 12B is a new encoder-free multimodal model that sits between the E4B and 26B A4B models, designed for systems with 12GB to 16GB of VRAM.
The model removes separate vision and audio encoders, instead using lightweight embedding modules that directly project raw inputs (images and audio) into the LLM's token embedding space.
For vision, raw 48x48 image patches are projected via a linear layer (≈26M parameters) and combined with positional embeddings using x and y matrices, totaling ≈35M parameters versus 550M in larger models.
For audio, raw amplitude samples from 40ms sequences are projected directly without positional embeddings, simplifying processing compared to encoder-based models.
Removing encoders reduces latency, as the LLM can start processing inputs earlier, and simplifies fine-tuning since only the LLM needs adjustment, though the LLM now handles more understanding tasks.

Hasty Briefsbeta