A Visual Guide to Gemma 4 12B
7 hours ago
- #multimodal AI
- #encoder-free models
- #Gemma 4
- Gemma 4 12B is a new encoder-free multimodal model that sits between the E4B and 26B A4B models, designed for systems with 12GB to 16GB of VRAM.
- The model removes separate vision and audio encoders, instead using lightweight embedding modules that directly project raw inputs (images and audio) into the LLM's token embedding space.
- For vision, raw 48x48 image patches are projected via a linear layer (≈26M parameters) and combined with positional embeddings using x and y matrices, totaling ≈35M parameters versus 550M in larger models.
- For audio, raw amplitude samples from 40ms sequences are projected directly without positional embeddings, simplifying processing compared to encoder-based models.
- Removing encoders reduces latency, as the LLM can start processing inputs earlier, and simplifies fine-tuning since only the LLM needs adjustment, though the LLM now handles more understanding tasks.