Hasty Briefsbeta

Bilingual

A Visual Guide to Gemma 4 12B

6 hours ago
  • #multimodal AI
  • #encoder-free models
  • #Gemma 4
  • Gemma 4 12B is a new encoder-free multimodal model that sits between the E4B and 26B A4B models, designed for systems with 12GB to 16GB of VRAM.
  • The model removes separate vision and audio encoders, instead using lightweight embedding modules that directly project raw inputs (images and audio) into the LLM's token embedding space.
  • For vision, raw 48x48 image patches are projected via a linear layer (≈26M parameters) and combined with positional embeddings using x and y matrices, totaling ≈35M parameters versus 550M in larger models.
  • For audio, raw amplitude samples from 40ms sequences are projected directly without positional embeddings, simplifying processing compared to encoder-based models.
  • Removing encoders reduces latency, as the LLM can start processing inputs earlier, and simplifies fine-tuning since only the LLM needs adjustment, though the LLM now handles more understanding tasks.