A Visual Guide to Gemma 4

22 days ago

The Gemma 4 family includes four models with varying sizes and architectures: E2B (2B effective parameters), E4B (4B effective parameters), 31B (dense), and 26B A4B (Mixture of Experts, 4B active parameters).
All models are multimodal, supporting image inputs with variable aspect ratios and resolutions via a Vision Encoder, and smaller models (E2B/E4B) also support audio inputs through an audio encoder.
Architectural improvements over Gemma 3 include interleaved local and global attention layers (with global always last), K=V in global attention, p-RoPE for positional encoding, and enhancements like Grouped Query Attention for efficiency.
The Mixture of Experts (MoE) model (26B A4B) uses 128 experts with 8 activated during inference, plus a shared expert, allowing faster performance similar to a 4B parameter model.
Smaller models (E2B/E4B) use Per-Layer Embeddings (PLE) stored in flash memory to reduce RAM usage, making them suitable for on-device applications like phones.
The Vision Encoder uses adaptive resizing, 2D RoPE for positional encoding, and pooling to handle variable aspect ratios and resolutions, with a soft token budget system for controlling patch embeddings.
Audio processing in small models involves feature extraction via mel-spectrograms, chunking, downsampling with convolutions, and a Conformer encoder to generate embeddings projected for Gemma 4.

Hasty Briefsbeta