A Visual Guide to Gemma 4
12 hours ago
- #multimodal AI
- #model architecture
- #Gemma 4
- The Gemma 4 family includes four models with varying sizes and architectures: E2B (2B effective parameters), E4B (4B effective parameters), 31B (dense), and 26B A4B (Mixture of Experts, 4B active parameters).
- All models are multimodal, supporting image inputs with variable aspect ratios and resolutions via a Vision Encoder, and smaller models (E2B/E4B) also support audio inputs through an audio encoder.
- Architectural improvements over Gemma 3 include interleaved local and global attention layers (with global always last), K=V in global attention, p-RoPE for positional encoding, and enhancements like Grouped Query Attention for efficiency.
- The Mixture of Experts (MoE) model (26B A4B) uses 128 experts with 8 activated during inference, plus a shared expert, allowing faster performance similar to a 4B parameter model.
- Smaller models (E2B/E4B) use Per-Layer Embeddings (PLE) stored in flash memory to reduce RAM usage, making them suitable for on-device applications like phones.
- The Vision Encoder uses adaptive resizing, 2D RoPE for positional encoding, and pooling to handle variable aspect ratios and resolutions, with a soft token budget system for controlling patch embeddings.
- Audio processing in small models involves feature extraction via mel-spectrograms, chunking, downsampling with convolutions, and a Conformer encoder to generate embeddings projected for Gemma 4.