Hasty Briefsbeta

Bilingual

A Visual Guide to Gemma 4

14 hours ago
  • #multimodal AI
  • #model architecture
  • #Gemma 4
  • The Gemma 4 family includes four models with varying sizes and architectures: E2B (2B effective parameters), E4B (4B effective parameters), 31B (dense), and 26B A4B (Mixture of Experts, 4B active parameters).
  • All models are multimodal, supporting image inputs with variable aspect ratios and resolutions via a Vision Encoder, and smaller models (E2B/E4B) also support audio inputs through an audio encoder.
  • Architectural improvements over Gemma 3 include interleaved local and global attention layers (with global always last), K=V in global attention, p-RoPE for positional encoding, and enhancements like Grouped Query Attention for efficiency.
  • The Mixture of Experts (MoE) model (26B A4B) uses 128 experts with 8 activated during inference, plus a shared expert, allowing faster performance similar to a 4B parameter model.
  • Smaller models (E2B/E4B) use Per-Layer Embeddings (PLE) stored in flash memory to reduce RAM usage, making them suitable for on-device applications like phones.
  • The Vision Encoder uses adaptive resizing, 2D RoPE for positional encoding, and pooling to handle variable aspect ratios and resolutions, with a soft token budget system for controlling patch embeddings.
  • Audio processing in small models involves feature extraction via mel-spectrograms, chunking, downsampling with convolutions, and a Conformer encoder to generate embeddings projected for Gemma 4.