Gemma 4 for Telephony: From Two AI Models to One – Until I Switched to Chinese

4 hours ago

Replaced two-model phone agent cascade with single multimodal Gemma 4, evaluating across English, French, and Mandarin.
English: Single model achieved 100% reply accuracy and faster latency (0.66s) vs cascade (93%, 0.81s).
French: Single model performed well (93% accuracy, 0.71s latency) but had a language slip in one answer.
Mandarin: Single model failed catastrophically (~0% accuracy) due to poor audio transcription, unlike cascade (92%).
Metric focused on reply correctness, not transcription WER, as it reflects caller experience.
Audio encoder quality varies by language; English/French work, Mandarin doesn't in this model.
Integration simplifies telephony stack by collapsing speech-to-text and reasoning into one call.
Recommendation: Use single model for English/French, keep cascade for languages like Mandarin.

Hasty Briefsbeta