Gemma 4 for Telephony: From Two AI Models to One – Until I Switched to Chinese
5 hours ago
- #telephony
- #benchmark
- #multimodal-LLM
- Replaced two-model phone agent cascade with single multimodal Gemma 4, evaluating across English, French, and Mandarin.
- English: Single model achieved 100% reply accuracy and faster latency (0.66s) vs cascade (93%, 0.81s).
- French: Single model performed well (93% accuracy, 0.71s latency) but had a language slip in one answer.
- Mandarin: Single model failed catastrophically (~0% accuracy) due to poor audio transcription, unlike cascade (92%).
- Metric focused on reply correctness, not transcription WER, as it reflects caller experience.
- Audio encoder quality varies by language; English/French work, Mandarin doesn't in this model.
- Integration simplifies telephony stack by collapsing speech-to-text and reasoning into one call.
- Recommendation: Use single model for English/French, keep cascade for languages like Mandarin.