Show HN: I trained a 9M speech model to fix my Mandarin tones

a month ago

Author struggled with Mandarin pronunciation, especially tones, and built a deep learning-based system to grade pronunciation.
Initial approach was a pitch visualizer but switched to a Conformer encoder with CTC loss for better accuracy.
Conformer chosen for its ability to capture both local (CNNs) and global (Transformers) speech patterns.
CTC used to avoid auto-correction, ensuring the model reflects actual pronunciation errors.
Forced alignment via Viterbi algorithm to determine when specific syllables were spoken.
Tokenization includes Pinyin syllables with tones to explicitly capture pronunciation errors.
Trained on AISHELL-1 and Primewords datasets (~300 hours) with SpecAugment for robustness.
Model shrunk from 75M to 9M parameters with minimal accuracy loss, optimized for on-device use.
Fixed alignment bug by filtering silent frames to improve scoring accuracy.
Live demo available, runs in browser (~13MB), with noted limitations for casual speech and children.

Hasty Briefsbeta