Show HN: I trained a 9M speech model to fix my Mandarin tones
6 days ago
- #Mandarin
- #DeepLearning
- #Pronunciation
- Author struggled with Mandarin pronunciation, especially tones, and built a deep learning-based system to grade pronunciation.
- Initial approach was a pitch visualizer but switched to a Conformer encoder with CTC loss for better accuracy.
- Conformer chosen for its ability to capture both local (CNNs) and global (Transformers) speech patterns.
- CTC used to avoid auto-correction, ensuring the model reflects actual pronunciation errors.
- Forced alignment via Viterbi algorithm to determine when specific syllables were spoken.
- Tokenization includes Pinyin syllables with tones to explicitly capture pronunciation errors.
- Trained on AISHELL-1 and Primewords datasets (~300 hours) with SpecAugment for robustness.
- Model shrunk from 75M to 9M parameters with minimal accuracy loss, optimized for on-device use.
- Fixed alignment bug by filtering silent frames to improve scoring accuracy.
- Live demo available, runs in browser (~13MB), with noted limitations for casual speech and children.