Hasty Briefsbeta

Show HN: I trained a 9M speech model to fix my Mandarin tones

6 days ago
  • #Mandarin
  • #DeepLearning
  • #Pronunciation
  • Author struggled with Mandarin pronunciation, especially tones, and built a deep learning-based system to grade pronunciation.
  • Initial approach was a pitch visualizer but switched to a Conformer encoder with CTC loss for better accuracy.
  • Conformer chosen for its ability to capture both local (CNNs) and global (Transformers) speech patterns.
  • CTC used to avoid auto-correction, ensuring the model reflects actual pronunciation errors.
  • Forced alignment via Viterbi algorithm to determine when specific syllables were spoken.
  • Tokenization includes Pinyin syllables with tones to explicitly capture pronunciation errors.
  • Trained on AISHELL-1 and Primewords datasets (~300 hours) with SpecAugment for robustness.
  • Model shrunk from 75M to 9M parameters with minimal accuracy loss, optimized for on-device use.
  • Fixed alignment bug by filtering silent frames to improve scoring accuracy.
  • Live demo available, runs in browser (~13MB), with noted limitations for casual speech and children.