Hasty Briefsbeta

Bilingual

Why removing 'um' from a recording is harder than it sounds

2 days ago
  • #audio-editing
  • #speech-processing
  • #open-source-tool
  • Linguists refer to sounds like 'um', 'uh', and 'er' as disfluencies, which erm is designed to automatically edit out of audio recordings.
  • erm utilizes Whisper (via faster-whisper) for transcription with word-level timestamps to detect fillers, and employs multiple audio analysis passes to catch fillers Whisper misses.
  • The tool refines cut points to avoid audible clicks by sliding endpoints to quiet spots and aligning with zero-crossings, and uses dynamic crossfade lengths for smoother splices.
  • To mask background hiss mismatches, erm loops a quiet room tone under the entire output, ensuring consistent ambient sound.
  • erm employs a hybrid denoising strategy, detecting fillers in original audio while splicing from a denoised copy to maintain accuracy and audio quality.
  • A validation subcommand confirms the output's integrity by verifying file openness, length reductions, and the absence of fillers in a re-transcription.
  • erm deliberately avoids editing linguistic elements like 'like' or 'you know', and does not handle repeated words or false starts, focusing solely on non-linguistic fillers.
  • The tool can be run via uvx for quick use or installed with pip, requiring ffmpeg and ffprobe, and processes audio locally without data transmission.