Why removing 'um' from a recording is harder than it sounds
2 days ago
- #audio-editing
- #speech-processing
- #open-source-tool
- Linguists refer to sounds like 'um', 'uh', and 'er' as disfluencies, which erm is designed to automatically edit out of audio recordings.
- erm utilizes Whisper (via faster-whisper) for transcription with word-level timestamps to detect fillers, and employs multiple audio analysis passes to catch fillers Whisper misses.
- The tool refines cut points to avoid audible clicks by sliding endpoints to quiet spots and aligning with zero-crossings, and uses dynamic crossfade lengths for smoother splices.
- To mask background hiss mismatches, erm loops a quiet room tone under the entire output, ensuring consistent ambient sound.
- erm employs a hybrid denoising strategy, detecting fillers in original audio while splicing from a denoised copy to maintain accuracy and audio quality.
- A validation subcommand confirms the output's integrity by verifying file openness, length reductions, and the absence of fillers in a re-transcription.
- erm deliberately avoids editing linguistic elements like 'like' or 'you know', and does not handle repeated words or false starts, focusing solely on non-linguistic fillers.
- The tool can be run via uvx for quick use or installed with pip, requiring ffmpeg and ffprobe, and processes audio locally without data transmission.