Why removing 'um' from a recording is harder than it sounds

2 days ago

Linguists refer to sounds like 'um', 'uh', and 'er' as disfluencies, which erm is designed to automatically edit out of audio recordings.
erm utilizes Whisper (via faster-whisper) for transcription with word-level timestamps to detect fillers, and employs multiple audio analysis passes to catch fillers Whisper misses.
The tool refines cut points to avoid audible clicks by sliding endpoints to quiet spots and aligning with zero-crossings, and uses dynamic crossfade lengths for smoother splices.
To mask background hiss mismatches, erm loops a quiet room tone under the entire output, ensuring consistent ambient sound.
erm employs a hybrid denoising strategy, detecting fillers in original audio while splicing from a denoised copy to maintain accuracy and audio quality.
A validation subcommand confirms the output's integrity by verifying file openness, length reductions, and the absence of fillers in a re-transcription.
erm deliberately avoids editing linguistic elements like 'like' or 'you know', and does not handle repeated words or false starts, focusing solely on non-linguistic fillers.
The tool can be run via uvx for quick use or installed with pip, requiring ffmpeg and ffprobe, and processes audio locally without data transmission.

Hasty Briefsbeta