Free Interview Transcription — Speaker Labels, No Upload

Drop an audio file and get a speaker-labeled transcript in seconds. Runs entirely in your browser — your audio never leaves your device. No account. No cost.

🎙️
Drop audio file here or click to browse
MP3, M4A, WAV, WebM · Recommended: under 30 min
On-device AI — uses Whisper-base via transformers.js. First run downloads ~150 MB of model weights (cached in browser after that). Chrome / Edge with WebGPU is fastest; Safari falls back to WebAssembly.
Loading model…

How it works

All processing happens locally in your browser using WebAssembly and (where available) WebGPU. Nothing is sent to a server.

1. Decode audio Your browser's built-in Web Audio API decodes the file into a 16 kHz mono PCM waveform — the format Whisper expects.
2. Whisper transcription Whisper-base (74 M parameters, ONNX) runs chunk-by-chunk, returning text with word-level timestamps via the transformers.js library.
3. Speaker segmentation Short-time spectral energy and zero-crossing rate features are extracted per 20 ms frame. A k-means pass groups frames into N speaker clusters, then speaker labels are snapped to Whisper segments.
4. Export The labeled transcript is rendered inline and made available as a plain-text TXT or a subtitle-compatible SRT file you can drop into video editors.

Frequently asked questions

Does my audio get uploaded anywhere?
No. The entire pipeline — decoding, speech recognition, and speaker labeling — runs inside your browser tab. The audio bytes never leave your computer. This makes the tool safe for confidential interviews, legal recordings, medical conversations, and journalistic sources.
How accurate is the speaker labeling?
Speaker separation works best when speakers have distinct voices and there are clear pauses between turns. It uses energy and spectral-feature clustering rather than a dedicated speaker-embedding neural network, so it may occasionally swap labels on similar-sounding speakers. For high-stakes use, verify the labels manually. Accuracy improves when you set the correct speaker count before transcribing.
How long can the audio be?
There is no hard cap, but browser memory limits apply. Files up to 30 minutes work well on most laptops (typically 1–2 GB RAM for the decoded audio + model). Very long recordings (60+ min) may hit memory limits on lower-end devices. If you hit a crash, split the file with a free tool like Audacity first.
Why does the first run take so long to start?
Whisper-base is roughly 145 MB of ONNX model weights. On the first run these are downloaded from Hugging Face CDN and cached in your browser's IndexedDB. Every subsequent run on the same device starts in a few seconds because the model is served from local cache.
What audio formats are supported?
Any format your browser can decode: MP3, M4A (AAC), WAV, WebM/Opus, OGG, FLAC. The file is decoded via the Web Audio API, so browser codec support determines compatibility. Chrome and Edge support the widest range; Safari handles M4A and WAV natively. If your file fails to load, convert it to WAV with a free tool first.
Can I use this for languages other than English?
Yes. Whisper-base is multilingual and supports over 90 languages. Select your language (or "Auto-detect") from the Language dropdown before transcribing. Accuracy is highest for English, Spanish, French, German, and Japanese; accuracy may be lower for lower-resource languages.