- Does my audio get uploaded anywhere?
- No. The entire pipeline — decoding, speech recognition, and speaker labeling — runs inside your browser tab. The audio bytes never leave your computer. This makes the tool safe for confidential interviews, legal recordings, medical conversations, and journalistic sources.
- How accurate is the speaker labeling?
- Speaker separation works best when speakers have distinct voices and there are clear pauses between turns. It uses energy and spectral-feature clustering rather than a dedicated speaker-embedding neural network, so it may occasionally swap labels on similar-sounding speakers. For high-stakes use, verify the labels manually. Accuracy improves when you set the correct speaker count before transcribing.
- How long can the audio be?
- There is no hard cap, but browser memory limits apply. Files up to 30 minutes work well on most laptops (typically 1–2 GB RAM for the decoded audio + model). Very long recordings (60+ min) may hit memory limits on lower-end devices. If you hit a crash, split the file with a free tool like Audacity first.
- Why does the first run take so long to start?
- Whisper-base is roughly 145 MB of ONNX model weights. On the first run these are downloaded from Hugging Face CDN and cached in your browser's IndexedDB. Every subsequent run on the same device starts in a few seconds because the model is served from local cache.
- What audio formats are supported?
- Any format your browser can decode: MP3, M4A (AAC), WAV, WebM/Opus, OGG, FLAC. The file is decoded via the Web Audio API, so browser codec support determines compatibility. Chrome and Edge support the widest range; Safari handles M4A and WAV natively. If your file fails to load, convert it to WAV with a free tool first.
- Can I use this for languages other than English?
- Yes. Whisper-base is multilingual and supports over 90 languages. Select your language (or "Auto-detect") from the Language dropdown before transcribing. Accuracy is highest for English, Spanish, French, German, and Japanese; accuracy may be lower for lower-resource languages.