- Is this really free with no sign-up?
- Yes. The Whisper model runs directly in your browser via WebAssembly and ONNX Runtime — there is no backend, no account system, and no server receiving your audio. The only network activity is the one-time download of model weights from Hugging Face's CDN, which are then cached locally. After the first download you can even use this tool offline.
- How accurate is it compared to services like Otter.ai or Otter?
- Whisper is one of the most capable open-source speech recognition models available. The "Small" model achieves word-error-rates comparable to many paid cloud services on clear audio. Accuracy drops with heavy accents, low-quality microphones, or overlapping speakers. For best results use a quiet recording at a reasonable bit rate, and provide a glossary of unusual names or jargon. The browser models are slightly less accurate than the full Whisper large-v3 used by cloud APIs, but there is no cost or privacy trade-off.
- What file formats and lengths are supported?
- The tool accepts any format the browser's Web Audio API can decode: MP3, WAV, M4A/AAC, WebM (Opus or Vorbis), OGG, FLAC, and more. File length is limited only by available RAM — a typical 30-minute meeting recording at 128 kbps (about 30 MB) transcribes comfortably. Very long files (2+ hours) may be slow on low-end devices; use the Tiny model and ensure you have at least 2 GB of free RAM.
- What does the "Custom glossary" field do?
- Whisper accepts an optional "initial prompt" — a short text it reads before processing your audio to prime its vocabulary expectations. This tool collects your glossary terms, joins them into a prompt such as "Words to use: Kubernetes, CRISPR, Knackpad, Nguyen.", and passes it to the model. This significantly reduces mis-spellings of product names, medical terms, technical jargon, and unusual proper nouns. Enter one term per line, up to around 50 terms.
- Which model size should I choose?
- Tiny (~40 MB): best for quick notes, single speaker, clear audio — loads fastest. Base (~75 MB): a good all-rounder for most voice recordings. Small (~245 MB): recommended for technical content, non-native speakers, noisy recordings, or multiple speakers — slowest to download but most accurate. Models are cached after the first download, so you only pay the bandwidth cost once.
- Can I transcribe video files?
- Yes, if the browser can decode the audio track. MP4, WebM, and MOV files containing AAC or Opus audio will usually work — the tool reads only the audio track and ignores video. If your file does not load, strip the audio first with a free tool like HandBrake or ffmpeg and save as MP3 or WAV.