- Is my audio file uploaded to any server?
- No. The entire analysis runs inside your browser using the Web Audio API and Meyda.js. Your audio file never leaves your device. This is by design — deepfake detection often involves sensitive recordings (voicemails, phone calls, interviews) that should not be shared with third-party servers.
- How accurate is this detector?
- This tool uses classical digital signal processing heuristics, not a trained neural network. It can identify common statistical signatures of text-to-speech (TTS) and voice conversion (VC) systems — but it is not definitive. Heavily compressed audio (MP3 at 64 kbps), phone codec artifacts (G.711, AMR), and reverberant recordings all affect the score. A high score suggests AI-like statistical properties; it does not prove a clip is fake. Treat the result as one data point, not a verdict. For forensic or legal purposes, consult a specialist.
- What audio formats and lengths work best?
- MP3, WAV, and M4A files are all supported. For best results, use an uncompressed or lightly compressed clip of at least 5 seconds and ideally 10–30 seconds — very short clips give fewer frames to analyze, reducing confidence. Files with only music or background noise (no speech) will produce unreliable scores since the heuristics are calibrated for human speech patterns.
- What kinds of AI voice clones does this detect?
- The spectral heuristics are most sensitive to neural TTS systems (e.g., those using WaveNet, VITS, or similar vocoders) and real-time voice conversion tools (e.g., RVC, SVC). They are less effective against high-quality diffusion-based models trained on large multi-speaker datasets, and may miss clones that have been post-processed with noise addition or room simulation to mimic natural recording conditions.
- Why does MFCC variance matter for deepfake detection?
- MFCCs (Mel-frequency cepstral coefficients) are a compact representation of the spectral shape of sound, roughly matching how the human cochlea perceives timbre. When you speak naturally, your voice changes constantly — pitch, breathiness, speed, and resonance all fluctuate frame to frame, producing high MFCC variance. Neural vocoders generate audio from a compact latent space; unless deliberately perturbed, they produce a voice that is "too smooth" — lower inter-frame MFCC variance than any real human speaker would produce. This is one of the most stable signals across different TTS architectures.