- What kind of audio gives the best results?
- Clear single-note melodies work best: humming, whistling, singing a melody on "la" or "na", or playing a flute, violin, or other monophonic instrument. Chords and complex harmonics make transcription harder — for polyphonic recordings, try the High onset sensitivity setting and accept that some notes may merge or split. Background noise and reverb reduce accuracy significantly, so record in a quiet room and keep the microphone close.
- Does my audio get sent to a server?
- No. The entire transcription pipeline runs inside your browser using WebAssembly and TensorFlow.js. The only network request is a one-time download of the ~10 MB model weights (served from a CDN, cached in your browser after the first use). Your microphone audio or uploaded file never leaves your device.
- What do MIDI and MusicXML downloads give me?
- The MIDI file (.mid) can be imported into any DAW or notation software — GarageBand, Logic Pro, Ableton Live, FL Studio, Cubase, or MuseScore. It carries the exact pitch and timing of each detected note. The MusicXML file (.xml) is the standard interchange format for notation software: Sibelius, Finale, Dorico, MuseScore, and Flat.io all import it, giving you an editable score with proper staves, clefs, and note values. The ABC file (.abc) is a lightweight text format supported by MuseScore, abcjs, and various online tools.
- Why does the first run take longer?
- The neural network model (~10 MB of weights) is downloaded from a CDN on your first visit and stored in your browser's cache. Subsequent uses are instant. Transcription itself takes roughly one second of processing per second of audio on a modern device.
- What is the maximum recording length?
- There is no hard limit, but transcription time scales linearly with audio length. A 30-second recording typically processes in 10–30 seconds depending on your device. Very long recordings (several minutes) may cause the browser tab to use significant memory during inference. For best results, keep recordings under 60 seconds and trim silence from the start and end.