ByteDance High-Resolution AMT System Achieves SOTA in Piano Note and Pedal Transcription

Original article was published by Synced on Artificial Intelligence on Medium


ByteDance High-Resolution AMT System Achieves SOTA in Piano Note and Pedal Transcription

Automatic music transcription (AMT) is the task of transcribing raw audio recordings into symbolic representations such as the Musical Instrument Digital Interface (MIDI) technical standard. The field presents a variety of research challenges in signal processing and AI, as music signals often contain multiple sound sources correlated over time and frequency. In recent years the use of neural network based approaches has increased. These can simultaneously detect music information such as note onsets and offsets and pitches, etc., and have delivered SOTA results in AMT tasks.

AMT for piano music remains notoriously tricky because of the highly polyphonic nature of the instrument. In the recent paper High-Resolution Piano Transcription with Pedals by Regressing Onsets and Offsets Times, researchers from TikTok developer ByteDance introduce a high-resolution piano transcription system trained by regressing the precise onset and offset times of piano notes and pedals. The approach outperforms Google’s onsets and frames based system to set a new SOTA for piano note transcription.

Previous piano transcription systems typically split audio recordings into audio frames using discriminative models. This enabled them to predict the presence or absence of onsets and offsets framewise, but restricted transcription resolution to the frame hop size. Moreover, any misalignment in onset or offset labels in audio recordings made it difficult to precisely detect onset or offset times.

The researchers also note that even though sustain pedals play an essential part in pianos’ musical expression, current AMT systems do not typically perform pedal transcription.