New paper: PESTO, a novel method for Pitch Estimation with a Self-supervised Transposition-equivariant Objective

New paper from the Music Team at Sony Computer Science Laboratories – Paris (Sony CSL Paris)

We introduce PESTO, a novel method for Pitch Estimation with a Self-supervised Transposition-equivariant Objective.


This method automatically retrieves the melody for a monophonic audio track, with a primary focus on maximizing speed without sacrificing accuracy. In contrast to huge AI models, ours achieves competitive results on standard benchmarks while being 12 times faster than real-time on a consumer CPU. We will present this new method at ISMIR 2023, a prestigious academic conference about computer science for musical applications. 

To accomplish this, we leverage self-supervised learning (SSL), an emerging paradigm that allows AIs to be trained without the need for any labeled data and and aims at mimicking the way human beings learn. 

More precisely, we rely on the concept of equivariance. Our model does not possess any explicit pitch information but learns to infer pitch by comparing pairs of transposed audios. 

Intuitively, our AI functions like a musician who uses relative pitch: it first trains to recognize music intervals and then exploits this knowledge to accurately transcribe musical pieces. In addition, we carefully design our neural architecture to be specifically sensitive to slight pitch differences.

In practice, this simple method empowers our AI to be state-of-the-art in self-supervised pitch estimation, even though its entire knowledge is compressed into just 120kB. Its extreme lightweight design enables fast inference, making it well-suited for musical applications where real-time and low-resource usage are essential.

As an example, some recent AI-augmented instruments rely on a technology called Differentiable Digital Signal Processing (DDSP), which, in turn, depends on fast and reliable pitch estimation. We therefore believe that our method can be a great assistance to both music production and research purposes.

To facilitate its usage, we have made all our code and models open source and provided a simple pip-installable Python API.


We encourage you to give it a try!

Authors: Alain Riou (Télécom Paris, Sony Computer Science Laboratories – Paris); Stefan Lattner (Sony Computer Science Laboratories – Paris); Gaëtan Hadjeres (Sony AI); Geoffroy Peeters (Télécom Paris)




Code & pre-trained model:


Pip-installable Python API: