The goal of this project was to implement an algorithm to identify chords in music. I used a Deep Neural Network (DNN) and later a Convolutional Neural Net (CNN) to extract chroma vectors from a spectrogram. The chromagrams were then passed to a variety of classifiers.
Shown below are two chromagrams for the Beatles song I Need You. The top chromagram was created using a DNN, while the bottom was created using a constant Q-transform with a median filter.
- Given a batach of audio frames, extract a chromagram that contains only harmonic relvent information
- Given a segment of a chromagram, detect the chord played and its duration.
Representations of an Audio Signal
There are quite a few ways to represent an audio signal. Below is what a .wav looks like.
We need to extract information about the frequency of the signal. To do this, an algorithm called the Fourier Transform is used. The Fourier Transform takes a signal and maps it into the frequency domain.
Some recognition algorithms such as Stacked Denoising AutoEncoders take spectrograms as input. A common alternative is a chromagram. A chromagram is created using a Constant Q-Transform. The Constant Q-Transform is similar to the Fourier Transform, for this explanation all that matters is the transform outputs frequencies into logarithmically spaced bins. This is convenient because each bin corresponds to a musical pitch. Notice the y axis in the image below.
An audio signal contains sound unrelated to the harmony, such as percussion and ambiance. The Q-Transform bundles this in with the harmonically relevant information. For the sample above this isn't a problem, because the signal is from a guitar playing a single note. This doesn't work for polyphonic music, an audio song of a band would not look as clean. The majority of chord recognition algorithms use a filter. These filters are basically a set of general rules. A filter tries do things like remove percussion from a track, by removing certain frequencies from the signal. The idea is that certain frequncy ranges contain the harmony. The problem is music is random, a filter may succed in extracting chroma from classical music, but fail when extracting chroma from rap. Rule based filters fail to remove all noise and have the side effect of removing harmonic information. A more general and responsive algorithem is needed.
A data driven approach is the optimal solution. Deep learning is capable of extracting hierarchal, discriminative features from multiple domains. I use a Deep Neural Net and later a Convolutional Neural Net to extract chroma vectors. This removes the need to pre process the audio with reduction algorithms. The network is learning a function that maps a batch of frames from spectrogram to a single frame (chroma vector) of a chromagram.