In this article, we will be looking at audio comparison using MFCC (Mel-Frequency Cepstral Coefficients) and DTW (Dynamic Time Warping). Unlike many other algorithms, this one does not require you to use machine learning and relies purely on mathematical calculations. We will go through each significant step of the process and document what is being done, starting from feeding in two samples and ending up with a value ranging from 0 to 1 that shows how similar these samples are. At the end, we will evaluate some pros and cons when using this algorithm.
Overview
The audio comparison process happens in two distinct parts:
- The MFCC (Mel-Frequency Cepstral Coefficient) calculation part extracts features of the audio sample which contain information about the rate of change in spectral bands. This is the most important part since it helps us determine what the audio sample contains and enables us to, in a sense, uniquely identify it.
- DTW (Dynamic Time Warping) compares two MFCC feature matrices to determine their similarity. It lets us determine the similarity of two samples even when they are not perfectly aligned or their speed differs.
Some of the reasons why this algorithm might be used include:
- Does not require machine learning (which makes it easier to implement).
- Gives high accuracy for clean speech.
- Samples do not need to be aligned.
- Sample speeds can differ.
Pre-Emphasis
The first step of the process is to apply pre-emphasis on the audio sample which is done using a first order high pass filter. This filter amplifies high frequencies, which are parts of the audio that change rapidly. Lower frequencies are not as important because they are not trivial in detecting audio similarities.
The formula used to apply pre-emphasis is:
Here x are input samples and a is a coefficient which typically has a value of 0.97 but can range from 0.95 to 0.97. The first element stays unchanged (y0=x0). As an example, here is an image showing the original sample before pre-emphasis and a sample after pre-emphasis (blue—before, red—after). Also, make note that some samples shown are normalized for the purposes of demonstration:
As expected, the rapidly changing parts are emphasized but those that change slower are almost gone. When listening to the transformed sample by ear you can confirm that it is indeed the case.
Framing
In this step we take the pre-emphasized sample and split it into multiple even frames. The reason for this is that the signal changes over time and if we split this signal into frames, each frame will generally have around the same frequency. In further steps we will apply a Fourier transform on these frames which will give us more accurate comparison results in the end. If we did Fourier transform on the whole sample at once, rapid frequency changes would be smoothed out and that is undesirable.
The frame size can vary between 25 milliseconds up to 500 milliseconds. There is no recommended value so it takes a bit of experimenting to find the right one. Additionally, when doing framing you have to choose a frame jump. If you choose a frame jump that is equal to the frame size there will be no overlapping frames, which sometimes could be desirable if the audio sample you are analyzing does not change very rapidly. Also, sometimes there could be a case where the last frame does not have enough samples available to fill the whole frame, and in that case we do zero-padding. It is worth noting that having smaller frames and/or frame jumps will lead to more frames being generated, which in turn means that there are more calculations to be made.
Windowing
The third step of the process is to apply windowing to each frame generated in the previous step. Windowing lets us prevent sudden jumps at either end of the frame. This is done so that the Fourier transform applied in further steps can focus on the middle part of the frame. The most popular windowing function used in this case is the Hamming window. The Hamming window formula is as follows:
Here xn is the original sample, n is the n-th sample, and N is the total sample count in a frame. For the purposes of demonstration, applying a Hamming window on an audio sample produces the following result:
Discrete Fourier transform (DFT)
The fourth step is to apply DFT on each generated frame. The result of this is an array of complex numbers from which we will get an absolute value. These values will be used to determine constituent pitches in the sample waveform. The resulting array of samples from this operation will be 3/4 the length of the original and that is expected doing this transform. As an example, let’s take a 500 millisecond frame and apply the transform. This is the result:
Power spectrum
In this step we will calculate the power spectrum for each frame, which will then be used at the end of the next step for matrix multiplication. To calculate the power spectrum we use this formula:
Here, xn is the original sample. T is calculated using this formula:
and
Here, S is sample rate and * is frame size (in seconds).
Keep in mind that T calculation in the code example, which is available to access at the end of this article, adapts a while loop instead of using a mathematical representation. Here is an example of power spectrum calculated for one of the frames (note that we will no longer use Audacity audio software to visualize the results):
Mel filter bank
The sixth step is to calculate the Mel filter bank matrix. When calculating the matrix itself, we will not use sample data from the previous steps because it is solely dependent on other parameters such as sample rate and number of filters we want to calculate. For the full steps of calculation please refer to the code example linked at the end of this article.
Mel filter bank aims to mimic the human ear perception of sound by having higher accuracy hearing at lower frequencies and lower accuracy at higher frequencies. This filter focuses on critical parts of the frequency bands. Here is an example of a Mel filter bank generated for 26 filters at a sample rate of 44100 Hz:
At the end of this step we are going to do matrix multiplication between the power spectrum that was calculated in the previous step and the Mel filter bank calculated in this step. Every entry in a reversed V shape pertains to its own row in the filter bank matrix, so since we chose 26 filters that means there will be 26 rows. The number of columns are equal to T, which we calculated in the previous step. For the power spectrum matrix, its row count is equal to the amount of frames we created, while the number of columns are equal to T. To multiply these matrices together we need to transpose the filter bank matrix and then multiply the power spectrum matrix with the filter bank matrix. The resulting matrix size is {frame count} x 26 (filter count).
This is how the result looks like when put in a graph. Note that y axis values can differ significantly depending on the sample:
Discrete Cosine Transform (DCT)
The last step needed to calculate the Mel-Frequency Cepstral Coefficients is to apply Type-2 Discrete Cosine Transform to the matrix multiplication result from the previous step. This Transform is applied to each row of the matrix separately. The result of this operation looks like this:
Dynamic Time Warping (DTW)
In this part of the audio comparison process, we take two MFFC matrices calculated using the previous steps and compare them using DTW. These matrices do not have to be the same size, which allows us to compare samples with different lengths. That being said it is not recommended to compare samples with very differing sizes. For this small example we compare the same sample—this is how the cost matrix looks when graphed with a Python library called Matplotlib:
When looking at the cost matrix in and of itself it probably would not make much sense, but when we add the sample next to both axes it paints a very clear picture of what is happening. When there are parts in the sample that change very frequently we see distinct marks in the matrix, overwise it almost blends it. Obviously the graph is relative to the highest point in the sample and would look very different for other samples. Also, worth noting is that if the comparison is done between samples that differ in size the resulting cost matrix would not be quadratic anymore.
Normalization
This is the last step of the comparison process. The result that we get from the previous step gives us the information about the “distance” between both samples, which is useful when judging this result subjectively, but not as useful when we want to automate this comparison and do something with it in certain cases.
There are different ways to achieve normalization for DTW. Some include doing z-normalization before calculating it, while other ways such as min-max normalization are more simple. In theory, min-max normalization should be sufficient, but can probably cause issues with some outliers.
To do min-max normalization you simply calculate the minimum and maximum values in the cost matrix, then using the path (which we also retrieved when calculating DTW) calculate the average cost of a “step”. With DTW, the more similar the samples are, the lower the distance value. Since our goal is to have the result be 0 when samples are different and 1 when they are identical, we need to invert the result. Using this formula we can do that:
Now we have our normalized value. In the previous step we were comparing the same sample so the resulting value is 1. Additionally, it is worth noting that when using min-max normalization, having a 0.8 result or higher should indicate that the samples are similar.
Conclusion
Audio comparison using Mel-Frequency Cepstral Coefficients and Dynamic Time Warping contains ten distinct steps which transform input samples into data that can be mathematically compared. Some of these steps might have higher complexity, but overall it is a relatively simple algorithm to implement. It is solely based on mathematical calculations and does not depend on machine learning which is why there are some cases where it might or might not be used. For example, if samples are clear, of high quality and are not muffled, then the comparison produces high accuracy results. If not, it is recommended to look for an alternative algorithm. Additionally, it is worth noting that this algorithm does not search for the best match location if sample sizes differ significantly, but it is possible to add additional logic to make that happen.
Check out the implementation of this algorithm on my GitHub page: MFFC DTW.
Do you have an application with audio and video capabilities? We can help you test it using a variety of video quality metrics and algorithms. Contact us and let’s discuss your project.