What Is Audio Fingerprinting And How Does It Work – 2020 Guide

Have you ever wondered how does a program like Shazam work? How is it possible for an app to find a song based on a snippet of it recorded in any setting? What’s even more impressive is – how does that happen in just a matter of seconds? You’re aware that it’s not comparing just lyrics or you could just recite the lyrics and find it, but as you’ve probably tested that ages ago, you’ve come to realize it’s not possible. Some songs are instrumentals, so there are no lyrics to be compared at all. This has to be something far more complex, right? That’s right. that complex process is called audio fingerprinting and we’re going to talk about it. Let’s get started.

What Is Audio Fingerprinting?

Source: pexels.com

The process that makes all of that, and more, possible is called audio fingerprinting. This is the process of defining an audio signal, as compact as possible, by selecting the most relevant features of the audio content. The basic principle behind it is quite similar to human fingerprinting. Instead of looking at the whole thing, we’re creating a unique fingerprint that can later be used for finding an exact match, whether on our smartphone, TV or any other device.

This method allows observing of the audio, regardless of the format the audio’s in and regardless of the existing metadata. This allows for precise identification even after severe compression or deterioration of the sound quality. This isn’t only used to identify unknown tracks and artists from your phone. The audio-retrieval process sure is one of the common ones, but audio fingerprinting is also used in broadcast monitoring, for example.

Why Do We Need It And Where Do We Use It?

There are several reasons as to why this exists in the first place. From satisfying ones’ curiosity about the title of the song that’s playing on the radio and its performer to voice identification and broadcast monitoring. We’re going to talk a little bit about each of them before we dig a little deeper into the fingerprinting ‘techniques’ and how it works.

For an audio fingerprinting algorithm (AFA) to be considered effective, it has to possess discriminative powers as well as compactness and computational simplicity. If it checks out all the boxes, then it can be applied in various cases with the first one being…

Music Retrieval

Source: pexels.com

This is arguably the most popular and common application of the AFA. Specific features of an audio track are captured and made into a fingerprint, which is then used as metadata do compare and identify against the millions of other tracks in numerous databases. Ideally, the algorithm will be effective in both nosy and ideal recording environments. Background noise should play much of a factor and we will see why in the following paragraphs. If you’re interested in this application, in particular, you can read more about how Shazam, Intrasonics and Phillips approach this problem.


This process is virtually exactly how it sounds like. Audio fingerprinting techniques allow for a digital signature to be ingrained onto a track. This signature is later used to identify or verify the authenticity of the track – by looking for a watermark. This is also very useful, as any attempt to change the track’s fingerprint would result in changes in audio quality.

Broadcast Monitoring

Source: pexels.com

Another use of this technique is to monitor the content broadcasted on a TV or a radio. This can be particularly useful to keep count of the song playtime so that the royalties to the artists are paid accordingly. Also, it’s commonly used by advertisers who pay based on how much air time their commercial has gotten.

Voice Identification

Most of the smartphones have some kind of voice assistant integrated nowadays. To prevent these going off every time someone says something, it’s important to accurately identify when the owner of the phone is speaking. That’s why you have to repeat a few phrases before you set up helpers like Siri or Cortana. This creates an imprint of your voice, so it can be recognized later on.

Now, let’s take a look at how is all of this possible.

How Does It Work?

There are a few methods used to extract relevant features of an audio track to create a unique fingerprint.

1. Frame Splitting

Source: unsplash.com

This method is used to split an input audio track into frames of equal intervals with each frame representing each time point in the output metadata.

2. Windowing

Windowing is most commonly used in the spectral analysis to display a brief segment of a longer signal and to analyse its frequency content. This technique is also used to create sound segments called grains, which are only a few milliseconds in duration. Grains can, later on, be combined into granular sound clouds. To put it somewhat more simply, you can of any sound, with a starting and a finishing point as a windowed segment or a grain.

3. Equal Loudness Filter

This filter alters an audio file in a particular way. It enhances the frequencies we’re generally more sensitive to and weakens those we’re less sensitive to. This means that this filter intensifies the frequency range in which the audio content is usually found and cripples the low-frequency range where we can usually find the melody or the background noise in case of those recordings.

4. Fast Fourier Transform

Source: bloggingrepublic.com

The FFT is applied to the pre-processed audio frame to identify its frequency patterns. The patterns of all the audio frames are then put together and post-processed and then analysed.

5. Brand Splitting

Different audio sources may have different cut-off frequencies and the predominant patterns of the audio content may be similar only in different frequency bands. As a result, the spectrum pattern derived from STFT analysis and stitching is divided into multiple bands across the frequency. The peaks (frequency bins with maximum amplitude) in each band are then calculated for all frames. Once the frequency peaks of all frames are calculated across bands, the peaks in each band for all frames are joined together to form a contour. So finally, each audio file has the same number of metadata contours as the number of frequency bands.

Once all of this data is collected for the files in the database, the same process will be done to the file in the query which will then be compared against the files in the database.

There you have it. Understanding the uses of audio fingerprinting is far easier than understanding the science behind it, but we can all agree that this is one fascinating process with many uses and purposes.

Leave a Reply

Your email address will not be published. Required fields are marked *

What is 15 + 9 ?
Please leave these two fields as-is:
IMPORTANT! To be able to proceed, you need to solve the following simple math (so we know that you are a human) :-)