Truly free fingerprinting
libFooID brings audio fingerprinting functionality to a broader audience. Audio fingerprints allow robust identification of a song by simply letting the computer 'listen' to it.
Audio fingerprinting is not a new technology and several systems already exist, unfortunately most of them are of rather low performance, or restricted or proprietary in some way. libFooID is the first system that truly can be considered to be both free and high-performance.
Foosic.org services
For developers: libFooID library, including full source code, a static library and a DLL version. Freely available under either GPL or a modified BSD license.
Download: libfooid-1.0.zipFor developers: Example code in Java showing how to compare 2 fingerprints.
Download: matching.zip
Technical information
General overview
The libFooID fingerprinting procedure works by first normalizing the data to a standard, lowest-common-denominator form, followed by a time-to-frequency transformation. The frequency data then consists of blocks corresponding roughly to a second in time. This data is analyzed with an algorithm which mimicks the human hearing system. The results are quantized and packed into a compact representation.
Normalization step
As a first step, all incoming music data is converted to 32-bit float format and downmixed to mono. We skip any silence in front. We then feed the next 100 seconds worth of audio data into a resampler (currently libresample is used) to get 90 seconds of 8000Hz sampled output.
Time-to-Frequency and partitioning
The 8000Hz sampled output is converted into a frequency spectrum via a Hann-windowed DFT (using a custom Split-Radix FFT routine) applied to 8192 sample blocks. This will turn the 90 seconds of audio into 87 frames with frequency spectra. The frequency spectrum is partitioned in Bark bands following the formula by Traunmüller, and converted to power on a dB scale. We ignore the first Bark band (it contains near-DC frequencies) and extend the last one a bit so it extends up to our upper frequency limit (4000Hz), leaving 16, now fudged, Bark bands.
Spectral fitting
On each of the 16 Bark bands, for each of the 87 frames, we attempt a least square regression line fit through the spectral power values. The correlation coefficient, which expresses the goodness-of-fit, is quantized to a 2 bits "fit". (The quantization coefficients were determined by running the algorithm over a large training set and cutting the results into 25-50-75th percentiles). We also quantize the dominant spectral line per frame ("dom") in 6 bits. The latter is particularly helpful in case the song has large parts which are entirely tonal or entirely noisy.
Data packing
The resulting data, being 87 * 16 * 2 bits of "fit" data and 87 * 6 bits of "dom" data, is packed together with some metadata describing the version of the current fingerprint format, as well as the original song length (we might only have seen 100 seconds, so this must be provided by the user of the library), together with the average "fit" and "dom", into a 424 byte fingerprint.
Searching
There are a large number of ways to efficiently search the fingerdata, so the following should be considered a recommendation. Searching in the fingerprint data consists of taking a reference fingerprint and counting the number of errors that occur when comparing the "fit" and the "dominant" data with another fingerprint under consideration. It should be noted that the probability of a small error is high even with the exact same song (simple roundoff error could cause it), but the probability of a large error is very unlikely, because it would mean a large change in spectral characteristics of a nature that would cause the song to be perceived differently by a human. So we apply a larger weighting (currently quadratic) to large errors. The error calculation only needs to use integer arithmetic (since it can operate on the quantized values) and consequently runs very quickly. Example code in Java is available.
The number of fingerprints to consider can be very efficiently narrowed by considering that a hit cannot have a) a very different length b) a very different average fit c) a very different average dominant line. Considering all three will usually slim the search space by about 90-98%, depending on much risk one is willing to take. More efficient searching in time can be accomplished at the expense of memory by putting the data in a kd-tree, possibly after applying a Karhunen-Loève transformation to reduce the dimensionality, and using nearest neighbour or approximate nearest neighbour matching.
Why it works
Normalizing the data to 8000Hz mono means that differences in original sampling rate and number of channels are largely eliminated. It also makes us immune to new parametric audio coding systems such as SBR or PS (as used in HE-AAC), because we will never see the synthesized parts of the spectra.
Likewise, skipping silence ensures that we are approximately timed-aligned as well. The use of large analysis windows ensures that any remaining time-misalignment has only a minimal effect.
Small, smooth disturbances in the spectrum will not cause a large difference from the point of view of the least-squares line fitting routine, while large differences or completely different spectral compositions will cause a large difference in the ability to fit a line through the power spectrum.
The main advantage of the line fit method is that it is aware of relations between neighbouring spectral values, which is not the case with existing methods like spectral flatness measure or related methods. This can be easily seen in the case of smoothly decreasing spectral power values, such as those produced by a lowpass filter, which will hardly affect the line fit system but cause a large difference in calculated spectral flatness.


