Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221572037

A novel audio fingerprinting method robust to time scale modification and pitch
shifting

Conference Paper · October 2010


DOI: 10.1145/1873951.1874130 · Source: DBLP

CITATIONS READS

23 253

4 authors, including:

Zhurong Wang Xiangyang Xue


Fudan University Fudan University
4 PUBLICATIONS   30 CITATIONS    230 PUBLICATIONS   2,530 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Autonomous Mental Development View project

All content following this page was uploaded by Xiangyang Xue on 05 August 2014.

The user has requested enhancement of the downloaded file.


A Novel Audio Fingerprinting Method Robust to Time
Scale Modification and Pitch Shifting

Bilei Zhu, Wei Li∗ , Zhurong Wang, Xiangyang Xue∗


School of Computer Science, Fudan University
bileizhu@gmail.com, weili-fudan@fudan.edu.cn
wangzhurong@gmail.com, xyxue@fudan.edu.cn

ABSTRACT mainly because audio signals often suffer from various distor-
A novel audio fingerprinting method that is highly robust tions in the real world interference [1]: an audio piece may be
to Time Scale Modification (TSM) and pitch shifting is pro- compressed for less storage resources, or it may be polluted
posed. Instead of simply employing spectral or tempo-related with noise during transmission, etc. Among all these distor-
features, our system is based on computer-vision techniques. tions, TSM, also known as Time Stretching, which refers to
We transform each 1-D audio signal into a 2-D image and the process of changing speed or duration of an audio signal
treat TSM and pitch shifting of the audio signal as stretch without affecting its pitch, and pitch shifting, which refers
and translation of the corresponding image. Robust local de- to the process of changing the pitch of an audio signal with-
scriptors are extracted from the image and matched against out affecting its speed, are considered to be the two most
those of the reference audio signals. Experimental results challenging.
show that our system is highly robust to various audio dis- One of the most famous audio fingerprinting systems is
tortions, including the challenging TSM and pitch shifting. proposed in [4]. By segmenting an audio signal into overlap-
ping frames and extracting a 32-bit sub-fingerprint from 33
non-overlapping frequency bands of each frame, [4] achieves
Categories and Subject Descriptors promising identification precisions when audio lengths are
H.5.5 [Information Systems]: Information Interfaces and stretched from 96% to 104%. However, as concluded in
Presentation—Sound and Music Computing the paper, the robustness against TSM is mainly due to
large overlap of adjacent frames, which we argue is ineffi-
General Terms cient. What’ more, pitch shifts can have a large impact on
the system proposed in [4]: when even a small amount of
Algorithms, Experimentation pitch shifting is introduced, performance drops significantly.
Baluja and Covell also notice the problem in [1] and provide
Keywords two simple methods: (1) to insert pitch-shifted versions of
Audio Fingerprinting, Robustness, Time Scale Modification, the songs into the database, in addition to the original ver-
Pitch Shifting sion; (2) to attempt multiple queries for each song, varying
the pitch shift for each query. However, as stated in [1], both
of these two methods are inefficient: the first one increases
1. INTRODUCTION the memory while the second increases the computation.
An audio fingerprint is a compact content-based digest Since [4], large quantities of new audio fingerprinting meth-
that uniquely represents an audio signal. With its ability to ods have been introduced; some of them focus on time-scaled
link short, unlabeled excerpt to corresponding metadata, au- audio signals. A typical example of this kind of systems is
dio fingerprinting is now drawing more and more attentions proposed in [2]. [2] follows the approach of first segment-
[3]. A typical audio fingerprinting system extracts features ing an audio signal into time intervals and then extracting
from a query clip, matches them against those of the ref- a sub-fingerprint from each time interval, allowing identi-
erence audio signals, and identifies the query as the same fication of audio signals which are scaled by factors from
audio of a certain reference. 85% to 115%. However, the system lacks some robustness
Though lots of systems have been proposed in recent liter- against signal distortions: the recognition rates decrease sig-
ature, audio fingerprinting is still a challenging task. This is nificantly when audio signals are compressed or polluted
∗ with noise. Generally speaking, audio features used for au-
Corresponding author: {weili-fudan, xyxue}@fudan.edu.cn
dio fingerprinting before [6] are mainly spectral features cap-
turing local spectral or harmonic behavior of a signal [3][6].
To develop a time-scale invariant audio identification sys-
Permission to make digital or hard copies of all or part of this work for tem from another point of view, Kurth et al propose a set of
personal or classroom use is granted without fee provided that copies are tempo-related features that capture local tempo-, rhythm-
not made or distributed for profit or commercial advantage and that copies and meter-characteristics of a music piece [6]. By reducing
bear this notice and the full citation on the first page. To copy otherwise, to tempo estimations to certain modular tempo classes which
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
are invariant with respect to tempo doubling, a so-called
MM’10, October 25–29, 2010, Firenze, Italy. Cyclic Beat Spectrum (CBS) is obtained. With the help
Copyright 2010 ACM 978-1-60558-933-6/10/10 ...$10.00.

987
Frequency
E
A

(a) (d) (e)

(b)

(c)

8s 10s 12s Time

Figure 1: Spectrogram of (a) the original 10s audio excerpt (b) the 80%-time-stretched version of the original
excerpt (c) the 120%-time-stretched version of the original excerpt (d) the -50%-pitch-shifted version of the
original excerpt (e) the +50%-pitch-shifted version of the original excerpt

of CBS, the system achieves high identification rates even features. Section 3 carries out experiments on the proposed
for scaling factors of 79% and 126%, as well as other signal features to show their robustness. Section 4 concludes the
distortions. [6] may be the most time-scaling robust audio paper.
fingerprinting method we have ever seen.
In this paper, we propose a novel audio fingerprinting 2. ALGORITHM DESCRIPTION
method based on computer-vision techniques. Instead of
simply using spectral or tempo-related features as common 2.1 Auditory Image Construction
methods do, we transform each audio signal into a 2-D im-
In audio fingerprinting systems, audio signals often suffer
age and treat TSM and pitch shifting of the audio signal
from various distortions. Creating a feature representation
as stretch and translation of the corresponding image. By
that is robust to these distortions is a challenging task. In
extracting robust local descriptors from the image as audio
the proposed method, we transform each 1-D audio signal
features, our system performs highly robust to various audio
into a 2-D image and extract computer-vision features as
distortions including the challenging TSM and pitch shifting.
robust audio fingerprints.
What’s more, the robustness with respect to pitch shifting
To construct auditory image, we convert the audio sig-
in our system is directly obtained using the proposed audio
nal into a time-frequency representation (spectrogram) using
fingerprinting approach, without expanding the database or
Short-Time Fourier Transformation (STFT). The spectro-
multiplying queries.
gram represents the power contained in 97 logarithmically-
The insight that computer-vision techniques can be a pow-
spaced frequency bands, measured over 2048-sample win-
erful method for analyzing audio data comes from [5]. In
dows and 50% overlap (see Figure 1).
[5], Ke et al develop a system that learns local descriptors of
As mentioned in section 1, TSM and pitch shifting are the
the spectrogram as audio features automatically using pair-
two most challenging distortions. Figure 1 shows the spec-
wise boosting. [1] further improves [5] by replacing learning
trogram of a 10s audio excerpt together with the spectro-
approach by wavelet-based method, achieving higher iden-
grams of its time-stretched versions and pitch-shifted ver-
tification accuracy and better robustness. Unfortunately,
sions. Note that (b) and (c) share similar time-frequency
neither [1] nor [5] has made any breakthrough in resisting
representation with (a) except that (b) is 20% shorter than
TSM and pitch shifting. [1] doesn’t list any experimental re-
(a) along the time-axis while (c) is 20% longer. That is to
sults to show its robustness to TSM and pitch shifting while
say, TSM applied to an audio signal leads to the stretch
[1] is only proved to be robust for time stretching factors
of the corresponding spectrogram along the time-axis. As
from 90% to 110%.
for pitch shifting, time-frequency representation is stable in
The remainder of the paper is organized as follows. Sec-
time axis while moves along the frequency-axis. For exam-
tion 2 proposes the method of audio-image transformation
ple, A, D and E locate in (a), (d) and (e) respectively are
and describes the extraction and matching of robust audio
time-frequency components with similar sharp and energy.

988
a b

bÿ
des(d1, d2, Ă, d128)

A B
Figure 2: Local descriptors extracted using the SIFT
method
Figure 3: The matching of audio features

D has the same time value but different frequency value


with A, so does E. Thus both D and E can be obtained by
moving A along the frequency-axis. In other words, pitch
shifting applied to an audio signal leads to the translation
of the corresponding spectrogram along the frequency-axis.
Since TSM and pitch shifting of an audio signal can be
treated as stretch and translation of the corresponding spec-
trogram, we claim that image features robust to stretch and
translation of the spectrogram are also robust to TSM and
pitch shifting of the original audio signal.
A B
2.2 Image Features as Audio Features
Robust image features are extracted as audio features in
this section. Compared to global descriptors, we argue that
local descriptors are more suitable for the task of audio fin-
gerprinting, especially in the proposed method. Reasons are
stated as follows: (1) since queries are cut randomly from
the reference audio signals with unknown time-offsets, the
features should be local and robust to small shifts in time;
(2) local descriptors are more efficient in resisting TSM since
large overlap of adjacent frames is not needed.
A C
Scale Invariant Feature Transform (SIFT) [7] based de-
scriptors have been observed to perform the best compared
with other local descriptors for affine transformation, scale
Figure 4: The matching of audio clips
changes, rotation, blur, jpeg compression, and illumination
changes [8]. Thus we employ SIFT features of the spectro-
gram as features of original audio signal. The SIFT fea- b’ is the second closest (see Figure 3). If
tures are calculated as: (1) select candidates for features
by searching for peaks in the scale space of the different- D(a, b) < T h ∗ D(a, b′ ) (1)
of-Gaussians (DoG) function, (2) localize each feature using
where D(x, y) is the Euclidean distance between descriptors
measures of its stability, and (3) assign orientations based
of features x and y, T h is a threshold (T h = 0.6 in our
on local image gradient directions.
experiment), then a and b are considered to be matched.
Figure 4 illustrates the matching results of two pairs of
2.3 Audio Matching audio signals. A is a 5s audio excerpt, B and C are its
The output of the SIFT feature extractor is hundreds of 80%-time-stretched version and -30%-pitch-shifted version
features together with their 128-dimension descriptors (see respectively. Note that all the black lines connecting matched
Figure 2). To match two audio pieces, we first compare every features between A and B are almost horizontal. That is
feature pair between these two audio signals and then count to say, the time-frequency representation of an audio signal
the number of matched pairs. The matching of features is is stable along the frequency-axis when applied with TSM.
carried out by calculating the Euclidean distance between Meanwhile, all the black lines in the lower panel of figure
their 128-dimension descriptors. Suppose a is a feature in 4 are oblique, as pitch shifting leads to the movement of
audio A, b is the closest feature of a in B (the Euclidean time-frequency components along the frequency axis.
distance between descriptors of a and b is smaller than that To ensure discriminability of the proposed features, we
between descriptors of a and any other feature in B) while match each query against an audio database. The audio

989
piece which has most matched features with the query in
the database is chosen as the identification result. Table 1: Identification rates for differently time-
stretched excerpts
Time stretched [%] 65 70 80 90 100
3. EXPERIMENTAL RESULTS ID rate [%] 80 95 100 100 100
To evaluate performance of the proposed method, we first Time stretched [%] 110 120 130 140 150
set up an audio database composed of 1241 music pieces ID rate [%] 100 100 100 95 90
of various genres and a corresponding fingerprint database
based on the procedures in section 2. Each music is mono, 60 Table 2: Identification rates for differently pitch-
seconds long and originally sampled at 44.1 kHz. To achieve shifted excerpts
a good tradeoff among fingerprint granularity, robustness Pitch Shifted [%] -50 -25 0 +50 +100
and discriminability, we experimentally use 10s short ex- ID rate [%] 92 97 100 98 100
cerpts cut randomly from selected database music pieces as
queries. There are totally 100 randomly chosen queries and Table 3: Identification rates for different audio dis-
each of them is distorted by various audio signal operations tortions
to test the robustness. Type of Distortion ID rate [%]
Table 1 shows the robustness with regard to time stretch- Background Noise (SNR = 18db) 94
ing where the audio pieces are stretched from 65% to 150% MPEG @ 32kBit/s 98
of the original lengths. The ID rate indicates percentage of Equalization (Bass Boost) 100
queries which are identified by the reference music that has Echo (-6db, 500ms delay) 99
most matched features with them. Note that our system
preserves high ratio of correct identifications for such high
stretching factors as 65% and 150%, which outperforms the proved to be the biggest bottleneck of our system in the
current state-of-the-art audio fingerprinting algorithms. experiment.
The identification results for differently pitch-shifted ex-
cerpts are presented in Table 2. Thanks to the fact that
SIFT features are highly robust to image translation, our
5. ACKNOWLEDGMENTS
system still works well when test excerpts are one octave This work is jointly supported by NSFC(60873255), 973
down (50% of the original pitch) or one octave up (200% of Program(2010CB327906), the National High Technology Re-
the original pitch). Observing the results, we argue that the search and Development Program of China(2009AA01A346).
problem of pitch shifting can be solved directly using the au-
dio fingerprinting method, without expanding the database 6. REFERENCES
or multiplying queries. Note that the identification rate of [1] S. Baluja and M. Covell. Waveprint: efficient
+100% pitch-shifted excerpts are 2% larger than that of wavelet-based audio fingerprinting. Pattern
+50% pitch-shifted ones, this may be because the number Recognition, 41(11):3467–3480, 2008.
of matched features between the wrong-matched audio pair [2] R. Bardeli and F. Kurth. Robust identification of
may decrease at a faster speed than those of other pairs time-scaled audio. In AES 25th International
when pitch-shifting factor increases. Conference on Metadata for Audio, 2004.
In addition to TSM and pitch shifting, audio signals in the
[3] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A
real world interference may suffer from some other signal op-
review of audio fingerprinting. The Journal of VLSI
erations. Table 3 illustrates the robustness of the proposed
Signal Processing, 41(3):271–284, 2005.
method with regard to several typical signal distortions in-
cluding noise contamination, MPEG compression, equaliza- [4] J. Haitsma and T. Kalker. A highly robust audio
tion and echo adding, etc. As shown in the table, our sys- fingerprinting system. In International Symposium on
tem works well for these cases, providing high identification Music Information Retrieval, pages 107–115, 2002.
rates. [5] Y. Ke, D. Hoiem, and R. Sukthankar. Computer vision
There is one thing to mention is that all the 10s test ex- for music identification. In IEEE Computer Society
cerpts used in the experiment are cut randomly from the Conference on Computer Vision and Pattern
original 60s audio pieces with unknown time-offsets. Thus Recognition, pages 597–604, 2005.
all the distortions applied to the excerpts are combinations [6] F. Kurth, T. Gehrmann, and M. Müller. The cyclic
of cropping and the above-mentioned operations. beat spectrum: tempo related audio features for
time-scale invariant audio identification. In
International Symposium on Music Information
4. CONCLUSIONS Retrieval, pages 35–40, 2006.
In this paper, we propose a novel audio fingerprinting algo- [7] D. Lowe. Distinctive image features from scale-invariant
rithm based on computer-vision approach. By employing ro- keypoints. International Journal of Computer Vision,
bust local image descriptors, our system achieves promising 60(2):91–110, 2004.
identification rates under various audio distortions includ- [8] K. Mikolajczyk and C. Schmid. A performance
ing the challenging TSM and pitch shifting. Nevertheless, evaluation of local descriptors. IEEE Transactions on
there are still some shortcomings of the proposed method. Pattern Analysis and Machine Intelligence,
We extract hundreds of features from each audio piece, of 27(10):1615–1630, 2005.
which only a part are successfully matched. What’s more,
we directly employ exhaustive matching in the stage of au-
dio matching without any speed-up method. This has been

990

View publication stats

You might also like