Professional Documents
Culture Documents
Fitur Audio
Fitur Audio
Fitur Audio
5.1 INTRODUCTION
• represented as a sequence of samples (except for structured
representations such as MIDI)
• normally stored in a compressed form
• Human beings have amazing ability to distinguish different types
of audio, can instantly tell the
o type of audio (e.g., human voice, music, or noise),
o speed (fast or slow),
o the mood (happy, sad, relaxing, etc.),
o determine its similarity to another piece of audio.
• computer sees a piece of audio as a sequence of sample values
o most common method of accessing audio pieces is based on
their titles or file names
o may be hard to find audio pieces satisfying the particular
requirements of applications
o cannot support query by example.
• content-based audio retrieval techniques
o simplest:
sample to sample comparison
does not work well (sampling rates, number of bits for
each sample)
o based on a set of extracted audio features:
average amplitude,
frequency distribution, etc.
• general approach to content-based audio indexing and retrieval:
o Audio is classified into common types of audio (speech,
music, and noise)
o Different audio types processed and indexed in different
ways
For example, if the audio type is speech, speech
recognition is applied and the speech is indexed based
on recognized words
o Query audio are classified, processed, and indexed
o Audio pieces are retrieved based on similarity between the
query index and the audio index in the database.
• represented in
o time domain (time-amplitude representation)
o frequency domain (frequency-magnitude representation)
• features are derived or extracted from these two representations
• there are other subjective features such as timbre
5.2.1 Features Derived in the Time Domain
• most basic signal representation technique
• a signal is represented as amplitude varying with time
• silence is represented as 0
• signal value can be positive or negative depending on whether the
sound pressure is above or below the equilibrium atmospheric
pressure when there is silence
• Three features
o Average energy (E): indicates the loudness of the audio
signal
N−1
∑ x ( n) 2
E= n= 0
N
N : number of samples,
x(n) : value of sample n.
• Sound Spectrum
• Spectrum frequency domain representation of a signal
• derived from the time domain representation according to the
Fourier transform
• indicating the amount of energy at different frequencies
.
1 N−1
jnω
x ( n) =
N
∑ X ( k )e k
k= 0
5.2.3 Spectrogram
• shows the relation between three variables: frequency content,
time, and intensity.
• frequency content is shown along the vertical axis,
• time along the horizontal one.
• intensity, or power, of the signal is indicated by a gray scale
(darkest shows greatest amplitude/power)
• spectrogram clearly illustrates the relationships among time,
frequency, and amplitude
Speech
• low bandwidth compared to music (100 to 7,000 Hz).
• spectral centroids (also called brightness) of speech signals are
usually lower than those of music.
• frequent pauses in a speech, speech signals normally have a
higher silence ratio than music.
• succession of syllables composed of short periods of friction
(caused by consonants) followed by longer periods for vowels
• speech has higher variability in ZCR (zero-crossing rate).
Music
• has a high frequency range (bandwidth), from 16 to 20,000 Hz.
• spectral centroid is higher than that of speech.
• music has a lower silence ratio (except may be music produced by
a solo instrument or singing without accompanying music).
• music has lower variability in ZCR
• has regular beats that can be extracted to differentiate it from
speech
Step-by-Step Classification
• See figure 5.4
• Example:
o discriminate broadcast speech and music and achieved
classification rate of 90%
o the silence ratio to classify audio into music and speech with
an average success rate of 82%.
Audio Input
High Yes
Music
centroid?
High No
silence Music
ratio?
No
High ZCR Solo music
variability
?
Yes
Speech
Feature
Amplitude Reference speech
Test speech
Time
Feature
Amplitude
Time
Figure 5.6 A Dynamic time warping example: (a) before time warping
(b) after time warping
3
1
Table 5.2
Error Rates for High Performance Speech Recognition (Based on HMM)
Subject Matter Type Vocabulary, Word Error
No of Words Rate(%)
Connected digits Read 10 <0,3
Airline travel system Spontaneous 2.500 2
Wall Street Journal Read 64.000 7
Broadcast news Read/spontaneous 64.000 30
(mixed)
General phone call Conversation 10.000 50
Video track
Sound track
5.7 SUMMARY
This chapter described some common techniques and related issues for
content-based audio indexing and retrieval. The general approach is to
classify audio into some common types such as speech and music, and
then use different techniques to process and retrieve the different types
of audio. Speech indexing and retrieval is relatively easy, by applying JR
techniques on words identified using speech recognition. But speech
recognition performance on general topics without any vocabulary
restriction is still to be improved. For music retrieval, some useful work
has been done based on audio feature vector matching and approximate
pitch matching. However, more work is needed on how music and audio
in general is perceived and on similarity comparison between musical
pieces. It will also be very useful if we can further automatically classify
music into different types such as pop and classical.
The classification and retrieval capability described in this chapter is
potentially important and useful in many areas, such as the press and
music industry, where audio information is used. For example, a user
can hum or play a song and ask the system to find songs similar to what
was hummed or played. A radio presenter can specify the requirements
of a particular occasion and ask the system to provide a selection of
audio pieces meeting these requirements. When a reporter wants to find
a recorded speech, he or she can type in part of the speech to locate the
actual recorded speech. Audio and video are often used together in
situations such as movie and television programs, so audio retrieval
techniques may help locate some specific video clips, and video retrieval
techniques may help locate some audio segments. These relationships
should be exploited to develop integrated multimedia database
management systems.