Fitur Audio

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Chapter 5

Indexing and Retrieval of Audio

5.1 INTRODUCTION
• represented as a sequence of samples (except for structured
representations such as MIDI)
• normally stored in a compressed form
• Human beings have amazing ability to distinguish different types
of audio, can instantly tell the
o type of audio (e.g., human voice, music, or noise),
o speed (fast or slow),
o the mood (happy, sad, relaxing, etc.),
o determine its similarity to another piece of audio.
• computer sees a piece of audio as a sequence of sample values
o most common method of accessing audio pieces is based on
their titles or file names
o may be hard to find audio pieces satisfying the particular
requirements of applications
o cannot support query by example.
• content-based audio retrieval techniques
o simplest:
 sample to sample comparison
 does not work well (sampling rates, number of bits for
each sample)
o  based on a set of extracted audio features:
 average amplitude,
 frequency distribution, etc.
• general approach to content-based audio indexing and retrieval:
o Audio is classified into common types of audio (speech,
music, and noise)
o Different audio types processed and indexed in different
ways
 For example, if the audio type is speech, speech
recognition is applied and the speech is indexed based
on recognized words
o Query audio are classified, processed, and indexed
o Audio pieces are retrieved based on similarity between the
query index and the audio index in the database.

• Reason for classification


o different audio types require different processing, indexing
and retrieval techniques
o different audio types have different significance to different
applications
o speech recognition techniques/systems available
o the audio type or class information is itself very useful to
some applications
o search space after classification is reduced to a particular
audio class during the retrieval process.
• Audio classification is based on some objective or subjective audio
features
• general approach to speech indexing and retrieval
o apply speech recognition to convert speech to spoken words
o apply traditional IR on the recognized words
• two forms of musical representation: structured and sample-based
• temporal and content relationships between different media types
may help indexing and retrieval of multimedia objects

5.2 MAIN AUDIO PROPERTIES AND FEATURES

• represented in
o time domain (time-amplitude representation)
o frequency domain (frequency-magnitude representation)
• features are derived or extracted from these two representations
• there are other subjective features such as timbre
5.2.1 Features Derived in the Time Domain
• most basic signal representation technique
• a signal is represented as amplitude varying with time
• silence is represented as 0
• signal value can be positive or negative depending on whether the
sound pressure is above or below the equilibrium atmospheric
pressure when there is silence

Figure 5.1 Amplitude-time representation of an audio signal.

• Three features
o Average energy (E): indicates the loudness of the audio
signal
N−1
∑ x ( n) 2
E= n= 0
N

N : number of samples,
x(n) : value of sample n.

o Zero crossing rate: indicates the frequency of signal


amplitude sign change. To some extent, it indicates the
average signal frequency.
N
∑ | sgn x(n) − sgn x(n − 1) |
ZC = n= 1
2N
where sgnx(n) is the sign of x(n) (1 if x(n) is positive and –l
if x(n) is negative).

o Silence ratio: indicates the proportion of the sound piece that


is silent.
 Silence is defined as a period within which the absolute
amplitude values of a certain number of samples are
below a certain threshold.
 calculated as the ratio between the sum of silent periods
and the total length of the audio piece.

5.2.2 Features Derived From the Frequency Domain

• Sound Spectrum
• Spectrum frequency domain representation of a signal
• derived from the time domain representation according to the
Fourier transform
• indicating the amount of energy at different frequencies
.

Figure 5.2 The spectrum of the sound signal in Figure 5.1.

• DFT is given by:


N −1
X (k ) = ∑ x(n)e− jnω k
n= 0

where ωk=2π k/N, x(n) is a discrete signal with N samples,


k is the DFT bin number.
• If the sampling rate of the signal is fs Hz, then the frequency fk of
bin k in hertz is given by:
ωk k
fk = fs = fs
2π N
• If x(n) is time-limited to length N, then it can be recovered
completely by taking the inverse discrete Fourier transform (IDFT)
of the N frequency samples as follows:

1 N−1
jnω
x ( n) =
N
∑ X ( k )e k
k= 0

• The DFT and IDFT are calculated efficiently using an algorithm


called the Fast Fourier transforms (FFT).
• a long time period is broken into blocks called frames (typically 10
to 20 ms) and the DFT is applied to each of the frames.

• Features extracted from frequency definition


o Bandwidth: indicates the frequency range of a sound.
 Music normally has a higher bandwidth than speech
signals.
 bandwidth = highest frequency - lowest frequency of
the nonzero spectrum components.
 In some cases “nonzero” is defined as at least 3 dB
above the silence level.

o Energy Distribution: the signal frequency distribution


across the frequency components.
 useful for audio classification (music has more high
frequency components than speech).
 Example:
• frequency below 7 kHz belong to the low band
and others belong to high band.
• total energy for each band is calculated as the sum
of power of each samples within the band.
 spectral centroid (brightness), is the midpoint of the
spectral energy distribution of a sound. Speech has low
centroid compared to music.
o Harmonicity:
 In harmonic sound the spectral components are mostly
whole number multiples of the lowest (fundamental
frequency), and most often loudest frequency.
 Music is normally more harmonic than other sounds.
 harmonicity is determined by checking if the
frequencies of dominant components are of multiples
of the fundamental frequency.
 Example, sound spectrum of the flute playing the note
G4:
400 Hz, 800 Hz, 1200 Hz, 1600 Hz, and so on.
(f= 400 Hz is fundamental frequency)
o Pitch:
 Pitch is a subjective feature (related to but not
equivalent to the fundamental frequency)
 in practice, we use the fundamental frequency as the
approximation of the pitch.

5.2.3 Spectrogram
• shows the relation between three variables: frequency content,
time, and intensity.
• frequency content is shown along the vertical axis,
• time along the horizontal one.
• intensity, or power, of the signal is indicated by a gray scale
(darkest shows greatest amplitude/power)
• spectrogram clearly illustrates the relationships among time,
frequency, and amplitude

Figure 5.3 Spectrogram of the sound signal of Figure 5.1.


5.2.4 Subjective Features
• timbre, example of subjective feature
• timbre relates to the quality of a sound
• It encompasses all the distinctive qualities of a sound other than its
pitch, loudness, and duration.
• Salient components of timbre include the amplitude envelope,
harmonicity, and spectral envelope.

5.3 AUDIO CLASSIFICATION

5.3.1 Main Characteristics of Different Types of Sound

Speech
• low bandwidth compared to music (100 to 7,000 Hz).
• spectral centroids (also called brightness) of speech signals are
usually lower than those of music.
• frequent pauses in a speech,  speech signals normally have a
higher silence ratio than music.
• succession of syllables composed of short periods of friction
(caused by consonants) followed by longer periods for vowels
• speech has higher variability in ZCR (zero-crossing rate).

Music
• has a high frequency range (bandwidth), from 16 to 20,000 Hz.
• spectral centroid is higher than that of speech.
• music has a lower silence ratio (except may be music produced by
a solo instrument or singing without accompanying music).
• music has lower variability in ZCR
• has regular beats that can be extracted to differentiate it from
speech

Table 5.1 Main Characteristics of Speech and Music


Features Speech Music
Bandwidth 0-7 kHz 0-20 kHz
Spectral Centroid Low High
Silence Ratio High Low
Zero-crossing rate More Variable Less Variable
Regular beat None Yes

5.3.2 Audio Classification Frameworks


• based on calculated feature values
• first group of methods: each feature is used individually in
different classification steps
• second group: a set of features is used together as a vector to
calculate the closeness of the input to the training sets

Step-by-Step Classification
• See figure 5.4
• Example:
o discriminate broadcast speech and music and achieved
classification rate of 90%
o the silence ratio to classify audio into music and speech with
an average success rate of 82%.
Audio Input

High Yes
Music
centroid?

No Speech plus music

High No
silence Music
ratio?

Yes Speech plus solo music

No
High ZCR Solo music
variability
?

Yes

Speech

Figure 5.4 A possible audio classification process.

Feature-Vector-Based Audio Classification


• a set of features are calculated and used as a feature vector
• Training stage: average feature vector (reference vector) is found
for each class of audio.
• Classification: feature vector of an input is calculated and the
vector distances between the input feature vector and each of the
reference vectors are calculated
• input is classified into the class from which the input has least
vector distance (eg. Euclidean distance)

5.4 SPEECH RECOGNITION AND RETRIEVAL


• basic approach to speech indexing and retrieval:
o convert speech signals into text and
o apply IR techniques for indexing and retrieval
• Other information to enhance speech indexing and retrieval
o speaker’s identity
o mood of speaker

5.4.1 Speech Recognition

• automatic speech recognition (ASR) problem is a pattern matching


problem
• ASR system is trained to collect models or feature vectors for all
possible speech units
• smallest unit is a phoneme, word or phrases
• recognition process:
o extract feature vectors of input speech unit
o compare feature vectors collected during the training process
o speech unit whose feature vector is closest to that of the input
speech unit is deemed to be the unit spoken.
• describe three classes of practical ASR techniques.
o dynamic time warping,
o hidden Markov models (HMMs),
o artificial neural network (ANN) models.
• Techniques based on HMMs are most popular and produce the
highest speech recognition performance.

5.4.1.1 Basic Concepts of ASR

• ASR system operates in two stages: training and pattern matching


• Training stage: features of each speech unit is extracted and stored
in the system.
• Recognition process:
o features of an input speech unit are extracted and compared
with each of the stored features
o speech unit with the best matching features is taken as the
recognized unit (may use a phoneme as a speech unit).
• recognition is complicated by the following factors:
o A phoneme spoken by different speakers or by the same
speaker at different times produces different features in terms
of duration, amplitude, and frequency components.
o background or environmental noise.
o difficult to separate into individual phonemes because
different phonemes have different durations.
o Phonemes vary with their location in a word. The frequency
components of a vowel’s pronunciation are heavily
influenced by the surrounding consonants.

• See Figure 5.5 for a general model of ASR systems


o typical frame size is 10 ms.
o mel-frequency cepstral coefficients (MFCCs) mostly used
feature vectors. MFCCs are obtained by the following
process:
 The spectrum of the speech signal is warped to a scale,
called the mel-scale, that represents how a human ear
hears sound.
 The logarithm of the warped spectrum is taken.
 An inverse Fourier transform of the result of step 2 is
taken to produce what is called the cepstrum.
Training speech Preprcessing &
feature extraction Feature vectors

Corresponding word of training speech Phonetic


modeling

Training process Phoneme Dictionary


Retreiving process models and grammar

Speech input Preprocessing Feature Search and Output word


& feature vectors matching sequence
extraction

Figure 5.5 A general ASR system (after [13]).

5.4.1.2 Techniques Based on Dynamic Time Warping

• Each speech frame is represented by a feature vector.


• Recognition process:
o distances between the input feature vector and those in the
recognition database is computed as the sum of frame to
frame differences between feature vectors.
o The best match is the one with the smallest distance.
o Weakness, may not work due to nonlinear variations in the
timing of speeches:
 made by different speakers
 made at different times by the same speaker
o For example, the same word spoken by different people will
take a different amount of time. Therefore, we cannot directly
calculate frame to frame differences.
• Dynamic time warping normalizes or scales speech duration to
minimize the sum of distances between feature vectors that most
likely match best, see Figure 5.6
• Figure 5.6(a): before time warping
o reference speech and the test speech are the same, but the two
speeches have different time durations
• Figure 5.6(b): After time warping
o reference speech and test speech have the same length and
their distance can be calculated by summing the frame to
frame or sample to sample differences

Feature
Amplitude Reference speech

Test speech

Time

Feature
Amplitude

Time

Figure 5.6 A Dynamic time warping example: (a) before time warping
(b) after time warping

5.4.1.3 Techniques Based on Hidden Markov Models

o most widely used and produce the best recognition performance


o Phonemes are fundamental units of meaningful sound in speech
o Phonemes are different from all the rest, but they are not
unchanging in themselves.
o When one phoneme is voiced, it can be identified as similar
to its previous occurrences, although not exactly the same.
o a phoneme’s sound is modified by its neighbors’ sounds.
o The challenge of speech recognition is how to model these
variations mathematically.
o HMM consists of a number of states, linked by a number of
possible transitions (Figure 5.7).
o Associated with each state are a number of symbols, each with a
certain occurrence probability associated with each transition.
o When a state is entered, a symbol is generated. Which symbol to
be generated at each state is determined by the occurrence
probabilities.
o Figure 5.7,
o the HMM has three states
o At each state, one of four possible symbols, x1, x2, x3 and x4,
is generated with different probabilities, as shown by b1(x),
h2(x), b3(x) and b4(x).
o The transition probabilities are shown as a11, a12, and so forth.

3
1

Figure 5.7 An example of an HMM.


o not possible to identify a unique sequence of states given a
sequence of output symbols.
o Every sequence of states that has the same length as the output
symbol sequence is possible, each with a different probability.
o The sequence of states is “hidden” from the observer who sees
only the output symbol sequence (Thus called the hidden
Markov model)
o Although it is not possible to identify the unique sequence of
state for a given sequence of output symbols, it is possible to
determine which sequence of state is most likely to generate the
sequence of symbols, based on state transition and symbol
generating probabilities.
o HMMs in speech recognition
o Each phoneme is divided into three audible states:
 an introductory state,
 a middle state,
 an exiting state.
o Each state can last for more than one frame (normally each
frame is 10 ms)
o During the training stage,
 training speech data is used to construct HMMs for
each of the possible phonemes.
 Each HMM has the above three states and is defined
by state transition probabilities and symbol
generating probabilities. Symbols are feature vectors
calculated for each frame.
 Some transitions are not allowed as time flows
forward only.
• For example, transitions from 2 to 1, 3 to 2 and
3 to 1 are not allowed if the HMM in Figure
5.7 is used as a phoneme model.
• Transitions from a state to itself are allowed
and serve to model time variability of speech.
 At the end of the training stage, each phoneme is
represented by one HMM capturing the variations of
feature vectors in different frames.
 These variations are caused by different speakers,
time variations, and surrounding sounds.
o During speech recognition
 feature vectors for each input phoneme are
calculated frame by frame.
 Obtain which phoneme HMM is most likely to
generate the sequence of feature vectors of the input
phoneme.
 The corresponding phoneme of the HMM is deemed
as the input phoneme.
 As a word has a number of phonemes, a sequence of
phonemes are normally recognized together.
 There are a number of algorithms to compute the
probability that an HMM generates a given sequence
of feature vectors.
• forward algorithm: used for recognizing
isolated words
• Viterbi algorithms: for recognizing continuous
speech

5.4.1.4 Techniques Based on Artificial Neural Networks

• ANNs widely used for pattern recognition


• ANN consists of many neurons interconnected by links with
weights
• Speech recognition with ANNs consists of two stages:
o training stage:
 feature vectors of training speech data are used to train
the ANN (adjust weights on different links).
o recognition stage:
 ANN identify the most likely phoneme based on the
input feature vectors

5.4.1.5 Speech Recognition Performance

• measured by recognition error rate


• performance is affected by the following factors:
1. Subject matter: vary from a set of digits, a newspaper article, to
general news.
2. Types of speech: read or spontaneous conversation.
3. Size of vocabulary: from dozens to a few thousand words.

Table 5.2
Error Rates for High Performance Speech Recognition (Based on HMM)
Subject Matter Type Vocabulary, Word Error
No of Words Rate(%)
Connected digits Read 10 <0,3
Airline travel system Spontaneous 2.500 2
Wall Street Journal Read 64.000 7
Broadcast news Read/spontaneous 64.000 30
(mixed)
General phone call Conversation 10.000 50

5.4.2 Speaker Identification

• speaker identification or voice recognition attempts to find the


identity of the speaker or to extract information about an individual
from his/her speech
• very useful to multimedia information retrieval
• can determine speaker’s information
o the number of speakers in a particular setting
o gender of speaker,
o adult or child,
o speaker’s mood,
o emotional state and attitude, etc.
• speaker’s information, together with the speech content
significantly improves information retrieval performance.
• Speech recognition, if it is to be speaker-independent, must
purposefully ignore any idiosyncratic speech characteristics of the
speaker and focus on those parts of the speech signal richest in
linguistic information. In contrast, voice recognition must amplify
those idiosyncratic speech characteristics that individualize a
person and suppress linguistic characteristics that have no bearing
on the recognition of the individual speaker.

5.5 MUSIC INDEXING AND RETRIEVAL

• research and development of effective techniques for music


indexing and retrieval is still at an early stage.
• there are two types of music:
o structured or synthetic
o sample based music

5.5.1 Indexing and Retrieval of Structured Music and Sound


Effects

• represented by a set of commands or algorithms


• MIDI: represent music as a number of notes and control commands
• MPEG-4 Structured Audio: represents sound in algorithms and
control languages
• structured sound standards and formats are developed for sound
transmission, synthesis, and production.
• structure and notes description in these formats make retrieval
process easy (no need to do feature extraction)
• exact match of the sequence of notes does not guarantee same
sound due to different devices for rendering
• Finding similar music or sound effects based on similarity is
complicated even with structured music and sound effects.
• main problem is hard to define similarity between two sequences
of notes.
• One possibility is to retrieve music based on the pitch changes of a
sequence of notes
o each note (except for the first one) in the query and in the
database sound files is converted into pitch change relative to
its previous note.
o The three possible values for the pitch change are U(up),
D(down), and S(same or similar).
o a sequence of notes is characterized as a sequence of
symbols.
o retrieval task becomes a string-matching process.

5.5.2 Indexing and Retrieval of Sample-Based Music


• two approaches to indexing and retrieval of sample-based music
• first approach is based on a set of extracted sound features
• second approach is based on pitches of music notes

Music retrieval based on a set of features

• a set of acoustic features is extracted for each sound


• similarity between the query and each of the stored music pieces is
calculated based on the corresponding feature vectors
• This approach can be applied to general sound including music,
speech, and sound effects.
• five features are used
o loudness
o pitch
o brightness
o bandwidth
o harmonicity.
• Each feature is represented statistically by three parameters:
o Mean
o Variance
o autocorrelation.
• The Euclidean distance or Manhattan distance may be used.

Music retrieval based on pitch

• similar to pitch-based retrieval of structured music


• need to extract pitch for each note (called pitch tracking)
• Pitch tracking is a simple form of automatic music transcription
that converts musical sound into a symbolic representation
• basic idea
o Each note of music (including the query) is represented by its
pitch.
o a musical piece or segment is represented as a sequence or
string of pitches.
o retrieval is based on similarity between query and candidate
strings.
• Pitch is normally defined as the fundamental frequency of a sound
• To find the pitch for each note, the input music must first be
segmented into individual notes
• Segmentation of continuous music, especially humming and
singing, is very difficult.
• Therefore, it is normally assumed that music is stored as scores in
the database. The pitch of each note is known.
• The common query input form is humming
• There are two pitch representations
o first method
 each pitch except the first one is represented as pitch
direction (or change) relative to the previous note
 pitch direction is either U(up), D(down), or S (similar)
 each musical piece is represented as a string of three
symbols or characters.
o second method
 represents each note as a value based on a chosen
reference note
 value is assigned from a set of standard pitch values
that is closest to the estimated pitch.
 If we represent each allowed value as a character, each
musical piece or segment is represented as a string of
characters.
• finding a match or similarity between the strings
• humming is not exact, thus need approximate matching
• string matching with k mismatches, k is determined by user
• The problem consists of finding all instances of a query string
Q=q1q2q3..qm in a reference string R=r1,r2,r3,…rm such that there are
at most k mismatches (characters that are not the same)

5.6 MULTIMEDIA INFORMATION INDEXING AND


RETRIEVAL USING RELATIONSHIPS BETWEEN AUDIO
AND OTHER MEDIA

So far, we have treated sound independently of other media. In some


applications, sound appears as part of a multimedia document or object.
For example, a movie consists of a sound track and a video track with
fixed temporal relationships between them. Different media in a
multimedia object are interrelated in their contents as well as by time.
We use this interrelation to improve multimedia information indexing
and retrieval in the following two ways.
First, we can use knowledge or understanding about one medium to
understand the contents of other media. We have used text to index and
retrieve speech through speech recognition. We can in turn use audio
classification and speech understanding to help with the indexing and
retrieval of video. Figure 5.8 shows an multimedia object consisting of a
video track and a sound track. The video track has 26 frames. Now we
assume the sound track has been segmented into different sound types.
The first segment is speech and corresponds to video frames 1 through 7.
The second segment is loud music and corresponds to video frames 7
through 18. The final segment is speech again and corresponds to video
frames 19 through 26. We then use the knowledge of the sound track to
do the following on the video track. First, we segment the video track
according to the sound track segment boundaries. In this case, the video
track is likely to have three segments with the boundaries aimed with the
sound track segment boundaries. Second, we apply speech recognition to
sound segments 1 and 3 to understand what was talked about. The
corresponding video track may very likely have similar content. Video
frames may be indexed and retrieved based on the speech content
without any other processing. This is very important because in general
it is difficult to extract video content even with complicated image
processing techniques.

Frame 1 Frame 7 Frame 19 Frame 26

Video track

Sound track

Segment 1 Segment 2 Segment 3

Figure 5.8 An example multimedia object with a video track and


a soundtrack.

The second way to make use of relationships between media for


multimedia retrieval is during the retrieval process. The user can use the
most expressive and simple media to formulate a query, and the system
will retrieve and present relevant information to the user regardless of
media types. For example, a user can issue a query using speech to
describe what information is required and the system may retrieve and
present relevant information in text, audio, video, or their combinations.
Alternatively, the user can use an example image as query and retrieve
information in images, text, audio, and their combinations. This is useful
because there are different levels of difficulty in formulating queries in
different media.
We discuss the indexing and retrieval of composite multimedia
objects further in Chapter 9.

5.7 SUMMARY

This chapter described some common techniques and related issues for
content-based audio indexing and retrieval. The general approach is to
classify audio into some common types such as speech and music, and
then use different techniques to process and retrieve the different types
of audio. Speech indexing and retrieval is relatively easy, by applying JR
techniques on words identified using speech recognition. But speech
recognition performance on general topics without any vocabulary
restriction is still to be improved. For music retrieval, some useful work
has been done based on audio feature vector matching and approximate
pitch matching. However, more work is needed on how music and audio
in general is perceived and on similarity comparison between musical
pieces. It will also be very useful if we can further automatically classify
music into different types such as pop and classical.
The classification and retrieval capability described in this chapter is
potentially important and useful in many areas, such as the press and
music industry, where audio information is used. For example, a user
can hum or play a song and ask the system to find songs similar to what
was hummed or played. A radio presenter can specify the requirements
of a particular occasion and ask the system to provide a selection of
audio pieces meeting these requirements. When a reporter wants to find
a recorded speech, he or she can type in part of the speech to locate the
actual recorded speech. Audio and video are often used together in
situations such as movie and television programs, so audio retrieval
techniques may help locate some specific video clips, and video retrieval
techniques may help locate some audio segments. These relationships
should be exploited to develop integrated multimedia database
management systems.

You might also like