Professional Documents
Culture Documents
Signal Processing For Auditory Systems
Signal Processing For Auditory Systems
Signal Processing For Auditory Systems
ELL788,
21ST Sept. 2016
How is sound produced?
• Loud speakers produce sound by
– The diaphragm of the speaker moves out and in,
pushing air molecules together and apart
– The cycle of this process creates ‘sound waves’:
alternating high- and low-pressure regions that travel
through the air
How can we characterize a sound
wave?
?
Characterizing Sound Waves
Pain!
No sound heard
Is pitch and loudness all there is for
describing sound perception?
?
Sound Quality: Timbre
• All other properties of sound except for
loudness and pitch constitute timbre
• Timbre is created partially by the multiple
frequencies that make up complex tones
– Fundamental frequency is the first
harmonic
– Musical tones have additional harmonics
that are multiples of the fundamental
frequency
Sound Quality: Timbre
• Additive synthesis - process of adding harmonics to create
complex sounds
Additive synthesis. (a) Pressure changes for a pure tone with frequency = 440 Hz. (b) The second
harmonic of this tone, with a frequency of 880 Hz. (c) The third harmonic, with a frequency of 1,320 Hz. (d)
The sum of the three harmonics creates the waveform for a complex tone.
Frequency spectrum - display of harmonics of a complex sound
?
The Ear
Outer ear - pinna and auditory canal
Pinna helps with sound location
Auditory canal - tube-like 3 cm long structure
Protects the tympanic membrane at the end of the canal
Resonant frequency of the canal amplifies frequencies between 2,000 and
5,000 Hz
The middle ear. The three bones of the middle ear
(‘ossicles’) transmit the vibrations of the tympanic
membrane to the inner ear.
M
Oval
window
Round
window
Why have a middle ear at all? Why doesn’t the tympanic membrane connect
directly to the cochlea?
Function of Ossicles
• Outer ear is filled with air
• Inner ear is filled with fluid that is much denser than
air
• Pressure changes in air transmit poorly into the
denser medium
• Ossicles act to amplify the vibration for better
transmission to the fluid
?
The Inner Ear
• Main structure is the cochlea
– Fluid-filled snail-like
structure set into vibration by
the stapes
– Divided into the scala
vestibuli and scala tympani
by the cochlear partition
– Cochlear partition extends
from the base (stapes end)
to the apex (far end)
– Organ of Corti contained by
the cochlear partition
Why?
Cross-section of the Organ of Corti
cochlea
?
Békésys’ Place Theory of Hearing
• Frequency of sound is indicated by the place on the organ of
Corti that has the highest firing rate
• Békésy determined this in two ways
– Direct observation of the basilar membrane in a cadaver
– Building a model of the cochlea using the physical properties
of the basilar membrane
Békésys’ Place Theory of Hearing - continued
• Physical properties of the basilar membrane
– Base of the membrane (by stapes) is
• 3 to 4 times narrower than at the apex
• 100 times stiffer than at the apex
• Both the model and the direct observation showed that the
vibrating motion of the membrane is a traveling wave
Békésys’ Place Theory of Hearing - continued
• Envelope of the traveling wave
– Indicates the point of maximum displacement of the
basilar membrane
– Hair cells at this point are stimulated the most strongly
leading to the nerve fibers firing the most strongly at this
location
– Position of the peak is a function of frequency
Evidence for the Place Theory
• Tonotopic map
– Cochlea shows an orderly map of
frequencies along its length
• Apex responds best to low
frequencies
• Base responds best to high
frequencies
http://www.youtube.com/watch?v=1JE8WduJKV4&feature=related
How does the place theory encode
complex tones?
?
Response of Basilar Membrane to Complex
Tones
• Fourier analysis -
mathematic process
that separates
complex waveforms
into a number of sine
waves
Research on the response of the basilar
membrane shows the highest response in
auditory nerve fibers with characteristic
frequencies that correspond to the sine-wave
components of complex tones
Thus the cochlea is called a frequency
analyzer
?
Pathway from the Cochlea to the Cortex
Auditory Areas in the Cortex
• Hierarchical processing occurs in the cortex
– Neural signals travel through the core, then belt, followed by the
parabelt area
– Simple sounds cause activation in the core area
– Belt and parabelt areas are activated in response to more complex
stimuli made up of many frequencies
Auditory Areas in the human Cortex
• Brain scans in humans
– Tasks that require pitch recognition activate core area. A
tonotopic map is evident in A1.
http://www.youtube.com/watch?v=WDDfGMuofuw&feature=related
UNIT V –Time Elements in Music
Sound Vibrations, Pure Tones and the Perception of Pitch
Acoustic Vibrations and Pure Tone Sensations
Auditory Coding in the Peripheral Nervous System
Sound Vibrations, Pure Tones and the Perception of Pitch
Introduction
oWe hear a sound when the eardrum is set into a characteristic
type of motion called vibration. This vibration is caused by
small pressure oscillations of the air in the auditory canal
associated to an incoming sound wave.
oThe fundamentals of periodic vibratory motion
oEffects of eardrum vibrations on our sensation of hearing.
Sound Vibrations, Pure tones
Motion and Vibrations:
oChange of position of a given body with respect to some reference body.
o If the moving body is very small with respect to the reference or the
dimensions of the spatial domain covered in its motion, the problem is reduced
to the description of the motion of a point in space.
oSuch a small body is often called a material point or a particle.
o If the body is not small, but all points of the body are confined to move along
straight lines parallel to each other, it, too, will suffice to specify the motion of
just one given point of the body – rectilinear translation.
oThis is a “one-dimensional” case of motion, and the position of the given point
of the body is specified by the distance to a fixed reference point.
Material point moving along a vertical line :
•The reference point on the line is designated with the
letter O.
•The position of a material point P with respect to the
reference point O is indicated by the distance y –
displacement of P with respect to O or the coordinate of P.
•The material point P is in motion with respect to O when
its position y changes with time t.
•Motion can be represented mathematically in two ways:
functional relationships, and a graphic representation.
Representation of a one-dimensional motion graphically:
•Two axes perpendicular to each other, one representing the time t, the
other the coordinate y.
•A motion can be represented by plotting for each instant of time t, the
distance y at which the particle is momentarily located.
•Each point of the ensuing curve, such as S1, tells us that at time t = t1
the particle P is at a distance y1 from O, that is at position P1.
Fig. 2.8
• The maximum firing rate is associated with the maximum velocity of
the basilar membrane when it is moving toward the scala tympani;
inhibition of the firing rate occurs during motion in the opposite
direction, toward the scala vestibuli.
• The instantaneous position of the basilar membrane has a excitatory or
inhibitory effect, depending on whether the membrane is momentarily
distorted toward the scala tympani, or away from it, respectively.
• Both effects add up to determine the total response.
Figure 2.22 shows a hypothetical time distribution of neural impulses in a nerve fiber of the inner
ear connected to the appropriate resonance region of the basilar membrane, when it is excited
by a low frequency tone of a trapezoidal vibration pattern shape.
• This binaural hearing(Humans naturally have what's known as binaural hearing, which is the ability to hear in two ears. )
should involve a process called temporal cross correlation of the neural signals from both cochleas.
• In this model, an ascending neuron can fire only if it is excited simultaneously by both incoming
fibers. Because a neural signal propagates with a finite velocity along a fiber simultaneous arrival at
a given ascending neuron ending requires a certain time difference between the original signals in
both cochleas.
Ascending pathway: A nerve pathway that goes upward from the spinal
cord toward the brain carrying sensory information from the body to the
brain. In contrast, descending pathways are nerve pathways that go
down the spinal cord and allow the brain to control movement of the body
below the head.
2. DYNAMICS
All musical aspects relating to the relative loudness (or quietness) of music fall under the general element
of DYNAMICS.
Other basic terms relating to Dynamics are:
• Crescendo: gradually getting LOUDER
• Diminuendo (or decrescendo) : gradually getting QUIETER
• Accent: "punching" or "leaning into" a note harder to temporarily emphasize it.
3. MELODY
Melody is the LINEAR/HORIZONTAL presentation of pitch (the word used to describe the highness or
lowness of a musical sound). Melodies can be described as:
• CONJUNCT (smooth; easy to sing or play)
• DISJUNCT (disjointedly ragged or jumpy; difficult to sing or play).
4. HARMONY
Harmony is the VERTICALIZATION of pitch. Often, harmony is thought of as the art of combining
pitches into chords (several notes played simultaneously as a "block").
Harmony is often described in terms of its relative HARSHNESS:
• DISSONANCE: a harsh-sounding harmonic combination
• CONSONANCE: a smooth-sounding harmonic combination
Other basic terms relating to Harmony are:
• Modality: harmony created out of the ancient Medieval/Renaissance modes.
• Tonality: harmony that focuses on a "home" key center.
• Atonality: modern harmony that AVOIDS any sense of a "home" key center.
5. TONE COLOR (or TIMBRE)
Each musical instrument or voice produces its own characteristic pattern of “overtones,” which
gives it a unique "tone color" or timbre.
Ex. If you play a "C" on the piano and then sing that "C", you and the piano have obviously
produced the same pitch; however, your voice has a different sound quality than the piano.
6. TEXTURE
Texture refers to the number of individual musical lines (melodies) and the relationship these
lines have to one another.
• Monophonic (single-note) texture: Music with only one note sounding at a time
• Homophonic texture: Music with two or more notes sounding at a the same time, but generally
featuring a prominent melody in the upper part, supported by a less intricate harmonic
accompaniment underneath.
• Polyphonic texture: Music with two or more independent melodies sounding at the same time.
• Imitative texture: Imitation is a special type of polyphonic texture produced whenever a
musical idea is ECHOED from "voice" to "voice".
Perception of timbre of musical tones
• Sounds may be generally characterized by pitch, loudness, and quality.
• Sound "quality" or "timbre" describes those characteristics of sound which
allow the ear to distinguish sounds which have the same pitch and loudness.
• Timbre is then a general term for the distinguishable characteristics of a
tone.
• Timbre is mainly determined by the harmonic content of a sound and the
dynamic characteristics of the sounds.
• Psychoacoustic experiments with electronically generated steady complex
tones, of equal pitch and loudness but different spectra and phase
relationships among the harmonics, show that the timbre sensation is
controlled primarily by the power spectrum .
• By dividing the audible frequency range into bands of about one-third octave each and by
measuring the intensity or sound energy flow that for a given complex tone is contained in each
band, it was possible to define quantitative “dissimilarity indices” for the (steady) sounds of
various musical instruments, which correlate well with psychophysically determined timbre
similarity and dissimilarity judgments.
• The timbre sensation is controlled by the absolute distribution of sound energy in fixed critical
bands, not by the intensity values relative to that of the fundamental.
• This is easily verified by listening to a record or magnetic tape played at the wrong speed. This
procedure leaves relative power spectra unchanged, merely shifting all frequencies up or down;
yet a clear change in timbre of all instruments is perceived.
• The static timbre sensation is a “multidimensional” psychological magnitude related not to one
but to a whole set of physical parameters of the original acoustical stimulus—the set of
intensities in all critical bands.
• Tone quality perception is the mechanism by means of which information is extracted from the
auditory signal in such a way as to make it suitable for:
• (1) Storage in the memory with an adequate label of identification, and
• (2) Comparison with previously stored and identified information.
• The first operation involves learning or conditioning.
• A child who learns to recognize a given musical instrument is presented repeatedly with a
melody played on that instrument and told:“This is a clarinet.” His brain extracts suitable
information from the succession of auditory stimuli, labels this information with the
qualification “clarinet,” and stores it in the memory.
• The second operation represents the conditioned response to a learned pattern:
• When the child hears a clarinet play after the learning experience, his brain compares the
information extracted from the incoming signal (i.e., the timbre) with stored cues, and, if a
successful match is found, conveys the response: “a clarinet.”
Pitch Contour Musical Structure
• In music, the pitch contour focuses on the relative change in pitch over
time of a primary sequence of played notes.
• The same contour can be transposed without losing its essential relative
qualities, such as sudden changes in pitch or a pitch that rises or falls
over time.
• Pure tones have a clear pitch, but complex sounds such as speech and
music typically have intense peaks at many different frequencies
• Nevertheless, by establishing a fixed reference point in the frequency
function of a complex sound, and then observing the movement of this
reference point as the function translates, one can generate a meaningful
pitch contour consistent with human experience.
Pitch Contour Musical Structure
• In linguistics, speech synthesis, and music, the pitch contour of a sound is a function or
curve that tracks the perceived pitch of the sound over time.
• Pitch contour may include multiple sounds utilizing many pitches, and can relate to
frequency function at one point in time to the frequency function at a later point.
Auditory Streaming
• When human subjects hear a sequence of two alternating pure tones, they
often perceive it in one of two ways: as one integrated sequence (a single
"stream" consisting of the two tones), or as two segregated sequences, one
sequence of low tones perceived separately from another sequence of high
tones (two "streams").
• Perception of this stimulus is thus bistable.
• Auditory streaming is believed to be a manifestation of human ability to
analyze an auditory scene, i.e. to attribute portions of the incoming sound
sequence to distinct sound generating entities.
Auditory Streaming- example
• For example, a bell can be heard as a 'single' sound (integrated), or some people are
able to hear the individual components – they are able to segregate the sound. ... In
many circumstances the segregated elements can be linked together in time, producing
an auditory stream.
Tonality and Context
•
• For each new block we calculate the new power average by using the following
recursive formula:
where P(n) and P(n-1) represent the current and previous recursive average power
calculations respectively,
• 𝛼 is a constant to control the effectiveness of the algorithm, 0< 𝛼<1.
• A greater 𝛼 value will make P(n) less variant on new input average values .
• P(n) is a signal that provides information about the power of the input sequence.
• By performing an FFT to this signal, we will be able to find the frequency at which the power is
maximum.
• The size N chosen for the FFT should be a power of 2,
• Example, N=1024 computes 1024 frequency samples for P(n).
• P(n) is not re-calculated every time, but a sliding window is implemented.
• Sliding Window Method: In the sliding window method, a window of specified length, Len,
moves over the data, sample by sample, and the statistic is computed over the data in the
window. The output for each input sample is the statistic over the window of the current
sample and the Len - 1 previous samples.
• When the vector containing P(n) is filled in with values, the next value will be added to this
same vector and the first value in the vector will be discarded.
• Without implementing a sliding window, the calculation of the FFT would be performed
every time 2048 samples are acquired, which would approximately correspond to:
• TFFT =
• TFFT = 16 sec
• This would mean that we would only be able to detect variations in the BPM only every 16
seconds.
• By using the sliding window approach, the BPM calculated is continuously following the
BPM of the music.
Frequency domain algorithm
• The P(n) data stored in the FIFO is used to perform an FFT calculation. At the
beginning, the FIFO is empty.
• After more data is stored, FFT spectrum increases in power.
• Based on the theory, we know that the frequency which represents the beat, has the
highest power.
• Therefore, by analyzing the power spectrum of the P(n) sequence and finding the
bin that corresponds to the highest power, we are able to find the BPM of the sound
which should be proportional to the highest bin.
• A frequency bin is a segment [fl,fh] of the frequency axis that "collect" the amplitude, magnitude or
energy from a small range of frequencies, often resulting from a Fourier analysis. ... The frequency bin
can be derived for instance from the sampling frequency and the resolution of the Fourier transform.
• frequency bins are intervals between samples in frequency domain. For example, if your sample rate is
100 Hz and your FFT size is 100, then you have 100 points between [0 100) Hz. ... Each such small
interval, say 0-1 Hz, is a frequency bin.
• However, we have limited the searching algorithm to a certain bin range, which is in between of
the lowest BPM (60) and the highest BPM (200). It is from bin #16 to bin #52. The correspondence
between the bin number and frequency of the BPM is given by the following formula:
• W is the work delivered between the times t1 and t2.Pis called the
mechanical power (W or J/s).
• The concept of power is most important for physics of music. Indeed,
our ear is not interested at all in the total acoustic energy which
reaches the eardrum— rather, it is sensitive to the rate at which this
energy arrives, that is, the acoustic power. This rate determines the
sensation of loudness.
Perception of Loudness
• Primary pitch perception has logarithmic relation between position x on
basilar membrane and frequency. This compression is achieved due to
mechanical properties of cochlear.
• In the loudness detection mechanism, the compression is partly neural,
partly mechanical.
• When a pure sound is present, the primary neurons connected to sensor cells
located in the center of the region of maximum resonance amplitude
increase their firing rate above the spontaneous level.
• Another element contributing to loudness compression is related to the
following: At higher SPL’s (Sound pressure levels) a primary neuron’s
firing rate saturates at a level that is only a few times that of spontaneous
firing.
• Individual auditory nerve fibers with similar characteristic frequencies
have widely differing firing thresholds (the SPL needed to increase the
firing rate above spontaneous).
• The stimulus level increase required to achieve saturation, too, may
vary from 20db to 40db or more for different fibers.
• The study of the individual characteristics of auditory neurons,
identified three groups:
1. Fibers with high spontaneous firing rates (greater than 20 impulses
per second) and having the lowest thresholds.
2. Fibers with very low spontaneous rates (less than 0.5/s) and high
threshold values spread over a wide range (upto 50–60db).
3. A group with intermediate spontaneous rates and thresholds.
• Each inner hair cell receives fibers from all three groups.
• This gives rise to one individual sensor cell to deliver its output over a
wide dynamic range.
• Thus, sound intensity is encoded both in firing rate and the type of the
acoustic nerve fiber carrying the information.
• An increase in intensity leads to an increase of the total number of
transmitted impulses—either because the firing rate of each neuron
has increased or because the total number of activated neurons has
increased.