UNIT 5 Related Csetube

MULTIRATE DSP
Sample rate conversion

Sample rate conversion is the process of converting a (usually digital) signal from one sampling
rate to another, while changing the information carried by the signal as little as possible. When
applied to an image, this process is sometimes called image scaling.
Sample rate conversion is needed because different systems use different sampling rates, for
engineering, economic, or historical reasons. The physics of sampling merely sets minimum
sampling rate (an analog signal can be sampled at any rate above twice the highest frequency
contained in the signal, see Nyquist frequency), and so other factors determine the actual rates
n /
used. For example, different audio systems use different rates of 44.1, 48, and 96 kHz. As
another example, American television, European television, and movies all use different numbers
. i
of frames per second. Users would like to transfer source material between these systems. Just
replaying the existing data at the new rate will not normally work it introduces large changes
e
in pitch (for audio) and movement as well (for video), plus it cannot be done in real time. Hence
b
sample rate conversion is required.
t u
Two basic approaches are:
s e

. c
Convert to analog, then re-sample at the new rate.
Digital signal processing compute the values of the new samples from the old samples.
w w
/ w
Modern systems almost all use the latter since this method introduces less noise and distortion.
Though the calculations needed can be quite complex, they are entirely practical given todays
modern processing power.
: /
t p
A famous example of analog rate conversion was converting the slow-scan TV signals from the
Apollo moon missions to the conventional TV rates for the viewers at home. Another historical
h t
example, part analog and part digital, is the conversion of movies (shot at 24 frames per second)
to television (roughly 50 or 60 fields[nb 1] per second). To convert a 24 frame/sec movie to 60
field/sec television, for example, alternate movie frames are shown 2 and 3 times, respectively.
For 50 Hz systems such as PAL each frame is shown twice. Since 50 is not exactly 2x24, the
movie will run 50/48 = 4% faster, and the audio pitch will be 4% higher, an effect known as PAL
speed-up. This is often accepted for simplicity, but more complex methods are possible that
preserve the running time and pitch. Every twelfth frame can be repeated 3 times rather than
twice, or digital interpolation (see below) can be used in a video scaler.
http://www.csetube.in/
Digital sample rate conversion
There are at least two ways to perform digital sample rate conversion:
(a) If the two frequencies are in a fixed ratio, the conversion can be done as follows: Let
F = least common multiple of the two frequencies. Generate a signal sampled at F by
interpolating 0s in the original sample. This will also introduce aliases at multiples of the
baseband frequency. Remove these with a digital low pass filter, until only the signals
with less than half of the output sample frequency remain. Then reduce the sample rate
by discarding the appropriate samples.
(b) Another approach is to treat the samples as a time series, and create any needed new
points by interpolation. In theory any interpolation method can be used, though linear (for
simplicity) and a truncated sinc function (from theory) are most common.
n /
. i
Although the two approaches seem very different, they are mathematically identical. Picking an
interpolation function in the second scheme is equivalent to picking the impulse response of the
b e
digital filter in the first scheme. Linear interpolation is equivalent to a triangular impulse
response; sinc() will be an approximation to a brick-wall filter (it approaches the desirable "brick
wall" filter as the number of points increase).
t u
e
If the sample rate ratios are known, fixed, and rational, method (a) is better, in theory. The length
s
. c
of the impulse response of the filter in (a) is the same as choosing the number of points used in
interpolation in (b). In approach (a), a slow precomputation such as the Remez algorithm can be
w w
used to compute the "best" response possible given the number of points (best in terms of peak
error in various frequency bands, and so on). Note that a truncated sinc() function, though correct
points. / w
in the limit of an infinite number of points, is not the most accurate filter for a finite number of
: /
t p
However, method (b) will work in more general cases, where the sample rate ratios are not
rational, or two real time streams must be accommodated, or the sample rates are time varying.
h t
Normally, due to the mathematical operations employed, the output samples of sample rate
conversion are almost always computed to more precision than the output format can hold.
Conversion to the output bit size can be done by simple rounding, or more sophisticated methods
such as dither or noise shaping can be employed.
Example
CDs are sampled at 44.1 kHz, but a Digital Audio Tape, or DAT is usually sampled at 48 kHz.
How can material be converted from one sample rate to the other? First, note that 44.1 and 48 are
in the ratio 147/160. Therefore to convert from 44.1 to 48, for example, the process is
(conceptually):
Less technical explanation
If the original audio signal had been recorded at 7.056 MHz sampling rate, the process would be
simple. Since 7.056 MHz is 160 x 44.1 kHz, and also 147 x 48 kHz, all we would need to do is
take every 160th sample to get a 44.1 kHz sampling rate, and every 147th sample to get a 48 kHz
sampling rate. Taking every Nth sample like this preserves the content provided the information
(the audio signal) does not have any content above half the lowest sampling rate used (22.05
kHz) in this case.
So now the problem is how to generate the 7.056 MHz sampled signal, given that the original
has only 1/160 of the samples needed. A first thought might be to interpolate between the
existing points, but that turns out to have two problems. First, the frequency response will not be
flat, and second, this will create some higher frequency content. The high frequency content can
(and must) be removed with a digital filter (basically a complicated average over many points)
but the frequency response problem remains.
n /
The somewhat surprising answer is to replace the missing samples with zeros. So if the original
audio samples were ..,a,b,c,.., then the 7.056 MHz sequence is ..,a,0,0,0,...0,0,b,0,0...0,0,c,.., with
. i
159 zeros between each original sample. This too will create extra high frequency content (in
fact it is worse in this respect than linear interpolation) but at least the frequency response is flat.
e
Then the digital filter removes the unwanted high frequency content. The work of this digital
b
all of the samples are known to be zero.
t u
filter is also much easier if zeros are inserted, since the filter is basically an average and almost
s e
So inserting the zeros, then running the digital filter, gives the needed signal - sampled at 7.056
. c
MHz, but with no content above 24 kHz. Then just taking every 147th sample gives the desired
output. Which sample to start with does not matter - any set will work as long as they are 147
samples apart.
w w
Technical explanation
/ w

: /
Insert 159 zeros between every input sample. This raises the data rate to 7.056 MHz, the
t p
least common multiple of 44.1 and 48 kHz. Since this operation is equivalent to
reconstructing with Dirac delta functions, it also creates images of frequency f at 44.1f,
h t
44.1+f, 88.2f, 88.2+f, ...
Remove the images with a digital filter, leaving a signal containing only 020 kHz
information, but still sampled at a rate of 7.056 MHz.
Discard 146 of every 147 output samples. It does not hurt to do so since the signal now
has no significant content above 24 kHz.
(In practice, of course, there is no reason to compute the values of the samples that will be
discarded, and for the samples you still need to compute, you can take advantage of the fact that
most of the inputs are 0. This is called polyphase decomposition[1], and drastically reduces the
computation effort, without affecting the conversion quality.)
This process requires a digital filter (almost always an FIR filter since these can be designed to
have no phase distortion) that is flat to 20 kHz, and down at least x dB at 24 kHz. How big does x
need to be? A first impression might be about 100 dB, since the maximum signal size is roughly
32767, and the input quantization 1/2, so the input had a signal to broadband noise ratio of 98
dB at most. However, the noise in the stopband (20 kHz to 3.5 MHz) is all folded into the
passband by the decimation in the third step, so another 22 dB (that's a ratio of 160:1 expressed
in dB) of stopband rejection is required to account for the noise folding. Thus 120 dB rejection
yields a broadband noise roughly equal to the original quantizing noise.
There is no requirement that the resampling in the ratio 160:147 all be done in one step. Using
the same example, we could re-sample the original at a ratio of 10:7, then 8:7, then 2:3 (or do
these in any order that does not reduce the sample rate below the initial or final rates, or use any
other factorization of the ratios). There may be various technical reasons for using a single step
or multi-step process typically the single step process involves less total computation but
requires more coefficient storage.
SPEECH PROCESSING
n /
i
Speech processing is the study of speech signals and the processing methods of these signals.
.
b e
The signals are usually processed in a digital representation, so speech processing can be
regarded as a special case of digital signal processing, applied to speech signal.
t u
It is also closely tied to natural language processing (NLP), as its input can come from / output
s e
can go to NLP applications. E.g. text-to-speech synthesis may use a syntactic parser on its input
c
text and speech recognition's output may be used by e.g. information extraction techniques.
.
w
Speech processing can be divided into the following categories:
w

w
Speech recognition, which deals with analysis of the linguistic content of a speech signal.
/

: /
Speaker recognition, where the aim is to recognize the identity of the speaker.
Enhancement of speech signals, e.g. audio noise reduction.

p
Speech coding, a specialized form of data compression, is important in the
t
telecommunication area.

t
Voice analysis for medical purposes, such as analysis of vocal loading and dysfunction of
h
the vocal cords.
Speech synthesis: the artificial synthesis of speech, which usually means computer-
generated speech.
Speech enhancement: enhancing the perceptual quality of a speech signal by removing
the destructive effects of noise, limited capacity recording equipment, impairments, etc.
Speech signal processing refers to the acquisition, manipulation, storage, transfer and output of
vocal utterances by a computer. The main applications are the recognition, synthesis and
compression of human speech:
Speech recognition (also called voice recognition) focuses on capturing the human voice
as a digital sound wave and converting it into a computer-readable format.
Speech synthesis is the reverse process of speech recognition. Advances in this area
improve the computer's usability for the visually impaired.
Speech compression is important in the telecommunications area for increasing the

amount of information which can be transferred, stored, or heard, for a given set of time
and space constraints.
Speech recognition (also known as automatic speech recognition or computer speech

recognition) converts spoken words to text. The term "voice recognition" is sometimes used to
refer to recognition systems that must be trained to a particular speakeras is the case for most
desktop recognition software. Recognizing the speaker can simplify the task of translating
speech.
Speech recognition is a broader solution which refers to technology that can recognize speech
arbitrary voices.
n /
without being targeted at single speakersuch as a call center system that can recognize
. i
Speech recognition applications include voice dialing (e.g., "Call home"), call routing (e.g., "I
e
would like to make a collect call"), domotic appliance control, search (e.g., find a podcast where
b
t u
particular words were spoken), simple data entry (e.g., entering a credit card number),
preparation of structured documents (e.g., a radiology report), speech-to-text processing (e.g.,
e
word processors or emails), and aircraft (usually termed Direct Voice Input).
s

. c
w w
History
/ w
: /
The first speech recognizer appeared in 1952 and consisted of a device for the recognition of
t
York World's Fair.
p
single spoken digits [1] Another early device was the IBM Shoebox, exhibited at the 1964 New
h t
One of the most notable domains for the commercial application of speech recognition in the
United States has been health care and in particular the work of the medical transcriptionist
(MT)[citation needed]. According to industry experts, at its inception, speech recognition (SR) was
sold as a way to completely eliminate transcription rather than make the transcription process
more efficient, hence it was not accepted. It was also the case that SR at that time was often
technically deficient. Additionally, to be used effectively, it required changes to the ways
physicians worked and documented clinical encounters, which many if not all were reluctant to
do. The biggest limitation to speech recognition automating transcription, however, is seen as the
software. The nature of narrative dictation is highly interpretive and often requires judgment that
may be provided by a real human but not yet by an automated system. Another limitation has
been the extensive amount of time required by the user and/or system provider to train the
software.
A distinction in ASR is often made between "artificial syntax systems" which are usually
domain-specific and "natural language processing" which is usually language-specific. Each of
these types of application presents its own particular goals and challenges.
Applications
Health care
In the health care domain, even in the wake of improving speech recognition technologies,
medical transcriptionists (MTs) have not yet become obsolete. The services provided may be
redistributed rather than replaced. Speech recognition is used to enable deaf people to understand
the spoken word via speech to text conversion, which is very helpful.
Speech recognition can be implemented in front-end or back-end of the medical documentation

process.
n /
. i
Front-End SR is where the provider dictates into a speech-recognition engine, the recognized
signing off on the document. It never goes through an MT/editor. b e

words are displayed right after they are spoken, and the dictator is responsible for editing and
t u
Back-End SR or Deferred SR is where the provider dictates into a digital dictation system, and
s e
the voice is routed through a speech-recognition machine and the recognized draft document is
. c
routed along with the original voice file to the MT/editor, who edits the draft and finalizes the
report. Deferred SR is being widely used in the industry currently.
w w
Many Electronic Medical Records (EMR) applications can be more effective and may be
w
performed more easily when deployed in conjunction with a speech-recognition engine.
/
keyboard.
: /
Searches, queries, and form filling may all be faster to perform by voice than by using a
Military t p
h t
High-performance fighter aircraft
Substantial efforts have been devoted in the last decade to the test and evaluation of speech
recognition in fighter aircraft. Of particular note are the U.S. program in speech recognition for
the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA), the program
in France on installing speech recognition systems on Mirage aircraft, and programs in the UK
dealing with a variety of aircraft platforms. In these programs, speech recognizers have been
operated successfully in fighter aircraft with applications including: setting radio frequencies,
commanding an autopilot system, setting steer-point coordinates and weapons release
parameters, and controlling flight displays. Generally, only very limited, constrained
vocabularies have been used successfully, and a major effort has been devoted to integration of
the speech recognizer with the avionics system.
Some important conclusions from the work were as follows:
1. Speech recognition has definite potential for reducing pilot workload, but this potential was not
realized consistently.
2. Achievement of very high recognition accuracy (95% or more) was the most critical factor for
making the speech recognition system useful with lower recognition rates, pilots would not
use the system.
3. More natural vocabulary and grammar, and shorter training times would be useful, but only if
very high recognition rates could be maintained.
Laboratory research in robust speech recognition for military environments has produced
promising results which, if extendable to the cockpit, should improve the utility of speech
recognition in high-performance aircraft.
Working with Swedish pilots flying in the JAS-39 Gripen cockpit, Englund (2004) found
n /
recognition deteriorated with increasing G-loads. It was also concluded that adaptation greatly
. i
improved the results in all cases and introducing models for breathing was shown to improve
recognition scores significantly. Contrary to what might be expected, no effects of the broken
b e
English of the speakers were found. It was evident that spontaneous speech caused problems for
the recognizer, as could be expected. A restricted vocabulary, and above all, a proper syntax,
u
could thus be expected to improve recognition accuracy substantially.[2]
t
s e
The Eurofighter Typhoon currently in service with the UK RAF employs a speaker-dependent
. c
system, i.e. it requires each pilot to create a template. The system is not used for any safety
critical or weapon critical tasks, such as weapon release or lowering of the undercarriage, but is
w w
used for a wide range of other cockpit functions. Voice commands are confirmed by visual
and/or aural feedback. The system is seen as a major design feature in the reduction of pilot
/ w
workload, and even allows the pilot to assign targets to himself with two simple voice commands
or to any of his wingmen with only five commands.[3]
: /
t p
Speech synthesis is the artificial production of human speech. A computer system used for this
purpose is called a speech synthesizer, and can be implemented in software or hardware. A text-
h t
to-speech (TTS) system converts normal language text into speech; other systems render
symbolic linguistic representations like phonetic transcriptions into speech.[1]
Synthesized speech can be created by concatenating pieces of recorded speech that are stored in
a database. Systems differ in the size of the stored speech units; a system that stores phones or
diphones provides the largest output range, but may lack clarity. For specific usage domains, the
storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer
can incorporate a model of the vocal tract and other human voice characteristics to create a
completely "synthetic" voice output.[2]
The quality of a speech synthesizer is judged by its similarity to the human voice and by its
ability to be understood. An intelligible text-to-speech program allows people with visual
impairments or reading disabilities to listen to written works on a home computer. Many
computer operating systems have included speech synthesizers since the early 1980s.

Overview of text processing
Overview of a typical TTS system
n /
A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-end. The
i
front-end has two major tasks. First, it converts raw text containing symbols like numbers and
.
b e
abbreviations into the equivalent of written-out words. This process is often called text
normalization, pre-processing, or tokenization. The front-end then assigns phonetic
t u
transcriptions to each word, and divides and marks the text into prosodic units, like phrases,
clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-
s e
to-phoneme or grapheme-to-phoneme conversion.[3] Phonetic transcriptions and prosody
information together make up the symbolic linguistic representation that is output by the front-
c
end. The back-endoften referred to as the synthesizerthen converts the symbolic linguistic
.
representation into sound.
w w
History
/ w
: /
Long before electronic signal processing was invented, there were those who tried to build
t p
machines to create human speech. Some early legends of the existence of "speaking heads"
involved Gerbert of Aurillac (d. 1003 AD), Albertus Magnus (11981280), and Roger Bacon
(12141294).
h t
In 1779, the Danish scientist Christian Kratzenstein, working at the Russian Academy of
Sciences, built models of the human vocal tract that could produce the five long vowel sounds
(in International Phonetic Alphabet notation, they are [a], [e], [i], [o] and [u]).[4] This was
followed by the bellows-operated "acoustic-mechanical speech machine" by Wolfgang von
Kempelen of Vienna, Austria, described in a 1791 paper.[5] This machine added models of the
tongue and lips, enabling it to produce consonants as well as vowels. In 1837, Charles
Wheatstone produced a "speaking machine" based on von Kempelen's design, and in 1857, M.
Faber built the "Euphonia". Wheatstone's design was resurrected in 1923 by Paget.[6]
In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech
analyzer and synthesizer that was said to be clearly intelligible. Homer Dudley refined this
device into the VODER, which he exhibited at the 1939 New York World's Fair.
The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues at Haskins
Laboratories in the late 1940s and completed in 1950. There were several different versions of
this hardware device but only one currently survives. The machine converts pictures of the
acoustic patterns of speech in the form of a spectrogram back into sound. Using this device,
Alvin Liberman and colleagues were able to discover acoustic cues for the perception of
phonetic segments (consonants and vowels).
Early electronic speech synthesizers sounded robotic and were often barely intelligible. The
quality of synthesized speech has steadily improved, but output from contemporary speech
synthesis systems is still clearly distinguishable from actual human speech.
As the cost-performance ratio causes speech synthesizers to become cheaper and more accessible
to the people, more people will benefit from the use of text-to-speech programs.[7]
Electronic devices
n /
. i
The first computer-based speech synthesis systems were created in the late 1950s, and the first
complete text-to-speech system was completed in 1968. In 1961, physicist John Larry Kelly, Jr
e
and colleague Louis Gerstman[8] used an IBM 704 computer to synthesize speech, an event
b
t u
among the most prominent in the history of Bell Labs. Kelly's voice recorder synthesizer
(vocoder) recreated the song "Daisy Bell", with musical accompaniment from Max Mathews.
s e
Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce at the Bell
Labs Murray Hill facility. Clarke was so impressed by the demonstration that he used it in the
. c
climactic scene of his screenplay for his novel 2001: A Space Odyssey,[9] where the HAL 9000
computer sings the same song as it is being put to sleep by astronaut Dave Bowman.[10] Despite
w w
the success of purely electronic speech synthesis, research is still being conducted into
/ w
mechanical speech synthesizers.[11] They also use text to speech for a variety of youtube videos
called "the secret missing episode of...) These videos usually include cartoon characters such as
/
Drew Pickles, Barney the dinosaur, and Ronald Mcdonald
:
t p
Synthesizer technologies
h t
The most important qualities of a speech synthesis system are naturalness and intelligibility.
Naturalness describes how closely the output sounds like human speech, while intelligibility is
the ease with which the output is understood. The ideal speech synthesizer is both natural and
intelligible. Speech synthesis systems usually try to maximize both characteristics.
The two primary technologies for generating synthetic speech waveforms are concatenative
synthesis and formant synthesis. Each technology has strengths and weaknesses, and the intended
uses of a synthesis system will typically determine which approach is used.
Concatenative synthesis
Concatenative synthesis is based on the concatenation (or stringing together) of segments of

recorded speech. Generally, concatenative synthesis produces the most natural-sounding
synthesized speech. However, differences between natural variations in speech and the nature of
the automated techniques for segmenting the waveforms sometimes result in audible glitches in
the output. There are three main sub-types of concatenative synthesis.
Unit selection synthesis
Unit selection synthesis uses large databases of recorded speech. During database creation, each
recorded utterance is segmented into some or all of the following: individual phones, diphones,
half-phones, syllables, morphemes, words, phrases, and sentences. Typically, the division into
segments is done using a specially modified speech recognizer set to a "forced alignment" mode
with some manual correction afterward, using visual representations such as the waveform and
spectrogram.[12] An index of the units in the speech database is then created based on the
segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position
in the syllable, and neighboring phones. At runtime, the desired target utterance is created by
determining the best chain of candidate units from the database (unit selection). This process is
typically achieved using a specially weighted decision tree.
n /
. i
Unit selection provides the greatest naturalness, because it applies only a small amount of digital
signal processing (DSP) to the recorded speech. DSP often makes recorded speech sound less
b e
natural, although some systems use a small amount of signal processing at the point of
t u
concatenation to smooth the waveform. The output from the best unit-selection systems is often
indistinguishable from real human voices, especially in contexts for which the TTS system has
s e
been tuned. However, maximum naturalness typically require unit-selection speech databases to
be very large, in some systems ranging into the gigabytes of recorded data, representing dozens
. c
of hours of speech.[13] Also, unit selection algorithms have been known to select segments from a
place that results in less than ideal synthesis (e.g. minor words become unclear) even when a
better choice exists in the database.[14]
w w
] Diphone synthesis
/ w
: /
Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound
p
transitions) occurring in a language. The number of diphones depends on the phonotactics of the
t
h t
language: for example, Spanish has about 800 diphones, and German about 2500. In diphone
synthesis, only one example of each diphone is contained in the speech database. At runtime, the
target prosody of a sentence is superimposed on these minimal units by means of digital signal
processing techniques such as linear predictive coding, PSOLA[15] or MBROLA.[16] The quality
of the resulting speech is generally worse than that of unit-selection systems, but more natural-
sounding than the output of formant synthesizers. Diphone synthesis suffers from the sonic
glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has
few of the advantages of either approach other than small size. As such, its use in commercial
applications is declining, although it continues to be used in research because there are a number
of freely available software implementations.
Domain-specific synthesis
Domain-specific synthesis concatenates prerecorded words and phrases to create complete

utterances. It is used in applications where the variety of texts the system will output is limited to
a particular domain, like transit schedule announcements or weather reports.[17] The technology
is very simple to implement, and has been in commercial use for a long time, in devices like
talking clocks and calculators. The level of naturalness of these systems can be very high
because the variety of sentence types is limited, and they closely match the prosody and
intonation of the original recordings.[citation needed]
Because these systems are limited by the words and phrases in their databases, they are not
general-purpose and can only synthesize the combinations of words and phrases with which they
have been preprogrammed. The blending of words within naturally spoken language however
can still cause problems unless the many variations are taken into account. For example, in non-
rhotic dialects of English the "r" in words like "clear" /kli/ is usually only pronounced when
the following word has a vowel as its first letter (e.g. "clear out" is realized as /klit/).
Likewise in French, many final consonants become no longer silent if followed by a word that
n /
begins with a vowel, an effect called liaison. This alternation cannot be reproduced by a simple
word-concatenation system, which would require additional complexity to be context-sensitive.
Formant synthesis . i
b e
t u
Formant synthesis does not use human speech samples at runtime. Instead, the synthesized
speech output is created using an acoustic model. Parameters such as fundamental frequency,
s e
voicing, and noise levels are varied over time to create a waveform of artificial speech. This
method is sometimes called rules-based synthesis; however, many concatenative systems also
have rules-based components.
. c
w w
Many systems based on formant synthesis technology generate artificial, robotic-sounding
/ w
speech that would never be mistaken for human speech. However, maximum naturalness is not
always the goal of a speech synthesis system, and formant synthesis systems have advantages
: /
over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very
high speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-
p
speed synthesized speech is used by the visually impaired to quickly navigate computers using a
t
h t
screen reader. Formant synthesizers are usually smaller programs than concatenative systems
because they do not have a database of speech samples. They can therefore be used in embedded
systems, where memory and microprocessor power are especially limited. Because formant-
based systems have complete control of all aspects of the output speech, a wide variety of
prosodies and intonations can be output, conveying not just questions and statements, but a
variety of emotions and tones of voice.
Examples of non-real-time but highly accurate intonation control in formant synthesis include
the work done in the late 1970s for the Texas Instruments toy Speak & Spell, and in the early
1980s Sega arcade machines.[18] Creating proper intonation for these projects was painstaking,
and the results have yet to be matched by real-time text-to-speech interfaces.[19]
Articulatory synthesis
Articulatory synthesis refers to computational techniques for synthesizing speech based on
models of the human vocal tract and the articulation processes occurring there. The first
articulatory synthesizer regularly used for laboratory experiments was developed at Haskins
Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein. This
synthesizer, known as ASY, was based on vocal tract models developed at Bell Laboratories in
the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.
Until recently, articulatory synthesis models have not been incorporated into commercial speech
synthesis systems. A notable exception is the NeXT-based system originally developed and
marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where
much of the original research was conducted. Following the demise of the various incarnations
of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the
Trillium software was published under the GNU General Public License, with work continuing
as gnuspeech. The system, first marketed in 1994, provides full articulatory-based text-to-speech
controlled by Carr's "distinctive region model".

n /
conversion using a waveguide or transmission-line analog of the human oral and nasal tracts
HMM-based synthesis . i
b e
t u
HMM-based synthesis is a synthesis method based on hidden Markov models, also called
Statistical Parametric Synthesis. In this system, the frequency spectrum (vocal tract),
s e
fundamental frequency (vocal source), and duration (prosody) of speech are modeled
simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on
the maximum likelihood criterion.[20]
. c
Sinewave synthesis
w w
/ w
Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main bands
/
of energy) with pure tone whistles.[21]
:
t p
Speech compression may mean different things:

h t
Speech encoding refers to compression for transmission or storage, possibly to an
unintelligible state, with decompression used prior to playback.
Time-compressed speech refers to voice compression for immediate playback, without

any decompression (so that the final speech sounds faster to the listener).
Speech coding
Speech coding is the application of data compression of digital audio signals containing speech.
Speech coding uses speech-specific parameter estimation using audio signal processing
techniques to model the speech signal, combined with generic data compression algorithms to
represent the resulting modeled parameters in a compact bitstream.
The two most important applications of speech coding are mobile telephony and Voice over IP.
The techniques used in speech coding are similar to that in audio data compression and audio
coding where knowledge in psychoacoustics is used to transmit only data that is relevant to the
human auditory system. For example, in narrowband speech coding, only information in the
frequency band 400 Hz to 3500 Hz is transmitted but the reconstructed signal is still adequate for
intelligibility.
Speech coding differs from other forms of audio coding in that speech is a much simpler signal
than most other audio signals, and that there is a lot more statistical information available about
the properties of speech. As a result, some auditory information which is relevant in audio
coding can be unnecessary in the speech coding context. In speech coding, the most important
criterion is preservation of intelligibility and "pleasantness" of speech, with a constrained amount
of transmitted data.
n /
It should be emphasised that the intelligibility of speech includes, besides the actual literal
. i
content, also speaker identity, emotions, intonation, timbre etc. that are all important for perfect
intelligibility. The more abstract concept of pleasantness of degraded speech is a different
e
property than intelligibility, since it is possible that degraded speech is completely intelligible,
b
but subjectively annoying to the listener.
t u
with speech interaction.
s e
In addition, most speech applications require low coding delay, as long coding delays interfere
c
.a form of speech coding
Sample companding viewed as
w w
/ w
From this viewpoint, the A-law and -law algorithms (G.711) used in traditional PCM digital
telephony can be seen as a very early precursor of speech encoding, requiring only 8 bits per
: /
sample but giving effectively 12 bits of resolution. Although this would generate unacceptable
t p
distortion in a music signal, the peaky nature of speech waveforms, combined with the simple
frequency structure of speech as a periodic waveform with a single fundamental frequency with
h t
occasional added noise bursts, make these very simple instantaneous compression algorithms
acceptable for speech.
A wide variety of other algorithms were tried at the time, mostly variants on delta modulation,
but after careful consideration, the A-law/-law algorithms were chosen by the designers of the
early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a
very low complexity made them an excellent engineering compromise. Their audio performance
remains acceptable, and there has been no need to replace them in the stationary phone network.
Recently, in 2008, G.711.1 codec, which has a scalable structure, is standardized from ITU-T.
The input sampling rate is 16 kHz.
Modern speech compression
Much of the later work in speech compression was motivated by military research into digital
communications for secure military radios, where very low data rates were required to allow
effective operation in a hostile radio environment. At the same time, far more processing power
was available, in the form of VLSI integrated circuits, than was available for earlier compression
techniques. As a result, modern speech compression algorithms could use far more complex
techniques than were available in the 1960s to achieve far higher compression ratios.
These techniques were available through the open research literature to be used for civilian
applications, allowing the creation of digital mobile phone networks with substantially higher
channel capacities than the analog systems that preceded them.
The most common speech coding scheme is Code Excited Linear Prediction (CELP) coding,
/
which is used for example in the GSM standard. In CELP, the modelling is divided in two stages,
n
residual of the linear predictive model.
. i
a linear predictive stage that models the spectral envelope and code-book based model of the
b e
In addition to the actual speech coding of the signal, it is often necessary to use channel coding
t u
for transmission, to avoid losses due to transmission errors. Usually, speech coding and channel
coding methods have to be chosen in pairs, with the more important bits in the speech data
e
stream protected by more robust channel coding, in order to get the best overall coding results.
s
. c
The Speex project is an attempt to create a free software speech coder, unencumbered by patent
restrictions.
w w
Major subfields:
/ w

: /
Wide-band speech coding
t p
o AMR-WB for WCDMA networks
o VMR-WB for CDMA2000 networks

h t
o G.722, G.722.1, Speex and others for VoIP and videoconferencing
Narrow-band speech coding
o FNBDT for military applications
o SMV for CDMA networks
o Full Rate, Half Rate, EFR, AMR for GSM networks
o G.723.1, G.726, G.728, G.729, iLBC and others for VoIP or videoconferencing
Methods
Removal of silences. There are normally silences between words and sentences, and
even small silences within certain words. These can be reduced considerably and still
leave an understandable result.
Increasing speed. The speed can be increased on the entire audio track, but this has the
undesirable effect of increasing the frequency, so the voice sounds high-pitched (like
someone who has inhaled helium). This can be compensated for, however, by bringing
the pitch back down to the proper frequency.
Advantages
The same number of words can be compressed into a smaller time, and thus reduce advertising
costs, or more information can be included in a given radio or TV ad. Another advantage is that
this method seems to make the ad louder (by increasing the average volume), and thus more
likely to be noticed, without exceeding the maximum volume allowed by law.
Disadvantages
n /
The effect of removing the silences and increasing the speed is to make it sound much more
insistent, possibly to the point of unpleasantness.
. i
Other uses b e
Teaching and studying. t u
s e
Aids for the blind and disabled.
. c

w
Human-computer interfaces (such as voice-mail systems or lists of movies playing at a
theatre)
w
/ w
Speech recognition (speeds up or slows down human speech to a speed which can be
: /
recognized by the computer)
Other terms t p
h t
Unfortunately, there are a variety of confusing terms used for this and related technologies:
Time-compressed/accelerated speech (often used in psychological literature)
Compressed speech
Time-scale modified speech (used in signal processing literature)
time-scale modification (TSM)[1]
Sped-up speech
Rate-converted speech
Time-altered speech
Voice compression/speech compression/voice encoding/speech encoding/audio

compression (data) (these often refers to compression for transmission or storage,
possibly to an unintelligible state, with decompression used during playback)
ADAPTIVE FILTER
An adaptive filter is a filter that self-adjusts its transfer function according to an optimizing
algorithm. Because of the complexity of the optimizing algorithms, most adaptive filters are
digital filters that perform digital signal processing and adapt their performance based on the
input signal. By way of contrast, a non-adaptive filter has static filter coefficients (which
collectively form the transfer function).
n /
For some applications, adaptive coefficients are required since some parameters of the desired
. i
processing operation (for instance, the properties of some noise signal) are not known in
advance. In these situations it is common to employ an adaptive filter, which uses feedback to
e
refine the values of the filter coefficients and hence its frequency response.
b
t u
Generally speaking, the adapting process involves the use of a cost function, which is a criterion
for optimum performance of the filter (for example, minimizing the noise component of the
s e
input), to feed an algorithm, which determines how to modify the filter coefficients to minimize
the cost on the next iteration.
. c
w w
As the power of digital signal processors has increased, adaptive filters have become much more
common and are now routinely used in devices such as mobile phones and other communication
w
devices, camcorders and digital cameras, and medical monitoring equipment.
/

: /
t p
ht
Example
Suppose a hospital is recording a heart beat (an ECG), which is being corrupted by a 50 Hz noise
(the frequency coming from the power supply in many countries).
One way to remove the noise is to filter the signal with a notch filter at 50 Hz. However, due to
slight variations in the power supply to the hospital, the exact frequency of the power supply
might (hypothetically) wander between 47 Hz and 53 Hz. A static filter would need to remove all
the frequencies between 47 and 53 Hz, which could excessively degrade the quality of the ECG
since the heart beat would also likely have frequency components in the rejected range.
To circumvent this potential loss of information, an adaptive filter could be used. The adaptive
filter would take input both from the patient and from the power supply directly and would thus
be able to track the actual frequency of the noise as it fluctuates. Such an adaptive technique
generally allows for a filter with a smaller rejection range, which means, in our case, that the
quality of the output signal is more accurate for medical diagnoses.
Block diagram
The block diagram, shown in the following figure, serves as a foundation for particular adaptive
filter realisations, such as Least Mean Squares (LMS) and Recursive Least Squares (RLS). The
idea behind the block diagram is that a variable filter extracts an estimate of the desired signal.
n /
. i
b e
t u
e
To start the discussion of the block diagram we take the following assumptions:
s

. c
The input signal is the sum of a desired signal d(n) and interfering noise v(n)
x(n) = d(n) + v(n) w w

/ w
The variable filter has a Finite Impulse Response (FIR) structure. For such structures the
: /
impulse response is equal to the filter coefficients. The coefficients for a filter of order p
p
are defined as
t
h t .
The error signal or cost function is the difference between the desired and the estimated
signal
The variable filter estimates the desired signal by convolving the input signal with the impulse
response. In vector notation this is expressed as
where
is an input signal vector. Moreover, the variable filter updates the filter coefficients at every time
instant
where is a correction factor for the filter coefficients. The adaptive algorithm generates this
correction factor based on the input and error signals. LMS and RLS define two different
coefficient update algorithms.
Applications of adaptive filters

Noise cancellation
Signal prediction n /

Adaptive feedback cancellation
Echo cancellation . i
b e
t u
Active noise control s e
. c
From Wikipedia, the free encyclopedia
w w
(Redirected from Noise cancellation)
/ w
Jump to: navigation, search
: /
t p
Active noise control (ANC) (also known as noise cancellation, active noise reduction (ANR)
or antinoise) is a method for reducing unwanted sound.
h t
Explanation
Sound is a pressure wave, which consists of a compression phase and a rarefaction phase. A
noise-cancellation speaker emits a sound wave with the same amplitude but with inverted phase
(also known as antiphase) to the original sound. The waves combine to form a new wave, in a
process called interference, and effectively cancel each other out - an effect which is called phase
cancellation. Depending on the circumstances and the method used, the resulting soundwave
may be so faint as to be inaudible to human ears.
A noise-cancellation speaker may be co-located with the sound source to be attenuated. In this
case it must have the same audio power level as the source of the unwanted sound. Alternatively,
the transducer emitting the cancellation signal may be located at the location where sound
attenuation is wanted (e.g. the user's ear). This requires a much lower power level for
cancellation but is effective only for a single user. Noise cancellation at other locations is more
difficult as the three dimensional wavefronts of the unwanted sound and the cancellation signal
could match and create alternating zones of constructive and destructive interference. In small
enclosed spaces (e.g. the passenger compartment of a car) such global cancellation can be
achieved via multiple speakers and feedback microphones, and measurement of the modal
responses of the enclosure.
Modern active noise control is achieved through the use of a computer, which analyzes the
waveform of the background aural or nonaural noise, then generates a signal reversed waveform
to cancel it out by interference. This waveform has identical or directly proportional amplitude to
interference that reduces the amplitude of the perceived noise.

n /
the waveform of the original noise, but its signal is inverted. This creates the destructive
. i
The active methods (this) differs from passive noise control methods (soundproofing) in that a
e
powered system is involved, rather than unpowered methods such as insulation, sound-absorbing
b
ceiling tiles or muffler.
t u
generally:
s e
The advantages of active noise control methods compared to passive ones are that they are
More effective at low frequencies. . c

Less bulky.
w w
w
Able to block noise selectively.
/
: /
The first patent for a noise control system was granted to inventor Paul Lueg in 1934 U.S. Patent
2,043,416, describing how to cancel sinusoidal tones in ducts by phase-advancing the wave and
t p
canceling arbitrary sounds in the region around a loudspeaker by inverting the polarity. By the
1950s, systems were created to cancel the noise in helicopter and airplane cockpits including
h t
those patented by Lawrence J. Fogel in the 1950s and 1960s such as U.S. Patent 2,866,848, U.S.
Patent 2,920,138, U.S. Patent 2,966,549 and Canadian patent 631,136. In 1986, Dick Rutan and
Jeana Yeager used prototype headsets built by Bose in their around-the-world flight.[1][2]
Applications
Applications can be "1-dimensional" or 3-dimensional, depending on the type of zone to protect.
Periodic sounds, even complex ones, are easier to cancel than random sounds due to the
repetition in the wave form.
Protection of a "1-dimension zone" is easier and requires only one or two microphones and
speakers to be effective. Several commercial applications have been successful: noise-cancelling
headphones, active mufflers, and the control of noise in air conditioning ducts. The term "1-
dimension" refers to a simple pistonic relationship between the noise and the active speaker
(mechanical noise reduction) or between the active speaker and the listener (headphones).
Protection of a 3-dimension zone requires many microphones and speakers, making it less cost-
effective. Each of the speakers tends to interfere with nearby speakers, reducing the system's
overall performance. Noise reduction is more easily achieved with a single listener remaining
stationary in a three-dimensional space but if there are multiple listeners or if the single listener
moves throughout the space then the noise reduction challenge is made much more difficult.
High frequency waves are difficult to reduce in three dimensions due to their relatively short
audio wavelength in air. Sinusoidal noise at approximately 1000 Hz is double the distance of the
average person's left ear to the right ear; such a noise coming directly from the front will be
easily reduced by an active system but coming from the side will tend to cancel at one ear while
being reinforced at the other, making the noise louder, not softer. High frequency sounds above
1000 Hz tend to cancel and reinforce unpredictably from many directions. In sum, the most
n /
effective noise reduction in three dimensions involves low frequency sounds. Commercial
applications of 3-D noise reduction include the protection of aircraft cabins and car interiors, but
noise such as engine-, propeller- or rotor-induced noise. . i

in these situations, protection is mainly limited to the cancellation of repetitive (or periodic)
b e
t u
Antinoise is used to reduce noise at the working environment with ear plugs. Bigger noise
cancellation systems are used for ship engines or tunnels. An engine's cyclic nature makes FFT
analysis and the noise canceling easier to apply.
s e
c
The application of active noise reduction produced by engines has various benefits:
.

w
The operation of the engines is more convenient for personnel.
w
consumption.
/ w
Noise reduction eliminates vibrations that cause material wearout and increased fuel

/
Quieting of submarines.
:
tp
Linear prediction
ht
Linear prediction is a mathematical operation where future values of a discrete-time signal are
estimated as a linear function of previous samples.
In digital signal processing, linear prediction is often called linear predictive coding (LPC) and
can thus be viewed as a subset of filter theory. In system analysis (a subfield of mathematics),
linear prediction can be viewed as a part of mathematical modelling or optimization.
The prediction model
The most common representation is
where is the predicted signal value, x(n i) the previous observed values, and ai the
predictor coefficients. The error generated by this estimate is
where x(n) is the true signal value.

n /
. i
These equations are valid for all types of (one-dimensional) linear prediction. The differences are
found in the way the parameters ai are chosen.
b e
u
For multi-dimensional signals the error metric is often defined as
t
s e
. c
where is a suitable chosen vector norm.
w w
Estimating the parameters
/ w
/
The most common choice in optimization of parameters ai is the root mean square criterion
:
t p
which is also called the autocorrelation criterion. In this method we minimize the expected value
of the squared error E[e2(n)], which yields the equation
h t
for 1 j p, where R is the autocorrelation of signal xn, defined as
and E is the expected value. In the multi-dimensional case this corresponds to minimizing the L2
norm.
The above equations are called the normal equations or Yule-Walker equations. In matrix form
the equations can be equivalently written as
where the autocorrelation matrix R is a symmetric, Toeplitz matrix with elements ri,j = R(i j),
vector r is the autocorrelation vector rj = R(j), and vector a is the parameter vector.
Another, more general, approach is to minimize
where we usually constrain the parameters ai with a0 = 1 to avoid the trivial solution. This
constraint yields the same predictor as above but the normal equations are then
n /
where the index i ranges from 0 to p, and R is a (p + 1) (p + 1) matrix.
. i
b e
Optimization of the parameters is a wide topic and a large number of other approaches have been
proposed.
t u
e
Still, the autocorrelation method is the most common and it is used, for example, for speech
s
coding in the GSM standard.
. c
w w
Solution of the matrix equation Ra = r is computationally a relatively expensive process. The
Gauss algorithm for matrix inversion is probably the oldest solution but this approach does not
/ w
efficiently use the symmetry of R and r. A faster algorithm is the Levinson recursion proposed
by Norman Levinson in 1947, which recursively calculates the solution. Later, Delsarte et
: /
al. proposed an improvement to this algorithm called the split Levinson recursion which requires
t p
about half the number of multiplications and divisions. It uses a special symmetrical property of
parameter vectors on subsequent recursion levels.
t
Echohcancellation
The term echo cancellation is used in telephony to describe the process of removing echo from a
voice communication in order to improve voice quality on a telephone call. In addition to
improving subjective quality, this process increases the capacity achieved through silence
suppression by preventing echo from traveling across a network.
Two sources of echo have primary relevance in telephony: acoustic echo and hybrid echo.
Echo cancellation involves first recognizing the originally transmitted signal that re-appears,
with some delay, in the transmitted or received signal. Once the echo is recognized, it can be
removed by 'subtracting' it from the transmitted or received signal. This technique is generally
implemented using a digital signal processor (DSP), but can also be implemented in software.
Echo cancellation is done using either echo suppressors or echo cancellers, or in some cases
both.
History
In telephony, "Echo" is very much like what one would experience yelling in a canyon. Echo is
n /
the reflected copy of the voice heard some time later and a delayed version of the original. On a
telephone, if the delay is fairly significant (more than a few hundred milliseconds), it is
. i
considered annoying. If the delay is very small (10's of milliseconds or less), the phenomena is
called sidetone and while not objectionable to humans, can interfere with the communication
between data modems.[citation needed]
b e
t u
In the earlier days of telecommunications, echo suppression was used to reduce the objectionable
s e
nature of echos to human users. In essence these devices rely upon the fact that most telephone
conversations are half-duplex. That is one person speaks while the other listens. An echo
. c
suppressor attempts to determine which is the primary direction and allows that channel to go
forward. In the reverse channel, it places attenuation to block or "suppress" any signal on the
w
assumption that the signal is echo. Naturally, such a device is not perfect. There are cases where
w
/ w
both ends are active, and other cases where one end replies faster than an echo suppressor can
switch directions to keep the echo attenuated but allow the remote talker to reply without
attenuation.
: /
t p
Echo cancellers are the replacement for earlier echo suppressors that were initially developed in
the 1950s to control echo caused by the long delay on satellite telecommunications circuits.
h t
Initial echo canceller theory was developed at AT&T Bell Labs in the 1960s,[1] but the first
commercial echo cancellers were not deployed until the late 1970s owing to the limited
capability of the electronics of the era. The concept of an echo canceller is to synthesize an
estimate of the echo from the talker's signal, and subtract that synthesis from the return path
instead of switching attenuation into/out of the path. This technique requires adaptive signal
processing to generate a signal accurate enough to effectively cancel the echo, where the echo
can differ from the original due to various kinds of degradation along the way.
Rapid advances in the implementation of digital signal processing allowed echo cancellers to be
made smaller and more cost-effective. In the 1990s, echo cancellers were implemented within
voice switches for the first time (in the Northern Telecom DMS-250) rather than as standalone
devices. The integration of echo cancellation directly into the switch meant that echo cancellers
could be reliably turned on or off on a call-by-call basis, removing the need for separate trunk
groups for voice and data calls. Today's telephony technology often employs echo cancellers in
small or handheld communications devices via a software voice engine, which provides
cancellation of either acoustic echo or the residual echo introduced by a far-end PSTN gateway
system; such systems typically cancel echo reflections with up to 64 milliseconds delay.
Voice messaging and voice response systems which accept speech for caller input use echo
cancellation while speech prompts are played to prevent the systems own speech recognition
from falsely recognizing the echoed prompts.
Acoustic echo
Acoustic echo arises when sound from a loudspeakerfor example, the earpiece of a telephone
handsetis picked up by the microphone in the same roomfor example, the mic in the very
same handset. The problem exists in any communications scenario where there is a speaker and a
microphone. Examples of acoustic echo are found in everyday surroundings such as:
n /

Hands-free car phone systems
A standard telephone or cellphone in speakerphone or hands-free mode
. i

Dedicated standalone "conference phones"
b e
Installed room systems which use ceiling speakers and microphones on the table

u
Physical coupling (vibrations of the loudspeaker transfer to the microphone via the handset
casing)
t
s e
In most of these cases, direct sound from the loudspeaker (not the person at the far end,
. c
otherwise referred to as the Talker) enters the microphone almost unaltered. This is called direct
w w
acoustic path echo. The difficulties in cancelling acoustic echo stem from the alteration of the
original sound by the ambient space. This colours the sound that re-enters the microphone. These
/ w
changes can include certain frequencies being absorbed by soft furnishings, and reflection of
different frequencies at varying strength. These secondary reflections are not strictly referred to
/
as echo, but rather are "reverb".
:
t p
Acoustic echo is heard by the far end talkers in a conversation. So if a person in Room A talks,
h t
they will hear their voice bounce around in Room B. This sound needs to be cancelled, or it will
get sent back to its origin. Due to the slight round-trip transmission delay, this acoustic echo is
very distracting.
Acoustic Echo Cancellation

Since invention at AT&T Bell Labs[1] echo cancellation algorithms have been improved and
honed. Like all echo cancelling processes, these first algorithms were designed to anticipate the
signal which would inevitably re-enter the transmission path, and cancel it out.
The Acoustic Echo Cancellation (AEC) process works as follows:
1. A far-end signal is delivered to the system.

2. The far-end signal is reproduced by the speaker in the room.
3. A microphone also in the room picks up the resulting direct path sound, and consequent
reverberant sound as a near-end signal.
4. The far-end signal is filtered and delayed to resemble the near-end signal.
5. The filtered far-end signal is subtracted from the near-end signal.
6. The resultant signal represents sounds present in the room excluding any direct or reverberated
sound produced by the speaker.
Challenges for AEC (Acoustic Echo Cancellation)
The primary challenge for an echo canceler is determining the nature of the filtering to be
applied to the far-end signal such that it resembles the resultant near-end signal. The filter is
essentially a model of the speaker, microphone and the room's acoustical attributes.
To configure the filter, early echo cancellation systems required training with impulse or pink
n /
noise, and some used this as the only model of the acoustic space. Later systems used this
training only as a basis to start from, and the canceller then adapted from that point on. By using
cancellation in around 200 ms. . i

the far-end signal as the stimulus, modern systems can 'converge' from nothing to 55 dB of
Full Bandwidth Cancellation b e

t u
s e
Until recently echo cancellation only needed to apply to the voice bandwidth of telephone
circuits. PSTN calls transmit frequencies between 300 Hz and 3 kHz, the range required for
human speech intelligibility.
. c
w
Videoconferencing is one area where full bandwidth audio is transceived. In this case,
w
w
specialised products are employed to perform echo cancellation.
/
Hybrid echo
: /
t p
Hybrid echo is generated by the public switched telephone network (PSTN) through the
h t
reflection of electrical energy by a device called a hybrid (hence the term hybrid echo). Most
telephone local loops are two-wire circuits while transmission facilities are four-wire circuits.
Each hybrid produces echoes in both directions, though the far end echo is usually a greater
problem for voiceband.
Retaining echo suppressors

Echo suppression may have the side-effect of removing valid signals from the transmission. This
can cause audible signal loss that is called "clipping" in telephony, but the effect is more like a
"squelch" than amplitude clipping. In an ideal situation then, echo cancellation alone will be
used. However this is insufficient in many applications, notably software phones on networks
with long delay and meager throughput. Here, echo cancellation and suppression can work in
conjunction to achieve acceptable performance.
Modems
Echo control on voice-frequency data calls that use dial-up modems may cause data corruption.
Some telephone devices disable echo suppression or echo cancellation when they detect the 2100
or 2225 Hz "answer" tones associated with such calls, in accordance with ITU-T
recommendation G.164 or G.165.
In the 1990s most echo cancellation was done inside modems of type v.32 and later. In
voiceband modems this allowed using the same frequencies in both directions simultaneously,
greatly increasing the data rate. As part of connection negotiation, each modem sent line probe
signals, measured the echoes, and set up its delay lines. Echoes in this case did not include long
echoes caused by acoustic coupling, but did include short echoes caused by impedance
mismatches in the 2-wire local loop to the telephone exchange.
After the turn of the century, DSL modems also made extensive use of automated echo
n /
. i
cancellation. Though they used separate incoming and outgoing frequencies, these frequencies
were beyond the voiceband for which the cables were designed, and often suffered attenuation
e
distortion due to bridge taps and incomplete impedance matching. Deep, narrow frequency
b
IMAGE ENCHANCEMENT
t u
s e
Edge enhancement is an image processing filter that enhances the edge contrast of an image or
c
video in an attempt to improve its acutance (apparent sharpness).
.
w w
The filter works by identifying sharp edge boundaries in the image, such as the edge between a
/ w
subject and a background of a contrasting color, and increasing the image contrast in the area
immediately around the edge. This has the effect of creating subtle bright and dark highlights on
typical viewing distance. : /

either side of any edges in the image, leading the edge to look more defined when viewed from a
t p
t
The process is prevalent in the video field, appearing to some degree in the majority of TV
broadcasts and DVDs[citation needed]. A modern television set's "sharpness" control is an example of
h
edge enhancement. It is also widely used in computer printers especially for font or/and graphics
to get a better printing quality. Most digital cameras also perform some edge enhancement,
which in some cases cannot be adjusted.
Edge enhancement can be either an analog or a digital process. Analog edge enhancement may
be used, for example, in all-analog video equipment such as modern CRT televisions.
Properties
Edge enhancement applied to an image can vary according to a number of properties.
Amount. This controls the extent to which contrast in the edge detected area is enhanced.
Radius or aperture. This affects the size of the edges to be detected or enhanced, and the
size of the area surrounding the edge that will be altered by the enhancement. A smaller
radius will result in enhancement being applied only to sharper, finer edges, and the
enhancement being confined to a smaller area around the edge.
Threshold. Where available, this adjusts the sensitivity of the edge detection mechanism.
A lower threshold results in more subtle boundaries of colour being identified as edges. A
threshold that is too low may result in some small parts of surface textures, film grain or
noise being incorrectly identified as being an edge.
/
In some cases, edge enhancement can be applied in the horizontal or vertical direction only, or to
n
enhancement to images that were originally sourced from analog video.
. i
both directions in different amounts. This may be useful, for example, when applying edge
Effects of edge enhancement b e

t u
s e
Unlike some forms of image sharpening, edge enhancement does not enhance subtle detail which
may appear in more uniform areas of the image, such as texture or grain which appears in flat or
. c
smooth areas of the image. The benefit to this is that imperfections in the image reproduction,
such as grain or noise, or imperfections in the subject, such as natural imperfections on a person's
w
skin, are not made more obvious by the process. A drawback to this is that the image may begin
w
/ w
to look less natural, because the apparent sharpness of the overall image has increased but the
level of detail in flat, smooth areas has not.
: /
As with other forms of image sharpening, edge enhancement is only capable of improving the
t p
perceived sharpness or acutance of an image. The enhancement is not completely reversible, and
as such some detail in the image is lost as a result of filtering. Further sharpening operations on
h t
the resulting image compound the loss of detail, leading to artifacts such as ringing. An example
of this can be seen when an image that has already had edge enhancement applied, such as the
picture on a DVD video, has further edge enhancement applied by the DVD player it is played
on, and possibly also by the television it is displayed on. Essentially, the first edge enhancement
filter creates new edges on either side of the existing edges, which are then further enhanced.
Viewing conditions
The ideal amount of edge enhancement that is required to produce a pleasant and sharp-looking
image, without losing too much detail, varies according to several factors. An image that is to be
viewed from a nearer distance, at a larger display size, on a medium that is inherently more
"sharp" or by a person with excellent eyesight will typically demand a finer or lesser amount of
edge enhancement than an image that is to be shown at a smaller display size, further viewing
distance, on a medium that is inherently softer or by a person with poorer eyesight.
For this reason, home theatre enthusiasts who invest in larger, higher quality screens often
complain[citation needed] about the amount of edge enhancement present in commercially produced
DVD videos, claiming that such edge enhancement is optimized for playback on smaller, poorer
quality television screens, but the loss of detail as a result of the edge enhancement is much more
noticeable in their viewing conditions.
SPEECH PROCESSING
Speech processing is the study of speech signals and the processing methods of these signals. The signals
are usually processed in a digital representation, so speech processing can be regarded as a special case of
digital signal processing, applied to speech signal. It is also closely tied to natural language processing
(NLP), as its input can come from / output can go to NLP applications. E.g. text-to-speech synthesis may
use a syntactic parser on its input text and speech recognition's output may be used by e.g. information
extraction techniques.Speech processing can be divided into the following categories:
n /
Speech recognition, which deals with analysis of the linguistic content of a speech signal.

.
Speaker recognition, where the aim is to recognize the identity of the speaker.i

Enhancement of speech signals, e.g. audio noise reduction.
b e
Speech coding, a specialized form of data compression, is important in the telecommunication

area.
t u
Voice analysis for medical purposes, such as analysis of vocal loading and dysfunction of the

vocal cords.
s e
Speech synthesis: the artificial synthesis of speech, which usually means computer-generated

speech.
. c
Speech enhancement: enhancing the perceptual quality of a speech signal by removing the
w w
destructive effects of noise, limited capacity recording equipment, impairments, etc.
/ w
Speech signal processing refers to the acquisition, manipulation, storage, transfer and output of vocal
speech:
: /
utterances by a computer. The main applications are the recognition, synthesis and compression of human

t p
Speech recognition (also called voice recognition) focuses on capturing the human voice as a
t
digital sound wave and converting it into a computer-readable format.
h
Speech synthesis is the reverse process of speech recognition. Advances in this area improve the
computer's usability for the visually impaired.
Speech compression is important in the telecommunications area for increasing the amount of
information which can be transferred, stored, or heard, for a given set of time and space
constraints.
Speech recognition (also known as automatic speech recognition or computer speech recognition)
converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition
systems that must be trained to a particular speakeras is the case for most desktop recognition software.
Recognizing the speaker can simplify the task of translating speech. Speech recognition is a broader
solution which refers to technology that can recognize speech without being targeted at single speaker
such as a call center system that can recognize arbitrary voices. Speech recognition applications include
voice dialing, call routing, domotic appliance control, search, simple data entry, preparation of structured
documents, speech-to-text processing, and aircraft.
Health care
Speech recognition can be implemented in front-end or back-end of the medical documentation process.
Front-End SR is where the provider dictates into a speech-recognition engine, the recognized words are
displayed right after they are spoken, and the dictator is responsible for editing and signing off on the
document. It never goes through an MT/editor. Back-End SR or Deferred SR is where the provider
dictates into a digital dictation system, and the voice is routed through a speech-recognition machine and
the recognized draft document is routed along with the original voice file to the MT/editor, who edits the
draft and finalizes the report. Deferred SR is being widely used in the industry currently.
Military
/
High-performance fighter aircraft. In these programs, speech recognizers have been operated successfully
n
. i
in fighter aircraft with applications including: setting radio frequencies, commanding an autopilot system,
setting steer-point coordinates and weapons release parameters, and controlling flight displays.
Some important conclusions from the work were as follows: b e

t u
realized consistently.
s e
1. Speech recognition has definite potential for reducing pilot workload, but this potential was not
. c
2. Achievement of very high recognition accuracy (95% or more) was the most critical factor for
making the speech recognition system useful with lower recognition rates, pilots would not
use the system.
w w
3. More natural vocabulary and grammar, and shorter training times would be useful, but only if
w
very high recognition rates could be maintained.
/
: /
Speech synthesis is the artificial production of human speech. A computer system used for this
purpose is called a speech synthesizer, and can be implemented in software or hardware. A text-to-
t p
speech (TTS) system converts normal language text into speech; other systems render symbolic
linguistic representations like phonetic transcriptions into speech. Synthesized speech can be created
h t
by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of
the stored speech units; a system that stores phones or diphones provides the largest output range, but
may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-
quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human
voice characteristics to create a completely "synthetic" voice output. The quality of a speech
synthesizer is judged by its similarity to the human voice and by its ability to be understood. An
intelligible text-to-speech program allows people with visual impairments or reading disabilities to
listen to written works on a home computer. Many computer operating systems have included speech
synthesizers since the early 1980s.
and incomplete impedance matching. Deep, narrow frequency
IMAGE ENCHANCEMENT
Edge enhancement is an image processing filter that enhances the edge contrast of an image or video in
an attempt to improve its acutance.The filter works by identifying sharp edge boundaries in the image,
such as the edge between a subject and a background of a contrasting color, and increasing the image
contrast in the area immediately around the edge. This has the effect of creating subtle bright and dark
highlights on either side of any edges in the image, leading the edge to look more defined when viewed
from a typical viewing distance.
The process is prevalent in the video field, appearing to some degree in the majority of TV broadcasts and
DVDs. A modern television set's "sharpness" control is an example of edge enhancement. It is also widely
used in computer printers especially for font or/and graphics to get a better printing quality. Most digital
cameras also perform some edge enhancement, which in some cases cannot be adjusted. Edge
enhancement can be either an analog or a digital process. Analog edge enhancement may be used, for
example, in all-analog video equipment such as modern CRT televisions.
Properties
Edge enhancement applied to an image can vary according to a number of properties.

n /
. i
Amount. This controls the extent to which contrast in the edge detected area is enhanced.
e
Radius or aperture. This affects the size of the edges to be detected or enhanced, and the size of
b
t u
the area surrounding the edge that will be altered by the enhancement. A smaller radius will result
in enhancement being applied only to sharper, finer edges, and the enhancement being confined

to a smaller area around the edge.
s e
Threshold. Where available, this adjusts the sensitivity of the edge detection mechanism. A lower
. c
threshold results in more subtle boundaries of colour being identified as edges. A threshold that is
too low may result in some small parts of surface textures, film grain or noise being incorrectly
identified as being an edge.
w w
/ w
In some cases, edge enhancement can be applied in the horizontal or vertical direction only, or to both
directions in different amounts. This may be useful, for example, when applying edge enhancement to
/
images that were originally sourced from analog video.
:
p
Effects of edge enhancement
t
h t
Unlike some forms of image sharpening, edge enhancement does not enhance subtle detail which may
appear in more uniform areas of the image, such as texture or grain which appears in flat or smooth areas
of the image. The benefit to this is that imperfections in the image reproduction, such as grain or noise, or
imperfections in the subject, such as natural imperfections on a person's skin, are not made more obvious
by the process. A drawback to this is that the image may begin to look less natural, because the apparent
sharpness of the overall image has increased but the level of detail in flat, smooth areas has not.
As with other forms of image sharpening, edge enhancement is only capable of improving the perceived
sharpness or acutance of an image. The enhancement is not completely reversible, and as such some detail
in the image is lost as a result of filtering. Further sharpening operations on the resulting image compound
the loss of detail, leading to artifacts such as ringing. An example of this can be seen when an image that
has already had edge enhancement applied, such as the picture on a DVD video, has further edge
enhancement applied by the DVD player it is played on, and possibly also by the television it is displayed
on. Essentially, the first edge enhancement filter creates new edges on either side of the existing edges,
which are then further enhanced.
Viewing conditions
The ideal amount of edge enhancement that is required to produce a pleasant and sharp-looking image,
without losing too much detail, varies according to several factors. An image that is to be viewed from a
nearer distance, at a larger display size, on a medium that is inherently more "sharp" or by a person with
excellent eyesight will typically demand a finer or lesser amount of edge enhancement than an image that
is to be shown at a smaller display size, further viewing distance, on a medium that is inherently softer or
by a person with poorer eyesight.
For this reason, home theatre enthusiasts who invest in larger, higher quality screens often complain
about the amount of edge enhancement present in commercially produced DVD videos, claiming that
such edge enhancement is optimized for playback on smaller, poorer quality television screens, but the
loss of detail as a result of the edge enhancement is much more noticeable in their viewing conditions.
---------------------------------------------------------------------------------------------------------------------------
Some 2 marks
n /
(a)Differences:
. i
reversed while the input is in natural order. b e
1)The input is bit reversed while the output is in natural order for DIT, whereas for DIF the output is bit
multiplication takes place after the add-subtract operation in DIF. t u

2)The DIF butterfly is slightly different from the DIT butterfly, the difference being that the complex
Similarities:
s e
. c
Both algorithms require same number of operations to compute the DFT.Both algorithms can be done in
place and both need to perform bit reversal at some place during the computation
w w
(b) FIR and IIR:The IIR filters are of recursive type, whereby the present output sample depends on the
/ w
present input, past input samples and output samples.The FIR filters are of non recursive type, whereby
the present output sample depends on the present input sample and previous input samples.
(c)Gibbs phenomenon?
: /
p
One possible way of finding an FIR filter that approximates H(ejw) would be to truncate the infinite
t
h t
Fourier series at n=(N-1/2).Direct truncation of the series will lead to fixed percentage overshoots and
undershoots before and after an approximated discontinuity in the frequency response.
(d)The desirable characteristics of the window are

1.The central lobe of the frequency response of the window should contain most of the energy and should
be narrow.
2.The highest side lobe level of the frequency response should be small.
3.The side lobes of the frequency response should decrease in energy rapidly as tends to .
(e)The necessary and sufficient condition for linear phase characteristic in FIR filter is, the impulse
response h(n) of the system should have the symmetry property i.e., H(n) = h(N-1-n) where N is the
duration of the sequence.
(f)What are the advantages and disadvantages of FIR filters?
Advantages:
1. FIR filters have exact linear phase.
2. FIR filters are always stable.
3. FIR filters can be realized in both recursive and non recursive structure.
4. Filters with any arbitrary magnitude response can be tackled using FIR sequence.
Disadvantages:
1. For the same filter specifications the order of FIR filter design can be as high as 5 to 10 times that in an
IIR design.
2. Large storage requirement is requirement
3. Powerful computational facilities required for the implementation.
(g)Draw the direct form realization of FIR system.
n /
. i
b e
t u
(h)Draw the direct form realization of a linear Phase FIR system for N odd
s e
. c
w w
/ w
: /
t p
(i)Draw the direct form realization of a linear Phase FIR system for N even .
h t
(j)Draw the M stage lattice filter.
(k)pre-warping.
The effect of the non-linear compression at high frequencies can be compensated. When the desired
magnitude response is piece-wise constant over frequency, this compression can be compensated by
introducing a suitable pre-scaling, or pre-warping the critical frequencies by using the formula.
(l)Advantages bilinear transformation:

The bilinear transformation provides one-to-one mapping.
Stable continuous systems can be mapped into realizable, stable digital systems.
There is no aliasing. n /
Disadvantage:
. i
b e
The mapping is highly non-linear producing frequency, compression at high frequencies.
Neither the impulse response nor the phase response of the analog filter is preserved in a digital filter
obtained by bilinear transformation.
t u
(m) advantages of floating pint representation? 1.Large dynamic range 2.overflow is unlikely
s e
(n)truncation? Truncation is a process of discarding all bits less significant than LSB that is retained
. c
(o)Rounding? Rounding a number to b bits is accomplished by choosing a rounded result as the b bit
number closest number being unrounded.
w w
(p)two types of limit cycle behavior of DSP? 1.Zero limit cycle behavior 2.Over flow limit cycle
behavior
/ w
(q)methods to prevent overflow? 1.Saturation arithmetic and2.Scaling
: /
(r)advantages of Kaiser window?
t p
o It provides flexibility for the designer to select the side lobe level and N
h t
o It has the attractive property that the side lobe level can be varied continuously from the low value in
the Blackman window to the high value in the rectangular window
(s)FIR filter
These filters can be easily designed to have perfectly linear phase.
FIR filters can be realized recursively and non-recursively.
Greater flexibility to control the shape of their magnitude response.
Errors due to round off noise are less severe in FIR filters, mainly because feedback is not used.
IIR filter
These filters do not have linear phase.
IIR filters are easily realized recursively.
Less flexibility, usually limited to specific kind of filters.
The round off noise in IIR filters is more.

UNIT 5 Related Csetube

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT 5 Related Csetube

Uploaded by

Copyright:

Available Formats

MULTIRATE DSP

Sample rate conversion

Less technical explanation

Speech compression is important in the telecommunications area for increasing the

Speech recognition (also known as automatic speech recognition or computer speech

Speech recognition can be implemented in front-end or back-end of the medical documentation

signing off on the document. It never goes through an MT/editor. b e

Overview of text processing

Overview of a typical TTS system

Concatenative synthesis is based on the concatenation (or stringing together) of segments of

Unit selection synthesis

Domain-specific synthesis concatenates prerecorded words and phrases to create complete

controlled by Carr's "distinctive region model".

Time-compressed speech refers to voice compression for immediate playback, without

Time-compressed/accelerated speech (often used in psychological literature)

Time-scale modified speech (used in signal processing literature)

time-scale modification (TSM)[1]

Voice compression/speech compression/voice encoding/speech encoding/audio

x(n) = d(n) + v(n) w w

Applications of adaptive filters

interference that reduces the amplitude of the perceived noise.

More effective at low frequencies. . c

noise such as engine-, propeller- or rotor-induced noise. . i

Jump to: navigation, search

where x(n) is the true signal value.

Another, more general, approach is to minimize

Jump to: navigation, search

Acoustic Echo Cancellation

The Acoustic Echo Cancellation (AEC) process works as follows:

1. A far-end signal is delivered to the system.

Challenges for AEC (Acoustic Echo Cancellation)

cancellation in around 200 ms. . i

Full Bandwidth Cancellation b e

Retaining echo suppressors

typical viewing distance. : /

Effects of edge enhancement b e

Some important conclusions from the work were as follows: b e

and incomplete impedance matching. Deep, narrow frequency

Edge enhancement applied to an image can vary according to a number of properties.

multiplication takes place after the add-subtract operation in DIF. t u

(d)The desirable characteristics of the window are

(j)Draw the M stage lattice filter.

(l)Advantages bilinear transformation:

You might also like