Speech Recognition

• The fundamental aspect of speech recognition is the translation of sound into text and
• Speech recognition is the process by which computer maps an acoustic speech signal to some
form of abstract meaning of the speech.

Automatic Speech Recognition System

• Pre-processing/Digital Processing:
• The recorded acoustic signal is an analog signal.
• An analog signal cannot directly transfer to the ASR systems.
• So these speech signals need to transform in the form of digital signals and then only they can
be processed.
• These digital signals are move to the first order filters to spectrally flatten the signals.
• This procedure increases the energy of signal at higher frequency.
• Feature Extraction
• Feature extraction step finds the set of parameters of utterances that have acoustic correlation
with speech signals and these parameters are computed through processing of the acoustic
• These parameters are known as features.
• The main focus of feature extractor is to keep the relevant information and discard irrelevant
• To act upon this operation, feature extractor divides the acoustic signal into 10-25 ms.
• Data acquired in these frames is multiplied by window function.
• There are many types of window functions that can be used such as hamming Rectangular,
Blackman, Welch or Gaussian etc. In this way features have been extracted from every frame.
• There are several methods for feature extraction such as Mel-Frequency Cepstral Coefficient
(MFCC), Linear Predictive Cepstral Coefficient (LPCC), Perceptual Linear Prediction (PLP),
wavelet and RASTA-PLP (Relative Spectral Transform) Processing etc.
• Acoustic Modeling
• The connection between the acoustic information and phonetics is established.
• Acoustic model plays important role in performance of the system and responsible for
computational load.
• Training establishes co-relation between the basic speech units and the acoustic observations.
• Training of the system requires creating a pattern representative for the features of class using
one or more patterns that correspond to speech sounds of the same class.
• Many models are available for acoustic modeling out of them Hidden Markov Model (HMM)
is widely used and accepted as it is efficient algorithm for training and recognition
• Language Modeling
• A language model contains the structural constraints available in the language to generate the
probabilities of occurrence.
• It induces the probability of a word occurrence after a word sequence.
• The language model distinguishes word and phrase that has similar sound.
• For example, in American English, the phrases like “recognize speech" and "wreck a nice
beach" have same pronunciation but mean very different things.
• These ambiguities are easier to resolve when evidence from the language model is incorporated
with the pronunciation model and the acoustic model.
• Pattern Classification
• Pattern Classification (or recognition) is the process of comparing the unknown test pattern with
each sound class reference pattern and computing a measure of similarity between them.
• After completing training of the system at the time of testing patterns are classified to recognize
the speech.
• Part of Speech Tagging
• Part-of-speech (POS) tagging is a process in natural language processing (NLP) where each
word in a text is labeled with its corresponding part of speech.
• This can include nouns, verbs, adjectives, and other grammatical categories.
• POS tagging is useful for a variety of NLP tasks, such as information extraction, named entity
recognition, and machine translation.
• It can also be used to identify the grammatical structure of a sentence and to disambiguate
words that have multiple meanings.
• POS tagging is typically performed using machine learning algorithms, which are trained on a
large annotated corpus of text.
• The algorithm learns to predict the correct POS tag for a given word based on the context in
which it appears.
• Let’s take an example,
• Text: “The cat sat on the mat.”
• POS tags:
• The: determiner
• cat: noun
• sat: verb
• on: preposition
• the: determiner
• mat: noun
• Use of Parts of Speech Tagging in NLP
• To understand the grammatical structure of a sentence
• To disambiguate words with multiple meanings
• To improve the accuracy of NLP tasks
• To facilitate research in linguistics
• Steps Involved in the POS tagging
• Collect a dataset of annotated text
• Preprocess the text
• Divide the dataset into training and testing sets
• Train the POS tagger
• Test the POS tagger
• Fine-tune the POS tagger
• Use the POS tagge
• Implement Parts-Of-Speech tags using Spacy in Python
pip install spacy
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is planning to buy Indian startup for $1 billion")
for token in doc:
print(token, "|", token.pos_,"|", spacy.explain(token.pos_),"|",token.tag_,
• token.pos_ will give the POS tag of the specific token
• o/p:
• Output
• Apple | PROPN | proper noun | NNP noun, proper singular
• is | AUX | auxiliary | VBZ verb, 3rd person singular present
• planning | VERB | verb | VBG verb, gerund or present participle
• to | PART | particle | TO infinitival "to"
• buy | VERB | verb | VB verb, base form
• Indian | ADJ | adjective | JJ adjective (English), other noun-modifier (Chinese)
• startup | NOUN | noun | NN noun, singular or mass
• for | ADP | adposition | IN conjunction, subordinating or preposition
• $ | SYM | symbol | $ symbol, currency
• 1 | NUM | numeral | CD cardinal number
• billion | NUM | numeral | CD cardinal number
• Defining a tag set
• We have to define an inventory of labels for the word classes (i.e. the tag set)
-Most taggers rely on models that have to be trained on annotated (tagged) corpora. Evaluation
also requires annotated corpora.
-Since human annotation is expensive/time-consuming, the tag sets used in a few existing
labeled corpora become the de facto standard.
-Tag sets need to capture semantically or syntactically important distinctions that can easily be
made by trained human annotators.
Word classes
Open classes:
Nouns, Verbs, Adjectives, Adverbs
Closed classes:
Auxiliaries and modal verbs, Prepositions, Conjunctions Pronouns, Determiners,Particles,
• Defining a tag set
• Tag sets have different granularities: Brown corpus (Francis and Kucera 1982): 87 tags Penn
• Treebank (Marcus et al. 1993): 45 tags Simplified version of Brown tag set (de facto standard
for English now)
NN: common noun (singular or mass): water, book
NNS: common noun (plural): books Prague
• Dependency Treebank (Czech): 4452 tags
Complete morphological analysis: AAFP3----3N----: nejnezajímavějším
Adjective Regular Feminine Plural Dative….Superlative
• How much ambiguity is there?
Most word types are unambiguous:
Number of tags per word type:
• NB: These numbers are based on word/tag combinations in the corpus. Many combinations
that don’t occur in the corpus are equally correct.
• But a large fraction of word tokens are ambiguous Original Brown corpus: 40% of tokens
are ambiguous
• Qualitative evaluation
• Generate a confusion matrix (for development data): How often was a word with tag i
mistagged as tag j:
• See what errors are causing problems: -Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) -
Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

