Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

CESI Research Project 2019

Automated Speech Recognition Systems Applications in Industry


Yarol MORETTIa, Antoine ORFILAa, Clément CIRANNAa, Miriam TINOCO ALVAREZa, Tony BRIETa
a
CESI, 1240 route des dolines, Valbonne 06560, France

Abstract

The constant development of technology forms bridges to the future and creates possibilities that we could never imagine before. Technology is
now used to enrich human life and enhance productivity wherever it operates. Speech signal processing is a form of technology that is
constantly growing, and it has now gained its place in the industrial world, whether it is though automatic speech recognition, synthetic speech
or natural language processing. In fact, a simple pronouncement is sufficient to recognize the speaker in the real world. It is the challenge for
speech signal processing research. Advances in speech recognition technology in the past five decades have enabled a wide range of industrial
applications. It allows the ease of operations of personal computers for instance, yet the research that has been made and the features that the
whole world has access to is only a small preview of the possibilities that we have ahead of us. Microphones will soon replace keyboards and
writing reports will soon be automated. The capabilities of speech recognition systems are endless. This document is going to present the
development of the speech recognition technology and reassemble all the information that has been gathered through this research. It will also
underline the current use of these technologies and references to the different fields of application it is used in.
© 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 52nd CIRP Conference on Manufacturing Systems.

Keywords: speech; recognition; industry; automatisation; technology; signal processing;

1. Introduction we bypass the more common and manual input methods


(keyboard, mouse…) and we make interfaces available for
The human voice can be used in many ways in industry. people with severe disabilities. VUIs support human-
However, it is mainly used as an input for two main computer dialogues, but only covers a fraction of human
applications. The first one being speech recognition, where we conversation, focusing on only certain aspects and keyword
try to analyze the user’s speech and understand the content of recognition, depending on the application of course.
his speech. The second one is voice biometrics, where we try
to recognize the actual voice of the user and identify him. 1.2. Voice biometrics
Both applications are used in different fields, but we will
cover both subjects. Voice biometrics belong to two industries, speech
processing, that we described above, and biometric security.
1.1. Automated Speech Recognition (ASR) Biometrics-based technologies are applied most often in
security, monitoring, and fraud prevention as they help to
Speech is the primary mode of communication among identify individuals, and distinguish one person from another.
humans, and it has the potential to be a very important mode This is the real difference between biometrics and other
of communication between humans and computers. Speech security systems, automated or not. For instance, a card
has its own specific characteristics, and it is the preferred way system would only be able to recognize if the card has
to communicate from a distance. It is the best way of expired, or if the password corresponding to the card is the
communicating when hands or eyes are busy. When we right one. But it would be impossible to verify that the person
analyze speech, we are technically analyzing a signal. We call presenting the card is the right person with the card system
it signal processing technology. This technology has seen alone. Biometric systems determine whether a biometric
great advancements in the past few years, due to the sample, in our case the voice, comes from a specific
applications in the field of industrial automation. These individual. In order to do so, we compare the given sample
developments have greatly benefited the emerging area of with a reference that has been provided by the same individual
speech signal processing and it brought significant success in beforehand. We call it a “reference voiceprint”. Voice
industry and biomedical applications. biometrics are also nowadays the only biometrics that process
Voice user interfaces (VUIs) use speech technology to give acoustic information, making them able to work with standard
the user access to information, perform transactions, and telephone equipment. Therefore, they don’t require any
support easy communication. Those VUIs are based on proprietary hardware, like a fingerprint sensor or an iris-
automated speech recognition systems as an input. This way, scanner.

© 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-
nd/3.0/)
Peer-review under responsibility of the scientific committee of the 52nd CIRP Conference on Manufacturing Systems.
2 Yarol MORETTI, Antoine ORFILA, Clément CIRANNA, Miriam TINOCO ALVAREZ, Tony BRIET / CESI
2. State of Art

1.3. Methodology of ASR Systems

Speech recognition aims at deriving the sequence of speech


sounds that best match the input speech sound by using
pattern recognition technology. Word recognition systems had
been classified according to whether they can recognize
isolated or connected words and if they are speaker-
dependent or independent. They are also classified according
to whether they operate over phone lines or a local wide-band
channel.
Isolated-word systems requiring a pause between all words
and doesn’t allow any pause within individual words.
Typically, the systems require the pauses to be at least 100ms
to prevent to be confused with the stop consonants or any
other pause within words.
Connected-word systems are capable to recognizing
individual words within string of words. These strings can be
made of words from a natural speech without any pause
between them. These systems tend to be more complicated
than isolated-word recognizers. But in the other hand, they
can provide faster data entry. They don’t require any pause
between each word making them easier to use.
Fig. 1 ASR Operation modes
Speaker-independent or ‘universal’ systems are made to
recognize any type of voice. They are generally less accurate
Embedded ASR systems have been classified by software,
than other systems due to their smaller vocabulary. Their
hardware and combined hardware-software co-design. ASR
recognition logic is more complex due to the wide range of
systems have some key features that makes them preferable in
spectrum that they have to recognize. This system is the most
some application areas, such as:
desirable goal for speech recognition, but it comes with big
 Users do not need any kind of formation to use ASR,
constraints which make them very hard to integrate in all
because for most people speaking is an inherent skill that
work area.
comes as natural.
Speaker-dependent systems successes in recognizing any
 The usage of speech is by far the fastest way of
variant in pronunciation. It can then solve the problem of the
communication. In fact, speaking is about 10 times faster
wide range of variations which affects the speaker-
than writing on paper.
independent systems. To make it possible, it requires the user
 ASR systems allow users to multitask, because it leaves the
to train the system by repeating all the words in the
user’s hand and/or eyes available for other modalities.
vocabulary one or more time before using the system. This
 The input methods and devices are very affordable, like
can take a long to implement it but at the end, its speech
microphones or telephones.
recognition will be very accurate.

A Basic ASR system operates in two distinct modes:


1.4. Types of Voice Biometrics
training and recognition. During the training phase, we are
using a template generation module that calculates all the
There are two types of voice biometrics, the first one being
possible variations in the utterances of the same word. All
speaker verification. With this method, we are trying to
those possibilities are represented as feature vectors and then
authenticate that a person is who she or he claims to be. Most
stored in the database. During the recognition phase, the input
of the time, with this method we are going to ask the user for
voice is also creating a new template using the template
an ID code, a credit card code or any other specificity that
generation module, we can then compare the new template
helps us to recognize our user. This information would be
with the ones stored in our database to find the best match and
asked to each user in advance so that it is stored.
determine what word has been used.
Then, with the same procedure as ASR systems, reference
voiceprints would also be stored in the database, with the
matching ID code. This way, as the user identifies him or
herself with the ID, the system is able to get the reference
voiceprint for this person and setup the comparison.
Author name / Procedia CIRP 00 (2019) 000–000
1.5. Speech recognition algorithm (SRA)

To provide new and useful software for the people,


technologies like Alexa, use different type of algorithms to
recognize voices

1.1.1. Hidden Markov model


The HMM algorithm is used to convert speech into text.
But one of the most used method is to combine HMM with a
detection algorithm for pre-processing to remove unwanted
noise. After that the HMM extract and with the use of the Mel
Frequency Cepstral Coefficient divide the audio in small
segment that are interpreted and uses as input by HMM. After
that HMM compare the segment to a dictionary of words
pronunciations and determine the words in the segment. To
determine it uses a language model with probability. There is
HMM for each pronunciation in the dictionary. After that
HMM can convert the signal to text.

1.1.2. Gaussian Mixture Model


The GMM is used in combination with the HMM where
the HMM capture the temporal variations of the speech and
the GMM captures the special variations. It is easier to handle
time-sequence. “a sequence of GMMs are used to analyze the
input date for the HMM which gives it the sensitivity to
Fig. 2 Speaker verification temporal changes”. The first drawback is that with the GMM
the HMM will assume phonetical independence of the
The second method is speaker identification. It assigns an phonetic segment of the sentences. And sometimes when it is
identity to the voice of an unknown speaker. This time, the impossible to recognize the input it will select the most
system won’t have access to any information other than the common choice.
user’s voice. The system will have the hard task to assign the GMM is more useful for speech recognition that are
proper identity based on the voice alone. In most cases, this directed toward recognitions of emotion in the voices. But it
method is more difficult than the previous one. gives other advantages that it is very easy to use in
The challenge with speaker identification is that we only computation system. Recently it was the cluster analysis that
rely on different voiceprint samples and we are trying to was adopted.
create a reference out of it. We also have nothing else to
compare it to. Most of the time this technique is used when 1.6. Applications of ASR Systems
the user is not even aware of him being recorded.
1.1.3. Medical assistance
Medicine is a domain where speech recognition systems
are really usefully due to the importance of the hand works.
First, the dictation systems can be used by physicians to
reduce the needs of transcription services which reduces time.
Surgeons can also use this type of system to control medical
machines while operating on patients. And this won’t need
more than ten commands.
Patients suffering of speech disorder (dysglossia) or voice
disorder (dysphonia) can be helped by automatic recognition
techniques. This can help testing in evaluations of voice, this
helps a lot in neck and head cancer detection.
Using HMM with ASR systems can help doctors
identifying influence of therapies by detecting changes in
patients’ speeches. ASR are also used to serve as virtual
therapists, it has been used to treat aphasia at Spoken
Language Systems Lab Portugal with a system called
VITHEA an ASR system.
Fig. 3 Speaker identification
4 Yarol MORETTI, Antoine ORFILA, Clément CIRANNA, Miriam TINOCO ALVAREZ, Tony BRIET / CESI
1.1.4. Industrial Robotics 1.1.5. Physical/site access
Nowadays, we have access to inexpensive yet powerful The speech recognition software technologies possess a lot
microprocessors and improved algorithms. This allows to of application for accessing area and create security point to
drive a lot of applications in computer command, data entry, protect personal data or prevent intrusion. Now a lot of
speech-to-text, voice verification, etc. Voice input computers company uses these technologies like their main sources of
are used in many industrial inspection applications, especially revenue. For example, “the Girl Tech Inc” company uses
to have data as an input directly into a computer without speech recognition to improve privacy for young teen girls.
keying or transcription. They want to provide for young girls more accessibility to this
Speech recognition technology is successfully being technology and the constraints are the cost, the size and the
implemented on industrial robots and is becoming more and ease-of-use. They have created with chip-based text-
more accepted and adopted in various industries. The dependent speaker verification into a Door pass and journal
technology being more stable, affordable and better in general lock. In fact, it can recognize the voice of the main user and a
as the time goes, manufacturers are buying robots more easily password. If the voice is not recognized and the password is
and less risks are considered. The autonomy of those systems wrong, it will start an alarm to signal an intrusion.
has also seen a good upgrade, meaning operator intervention There are other examples of doors speak recognition
is less needed, emphasizing even more this trend. access, like for the US Immigration and Naturalization
Services that uses this technology during off hours to the gate
between US and Canada, or like in the city of Baltimore
where in the Evening and in the weekend you have this
system for the main 5 building of the city.

References

Judith A. Markowitz. COMMUNICATIONS OF THE ACM September


2000/Vol. 43, No. 9. p. 66-73
Dr. Jayashri Vajpai et al. Int. Journal of Engineering Research and
Applications ISSN: 2248-9622, Vol. 6, Issue 3, (Part - 1) March 2016,
pp.88-95
J. Welch, "Automatic Speech Recognition?Putting It to Work in Industry" in
Fig. 4 Annual Supply of Industrial Robots (Worldwide) Computer, vol. 13, no. 05, pp. 65-73, 1980.
doi: 10.1109/MC.1980.1653624
This technology can be used for any type of robotics, we Shaun V. Ault, Rene J. Perez, Chloe A. Kimble, and Jin Wang. International
talked about industrial robotics above, but it can also be used Journal of Machine Learning and Computing, Vol. 8, No. 6, December
2018
for wheelchairs with special Human-Machine Interfaces
(HMI), voice can increase the wheelchair navigation and its
movement precision. We have an example of this with the
Robchair, it can navigate in dynamic environments and with
the presence of humans.

You might also like