Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Speech Recognition on

Handheld Devices

Date Created: Thursday, 7 December 2006

Last Modified: 7/12/2006 10:35

Author(s):
Aman Dayal

Document Status: Draft


Table of Contents

Page |2
Introduction

As handheld devices become ubiquitous and the tasks performed become multipurpose in
nature, efficient data entry techniques are necessary.

Current Input Methods in Hand Held Devices

Handheld devices such as mobile phones and personal digital assistants (PDAs) have
become a common part of everyday life. Mobile phones and PDAs are now used for a
diversity of tasks that require the entry of varying amounts of data. Existing data entry
techniques for mobile devices include:

 Soft Keyboard
 Gesture Recognition – Letter Recognizer & Block Recognizer
 Small Keyboards – eg. I-Mate Jas jam, Nokia Communicator, Blackberry

These techniques of input are often slow and cumbersome, and are tedious especially when
entering large amount of data. As handheld devices become more popular, there will be a
need of much more efficient methods of data entry

Speech Recognition

Speech recognition (in many contexts also known as 'automatic speech recognition',
computer speech recognition or erroneously as Voice Recognition) is the process of
converting a speech signal to a sequence of words, by means of an algorithm implemented
as a computer program.

Most speech recognition users would tend to agree that dictation machines can achieve
very high performance in controlled conditions. Part of the confusion mainly comes from
the mixed usage of the term speech recognition and dictation.

Speaker-dependent dictation systems requiring a short period of training can capture


continuous speech with a large vocabulary at normal pace with a very high accuracy. Most
commercial companies claim that recognition software can achieve between 98% to 99%
accuracy (getting one to two words out of one hundred wrong) if operated under optimal
conditions. These optimal conditions usually mean the test subjects have

1. Matching speaker characteristics with the training data


2. Proper speaker adaptation
3. Clean environment (e.g. office space).

Other, limited vocabulary, systems requiring no training can recognize a small number of
words (for instance, the ten digits) from most speakers. Such systems are popular for
routing incoming phone calls to their destinations in large organizations.

Page |3
Today vs. Tomorrow

Current Voice Recognition Technology Features & Limitations

Future Speech Recognition Technology Features & Limitations

Speech Recognition on Handheld devices


(http://findarticles.com/p/articles/mi_m0CMN/is_2_38/ai_70870344)

Some new hand-held devices already use speech-recognition technology to integrate e-mail,
telephone, pagers and fax machines with a speech-user interface, rather than a graphical or
keypad-based interface. Next-generation "smart-phone" applications and related wireless

Page |4
services are now appearing that provide access to e-mail, Internet and corporate intranet
information from a cellular handset.

NLU – Natural Language Understanding

Developers are enhancing automatic speech-recognition and text-to-speech technology


with natural language understanding (NLU). NLU is a next-generation speech technology
that enables applications not only to recognize words but also to understand them
contextually, and read them back in a pleasant voice. This advance also represents the first
step toward true natural-language dialogue systems that enable two-way conversations
with computers--essential for future Web searching with speech-enabled hand-held devices.

Interacting with the Device

New applications combine improved text-to-speech (TTS) and automatic speech-recognition


(ASR) engines with natural-language processing. ASR converts the user's speech to a text
sentence of distinct words. TTS converts a text sentence into computer-generated speech.
Mediating between them, NLU enables the computer to understand what the user is saying.
The combination of these technologies makes it possible for applications to interact with
humans through spoken text, eliminating the need for pre-recorded voice files or manual
input devices.

Speech-recognition and text-to-speech engines ,residing remotely on servers enable users


to ask questions, and hear instant answers in a pleasant, human-sounding voice.

With a speech-equipped wireless Web device, rather than stopping to use a stylus, or hunt
through the letters/numbers on a mobile phone's keypad, users can gain instant access to
specific content by simply speaking into the phone and asking for it.

Hindrances

Sophisticated natural language applications such as handheld speech-to-speech translation


require fast and lightweight speech recognition. Several technical challenges have hindered
the deployment of such applications on embedded devices.

1. The Small Size


2. And the need to minimize power consumption , lead to compromises in the
hardware
3. Operating system software further restricts their capabilities below what one might
assume from their raw CPU speed. For example, embedded CPUs typically lack
hardware support for floating-point arithmetic.
4. Memory, storage capacity and bandwidth on embedded devices are also very
limited.

How is Speech Recognition being achieved?

Speech Recognition is already being used on handheld devices for short tasks such as voice
activated dialling or directory assistance.

Page |5
Speech Recognition on handheld devices is being achieved in several different ways

 On the Device

The speech engine can work directly on the hand held device, but there are a lot of
shortcomings in this method

1. System training
2. Limited Vocabularies
3. well-defined grammars to maintain adequate recognition accuracy
4. Latency

For example, in voice-activated dialling, recognition accuracy is improved through


system training by the user and the use of a limited vocabulary. Mostly used for
speed dialling and directory look up.

 Server-Based Speech Recognition

In this method, the speech engine resides on the server. The recognition server
engine receives speech data across a network from the mobile device and returns
the resulting text for display and correction on the handheld device.

Advantages

1. Maintain Recognition Accuracy Rates


2. Reduce Processing delays
3. Prove to be promising for large-vocabulary free-form tasks such as
completing calendar entries or composing short replies to e-mail.

Disadvantages

1. Network delays
2. Don’t prove beneficial if vocabularies are small or grammars are well-defined.

 Speaker Dependent
(http://www.pocketpcmag.com)

Speaker-dependent programs tend to be smaller, faster, and more accurate than


speaker-independent programs. However, they also require more time to learn
because one has to train the program to recognize your voice patterns.

Speaker-Dependent systems are trained for a single voice. The system is trained to
understand their pronunciations, inflections, and accents, and can run much more
efficiently and accurately because it is tailored to the speaker.

 Speaker Independent

Page |6
Speaker-independent programs are easier to learn and use, but they tend to be
larger and require more power.

Speaker-Independent systems are designed to deal with anyone, as long as they're


speaking English. To do this, the scientists had to figure out what parts of speech are
generic, and which ones vary from person to person.

SALT - Speech Application Language Tags


(http://www.pocketpcmag.com)

The SALT Forum brings together a diverse group of companies sharing a common interest in
developing and promoting speech technologies for multimodal and telephony applications.
Founded in 2001 and representing over 70 technology leaders, the SALT Forum seeks to
establish and promote a royalty-free standard that provides spoken access to many forms of
content through a wide variety of devices.
In pursuit of these goals, Version 1.0 of the SALT specification was developed by Forum
members and contributed to the World Wide Web Consortium (W3C).

The Speech Application Language Tags (SALT) 1.0 specification enables multimodal and
telephony-enabled access to information, applications, and Web services from PCs,
telephones, tablet PCs, and wireless personal digital assistants (PDAs). The Speech
Application Language Tags extend existing mark-up languages such as HTML, XHTML, and
XML. Multimodal access will enable users to interact with an application in a variety of ways:
they will be able to input data using speech, a keyboard, keypad, mouse and/or stylus, and
produce data as synthesized speech, audio, plain text, motion video, and/or graphics. Each
of these modes will be able to be used independently or concurrently.

Speech Recognition Softwares

Several products are now available in the market that will transcribe recordings that have
been made earlier. In this report we review some of the most popular Speech Recognition
Softwares in the market today, their capabilities and features (on handheld devices).

1. Dragon NaturallySpeaking Preferred 9

Dragon NaturallySpeaking Version 9 claims an accuracy of 99% and can convert


speech into text on a PC at up to 120 words per minute. The software also lets you
dictate into any Nuance-certified handheld device for automatic transcription when
you synch with your PC.

 Features

One can create a profile of their own voice for a mobile recording device, such as
a Pocket PC handheld. After they have recorded their thoughts on the go, they
can feed Dragon that sound file later for transcription.

Page |7
2. IBM Embedded ViaVoice, Multiplatform Edition
(http://www-306.ibm.com)

IBM Embedded ViaVoice, Multiplatform Edition delivers IBM speech technology to


mobile devices such as smart phones, handheld personal digital assistants (PDAs),
and automobile components, giving you the power to develop solutions with voice
access to information from work, home, school, or while travelling.

 Features

Embedded ViaVoice supports a variety of real-time operating systems (RTOS) and


microprocessors. Embedded device applications can use IBM speech technology
in its two basic forms.

1. Command and control (C&C) is a form of Automatic Speech Recognition


(ASR) that uses human speech to input commands into a mobile device.
For example, a PDA's commands might be "What are my appointments
for today?" or "When is my next appointment?"

2. Text-to-Speech (TTS) uses synthesized human speech to output text and


other information from a mobile device. IBM TTS can output most words
in our supported languages.

The PC Magazine article touts the ViaVoice 4.4, release in January 2006, as a major
advance, and a landmark on IBM's five-year goal of achieving "super-human" speech
recognition; that is, computer speech recognition that equals and even surpasses
human capabilities.

3. Microsoft Voice Command for Pocket PC

Microsoft has released an update to its speech recognition software for Windows
Mobile devices. Voice Command version 1.6 lets users control virtually any aspect of
a Pocket PC or smartphone using simple, "intuitive" vocal commands, provides voice
prompts back to the user, and now supports smartphones, according to the
company.

 Features

1. A new capability in Version 1.6 of Voice Command is support for


Bluetooth headsets. This feature can be configured to accept commands
from a headset, and to announce incoming phone calls and email headers
through the headset, Microsoft says. It reportedly works with some, but
not all, hands-free car kits.

2. The other major new feature is spoken email notification. Microsoft says
that this capability can be set to state the subject and sender of incoming
email messages, and read the content of SMS messages as they arrive.

Page |8
4. Fonix VoiceIn Standard Edition 4.1

Fonix Speech, Inc., a wholly-owned subsidiary of Fonix Corporation specializing in


embedded speech interfaces for mobile devices, handheld electronic products
,systems and processors, has released version 4.1 of its Fonix VoiceIn Standard
Edition automatic speech recognition technology.

Fonix VoiceIn® SE 4.0.1 speech recognition technology gives developers an easy-to-


use tool for developing voice interfaces for embedded products and applications as
well as PC and server-based systems.

 Features

1. "Speech-in" capabilities on products and devices give users an easy, safe


way to access information and operate device functions without pressing
buttons, scrolling through menus, looking at screens or typing.

2. The release includes improved recognition rates, particularly in noise-


saturated environments, and the new Speech Analysis Module, which
enables developers to create applications that provide feedback to end
users to improve pronunciation of foreign words and phrases.

3. The Speech Analysis Module is available in US and UK English, Canadian


and European French, Castilian and Latin American Spanish, German,
Japanese, Swedish, Italian and Korean. Fonix VoiceIn 4.1 targets a variety
of speech recognition applications that are available from OEMs, system
integrators and application developers.

5. VITO Voice2Go

VITO Voice2Go is a voice recognition application for Pocket PC that allows managing the
Pocket PC with one’s own voice.

 Features

1. Starting and quitting applications


2. Calling your contacts and hanging up the phone
3. Modifying system settings
4. Voice2Go also features macro reading. One can record any actions they
do with the stylus (button presses, screen and menu taps etc.) - and
perform them later with their voice. Even if there's no setting or program
in the list of actions for Voice2Go, one can always create a new one for
themselves.

Page |9
A sound file can also be recorded externally and sent over the Internet or a WAN to the PHILIPS SpeechMagic system,
where it’s converted to text and can be edited.

Dragon NaturallySpeaking Preferred 9 uses the method of saving the voice dictation as a file on the handheld device, and
when synchronized with the computer, it’s translated using the software onto the computer

Speech Recognition – The Future


(http://www.leaonline.com/doi/pdfplus/10.1207/s15327590ijhc1903_1)

Current speech-to-text programs are too large and require too much CPU power to be
practical for the Pocket PC. Speech recognition technology has improved to the point that
we now have reasonable command recognition programs available for the Pocket PC.

As mobile technology advances and speech recognition improves, it may be possible to


support large-vocabulary speech recognition directly on handheld devices and still obtain an
acceptable level of recognition accuracy.

However, even when this becomes an option, one would expect server-based solutions to
continue to provide superior computing capabilities. As a result, there will continue to be a
trade-off for tasks involving large vocabularies where server-based solutions are expected to
allow recognition accuracy to be maintained at an acceptable level while reducing
computational delays at the expense of introducing network delays.

Although longer tasks like, developing more effective text entry solutions with the goal of
supporting tasks involving larger quantities of text, may not be particularly common using
current technologies, ineffective text entry mechanisms are at least partially responsible

P a g e | 10
and the development of new, more effective mechanisms for entering text when using
mobile devices will significantly expand the possibilities for these devices.

Mobile Speech Recognition: Development

A software development kit (SDK or "devkit") is typically a set of development tools that
allows a software engineer to create applications for a certain software package, software
framework, hardware platform, computer system, video game console, operating system or
similar.

 Fluent Speech Software Development Kit (SDK)


http://www.embeddedstar.com/press/content/2002/9/embedded5235.html

The Fluent Speech SDK allows developers to easily implement and integrate speech
recognition into a wide range of consumer electronics products including convergence
devices like PDAs and smart phones, MP3 players, and navigation systems.

The SDK is available for devices using StrongARM processors running Microsoft's
Windows CE operating system or one of its variants such as Pocket PC or Auto PC, as
well as for Windows PCs. The SDK also includes libraries for the Windows Pocket PC
Emulator.

Not tested

 Mobile Conversay™ SDK


http://www.conversay.com/Products/Embedded/MCSDK.asp

The Mobile Conversay Software Development Kit (SDK) is a tool used to speech-enable
applications for Linux, eLinux and Pocket PC platforms. With the Mobile Conversay SDK,
developers can rapidly create speech-powered mobile applications. The Conversay
speech platform delivers a speaker-independent, continuous speech recognition engine.
Users don't have to "train" the system and may speak in a natural voice. Clear, robust,
text-to-speech capabilities allow the end-user to reliably access information.

Not Tested

 Speech SDK for Windows CE and Windows Mobile –Speech Server Solution

VoiceIn Standard Edition (SE) 4.0.1 is based on the Fonix's proprietary neural network-
based technology. It provides accurate speaker-independent ASR in noisy environments,
according to Fonix. The new version now supports multiple channels in a wider array of
operating systems and development platforms, and also adds Italian language support.

The multi-channel feature allows developers to create applications that run "more
concurrent channels on a single platform than competitive offerings", according to
Fonix, thereby reducing the number of speech servers required for a given application.
Additionally, the new release adds support for programming in VB.NET.

P a g e | 11
 VoiceLib™ SDK 2.0.0 for PalmOS

VoiceLib™ SDK 2.0 for PalmOS™ is a simple to use Voice Recognition SDK for the Palm
Operating System.

 Embedded ViaVoice Multi-Application SDK

Embedded ViaVoice Multi-Application SDK is a software development kit (SDK) for


embedded speech that allows multiple applications simultaneous access to Automatic
Speech Recognition (ASR) and Text-To-Speech (TTS) resources on an embedded device.

Linux and Windows®-based handheld devices are not supported right now.

Recent Developments
(http://www.geek.com)

 Researchers in Hong Kong have created new speech recognition technology that is
capable of running on less processing power. The new technology, ASSF (Auditory
Spectrum-based Speech Feature), uses less processing power than the widely used
MFCC (Mel Function Cochlear Coefficient) technology by "using more sophisticated
decision rules for dealing with the data it gathers about the wave forms." The decision
rules can then be run in memory, instead of being crunched by a powerful processor.

The future of this technology could lead to voice-controlled Web surfing on mobile
phones or PDAs, better speech recognition in noisy settings, and even voice-controlled
computer games.

 As processors and memory have continued to grow in capacity and drop in price,
developers have used larger voice segments that make it easier to develop more
natural-sounding speech. At the same time, developers have broken new ground in the
ability to join these voice segments effectively to create a smoother, more natural-
sounding synthetic voice.

 The newest synthesizers, combined with new ASR (automatic speech-recognition)


technology, enable the computer to generate any question necessary to clarify spoken
input.

Conclusion

P a g e | 12

You might also like