Professional Documents
Culture Documents
CDD 4.1 - SpeechRecognition
CDD 4.1 - SpeechRecognition
Handheld Devices
Author(s):
Aman Dayal
Page |2
Introduction
As handheld devices become ubiquitous and the tasks performed become multipurpose in
nature, efficient data entry techniques are necessary.
Handheld devices such as mobile phones and personal digital assistants (PDAs) have
become a common part of everyday life. Mobile phones and PDAs are now used for a
diversity of tasks that require the entry of varying amounts of data. Existing data entry
techniques for mobile devices include:
Soft Keyboard
Gesture Recognition – Letter Recognizer & Block Recognizer
Small Keyboards – eg. I-Mate Jas jam, Nokia Communicator, Blackberry
These techniques of input are often slow and cumbersome, and are tedious especially when
entering large amount of data. As handheld devices become more popular, there will be a
need of much more efficient methods of data entry
Speech Recognition
Speech recognition (in many contexts also known as 'automatic speech recognition',
computer speech recognition or erroneously as Voice Recognition) is the process of
converting a speech signal to a sequence of words, by means of an algorithm implemented
as a computer program.
Most speech recognition users would tend to agree that dictation machines can achieve
very high performance in controlled conditions. Part of the confusion mainly comes from
the mixed usage of the term speech recognition and dictation.
Other, limited vocabulary, systems requiring no training can recognize a small number of
words (for instance, the ten digits) from most speakers. Such systems are popular for
routing incoming phone calls to their destinations in large organizations.
Page |3
Today vs. Tomorrow
Some new hand-held devices already use speech-recognition technology to integrate e-mail,
telephone, pagers and fax machines with a speech-user interface, rather than a graphical or
keypad-based interface. Next-generation "smart-phone" applications and related wireless
Page |4
services are now appearing that provide access to e-mail, Internet and corporate intranet
information from a cellular handset.
With a speech-equipped wireless Web device, rather than stopping to use a stylus, or hunt
through the letters/numbers on a mobile phone's keypad, users can gain instant access to
specific content by simply speaking into the phone and asking for it.
Hindrances
Speech Recognition is already being used on handheld devices for short tasks such as voice
activated dialling or directory assistance.
Page |5
Speech Recognition on handheld devices is being achieved in several different ways
On the Device
The speech engine can work directly on the hand held device, but there are a lot of
shortcomings in this method
1. System training
2. Limited Vocabularies
3. well-defined grammars to maintain adequate recognition accuracy
4. Latency
In this method, the speech engine resides on the server. The recognition server
engine receives speech data across a network from the mobile device and returns
the resulting text for display and correction on the handheld device.
Advantages
Disadvantages
1. Network delays
2. Don’t prove beneficial if vocabularies are small or grammars are well-defined.
Speaker Dependent
(http://www.pocketpcmag.com)
Speaker-Dependent systems are trained for a single voice. The system is trained to
understand their pronunciations, inflections, and accents, and can run much more
efficiently and accurately because it is tailored to the speaker.
Speaker Independent
Page |6
Speaker-independent programs are easier to learn and use, but they tend to be
larger and require more power.
The SALT Forum brings together a diverse group of companies sharing a common interest in
developing and promoting speech technologies for multimodal and telephony applications.
Founded in 2001 and representing over 70 technology leaders, the SALT Forum seeks to
establish and promote a royalty-free standard that provides spoken access to many forms of
content through a wide variety of devices.
In pursuit of these goals, Version 1.0 of the SALT specification was developed by Forum
members and contributed to the World Wide Web Consortium (W3C).
The Speech Application Language Tags (SALT) 1.0 specification enables multimodal and
telephony-enabled access to information, applications, and Web services from PCs,
telephones, tablet PCs, and wireless personal digital assistants (PDAs). The Speech
Application Language Tags extend existing mark-up languages such as HTML, XHTML, and
XML. Multimodal access will enable users to interact with an application in a variety of ways:
they will be able to input data using speech, a keyboard, keypad, mouse and/or stylus, and
produce data as synthesized speech, audio, plain text, motion video, and/or graphics. Each
of these modes will be able to be used independently or concurrently.
Several products are now available in the market that will transcribe recordings that have
been made earlier. In this report we review some of the most popular Speech Recognition
Softwares in the market today, their capabilities and features (on handheld devices).
Features
One can create a profile of their own voice for a mobile recording device, such as
a Pocket PC handheld. After they have recorded their thoughts on the go, they
can feed Dragon that sound file later for transcription.
Page |7
2. IBM Embedded ViaVoice, Multiplatform Edition
(http://www-306.ibm.com)
Features
The PC Magazine article touts the ViaVoice 4.4, release in January 2006, as a major
advance, and a landmark on IBM's five-year goal of achieving "super-human" speech
recognition; that is, computer speech recognition that equals and even surpasses
human capabilities.
Microsoft has released an update to its speech recognition software for Windows
Mobile devices. Voice Command version 1.6 lets users control virtually any aspect of
a Pocket PC or smartphone using simple, "intuitive" vocal commands, provides voice
prompts back to the user, and now supports smartphones, according to the
company.
Features
2. The other major new feature is spoken email notification. Microsoft says
that this capability can be set to state the subject and sender of incoming
email messages, and read the content of SMS messages as they arrive.
Page |8
4. Fonix VoiceIn Standard Edition 4.1
Features
5. VITO Voice2Go
VITO Voice2Go is a voice recognition application for Pocket PC that allows managing the
Pocket PC with one’s own voice.
Features
Page |9
A sound file can also be recorded externally and sent over the Internet or a WAN to the PHILIPS SpeechMagic system,
where it’s converted to text and can be edited.
Dragon NaturallySpeaking Preferred 9 uses the method of saving the voice dictation as a file on the handheld device, and
when synchronized with the computer, it’s translated using the software onto the computer
Current speech-to-text programs are too large and require too much CPU power to be
practical for the Pocket PC. Speech recognition technology has improved to the point that
we now have reasonable command recognition programs available for the Pocket PC.
However, even when this becomes an option, one would expect server-based solutions to
continue to provide superior computing capabilities. As a result, there will continue to be a
trade-off for tasks involving large vocabularies where server-based solutions are expected to
allow recognition accuracy to be maintained at an acceptable level while reducing
computational delays at the expense of introducing network delays.
Although longer tasks like, developing more effective text entry solutions with the goal of
supporting tasks involving larger quantities of text, may not be particularly common using
current technologies, ineffective text entry mechanisms are at least partially responsible
P a g e | 10
and the development of new, more effective mechanisms for entering text when using
mobile devices will significantly expand the possibilities for these devices.
A software development kit (SDK or "devkit") is typically a set of development tools that
allows a software engineer to create applications for a certain software package, software
framework, hardware platform, computer system, video game console, operating system or
similar.
The Fluent Speech SDK allows developers to easily implement and integrate speech
recognition into a wide range of consumer electronics products including convergence
devices like PDAs and smart phones, MP3 players, and navigation systems.
The SDK is available for devices using StrongARM processors running Microsoft's
Windows CE operating system or one of its variants such as Pocket PC or Auto PC, as
well as for Windows PCs. The SDK also includes libraries for the Windows Pocket PC
Emulator.
Not tested
The Mobile Conversay Software Development Kit (SDK) is a tool used to speech-enable
applications for Linux, eLinux and Pocket PC platforms. With the Mobile Conversay SDK,
developers can rapidly create speech-powered mobile applications. The Conversay
speech platform delivers a speaker-independent, continuous speech recognition engine.
Users don't have to "train" the system and may speak in a natural voice. Clear, robust,
text-to-speech capabilities allow the end-user to reliably access information.
Not Tested
Speech SDK for Windows CE and Windows Mobile –Speech Server Solution
VoiceIn Standard Edition (SE) 4.0.1 is based on the Fonix's proprietary neural network-
based technology. It provides accurate speaker-independent ASR in noisy environments,
according to Fonix. The new version now supports multiple channels in a wider array of
operating systems and development platforms, and also adds Italian language support.
The multi-channel feature allows developers to create applications that run "more
concurrent channels on a single platform than competitive offerings", according to
Fonix, thereby reducing the number of speech servers required for a given application.
Additionally, the new release adds support for programming in VB.NET.
P a g e | 11
VoiceLib™ SDK 2.0.0 for PalmOS
VoiceLib™ SDK 2.0 for PalmOS™ is a simple to use Voice Recognition SDK for the Palm
Operating System.
Linux and Windows®-based handheld devices are not supported right now.
Recent Developments
(http://www.geek.com)
Researchers in Hong Kong have created new speech recognition technology that is
capable of running on less processing power. The new technology, ASSF (Auditory
Spectrum-based Speech Feature), uses less processing power than the widely used
MFCC (Mel Function Cochlear Coefficient) technology by "using more sophisticated
decision rules for dealing with the data it gathers about the wave forms." The decision
rules can then be run in memory, instead of being crunched by a powerful processor.
The future of this technology could lead to voice-controlled Web surfing on mobile
phones or PDAs, better speech recognition in noisy settings, and even voice-controlled
computer games.
As processors and memory have continued to grow in capacity and drop in price,
developers have used larger voice segments that make it easier to develop more
natural-sounding speech. At the same time, developers have broken new ground in the
ability to join these voice segments effectively to create a smoother, more natural-
sounding synthetic voice.
Conclusion
P a g e | 12