Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 8

WEB-SPEAK: A CUSTOMIZABLE SPEECH-BASED WEB NAVIGATION

INTERFACE FOR PEOPLE WITH DISABILITIES USING ARTIFICIAL


INTELLIGENCE

Foad Hamidi Leo Spalteholz Nigel Livingston


CanAssist Department of Electrical and CanAssist
University of Victoria, Victoria, Computer Engineering University University of Victoria,
BC, Canada of Victoria, BC, Canada BC, Canada
foad@canassist.ca leos@ece.uvic.ca njl@uvic.ca

1.Introduction original proposal, only “speech” was


considered as our input modality.
Before the era of the Web, people used to However, new ideas came out when I
use specific software to satisfy the need, implemented the system and tried to
such as using e‐mail client to read the provide an interface for speech
email and using FTP client to download recognizer. Another two input modalities,
the file. However, the Web has become (mouse) gesture and keyboard input, are
an important medium for delivering implemented and integrated into the
information nowadays. There are more Webnnel System. The input modalities of
and more people relying on it for work our final project include speech
and entertainment, such as checking e‐ command, (mouse) gesture command and
mails, reading news, watching videos, keyboard input. Figure 1 illustrates the
listening to music and shopping on the idea we have.
Web. Even more, the users can have their In the following sections of this report, I
free operating system on the Web. will describe related work in section 2.
[14,15] Later, the system design and system
implementation will be explained in
section 3 and 4 respectively. User study
will be shown in section 5 and I will have
a short discussion in section 6. Finally, I
will show the collaboration information
and references in section 7 and 8

2. Related Work

With the success of Web technologies, Information display with TV channel


people are familiar to use the Web, and format can be seen on some applications.
start to transfer similar experience to Youtube uses frame list and flash
different domains. In our final project, animation to display the video clips.
we envision a future application to use Joost [6] and Mogulus [8] use grid like
Web browser on a big screen TV for arrangement to display live TVs clips
home environment. We design a Web with multiple small screens. On the other
navigation system to represent web sites hand, in the mobile device, Avot mV [1]
as TV channels and allow users to use uses a similar display to provide video
multiple modalities to navigate them, search. The idea of TV channel format
even to control and change the content of presenting web sites is inspired by these
the web sites. In short, the user can use applications, because we think it could
the input modalities to send out the save users’ time from typing in the URL
request to the system, and the system address and provide more natural
could feedback correspondingly. In our interaction and better Web browsing

1
experience. However, to the best of our vision recognition and WATSON project
knowledge, we have not seen a similar works on real‐time head tracking and
system proposing an idea to display web gesture recognition [16]. Instead of using
sites as TV channels for the users to hand gesture recognition, Mouse
access at home environment. Gestures [10] extension provides an
Web automation and customization is a interesting idea to recognize a mouse
research topic in web content access. gesture.
Chickenfoot [3] and Greasemonkey [5] 3. System Design
are two web scripting frameworks that
allow users to write their own scripts to The goal of the Webnnel System is to
customize web pages. Programmers can have a flexible architecture to allow
write JavaScript programs, which will be multiple input modalities to manipulate
applied to HTML or XHTML web pages, web content at home environment. Based
to access web content dynamically. In on this idea, the system design is
Chickenfoot, user can even record a separated to “web content access” and
series of actions, set the scripts as “input modalities.” In the “web content
triggers for specific condition, or package access” part, I divide it into content
a script as a Firefox extension. However, manipulation and content aggregation
both of Chickenfoot and Greasemonkey and presentation. In this session, I will
do not provide easy‐to‐use interface for explain all the details of my design ideas.
third party application to utilize user’s 3.1 System Architecture
written or existing functions. Conceptually, the Webnnel system
Furthermore, they do not provide architecture contains two main parts:
standalone library for users to include for Webnnel Command System and Input
application development. Accessmonkey
Modalities, which includes Speech
[2] is another script framework that
allows multiple users, including web Command Input, Keyboard Command
users, web developers and web Input and (Mouse) Gesture Command
researchers, to collaboratively write the Input. Not only does Webnnel Command
scripts to enhance web page accessibility. System manipulate the web content, but
Unfortunately, it does not provide an it also provides easy‐to‐use interface to
interface for user to customize the web input modalities. There are two benefits
page by using natural interaction, such as
of this design. First, the Webnnel system
speech or gesture.
Speech recognition has a long history in is flexible to integrate more input
research. Using speech to invoke web modalities. Second, the system designer
content access is helpful to people with can focus more on solving the problem of
disabilities. Microsoft Windows Vista recognizing issue of the engine in each
Speech Recognition system [7] provides input modality, such as speech
a platform for users to control Windows recognition or hand gesture recognition,
applications, such as using voice
commands to start a new program, switch and easily to integrate the engine to
between applications and control manipulate the web content. In current
operating system. However, it does not design, the Webnnel system is consisted
have much flexibility in web browsing of
and web content manipulation. CMU
Sphinx4 Java‐based speech recognition
system [4] is another well‐known speech
recognition engine, but it has too many
configuration settings to consider and the 1. Webnnel Command System
recognition accuracy depends much on 2. Speech Command Extraction (SCE)
the language model. Hand gesture System
recognition is another interesting 3. Keyboard Command System
research topic. Intel released Open 4. (Mouse) Gesture Command System
Computer Vision (OpenCV) libraries
[11] for application development to do 3.2 Webnnel Command System

2
There are four components in the multiple functional tools, for CMM to
Webnnel Command System: (1) use.
Webnnel Command Interface (WCI) (2)
Command Abstraction Interface (CAI)
(3) Channel Aggregation and 3.3 Speech Command Extraction
Presentation (CAP); and (4) Content (SCE) System
Manipulation Module (CMM). Because I
use Firefox web browser as my platform, Using Speech to access or control the
the Webnnel Command System is web content is useful to the people with
implemented as a Firefox extension. The disabilities or dysfunctional hand motor
output of the Webnnel Command System abilities. Besides, to the general users,
is a customized channel presentation of using natural language is the most natural
the web content. The relationship way to interact with people and we
between input modalities, the Webnnel assume it is also applicable to interact
Command System, Firefox browser and with the system. We tried to survey and
the customized output presentation is train the existing speech recognition
illustrated as Figure 2. system to our defined commands. The
basic design idea is to utilize the output
of speech recognition engine and enter
this output to the Webnnel Command
System.
3.4 (Mouse) Gesture Recognition
System

The design idea of (Mouse) Gesture


Recognition System is that user can use a
mouse or a stylus to gesture their
command to the Webnnel Command
System. The extension of this design is to
use remote controller to gesture the
Figure 2: The system architecture of the command on the screen or on the wall to
Webnnel System send the request. Because I don’t have
• Webnnel Command Interface (WCI) enough time to design my (Mouse)
0 Define an interface for input Gesture Recognition engine, I tried to
modalities to send the request command. survey and hack existing system to
extract the functionalities the Webnnel
• Command Abstraction Interface (CAI) Command System might need. The detail
0 Define high level APIs for WCI implementation will be explained in
to access the internal functions, such as section 4.3.
myEmail() for “my email” command.
4. System Implementation
• Channel Aggregation and Presentation
(CAP) In this section, I will explain the details
0 Provide different templates to of the implementation, including
render web site snapshots. technologies and hackings, and show the
1 Render customized web content results of using different input
and appropriate UI supports. modalities.
• Content Manipulation Module (CMM) 4.1 Webnnel Command System
0 Define functions for specific As mentioned in section 3, I developed
purpose, such as image detection and go
the Webnnel Command System as a
to web site automatically, for CAI.
1 Define functions for CAP to Firefox extension. All the technologies I
render content, such as taking snapshots. use to implement the Webnnel Command
2 Inside CMM module, I also System are HTML/XHTML, JavaScript,
design a “Webnnel Utilities” containing XML User Interface Language (XUL),

3
Cascading Style Sheet (CSS) and Firefox As for clicking a link without using the
extension development knowledge. The mouse, I implement it by using
extension structure and explanation are JavaScript regular expression to check if
illustrated as Table 1. any <A> link contains that substring.
Because I embed <SPAN> node inside
One key idea of the Webnnel Command <A node> to allow user to use the
System is to set an input textbox on the number to identify the link, <SPAN>
user interface, and always set focus on codes with certain class name are also
that textbox when browser reloads any needed to check during the process.
web page, that allows any third party (Figure 5) One important technical
application to always focus on that problem needs to highlight is that any
textbox to enter the command. Then, HTML web page might contains multiple
attaching an event listener to get any frames and each frame is another
“enter” command on that textbox. All the complete HTML document. To identify
entered commands will be parsed inside the correct link, you need to get all the
CAI module and redirect accurate documents first, and
commands to corresponding function.
Figure 3 is a snippet code I use to traverse all the documents to check all
implement above idea. (User can press the links. In the Webnnel Command
“F2” to show or hidden the input System, I define more than 20
textbox.) commands, which can be categorized in 3
different categories: Navigation, Content
Another interesting idea is how to Access and Macro. All the commands
identify the link and click that link and their corresponding purpose are
without using the mouse. To identify the organized as Table 2.
link, I allow user to enter substring of the
link to identify it, such as “trick” in “A 4.2 Speech Command Extraction
Cool Trick for Solar Cells.” However, to (SCE) system
some input modality, like Speech,
substring is not always precise enough. When we submitted the final project
To solve this problem, when a new page proposal, we planned to use CMU‐
Sphinx 4 Java‐based speech recognition
is loaded, the Webnnel Command
as our speech recognition engine.
System will use Document Object Model However, after more than one week’s
(DOM) existing get element function to work, we fail to train the system to
extract all the <A> nodes. For each <A> recognize our predefined speech
node, I create a new node, named commands. There are at least two
<SPAN>, with “numTag” class name, possible reasons: (1) There are too many
embed number information inside configuration parameters to consider, and
we couldn’t figure out which one is the
<SPAN> node, and append this node as a
best one to our hardware system. (2)
child of <A> node. After embedding a After tuning the custom language model
<SPAN> node, setting the display style and grammar, we still had poor
as hidden. (Figure 4) Later, when user recognition accuracy. It is too
enters “show tag” command, all the complicated to utilize the Sphinx.
embedded <SPAN> nodes will show up.
The idea of attaching a number tag to a After getting TA’s (Chih‐Yu) suggestion,
we switch to test Mac Speech
link is inspired by “Mouseless
Recognizer. In the testing, we found that
Browing”[9]. After number tags are not only could Mac Speech Recognizer
shown, user can command “click X” (X customize (add/delete) the speech
is the number) to select the link. commands, but it also provides better

4
recognition rate without training in cmdParser(command.value); /
advanced on our defined commands. / cmdParser(cmd) is a
However, the Mac Speech Recognizer command parser
could only work in Mac environment, command.value = "";
and we don’t have any hacking solution
}
to make it cross‐platform at this moment,
such as working on the Windows }, true);
environment. var command =
document.getElementById("w
In general, we create the speech ebnnel‐toolbar‐command");
commands corresponding to the
commands of the Webnnel Command command.focus();
System listed in Table 2. However,
Because Mac Speech Recognizer is not
natural language is better than command
able to access Firefox internal chrome
language when user uses the speech as an
window resources, the input textbox
input modality. In the implementation,
design of the Webnnel Command System
we extend the commands to be natural
provides a good solution for it to access
language like commands. For example,
and control the web content. We didn’t
“web channel” command can be spoken
notice this problem until we were in our
as “go to web channel”, “please go to
implementation phase. After adding input
web channel” or “switch to web
textbox into the Webnnel Command
channel”, and the speech recognition
System, we were surprised that it is also
should work in these cases as well.
workable to other kinds of third party
(Figure 11)
applications. The executed result of the
After user speaks out the natural Speech Command Extraction (SCE)
language command, the Mac Speech System is illustrated as Figure 13.
Recognizer (NSSpeechRecognizer) will
match against the words and sentences 4.3 (Mouse) Gesture Command System
given in the corpus of this language Gesture recognition is not listed in our
model in real time, and the corresponding original proposal. However, after
Apple Script will be called. The Apple implementing the Webnnel Command
Script lists the detailed actions we want System, I notice the flexibility of
to execute, such as type in “channel 5”
integrating new input modalities. I use
command to the Webnnel Command
System and press the “enter” button. the Mouse Gestures [10] Firefox
(Figure 12) extension as the (mouse) gesture
recognition engine, and embed the output
with different JavaScript snippet code to
var textbox = send the command to the Webnnel
document.getElementById("w Command System.
ebnnel‐toolbar‐command");
textbox.addEventListener('ke Because gesture recognition is not the
ydown',function (evt) { same as speech recognition, I don’t try to
if(evt.keyCode == 13){ // create every command with a new
keyCode 13 is the enter gesture. Based on the concept of using
command (mouse) gesture at home environment, I
var command = design part of the commands as gestures
document.getElementById("w in the (Mouse) Gesture Recognition
ebnnel‐toolbar‐command"); System. (Table 3) Parts of the executed
results are illustrated as Figure 13 and 14.

5
Because the (Mouse) Gesture 2 Liked the tag system (showing
Recognition engine is based on the the tag number to the link)
direction of the stroke to recognize the 3 Shorter the command it’s better
4 There should be ways to enter
gesture, the accuracy rate depends on the
the URL directly into the address bar as
design of the gesture. (Figure 17) well
However, there are two drawbacks to use
5
the direction to recognize the gestures.
First, the same gesture with little
difference might have different gesture
code. Second, from human’s point of
view, the same meaning gesture will have
completely different gesture code.
Because the gesture recognition engine
bases on the gesture code to recognize
the gesture, different gesture codes will
output different results. Figure 18 is an
example to show the second drawback.

5. User Study
There are four users participating to the
user study of the Webnnel Command
System +Speech Command Extraction
(SCE) System. Two tasks are designed to
understand the user’s feedback.
1 Task 1: Go to a certain website
2 Task 2: Go to their web‐based
email system
6. Discussion
The total success rate is good. There are
over 70% success rate to the task 1 and
In this project, we envision a future
100% success rate to the task 2. (Figure
application to use Web browser on a big
19) The recognition accuracy of the
screen TV at home environment. We
speech recognition is pretty good to the
design a Web navigation system to
first three users, but to the forth user, he
represent Web sites as TV channels and
had difficulty to have the Speech
allow users to use multiple input
Command Extraction (SCE) System to
modalities to navigate, control and
recognize his speech command. The
change the content of them. In the system
possible reason is that the forth user is a
demonstration, we show a concrete
nonnative English speaker and he has an
example to use Speech, Keyboard and
accent in his spoken language. (Figure
(Mouse) Gesture to navigate Web
20)
channels and automate and customize the
1
task for the users, such as go to personal
e‐mail account to check the emails and
In the user study, we got a very good user
remove unnecessary content to enhance
feedback about using speech input
the user’s reading experience.
modality to control the navigation of the
Web browser. Some of the comments
However, in current implementation,
are:
there still exist some limitations. First,
1 Commands are natural and easy the Speech Command Extraction (SCE)
to remember System uses Mac Speech Recognizer as

6
its recognition engine, which is platform
dependent and can only work on Mac Oshani Seneviratne:
system at this moment. Second, in the 1 Speech Command Extraction
(Mouse) Gesture Recognition System, we (SCE) System
haven’t figured out how to package our 2 User Study
hackings to allow user to download it and
make them deployed. Currently, user We use Eclipse with SVN and Google
needs to manually change the hacking Code online version control
parts to embed the gestures we defined (http://code.google.com/p/webnnel/) to
and evoke corresponding JavaScript manage and synchronize our documents,
snippet code. references, images and project source
codes. In the near future, I will release
In the user study, we got a positive the Webnnel Command System at
feedback from the users. Most of them http://people.csail.mit.edu/chyu/projects/
like the idea to use speech to navigate the webnnel for the public to use it.
Web. Because we didn’t finish the
(Mouse) Gesture Recognition System 8. References
when we conducted the user study, we
are also interested to know the results of 1 Avot mV,
an idea to use (mouse) gesture in the real http://www.avotmedia.com/
environment. 2 Bigham, J. P., and Ladner, R. E.
In short, currently, the Webnnel System Accessmonkey: a collaborative scripting
has few benefits: First, it allows user to framework for web users and developers.
have multiple input modalities to In W4A '07, ACM Press, pp. 25‐34, 2007.
navigate, control, automate and 3 Bolin, M., Webber, M., Rha, P.,
customize the Web channels. Second, it Wilson, T. and Miller, R.C. Automation
provides an idea to allow third party and customization of rendered web
application to control the web content. pages, Proceedings of the 18th annual
Third, to use the Webnnel System, the ACM symposium on User interface
software and technology, October 23‐26,
effort of an application developer is the
2005.
design of new added recognition engine,
and the integration effort will be lower.
4 CMU‐Sphinx Speech
Recognition Engine,
Last but not least, the Webnnel System is
http://cmusphinx.sourceforge.net/html/c
a feasible and attractive approach to musphinx.php
navigate web sites as Web Channels on 5 Greasemonkey,
bigger display TV at home environment. https://addons.mozilla.org/en‐
US/firefox/addon/748
6 Joost, http://www.joost.com/
7. Collaboration 7 Microsoft Windows Vista
Speech Recognition system
http://www.microsoft.com/enable/produc
The Webnnel project is collaborated by
ts/windowsvista/speech.aspx
Chen‐Hsiang Yu and Oshani
Seneviratne. We divided our project into 8 Mogulus,
several tasks and each of us focuses on http://www.mogulus.com/
specific tasks mentioned below. 9 Mouless Brorsing,
https://addons.mozilla.org/en‐
Chen‐Hsiang Yu: US/firefox/addon/879c
1 Webnnel Command System
2 Development of Firefox 1 Mouse Gestures,
Extension of Webnnel Command System https://addons.mozilla.org/en‐
3 (Mouse) Gesture Extraction US/firefox/addon/39
(MGE) System

7
2 Open Computer Vision libraries,
http://sourceforge.net/projects/opencvlibr
ary/
3 Petrie, H., Hamilton, F. and
King, N. Tension, what tension? Website
accessibility and visual design.
Proceedings of the 2004 international
cross‐disciplinaryworkshop on Web
accessibility (W4A), pp. 13‐18, 2004.
4 Richards, J. and Hanson, V.
Web accessibility: a broader view.
Proceedings of the 13th international
conference on World Wide Web, pp. 72‐
79, 2004.
5 StartForce,
http://www.startforce.com/OS/
6 YouOS,
https://www.youos.com/
7 WATSON: Real‐time Head
Tracking and Gesture Recognition,
http://projects.ict.usc.edu/vision/watson/
8 njl@uvic.ca

You might also like