Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2022) 000–000
Procedia Computer Science 00 (2022) 000–000 www.elsevier.com/locate/procedia
ScienceDirect www.elsevier.com/locate/procedia
Procedia Computer Science 216 (2023) 587–596

7th International Conference on Computer Science and Computational Intelligence 2022


7th International Conference on Computer Science and Computational Intelligence 2022
Improving the performance of speech-gesture multimodal interface
Improving the performance of speech-gesture
in non-ideal environmentsmultimodal interface
in non-ideal environments
Fiolisya Faustine Ambadara, Jude Joseph Lamug Martineza*
Fiolisya Faustine Ambadara, Jude Joseph Lamug Martineza*
a
Computer Science Department, Faculty of Computing and Media, Bina Nusantara University, Jakarta 11480, Indonesia
a
Computer Science Department, Faculty of Computing and Media, Bina Nusantara University, Jakarta 11480, Indonesia

Abstract
Abstract
Multimodal interfaces have enhanced human-computer interaction by enabling users to interact with computers using a
combination of multiplehave
Multimodal interfaces inputenhanced
modes, providing increasedinteraction
human-computer accessibilityby to a widerusers
enabling rangetoofinteract
users inwith
various situations.
computers usingThea
multimodal
combinationsystem’s
of multipleability to process
input modes, multiple
providing input modes accessibility
increased allows it to rely
to aonwider
one input
rangemodal given
of users inthat the second
various modal
situations. Theis
unable to function
multimodal system’sdue to exposure
ability to processtomultiple
extremeinputenvironments. Thisit study
modes allows to relywill analyse
on one inputa modal
speech-gesture
given thatmultimodal interface
the second modal is
framework and the prototype
unable to function that wastoinitially
due to exposure extremedeveloped by Sindy
environments. ThisDewanti and analyse
study will have been improved uponmultimodal
a speech-gesture by Regita Isada. To
interface
further improve
framework the prototype
and the framework and
that was prototype’s performance,
initially developed this study
by Sindy will and
Dewanti evaluate
have and
beenresolve the upon
improved issuesby encountered
Regita Isada.in the
To
previous study regarding
further improve the configuration
the framework and prototype’sof each modal’s confidence
performance, levels,
this study will environment
evaluate detection,
and resolve weight
the issues calculation,
encountered in and
the
how the unification
previous process
study regarding the selects a final of
configuration semantic. Upon confidence
each modal’s implementing the environment
levels, changes, the detection,
prototype weight
was tested under three
calculation, and
environmental conditions:
how the unification processnormal,
selectsmoderate, and extreme
a final semantic. in both
Upon unimodal and
implementing the multimodal
changes, themode. The test
prototype wasresults
testedshow thatthree
under the
prototype was conditions:
environmental able to deliver the moderate,
normal, expected results with improved
and extreme accuracyand
in both unimodal in multimodal
multimodal mode.
mode The
as compared
test resultstoshow
the previous
that the
study. Nonetheless,
prototype was able the way thatthe
to deliver theexpected
modals perform, and the
results with unification
improved process
accuracy in can still be further
multimodal mode improved.
as compared to the previous
© 2022Nonetheless,
study. The Authors.thePublished
way that bytheELSEVIER B.V. and the unification process can still be further improved.
modals perform,
© 2023
This The
is an Authors.
open Published
accessPublished by Elsevier
article under B.V.
the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
© 2022 The Authors. by ELSEVIER B.V.
This is an open
Peer-review access
under article under
responsibility of the scientific
the CC BY-NC-ND license
committee (https://creativecommons.org/licenses/by-nc-nd/4.0)
of the 7th International Conference on Computer Science and
This is an open access article under CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review
Computational under responsibility
Intelligence of the scientific committee
2022 of the scientific committee of the of the 7th International Conference on Computer Science and
Peer-review under responsibility
Computational Intelligence 2022 7th International Conference on Computer Science and
Keywords: Multimodal
Computational interfaces,
Intelligence 2022speech, hand gesture,unification-based signal fusion, speech-gesture multimodal framework, human computer
interaction
Keywords: Multimodal interfaces, speech, hand gesture,unification-based signal fusion, speech-gesture multimodal framework, human computer
interaction

1. Introduction
1. Introduction
Technology advancements have allowed Human Computer Interaction (HCI) to more closely resemble human-
human interactions.
Technology Where traditional
advancements HCI allows
have allowed Humanusers to give input
Computer commands
Interaction (HCI)bytotyping
more or clicking
closely a button,
resemble users
human-
human interactions. Where traditional HCI allows users to give input commands by typing or clicking a button, users

* Corresponding author. Tel.: +6287855807533


* Corresponding jmartinez@binus.edu
E-mail address:author. Tel.: +6287855807533
E-mail address: jmartinez@binus.edu
1877-0509 © 2022 The Authors. Published by ELSEVIER B.V.
This is an ©
1877-0509 open access
2022 The article
Authors.under the CCby
Published BY-NC-ND
ELSEVIERlicense
B.V. (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under
This is an open responsibility
access of the
article under the scientific
CC BY-NC-NDcommittee of the
license 7th International Conference on Computer Science and
(https://creativecommons.org/licenses/by-nc-nd/4.0)
Computational Intelligence
Peer-review under 2022 of the scientific committee of the 7th International Conference on Computer Science and
responsibility
Computational Intelligence 2022
1877-0509 © 2023 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 7th International Conference on Computer Science and Computational
Intelligence 2022
10.1016/j.procs.2022.12.173
588 Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596
582 Ambadar et al. / Procedia Computer Science 00 (2022) 000–000

can now interact with computers the same way they interact with other people: audio-visual signals. The ability to
interact with computers using multiple input modes is called a multimodal interaction. Oviatt [1] described
multimodal interfaces as a system that processes a combination of two or more input modes. These input modes
include, but are not limited to, touch, speech, gesture, and gaze. Multimodal interfaces have the potential to increase
usability for a wider range of people and function in less than ideal conditions. For instance, an interface that supports
both speech and gesture recognition would still be able to understand the user in noisy environments through the
gesture input. Additionally, the interface would be more accessible to people with disabilities as the users are able to
rely on either mode of interaction depending on their needs. Dewanti[2] has developed a framework to study the
performance of multimodal speech-gesture interface systems’ under extreme environmental conditions. This
framework was used to develop a prototype that utilizes Leap Motion Controller for the Gesture modal and Windows
Speech Recognition API for the Speech modal. The prototype was then tested under different environmental
conditions (normal and extreme) to see whether the multimodal system allows for better input interpretation under
extreme conditions as compared to a unimodal system. The reliability of the interpretations produced by each input
modal is measured using a metric called confidence[2]. This study was later continued by Isada[3] who added more
input values, a moderate environmental condition for testing, verification for complete gestures, modification of the
metric formula, and attempted to improve the success rates of the system’s input interpretation. Although the test
results in Isada’s iteration gave better success rates, some concerns with the prototype were identified. They include
the configuration of confidence levels, the performance of the environmental volume reader, and adjustments for the
unification of input signals[3].
This study will cover an in-depth analysis of the framework to identify possible causes of the issues encountered
in the previous iteration. A deeper look into the framework can also help to identify potential flaws and areas for
improvement, after which an appropriate solution can be implemented to improve its performance. The formula for
calculating the result may need to be slightly modified depending on how many inputs are being used for testing.
The purpose of this study is to continue the research done by Dewanti[2] and Isada[3], implement the
recommendations they have given, attempt to resolve known issues and improve the functionality of the prototype.
Prior to the start of this study, the source code and Unity project used in Isada’s iteration as well as the test results
have been shared. As such, the process of this study will include an observation of the framework and the prototype
system design, analysis of the implementation including the confidence of modals, formulas, and the unification
process, testing methods used and the results of the previous study and implement possible adjustments and new
features for improving input interpretation. The expected outcome of this study is to have the issues and concerns
from the previous iterations resolved to improve the overall accuracy of the prototype in terms of environment
detection and input interpretation. This study can help to increase the understanding of multimodal interfaces by
exploring more of its potential for HCI as well as identifying areas for further research.

2. Problem Analysis

2.1 Environmental Conditions


Dewanti’s prototype was tested using two environmental conditions: normal environment and extreme
environment. The environment is classified as extreme when the environment surrounding the input stream prevents
the system from interpreting the user’s command or prevents the system from picking up an input signal altogether.
The environment is classified as normal when it does not hinder the system’s performance, allowing it to pick up a
signal and interpret the user’s command. Isada’s prototype in was tested using three environmental conditions,
adding a moderate environment alongside the two conditions tested in1. This is to take into account the
environmental conditions that are not 100% ideal but are also not too disruptive.

2.2 Prototype Observation


An observation of the prototype was done in order to have a better understanding of how it works. The
observations will cover how both input modals work individually, how they work together, and how the unification
process works. The observation of the speech modal was done using VOICE and the built-in microphone, and the
observation of the gesture modal was done using LEAP. Dewanti explained that based on the CASE model, this
system is classified under “Concurrent” multimodal combination. Each modal is managed independently, and their
Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596 589
Ambadar et al. Procedia Computer Science 00 (2022) 000–000 583

activation is synergistic. Based on the CARE model, the system is classified as “Redundancy”. As seen in the list of
registered commands, the prototype should be able to come to the same conclusion whether it is used as a
multimodal or a unimodal system1.

2.3 Observation of Unimodal Gesture


During the observation of a unimodal gesture signal in the normal environment, LEAP was able to detect the
correct finger states and palm direction in most cases. Other times, the virtual hand displayed in Unity struggled to
mimic the user’s hand. For example, when the user does a simple “palm facing inwards” gesture with all fingers
extended, the virtual hand shows the palm facing inwards and only an extended index. A similar issue occurred
during Isada’s observation as described 2. Isada mentioned that this may be an issue with LEAP’s sensor rather than
with the prototype itself2. After testing various gestures including those that are not pre-mapped, it is possible that
breaking down each semantic into two sub-semantics is not sufficient for obtaining the correct guess. For example,
the “Left” semantic is determined by the sub-semantics “left palm outward” and “index and thumb extended”. If the
user does an “L” gesture as shown in Figure 6, with the index and thumb extended (upwards) and palm facing
outwards, it will be detected as “Left” even though in reality the user is pointing up. This may cause problems in
future studies if more gestures were to be added. One of the recommendations given by Dewanti is to add finger
direction as part of the gesture sub-semantics. The observation of the gesture modal was also done with the other
environmental conditions. Following the testing method, for the moderate environment, LEAP was partially covered
with a glass bowl on the first test run and powder on the second test run. These foreign objects interfere slightly with
LEAP’s gesture reading, resulting in incorrect detection of the extended finger states. For the extreme environment,
LEAP was fully covered with a glass bowl on the first test run and powder on the second test run. Similar to the test
results shown, LEAP was unable to pick up a signal. This is the expected result as the glass bowl and powder
covering LEAP interferes with the infrared.

2.4 Observation of Unimodal Speech


During the observation of a unimodal speech signal in the normal and moderate environment, VOICE was able to
correctly identify the input signals after several guesses. In the extreme environment, the prototype was unable to
pick up a signal from VOICE. This is the expected outcome and was also shown in the test results[1][2]. As VOICE
waits to receive a speech input from the user, it continuously measures the loudness of the surrounding environment
through the environmental volume reader feature. The environmental volume reader displays the environmental
condition every four seconds in the output console. However, when tested with different environmental conditions,
the environmental volume reader will always display normal. This was not the intended behavior of the
environmental volume reader and was an issue that occurred during Isada’s testing.

2.5 Observation of the Multimodal Signal


The observation of the prototype in multimodal mode shows that it is able to process and unify multimodal
signals. As previously mentioned[1][2], each modal produces a unified structure that contains the semantic, source,
confidence, timestamp, and valid time. The component known as the weight resolver receives the unified structures
of each modal and calculates their weights. This information can be viewed in the output console. The weighted
structures of the modals will then be passed to the unification controller.

2.6 Observation of the Weight Formula and Unification Confidence


The framework uses confidence as a metric to identify the environmental condition of the input modals. The
three levels of confidence are “low”, “medium”, and “high”. Low confidence means that the modal is unable to
interpret the user’s command, as if the modal is in its extreme environment. Whereas high confidence indicates that
the modal is in its ideal environment. The confidence level affects the weight of the modal wherein higher
confidence results in more weight. One of the features of the framework is the ability to rely on one modal to give
the correct interpretation given that the other modal is in its extreme environment. Therefore, the framework relies
on the modal that has more weight [1][2]. The weight of each modal considers their confidence level as well as a
modifier constant K. The weight formula as described 2 is as follows, where V is the confidence for VOICE and G is
the confidence for LEAP.
584 Ambadar et al. / Procedia Computer Science 00 (2022) 000–000
590 Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596

(1)

In order to obtain the weighted confidence of each modal, their initial confidence is multiplied with the value of
K. The formula for calculating the weight is shown in (2) and (3):

(2)

(3)

The final confidence of the unified semantic, given that the ideal confidence is 1, is calculated by adding together
the weights of both modals. The formula for calculating the final confidence C is shown in (4):

(4)

To have a better understanding of the formula, if both modals are in their ideal environments and they each have
a confidence of 1, the value of K calculated with (1) would be 0.5. Following formulas (2) and (3), multiplying the
confidence of each modal with K results in both their weights being 0.5. By adding both weights together, the final
confidence is 1, following (4). A final confidence of 1 would mean that the final guess should be 100% trustworthy.
However, it is interesting to note that in the test cases 2 where both modals guessed the same semantic, the final
confidence is always 1 despite the initial confidence of VOICE occasionally being 0.6 or 0.9. To test the formula
once again, assume both modals have an initial confidence of 0.3. The value of K calculated using (1) would be
1.67. The weights of each modal calculated with (2) and (3) would be 0.5, resulting with a final confidence of 1.
Isada mentioned that in cases where both modals guess the same semantic, it is acceptable to have the final
confidence as 1. Hence, the final confidence calculation was intentionally made this way. Based on the weight
formula, the weight can be said to represent the percentage of how much each modal’s initial confidence contributes
to the final confidence. If both modals have the same confidence, they each account for 50% of the final confidence,
making both their weights 0.5. Formula (4) simply adds these percentages together, resulting in a final confidence of
1 regardless of what each modal’s initial confidence was. Although having both modals guess the same semantic
means that they support each other’s guesses, having the final confidence boosted to 1 regardless of their initial
confidence does not accurately represent the environment of the multimodal interface. Additionally, it is not an
accurate representation of how reliable the final guess is, especially if the two modals individually have low
confidence. Having two unreliable sources produce the same semantic does not make the guess 100% reliable. If the
framework were to be implemented in a real system that has a minimum confidence 1, the final output would have an
ideal confidence while the confidence of the individual modals might have been too low to pass the minimum
requirements. Additionally, while the purpose of the weight formula is to determine which modal is more reliable
based on their environmental conditions, currently it may not work as intended. The formula is under the assumption
that the confidence of each modal is affected by its environmental condition, hence the modal with higher
confidence also has more weight. However, as seen in the test results2, the modals can have high confidence despite
not being in their optimal environment. In the case where the speech modal is in its normal environment for a
unimodal signal test, some of the results have a confidence of 0.6 (medium). In the moderate environment, some of
the results have a confidence of 0.9 (high). In a multimodal test case where the speech modal is in its normal
environment and the gesture modal is in its moderate environment, the speech modal produced results with a
confidence of 0.9, and the gesture modal returned results with a confidence of 1.0. The gesture modal returned a
result with 0.3 confidence in only one of the test cases [2]. The gesture modal will have more weight despite it being
in the less optimal condition. In these test cases, both modals guessed the same semantic, so the prototype was able
to obtain the correct final guess [2]. However, this may cause problems if the speech and gesture modals produce
different results. Modals in the less ideal environment are more prone to error and having high confidence does not
guarantee that it is more correct than the other modal. This is like how people can be confident about something and
still be mistaken due to misreading, mishearing, or other factors. The environmental condition is crucial, as ideal
environments would support the confidence level while non-ideal environments should set back the confidence. The
Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596 591
Ambadar et al. Procedia Computer Science 00 (2022) 000–000 585

weight formula can be improved by adding another variable to represent the environment. For the speech modal, the
environmental volume reader can determine the environmental condition. For the gesture modal, there needs to be a
new method to measure the surrounding environment. Additionally, the formula for calculating the final confidence
needs to be re-examined so that the final confidence can provide a better representation of the multimodal
interface’s environment and the reliability of the final guess.

2.7 Gesture Modal Semantics


Isada mentioned [2] that in the condition where the input signals from each modal have different semantics, the
semantic with the higher confidence level will be accepted as the correct interpretation. However, sometimes the
incorrect semantic has higher confidence, and as a result, it is passed as the correct guess. This issue occurred in the
test case where both the speech and gesture modals were in their moderate environment, and the correct semantic
was “Left”. VOICE was able to identify the correct semantic in both test runs, with 0.6 confidence on the first test
run and 0.9 confidence on the second. However, LEAP detected “Rejection” with 1.0 confidence on the first test run
and “Approval” with 1.0 confidence on the second 2. This could be caused by the clashing of the newly added sub-
semantics with those used during the first iteration. Dewanti’s iteration [1]only used two semantics, “Approval” and
“Rejection”. The pre-mapped gestures were broken down into three sub-semantics which are: Left palm facing
inwards, Left palm facing outwards, Thumb is extended and all other fingers collapsed. As there were only two
semantics, the confidence levels of “left palm facing inwards” and “left palm facing outwards” were high, both at
0.7 for “Approval” and “Rejection” respectively. The “extended thumb” state adds a confidence of 0.3 to both
semantics [1]. During Isada’s iteration [2], the semantics “Left” and “Right” also use palm orientation. However, the
confidence levels for “Approval” and “Rejection” were not modified. The sub-semantics for “Right” include “left
palm facing inward in right direction”, and “Left” includes “left palm facing outward in left direction”. Both sub-
semantics have a confidence level of 0.4 [2]. Additionally, palm orientation is given a level of deviation tolerance
for usability. Hence, by having palm directions that are very similar such as “palm facing inwards” and “left palm
facing inward in right direction”, one could easily be mistaken for the other. As Isada described [2], the “Left”
gesture is also recognised by two extended fingers. However, as shown during testing, LEAP sometimes fails to
detect the correct extended finger states. In this case, LEAP only detected one extended finger, which was the thumb
[2]. It was suggested that for the next iteration, the calculation for the confidence of the gesture modal can be
modified.

2.8 Speech Modal


Another issue described by Isada is concerning the environmental volume reader [2]. One of the features of this
framework is the ability to detect whether the surrounding environment is in its normal or extreme condition.
However, testing shows that even in the extreme speech environment, the environmental volume reader will always
display normal. The current setting for the extreme speech environment is above 85dB. Isada mentioned that sounds
above 85dB are considered harmful [3]. However, noise does not necessarily have to be harmful for it to be
disruptive. Isada’s test cases used an audio of a coffee shop ambiance for the moderate environment and a live TV
broadcast for the extreme environment. As shown in Isada’s test results, the environmental volume reader displayed
volumes between 60dB – 72dB in the extreme environment. This volume range was disruptive enough for the
speech modal to not be able to pick up a signal. It was suggested that for the next iteration, the decibel setting on
environmental volume reader could be adjusted.

2.9 Unification
When VOICE and LEAP guess different semantics, the guess with higher confidence with higher confidence
will be taken as the correct guess. As Isada mentioned, the sole reliance on confidence levels to choose the correct
semantic could boost the wrong semantic, as shown in the test case previously discussed. It was suggested that
another variable could be used to determine the more reliable modal, or the unification process could be
recalculated.
592 Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596
586 Ambadar et al. / Procedia Computer Science 00 (2022) 000–000

3. Solution Design

This study will implement the recommendations given by Isada 2as well as attempt to resolve new issues that
were In addition, this study will focus on modifying the internal components of the framework and the way that the
prototype was implemented in attempts to resolve the issues that occurred during testing. The framework
architecture does not require major changes, so it follows the structure used in the previous studies [1][2].

3.1 Application Information Database


The application information database is where all the registered inputs are stored. These include speech
commands and pre-mapped gestures that will be used for determining guesses. Modifying existing commands or
adding new commands are done by editing/creating a ScriptableObject in the Unity asset folder and adding details
through the Unity Inspector. Modifying the gesture mapping includes selecting the type of data such as palm
direction and extended finger state, and assigning their confidence 2.

3.2 Input Analyzer


The input analyser is the component that reads user inputs and converts them into unified structures. As
mentioned in Chapter 3, the unified structure holds information for the fusion process which includes the semantic,
source, confidence, timestamp, and valid time (how long before the signal expires). The valid time for the gesture
modal (LEAP) is the timestamp + 4 seconds, and for the speech modal (VOICE) is the timestamp + 1 second [1][2].

3.3 Speech Input


The speech input component includes the environmental volume reader which will measure the loudness of the
speech environment in decibels. This feature is called every four seconds and will measure the environment
continuously [2]. As part of the recommendations given by Isada, the decibel settings on the environmental volume
reader needs to be adjusted as it was unable to detect extreme environments. As described, the average decibels for
normal conversation or background music is 60dB, office noise or the inside of a car going 60mph is 70dB, and a
vacuum cleaner or the average radio is 75dB [3]. Some examples of noise between 80dB - 89dB are heavy traffic,
window air conditioner, a noisy restaurant, and a power lawn mower. Additionally, the environmental volume
reader currently only distinguishes between normal and extreme. As the framework will be tested against three
different environmental conditions, it needs to be able to detect moderate environments as well. In order for the
settings to more accurately detect the environments in the testing conditions, it will use the environment reading
from Isada’s study as a benchmark. Isada’s test results show that under the moderate testing conditions, the
environmental volume reader displays volumes above 50dB. Under extreme testing conditions, the environmental
volume reader displays volumes above 60dB.

3.4 Gesture Input


The gesture resolver is the component that matches the user’s gesture to the pre-mapped gestures in the signal
database. It checks the extended finger state to ensure that the gesture is complete and that the user is providing a
gesture that has been registered. The results from Isada’s study shows that the gesture modal can have confidence
levels that are too high for incorrect guesses. The highest confidence for the speech modal is 0.9, and yet the gesture
modal can have a confidence of 1.0 even though the guess is incorrect. In order to make the confidence levels of
both speech and gesture more balanced, each gesture semantic will be broken down into three sub semantics, each
contributing a confidence of 0.3. In addition to palm orientation and extended finger states, finger direction will also
be used. The observation shows that LEAP can occasionally misread gestures, especially in moderate and extreme
environments. The hope is that by breaking down the gestures into three sub-semantics (palm orientation, extended
finger state, and finger direction), if LEAP misreads one of them, it can rely on the other two sub-semantics to
obtain the correct result. Furthermore, a feature that measures the environment of the gesture modal will be added.
There are a set of diagnostic tests that can be run for troubleshooting LEAP that includes hardware tests, software
tests, and environmental tests. Some examples [4] namely a check sensor test which verifies the sensor’s signal
quality, a smudge test which detects foreign substances on the LEAP sensor such as oil and grease, and a lighting
check which checks for external infrared light sources in the field of view of the LEAP sensor.
Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596 593
Ambadar et al. Procedia Computer Science 00 (2022) 000–000 587

3.5 Unified Structures


As previously mentioned, the speech and gesture resolvers produce a unified structure that contains the
semantic, confidence, timestamp, and valid time. As the framework will include features that measure the
environmental conditions of both the speech and gesture modals, the results from these features can be used to alter
the initial confidence of each modal to better reflect their environmental condition. This was an issue previously
discussed in the previous section where the initial confidence does not accurately represent the environmental
conditions. To resolve this issue, the input analyzer will calculate the modified confidence for each modal that will
be included in their respective unified structure. Each environmental condition will be assigned a value that will then
be multiplied with the initial confidence. The assigned values are shown in table below.

Table 1. Assigned Values for each Environmental Condition


Environment Value
Normal 1.0
Moderate 0.6
Extreme 0.3

The formulas for calculating the modified confidence MV and MG are shown in (5) and (6), where V is the initial
confidence of the speech modal, G is the initial confidence of the gesture modal, and E is the environmental
condition.

(5)

(6)
The modified confidence will take the place of the initial confidence in the unified structure. Further processes
involving the confidence of each modal will use their modified confidence.

3.6 Multimodal Manager


The multimodal manager consists of three components, namely the threshold controller, weight resolver, and
unification controller [1][2].

• Threshold Controller
As the signals from the speech and gesture modals are asynchronous, the threshold controller is the component
that holds the signal from each modal for a set amount of time before they are passed on for unification. When a
signal from one modal is received, the threshold controller will wait to receive a signal from the other modal. If the
second modal’s signal is received within the set period, both signals will be sent for unification, making it a
multimodal signal. If it does not receive a signal from the second modal before the held signal expires, only the one
signal will be sent as a unimodal signal [1][2]. In this framework, the threshold controller runs an observation
routine of new signals and held signals every 0.1 seconds. Report routines are triggered based on the 4 cases.

• Weight Resolver and Weigh Formula


The weight resolver is the component that calculates the weight of each modal. Through the weight calculation,
it determines which modal is more reliable based on their environmental condition. It accepts an array of unified
structures of size 1 for unimodal signals and size n for multimodal signals with n modals [1][2]. The weight
calculation utilises the confidence levels of each modal. Both LEAP and VOICE have Confidence as a built-in
property. For LEAP, confidence indicates how well the internal hand model fits the observed data, returning a
Boolean value that indicates whether the gesture is valid or not[2][5]. For VOICE, it indicates the confidence of the
speech recogniser in the recognised result, although it is important to note that the confidence score does indicate
absolute likelihood that the phrase was recognised correctly[6][7].
As mentioned in the previous section, the weight formula was under the assumption that the confidence of each
modal reflects their environmental conditions. As such, simply multiplying the initial confidence with a weight
594 Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596
588 Ambadar et al. / Procedia Computer Science 00 (2022) 000–000

constant K that affects both modals uniformly are not sufficient to determine which modal is more reliable. As the
initial confidence of each modal fails to detect the environmental condition, the weight of each modal will also fail
to determine which modal is in the more optimal environment. As previously mentioned [1][2], the weight is used as
a degree of trust. Even though the modals have high confidence, their environmental conditions could influence their
interpretation, and consequently, should lower their degree of trust. As previously mentioned, the unified structure
of each modal now contains their modified confidence instead of their initial confidence. The new confidence value
would be able to represent their environmental conditions better. For the weight calculation, the weight formula will
make use of each modal’s modified confidence. The constant K is used as a modifier given that the ideal final
confidence C is 1. A final confidence of 1 means that the multimodal interface is 100% confident of the final guess.
The formula for calculating K is shown in (7), now using the modified confidence of the modals instead of their
initial confidence.
(7)

Calculating the weight of each modal is shown in (8) and (9). It is like the original formula used in (1) (2), only
instead of using the initial confidence, it uses the modified confidence.
(8)
(9)
As mentioned in the previous section, the weight of each modal represents the percentage of how much their
confidence contributes to the final confidence. The weighted confidence of each modal will first need to be
calculated by multiplying their weights with their modified confidence, as shown in (10) and (11).

(10)
(11)
The final confidence C is calculated by adding together the weighted confidence of each modal, as shown in
(12). The final confidence is therefore a weighted average of each modal’s confidence.

(12)

The weight resolver modifies the unified structure of each modal into a weighted structure, which will then be
passed on for unification 2.

• Unification Controller
The unification controller is the component where the weighted structures of each modal are unified. This
component produces a unified output of a multimodal semantic in which the final confidence is calculated using
(10). The timestamp is taken from the earliest signal and the valid time is taken from the signal with the latest valid
time [1][2]. For unimodal signals, the initial semantic and confidence are taken as the final semantic and confidence.
As there is only one signal to process, there is no need for weight calculation or fusion with another signal. They are
passed without modification[1][2]. In cases where the two modals guess different semantics, the calculation of the
modified confidence as previously described can help to determine which modal is more trustworthy. When both
modals have the same initial confidence but different environmental conditions, the modal in the more ideal
environment will have higher modified confidence and therefore can be taken as the correct guess. However, it is
another issue when both modals are exposed to the same environmental condition. Isada mentioned that when the
two modals have different semantics and the same confidence, the signal with the earlier timestamp will be taken as
the correct guess.

4. Conclusion and Recommendation

This study was a continuation of Dewanti’s[1]and Isada’s [2] work regarding multimodal interfaces and how they
perform under various environmental conditions. The framework developed by Dewanti focuses on Speech-Gesture
multimodal interfaces. The prototype that was developed utilizes LEAP Motion Controller (LEAP) for the gesture
Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596 595
Ambadar et al. Procedia Computer Science 00 (2022) 000–000 589

modal and Windows Speech Recognition API (VOICE) for the speech modal. Confidence was used as a variable to
determine the best guess from each modal. Weight calculation was derived from the confidence level of each modal
and is used to calculate the final confidence as well as determining the correct semantic in situations where the two
modals guess different semantics. Most of the issues encountered during Dewanti’s study have been resolved during
Isada’s study, although there were new issues that needed to be addressed. The proposed changes include
reconfiguring the gesture mapping and confidence levels for LEAP, resolving the issue with the environmental
volume reader, adding a feature that measures the environment for LEAP, re-examining the weight calculation and
unification process, as well as adding more registered semantics for testing. The gesture mapping was reconfigured
to utilise 3 sub-semantics, namely palm direction, extended finger states, and finger direction. The confidence level
for LEAP was reconfigured to have the highest confidence of 0.9, similar to VOICE. This lowers the chances of
LEAP having higher confidence than VOICE when it produces an incorrect guess. The confidence level of each
modal was also reconfigured to take into account their environmental conditions. To measure the environment of the
modals, the settings on the environmental volume reader were modified based on the test results from the previous
study as well as an observation of VOICE’s ability to pick up signals. For the gesture environment, the prototype
utilises the environmental diagnostics provided by LEAP. The results from these environmental tests were then used
to modify the modal’s initial confidence to obtain a modified confidence. Similar to the previous study, the
prototype was tested in three environmental conditions: normal, moderate, and extreme. It can be concluded from
the results that the prototype was able to perform as expected in most of the test cases. There were improvements in
the prototype’s ability to distinguish which modal is more trustworthy based on their environmental conditions and
choosing the correct semantic. However, the results also pointed out some flaws regarding how the modals perform
individually. For the gesture modal, the deviation tolerance for palm direction and finger direction can still cause
ambiguous guesses, and there is still a bias with choosing a semantic in the event where multiple guesses have the
same confidence. Out of 210 tests, 168 were expected to have results. From these 168 tests, 165 produced the
correct results. The overall accuracy remains similar to Isada’s study at 98% [2]. In unimodal mode, the prototype
performed worse compared to the previous study. More specifically, it performed worse in the unimodal gesture
mode in the moderate environment where it guessed the incorrect semantic in two of the test runs. In unimodal
mode, the prototype guessed the correct semantic in 54/56 test runs, resulting in an accuracy of 96%. However, it
was shown that the prototype was able to perform better in multimodal mode. In situations where the two modals
guessed different semantics, it was able to choose the correct final semantic based on the weighted confidence. With
the incorporation of environment weights to modify each modal’s initial confidence, the weighted confidence levels
provide a better representation of their environmental conditions. Therefore, the prototype was able to determine
which modal is more reliable. The prototype was able to guess the correct semantic in 111/112 test runs, resulting in
an accuracy of 99% in multimodal mode.
Based on the test results, there are several improvements that can be made on the framework and prototype. One
of the issues is regarding the environment detection for the gesture modal. The results from the diagnostic tests were
not very accurate in detecting the environment even though LEAP’s performance in detecting the user’s input was
affected. For future implementation, the method of determining the environment for the gesture modal could be re-
examined. It was also suggested[1][2] that the framework could be used to develop a prototype with different input
devices as the way they perceive input and calculate confidence levels could be different. Furthermore, there were
several occurrences where LEAP produced multiple guesses with the same confidence and chose one semantic
based on hierarchy in the signal database. These ambiguous gestures could be caused by the deviation tolerance, and
similar gestures could be mistaken for one another. The deviation angle has been reduced from the previous study
and reducing it even further may affect its usability. Additionally, as more input commands are registered, having
more gestures that are closely similar is inevitable. A suggestion given by Dewanti is to improve the pipeline that
currently only supports one unified structure per signal, and only one semantic per unified structure. When each
modal has the ability to pass multiple semantics, it allows for mutual disambiguation between modals. The
Multimodal Integration Agent in QuickSet merges identical items in each of the modal’s feature structures to
compensate for errors in either modal. It was found that the gesture modal will have multiple interpretations in most
cases, and ambiguous gestures were resolved through integration with speech [3][4][5]. Further research with this
framework could be done focusing on mutual disambiguation and how it can improve the unification accuracy. In
the current test results, there was no occurrence where LEAP and VOICE produce different semantics with the same
confidence, when this does occur, the unification will take the modal with the earlier timestamp. Flippo, Krebs, and
Marsic mentioned that the best approach in cases of unresolvable ambiguity is to ask the user for clarification [8].
596 Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596
590 Ambadar et al. / Procedia Computer Science 00 (2022) 000–000

As shown in their framework, the dialog manager will check whether the final frame is complete, and if not, it will
ask the user to provide the missing information [8]. Despite this fact, people may have different preferences
regarding this issue. Some people may find it inconvenient for the system to constantly ask them to repeat a
command, while some may prefer this over the system giving the incorrect output. As the goal of human-computer
interaction is to design a system that is comfortable and effective for human use, it would be beneficial to do more
research on users’ expectations.

References

[1] S. Oviatt, “Multimodal Interfaces,” In The human-computer interaction handbook: Fundamentals, evolving technologies and emerging
applications, vol. 14, pp. 286-304, 2003. [Online]. Available: http://pages.cs.wisc.edu/~bilge/private/Oviatt2003-MultimodalInterfaces.pdf.
[Accessed 18 January 2021]
[2] S. S. Dewanti, “Multimodal Interfaces: A Study on Speech-Hand Gesture Recognition,” Dept. Computer Science, Binus Univ. at Jakarta,
Indonesia, 2019. [Online]. Available: http://library.binus.ac.id/eColls/eThesisdoc/Lain-lain/Technical%20Report-bi-cs-2019-0055.pdf.
[Accessed 18 January 2021]
[3] I. Regita, “A Deeper Look in Multimodal Interfaces and Its Use in Extreme Conditions,” Dept. Computer Science, Binus Univ. at Jakarta,
Indonesia, 2020. [Online]. Available: http://library.binus.ac.id/eColls/eThesisdoc/Lain-lain/Technical%20Report%20new-is1-sn-cs-2020-
0074.pdf. [Accessed 18 January 2021]
[4] B. Dumas, L. Denis and S. Oviatt, "Multimodal interfaces: A survey of principles, models and frameworks." In Human machine interaction.
Springer, Berlin, Heidelberg, pp. 3-26, 2009. [Online]. Available: https://diuf.unifr.ch/people/lalanned/Articles/mmi_chapter_final.pdf.
[Accessed 19 February 2021]
[5] F. Flippo, A. Krebs and I. Marsic, “A Framework for Rapid Development of Multimodal Interfaces,” In Proceedings of the 5th international
conference on Multimodal interfaces, pp. 109-116, 2003. [Online]. Available:
https://www.ece.rutgers.edu/~marsic/Publications/icmi2003.pdf. [Accessed 28 February 2021]
[6] M. Johnston, P. R. Cohen, D. McGee, S. Oviatt, J. S. Pittman and I. Smith, “Unification-based multimodal integration,” Dept. Computer
Science and Engineering, Oregon Graduate Institute at Portland, OR 97291, USA, July 1997. [Online]. Available:
https://www.aclweb.org/anthology/P97-1036.pdf. [Accessed 28 February 2021]
[7] Unity, “Unity Platform,” [Online]. Available: https://unity.com/products/unity-platform. [Accessed 28 February 2021].
[8] LEAP, “How Hand Tracking Works,” [Online]. Available: https://www.ultraleap.com/company/news/blog/how-hand-tracking-works/.
[Accessed 28 February 2021].
[9] LEAP, “Unity Plugin Overview — Leap Motion Unity SDK v2.3 documentation,” [Online]. Available: https://developer-
archive.leapmotion.com/documentation/v2/unity/unity/Unity_Overview.html. [Accessed 28 February 2021].
[10] HealthLink BC, “Harmful Noise Levels,” 19 July 2019 [Online]. Available: https://www.healthlinkbc.ca/health-topics/tf4173. [Accessed
31 March 2021].
[11] S. Oviatt, “Taming recognition errors with a multimodal interface,” In Communications of the ACM, vol. 43, no. 9, pp. 45-51, 2000.
[Online]. Available: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.359&rep=rep1&type=pdf. [Accessed 12 March 2021]
[12] S. Oviatt, “Mutual Disambiguation of Recognition Errors in a Multimodal Architecture,” Center for Human-Computer Interaction, Oregon
Graduate Institute of Science and Technology at Oregon, USA, May 1999. [Online]. Available:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.219.7348&rep=rep1&type=pdf. [Accessed 12 March 2021]
[13] D. Kirby, “Running the Leap Motion Diagnostics,” 7 September 2020 [Online]. Available: https://support.leapmotion.com/hc/en-
us/articles/360004363657-Running-the-Leap-Motion-Diagnostics. [Accessed 5 April 2021]

You might also like