Professional Documents
Culture Documents
1 s2.0 S1877050922022517 Main
1 s2.0 S1877050922022517 Main
com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2022) 000–000
Procedia Computer Science 00 (2022) 000–000 www.elsevier.com/locate/procedia
ScienceDirect www.elsevier.com/locate/procedia
Procedia Computer Science 216 (2023) 587–596
Abstract
Abstract
Multimodal interfaces have enhanced human-computer interaction by enabling users to interact with computers using a
combination of multiplehave
Multimodal interfaces inputenhanced
modes, providing increasedinteraction
human-computer accessibilityby to a widerusers
enabling rangetoofinteract
users inwith
various situations.
computers usingThea
multimodal
combinationsystem’s
of multipleability to process
input modes, multiple
providing input modes accessibility
increased allows it to rely
to aonwider
one input
rangemodal given
of users inthat the second
various modal
situations. Theis
unable to function
multimodal system’sdue to exposure
ability to processtomultiple
extremeinputenvironments. Thisit study
modes allows to relywill analyse
on one inputa modal
speech-gesture
given thatmultimodal interface
the second modal is
framework and the prototype
unable to function that wastoinitially
due to exposure extremedeveloped by Sindy
environments. ThisDewanti and analyse
study will have been improved uponmultimodal
a speech-gesture by Regita Isada. To
interface
further improve
framework the prototype
and the framework and
that was prototype’s performance,
initially developed this study
by Sindy will and
Dewanti evaluate
have and
beenresolve the upon
improved issuesby encountered
Regita Isada.in the
To
previous study regarding
further improve the configuration
the framework and prototype’sof each modal’s confidence
performance, levels,
this study will environment
evaluate detection,
and resolve weight
the issues calculation,
encountered in and
the
how the unification
previous process
study regarding the selects a final of
configuration semantic. Upon confidence
each modal’s implementing the environment
levels, changes, the detection,
prototype weight
was tested under three
calculation, and
environmental conditions:
how the unification processnormal,
selectsmoderate, and extreme
a final semantic. in both
Upon unimodal and
implementing the multimodal
changes, themode. The test
prototype wasresults
testedshow thatthree
under the
prototype was conditions:
environmental able to deliver the moderate,
normal, expected results with improved
and extreme accuracyand
in both unimodal in multimodal
multimodal mode.
mode The
as compared
test resultstoshow
the previous
that the
study. Nonetheless,
prototype was able the way thatthe
to deliver theexpected
modals perform, and the
results with unification
improved process
accuracy in can still be further
multimodal mode improved.
as compared to the previous
© 2022Nonetheless,
study. The Authors.thePublished
way that bytheELSEVIER B.V. and the unification process can still be further improved.
modals perform,
© 2023
This The
is an Authors.
open Published
accessPublished by Elsevier
article under B.V.
the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
© 2022 The Authors. by ELSEVIER B.V.
This is an open
Peer-review access
under article under
responsibility of the scientific
the CC BY-NC-ND license
committee (https://creativecommons.org/licenses/by-nc-nd/4.0)
of the 7th International Conference on Computer Science and
This is an open access article under CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review
Computational under responsibility
Intelligence of the scientific committee
2022 of the scientific committee of the of the 7th International Conference on Computer Science and
Peer-review under responsibility
Computational Intelligence 2022 7th International Conference on Computer Science and
Keywords: Multimodal
Computational interfaces,
Intelligence 2022speech, hand gesture,unification-based signal fusion, speech-gesture multimodal framework, human computer
interaction
Keywords: Multimodal interfaces, speech, hand gesture,unification-based signal fusion, speech-gesture multimodal framework, human computer
interaction
1. Introduction
1. Introduction
Technology advancements have allowed Human Computer Interaction (HCI) to more closely resemble human-
human interactions.
Technology Where traditional
advancements HCI allows
have allowed Humanusers to give input
Computer commands
Interaction (HCI)bytotyping
more or clicking
closely a button,
resemble users
human-
human interactions. Where traditional HCI allows users to give input commands by typing or clicking a button, users
can now interact with computers the same way they interact with other people: audio-visual signals. The ability to
interact with computers using multiple input modes is called a multimodal interaction. Oviatt [1] described
multimodal interfaces as a system that processes a combination of two or more input modes. These input modes
include, but are not limited to, touch, speech, gesture, and gaze. Multimodal interfaces have the potential to increase
usability for a wider range of people and function in less than ideal conditions. For instance, an interface that supports
both speech and gesture recognition would still be able to understand the user in noisy environments through the
gesture input. Additionally, the interface would be more accessible to people with disabilities as the users are able to
rely on either mode of interaction depending on their needs. Dewanti[2] has developed a framework to study the
performance of multimodal speech-gesture interface systems’ under extreme environmental conditions. This
framework was used to develop a prototype that utilizes Leap Motion Controller for the Gesture modal and Windows
Speech Recognition API for the Speech modal. The prototype was then tested under different environmental
conditions (normal and extreme) to see whether the multimodal system allows for better input interpretation under
extreme conditions as compared to a unimodal system. The reliability of the interpretations produced by each input
modal is measured using a metric called confidence[2]. This study was later continued by Isada[3] who added more
input values, a moderate environmental condition for testing, verification for complete gestures, modification of the
metric formula, and attempted to improve the success rates of the system’s input interpretation. Although the test
results in Isada’s iteration gave better success rates, some concerns with the prototype were identified. They include
the configuration of confidence levels, the performance of the environmental volume reader, and adjustments for the
unification of input signals[3].
This study will cover an in-depth analysis of the framework to identify possible causes of the issues encountered
in the previous iteration. A deeper look into the framework can also help to identify potential flaws and areas for
improvement, after which an appropriate solution can be implemented to improve its performance. The formula for
calculating the result may need to be slightly modified depending on how many inputs are being used for testing.
The purpose of this study is to continue the research done by Dewanti[2] and Isada[3], implement the
recommendations they have given, attempt to resolve known issues and improve the functionality of the prototype.
Prior to the start of this study, the source code and Unity project used in Isada’s iteration as well as the test results
have been shared. As such, the process of this study will include an observation of the framework and the prototype
system design, analysis of the implementation including the confidence of modals, formulas, and the unification
process, testing methods used and the results of the previous study and implement possible adjustments and new
features for improving input interpretation. The expected outcome of this study is to have the issues and concerns
from the previous iterations resolved to improve the overall accuracy of the prototype in terms of environment
detection and input interpretation. This study can help to increase the understanding of multimodal interfaces by
exploring more of its potential for HCI as well as identifying areas for further research.
2. Problem Analysis
activation is synergistic. Based on the CARE model, the system is classified as “Redundancy”. As seen in the list of
registered commands, the prototype should be able to come to the same conclusion whether it is used as a
multimodal or a unimodal system1.
(1)
In order to obtain the weighted confidence of each modal, their initial confidence is multiplied with the value of
K. The formula for calculating the weight is shown in (2) and (3):
(2)
(3)
The final confidence of the unified semantic, given that the ideal confidence is 1, is calculated by adding together
the weights of both modals. The formula for calculating the final confidence C is shown in (4):
(4)
To have a better understanding of the formula, if both modals are in their ideal environments and they each have
a confidence of 1, the value of K calculated with (1) would be 0.5. Following formulas (2) and (3), multiplying the
confidence of each modal with K results in both their weights being 0.5. By adding both weights together, the final
confidence is 1, following (4). A final confidence of 1 would mean that the final guess should be 100% trustworthy.
However, it is interesting to note that in the test cases 2 where both modals guessed the same semantic, the final
confidence is always 1 despite the initial confidence of VOICE occasionally being 0.6 or 0.9. To test the formula
once again, assume both modals have an initial confidence of 0.3. The value of K calculated using (1) would be
1.67. The weights of each modal calculated with (2) and (3) would be 0.5, resulting with a final confidence of 1.
Isada mentioned that in cases where both modals guess the same semantic, it is acceptable to have the final
confidence as 1. Hence, the final confidence calculation was intentionally made this way. Based on the weight
formula, the weight can be said to represent the percentage of how much each modal’s initial confidence contributes
to the final confidence. If both modals have the same confidence, they each account for 50% of the final confidence,
making both their weights 0.5. Formula (4) simply adds these percentages together, resulting in a final confidence of
1 regardless of what each modal’s initial confidence was. Although having both modals guess the same semantic
means that they support each other’s guesses, having the final confidence boosted to 1 regardless of their initial
confidence does not accurately represent the environment of the multimodal interface. Additionally, it is not an
accurate representation of how reliable the final guess is, especially if the two modals individually have low
confidence. Having two unreliable sources produce the same semantic does not make the guess 100% reliable. If the
framework were to be implemented in a real system that has a minimum confidence 1, the final output would have an
ideal confidence while the confidence of the individual modals might have been too low to pass the minimum
requirements. Additionally, while the purpose of the weight formula is to determine which modal is more reliable
based on their environmental conditions, currently it may not work as intended. The formula is under the assumption
that the confidence of each modal is affected by its environmental condition, hence the modal with higher
confidence also has more weight. However, as seen in the test results2, the modals can have high confidence despite
not being in their optimal environment. In the case where the speech modal is in its normal environment for a
unimodal signal test, some of the results have a confidence of 0.6 (medium). In the moderate environment, some of
the results have a confidence of 0.9 (high). In a multimodal test case where the speech modal is in its normal
environment and the gesture modal is in its moderate environment, the speech modal produced results with a
confidence of 0.9, and the gesture modal returned results with a confidence of 1.0. The gesture modal returned a
result with 0.3 confidence in only one of the test cases [2]. The gesture modal will have more weight despite it being
in the less optimal condition. In these test cases, both modals guessed the same semantic, so the prototype was able
to obtain the correct final guess [2]. However, this may cause problems if the speech and gesture modals produce
different results. Modals in the less ideal environment are more prone to error and having high confidence does not
guarantee that it is more correct than the other modal. This is like how people can be confident about something and
still be mistaken due to misreading, mishearing, or other factors. The environmental condition is crucial, as ideal
environments would support the confidence level while non-ideal environments should set back the confidence. The
Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596 591
Ambadar et al. Procedia Computer Science 00 (2022) 000–000 585
weight formula can be improved by adding another variable to represent the environment. For the speech modal, the
environmental volume reader can determine the environmental condition. For the gesture modal, there needs to be a
new method to measure the surrounding environment. Additionally, the formula for calculating the final confidence
needs to be re-examined so that the final confidence can provide a better representation of the multimodal
interface’s environment and the reliability of the final guess.
2.9 Unification
When VOICE and LEAP guess different semantics, the guess with higher confidence with higher confidence
will be taken as the correct guess. As Isada mentioned, the sole reliance on confidence levels to choose the correct
semantic could boost the wrong semantic, as shown in the test case previously discussed. It was suggested that
another variable could be used to determine the more reliable modal, or the unification process could be
recalculated.
592 Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596
586 Ambadar et al. / Procedia Computer Science 00 (2022) 000–000
3. Solution Design
This study will implement the recommendations given by Isada 2as well as attempt to resolve new issues that
were In addition, this study will focus on modifying the internal components of the framework and the way that the
prototype was implemented in attempts to resolve the issues that occurred during testing. The framework
architecture does not require major changes, so it follows the structure used in the previous studies [1][2].
The formulas for calculating the modified confidence MV and MG are shown in (5) and (6), where V is the initial
confidence of the speech modal, G is the initial confidence of the gesture modal, and E is the environmental
condition.
(5)
(6)
The modified confidence will take the place of the initial confidence in the unified structure. Further processes
involving the confidence of each modal will use their modified confidence.
• Threshold Controller
As the signals from the speech and gesture modals are asynchronous, the threshold controller is the component
that holds the signal from each modal for a set amount of time before they are passed on for unification. When a
signal from one modal is received, the threshold controller will wait to receive a signal from the other modal. If the
second modal’s signal is received within the set period, both signals will be sent for unification, making it a
multimodal signal. If it does not receive a signal from the second modal before the held signal expires, only the one
signal will be sent as a unimodal signal [1][2]. In this framework, the threshold controller runs an observation
routine of new signals and held signals every 0.1 seconds. Report routines are triggered based on the 4 cases.
constant K that affects both modals uniformly are not sufficient to determine which modal is more reliable. As the
initial confidence of each modal fails to detect the environmental condition, the weight of each modal will also fail
to determine which modal is in the more optimal environment. As previously mentioned [1][2], the weight is used as
a degree of trust. Even though the modals have high confidence, their environmental conditions could influence their
interpretation, and consequently, should lower their degree of trust. As previously mentioned, the unified structure
of each modal now contains their modified confidence instead of their initial confidence. The new confidence value
would be able to represent their environmental conditions better. For the weight calculation, the weight formula will
make use of each modal’s modified confidence. The constant K is used as a modifier given that the ideal final
confidence C is 1. A final confidence of 1 means that the multimodal interface is 100% confident of the final guess.
The formula for calculating K is shown in (7), now using the modified confidence of the modals instead of their
initial confidence.
(7)
Calculating the weight of each modal is shown in (8) and (9). It is like the original formula used in (1) (2), only
instead of using the initial confidence, it uses the modified confidence.
(8)
(9)
As mentioned in the previous section, the weight of each modal represents the percentage of how much their
confidence contributes to the final confidence. The weighted confidence of each modal will first need to be
calculated by multiplying their weights with their modified confidence, as shown in (10) and (11).
(10)
(11)
The final confidence C is calculated by adding together the weighted confidence of each modal, as shown in
(12). The final confidence is therefore a weighted average of each modal’s confidence.
(12)
The weight resolver modifies the unified structure of each modal into a weighted structure, which will then be
passed on for unification 2.
• Unification Controller
The unification controller is the component where the weighted structures of each modal are unified. This
component produces a unified output of a multimodal semantic in which the final confidence is calculated using
(10). The timestamp is taken from the earliest signal and the valid time is taken from the signal with the latest valid
time [1][2]. For unimodal signals, the initial semantic and confidence are taken as the final semantic and confidence.
As there is only one signal to process, there is no need for weight calculation or fusion with another signal. They are
passed without modification[1][2]. In cases where the two modals guess different semantics, the calculation of the
modified confidence as previously described can help to determine which modal is more trustworthy. When both
modals have the same initial confidence but different environmental conditions, the modal in the more ideal
environment will have higher modified confidence and therefore can be taken as the correct guess. However, it is
another issue when both modals are exposed to the same environmental condition. Isada mentioned that when the
two modals have different semantics and the same confidence, the signal with the earlier timestamp will be taken as
the correct guess.
This study was a continuation of Dewanti’s[1]and Isada’s [2] work regarding multimodal interfaces and how they
perform under various environmental conditions. The framework developed by Dewanti focuses on Speech-Gesture
multimodal interfaces. The prototype that was developed utilizes LEAP Motion Controller (LEAP) for the gesture
Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596 595
Ambadar et al. Procedia Computer Science 00 (2022) 000–000 589
modal and Windows Speech Recognition API (VOICE) for the speech modal. Confidence was used as a variable to
determine the best guess from each modal. Weight calculation was derived from the confidence level of each modal
and is used to calculate the final confidence as well as determining the correct semantic in situations where the two
modals guess different semantics. Most of the issues encountered during Dewanti’s study have been resolved during
Isada’s study, although there were new issues that needed to be addressed. The proposed changes include
reconfiguring the gesture mapping and confidence levels for LEAP, resolving the issue with the environmental
volume reader, adding a feature that measures the environment for LEAP, re-examining the weight calculation and
unification process, as well as adding more registered semantics for testing. The gesture mapping was reconfigured
to utilise 3 sub-semantics, namely palm direction, extended finger states, and finger direction. The confidence level
for LEAP was reconfigured to have the highest confidence of 0.9, similar to VOICE. This lowers the chances of
LEAP having higher confidence than VOICE when it produces an incorrect guess. The confidence level of each
modal was also reconfigured to take into account their environmental conditions. To measure the environment of the
modals, the settings on the environmental volume reader were modified based on the test results from the previous
study as well as an observation of VOICE’s ability to pick up signals. For the gesture environment, the prototype
utilises the environmental diagnostics provided by LEAP. The results from these environmental tests were then used
to modify the modal’s initial confidence to obtain a modified confidence. Similar to the previous study, the
prototype was tested in three environmental conditions: normal, moderate, and extreme. It can be concluded from
the results that the prototype was able to perform as expected in most of the test cases. There were improvements in
the prototype’s ability to distinguish which modal is more trustworthy based on their environmental conditions and
choosing the correct semantic. However, the results also pointed out some flaws regarding how the modals perform
individually. For the gesture modal, the deviation tolerance for palm direction and finger direction can still cause
ambiguous guesses, and there is still a bias with choosing a semantic in the event where multiple guesses have the
same confidence. Out of 210 tests, 168 were expected to have results. From these 168 tests, 165 produced the
correct results. The overall accuracy remains similar to Isada’s study at 98% [2]. In unimodal mode, the prototype
performed worse compared to the previous study. More specifically, it performed worse in the unimodal gesture
mode in the moderate environment where it guessed the incorrect semantic in two of the test runs. In unimodal
mode, the prototype guessed the correct semantic in 54/56 test runs, resulting in an accuracy of 96%. However, it
was shown that the prototype was able to perform better in multimodal mode. In situations where the two modals
guessed different semantics, it was able to choose the correct final semantic based on the weighted confidence. With
the incorporation of environment weights to modify each modal’s initial confidence, the weighted confidence levels
provide a better representation of their environmental conditions. Therefore, the prototype was able to determine
which modal is more reliable. The prototype was able to guess the correct semantic in 111/112 test runs, resulting in
an accuracy of 99% in multimodal mode.
Based on the test results, there are several improvements that can be made on the framework and prototype. One
of the issues is regarding the environment detection for the gesture modal. The results from the diagnostic tests were
not very accurate in detecting the environment even though LEAP’s performance in detecting the user’s input was
affected. For future implementation, the method of determining the environment for the gesture modal could be re-
examined. It was also suggested[1][2] that the framework could be used to develop a prototype with different input
devices as the way they perceive input and calculate confidence levels could be different. Furthermore, there were
several occurrences where LEAP produced multiple guesses with the same confidence and chose one semantic
based on hierarchy in the signal database. These ambiguous gestures could be caused by the deviation tolerance, and
similar gestures could be mistaken for one another. The deviation angle has been reduced from the previous study
and reducing it even further may affect its usability. Additionally, as more input commands are registered, having
more gestures that are closely similar is inevitable. A suggestion given by Dewanti is to improve the pipeline that
currently only supports one unified structure per signal, and only one semantic per unified structure. When each
modal has the ability to pass multiple semantics, it allows for mutual disambiguation between modals. The
Multimodal Integration Agent in QuickSet merges identical items in each of the modal’s feature structures to
compensate for errors in either modal. It was found that the gesture modal will have multiple interpretations in most
cases, and ambiguous gestures were resolved through integration with speech [3][4][5]. Further research with this
framework could be done focusing on mutual disambiguation and how it can improve the unification accuracy. In
the current test results, there was no occurrence where LEAP and VOICE produce different semantics with the same
confidence, when this does occur, the unification will take the modal with the earlier timestamp. Flippo, Krebs, and
Marsic mentioned that the best approach in cases of unresolvable ambiguity is to ask the user for clarification [8].
596 Fiolisya Faustine Ambadar et al. / Procedia Computer Science 216 (2023) 587–596
590 Ambadar et al. / Procedia Computer Science 00 (2022) 000–000
As shown in their framework, the dialog manager will check whether the final frame is complete, and if not, it will
ask the user to provide the missing information [8]. Despite this fact, people may have different preferences
regarding this issue. Some people may find it inconvenient for the system to constantly ask them to repeat a
command, while some may prefer this over the system giving the incorrect output. As the goal of human-computer
interaction is to design a system that is comfortable and effective for human use, it would be beneficial to do more
research on users’ expectations.
References
[1] S. Oviatt, “Multimodal Interfaces,” In The human-computer interaction handbook: Fundamentals, evolving technologies and emerging
applications, vol. 14, pp. 286-304, 2003. [Online]. Available: http://pages.cs.wisc.edu/~bilge/private/Oviatt2003-MultimodalInterfaces.pdf.
[Accessed 18 January 2021]
[2] S. S. Dewanti, “Multimodal Interfaces: A Study on Speech-Hand Gesture Recognition,” Dept. Computer Science, Binus Univ. at Jakarta,
Indonesia, 2019. [Online]. Available: http://library.binus.ac.id/eColls/eThesisdoc/Lain-lain/Technical%20Report-bi-cs-2019-0055.pdf.
[Accessed 18 January 2021]
[3] I. Regita, “A Deeper Look in Multimodal Interfaces and Its Use in Extreme Conditions,” Dept. Computer Science, Binus Univ. at Jakarta,
Indonesia, 2020. [Online]. Available: http://library.binus.ac.id/eColls/eThesisdoc/Lain-lain/Technical%20Report%20new-is1-sn-cs-2020-
0074.pdf. [Accessed 18 January 2021]
[4] B. Dumas, L. Denis and S. Oviatt, "Multimodal interfaces: A survey of principles, models and frameworks." In Human machine interaction.
Springer, Berlin, Heidelberg, pp. 3-26, 2009. [Online]. Available: https://diuf.unifr.ch/people/lalanned/Articles/mmi_chapter_final.pdf.
[Accessed 19 February 2021]
[5] F. Flippo, A. Krebs and I. Marsic, “A Framework for Rapid Development of Multimodal Interfaces,” In Proceedings of the 5th international
conference on Multimodal interfaces, pp. 109-116, 2003. [Online]. Available:
https://www.ece.rutgers.edu/~marsic/Publications/icmi2003.pdf. [Accessed 28 February 2021]
[6] M. Johnston, P. R. Cohen, D. McGee, S. Oviatt, J. S. Pittman and I. Smith, “Unification-based multimodal integration,” Dept. Computer
Science and Engineering, Oregon Graduate Institute at Portland, OR 97291, USA, July 1997. [Online]. Available:
https://www.aclweb.org/anthology/P97-1036.pdf. [Accessed 28 February 2021]
[7] Unity, “Unity Platform,” [Online]. Available: https://unity.com/products/unity-platform. [Accessed 28 February 2021].
[8] LEAP, “How Hand Tracking Works,” [Online]. Available: https://www.ultraleap.com/company/news/blog/how-hand-tracking-works/.
[Accessed 28 February 2021].
[9] LEAP, “Unity Plugin Overview — Leap Motion Unity SDK v2.3 documentation,” [Online]. Available: https://developer-
archive.leapmotion.com/documentation/v2/unity/unity/Unity_Overview.html. [Accessed 28 February 2021].
[10] HealthLink BC, “Harmful Noise Levels,” 19 July 2019 [Online]. Available: https://www.healthlinkbc.ca/health-topics/tf4173. [Accessed
31 March 2021].
[11] S. Oviatt, “Taming recognition errors with a multimodal interface,” In Communications of the ACM, vol. 43, no. 9, pp. 45-51, 2000.
[Online]. Available: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.359&rep=rep1&type=pdf. [Accessed 12 March 2021]
[12] S. Oviatt, “Mutual Disambiguation of Recognition Errors in a Multimodal Architecture,” Center for Human-Computer Interaction, Oregon
Graduate Institute of Science and Technology at Oregon, USA, May 1999. [Online]. Available:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.219.7348&rep=rep1&type=pdf. [Accessed 12 March 2021]
[13] D. Kirby, “Running the Leap Motion Diagnostics,” 7 September 2020 [Online]. Available: https://support.leapmotion.com/hc/en-
us/articles/360004363657-Running-the-Leap-Motion-Diagnostics. [Accessed 5 April 2021]