TFE: The Use of HoloLens 2 in Logistics

Acknowledgments
I would very much like to express my deepest gratitude to the people that made this work
possible. I would like to thank my father, who, ever since I was a child, sparked passion for
computing, a passion that has never left me since.
I would like to express my sincere thanks to Professor Michael Schyns who, for the last 4
years, nurtured the spark and turned it into a blazing flame, leveraging creativity and hard
work in his class, which led me to the place I am today. Thank you for allowing your students
to always work on innovative projects, with brand new approach and hardware every time.
Working on a project as innovative as the one you allowed me to, using the HoloLens 2,
taught me a lot about myself on the one hand, but also the possibilities using technology in
my future career.
Finally, I would like to address my sincere gratitude to the members of my family and friends
who never lacked support, counsel and most importantly, smiles when I needed them.
1
Table of Contents
ACKNOWLEDGMENTS .....................................................................................................................1
ABBREVIATIONS .............................................................................................................................4
INTRO .............................................................................................................................................6
STATE OF ART .................................................................................................................................9
EXTENDED REALITY ..............................................................................................................................9

MIXED REALITY .......................................................................................................................................... 9
VIRTUAL REALITY ...................................................................................................................................... 12
AUGMENTED REALITY ................................................................................................................................ 16
MIXED REALITY ILLUSION ........................................................................................................................... 21
HOLOLENS 2 .................................................................................................................................... 22
USE CASES ............................................................................................................................................... 22
INDUSTRY 4.0 .................................................................................................................................. 23
FACTORS OF ADOPTION .............................................................................................................................. 25
MIXED REALITY AND AUGMENTED REALITY IN INDUSTRY 4.0............................................................................ 25
COMPUTER VISION ............................................................................................................................ 26
SOLUTION DESIGN ........................................................................................................................ 30
VISION............................................................................................................................................ 30
EXECUTION ...................................................................................................................................... 31
3D MODEL BUILDING ................................................................................................................................ 32
PRE-PROCESSING OF THE PICTURES .............................................................................................................. 34
MICROSOFT AZURE ................................................................................................................................... 42
HOLOLENS 2 APPLICATION INTERFACE .......................................................................................................... 49
SOLUTION ANALYSIS ..................................................................................................................... 52
BENEFITS ......................................................................................................................................... 52
RECOMMENDATIONS & FUTURE WORK .................................................................................................. 53
3D MODEL BUILDING ................................................................................................................................ 54
PRE-PROCESSING OF THE PICTURES .............................................................................................................. 55
MICROSOFT AZURE ................................................................................................................................... 56
HOLOLENS 2 APPLICATION INTERFACE .......................................................................................................... 57
CONCLUSION ................................................................................................................................ 58
APPENDICES.................................................................................................................................. 59
APPENDIX 1 : AGISOFT WORKFLOW....................................................................................................... 59

APPENDIX 2 : U-NET MODEL TRAINER................................................................................................... 61
APPENDIX 3 : JSON TO MASK ............................................................................................................. 63
2
APPENDIX 4 : MASK GENERATOR ......................................................................................................... 64
APPENDIX 5 : DOCKER BUILD ............................................................................................................... 66
APPENDIX 6 : ENTRY.BAT .................................................................................................................... 68
APPENDIX 7 : AZURE FUNCTIONS TRIGGERS ............................................................................................ 69
BIBLIOGRAPHY .............................................................................................................................. 71
3
Abbreviations
IOT : Internet of Things

AR : Augmented Reality
HCI : Human-Computer Interaction
MMI : Man-Machine Interaction
xR : Extended Reality
LSP : Logistics-service Providers
IM : Immersion
VR : Virtual Reality
EWK : Extent of the World Knowledge
CO : Coherence
MRI : Mixed Reality Illusion
VE : Virtual Environment
IVR : Immersive Virtual Environment
3D : Three Dimensions
CGI : Computer Generated Imagery
HMD : Head Mounted Display
Number of Words : 24 483
4
5
Intro
In the current global economic landscape, there is a marked acceleration in technological
advancements, incessant changes, and increased global interactions. These profound shifts
have ushered in what is often referred to as the "Fourth Industrial Revolution" (Hirt & Willmott,
2014). Developments in artificial intelligence, data analytics, cloud computing, and the Internet
of Things (IoT) are fundamentally altering business paradigms, paving the way for innovation,
enhancing efficiency, and fostering competitiveness for stakeholders encompassing
consumers, entrepreneurs, and investors. Contemporary business models are more resource-
efficient, and barriers to entry have been considerably lowered, enhancing overall accessibility.
Technological advancements have revolutionized the daily operational methodologies of

businesses, most notably epitomized by the migration towards digitalization and Industry 4.0.
Through technological interventions, businesses can automate mundane tasks, refine processes,
and enhance operational efficiency, leading to significant cost reductions. Another paradigm
shift has been the transition from traditional on-premises work to remote collaborations. Cloud-
based applications and virtual collaboration platforms have decentralized workspaces,
providing employees unparalleled flexibility. The recent pandemic underscored the feasibility
of remote work, thereby expediting the pace of digital transformation and necessitating
adaptive operational strategies.
Specifically, within the logistics sector, technological innovations have enhanced supply chain
management efficacy. IoT sensors, for instance, enable real-time tracking of inventory and
shipments, facilitating logistics personnel in optimizing transportation routes and schedules.
Augmented reality (AR) can be instrumental in warehouse management, either by offering
virtual training environments or by delivering real-time task-specific information, aiding
decision-making.
However, digital transformation transcends mere infrastructural upgrades. It represents a

comprehensive shift, mandating organizations to perpetually re-evaluate their operational
paradigms. While technology underpins this transition, strategic leadership remains pivotal.
Human resources emerge as a potential challenge, necessitating adept management strategies
to navigate this complexity (Cichosz, Wallenburg, & Michael, 2020). Companies that remain
complacent, neglecting proactive digital strategizing, face an elevated risk of obsolescence,
whereas forward-looking entities are better positioned for success (Sullivan & Kern, 2021).
Human-Computer Interaction (HCI), alternatively termed as Machine-Man Interaction (MMI)

in scholarly circles, is a specialized domain within computer science. It is dedicated to the
"design, evaluation, and implementation of interactive computing systems for human
utilization and the exploration of associated phenomena" (Sinha, Shahi, & Shankar, 2010). HCI
operates on the premise that sophisticated machinery necessitates adept users. Usability and
functionality are intertwined; a system's efficacy hinges on its seamless utilization by users.
Striking a balance ensures that technology serves its intended purpose, considering both its
capabilities and the expertise of the user.
A promising frontier within HCI, with potential transformative implications, is Extended

Reality (xR). As xR technologies evolve in sophistication and accessibility, they present novel
avenues for businesses to engage stakeholders, enhance training modules, visualize complex
data sets, and streamline processes. By delivering real-time data, bolstering situational
awareness, and providing task-specific assistance, xR stands to enhance decision-making,
6
minimize errors, and elevate productivity. Moreover, its role in ensuring worker safety through
real-time alerts offers an insightful glimpse into the manifold applications of this emerging
technology.
Within the discourse of HCI, the user's environment and surroundings emerge as quintessential
elements. The physical and digital realms, while distinct, are intrinsically interconnected,
particularly as we navigate the cusp of the Fourth Industrial Revolution. The HCI paradigm
emphasizes the significance of a user-centric approach, considering the environmental context
in which interactions transpire. Extended Reality (xR) epitomizes, perfectly encapsulates, this
interconnection, blurring the lines between the tangible and virtual domains. By superimposing
digital information onto the real world, or by immersing the user in a completely virtual
environment, xR harmoniously integrates the user's surroundings into the interaction process.
This immersive fusion ensures that HCI isn't just about the interface or the device but also
encompasses a broader ecological perspective, ensuring that technology is holistically
intertwined with the user's spatial and contextual reality.
Figure 1 - Mixed Reality components (Microsoft, 2023)
xR is also bringing a breath of fresh air to the interaction between companies and customers.
Immersive and interactive experiences such as virtual product demonstrations, 3D
visualizations, interactive training modules or even immersive entertainment are brand new
opportunities brought by xR.
One promising device that has recently gained attention in this field is the Microsoft HoloLens
2, a headset that provides virtually generated assets in augmented reality, interacting with the
surroundings of the user, to create an immersive and interactive world based on overlapping
realities – the mixed reality.
The HoloLens 2 is equipped with a variety of sensors and cameras: depth cameras (LiDAR –
Light Detection And Ranging), RGB cameras, and inertial sensors, which enable it to capture
and process information about the user's environment in real-time. This device can be used in
a range of fields, from education and healthcare to entertainment, sales, or manufacturing. The
possibilities are endless, which makes the HoloLens 2 a highly polyvalent and versatile tool.
Limitations of the device are mostly due to its hardware than its concept and software.
In this thesis, we will explore the potential of the HoloLens 2 in the context of logistics
applications. Logistics is a complex and rapidly evolving industry that involves the movement,
storage, and processing of goods and materials. A core struggle in the field is lack of resources,
7
human and financial, that eventually inhibits the digital transformation as defined earlier on.
(Cichosz, Wallenburg, & Michael, 2020). Therefore, we will investigate how we use
technology to make a step forward the tackling of such issue.
Specifically, we will focus on developing a proof-of-concept app using the HoloLens 2 that
could eventually lead to several logistics use cases, such as tracking inventory, identifying
goods and facilitate the packing of parcels. By doing so, we aim at reducing the time wasted
by each worker during a potential process, relieving them from useless and time-consuming
tasks – improving efficacity as a direct consequence. From a managerial point of view, this
should affect the cost structure as well. Indeed, apart from the initial investment that represents
the HoloLens 2, significant savings could be made by logistic-services providers (LSP). That
would eventually allow for more workforce to be enrolled or a positive financial impact in any
case.
The findings of this research will contribute to the growing body of literature and possibilities
using mixed-reality devices and provide insights into the potential applications of this
technology in logistics. Furthermore, this thesis will provide a foundation for further research
in this field, including the development of more advanced algorithms and the integration of the
HoloLens 2 with other technologies such as drones and robots.
In this piece of work, we delve into the distinctive capabilities of the HoloLens 2, exploring
how integrating Azure-driven computational methods with advanced photogrammetry
techniques can optimize and fully harness the device's inherent potential through innovative
solution design. Photogrammetry has been a crucial tool in various sectors for digitally
capturing the geometry of physical objects using photographs. As technological advances pave
the way for more efficient methods, cloud solutions like Azure provide not only a backbone to
handle and process the computational power required in these tasks, but also a highly portable
solution that can be flexible and work under different conditions.
Given its hardware constraints, the headset can't perform intensive computational tasks like
those required by photogrammetry in real-time. However, by leveraging Azure's computational
prowess, it is possible to offload these tasks to the cloud. This amalgamation of Azure and
photogrammetry creates a synergy that optimizes the capabilities of the HoloLens 2.
Moreover, the device’s Mixed Reality features, combined to on-the-go 3D model generation
could create endless possibilities to tackle modern issues and more that are to rise.
Therefore, we can state the starting point of this research clearly:
How can we harness the inherent capabilities of the HoloLens 2

to devise sophisticated applications pertinent to logistics?
8
State of Art
This chapter aims at giving a framework for the present work within the existing literature for
all fields of study related to the subject and the proposed solution to the problem forementioned.
The objective is not to give an exhaustive definition of every aspect of this piece, but rather
give a common knowledge to every reader to ensure a proper understanding of the designed
solution.
Extended Reality
As defined by Nvidia, a computer hardware company, “Extended reality, or xR, is an umbrella

category that covers a spectrum of newer, immersive technologies, including virtual reality,
augmented reality and mixed reality.” As said in the definition and contrary to popular belief,
extend reality is rather a spectrum, a continuum than a scale. Some elements are characteristic
and specific of each classification while others may belong to multiple categories.
In this section, we will define the taxonomy1 that will be used in this work to allow a clear
understanding of each term and its limitations. In 1994, Paul Milgram and Fumio Kishino
published “A taxonomy of mixed reality displays”, a paper way ahead of its time, that has been
referenced as the blueprint of the field. Most of the upcoming section is based on work that
they have laid the ground for.
Unfortunately, definitions and exact boundaries of each term might differ, sometimes conflict,
from one author to the other. To stay consistent with the context of this current paper (i.e., the
HoloLens 2, developed by Microsoft), we will try to stick to their definition while taking in
consideration criticism from other sources.
Mixed Reality
Figure 2 - Mixed Reality Spectrum according to Microsoft (Microsoft, 2023)
Bill Gates’ company opposes two extrema on the mixed reality spectrum: physical and digital
worlds. The former is the one we live in everyday while the latter is based on a computer-
generated experience that replaces the actual world the user perceives. Therefore, the only axis
1
Taxonomy : [uncountable] the scientific process of classifying thingsinto groups that share similar qualities
9
considered when it comes to mixed reality is based on how much digitally immersive content
is generated around the user. If we look at other suggested models, such as the XR Framework
(Figure 3; X standing for any, not Extended), there is a complete break of the continuum,
classifying MR as a mere subcategory of AR and leaving no room for nuance around MR.
Concepts such as Local Presence and Telepresence will be defined later on.
Figure 3 - XR Framework (Rauschnabel, Felix, Hinsch, Shahab, & Alt, 2022)
If we now have a look at Milgram & Kishino’s work (Figure 4), their Reality-Virtuality
Continuum opposes Real and Virtual Environment, but completely revolves around realism of
the image and does not take into considerations the user and his perception of the experience –
the presence.
Figure 4 - Reality-Virtuality (RV) Continuum (Milgram & Kishino, 1994)
Therefore, some found this approach reductive of future possibilities and created a more
inclusive taxonomy. In an updated version of Milgram and Kishino’s work, the paradigm of
MR is based on 3 criterions: Immersion (IM), Extent of the World Knowledge (EWK) and
Coherence (CO) (Skarbez, Smith, & Whitton, 2021). The objective was to find a common
classification, clear enough to be applicable yet precise enough to leave no room for
terminological doubt.
10
• EWK represents “the extent to which the system is aware of its real-world surroundings
and can respond to changes in those surroundings". (Skarbez, Smith, & Whitton,
2021). Perfect EWK would be an IoT network that could undoubtedly sense and predict
the behavior of any real-world event since it has a complete and flawless understanding
and awareness of all physical being or object, i.e., omniscience.
• CO, or as somehow called Fidelity (Alexander, Brunyé, Sidman, & Weil, 2005) or
Authenticity (Gilbert, 2017) in the literature, determines the plausibility of an
experience. As simple as it is, perfect coherence is actually the physical reality – where
everything happening must be possible according to physics and the sole laws of nature.
• IM assesses the levels of engagement and engrossment of the user in a given xR
experience. (Brown & Cairns, 2004) This is what is conventionally aimed at when it
comes to VR: a perfect and fully computer-generated world, fully isolating the user
from physical reality. As an example, we can think of Spielberg’s 2018 movie “Ready
Player One” where the main character switches from a dystopian town to a futuristic
and idealistic virtual world thanks to his Head-Mounted Device (HMD), where the
former and the latter have absolutely no connection one to another and the user is having
a real-life like experience when using his headset. It must be mentioned that no matter
the amount of EWK or CO, if there is no IM, the MR experience simply doesn't exist.
These criteria give us a 3-dimensional space to work with. That being said, given the current
technological state, we can consider EWK, IM, and CO as 3 asymptotes - unachievable goals.
Indeed, it is impossible for a system to be fully aware and predictive of every single event in
the world in real time, as it is infeasible to give users a life-like experience in VR (although we
are getting closer to it) or hopeless to code every law of physics, since some are yet to be
discovered and understood. We will now explore the possibilities each combination would be
giving us and exemplify with concrete use-cases.
Figure 5 - Objective criterion against perception (Skarbez, Smith, & Whitton, 2021)
Place illusion, world awareness and plausibility illusion have already been discussed when
introducing their respective concepts. We will thus move on to their combination one at a time.
11
- Replicated World Illusion: The user is fully immersed in a virtual copy of the real
world.
- System Intelligence Illusion: State of a fully aware of its surroundings system, using
that information in ways that do not violate the principle of coherence. (Not an MR
experience)
- Presence, or telepresence to be more accurate, is described as the subjective perception
of “being there” – Total Immersion (Brown & Cairns, 2004). In the literature, the
distinction made is based on whether the user’s environment is mediated by technology
– which is the case here. (Steuer, 1992). While IM assesses the degree of engagement,
telepresence adds a layer of reality to the user’s experience, which reinforce the illusion
of realness of MR and eventually brings the users beyond the mirage and provides
emotional connection of the user to the computer-generated images (CGI), i.e., empathy
(Brown & Cairns, 2004). Since telepresence is based on a subjective perception, there
unfortunately isn’t a universally accepted set of criteria, although we do find recurring
ways of quantifying and analyzing presence more or less formally. These are: subjective
measure through iGroup Presence Questionnaire or ITC- Sense of Presence inventory
(Schwind, Knierim, Haas, & Henze, 2019 ; Brown & Cairns, 2004), behavioral
measures (Slater M. , 2009) (Boem & Iwara, 2018), physiological measures (Meehan
& Insko, 2002) and performance measures (Huegel, Celik, & Israr, 2009).
- Mixed Reality Illusion (MRI) combines all 3 criteria in a way that transcends CGI.
Indeed, at best, the user cannot perceive the distinction between what is real and what
is not; the system blurs the border between virtuality and reality.
Nevertheless, it shall be mentioned that not all applications of MR require perfect MRI. Indeed,
use-cases specific compromises and concessions must be done in order to ensure the intended
impact on the user. For example, when asked, gamers stated the most important aspects for an
ideal VR-gaming experience is the atmosphere, which builds up immersion and, in the right
condition, telepresence. (Brown & Cairns, 2004). That being said, it means the effort of the
application designer shall not be put on EWK but rather on other vertices of the matrix. On the
other hand, Assisted Reality smart glasses like aRdent, require a high CO and a relatively
important EWK since it gives live information to workers, but does not need to provide an
immersive experience. (Get Your Way, 2023).
Naturally, we can see that this reconciliating framework makes room for multiple extended
reality technologies that exist on the spectrum of MR, each one existing at a different place of
the matrix-spectrum defined. We will now proceed to define some of them based on Skarbez
et al.’s 2021 taxonomy.
Virtual Reality
“VR is defined as the use of a computer-generated 3D environment – called a ‘virtual

environment’ (VE) – that one can navigate and possibly interact with, resulting in real-time
simulation of one or more of the user's five senses. “ (Guttentag, 2009). When it comes to VR,
we usually agree on four essentials for an experience to be considered part of the spectrum.
(Alqahtani, Daghestani, & Ibrahim, 2017)
- Virtual World: Integrated with each-other CGIs, where images and objects generated
must be linked by relationships (i.e., there must be some rules between objects that
respect a certain order and laws, specific to the Virtual Environment depicted)
12
- IM: cfr. former section
- Sensory feedback: Results obtained by the user based on his input. They can be
perceived through any of the 5 senses, although sight and hearing are the most common.
When it comes to touch, there are existing technology able to mimic certain feelings
such as haptic gloves. They are wearable accessories to be combined with VR able to
simulate resistance from the virtually generated material. Although there is a range of
wearables existing, none of them is able to reproduce texture yet. (Perret, 2018) Smell
and taste are the hardest taste to stimulate since they require the most external
interaction with the user. There are some existing technologies using electrical
simulation to fake interaction with the body, but they require much heavier HMD as
well as other gear and are thus not marketable yet. (Kerruish, 2019)
It is worth mentioning we classify sensorial experiences based on 2 distinct criteria:
depth and breadth. The quality of the cues is referred to as depth while the quantity the
user has to process synchronously is called breadth. To exemplify easily, let’s just
consider a VE taking place in the rain forest, using a HMD and an audio headset since
it is the most common gear used when exploring VR. To create an atmosphere, there
are trees and rain sounds, things we would normally find in such environment. In order
to increase sensory breadth, we would create a visually more diverse fauna and flora,
as well as adding animals’ cries, footsteps, and tree leaves sound. When it comes to
depth, the quality of the models shall be upgraded, in addition to adapting the footsteps’
sounds to the ground type we are walking on, stereo sound etc. This logic can be
transposed to any of the 5 senses, with limitations being tied to hardware possibilities
obviously.
- Interactivity: The ability for the user to affect objects from the virtuality. It should also
be possible for the user to change their location and angle (i.e., point of view)
(Wohlgenannt, Simons, & Stieglitz, 2020). Otherwise, the experience could be limited
to a mere video and be branded VR. In that sense, so-called VR roller-coaster are an
extrapolation and misleading name since most of them are only bringing 3D videos
with extra-sensory stimuli (moving seats, wind, olfactory elements etc.) – which are
more like simulations in that sense. (Biocca & Levy, 1995)
Some researchers emphasize the definition of VR on the feeling of “being there”, which makes
a clear link with the above-mentioned definition of telepresence. (Wohlgenannt, Simons, &
Stieglitz, 2020). VE are to-this-day the most complex HCI interface yet, since complete
freedom of action for the user is virtually (in the sense technically) possible.
In our exploration of contemporary literature, we discovered Steuer recommended to split VR

categories based on experience, specifically presence. However, a clear definition is subject to
change depending on the authors – making VR an umbrella term for any sort of CGI-based
experience (Takac, Collett, Conduit, & De Foe, 2021). Some use arbitrary specifications,
study-specific, creating a misuse of correct terminology, and eventually chaos due to
conflicting definitions in the literature, as well as the impossibility to reference such work.
Furthermore, others rather use immersion criteria, hence the concepts of Non-immersive, Semi-
Immersive and Immersive VRs. Once again, the issue lies on the sole concept of IM. Since it
has been defined in a subjective way, boundaries are blurred and highly difficult to measure
(Takac, Collett, Conduit, & De Foe, 2021).
Although there is a consensus on clear example for non-immersive and semi-immersive VRs,
the inconsistence makes us look for a different approach: a hardware-based categorization of
13
VR experiences. Indeed, since gear is absolutely necessary to experience IM, we can safely
state that VR is hardware dependent.
Two psychological concepts closely associated with hardware-centric definitions of VR play a

significant role: the cognition associated with the environment, influenced by sensory input,
(the perception gotten from the mind) and the involvement of the physical body, as an avatar
of the mind. (Biocca & Levy, 1995). These two smooth (i.e., synchronized) sensory procedures
are conditional on two intertwined hardware abilities: the coordination of simulation in three-
dimensional space (VE) and the tracking of movement through sufficient degrees of freedom
(Takac, Collett, Conduit, & De Foe, 2021). To put it simply, once the user moves (body
involvement), the system should be able to follow-up with that input, update simultaneously
and accurately the VE and eventually stimulate the perception of the user (mind impact).
Degrees of Freedom refer to the number of distinct types of movements or orientations that a
user can perform or experience, usually 3DoF or 6DoF (Sherman & Craig, 2018). The ability
to move in 3 directions - yaw (turning left or right), pitch (looking up or down), and roll (tilting
your head) is usually found in simple devices such as smartphones or Google Cardboard. In
comparison, 6DoF adds strafe (move left or right), elevate (move up and down) and surge
(forward and backward), which is found in higher-end HMD, such as the ones from Oculus or
Valve.
Blocking visual stimuli in VR, other than the VE, is a crucial hardware feature, especially since
it helps managing sensory overload and enhancing focus. In immersive VR (IVR), visual
overload can occur due to the richness of stimuli, potentially preventing telepresence
eventually. By selectively blocking or filtering certain visual stimuli, VR systems can create a
more comfortable and immersive. This is particularly beneficial in educational or training VR
applications where focus on specific tasks or objects is required (Sherman & Craig, 2018).
Unlike AR, that will be discussed later on, VR HMDs have no see-through possibility, usually
worn as an opaque eye mask-like style.
Figure 6 - Hardware-based virtual reality qualification matrix (H-VRQM) (Takac, Collett, Conduit, & De Foe, 2021)
Based on Figure 6, we now have an objective VR classification, easily usable and

understandable, relying on measurable and checkable criteria.
If we consider our new classification, let’s define a more restrictive definition for VR that stays
consistent with brought elements. Virtual Reality (VR) is defined as a fully visual-immersive,
computer-generated environment – VE, that enables users to navigate and interact through the
application of synchronized sensory simulation, supporting varying degrees of freedom (DoF),
which allow users to perform movements that are accurately mirrored within the VE. The result
14
is a real-time sensory simulation that can engage one or more of the user's five senses, offering
a convincing illusion of being present in the simulated environment.
We find in literature (cfr. Figure 3) not exactly categories, but rather slight variations of VR
that still do fit in our new restrictive definition, as well as in the MR-matrix defined in the
previous section. These are based on the concept of telepresence and its intensity. The less
intense the experience feels, the closer it is to Atomistic VR and vice-versa for Holistic VR.
The former is usually used when immersion is not really important for the core function of the
application – for example, training workers. Oppositely, Holistic VR finds its true use when
the user must be fully immersed for the good of the operation, like when performing stress
tests. (Rauschnabel, Felix, Hinsch, Shahab, & Alt, 2022)
Therefore, we can safely assume VR is a type of MR experience focusing on IM and CO. One
might wonder then where the EWK has a place in VR. Once again, if we recall our initial
objective, reconciliating existing framework around once consistent model, we can thus check
what the variation of EWK means. From the RV-Continuum, this would bring us into
Augmented Virtuality, which is a mainly VE with some physical artifacts integrated. That
being said, they can either be actual objects integrated in the field of view, but also data from
the real world that has an impact on the VR experience.
That being said, the less EWK we have, the deeper we dive in virtuality. This means the RV-
continuum also has a place in our model. One might then wonder: what happens when there is
no EWK? Would that fall outside the scope of MR? Well, this question is a bit trickier. Indeed,
so far, it has always been a question of exteroceptive2 perception and senses, while
interoceptive senses are untouched – which provides conflicting feelings internally. In other
words, whatever strong could the illusions of place and plausibility be, one remains connected
to their internal senses, they can realize at any moment they are physical beings interacting
with a digital world through their avatar. This is why the updated version of Milgram &
Kishino’s 1994 paper introduces an extension of the RV Continuum, which would be the only
VE falling outside the scope of MR: the Matrix-like VE. This type of VE would mean the user
is completely ignorant of his existence in a physical world and is solely aware of himself in a
VE.
Figure 7 - Updated RV Continuum (Skarbez, Smith, & Whitton, 2021)
2
Exteroceptive: [adjective] that responds to external stimuli.
15
Augmented Reality
Augmented Reality (AR) bridges the gap between digital and physical realities by
superimposing computer-generated objects onto the user's real-world environment, resulting in
a mixed reality (Milgram & Kishino, 1994). This technology, through its interaction with our
sensory perceptions, has transformed how we interact with digital platforms (Carmigniani et
al., 2011). In opposition to VR as defined in the current paper, where the user cannot see the
physical world around them, AR superimpose CGIs upon reality. It is used as an extended
experience of our world, where virtual and real objects coexist in the same space from the point
of view of the user (Azuma, A Survey of Augmented Reality, 1997). Using additional work
from him (Azuma, 2001), we can overall qualities a system must respect to be branded as AR.
The core characteristics of Augmented Reality (AR) can be classified into the following
categories:
- Combination of Real and Virtual Worlds: The fundamental characteristic of AR is
that it merges the physical and digital realms. Unlike Virtual Reality (VR), which
completely immerses users in a fully artificial environment, AR overlays digital content
onto the user's real-world environment. In other words, there is an absolute injunction
for some EWK, the extent depending on the expected use of the app.
- Interactive in Real Time: AR systems are designed to be interactive in real time. The
digital elements that AR adds to a user's view of the real world can change or respond
instantaneously based on the user's actions or changes in the environment. Indeed,
system delays can cause dynamic errors and ruin, or at least impair, the experience.
Therefore, we measure end-to-end system delay as a measure of the lag between the
point in time when the tracking system captures the location and angle of the viewpoint
and when the created images that match this position and orientation are shown on the
display. There exists techniques to attempt to reduce such an issue but we will not
elaborate on that since it is outside the scope of this work.
- Context-Awareness: Advanced AR systems are context-aware. They not only overlay
digital content onto the physical world but also understand and respond to the user's
environment. For example, an AR system might recognize a specific object in the user's
environment and provide information or additional digital content related to that object.
- Immersive Sensory Experience: While the visual component is the most notable
aspect of AR, the technology can engage other senses too. Sound, haptic feedback, and
even smell can be incorporated to provide a more immersive sensory experience. This
is especially true and necessary for some applications, like surgery, where surgeons
were remotely operating and had access to force feedback to help with preciseness and
simulate organs’ texture and stiffness. (Samset, 2009)
- Use of AR Devices: AR experiences are delivered through devices, which can range
from smartphones and tablets (using their camera to capture the real world and screen
to display the added digital elements) to smart glasses or AR headsets, which allow
users to see the real world with digital content overlaid directly onto their field of view.
- 3D Registration: AR systems use 3D registration to align digital overlays with the real
world accurately. This means digital content appears anchored to the physical world as
if it's a part of it, maintaining its position even when the user moves around. This is a
key component when building place illusion (i.e., IM). 3D Registration is demanded in
most AR applications. Indeed, this is what gives an impression of stability in AR
experiences. Bad alignment compromises CO as well, since we can now have a digital
cup of coffee going through a physical table. Visual capture, a phenomenon where our
brain prioritizes visual information over other sensory inputs, can mask registration
16
errors in Virtual Environments (VE). This means users might perceive their physical
hand as being where the virtual hand is displayed, even if there's a misalignment. In
contrast, Augmented Reality (AR) requires more precise registration because any
misalignment between real and virtual objects becomes readily apparent due to the high
resolution and sensitivity of the human visual system. This is usually referred as local
presence, the smooth blending of virtual objects into the user's perception of their
immediate real-world environment. (Slater & Wilbur, 1997)
To palliate for the 3D Registration issue, there exists a solution for anchoring the object in a
defined place in space: the use of marker, also known as Recognition-Based AR (Azuma,
1997). This type of AR uses a camera and some type of visual marker, usually a QR code, to
produce a result when the marker is sensed by a reader. The AR device calculates the position
and orientation of a marker to position the content, sometimes allowing the user to interact with
the virtual content. Such an approach does not require calibration from the user and usually
offers a faster CGIs rendering (Oufqir, 2020 ; Carmigniani, Furht, Anisetti, & Ceravolo, 2010).
However, it does not fix all issues. Indeed, apart from requiring specific set-ups in a given
space, Marker-Based AR does not provide comprehensive data for all AR tasks. Specifically,
they do not provide absolute depth information, which is crucial for effectively blending virtual
and real objects. This lack of depth information can lead to virtual objects appearing 'flat' or
not properly integrated into the real world, eventually limiting interactions between virtual and
real objects – diminishing the overall AR experience (Feng, Been-Lirn Duh, & Billinghurst,
2008). A proposed solution to these issues involves the use of additional sensors or user input.
Laser rangefinders, for instance, can be used to create an initial depth map of real objects in
the environment, facilitating more precise registration. By aligning these depth maps from the
real and virtual environments, the system can accurately overlay the virtual objects onto the
real world (Feng, Been-Lirn Duh, & Billinghurst, 2008).
Furthermore, the use of manual intervention or user input can help mitigate issues when
automatic processes fail. If a vision algorithm fails to identify a part because the view is
obscured, the system can ask the user for help. This could involve the user manually aligning
virtual objects or providing additional context to help the system understand the environment.
Thanks to the advancements of computer vision, coupled with faster (hence more permissive)
hardware and improved algorithms, we have now made huge progress towards Markerless
Augmented Reality (AR), also referred to as Location-Based AR or Position-Based AR
(Carmigniani, Furht, Anisetti, & Ceravolo, 2010). It employs real-world, live information to
augment the user's environment. This is achieved through the use of GPS, digital compass,
velocity meter, or accelerometer integrated into the device, which feeds to the computer-vision
software data about location and orientation. Such AR systems do not require any predefined
markers or images to overlay virtual information onto the real world.
Applications of markerless AR range from navigation assistance, where directions are overlaid
onto real-world streets, to showcasing historical sites with additional information, to
commercial applications like location-based marketing (Carmigniani, Furht, Anisetti, &
Ceravolo, 2010). A popular example of markerless AR is the game "Pokémon Go", where
digital creatures are superimposed onto the player's real-world surroundings.
The primary advantage of markerless AR is that it doesn't require any pre-knowledge of the
user's environment, making it more flexible, adaptable, and scalable. It is easier to implement
17
and use ‘on the go’ and user friendlier since it is more intuitive than its marker counterpart
(Feng, Been-Lirn Duh, & Billinghurst, 2008). However, it relies heavily on the accuracy of the
device's location, orientation sensors or any other collected data – in case of noise, it is harder
for the system to provide perfect feedback to the user, making precise registration a challenge.
For all of these reasons, computational complexity, although it has been facilitated in the recent
years, remains a leading cause of delay, hence dynamic errors.
AR devices provide users with the ability to interact with both real-world and virtual elements
simultaneously. These devices typically consist of display technology, sensors for tracking
motion and orientation, and processors for rendering the augmented content.
Head-Mounted Displays (HMDs) are among the most common AR devices. These are
wearable devices that display digital content in the user's field of vision. We usually distinct 2
categories among HMD: See-through Optical or Video Displays. (Azuma, 1997)
- See-through Optical Displays are integrated into devices such as Microsoft’s HoloLens.
In this setup, digital content is projected directly onto the user's eyes or onto semi-
transparent mirrors embedded in the glasses. This allows the user to see both the real
world and the digital content simultaneously. The user’s view of reality is not
obstructed, and the digital content appears to blend seamlessly with the physical world.
However, contrast and visibility can be an issue in bright outdoor environments.
- Video See-Through Displays, on the other hand, use cameras to capture the real-world
view and then combine this with digital content on a screen. The user thus sees a
combination of reality and digital content, but this view is mediated by a video feed,
which can introduce a slight delay or reduce the sense of presence. This approach is
common in handheld AR applications on smartphones and tablets but is also used in
some HMDs and smart glasses designs. One among the most recent examples of such
HMD is Apple’s Vision Pro
Smart Glasses are another common form of AR device. These resemble traditional eyeglasses
but include integrated technology to display information contextually. Smart glasses are indeed
a type of head-mounted device (HMD). However, they are often categorized separately due to
their design and intended usage (Mekni & Lemieux, 2014). Indeed, a smaller local presence is
usually expected from Smart Glasses, their use generally revolving around Assisted Reality
(aR or AsR).
Handheld AR devices, particularly smartphones and tablets, have been instrumental in bringing
AR to the masses. These devices leverage built-in cameras and sensors to provide AR
experiences (Billinghurst & Duenser, 2012). Applications like Pokemon Go and Ikea's AR
shopping tool exemplify the potential of AR on these devices, where the objective is to be as
user-ready and friendly as possible.
Spatial AR devices project digital information onto physical objects in the real world, with no
need for the user to wear or hold a specific device. Applications of SAR range from
entertainment to industrial design and manufacturing, highlighting the flexibility of this
approach (Bimber & Raskar, 2005).
However, AR devices also face challenges. Notably, issues with user comfort, display quality,
field of view, and battery life are ongoing areas of research and development. Despite these
18
hurdles, the continued evolution of AR devices is anticipated, with advances in technology
enabling more immersive and versatile augmented experiences in the future.
If we now reconsider our initial criteria when it comes to the MR spectrum, we now have a
clearer view on how to position AR on the matrix. Indeed, as mentioned before, EWK is a core
requirement for AR since it interacts with the physical environment by very definition. Most
issues are coming from recognizing the environment and this criterion should always be
prioritized when working with AR. Furthermore, as discussed, CO must be respected in order
to ensure a great and realistic experience for the user, especially since you interact with the real
world. Finally, IM, through the concept of local presence, will vary in intensity depending on
the requirement and expected use of AR.
Luckily for us, unlike VR, terminology around AR is not as variable, which makes it even
easier to come to a consensus in literature. As done before, let’s now match RV Continuum
and XR framework with the current model used. When it comes to the former, Augmented
Reality naturally fits within the 3-criteria matrix as a combination of EWK, CO and IM, with
an emphasis on the first two. As far as the XR Framework is concerned, as mentioned, aR
minimizes the IM and CO required from the user, since the virtual objects are usually more of
a visual aid and do not require much local presence. On the polar opposite, we have Mixed
Reality. In that case, a clear distinction between the MR Spectrum and the MR Illusion shall
be made. MRS represents the context where most xR lies while MRI refers to the perceptual
illusion created by MR systems where the user perceives virtual and real objects as coexisting
and interacting in the same space, requiring a combination between IM, CO and EWK. This
distinction is absolutely vital to the understanding of our newly defined model and the two
terms shall not be confused. In order to ensure a better overview of the situation, Figure 8 helps
summarizing each technology that have been defined in this section. Although MRI can be
considered a subcategory of AR, we would rather make it its own due to the perfect
combination of real and virtual artifacts and the entire possibilities lying on such paradigm of
extended reality.
19
20
Physical World Mixed Reality Spectrum Digital World
Mixed Reality
AR spectrum VR spectrum Matrix-Like VR
Illusion
Augmented
aR AR Atomisitc VR Holistic
Virtuality
Figure 8 - Mixed Reality Spectrum based on 3-critera matrix

Immersion None Low Medium Very High High High Very High Very High Very high
Coherence Perfect High High Very High High High Very High Very High Very High
Low (limited to Low (limited to Low (limited to
Extent of World
Perfect Very High Very High Very High Low Internal senses and Internal senses and Internal senses and None
Knowledge
self-conciousness) self-conciousness) self-conciousness)
Opaque HMD with
Most used device n.a. Smart Glasses SmartPhone / HMD HMD Opaque HMD Opaque HMD Opaque HMD To be disvovered
external sensors
Immersive VR experience
Visual Assistance Add real-world Fully immersive VR,
Integrate 3D Smoothingly blend Light Virtual experience where the user is
Use-case n.a. based on content (objects or emphasizinng on
models to reality reality and virtuality experience generating a new not concioucious he
environment data) to virtuality presence
environement is in a VE
aRdent smart HoloLens 2 giving Madison Beer's
Worker training for
Example n.a. glasses by Get Your Pokemon Go instructions on how Immersive Reality Flight simulator Stress tests Matrix
a repetitive task
Way to build something concert experience
Mixed Reality Illusion
It is worth mentioning that, due to the scope of this work, whenever we will be referring to MR
from now on, it is implied that it is the Mixed Reality Illusion. Whenever we will be talking
about the spectrum, we will explicitly mention it. As we can see in the table, every paradigm
in xR finds a way to exist within this newly defined Matrix. The embodiment of smooth
blending between real and virtual world is Mixed Reality. Indeed, it requires a very high CO
in order to create valid interaction between physical and digital assets, a proper EWK so it can
greatly analyze the situation the user is in and a high-level of immersion since the user is ideally
not supposed to be able to discern CGIs and physical objects.
MR constitutes a significant shift in our interaction with and perception of the digital world. It
merges real and virtual environments to create new sensorial environments where physical and
digital objects coexist and interact in real-time (Milgram & Kishino, 1994). The seamless
integration of real and virtual environments creates an immersive experience that maximizes
the strengths of both VR and AR (Kipper & Rampolla, 2012).
The development of MR has been fueled by advancements in technologies such as HMDs,

spatial computing, computer vision, and artificial intelligence (Billinghurst & Duenser, 2012).
Devices such as Microsoft's HoloLens (1 & 2) and Magic Leap's Lightwear offer high-quality
MR experiences by superimposing holographic digital content onto the physical world. These
devices use sensors, cameras, and advanced algorithms to understand their surroundings,
allowing the digital content to interact dynamically with the real world (Microsoft, 2023).
MR has found extensive applications across various sectors. In education, MR enhances

learning experiences by offering interactive, three-dimensional visualizations that facilitate
complex teaching topics (Bower, Howe, McCredie, Robinson, & Grover, 2014). In the
healthcare sector, MR is used in preoperative surgical planning, intraoperative navigation, and
postoperative evaluation, providing surgeons with a comprehensive view of the patient's
anatomy (Kamphuis, Barsom, Schijven, & Christoph, 2014). Furthermore, MR is
revolutionizing the business landscape by enabling product visualization, remote collaboration,
training, and maintenance in industries ranging from retail to manufacturing and engineering
(Porter, 2017).
Despite its promise, MR faces significant challenges such as technical limitations in tracking
and registration, the need for realistic graphics rendering, and human factors such as motion
sickness and cognitive overload (Carmigniani, Furht, Anisetti, & Ceravolo, 2010). Addressing
these challenges is crucial for MR to reach its full potential. The future of MR lies in its
convergence with emerging technologies such as 5G, cloud computing, and AI, which would
enable ubiquitous, context-aware, immediate, and intelligent MR experiences (Rauschnabel,
Felix, Hinsch, Shahab, & Alt, 2022)
Mixed Reality represents a paradigm shift in human-computer interaction, providing

unprecedented opportunities for immersive, interactive experiences. As the technology matures
and overcomes existing challenges, MR will increasingly permeate various domains of human
activity, transforming the way we learn, work, and play.
21
HoloLens 2
The HoloLens 2, developed by Microsoft, is a revolutionary HMD that signifies an innovative

leap in the realm of MR devices, integrating advanced sensors, optics, and computational
capabilities to provide a highly immersive and interactive MR experience (Microsft, 2023)
Figure 9 - HoloLens 2 (Microsoft, 2023)
The HoloLens 2 is the second-generation of MR HMD commercialized by Microsoft. In

comparison with its predecessor, it boasts superior ergonomics and improvements in
immersion, with a see-through holographic display with a resolution of 47 pixels per degree of
sight, which allows detailed holographic content visualization. The device also utilizes eye-
tracking sensors, enabling interactions with just a glance, and incorporates advanced hand-
tracking capabilities for more instinctual interactions with holographic content, featuring thus
direct manipulation of holograms with the same interactions one would use with physical
objects in the real world.
Powered by the Qualcomm Snapdragon 850 Compute Platform and a custom-built
Holographic Processing Unit (HPU), the HoloLens 2 achieves significant strides in processing
power and energy efficiency (up to 3 hours on active use). Furthermore, its capabilities when
it comes to 3D registration is ensured thanks to its four “environment understanding cameras"
for spatial tracking (Microsft, 2023).
Beyond mere hardware specifications, the HoloLens 2 supports an array of MR applications.

For example, its Azure spatial anchors enable shared experiences across multiple devices,
fostering collaborative MR applications by allowing a group of user to interact within the same
MR experience. Additionally, the Dynamics 365 applications, such as Remote Assist and
Guides, allow for remote assistance and step-by-step holographic instructional overlays, thus
enhancing productivity in industrial contexts (Microsft, 2023).
Use Cases
The potential applications of HoloLens 2 span across various industries. In the medical field,
HoloLens 2 has demonstrated its potential in several ways. Pre-surgical planning is one key
application where surgeons use the HoloLens 2 to review detailed holographic visualizations
of the patient's anatomy to precisely plan surgical procedures. The device enables these
22
visualizations to be manipulated in a three-dimensional space, providing a level of
understanding that 2D imaging techniques cannot match. One case study demonstrated that
these visualizations aided in reducing surgical times and improving surgical outcomes,
suggesting a significant benefit for patient safety and healthcare efficiency (Sielhorst, Obst,
Burgkart, Riener, & Navab, 2019). Moreover, the HoloLens 2 is being used for real-time
surgical navigation, aiding surgeons during the actual surgical procedure. For example, the
HoloLens 2 can project guided pathways onto a patient during a procedure, enabling the
surgeon to see where they need to cut or stitch. This can improve accuracy and potentially
minimize the risk of complications (Sielhorst, Obst, Burgkart, Riener, & Navab, 2019).
The HoloLens 2 also presents transformative opportunities for education. Using AR
technologies, educators can create interactive, 3D visualizations of complex concepts, from
intricate biological structures to large-scale planetary systems. Students can then interact with
these models, manipulating them to gain a deeper understanding of the subject matter. This use
of the HoloLens 2 can support more engaging, effective, and personalized learning experiences
(Radu, 2014).
The aerospace industry, notably represented by companies like Airbus, has been using the
HoloLens 2 for complex assembly operations. By utilizing the HoloLens 2, workers can see
holographic instructions projected onto their field of view, guiding them through the assembly
process. Airbus reported this has led to improvements in quality, efficiency, and the speed of
training new employees (Rogers, Paay, & Brereton, 2020)
Furthermore, the HoloLens 2's remote assistance capabilities could be leveraged in industries
such as manufacturing and maintenance. Through Dynamics 365 Remote Assist, technicians
can share their point-of-view with remote experts, who can then provide guidance, draw
annotations in the technicians' field of view, or even overlay 3D models to help solve complex
problems (Microsft, 2023)
These use cases represent just the tip of the iceberg in terms of the potential applications of the
HoloLens 2, and the integration of MR in work life in general. As the technology continues to
evolve and more industries recognize its benefits, the possibilities for its application are bound
to expand.
The HoloLens 2, with its advanced features and wide-ranging applications, is a critical driver
of the ongoing mixed reality revolution. By further refining the device's capabilities and
addressing existing challenges, the HoloLens 2, and MR in general, is poised to continually
redefine human-computer interaction and shape the future of various industries. Furthermore,
by paving the way for MR applications, it fosters interest around the subject and opens the
market to new intrants, hence more innovation.
Industry 4.0
Industry 4.0, often referred to as the Fourth Industrial Revolution, characterizes the current
trend of automation and data exchange in manufacturing technologies (Lasi, Fettke, Kemper,
Feld, & Hoffmann, 2014). Industry 4.0 is an umbrella term that encompasses various cyber-
physical systems, the Internet of Things (IoT), cloud computing, and cognitive computing,
amongst other digital technologies (Lasi, Fettke, Kemper, Feld, & Hoffmann, 2014). It
represents a vision of the future of industry in which all production processes are interlinked to
enable real-time communication, analysis, and decision-making (Schwab, 2016). In other
words, Industry 4.0 is where the physical world of manufacturing meets the digital world of
23
information technology, creating a fully-integrated ecosystem that is both self-aware and self-
optimized. Digital transformation, on the other hand, is a broader concept that encapsulates the
process of integrating digital technology across all facets of business and industry,
fundamentally altering how organizations operate and deliver value to customers (Fitzgerald,
Kruschwitz, Bonnet, & Welch, 2013). These two phenomena, though distinctive, are
intrinsically linked and represent a seismic shift towards more digitized, automated, and
interconnected processes in various industries.
While Industry 4.0 is a sector-specific manifestation of digital technologies, digital

transformation is a broader, organization-wide integration of digital technology into all areas,
fundamentally changing how businesses operate and deliver value (Berman, 2012). This
transformation transcends traditional roles like sales, marketing, and customer service, and also
includes changes in behind-the-scenes operations such as manufacturing and supply chain
management. With its focus on the integration of digital technologies into manufacturing and
industrial processes, Industry 4.0 can be seen as a facet or subset of digital transformation
(Fitzgerald, Kruschwitz, Bonnet, & Welch, 2013). In other words, while digital transformation
represents a wider change encompassing all industries and sectors, Industry 4.0 specifically
refers to changes within the manufacturing and industrial sectors.
The transition to Industry 4.0 has the potential to significantly enhance operational efficiency
across different industrial sectors. This is due in large part to the integration of digital
technologies, such as the IoT, cloud computing, and advanced analytics, into manufacturing
and other industrial processes (Rüßmann, et al., 2015)
For instance, IoT-enabled devices can collect and transmit real-time data about their operating
conditions and performance. When combined with advanced analytics, this data can yield
insights that can help organizations optimize their processes, reduce waste, and improve
product quality (Manyika, et al., 2015). In addition, predictive analytics can anticipate and
prevent equipment failures, reducing downtime and maintenance costs.
Moreover, the integration of advanced robotics, smart devices and automation into industrial
processes can reduce manual labor and improve the speed, accuracy, and repeatability of those
processes. This can lead to increased productivity, higher quality products, and safer working
conditions (Makridakis, 2017).
Despite the myriad of benefits that Industry 4.0 and digital transformation bring, they also
present significant challenges. These include cybersecurity threats due to the increased
connectivity and dependence on digital technologies, the risk of job displacement due to
automation, and the need for significant investment in new technologies and the training of
staff to use these technologies effectively (Kagermann & Wolfgang, 2022)
There's also the issue of data management. As Industry 4.0 and digital transformation generate
vast amounts of data, organizations need effective ways to store, process, analyze, and use this
data. Without the right systems in place, data can become more of a liability than an asset.
Furthermore, we do have to consider the legal framework of generated data and the use it can
be made of, against the cost a batch of data requires to be harvested.
Transitioning to Industry 4.0 requires a significant investment in new technologies and

infrastructure, as well as in training staff to effectively use these technologies (Kagermann &
Wolfgang, 2022). These upfront costs can be substantial, making it critical for organizations to
carefully plan their transition and ensure they have sufficient resources to fund it.
24
On the long run, however, the transition to Industry 4.0 can yield significant financial benefits.
The increased operational efficiency and productivity that come with Industry 4.0 can lead to
lower costs and higher revenues. In addition, the use of advanced analytics can help
organizations make more informed and effective business decisions, potentially leading to
increased profitability (Bughin, et al., 2017)
Nevertheless, organizations must also consider potential risks and downsides. For instance, the
increased reliance on digital technologies brings about new cybersecurity risks, which could
lead to significant financial losses in the event of a cyber-attack. In addition, organizations that
fail to successfully manage their transition to Industry 4.0 could end up wasting their
investment and damaging their competitiveness (Chui, et al., 2018)
Factors of adoption
In order to better define objectives when it comes to the building of future solutions, several
key factors, influencing the adoption of Industry 4.0, have been highlighted in the academic
literature. These are often grouped into various categories, including technological,
organizational, environmental, and individual factors.
1. Technological Readiness: The readiness to adopt and implement new technologies is a

critical factor. This includes having the right infrastructure in place, technical skills,
and understanding of the technologies that underpin Industry 4.0 (Frank, Roehrig, &
Pring, 2017)
2. Data Management Capabilities: The ability to manage and use data effectively is
essential. Industry 4.0 is data-driven, and companies need to have the capacity to handle
large volumes of data and use it to gain insights (Lu, 2017)
3. Organizational Culture: An open and innovative culture that embraces change is also
crucial. The transition to Industry 4.0 involves significant change, and organizations
need to be ready to adapt and evolve (Kagermann & Wolfgang, 2022).
4. Leadership and Strategy: Strong leadership and a clear strategy for adopting and
implementing Industry 4.0 technologies are important. This includes having a vision
for how these technologies will be used and the benefits they will bring (Kagermann &
Wolfgang, 2022).
5. Regulatory Environment: The regulatory environment can also impact the adoption of
Industry 4.0. This includes things like data protection regulations, which can affect how
data is used and shared (Kagermann & Wolfgang, 2022).
6. Financial Resources: Finally, financial resources are also a key factor. Adopting new
technologies can require significant investment, and organizations need to have the
financial capacity to make these investments (Liao, Deschamps, Loures, & Ramos,
2017)
Mixed Reality and Augmented Reality in Industry 4.0
MR and AR are key technologies underpinning Industry 4.0. These technologies overlay digital
information onto the physical world, enhancing human perception and decision-making
capabilities (Milgram & Kishino, 1994). In the context of Industry 4.0, they can enhance the
efficiency and effectiveness of various industrial processes.
For instance, in manufacturing, AR can guide workers through complex assembly processes,
reducing errors and increasing efficiency (Azuma, 2001). In maintenance and repair, AR can
25
overlay instructions and diagnostics onto machines, helping technicians identify and fix issues
faster. MR takes this a step further by enabling more immersive experiences where physical
and digital objects coexist and interact in real-time. This can be particularly beneficial for
collaborative work, remote assistance, and training scenarios in the industrial context
(Billinghurst & Duenser, 2012).
The Fourth Industrial Revolution, or Industry 4.0, represents a major step forward in the
integration of digital technologies into manufacturing and industrial processes. It is a key facet
of the broader digital transformation affecting all industries and sectors. While these trends
offer significant potential benefits, they also present challenges that organizations must address
to fully leverage their potential. Mixed Reality and Augmented Reality are among the key
technologies driving these trends, offering powerful new ways to improve efficiency,
effectiveness, and worker experience in the industrial sector.
Computer Vision
Computer vision is a branch of artificial intelligence that enables computers to interpret and
understand visual information from the physical world, essentially mimicking the capabilities
of human vision but at a far greater speed and consistency deciphering the contents of digital
images and videos. Computer vision fundamental processes within this realm are object
detection, recognition, localization, segmentation, and image classification. These processes
are interconnected but serve different roles and are deployed based on specific requirements.
They are built on numerous tasks such as image processing, pattern recognition, geometric
modeling, and learning techniques (Szeliski, 2010). We will now formally define each major
operation and their respective links to have a better overview of what Computer Vision can
achieve.
Figure 10 - Overview of Object Recognition Computer Vision Tasks (Brownlee, 2021)
- Object Recognition: Term encapsulating a variety of tasks within the field of computer
vision, primarily focused on identifying and locating objects within digital images
(Krizhevsky, Sutskever, & & Hinton, 2012). This process is crucial for numerous
26
applications and has significantly advanced with the evolution of machine learning and
deep learning algorithms.
- Image Classification: This involves predicting the category or class of a singular object
in an image. The input is typically an image containing a single object, and the output
is a class label, essentially an integer that maps to a specific category. In essence, the
task is to assign a label from a fixed set of categories to an input image (Brownlee,
2021).
- Object Localization: This task involves identifying the presence and location of objects
within an image and marking their location with bounding boxes. Here, the input can
be an image containing one or more objects, and the output is one or more bounding
boxes, usually defined by a point, width, and height (Brownlee, 2021).
- Object Detection: This is an extension of the previous tasks, aiming to locate and
classify one or more objects in an image. Again, the input is an image with one or more
objects, but the output is one or more bounding boxes and a class label for each
bounding box. This means that each detected object is not only localized within the
image but also assigned a category (Brownlee, 2021).
Each of these tasks has distinct evaluation metrics. Image classification models are often
evaluated using mean classification error across the predicted class labels. Object localization
models are usually evaluated based on the distance between the predicted and actual bounding
box for each object. For object detection, performance is often measured by precision and recall
across each of the best matching bounding boxes for the known objects in the image
(Everingham, Van Gool, Williams, Winn, & Zisserman, 2010).
An advanced task within the object recognition suite is Object Segmentation or "object instance
segmentation" or "semantic segmentation". This task involves identifying specific pixels that
belong to the object, rather than using a simplified bounding box. This leads to a much more
detailed understanding of object shape and spatial distribution within the image (Long,
Shelhamer, & Darrell, 2015).
Figure 11 - Computer Vision Tasks (Machine Learning Mastery, 2020)
While these processes may seem similar, they serve distinct roles in the field of computer vision
and underpin various practical applications. Image classification, which assigns a label to an
entire image, is fundamental in applications such as scene recognition, where the context of an
image (such as a beach, city, or forest) is important. Object detection, which identifies multiple
objects and their locations within the image, is critical in autonomous vehicles, where it's vital
to identify and locate other vehicles, pedestrians, and obstacles. Lastly, segmentation offers a
more detailed representation by splitting the image into meaningful regions. This is especially
useful in medical imaging, where it's necessary to delineate the boundaries of different tissues
or anomalies, like tumors, in scans for accurate diagnosis or treatment planning (Litjens, et al.,
2017)
27
Despite these differences, these processes often work in conjunction to interpret image content
effectively, providing a holistic view of the visual data and enabling diverse real-world
applications.
One noteworthy example is the use of computer vision in Microsoft's HoloLens 2 – it employs
computer vision for spatial mapping, gesture recognition, and gaze tracking, thereby enabling
user interaction with holographic content. Similarly, in AR, computer vision techniques like
SLAM (Simultaneous Localization and Mapping) are widely used. SLAM creates a map of the
unknown environment while keeping track of the user's current location, a fundamental
requirement for AR applications like navigation, gaming, and industrial maintenance (Castle,
Klein, & Murray, 2008)
While computer vision has revolutionized technology a long time ago, it also introduces unique
challenges. Issues such as computational efficiency, robustness to varying lighting conditions,
and accuracy in object detection and tracking remain persistent challenges (LeCun, Bengio, &
Hinton, 2015). Nonetheless, the advent of deep learning and neural networks has marked a
turning point for computer vision. Advanced techniques like convolutional neural networks
(CNNs) are showing great promise in enhancing the performance of object recognition,
segmentation, and other computer vision tasks (LeCun, Bengio, & Hinton, 2015). Such
advancements hint at the immense potential we still have to tap into in order to unlock the full
potential of algorithms in computer vision. Talking of which, let’s now have a look at different
learning models used when it comes to training from scratch a model using machine learning.
Here are some of the most widely used models in the field of computer vision and their
respective use (Brownlee, 2021).
1. Convolutional Neural Networks (CNNs): Perhaps the most prominent type of model
used in computer vision, CNN is a type of neural network designed specifically for
processing grid-like data, such as images. It uses convolutional layers, pooling layers,
and fully connected layers to learn hierarchical features from the input data, making it
highly efficient in tasks like image classification (LeCun, Bengio, & Hinton, 2015).
CNNs are used in tasks like image and video recognition, recommender systems, image
generation, and natural language processing.
2. Region-based Convolutional Neural Networks (R-CNN): R-CNN is an evolution of
CNNs designed for object detection tasks. It uses a region proposal algorithm to identify
potential bounding boxes in an image, then uses a CNN to classify these regions into
various categories and refine the bounding box coordinates. However, R-CNN is
computationally intensive because it needs to process the CNN for each proposed
region separately (Girshick, Donahue, Darrell, & Malik, 2014).
3. Faster R-CNN: Faster R-CNN is an improvement over the original R-CNN and its
immediate successor, Fast R-CNN. Instead of using a separate algorithm for region
proposal, Faster R-CNN introduces a Region Proposal Network (RPN) that shares full-
image convolutional features with the detection network, thus enabling nearly cost-free
region proposals. This significantly increases speed while maintaining high accuracy in
object detection tasks (Girshick, Donahue, Darrell, & Malik, 2014).
4. ResNet (Residual Networks): ResNets, known for their deep architectures, are pivotal
in image classification tasks, especially in large-scale image recognition. They are also
deployed in detecting objects and locations in images and have found extensive
28
applications in facial recognition technologies (He, Zhang, Ren, & Sun, 2016). ResNets
are a type of CNN that introduced "skip connections" or "shortcuts" to allow gradients
to flow through the network directly (He, Zhang, Ren & Sun, 2016). This mitigates the
vanishing gradient problem (a challenge in training deep neural networks where the
gradient used to update weights becomes exceedingly small, leading to minimal
changes in the weights, and thus inhibiting the network's learning process.), making it
possible to train very deep networks. ResNets have been used to achieve state-of-the-
art results in various computer vision tasks. (He, Zhang, Ren, & Sun, 2016)
5. YOLO (You Only Look Once): This is a real-time object detection system that, unlike
region proposal-based techniques, sees the entire object in one image during training
and testing, hence the name "You Only Look Once" (Redmon, Divvala, Girshick, &
Farhadi, 2016). Specialized in real-time object detection systems, YOLO excels in
scenarios requiring speed. It can process streaming video in real-time to detect objects,
finding applications in domains like autonomous vehicles or surveillance systems.
6. Mask R-CNN: This extends Faster R-CNN, a popular object detection framework, by
adding a branch for predicting segmentation masks on each Region of Interest (ROI),
in parallel with the existing branch for bounding box recognition. This allows it to
simultaneously detect the objects in an image and generate a high-quality segmentation
mask for each instance (He, Zhang, Ren, & Sun, 2016).
7. U-Net: The network is based on the fully convolutional network and its architecture
was modified and extended to work with fewer training images and to yield more
precise segmentations. U-Net is simpler and faster to train than models like Mask R-
CNN, making it a good choice for tasks with a single object type or limited
computational resources. (Girshick, Donahue, Darrell, & Malik, 2014)
As we continue to push the boundaries of what is possible with HCI, the role of computer
vision becomes increasingly integral. By understanding and interpreting visual data, computer
vision enables technologies to interact more meaningfully with users and their environments.
Despite current challenges, ongoing advancements promise a future where XR technologies,
powered by sophisticated computer vision algorithms, become an integral part of our everyday
lives. As research in artificial intelligence and machine learning continue to evolve, these
processes will become more precise and efficient, further closing the gap between human and
machine vision.
29
Solution Design
Vision
Now that we have a well-defined framework for our research, it seems important to think about
the objective of the solution, its requirements and how its design can be optimized.
As said in the introduction, our solution consists of an application on the HoloLens 2 that would
transform physical assets such as a parcel in a 3D model that could be used for several
operations, including bin-packing. Creating 3D models out of 2D data is a process called
photogrammetry, by capturing measurements from photographic images. It's based on the
principle of triangulation, a process where the coordinates of points in 3D space are defined by
measurements made in two or more photographic images taken from different positions.
In a broader sense, photogrammetry can be defined as the art, science, and technology of
obtaining reliable information about physical objects and the environment through processes
of recording, measuring, and interpreting photographic images and patterns of recorded radiant
electromagnetic energy and other phenomena (Wolf & Dewitt, 2000).
With advances in technology, photogrammetry has expanded beyond traditional aerial and
satellite imagery and now includes other forms of sensing, such as LiDAR and SAR, as well
as terrestrial and close-range applications. Modern photogrammetry also often involves the use
of computer algorithms to automate the process of extracting 3D shapes from multiple images,
and it plays a vital role in fields like surveying, architecture, engineering, manufacturing,
quality control, police investigation, and geology, among others.
First things first, since we are using a HoloLens 2, the solution will be using MR. That implies
a smooth blending of real and virtual worlds, with interactive options. When it comes to the
Industry 4.0 aspect, let’s take a look at the 6 criteria we defined earlier on, influencing the
adoption of new technology, and how we can work on each parameter to ensure our solution
fits the needs of modern industry:
1. Technological Readiness: The solution should be based on technologies that are ready
for deployment and use, preferably technologies that are mature, tested, and optimized
for practical use. Industries have been using more and more IoT-related processes and
are open to broader horizons.
2. Data Management Capabilities: The solution should be able to handle, process, and
analyze large volumes of data effectively and efficiently. It should also provide useful
insights derived from this data, which can be utilized for decision-making processes
and improving operational efficiency.
3. Organizational Culture: The solution should be able to align with and foster an open,
innovative, and change-embracing culture within an organization. It should be designed
in a way that encourages user engagement, adaptation, and continuous learning.
4. Leadership and Strategy: The solution should be in line with the organization's strategic
goals and vision for Industry 4.0 adoption. It should provide clear benefits that match
the organization's strategic direction and should contribute positively to the realization
of the organization's strategic objectives.
5. Regulatory Environment: The solution should comply with relevant regulatory
requirements, particularly those related to data protection and privacy. It should have
30
mechanisms to ensure data security, privacy, and ethical use of data, in accordance with
local and international regulations.
6. Financial Resources: The solution should provide a good return on investment and
should be affordable for the organization. The cost of adopting, implementing, and
maintaining the solution should be justified by the benefits and value it provides to the
organization, considering both direct and indirect costs.
That being said, organizational culture, leadership & strategy and regulatory environment are
outside the scope of this work, which only aims at building a technical solution. We will thus
focus on the technological readiness, data management and financial resources (at a lesser
extent) to design our solution. We do have access to the right technology (MR), with the right
equipment (HoloLens 2). Our solution must provide quality data that should be usable in real-
life scenario, scalable and efficiently built. We will also thus tap into Azure Web Services to
handle possibly large computations, data storage and make the solution as portable as possible.
Although it is not our priority to tackle such criteria, we must create a user-friendly experience
that does not require long training to master so that in fine, we can stimulate use and interest
of the app. Finally, it should not require far-fetched investments in order to be integrated in the
company’s pipeline.
Here is a flowchart of the main operations performed during the ‘Images to 3D model’ process.
As we can see, it has several steps that may seem simple on paper but whose implementation
turns out to be trickier than supposed.
Figure 12 - Flowchart : Images to 3D Model process
Execution
In this section, we will dive into the core of this work and explain how each major step of our
‘Images to 3D model’ works. First, we will explain the mechanics behind photogrammetry and
why Agisoft Metashape has been selected as photogrammetric software for our task. Right
after, the pre-processing operations will be reviewed to better understand how to get the input
that is to be fed to our tool. Then, the automation made on Azure will provide us with the last
pieces of the puzzle while bringing light to some technical issues we faced and challenges we
31
overcame. Finally, we will review the Graphic User Interface created on the app for our
HoloLens 2.
In this section, some tests have been carried out to compare results under different
configurations and parameters. Unless stated clearly, the tests have been made on a local
machine with the following specifications:
- OS : Windows 10 Enterprise, v. 22H2
- CPU : Intel i7-5700HQ Quad-Core 3.5 GHz
- GPU : Nvidia GTX 1080-Ti
- RAM : 2 x 8GB DDR4
This is a pretty average machine; results can thus vary from one computer to another.
3D Model Building
Photogrammetry softwares have been instrumental in transforming 2D imagery into 3D

models, leading to significant advances in fields such as archaeology, culture, architecture,
geology, and manufacturing, among others. A plethora of photogrammetry software is
currently available, each with its unique features, strengths, and limitations. When selecting
photogrammetry software, it is crucial to consider several aspects. These include the software's
accuracy, speed, ease of use, compatibility with different formats, and the ability to handle
varying data volumes.
In 2021, a study was made on seven softwares, among which 4 were commercial and 3 open-
source, ranking them according to their efficiency – based on the time spent on processing and
the size of the generated point of cloud. (Lehoczky & Abdurakhmonov, 2021). Unsurprisingly,
it turns out that commercial softwares generally outperform open-source alternatives in terms
of performance, capacity, and complexity.
Open-source softwares have several advantages. For one, they are available for almost every
platform and operating system. Furthermore, they offer significant flexibility in
parameterization, facilitating research activities. Due to the accessibility of the code, the
processing method can be tailored to specific tasks, leading to a more efficient workflow in
certain cases. However, this typically necessitates a strong background in computer vision.
In contrast, commercial softwares are typically optimized for Windows or Linux, rarely for
MacOS. These applications are user-friendly and generally stable. However, they offer less
room for parameterization. Given their "black box" operation, where the computational method
is not revealed, there are limited opportunities for result verification and error interpretation.
Over time, commercial software tends to evolve towards more complex user interfaces to meet
all demands and cater to a wide range of applications.
In the conducted tests, Agisoft Metashape, a commercial product, emerged as the most
efficient, delivering the highest point density in the shortest computational time. Furthermore,
it requires minimal expertise in surveying or computer science from users and still offers room
for parameterization settings, while ensuring good results most of the time. (Lehoczky &
Abdurakhmonov, 2021)
32
In addition to its other benefits, Metashape offers a very convenient feature: a possibility to run
the software in headless mode (without displaying a Graphical User Interface), rather using the
command-line interface of the host machine. Using python scripting, thanks to the provided
module, we do not need human interaction and can automate the process under the right
circumstances. Although that brings its lot of challenges, these are completely worth the effort.
Indeed, since our objective is to make the workflow run on its own on the cloud after triggering
the start until we obtain the 3D model expected on our HoloLens 2 headset, we do need a way
to use Metashape on Azure, either via a Virtual Machine or a Container. The use of either of
them will be discussed in the upcoming section concerning Microsoft Azure and its cloud
services.
The script automates the process of creating a 3D model from a set of photos using the Agisoft
Metashape software. Here's a review of the main steps required for this operation. Note that
most photogrammetry softwares use the same methodology and such code could be adapted to
work under other circumstances. We will proceed to explain the main operations but, as to not
confuse the reader, won’t elaborate on the specific dispositions required by Azure’s
architecture.
1. File Discovery: The script defines a function find_files() that looks for specific files in
a given folder, the pictures taken. It then checks if the correct command-line arguments
were provided and assigns the image and output folders based on these arguments. It
then uses the find_files() function juste defined to create a list of photo files in the
image folder.
2. Document Creation and Photo Addition: The script creates a new Metashape
Document, adds a new Chunk (unit of work representing a subset of the project) to it,
and adds the photos from the previous step to the Chunk.
3. Mask Importation: The script sets a template for where it expects to find mask files
(which can be used to ignore certain parts of images), then for each photo, it imports its
corresponding mask from the folder. These are the masks generated during a pre-
processing step using a U-Net model. A section will be elaborating of the pre-modeling
operations.
4. Photo Alignment: The script matches the photos and aligns the cameras. When
performing photo alignment, the script estimates the initial position and orientation of
each image used based on a common coordinate system created via matching the points
on different pictures. In other words, if 2 points are matched on 2 different pictures as
an identical part of the physical object, it creates a relative system that help the
underlying software understand where each picture has been taken from. Based on some
parameters related to masks (filter_mask and mask_tiepoints), we can at this point
exclude some points from the process in order to have a model focused on the object
itself. We will further explain the masking process in the dedicated section.
5. Dense Cloud Building: This step builds depth maps from the aligned photos, which
are 2D images with an estimated distance from the camera to the surface of the object
for each pixel. From this output is built a dense cloud of points which captures the 3D
structure of the scene or object with a high level of detail, thanks to triangulation.
6. Mesh Building: We build a 3D mesh from the dense cloud, a collection of vertices,
edges, and faces that defines the rough structure of the object.
7. Texture Building: The script records the start time, builds (U,V) coordinates (2D
coordinates that are used for mapping a 2D texture onto a 3D model), builds a texture
from the original photos using these (U,V) coordinates and match those two
33
components. This is the final step in creating a 3D model: texturing the 3D mesh to
make it look like the original object.
8. Model Exportation: If a 3D model was successfully created, the script exports it as an
file to the output folder.
9. Completion Message and Timing: This step is optional, but for logging purposes, our
script prints a message indicating the process is finished and where the results were
saved. It then prints the total processing time for each step in the 3D model creation
process.
While automated tools have made photogrammetry more accessible, understanding the
underlying principles and steps involved in the process is still crucial for achieving optimal
results. This includes understanding how to properly capture photos for photogrammetry,
knowing how to adjust software parameters for different scenarios, and being able to validate
the results since we sometime get errors. It is thus primordial to understand the back-end to fix
any issue.
Pre-processing of the pictures
To ensure best result when it comes to the 3D model building, pre-processing the pictures is of
the highest importance. Indeed, on its own, Agisoft Metashape can hardly differentiate the
background from the subject on a picture. Therefore, the number of ‘noise’ points when it
comes to the making of the model is way too high. In order to prevent such an issue, there
exists a way to allow for the minimum points possible: the use of masks. Indeed, by importing
masks in our software – in association with their respective picture, we tell the algorithm which
points it can drop, reducing computing time and giving us a more accurate and focused model
with as little background as possible.
Figure 13 - Comparison of the 2 models, without mask (left) - with mask (right)
As we can see, the model using masks has a much smaller bounding box, although some
artifacts could be cleaned manually afterward. When creating a 3D model, the obvious
objective is to only capture the main subject. In addition to making the virtual object ready to
use, the computation behind its creation also gets far more efficient.
34
That being said, since we are automating our process, it is impossible to manually mask each
picture used initially by selecting the subject on each and every one. Furthermore, such a task
would be extremely time consuming for very large datasets. Therefore, we can also automate
this process using a python script. When it comes to masking, two options are offered at us:
using the Open Source Computer Vision Library (OpenCV) or using deep learning to build a
model. Masking in OpenCV can be an effective and relatively simple way to isolate parts of an
image based on color or intensity thresholds. However, it has its limitations as it often requires
manual adjustments and is largely dependent on good lighting conditions. Further, masking in
OpenCV can struggle with complex backgrounds or when objects of interest are similar in
color to their surroundings.
On the other hand, building a model, particularly using machine learning techniques, offers
several advantages. A well-trained model can handle a wider range of lighting conditions,
complex backgrounds, and subtler color variations. It is capable of learning from a vast array
of features rather than relying solely on color or intensity. Models can be trained to recognize
specific objects, making them more effective in scenarios where the object of interest does not
have a distinct color or intensity. Additionally, machine learning models can continuously
improve over time through learning from new data. While the initial training of the model may
require more resources and expertise compared to a simple masking approach, the potential
accuracy and versatility gains of a model often justify the additional investment (LeCun,
Bengio, & Hinton, 2015).
Our objective is thus to get a script that would perform as efficiently as possible some object
segmentation tasks. There are numerous models that are could be fit for such use, each with its
own strengths and weaknesses, the best depending on a variety of factors, including the
complexity of your images, the number and variability of the objects, the level of detail required
in the segmentation, the computational resources you have available, and so on and fourth.
In our case, to simplify the analysis, we will suppose all parcels that will be scanned will be
cardboard boxes, since it is the most common and simple type of units.
We decided to go for a U-Net architecture for our model, a type of convolutional neural
network that excels in tasks such as semantic segmentation, where the aim is to assign a class
to each pixel in the image. This type of task is common when there is only one type of object
to detect in the image, and every pixel either belongs to the object or the background. Since the
user will be taking pictures of each unit individually when starting the process of 3D
modelization, we can assume there will be only one subject on each picture. The U-Net
architecture consists of two paths - a contracting (downsampling) path and an expanding
(upsampling) path, which gives it a 'U' shape, hence the name U-Net. The objective is to
highlight and then match intelligently low-level and high-level features on each picture to train
the model. The former are the basic features that can be directly extracted from the source
image, such as edges, colors, texture, intensity, gradients, etc. Low-level features often
correspond to simple patterns in the image, and they can be captured in the initial layers of a
CNN. They don't usually depend on the context and are more about the raw pixel values and
simple, local patterns. On the other hand, high-level features are more abstract and complex
features that usually represent a higher understanding of the image's content. High-level
features might include shapes, objects, faces, or even more abstract concepts like the scene type
(e.g., a beach or a city). These features are typically extracted in the deeper layers of a CNN,
and they are obtained by combining the lower-level features and recognizing larger patterns in
the image (Ronneberger, Fischer, & Brox, 2015). If we want to vulgarize those two concepts,
we can more or less admit the high-level features as ‘what’ and the low-level as ‘where’.
35
Python is widely recognized as the preferred language for machine learning and data science
due to several key factors (Oliphant, 2007). From its easiness to read, which is particularly
useful when writing and understanding algorithms, python's extensive selection of machine
learning-specific libraries or its simplicity – which allows developers to implement machine
learning models quickly, increasing their productivity and accelerating the overall development
process – Python is the ideal language programing for any machine learning-related task and
hence, the obvious choice. Let’s now bring some explanation on how the U-Net model works.
1. Downsampling - Contracting Path: The downsampling process starts when the input
image enters the network. This path involves successive applications of Convolutional
layers (Conv2D) and MaxPooling layers (MaxPooling2D). Convolutional layers with
a Rectified Linear Unit (ReLU) activation function are applied first, processing the
input image by extracting different features. This is followed by MaxPooling layers,
which reduce the spatial dimensions of the output from the Conv2D layer, effectively
focusing on high-level features. The parameters within the Conv2D function define the
number of filters to be used and their dimensions. This path aims at capturing the
semantic or contextual information of the input. The downsampling process reduces the
spatial dimension (width and height) of the input while increasing the depth (number
of features learned by the model about the input), thereby focusing on high-level
features. Each step in this path involves the application of a series of convolutional
layers followed by a max-pooling layer that halves the dimensions.
2. Bottleneck: In the bottleneck, the high-level features and the context captured in the
contracting path are processed.
3. Upsampling - Expanding Path: This is where the network attempts to "upscale" the
high-level features, with the goal of regaining the spatial dimensions lost during the
downsampling process. This is achieved through the UpSampling2D function, which
effectively doubles the dimensions of its input. The outputs from these upsampling
layers are then concatenated with the outputs from the Conv2D layers in the contracting
path. The concatenated features are passed through additional Conv2D layers to further
refine the upsampled features. This series of operations help the model to accurately
localize and outline the objects in the image. The combination of high-level features
and spatial information helps the network to correctly localize and delineate the objects
in the image.
4. Final Layer: The final layer of the model is a Convolutional layer with a Sigmoid
activation function. This layer outputs a probability map where each pixel has a value
between 0 and 1, representing the likelihood of the pixel belonging to the target class.
5. Compiling the Model: The model is compiled using the Adam optimizer, one of many
loss-minimizing functions, and binary cross-entropy loss function, suitable for a binary
segmentation problem. The binary cross-entropy loss for a prediction model output
y_pred and actual output y_true is given by the equation:
- [y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred)]
The accuracy of the model's predictions is used as the evaluation metric.
6. Preparation of Training and Validation Data: The code then loads the training and
validation images and masks from their respective directories, normalizes them to a
range between 0 and 1, and creates datasets that pair each image with its corresponding
mask. Normalizing data between 0 and 1 (instead of [0-255]) ensures consistent data
36
ranges, improves stability and efficiency of gradient descent optimization, and
accelerates model convergence. It is also at this step that we augment our training
dataset – augmenting the validation one would not be a good idea since it could lead to
inconsistences.
7. Training the Model: The model is then trained using the fit() function, where the
training data and validation data are passed as parameters, along with the number of
epochs, which defines how many times the learning algorithm will work through the
entire training dataset.
8. Saving the Model: Finally, the trained model is saved to a .h5 file (Hierarchical Data
Format Version5, or HDF5) for future use, without having to retrain the model from
scratch.
A pre-made set of cardboard boxes images of 2967 pictures has been used to train the model,
with 70%, 20% and 10% for the training, the validation, and the testing respectively. However,
when training the model, an operation called augmenting is used on the training dataset to build
a more robust model. Indeed, by digitally editing the pictures (adding noise, partially
randomizing lightning conditions, performing rotations, scaling, or cropping of images etc.),
we artificially expand the size of a training dataset in an easy way, making our built model less
sensitive to variations in the input data. Indeed, it means a same picture might look different
during 2 different epochs. In our case, we only performed flipping and zooming operations.
As said in step 6, we loaded our training images and their respective masks to train the model.
To create those masks, we used the annotations provided with our dataset, a .JSON file
containing the bounding boxes of each subject on every image and processed such data using
another script outputting masks from the given JSON file.
There are three main steps in this script transforming this raw text in images.
1. Setting up directory paths: Directory paths for training, testing, and validation images
as well as their corresponding masks are set up.
2. Function definitions: Two functions are defined: create_mask() and get_image_id().
create_mask() is used to create a binary mask for a given image based on bounding
box coordinates provided in the annotations. This mask is of the same size as the
original image and has the value of 1 where the object (defined by the bounding box)
is located and 0 elsewhere. get_image_id() is used to get the id of an image given its
filename from the annotations.
3. Creating masks for all sets: For each set (train, test, valid), the script creates masks
from the annotations provided in a COCO-style JSON file. COCO (Common Objects
in Context) is a popular format for annotated image datasets. For each image, it creates
a mask and resizes it to (128, 128), saving it with the image id in the corresponding
mask directory. It also resizes the original image to (128, 128) and renames the image
file to its id.
Essentially, this script reads COCO-style annotations3, creates binary masks based on the
annotations for object segmentation, resizes both the masks and the original images to a
standard size (128,128), and renames the image files to their respective ids. The processed
datasets can then be used for tasks like semantic segmentation or object detection. The size is
chosen as to facilitate computations and prevent overflowing of the memory while keeping
enough information for proper training of the model.
3
A structured file format that provides a standardized way to store annotations of images for object detection and
segmentation, following the conventions used in the Common Objects in Context (COCO) dataset.
37
Once both these processes finished, our model is saved and ready to be loaded into our next
piece of code, the mask generation script. This script is a TensorFlow-based image
segmentation model prediction application. It takes new images from a directory, feeds them
to our pre-trained segmentation model, and saves the segmentation output as masks. It also
applies a noise removal function to the masks to only focus on a single object and remove
artifacts that will negatively impact the resulting image. Here's a step-by-step breakdown:
1. Imports and Constants Setup: The script starts by setting constants img_height and
img_width to represent the height and width, respectively, of the images to be fed into
the model. num_channels is set to 3, indicating that images are in RGB.
2. Model Loading: The script loads our pre-trained model from the file model.h5 using
TensorFlow's load_model function.
3. Image Directory: The script sets images_dir as the directory from where it will read
new images for prediction. It creates a list called images containing the names of all
files in this directory.
4. Image Iteration and Prediction: The script enters a loop where it processes each
image file. For each picture, it opens the image file with PIL, stores its original
dimensions, and then loads the image using load_img. The image is converted to a
numpy array and normalized (divided by 255), then resized to match the model's
expected input size. The model then predicts the segmentation of the image. This
prediction is resized to match the original image size.
5. Prediction Saving: If the directory to save the masks doesn't exist, it's created. The
predicted mask is then saved in this directory with the same filename as the original
image.
6. Noise Removal Application: It applies a remove_noise function to the predicted mask,
using the OpenCV library to apply morphological operations (dilation, opening) and
distance transformations to remove unwanted noise from the input image, hence
smoothing out the results.
Figure 14 - Picture going through the masking process
Although the masks are now obtained, it still is interesting, if not crucial, to perform some
standardized procedures to measure the accuracy of our model on new data before moving on.
In order to test whether the weights are correctly defined in our .h5 files, we used 3 metrics:
38
the loss, the accuracy and the Jaccard Index. The loss function (to be minimized) quantifies the
disparity between the predicted value and actual value for an instance of the dataset. Secondly,
the pixel accuracy, a metric that measures the proportion of correct predictions made out of all
predictions – the ratio of how well the model classified the data in a binary way. Finally, the
Jaccard Index, also called the IoU (Intersection over Union) ratio, is a standard measure in the
object segmentation field. It overlaps two masks and compares the sum of all points of
intersection (where both masks are true) over the points of union (at least one mask is true) and
gives us a proportion of the overlapped area to the combined area of both masks; since it is a
ratio, the closer to 1, the better. To compute it, we used the ran our mask generating script on
the test dataset and computed the mean IoU index on each produced mask set in comparison
with the one obtained from the given annotation of the file.
Please note that while it will be applied when running the photogrammetric operations, the
remove_noise function has been disabled to measure the IoU in order to normalize result and
gauge the sole exactitude of the model and its unmodified results.
Figure 15 - Jaccard Index (Rosebrock, 2016)
In order to have some benchmark, we build 2 other models, one based on the same dataset,
with 30 epochs, and another based on the same dataset as well, but fine-tuned using a new
dataset of 2700 images. We then proceeded to test the 3 models successively on the same set
using the mask generation method described hereabove. The following table covers the results.
Model Loss Accuracy Loss for Accuracy IoU

validation for
set validation
set
Regular 0.4568 0.7916 0.4044 0.8184 0.6135
30 epochs 0.4187 0.8161 0.3693 0.8438 0.6122
Fine-Tuned 0.3613 0.8422 0.4752 0.7919 0.6238
Figure 16 – Performance metrics comparison for object segmentation trained model.
39
The 'Regular' model displays a training loss of 0.4568 with a training accuracy of 79.16%. For
the validation set, the loss is slightly lower at 0.4044, while the accuracy is higher at 81.84%.
The IoU, as described hereabove, for the 'Regular' model is 0.6135.
The '30 epochs' model presents a training loss of 0.4187 and a training accuracy of 81.61%. Its
validation metrics reveal a loss of 0.3693 and an accuracy of 84.38%. The IoU for this model
is very close to that of the 'Regular' model, coming in at 0.6122. This suggests that while the
model improved in terms of classification accuracy from 'Regular' to '30 epochs', the actual
overlap of the predicted areas with the true areas remained relatively stable.
Lastly, the 'Fine-Tuned' model reports the lowest training loss of 0.3613 and the highest
training accuracy of 84.22%. However, its validation loss of 0.4752 is notably higher than its
training loss, indicating potential overfitting. The validation accuracy drops to 79.19%.
Interestingly, this model has the highest IoU of 0.6238, suggesting that while its accuracy on
the validation set is lower, the overlap between its predictions and the actual areas is slightly
better.
From these results, the '30 epochs' model appears to provide the best balance between training
and validation performance, particularly in terms of classification accuracy and bounding box
prediction as indicated by the IoU. The 'Fine-Tuned' model, despite its impressive training
performance, seems to have issues with generalizing to unseen data.
The results were relatively good on new data but were leaving too much room for improvement
– which led us wondering: was there something inherently off with our initial dataset? Since it
is the common ground for all three models, the possibility had to be considered. As a matter of
fact, it is possible to train a semantic segmentation model using bounding box annotations if
the these are converted into binary mask annotations.
The bounding boxes essentially give you the coarse locations of the objects in the image. By
making the assumption that all pixels within a bounding box belong to the object (which may
not always be accurate, particularly when objects overlap), you can convert these bounding
boxes into binary masks where all pixels within the box are set to 1 (indicating the presence of
an object) and all other pixels are set to 0 (indicating the background), which is what had been
done here.
While this approach allows for simpler annotations to be used, it may not yield results that are
as accurate as using pixel-wise mask annotations, particularly in images where objects overlap
or where the bounding box includes a significant amount of background around the object.
However good, background removal is crucial in our case, and this issue need to be addressed.
Therefore, using the same U-Net training operations, we trained 2 new model using an entirely
new dataset of 5632 items, using this time exact coordinates to generate the training and
validation masks sets. With that in mind, the mediocre results to the IoU test make a lot of
sense, since we were using the wrong intrants (i.e. two different types of masks, bounding
boxes against semantic segmentation) to compute.
Once again, to be able to compare our results under different parameters, we trained one model,
going through pixel-wise annotations 10 times, while the other 20. We also measured the IoU
index on a new test dataset using the Fine-Tuned model to see what difference in performance
using a pixel-wise dataset makes when training a U-Net neural network.
40
Model Loss Accuracy Loss for Accuracy IoU with
validation for new dataset
set validation
set
Fine-Tuned 0.3613 0.8422 0.4752 0.7919 0.7925
Pixel-Wise 0.2483 0.8945 0.2719 0.8836 0.7913
annotations
P-W 0.2180 0.9068 0.2520 0.8921 0.7989
annotations
20 epochs
Figure 17 - Performance Metrics of Pixel-Wise mask trained model
The newly provided results show performance metrics for three models: 'Fine-Tuned', 'Pixel-
Wise annotations', and 'P-W annotations 20 epochs'. Interestingly, the 'Fine-Tuned' model's
metrics are retained from the previous analysis, allowing for a direct comparison to the new
pixel-wise mask trained models.
1. Fine-Tuned Model: This model was not retrained so it has the same performance
metrics as previously, except for a significant increase in the IoU value when validated
on the new dataset. The IoU increased to 0.7925, a considerable leap from the previous
0.6238. This suggests that although the model was trained using bounding boxes, its
predictions align well with the pixel-wise ground truths in the new dataset.
2. Pixel-Wise annotations Model: With a training loss of 0.2483 and accuracy of
89.45%, this model seems to have benefitted from training on pixel-wise masks. Its
validation loss and accuracy, 0.2719 and 88.36% respectively, indicate a strong
generalization performance. The IoU for this model on the new dataset is 0.7913, which
is almost identical to the 'Fine-Tuned' model. This could suggest that while pixel-wise
mask training improves classification accuracy, its contribution to better spatial overlap
(as measured by IoU) is comparable to that of the bounding-box trained model.
3. P-W annotations 20 epochs Model: This model appears to be the best-performing one
in this set. It has the lowest training loss at 0.2180 and highest training accuracy at
90.68%. Its validation performance is also commendable, with a loss of 0.2520 and
accuracy of 89.21%. Additionally, it has the highest IoU of 0.7989 with the new dataset.
This suggests that additional epochs during training have helped to refine the model for
both classification accuracy and spatial overlap.
In summary, training on pixel-wise masks definitely improved the classification accuracy as

observed from the 'Pixel-Wise annotations' and 'P-W Annotations 20 epochs' models.
However, in terms of IoU on the new dataset, the improvement is relatively marginal. It's
notable that the 'Fine-Tuned' model, even though trained on bounding boxes, has an IoU value
in the same ballpark as the pixel-wise mask-trained models. This reinforces the idea that
training approach (bounding box vs. pixel-wise mask) and final validation IoU might not have
a direct linear relationship, and other factors in the model or dataset can contribute significantly
to the final performance.
Since we do have to make a choice about which model we will go on with, we picked the ‘P-
W annotation 20 epochs’ since it highlights the best results on average.
41
Microsoft Azure
Microsoft Azure, often simply referred to as Azure, is a cloud computing service created by
Microsoft for building, testing, deploying, and managing applications and services through
Microsoft-managed data centers (Chou, 2011). Azure provides an array of integrated cloud
services that encompass computing, storage, data analytics, machine learning, networking, and
many more. It is part of Microsoft's broader vision of the "Intelligent Cloud" and is a key
component in Microsoft's business strategy.
As a cloud service, Azure follows the model of Infrastructure as a Service (IaaS), Platform as
a Service (PaaS), and Software as a Service (SaaS) (Chou, 2011). This offers the advantages
of high availability, scalability, and flexibility, where resources can be scaled up or down
depending on the demand. Users can pay for what they use and do not have to worry about
managing the underlying infrastructure.
Azure's IaaS capabilities include virtual machines and virtual networks, block and file-based
storage, and a content delivery network for caching content. For PaaS, Azure provides
environment for developers to deploy their applications without needing to worry about the
underlying operating system, server hardware, or network infrastructure (Microsoft, 2023).
Azure's SaaS offerings include Office 365, Dynamics 365, and other productivity tools. In
addition to these, Azure also provides various AI and machine learning tools, databases,
DevOps tools, IoT services, and so much more, all integrated to provide a comprehensive
solution for diverse needs.
We will now describe each individual Azure Service that has been solicited and explain their
respective roles in the process. The following section has been entirely built thanks to Microsoft
documentation and Microsoft Learn, Microsoft's platform for providing free, online, self-paced
educational content about Microsoft products, services, and other related technologies. It's
designed to help learners of all backgrounds, from beginners to experienced professionals,
acquire skills, gain knowledge, and achieve certifications in various Microsoft-related topics.
Storage: Azure Blob Storage
Azure Blob Storage is Microsoft's cloud-based object storage solution, designed to cater to the
vast and dynamic requirements of modern-day data. This service is particularly optimized for
storing immense volumes of unstructured data. When we say unstructured, we're talking about
anything from text to binary data, which can range from documents and media files to logs and
application-specific data. In the realm of storage solutions, "Blob" is an acronym for "Binary
Large Object", and true to its name, it's designed to store any kind of data you throw at it.
Diving deeper, the architecture of Azure Blob Storage revolves around some key components.
At its core, we have the "Blob Containers" (not to be confused with regular container like
Docker offers). These containers can be likened to directories in a traditional file system,
serving as the primary grouping mechanism for blobs – they encapsulate each unit of storage
existing in this service. Every blob, regardless of its type or purpose, must belong to a container.
A noteworthy mention is that Azure Blob Storage operates on a flat file system structure. What
this means is that, unlike traditional file system managers, it doesn't understand the concept of
folders or hierarchies. However, it mimics this behavior by using naming conventions,
essentially faking a folder-like structure through cleverly structured blob names. This is where
most, if not all our data, will be stored at some point in the process. The flexibility in terms of
42
content makes this particularly useful considering the variety of type of files we are to be
dealing with, from images under various extensions to 3D models or text files such as potential
code or sentinels.
Blob are stored in data containers. Containers in Blob storage serve as top-level structures,
much like folders, where data blobs or objects are organized and stored. Nonetheless, there are
other storage means available in Azure Blob Storage. Queues are mechanisms in Azure that
enable communication between web parts and services, allowing messages to be stored for
asynchronous processing. File Share in Azure provides fully managed file shares in the cloud
that can be accessed using standard protocols like SMB, giving structure to blobs when
exchange with other services is required. Meanwhile, Tables in Azure's Blob storage context
offer a NoSQL store for semi-structured data, allowing for rapid development and fast access
to large quantities of data. Together, these elements enrich the Azure Blob storage ecosystem,
providing versatile tools for a variety of data storage and communication needs.
One of the standout features of Azure Blob Storage is its "Access Tiers" system. It’s a strategic
mechanism that ensures data is stored in the most cost-effective way. The "Hot" tier is designed
for data that sees frequent access, while the "Cool" tier is more for data that's occasionally
touched and sits for a minimum of 30 days. For data that’s almost dormant and is accessed very
rarely over a span of 180 days, the "Archive" tier is the most fitting. Most of our resources are
using Hot Access Tier since they are not meant to be stored on the long run, but rather
everchanging during the execution of a workflow, which makes the operations accessing them
cheaper than Cool tier data, according to Microsoft’s documentation.
Now, considering the process we are involved in, Azure Blob Storage serves a dual purpose.
Initially, it acts as the input data source. The entire cycle begins when a new blob, an image,
lands in the Azure Blob Storage container, directly from the HoloLens 2. This new arrival
serves as a beacon, triggering the Azure Function into action. Post trigger, the processing phase
begins. Here, the Azure Function dives into its task, reading the input image, unleashing the
algorithms and transformations of the Mask Generator upon it, and eventually churning out a
result. Once processed, this result, which should be a mask, finds its resting place back in Azure
Blob Storage, specifically in the dataset/masks directory. This cycle ensures that data is not
only processed efficiently but also stored in an organized manner, ready for retrieval. Right
after, the initial pictures are moved from the unprocessed-dataset to the regular dataset, ready
to be used by Metashape alongside their respective masks. At the end of our ‘Image to 3D
Model’ process, the entire dataset is flushed, making room for the next operation.
Utilizing Azure Blob Storage with Azure Functions provides a robust framework for
processing pipelines. This combination is designed to scale effortlessly, matching the ever-
growing demands of data processing. Its seamless integration capabilities with other Azure
offerings make it an invaluable asset for processes that lean on Azure services like Functions.
Lastly, the flexible cost structure ensures that while data grows, the expenses don't necessarily
have to.
Virtualization of Metashape: Azure Container Instances & Registries
In modern software development and deployment, the need for isolated and reproducible
environments has led to the widespread adoption of two principal technologies: containers and
43
virtual machines (VMs). Both technologies are designed to provide an isolated environment in
which applications can run, but they do so in fundamentally different ways and for different
reasons.
Virtual machines provide an abstraction of physical hardware, allowing multiple operating

systems to run on a single physical machine. Each VM includes a full copy of an operating
system, the application, necessary binaries, and libraries - making VMs relatively large in size.
This comprehensive nature offers a high degree of isolation, as each VM is entirely separate
from others. VMs are managed by a hypervisor, which sits between the hardware and the
operating system, facilitating the creation, management, and isolation of each VM. Containers,
on the other hand, virtualize the operating system rather than the underlying hardware.
Containers encapsulate the application's binaries, libraries, and configuration files, ensuring
that it runs consistently across different environments, but they share the OS kernel, making
them much lighter in weight than VMs (Sharma, Chaufournier, Shenoy, & Tay, 2016). This
results in faster start-up times and better utilization of underlying resources, which is relatively
convenient since Azure has a pay-for-consumption billing model. The more efficient the use
of resources, the smaller the price. Docker is a probably the most popular platform designed to
develop, ship, and run applications inside lightweight, portable, and self-sufficient containers.
A Dockerfile is a script containing a set of instructions used by Docker to automate the building
of container images, specifying the base OS, application dependencies, and configuration
details. Figuratively, VMs could be considered as an entire house, full of equipment, that one
would require to perform a range of tasks, from taking a shower to watching a movie. In
comparison, containers are only one room with its set of tools, used to perform a specific task,
such as cooking a meal.
Although using a container over a VM seemed obvious, little did we know about the challenge
it would raise. To give some context, we were provided with an .msi4 file from the University
of Liège – a preactivated version of the software. Since we did not have a license key, using
existing Docker images was impossible since they need to be paired at some point with the said
key. The only solution left was writing our own Dockerfile, installing the Metashape and its
dependencies on our image, to later be pulled and run in a container.
During the containerization process, we encountered a particular issue related to the graphics
processing unit (GPU). Metashape, being a photogrammetry software, can leverage GPU
acceleration, i.e. the parallel processing power of modern graphics cards, to perform
computations faster than using only the central processing unit (CPU). In photogrammetry,
intricate calculations are required to process, interpret, and convert large sets of images into
3D models. GPU acceleration makes these processes more efficient, allowing faster rendering
and reduced computational time.
However, due to the underlying differences between the Windows operating system and Linux-
based systems, which were originally the primary platform for container technologies, overall
GPU acceleration is not natively supported when running inside a container – which comes
into conflict with our plans since we only have access to a Windows version of Metashape.
Native Windows containers primarily support DirectX, Microsoft's collection of APIs
(Application Programming Interfaces5) used for multimedia and video tasks – depending on
the Windows Core used as base image on the container. On the other hand, many applications,
4
Microsoft System Installer, a type of file only compatible with Microsoft’s OS.
5
Set of rules and protocols that allows different software entities to communicate with each other. (Microsoft,
2023)
44
including Metashape, rely on OpenGL6 for rendering and graphics-related computations. Given
this discrepancy, a Windows containerized application that requires OpenGL for GPU
acceleration won't function as expected, or in the case of Metashape, might not run at all, since
Windows containers don't natively support OpenGL commands. What seemed like a simple
task at first turned out to be of the highest complexity, when even Metashape documentation
nor contacting the app’s support could bring more information to the table.
Nonetheless, OpenGL documentation was mentioning various compatibility modes, including

ANGLE. ANGLE (Almost Native Graphics Layer Engine) became the most crucial component
in this scenario. It's an open-source project that allows Windows-based applications to run
OpenGL content by translating OpenGL calls to DirectX commands, which Windows systems
natively understand. By doing so, ANGLE bridges the gap between OpenGL applications and
the DirectX-focused nature of Windows containers.
Given the reliance of Metashape on OpenGL and the lack of native support for OpenGL in
Windows containers, a solution had to be devised. Here's how it was tackled:
1. Using ANGLE for OpenGL Translation: The command --opengl angle was added
to Metashape's startup in the Docker file. This command essentially instructs
Metashape to use ANGLE for translating OpenGL commands, enabling the software to
interact with the GPU within the Windows container environment by converting these
commands into DirectX-understandable ones.
2. Running the Container with Specific Flags: When launching the container, the
following arguments are added to our docker run. They provide specific instructions:
• --isolation process sets the container's isolation mode to process, allowing it to
run in the same user namespace as the host, facilitating better hardware access.
• --device class/5B45201D-F2F2-4F3B-85BB-30FF1F953599 provides the
container access to the host's GPU, enabling DirectX GPU acceleration. This
also added a constraint: the base image used in out container had to match and
be compatible with the host’s OS, otherwise access to hardware would not be
allowed.
3. Inclusion of Necessary DLLs: The Docker file explicitly copies and installs various
dynamic link libraries (DLLs), like glu32.dll, opengl32.dll, and DirectX shader
compiler files. These are required for graphics rendering and GPU acceleration in the
Windows environment and are dependencies for Metashape.
Containerizing Metashape using a Windows image presented unique challenges, primarily due
to GPU acceleration requirements and the differences in graphics APIs supported by Windows.
By leveraging tools like ANGLE and specific Docker commands, it was possible to bridge the
gap between OpenGL and DirectX, allowing Metashape to run efficiently within a Windows
container while accessing the host machine's GPU capabilities. It is possible to run the
container without material acceleration but, the table below provides processing time of each
step during photogrammetry under the same conditions, except GPU acceleration. The numbers
speak for themselves, it is thus not necessary to elaborate on their interpretation.
6
A cross-language, cross-platform API for rendering 2D and 3D graphics, used in applications that need graphics
rendering, like games or visualization software. (OpenGL, 2023)
45
Paradigm Aligning Building Building Building Total Time
Photos Dense cloud mesh texture
With GPU 53,3s 54,5 89,1 152,1 349s
acceleration
Without 368,0s 1262,2s 107,3s 1516,5s 3254s
GPU
accelration
Figure 18 - Comparison of time of processing with or without GPU acceleration
Azure's cloud ecosystem offers a plethora of services to manage and deploy containers. Among
these, Azure Container Instances (ACI) and Azure Container Registry (ACR) stand out as
crucial components.
Azure Container Registry (ACR) functions primarily as Azure's managed Docker registry
service. It's analogous to Docker Hub, but with the distinction of being private and nestled
comfortably within Azure's architecture. With ACR, developers are endowed with the ability
to store, manage, and orchestrate their container images and relative content. An useful feature
of ACR is its seamless integration with the Docker CLI. This means developers can push
Docker images directly to ACR without veering away from their familiar development
environments. Adding to its arsenal is its proactive scan of images for vulnerabilities, adding
another layer of assurance for developers.
On the other side of this equation is Azure Container Instances (ACI), designed to quickly
deploy containers without getting bogged down in provisioning or managing VMs or entire
clusters. Every container in ACI is an isolated unit, making it a cost-effective solution billed
by the second. A standout feature is its compatibility with popular industry tools, from the
Docker CLI to Kubernetes, enabling versatile applications. Moreover, with Azure File Share,
containers in ACI can conveniently share data through mounted volumes. And, just like its
counterpart ACR, ACI places a high premium on security.
The dance between ACR and ACI is a synchronized one. Once our Docker image of
Metashape, is stored in ACR, it's ready for deployment. It can then be deployed to Azure
services like ACI or even Azure Kubernetes Service, another container service that emphasize
orchestration between multiple containers.
To add a layer of automation, ACR webhooks can be set up to notify when container image
updates occur. This could trigger various automated processes, from redeployments to audit
logs.
So, let's imagine we're looking to containerize Metashape within this framework. The journey
would begin with crafting a Docker image using a Dockerfile. This image encapsulates not
only the Metashape software but also its dependencies and any scripts or data it might require.
Post local testing, this image is then ushered into ACR, marking it as accessible for Azure
services. When the need arises to execute Metashape, instead of navigating through the
processes of launching a VM or establishing a Kubernetes cluster, the container can be
deployed straight from ACR to ACI. Here, Metashape gets a conducive, isolated environment
for our planned operations.
But the magic doesn't stop there. As Metashape hums away, processing data, it might need to
pull from a resource like Azure Blob Storage, which can serve as a reservoir for images needed
in photogrammetry. Post-processing, the results can be archived back into Azure Blob Storage
or any suitable data store. For those requiring automation like us, Azure Functions is an ally,
monitoring specific events like task completions and catalyzing subsequent processes, which
will be covered in the next section.
46
In the grand tapestry of Azure's services, ACI and ACR emerge as two pivotal threads. ACR
stands as a beacon for container image management, while ACI offers a streamlined
environment for container deployment. Together, they create a harmonious synergy,
empowering developers to weave scalable and efficient container workflows, especially for
applications like Metashape.
Environment Aligning Building Building Building Total time

photos Dense cloud Mesh Texture
Regular 23,4s 53,2s 55,1s 125,9s 257,6s
local
machine
With GPU 53,3s 54,5 89,1 152,1 349s
acceleration
Without 368,0s 1262,2s 107,3s 1516,5s 3254s
GPU
accelration
Container 232,2s 997,6s 52,4s 157,03s 1439,2s
using Azure
Container
Instances
The table illustrates a comparative analysis of the time required to perform various stages of a
photogrammetry process under different computational environments. These stages include
aligning photos, building a dense cloud, building a mesh, and building a texture. Each of these
stages involves different computational tasks, and their performance is evidently sensitive to
the hardware and software configurations of the environment they are executed in.
On a Regular Local Machine, the total time taken is 257.6 seconds. This environment likely
represents a baseline configuration, perhaps a standard CPU-based processing unit. The most
time-consuming stage here is building the texture, which requires 125.9 seconds. This might
be due to the complex computations involved in texture mapping, which may involve blending
multiple images, handling large image files, and performing intricate calculations to map these
textures onto a 3D model.
When GPU Acceleration is enabled, surprisingly, the total time increases to 349 seconds. One
might expect that enabling GPU acceleration would mean the process would be as fast, as
GPUs are known for their ability to handle parallel tasks efficiently. The fact that the process
took longer with GPU acceleration could suggest that the algorithms used in this
photogrammetry process might not have been optimized for parallel execution, or the GPU's
capacity might not have been effectively utilized. It’s also possible that the overhead of
transferring data between the CPU and GPU negated the potential benefits of the GPU’s faster
computation. Finallt, it highlight show the ANGLE operations, as convenient as they are, are
not perfect and require significant additional time.
Without GPU Acceleration, the process takes significantly longer, totaling 3254 seconds. This
is an expected outcome, as the absence of GPU support means the CPU must handle all the
computational tasks. This is particularly impactful in the 'Building Dense Cloud' stage, which
requires 1262.2 seconds – a stark contrast to other environments. This stage likely involves a
high degree of parallelism (e.g., pairwise comparisons between images), which a GPU could
handle more efficiently than a CPU.
47
The experiment conducted using Azure Container Instances (ACI) resulted in a total time of
1439.2 seconds. ACI is designed to offer a convenient and scalable solution for running
containers in the cloud. However, in this instance, it did not provide any form of virtual GPU
for Windows-based images, despite the appropriate flags being set when starting the container.
The 'Building Dense Cloud' stage in the ACI environment is significantly slower compared to
the regular local machine and the environment with GPU acceleration. This could be due to the
fact that, while ACI is an excellent solution for scalable and isolated execution of containers,
it might not be optimized for intensive computational tasks, especially when such tasks would
benefit from specialized hardware like GPUs. As a matter of fact, some processes took more
than 3000s to come to an end.
This experiment’s disappointing outcome in the ACI environment points to an important

limitation of this cloud service when it comes to Windows-based images. The inability to
access virtual GPUs—even when they are explicitly requested—can lead to significant
performance degradation. This could be due to a range of factors, including the virtualization
layer in the cloud environment, which may add overhead and latency to the computations, or
potential throttling of resources on the cloud instances to ensure equitable distribution across
multiple tenants.
Furthermore, the difference in the performance between local execution and cloud-based
execution (ACI) hints at the potential network latency, data transfer overhead, and cold start
times associated with cloud instances. When performing intensive computational tasks like
photogrammetry, these factors can accumulate and lead to substantial delays, as observed in
the results.
Automation of the process: Azure Functions
Within the expansive realm of Azure’s cloud computing services, Azure Functions emerges as
the instrumental tool we will use for our operations’ orchestration, an invaluable asset for our
project’s success. Azure Functions stand as a proof of the evolution of cloud computing,
emphasizing an event-driven, serverless paradigm. It facilitates the execution of discreet units
of code, or "functions," in response to specific events, abstracting away the infrastructure
management. This “serverless” paradigm doesn't imply an absence of servers; rather, it
represents an abstraction that allows developers, like us in this endeavor, to focus solely on the
business logic. This design supports modularity, scalability, and efficient resource usage,
particularly apt for scenarios where actions should occur in response to changes in data or state.
Each function in Azure Functions encapsulates a piece of functionality, and when triggered, it
can execute tasks like processing data, integrating with other systems, or even computing
responses to web requests. This modular design facilitates scalability and maintenance. Azure
Functions supports a variety of programming languages such as C#, Java, JavaScript,
TypeScript, and Python, ensuring a broad applicability to various developer audiences. To stay
consistent with what has been done so far, we will use Python as main programming language.
It's worth noting the distinction between traditional cloud computing models and serverless
architectures like Azure Functions. In traditional models, even if no tasks are being executed,
it might still incur costs for the reserved compute resources. In contrast, serverless models only
charge when the function is executing, aligning costs more closely with actual usage.
48
We shall now give a few words about the three functions used to execute our process and how
Azure Functions helped us choreography the entire procedure.
A) StartProcess
• How it works: As the name suggests, this function activates when there's a change in
Azure Blob Storage, depending on the condition set. In our case, it is when 40 blobs
are uploaded. Since we are automating the taking of picture via the HoloLens, we know
the exact amount of picture that will be taken. Therefore, when that exact number is
attained, the function activates.
• Operational Role: This function is relatively simple and acts as starting point of our
data’s Azure journey. Basically, it is counting periodically the number of files in the
‘unprocessed_dataset’ folder until it reaches the number we are expecting; a sentinel is
then created. This triggers the second function.
B) StartMask
• How it works: Initially, this function was activated by messages in an Azure Storage
Queue. Queue messages in Azure are often used to communicate between cloud
services, representing tasks or information. However, due to the inconsistency between
the period of checks on Containers and Queues, we preferred to switch to a Blog Trigger
function as well.
• Operational Role: This function actually contains an azure-fit version of the Mask
Generation script. Once the designed sentinel is detected, it starts running the code.
From here, it moves the pictures from the ‘unprocessed-dataset’ to ‘dataset’ so it will
not keep triggering the first function and processes the pictures to create their respective
masks. Finally, a sentinel7 is created at the end “completed.txt” to announce the end of
the process.
C) Start3D
• How it works: This function is also a Blob Triggered function. This time, we look for
the specific “completed.txt” file created at the step before.
• Operational Role: The function waits for the masks to be created (and the
completed.txt file to exist) to run the Azure Container Instance. Once it starts, it means
the ground has been correctly laid up and we are ready to create the 3D model.
HoloLens 2 Application Interface
Having constructed a robust backend infrastructure, the next crucial step is to elucidate how
this backend seamlessly integrates with the user's device, in this case, the HoloLens 2. The
objective is not merely to present data but to craft an immersive experience that leverages the
full capabilities of mixed reality and the sophisticated features of the HoloLens 2. To this end,
we utilize the Mixed Reality Toolkit (MRTK), a key asset in the development of mixed reality
applications.
7
Designated file, usually empty, used as a signal or indicator to control or synchronize processes within a system
49
Unity, a powerful cross-platform game engine, serves as the development environment in
which this HoloLens application is crafted. Unity’s flexibility and extensive features make it
an ideal choice for developing mixed reality applications. It provides the tools necessary to
design, simulate, and test the user experience, ensuring that the photo capture session runs
smoothly and efficiently. The MRTK provides a vast array of tools, assets, and pre-built
components that significantly ease the development process, allowing developers to focus on
the unique aspects of their applications. The toolkit is tailored to the HoloLens ecosystem,
ensuring that apps built using it are optimized for performance and user experience on the
device.
Figure 19 - GUI for our application
Our primary goal for the HoloLens 2 application is twofold: Firstly, to facilitate the uploading
of pictures to our Azure Blob Storage, and secondly, to retrieve and render the resultant 3D
model generated by our backend. Upon the initialization of the application, the HoloLens 2 is
programmed to commence an automated photography session. Over the span of 15 seconds, it
captures a series of 30 images, resulting in two pictures being taken per second. This carefully
orchestrated sequence is designed to capture the targeted object from various angles and
perspectives, thereby creating a comprehensive visual dataset. The user plays an active role in
this process, as they are instructed to walk around the object of interest during this 15-second
interval. This movement allows the device to capture the object from various angles and
distances, forming a rich and detailed collection of images that are essential for the subsequent
3D model generation.
This image is then processed and sent to our Web API. This Web API, hosted on Azure, acts
as an intermediary. It securely receives the image data, manages its storage in the Azure Blob
Storage container, and triggers the backend processes that generate the 3D model. It's worth
noting that by using a Web API, we abstract away direct interaction between the device and
Azure. This not only simplifies the client-side logic but also enhances security by obviating the
need to store sensitive Azure credentials on the device.
Once the backend completes its processing and a 3D model is generated, a reference or link to
this model is returned to our HoloLens application through the same Web API. The application
then fetches this model and, leveraging the rendering capabilities of Unity and the spatial
awareness of the HoloLens 2, displays the model in mixed reality. The user can then interact
with this model, possibly moving, rotating, or scaling it, all within the confines of their physical
space.
50
This interaction is made possible by the unique spatial recognition capabilities of the HoloLens
2, complemented by the tools provided by the MRTK. Furthermore, the MRTK ensures that
our application adheres to best practices for user interactions in mixed reality, providing
intuitive gestures and visual feedback.
In summary, the end-to-end process from capturing an image on the HoloLens 2 to visualizing
a 3D model in mixed reality is facilitated by a symphony of components. The HoloLens 2
serves as the front-end interface, the Web API bridges the gap between the device and Azure
services, and Azure Blob Storage, along with other backend components, handles data
processing and storage. Together, these elements craft a fluid and immersive user experience,
epitomizing the potentials of modern mixed reality applications.
Initially, the development plan for the application included the integration of a test feature,
envisioned to enhance the user's interaction with the generated 3D models. This feature was
designed to allow users to engage with virtual bounding boxes surrounding the 3D models,
thereby providing a means to manipulate and interact with these models within the mixed
reality environment. Furthermore, it was intended that these bounding boxes would enable
meaningful interactions between different 3D models and between the models and the user’s
physical environment.
However, during the testing phase, it was observed that the 3D models generated via the
photogrammetry process exhibited significant artifacts. These artifacts, which are unintended
and anomalous features introduced during the model generation process, had a profound effect
on the resultant models. They manifested as irregular and inaccurate geometric shapes, which,
in turn, caused the computed bounding boxes to deviate substantially from the actual contours
of the 3D models.
This deviation rendered the bounding boxes essentially nonsensical—meaning that they neither
accurately represented the geometry of the 3D models nor aligned with the physical reality that
the models were intended to replicate. As a consequence of these inaccuracies, any interactions
predicated on the bounding boxes, whether between different 3D models or between a model
and the user’s physical environment, were potentially compromised.
51
Solution Analysis
Benefits
In this thesis, the scope of inquiry and development extends across multiple domains, reflecting
the inherently cross-platform nature of our project. The work undertaken is a testament to the
intricate choreography of technologies that modern mixed reality applications entail.
Development commenced on a local computer, leveraging the robust tools and libraries that
Unity provides for crafting interactive 3D environments. Simultaneously, Azure, Microsoft's
cloud computing platform, played a pivotal role as the backbone for testing, data storage, and
backend computations. In addition, meticulous attention was devoted to the crafting of a
seamless and intuitive front-end experience, aimed at providing the end-user with a user-
friendly interface for interacting with complex 3D data.
The HoloLens 2 emerged as a central player in this intricate dance of technologies. With its
advanced mixed reality capabilities and notable portability, the HoloLens 2 is posited as an
exemplary device for on-the-go photogrammetry. Its intuitive gesture controls and voice
commands, coupled with a comfortable and lightweight design, render it remarkably user-
friendly. These attributes expedite the learning curve, enabling users, regardless of their
technical proficiency, to engage with the device and the photogrammetry process effectively.
Furthermore, the HoloLens 2’s spatial understanding and tracking capabilities are instrumental
in capturing the necessary image data with precision, which is foundational for the creation of
accurate 3D models.
Despite the considerable advantages of employing the HoloLens 2 in this context, the journey
was not without its challenges. From computational constraints to issues arising from
photogrammetric artifacts, a series of drawbacks surfaced during the development phase. These
issues presented both technical and conceptual hurdles that required innovative problem-
solving and adaptability.
Nevertheless, through perseverance and iterative refinement, this thesis presents a nascent
system that stands as a significant step towards a novel application of mixed reality technology
for photogrammetry. The work herein does not purport to be a final, polished solution, but
rather a foundational effort that charts a path towards more refined and robust implementations.
In this research, a notable feature of our development strategy was the harmonization of
multiple programming languages across diverse software architectures, a complex task that
was central to the realization of our cross-platform vision. The seamless integration of these
languages and technologies was no trivial task; it demanded rigorous architectural planning,
meticulous code organization, and a deep understanding of the interoperability mechanisms
that allow these disparate components to communicate effectively and securely. Despite the
inherent complexities, we managed to establish a fluid and cohesive workflow, wherein each
component, regardless of its underlying language or architecture, functioned as an integral part
of a unified, well-orchestrated system.
In the next part of this thesis, we will delve deeper into these nuances, offering a critical
analysis of the challenges encountered, the lessons learned, and the potential avenues for future
research and development. This forthcoming discussion is aimed at not only critiquing the
current work but also at envisioning the next iterations of this cross-platform photogrammetry
solution, with the HoloLens 2 remaining as a promising and central tool in this endeavor.
52
Recommendations & Future work
While the intent was to develop a proof-of-concept application for the use of a HoloLens 2
device in logistics, this implementation could hypothetically be used in a real case scenario.,
beyond warehouses. This would require training the neural network for new datasets for the
particular task required.
Whether it is for a particular company or sold as a service, the major cost of this process resides
in the purchase of Microsoft’s device. To prevent such cost, since the backbone computations
are relying on Azure, we could partially develop the same application for a different device,
such as an Android device the agent receiving the parcels could use to send the data to the Blob
Storage so that only the workers requiring MR for their work, such as workers in charge of bin
packing, would need a HoloLens 2 to perform their task and would get ready-to-use 3D models
without having to wait for the processing time once pictures are sent to work on their due tasks.
During our endeavor, we always tried to minimize the cost of our actions, especially on Azure,
but a profit analysis should be carried out to have a better overview of the return on investments
on infrastructure. There even exists so-called photogrammetry stations, usually called ‘3D
Scanning Studios’ that are able to capture ready-to-use pictures for photogrammetry or even
video for different use.
Figure 20 - 3D Scanning Studio (Peerspace, 2023)
On the long run, if included in the process and tailored to the place’s needs, it could totally be
worth the investment. For example, in commercial flights, it is a real challenge for carriers to
efficiently store luggage in the aircrafts. Usually, every unit goes through a conveyor belt. If a
3D scanning studio was installed, every suitcase would become a 3D model that the baggage
handler, equipped with a HoloLens 2 with a bin-packing app, could use to be more efficient,
having thus an impact on their workload and possibly their health in fine. This is just an
example of how tailoring the process created here with requirements for a specific usage can
foster the use of technology and leverage new possibilities.
Finally, since the process requires connection to the cloud, the user close to an internet hotspot,
a strong one preferably, because of the initial upload and final download of files. Unfortunately,
the HoloLens 2 does not possess the required resources to perform accurate photogrammetry
on its own. Photogrammetry involves the processing of numerous high-resolution images to
53
generate 3D models, a process that is computationally intensive. Consequently, the hardware
requirements for smooth performance in photogrammetry applications are relatively high.
Based on the guidelines from major photogrammetry software providers, the average hardware
requirements generally include a modern multi-core processor, 16GB to 64GB of RAM (or
more for larger projects), a dedicated NVIDIA or AMD graphics card with at least 4GB of
VRAM, and ample storage space on a solid-state drive (SSD) for quick read/write operations
(Agisoft, 2019; Pix4D, 2020). It's also recommended to have a high-speed internet connection
for cloud-based processing and data sharing.
In comparison, the HoloLens 2, Microsoft's mixed-reality headset, is designed more for

displaying and interacting with 3D content rather than processing it. The device comes
equipped with a custom Holographic Processing Unit (HPU) and a Qualcomm Snapdragon 850
Compute Platform. It boasts 4GB of RAM and 64GB of internal storage, along with a 2K 3:2
light engine for display purposes (Microsoft, 2019). While these specifications enable the
HoloLens 2 to provide an immersive mixed-reality experience, they don't match the hardware
demands of photogrammetry processing. Therefore, while the HoloLens 2 can be used to
visualize and interact with 3D models produced through photogrammetry, the actual processing
to create those models would be more efficiently done on a well-equipped desktop or
workstation. The spatial mapping capability of the HoloLens is primarily designed to support
real-time mixed reality experiences, where the device maps its immediate surroundings to
overlay digital content accurately. This real-time environment mapping prioritizes speed over
ultra-high detail, aiming to offer a seamless interaction between the user and the mixed-reality
content. While this serves the purpose of augmented reality well, it does not have the high-
precision, dense sampling, and image overlap that photogrammetry demands.
3D Model Building
While the technique itself cannot be changed when dealing with Metashape, there are a few
details that could help the process. Indeed, there exists some parameters in the script that could
have been fine-tuned and we could provide a way for the user to edit those via the GUI. Here
is a non-exhaustive list of parameters that could be changed and tweaked to edit the result based
on the requirements of the user.
1. Alignment Accuracy: This setting affects the photo alignment process. Higher
accuracy settings will make the software spend more time and computational resources
to align the photos but could result in a more accurate and detailed model. However, it
may also increase the likelihood of alignment errors in the case of low-quality images.
2. Key Point Limit: This parameter controls the maximum number of key points
(distinctive features) that the software will identify in each image during the alignment
process. Increasing this limit can lead to a more detailed model, but it will also increase
the computational cost.
3. Tie Point Limit: Similar to the key point limit, this parameter controls the maximum
number of tie points (key points that are matched between multiple images) that the
software will use during the alignment process. Increasing this limit can improve the
accuracy of the model, but it will also increase the computational cost and may lead to
more noise in the resulting point cloud.
4. Depth Map Resolution: This setting affects the resolution of the depth maps that are
used to build the dense point cloud. Higher resolution will result in a more detailed
model, but it will increase the computational cost and may make the process more
sensitive to noise in the images.
54
5. Dense Cloud Quality: This parameter controls the quality of the dense point cloud.
Higher settings will produce more detailed models but will increase the computational
cost.
6. Mesh Generation Settings: These settings control the process of building a 3D mesh
from the dense point cloud. They can affect the smoothness and detail of the resulting
model, as well as the size of the mesh (in terms of the number of vertices and faces).
Also, we were provided with an outdated version of Agisoft Metashape. By upgrading the
version, the computational time could potentially have been reduced while accuracy increased,
giving us better results in less time, as updates generally improve performances.
Finally, there is no handling of errors planned in the workflow. Indeed, sometimes, Agisoft
Metashape fails to align pictures correctly due to the lack of information it suffers from the
given photos under specific conditions. While the logs from the built functions output some
alarming message, no particular error catching is set up. That means computational power is
wasted on a process that cannot output a proper model from the start. Therefore, we should
plan including in our script an error-catching method to warn the user of the failure and stop
the process as soon as possible when required.
Pre-processing of the pictures
When building the object segmentation algorithm, the model used was the U-Net for a range
of reasons explained in the corresponding section. However, another model, the DeepLabv3+
could have fit the task. While both are renowned for semantic segmentation, they differ in
architecture, the method of capturing context, and the kind of problems they are best suited to
address:
1. U-Net: It employs a symmetric encoder-decoder structure where the encoder captures
context, and the decoder enables precise localization. U-Net's success is partly due to
its efficient use of the contracting path to capture context and a symmetric expanding
path that enables precise localization. It's particularly effective when data is scarce.
2. DeepLabv3+: This model, on the other hand, utilizes an encoder-decoder structure with
atrous convolutions and spatial pyramid pooling in the encoder part to capture multi-
scale context. It further refines the segmentation results with a decoder module,
especially around object boundaries. The DeepLabv3+ model shines in scenarios where
high-quality delineation of object boundaries is required.
In essence, the choice between U-Net and DeepLabv3+ depends on the specific requirements
and resources of the segmentation task. One the one hand, when precision and detailed
boundary segmentation are critical, DeepLabv3+ may be the more suitable option. However,
when it comes to our task, the advantage of having multiple pictures to cross data from leaves
room for less precise masking. Furthermore, when dealing with a small dataset, U-Net usually
delivers better performances. Finally, the implementation of a DeelLabv3+ model is much less
straightforward and, actually, highly complex since it is often initialized with weights from
pre-trained models, requires more computational resources and has a higher need for training
data due to its complicated architecture (Chen, Zhu, Schroff, & Adam, 2018).
Also, during the training, we reduced the size of each picture drastically to simplify the
training’s computations. Eventually, that means a lot of resizing has to happen during the
masking process since the model can only handle 128x128 size images. We thus obtain glitchy
and overstretched masks. While the remove_noise function is there to smooth the resulting
55
binary pictures, with more powerful machines, we could actually afford a higher image
resolution when training the neural network. That being said, the smoothing works relatively
right so, how minimal the difference, it would still be worth running some tests.
We built a model a model that achieves semantic segmentation: it can detect boxes on a picture
and mask them. However, this is under the assumption that only one box is on the image. If we
decided to go for instance segmentation, the neural network would be able to detect and identify
each box individually. With much more work, we could thus derive sub-pictures from the initial
one, matching them together and obtain multiple sets from the same initial photo.
Talking of data, although it would have been more complex to handle multiple mask sources,
we could have used the LiDAR integrated in the HoloLens 2 to generate depth maps and add
a new layer of understanding to the RBG images and have a more precise masking of the
pictures. Indeed, all pre-processing has been built upon a neural network trying to give a class
to each pixel on a picture, with its set of mistakes. By directly using depth maps generated by
the headset itself, that could have either complement or even replace the entire machine
learning step with much more accurate measurements.
When it comes to pictures, because of slight variation in lighting conditions, some pictures can
have a different tint. Although it is usually not alarming, a common filter could be applied to
normalize color palettes among photos and maximize alignment operation success.
Microsoft Azure
With its very slow and steep learning curve, Microsoft Azure brought a lot of challenges.
Indeed, with its 200+ services, part of the initial difficulty was to navigate in those uncharted
waters at first and decipher what would potentially be useful. With its brand-new concepts and
completely different architecture, it is hard to optimize the process as a non-professional or
licensed Azure developer. Let’s however discuss some areas that would deserve to be looked
upon in some future similar work.
First and foremost, our process, unfortunately, does not support multi-threading. Because of
the way we set up the BlobTrigger function (i.e. periodic check of the unprocessed-dataset),
we would obtain a folder full of parasite images, belonging to 2 different sets, which would
inevitably be leading to the failure of the process. By using some stop-start system and
container lock system, we were able to prevent this issue. However, creating environment
variables that would assign an ID to a set and allow each uniquely identified set to be processed
in parallel would make room for smoother workflows and less constraints. In that imaginary
case, we should check all services for maximum allocated-capacity, and possibly consider
upgrading the consumption-model of some among them. That would eventually have an
inherent impact on the cost and cost-efficiency tests should be run to confirm it is worth it in
view of the volume of processed data.
A second point to discuss is the containerization of Agisoft Metashape. Since our fundamental
conditions were not great to begin with, we can assure having access to a Linux-based version
of Agisoft Metashape would have made computation quicker, since the OpenGL API would
not have to be translated on-the-go in Direct X instructions. Furthermore, as mentioned, Azure
Container Instances does not provide any GPU-support for Windows-based container images.
Therefore, although GPU acceleration is achievable locally, a different setup would be required
56
when using Azure. Indeed, Kubernetes, a different but highly complex to set up container
service, offers virtual GPUs for Windows based images. Unfortunately, the complexity of the
parameters made us unable to implement such service since it was out of our skills.
To make use again of the allegory of the house brought when comparing VMs and containers,
where the former was the full house and the later a mere kitchen, our meal was cooked using a
microwave whilst needing an oven – it could have been better under other circumstances, but
it still provided a decent solution considering available tools.
Finally, at the moment of building the container, Microsoft restrained access to more than 6
virtual cores to Free Students Accounts, while we requested 8. Due to configuration
requirements, we could in fine only use 4, coupled with 16GB of RAM. It makes sense that
more cores would have allowed for faster and better results.
This lack of power and GPU resources led to unexpectedly disappointing results, that could
however be improved under the right circumstances.
HoloLens 2 Application Interface
Unfortunately, because of the artifacts, building bounding boxes around the object to create an
interaction model led to some absurd results. We wish we had the ability to manually edit the
model upon downloading them but it represents a completely different work that would be
outside the scope of this paper.
Apart from that, Azure SDK has recently stopped being supported by Windows, and
completely removed from their GitHub repository, which led us to take slightly more complex
paths to get to our solution.
57
Conclusion
In conclusion, this research endeavor presented itself as a journey marked by significant
challenges, frequent reconsiderations, and intense periods of critical problem-solving.
Throughout the course of this project, we grappled with intricate technical obstacles,
navigated complex software and hardware ecosystems, and faced numerous moments fraught
with doubt and uncertainty. There was a juncture at which the necessity to reconceptualize
our approach led us to start anew from a blank canvas—an experience as daunting as it was
invigorating.
Yet, despite these struggles, the opportunity to work with the HoloLens 2 has been an
immensely rewarding experience. This cutting-edge technology, with its capacity to blend
digital information seamlessly into the physical world, presented us with a unique and potent
tool for exploring innovative applications in the realm of logistics.
Our work, while nascent and ripe for further refinement, has yielded tangible results. We
have successfully developed a proof-of-concept application that illustrates the potential of
utilizing HoloLens 2 in logistics operations, offering a glimpse into a future where mixed
reality technology becomes a cornerstone in the management of goods and materials. This
application, albeit a prototype, stands as a testament to the feasibility of our vision and the
resilience of our approach.
We are proud of the foundation that this work has established. It is a foundation built not on
certainties, but on questions and exploration—an edifice of knowledge that invites further
inquiry, refinement, and innovation. Our work, we hope, serves as a stepping stone for future
researchers and developers, providing both inspiration and a practical basis upon which they
can construct more polished, efficient, and expansive applications.
In reflecting on this journey, we harbor a profound sense of gratitude and satisfaction. The
struggles, while formidable, were surmounted. The doubts, while recurrent, were assuaged by
eventual clarity and progress.
In summary, this thesis represents more than a culmination of code, devices, and data—it is a
narrative of resilience, innovation, and scholarly pursuit. We ventured to demonstrate how
HoloLens 2 could be wielded as a transformative tool in logistics, and in this endeavor, we
have laid groundwork for what promises to be an exciting and impactful frontier in mixed
reality applications.
58
Appendices
Appendix 1 : Agisoft workflow
import Metashape
import os, sys, time
print('Starting the Process')

# Checking compatibility
compatible_major_version = "1.5"
found_major_version = ".".join(Metashape.app.version.split('.')[:2])
if found_major_version != compatible_major_version:
raise Exception("Incompatible Metashape version: {} !=
{}".format(found_major_version, compatible_major_version))
def find_files(folder, types):

return [entry.path for entry in os.scandir(folder) if (entry.is_file() and
os.path.splitext(entry.name)[1].lower() in types)]
if len(sys.argv) < 3:
print("Usage: general_workflow.py <image_folder> <output_folder>")
raise Exception("Invalid script arguments")
image_folder = sys.argv[1]
output_folder = sys.argv[2]
dataset = os.path.join(image_folder, 'dataset')
print('Image Folder:', dataset)
photos = find_files(dataset, [".jpg", ".jpeg", ".tif", ".tiff"])
print('Found photos:', photos)
doc = Metashape.Document()
chunk = doc.addChunk()
chunk.addPhotos(photos)
# Import masks
mask_template = os.path.join(image_folder, 'masks', 'dataset', 'masks',
"{filename}.png") # Change to your mask file type.
for camera in chunk.cameras:

filename = os.path.splitext(os.path.basename(camera.photo.path))[0]
path = mask_template.format(filename=filename)
chunk.importMasks(path=path, source=Metashape.MaskSourceFile,
operation=Metashape.MaskOperationReplacement, cameras=[camera])
print(str(len(chunk.cameras)) + " images loaded")
59
print(mask_template)
# Step 1: Align photos,

start_time_align = time.time() # Start time
chunk.matchPhotos(accuracy=Metashape.HighAccuracy, generic_preselection=True,
reference_preselection=False, filter_mask=False, mask_tiepoints=False )
chunk.alignCameras()
end_time_align = time.time() # End time
# Step 2: Build dense cloud

start_time_cloud = time.time() # Start time
chunk.buildDepthMaps()
chunk.buildDenseCloud()
end_time_cloud = time.time() # End time
# Step 3: Build mesh

start_time_mesh = time.time() # Start time
chunk.buildModel(surface=Metashape.Arbitrary, source=Metashape.DenseCloudData,
interpolation=Metashape.EnabledInterpolation, volumetric_masks=True)
end_time_mesh = time.time() # End time
# Step 4: Build texture

start_time_texture = time.time() # Start time
chunk.buildUV()
chunk.buildTexture(blending=Metashape.MosaicBlending, size=4096, fill_holes=True)
end_time_texture = time.time() # End time
# Step 5: Export results

if chunk.model:
chunk.exportModel(output_folder + '/model.obj')
print('Processing finished, results saved to ' + output_folder + '.')
# Print total processing time for each step

print('Total processing time for aligning photos: ' + str(end_time_align -
start_time_align) + ' seconds')
print('Total processing time for building dense cloud: ' + str(end_time_cloud -
start_time_cloud) + ' seconds')
print('Total processing time for building mesh: ' + str(end_time_mesh -
start_time_mesh) + ' seconds')
print('Total processing time for building texture: ' + str(end_time_texture -
start_time_texture) + ' seconds')
60
Appendix 2 : U-Net Model Trainer
import tensorflow as tf
from tensorflow.keras import layers
import os
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D,
concatenate
# Define U-Net model

def create_model(img_shape, num_class):
inputs = Input(img_shape)
# Contracting Path
conv1 = Conv2D(64, (3, 3), activation='relu', padding='same')(inputs)
pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)
conv2 = Conv2D(128, (3, 3), activation='relu', padding='same')(pool1)

pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)
# Bottleneck
conv3 = Conv2D(256, (3, 3), activation='relu', padding='same')(pool2)
# Expanding Path
up1 = concatenate([UpSampling2D(size=(2, 2))(conv3), conv2], axis=-1)
conv4 = Conv2D(128, (3, 3), activation='relu', padding='same')(up1)
up2 = concatenate([UpSampling2D(size=(2, 2))(conv4), conv1], axis=-1)

conv5 = Conv2D(64, (3, 3), activation='relu', padding='same')(up2)
outputs = Conv2D(num_class, (1, 1), activation='sigmoid')(conv5)
return Model(inputs=inputs, outputs=outputs)
# Create model
model = create_model((128, 128, 3), 1) # input image size and number of classes
# Compile the model

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Paths to your data directories

train_images_dir = './dataset/train/trainimg'
train_masks_dir = './dataset/train_masks/masks'
valid_images_dir = './dataset/valid/validimg'
valid_masks_dir = './dataset/valid_masks/masks'
# Define image and mask size
61
image_size = (128, 128)
batch_size = 32
# Load your training images

train_images_dataset = tf.keras.utils.image_dataset_from_directory(
train_images_dir,
image_size=image_size,
batch_size=batch_size,
label_mode=None, # No labels, we're only interested in the images themselves
color_mode='rgb',
seed=42
)
# Normalize the images to [0, 1] range

train_images_dataset = train_images_dataset.map(lambda x: x / 255.0)
# Load your training masks

train_masks_dataset = tf.keras.utils.image_dataset_from_directory(
train_masks_dir,
label_mode=None, # No labels, we're only interested in the masks themselves
color_mode='grayscale',
seed=42
)
# Normalize the masks to [0, 1] range

train_masks_dataset = train_masks_dataset.map(lambda x: x / 255.0)
# Zip together the images and masks to create a dataset that yields pairs
train_dataset = tf.data.Dataset.zip((train_images_dataset, train_masks_dataset))
# Repeat for validation datasets

valid_images_dataset = tf.keras.utils.image_dataset_from_directory(
valid_images_dir,
label_mode=None,
color_mode='rgb',
seed=42
)
valid_images_dataset = valid_images_dataset.map(lambda x: x / 255.0)
valid_masks_dataset = tf.keras.utils.image_dataset_from_directory(
valid_masks_dir,
label_mode=None,
color_mode='grayscale',
seed=42
)
62
valid_masks_dataset = valid_masks_dataset.map(lambda x: x / 255.0)
valid_dataset = tf.data.Dataset.zip((valid_images_dataset, valid_masks_dataset))
# Now you can train your model using these datasets

model.fit(train_dataset, validation_data=valid_dataset, epochs=20)
# Save the model

model.save('./modelfinal20.h5')
Appendix 3 : JSON to Mask
import os
import json
import cv2
import numpy as np
# Directory paths
base_dir = '.'
train_dir = os.path.join(base_dir, 'dataset/train/trainimg')
val_dir = os.path.join(base_dir, 'dataset/valid/validimg')
# Directory paths for masks

train_mask_dir = os.path.join(base_dir, 'dataset/train_masks/masks')
val_mask_dir = os.path.join(base_dir, 'dataset/valid_masks/masks')
# Function to create mask from bounding box coordinates

def create_mask(annotations, image_id, image_dir):
id_to_filename = {image['id']: image['file_name'] for image in
annotations['images']}
image_path = os.path.join(image_dir, id_to_filename[image_id])
image = cv2.imread(image_path)
mask = np.zeros(image.shape[:2], dtype=np.uint8)
for annotation in annotations['annotations']:
if annotation['image_id'] == image_id:
x, y, w, h = map(int, annotation['bbox'])
mask[y:y+h, x:x+w] = 1
return mask
# Function to get image id from file name

def get_image_id(filename, annotations):
for image in annotations['images']:
if image['file_name'] == filename:
return image['id']
# Create masks for all sets

for set_dir, mask_dir in zip([train_dir, val_dir], [train_mask_dir, val_mask_dir]):
# Ensure the mask directory exists
os.makedirs(mask_dir, exist_ok=True)
63
# Load JSON annotations for the specific set
with open(os.path.join(set_dir, '_annotations.coco.json'), 'r') as f:
annotations = json.load(f)
for image_name in os.listdir(set_dir):

image_id = get_image_id(image_name, annotations)
if image_id is not None:
mask = create_mask(annotations, image_id, set_dir)
mask_path = os.path.join(mask_dir, str(image_id) + '.png')
cv2.imwrite(mask_path, cv2.resize(mask * 255, (128, 128)))
# Rename and resize images

original_image_path = os.path.join(set_dir, image_name)
image = cv2.imread(original_image_path)
image_resized = cv2.resize(image, (128, 128))
new_image_path = os.path.join(set_dir, str(image_id) + '.png')

os.rename(original_image_path, new_image_path)
cv2.imwrite(new_image_path, image_resized)
Appendix 4 : Mask Generator
import numpy as np
from tensorflow.keras.preprocessing.image import load_img, img_to_array
import os
from PIL import Image, UnidentifiedImageError
import cv2
# Define constants
img_height, img_width = 128, 128
num_channels = 3
def remove_noise(image_path):
img = cv2.imread(image_path, 0)
_, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)
kernel = np.ones((2,2),np.uint8) # smaller structuring element
opening = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel, iterations=1) # fewer
iterations
sure_bg = cv2.dilate(opening, kernel, iterations=1) # fewer iterations
dist_transform = cv2.distanceTransform(opening, cv2.DIST_L2, 5)
_, sure_fg = cv2.threshold(dist_transform, 0.1*dist_transform.max(), 255, 0) #
lower threshold
sure_fg = np.uint8(sure_fg)
cv2.imwrite(image_path, sure_fg)
# Load the trained model

model = tf.keras.models.load_model('./modelft.h5')
64
# Directory of new images
new_images_dir = './dataset/test/testimg'
# Get list of all files in the directory

new_images = os.listdir(new_images_dir)
# Iterate over all the new images

for img_file in new_images:
file_path = os.path.join(new_images_dir, img_file)
# Check if the path is a file

if os.path.isfile(file_path):
# Load an image file and get its dimensions
original_image = Image.open(file_path)
original_size = original_image.size
# Load an image file with its original size

img = load_img(file_path)
else:
continue
# Convert the image to numpy array

img = img_to_array(img)
# Scale the image pixels by 255
img = img / 255.0
# Resize to the size expected by the model
img_resized = tf.image.resize(img, [img_height, img_width])
# Reshape data for the model
img_resized = np.reshape(img_resized, (1, img_height, img_width, num_channels))
# Make prediction
prediction = model.predict(img_resized)
# Convert the prediction to match the original image's dimensions

prediction_image = tf.image.resize(prediction[0], original_size[::-1])
prediction_image = tf.image.convert_image_dtype(prediction_image,
dtype=tf.uint8)
# Check if directory exists

if not os.path.exists('./dataset/test/masksft/'):
# If directory doesn't exist, create it
os.makedirs('./dataset/test/masksft/')
# Define mask path

mask_path = os.path.join('./dataset/test/masksft/', img_file)
# Save the prediction image

tf.keras.preprocessing.image.save_img(mask_path, prediction_image)
# Apply noise removal to the saved mask
65
#remove_noise(mask_path)
Appendix 5 : Docker Build
# Use a Windows Server Core image

FROM mcr.microsoft.com/windows/servercore:2004
# Create temp directory

RUN cmd /C mkdir C:\\temp
# Install VCRedist
RUN powershell -Command \
$ErrorActionPreference = 'Stop'; \
[Net.ServicePointManager]::SecurityProtocol =
[Net.SecurityProtocolType]::Tls12; \
Invoke-WebRequest -Method Get -Uri
https://download.microsoft.com/download/9/3/F/93FCF1E7-E6A4-478B-96E7-
D4B285925B00/vc_redist.x64.exe -OutFile C:\\temp\\vc_redist.x64.exe; \
Start-Process C:\\temp\\vc_redist.x64.exe -ArgumentList
'/install', '/quiet', '/norestart' -NoNewWindow -Wait; \
Remove-Item -Force C:\\temp\\vc_redist.x64.exe
# Create the GatheredDlls directory

RUN powershell -Command New-Item -ItemType Directory -Force -Path
C:\GatheredDlls
# Copy the dll files into the Docker image

COPY ./glu32.dll C:/GatheredDlls/glu32.dll
COPY ./opengl32.dll C:/GatheredDlls/opengl32.dll
# Retrieve the DirectX shader compiler files needed for DirectX

Raytracing (DXR)
"https://github.com/microsoft/DirectXShaderCompiler/releases/download/v
1.6.2104/dxc_2021_04-20.zip" -OutFile $env:TEMP\dxc.zip; \
Expand-Archive -Path "$env:TEMP\dxc.zip" -DestinationPath
$env:TEMP; \
Copy-Item -Path $env:TEMP\bin\x64\dxcompiler.dll -Destination
"C:\GatheredDlls" -Force; \
Copy-Item -Path $env:TEMP\bin\x64\dxil.dll -Destination
"C:\GatheredDlls" -Force
# Copy DLLs to the Windows system directory

Copy-Item -Path C:\GatheredDlls\* -Destination C:\Windows\System32
# Install VC Redistribuable
66
https://download.microsoft.com/download/9/3/F/93FCF1E7-E6A4-478B-96E7-
D4B285925B00/vc_redist.x64.exe -OutFile C:\\temp\\vcredist_x64.exe; \
Start-Process C:\\temp\\vcredist_x64.exe -ArgumentList '/install
/passive /norestart' -Wait; \
Remove-Item -Force C:\\temp\\vcredist_x64.exe
# Install Python
https://www.python.org/ftp/python/3.11.2/python-3.11.2-amd64.exe -
OutFile C:\\temp\\python-installer.exe; \
Start-Process C:\\temp\\python-installer.exe -ArgumentList '/quiet
InstallAllUsers=1 PrependPath=1' -Wait; \
Remove-Item -Force C:\\temp\\python-installer.exe
# Install Azure CLI

https://aka.ms/installazurecliwindows -OutFile C:\\temp\\AzureCLI.msi;
\
Start-Process C:\\temp\\AzureCLI.msi -ArgumentList '/quiet' -Wait;
\
Remove-Item -Force C:\\temp\\AzureCLI.msi
# Copy the installer file into the Docker image

COPY ./setup.msi C:\\temp\\setup.msi
# Install the application

RUN msiexec /i C:\\temp\\setup.msi /quiet /qn
# Remove the installer file

RUN powershell -Command Remove-Item -Force C:\\temp\\setup.msi
# Copy the .whl package

COPY ./Metashape-2.0.2-cp37.cp38.cp39.cp310.cp311-none-win_amd64.whl
C:\\temp\\Metashape-2.0.2-cp37.cp38.cp39.cp310.cp311-none-win_amd64.whl
# Install the .whl package

RUN python -m pip install C:\\temp\\Metashape-2.0.2-
cp37.cp38.cp39.cp310.cp311-none-win_amd64.whl
# Remove the .whl package

RUN powershell -Command Remove-Item -Force C:\\temp\\Metashape-2.0.2-
cp37.cp38.cp39.cp310.cp311-none-win_amd64.whl
# Copy your Python script into the Docker image

COPY ./workflow.py C:\\app\\workflow.py
# Copy your input data into the Docker image

COPY ./input C:\\app\\input
67
# Copy your output folder into the Docker image
COPY ./output C:\\app\\output
# Install Azure SDK for communication with Blob Storage

RUN pip install azure-storage-blob
# Make port 80 available to the world outside this container

EXPOSE 80
# Setup Azure access

ENV AZURE_STORAGE_ACCOUNT=3dptv001
ENV
AZURE_STORAGE_KEY=fs63cNDBG5qxlio4DoeQrUuPxLE9G491OALt1HuqPWznYhoC3KOxD
fjQT9rjGgFVJ3OFL0EyZHou+ASt/xRRSA==
# Copy entry.bat into the Docker image

COPY ./entry.bat C:\\app\\entry.bat
# Set the working directory

WORKDIR /app
# Set the entrypoint to call entry.bat

ENTRYPOINT ["C:\\app\\entry.bat"]
Appendix 6 : Entry.bat
REM Download from Azure Blob

call az storage blob download-batch -d C:\app\input\ -s input-container --pattern *.jpg --
account-name 3dptv001 --account-key
fs63cNDBG5qxlio4DoeQrUuPxLE9G491OALt1HuqPWznYhoC3KOxDfjQT9rjGgFVJ3OF
L0EyZHou+ASt/xRRSA==
call az storage blob download-batch -d C:\app\input\masks -s input-container --pattern *.png -
-account-key
REM Call Metashape processing

"C:\Program Files\Agisoft\Metashape Pro\metashape.exe" --opengl angle -r
C:\app\workflow.py C:\app\input\ C:\app\output
REM Delete the jpg and png files from Azure Blob after successful upload
call az storage blob delete-batch -s input-container --pattern *.jpg --account-name 3dptv001 --
account-key
call az storage blob delete-batch -s input-container --pattern *.png --account-name 3dptv001 -
-account-key
68
call az storage blob delete-batch -s input-container --pattern completed.txt --account-name
3dptv001 --account-key
REM Upload results to Azure Blob

az storage blob upload-batch -s C:\app\output -d output --account-name 3dptv001 --account-
key
L0EyZHou+ASt/xRRSA== --overwrite true
Appendix 7 : Azure Functions Triggers
import logging
import azure.functions as func
from azure.storage.blob import BlobServiceClient
import os
def main(myblob: func.InputStream):

logging.info(f"Python blob trigger function processed blob \n"
f"Name: {myblob.name}\n"
f"Blob Size: {myblob.length} bytes")
# Get the connection string from the "AzureWebJobsStorage" setting in

local.settings.json
connection_string = os.getenv("AzureWebJobsStorage")
# Use the connection string to connect to Blob Storage

blob_service_client =
BlobServiceClient.from_connection_string(connection_string)
# Get the blob container client

container_client = blob_service_client.get_container_client("input-container")
# Count the blobs in the specified directory

blob_count =
len(list(container_client.list_blobs(name_starts_with="unprocessed_dataset")))
logging.info(f"Count: {blob_count}\n")
# If blob count is more than 30, create the startmasking.txt file if it doesn't
exist
if blob_count > 30:
logging.info(f'Valid Condition')
# Check if the file startmasking.txt already exists in the container

blob_client = container_client.get_blob_client("startmasking.txt")
69
if not blob_client.exists():
# If it doesn't exist, create an empty startmasking.txt file
blob_client.upload_blob(b'', overwrite=True)
logging.info(f'startmasking.txt created successfully')
else:
logging.info(f'startmasking.txt already exists')
import logging
import azure.functions as func
from azure.mgmt.containerinstance import ContainerInstanceManagementClient
from azure.identity import DefaultAzureCredential
import os
def main(myblob: func.InputStream):

logging.info(f"Python blob trigger function processed blob \n"
f"Name: {myblob.name}\n"
f"Blob Size: {myblob.length} bytes")
resource_group_name = "3dptv001" # replace with your actual resource group

container_group_name = "azureuser"
# Use Azure Identity's DefaultAzureCredential for authentication

credential = DefaultAzureCredential()
# Retrieve Subscription ID from Environment Variables

subscription_id = os.getenv('SubID')
# Create container instance management client

container_client = ContainerInstanceManagementClient(credential,
subscription_id)
print (credential)
# Start the container instance

container_client.container_groups.begin_start(resource_group_name,
container_group_name)
logging.info(f"Started container instance {container_group_name} in resource

group {resource_group_name}")
70
Bibliography
1. Alexander, A. L., Brunyé, T., Sidman, J., & Weil, S. A. (2005). From gaming to
training: A review of studies on fidelity, immersion, presence, and buy- in and their
effects on transfer in PC-based simulations and games. The Interservice/Industry
Training, Simulation, and Education Conference (I/ITSEC). Orlando.
2. Alqahtani, A. S., Daghestani, L. F., & Ibrahim, L. F. (2017). Environments and
System Types of Virtual Reality Technology in STEM: A Survey. International
Journal of Advanced Computer Science and Applications.
3. Azuma, R. T. (1997, August). A Survey of Augmented Reality. Presence:
Teleoperators and Virtual Environments, pp. 355-385.
4. Azuma, R. T. (2001, December). Recent advances in augmented reality. IEEE
Computer Graphics and Applications, pp. 34-47.
5. Berman, S. (2012). Digital transformation: opportunities to create new business
models. Strategy & Leadership, 16-24.
6. Biocca, F., & Levy, M. R. (1995). Communication in the Age of Virtual Reality.
Hillsdale: Lawrence Eribaum Associates Inc.,.
7. Billinghurst, M., & Duenser, A. (2012). Augmented Reality in the Classroom. IEEE
Computer Society.
8. Bimber, O., & Raskar, R. (2005). Spatial Augmented Reality: Merging Real and
Virtual Worlds.
9. Boem, A., & Iwara, H. (2018). Encounter-Type Haptic Interfaces for Virtual Reality
Musical Instruments . IEEE Conference on Virtual Reality and 3D User Interfaces
(VR). Tuebingen, Germany: IEEE.
10. Bower, M., Howe, C., McCredie, N., Robinson, A., & Grover, D. (2014).
Augmented Reality in education–cases, places, and potentials. Educational Media
International, 1-15.
11. Brown, E., & Cairns, P. (2004). A Grounded Investigation of Game Immersion.
London: University College London Interaction Centre.
12. Brownlee, J. (2021). Deep Learning For Computer Vision.
13. Bughin, J., Hazan, E., Ramaswamy, S., Chui, M., Allas, T., Dahlström, P., . . .
Trench, M. (2017). Artificial intelligence: The next digital frontier? McKinsey
Global Institute.
14. Carmigniani, J., Furht, B., Anisetti, M., & Ceravolo, P. (2010). Augmented reality
technologies, systems and applications . Multimedia Tools and Applications, 341-
377.
15. Castle, R. O., Klein, G., & Murray, D. W. (2008). Video-rate localization in
multiple maps for wearable augmented reality. 12th IEEE International Symposium
on Wearable Computers (pp. 15-22). IEEE.
16. Chen, X., Ma, H., Wan, J., Li, B., & Xia, T. (2020). Multi-view 3D object detection
network for autonomous driving. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, (pp. 6526-6534).
17. Chui, M., Manyika, J., Miremadi, M., Henke, N., Chung, R., Nel, P., & Malhotra,
S. (2018). Notes from the AI frontier: Insights from hundreds of use cases.
McKinsey Global Institute.
18. Cichosz, M., Wallenburg, C. M., & Michael, K. A. (2020, May 22). Digital
transformation at logistics service providers: barriers, success factors and leading
practices. The International Journal of Logistics Management.
19. Feng, Z., Been-Lirn Duh, H., & Billinghurst, M. (2008). Trends in augmented
reality tracking, interaction and display: A review of ten years of ISMAR. ACM
International Symposium on Mixed and Augmented Reality. IEEE.
71
20. Fitzgerald, M., Kruschwitz, N., Bonnet, D., & Welch, M. (2013). Embracing Digital
Technology: A New Strategic Imperative. MIT Sloan Management Review, 1-12.
21. Frank, M., Roehrig, P., & Pring, B. (2017). What To Do When Machines Do
Everything.
22. Get Your Way. (2023). aRdent - Use cases. Retrieved from Get your way:
https://www.getyourway.be/use-case
23. Gilbert, S. B. (2017). Perceived Realism of Virtual Environments Depends on
Authenticity. In Presence: Teleoperators and Virtual Environments (pp. 322-324).
24. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies
for accurate object detection and semantic segmentation. IEEE conference on
computer vision and pattern recognition, (pp. 580-587).
25. Guttentag, D. a. (2009). Virtual reality: Applications and implications for tourism.
26. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. IEEE conference on computer vision and pattern recognition, (pp. 770-
778).
27. Hirt, M., & Willmott, P. (2014, May 1). Strategic principles for competing in the
digital age. McKinsey & Company.
28. Huegel, J. C., Celik, O., & Israr, A. (2009, December). Performance Measures of
Task-Oriented Behavior in Virtual Environments. Presence Teleoperators & Virtual
Environments , pp. 449-467.
29. Kagermann, H., & Wolfgang, W. (2022). Ten Years of Industrie 4.0.
30. Kamphuis, C., Barsom, E., Schijven, M., & Christoph, N. (2014). Augmented
reality in medical education? Perspectives on medical education, 300-311.
31. Kerruish, E. (2019). Arranging sensations: smell and taste in augmented and virtual
reality. The Senses and Society.
32. Kipper, G., & Rampolla, J. (2012). Augmented Reality: An Emerging Technologies
Guide to AR. Syngress.
33. Krizhevsky, A., Sutskever, I., & & Hinton, G. E. (2012). ImageNet Classification
with Deep Convolutional Neural Networks. Advances in Neural Information
Processing Systems, 1097-1105.
34. Lasi, H., Fettke, P., Kemper, H. G., Feld, T., & Hoffmann, M. (2014). Industry 4.0.
Business & Information Systems Engineering, 239-242.
35. Liao, Y., Deschamps, F., Loures, E. D., & Ramos, L. F. (2017). Past, present and
future of Industry 4.0-a systematic literature review and research agenda proposal.
International Journal of Production Research, 3609-3629.
36. Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A., Ciompi, F., Ghafoorian, M., &
Sánchez, C. I. (2017). A survey on deep learning in medical image analysis.
Medical Image Analysis,, 60-88.
37. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully Convolutional Networks for
Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, (pp. 3431–3440).
38. Lu, J. (2017). Multi-model Data Management : What's New and What's Next?
39. Makridakis, S. (2017). The forthcoming Artificial Intelligence (AI) revolution: Its
impact on society and firms. Futures, 46-60.
40. Manyika, J., Chui, M., Bisson, P., Woetzel, J., Dobbs, R., Bughin, J., & Aharon, D.
(2015). Unlocking the potential of the Internet of Things. McKinsey Global
Institute.
41. Meehan, M., & Insko, B. (2002, August). Physiological Measures of Presence in
Stressful Virtual Environments. ACM Transactions on Graphics .
72
42. Mekni, M., & Lemieux, A. (2014). Augmented Reality : Applications , Challenges
and Future Trends.
43. Microsft. (2023). Hololens 2. Retrieved from https://www.microsoft.com/fr-
fr/hololens
44. Milgram, P., & Kishino, F. (1994). Augmented reality: a class of displays on the
reality-virtuality continuum.
45. Microsoft. (2023, January 25). What is mixed reality? . Retrieved from Microsoft
Learn: https://learn.microsoft.com/en-us/windows/mixed-reality/discover/mixed-
reality
46. Oufqir, Z. (2020). From Marker to Markerless in Augmented Reality. Embedded
Systems and Artificial Intelligence , (pp. 599-612).
47. Perret, J. (2018). Touching Virtual Reality: A Review of Haptic Gloves. 16th
International Conference on New Actuators. Germany: Actuator.
48. Porter, M. E. (2017). Why every organization needs an augmented reality strategy.
Harvard Business Review, 46-57.
49. Radu, I. (2014). Augmented reality in education: a meta-review and cross-media
analysis. . Personal and Ubiquitous Computing, 1533-1543.
50. Rauschnabel, P. A., Felix, R., Hinsch, C., Shahab, H., & Alt, F. (2022, August).
What is XR? Towards a Framework for Augmented and Virtual Reality. Computers
in Human Behavior.
51. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once:
Unified, real-time object detection. IEEE conference on computer vision and
pattern recognition, (pp. 779-788).
52. Rogers, Y., Paay, J., & Brereton, M. (2020). Towards human-centered, urban-scale
augmented reality. Interacting with Computers, 174-187.
53. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for
biomedical image segmentation. Conference on Medical image computing and
computer-assisted intervention, (pp. 234-241).
54. Rüßmann, M., Lorenz, M., Gerbert, P., Waldner, M., Justus, J., Engel, P., &
Harnisch, M. (2015). Industry 4.0: The future of productivity and growth in
manufacturing industries. Boston Consulting Group.
55. Samset, E. (2009). Augmented Reality in Surgical Procedures. The International
Society for Optical Engineering .
56. Schwab, K. (2016). The Fourth Industrial Revolution. World Economic Forum.
57. Sherman, W., & Craig, A. (2018). Understanding Virtual Reality.
58. Schwind, V., Knierim, P., Haas, N., & Henze, N. (2019). Using Presence
Questionnaires in Virtual Reality. Conference on Human Factors in Computing
Systems Proceeding.
59. Skarbez, R., Smith, M., & Whitton, M. C. (2021). Revisiting Milgram and
Kishino’s Reality-Virtuality Continuum. Frontier in Virtual Reality, 1-8.
60. Slater, M. (2009, December 12). Place illusion and plausibility can lead to realistic
behaviour in immersive virtual environments. Philosophical Transactions of the
Royal Society B: Biological Sciences, pp. 2549-3557.
61. Slater, M., & Wilbur, S. (1997). A Framework for Immersive Virtual Environments
(FIVE): Speculations on the Role of Presence in Virtual Environments. Presence:
Teleoperators and Virtual Environments, 603-616.
62. Steuer, J. (1992). Defining Virtual Reality: Dimensions Determining Telepresence.
In F. Biocca, & M. R. Levy, Communication in the Age of Virtual Reality.
63. Sullivan, M., & Kern, J. (2021). The Digital Transformation of Logistics A Review
About Technologies and Their Implementation Status. In The Digital
73
Transformation of Logistics: Demystifying Impacts of the Fourth Industrial
Revolution (pp. 361-403).
64. Szeliski, R. (2010). Computer vision: algorithms and applications. Springer Science
& Business Media.
65. Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). DeepFace: Closing the
gap to human-level performance in face verification. Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, (pp. 1701-1708).
66. Takac, M., Collett, J., Conduit, R., & De Foe, A. (2021). Addressing virtual reality
misclassification: A hardware-based qualification matrix for virtual reality
technology.
67. Weinstein, D. (2022, 05 20). What Is Extended Reality? | NVIDIA Blog. Retrieved
from NVIDIA Blog: https://blogs.nvidia.com/blog/2022/05/20/what-is-extended-
reality/
68. Wohlgenannt, I., Simons, A., & Stieglitz, S. (2020). Virtual Reality . Business &
Information Systems Engineering, 455-461.
69. Wolf, P. R., & Dewitt, B. A. (2000). Elements of photogrammetry: with
applications in GIS.
74

TFE: The Use of HoloLens 2 in Logistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TFE: The Use of HoloLens 2 in Logistics

Uploaded by

Copyright:

Available Formats

Acknowledgments

STATE OF ART .................................................................................................................................9

EXTENDED REALITY ..............................................................................................................................9

SOLUTION DESIGN ........................................................................................................................ 30

SOLUTION ANALYSIS ..................................................................................................................... 52

APPENDIX 1 : AGISOFT WORKFLOW....................................................................................................... 59

IOT : Internet of Things

Number of Words : 24 483

Technological advancements have revolutionized the daily operational methodologies of

However, digital transformation transcends mere infrastructural upgrades. It represents a

Human-Computer Interaction (HCI), alternatively termed as Machine-Man Interaction (MMI)

A promising frontier within HCI, with potential transformative implications, is Extended

Figure 1 - Mixed Reality components (Microsoft, 2023)

Therefore, we can state the starting point of this research clearly:

How can we harness the inherent capabilities of the HoloLens 2

As defined by Nvidia, a computer hardware company, “Extended reality, or xR, is an umbrella

Figure 2 - Mixed Reality Spectrum according to Microsoft (Microsoft, 2023)

Figure 3 - XR Framework (Rauschnabel, Felix, Hinsch, Shahab, & Alt, 2022)

Figure 4 - Reality-Virtuality (RV) Continuum (Milgram & Kishino, 1994)

“VR is defined as the use of a computer-generated 3D environment – called a ‘virtual

In our exploration of contemporary literature, we discovered Steuer recommended to split VR

Two psychological concepts closely associated with hardware-centric definitions of VR play a

Based on Figure 6, we now have an objective VR classification, easily usable and

Figure 7 - Updated RV Continuum (Skarbez, Smith, & Whitton, 2021)

Figure 8 - Mixed Reality Spectrum based on 3-critera matrix

The development of MR has been fueled by advancements in technologies such as HMDs,

MR has found extensive applications across various sectors. In education, MR enhances

Mixed Reality represents a paradigm shift in human-computer interaction, providing

The HoloLens 2, developed by Microsoft, is a revolutionary HMD that signifies an innovative

Figure 9 - HoloLens 2 (Microsoft, 2023)

The HoloLens 2 is the second-generation of MR HMD commercialized by Microsoft. In

Beyond mere hardware specifications, the HoloLens 2 supports an array of MR applications.

While Industry 4.0 is a sector-specific manifestation of digital technologies, digital

Transitioning to Industry 4.0 requires a significant investment in new technologies and

1. Technological Readiness: The readiness to adopt and implement new technologies is a

Mixed Reality and Augmented Reality in Industry 4.0

Figure 10 - Overview of Object Recognition Computer Vision Tasks (Brownlee, 2021)

Figure 11 - Computer Vision Tasks (Machine Learning Mastery, 2020)

Figure 12 - Flowchart : Images to 3D Model process

Photogrammetry softwares have been instrumental in transforming 2D imagery into 3D

Pre-processing of the pictures

- [y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred)]

The accuracy of the model's predictions is used as the evaluation metric.

Figure 14 - Picture going through the masking process

Figure 15 - Jaccard Index (Rosebrock, 2016)

Model Loss Accuracy Loss for Accuracy IoU

In summary, training on pixel-wise masks definitely improved the classification accuracy as

Storage: Azure Blob Storage

Virtualization of Metashape: Azure Container Instances & Registries

Virtual machines provide an abstraction of physical hardware, allowing multiple operating

Nonetheless, OpenGL documentation was mentioning various compatibility modes, including

Environment Aligning Building Building Building Total time

This experiment’s disappointing outcome in the ACI environment points to an important

Automation of the process: Azure Functions

HoloLens 2 Application Interface

Figure 19 - GUI for our application

Figure 20 - 3D Scanning Studio (Peerspace, 2023)

In comparison, the HoloLens 2, Microsoft's mixed-reality headset, is designed more for

Pre-processing of the pictures

HoloLens 2 Application Interface

print('Starting the Process')

def find_files(folder, types):

print('Image Folder:', dataset)