Techxplain 21

ISSUE 21 | 04.
2024
DEEP DIVE
Enhanced Home Team Capabilities

through Multi-Modal Gen AI
TECH BYTES
Single-modal vs Multi-modal Gen AI
BACK PAGE Sneak Peek: Milipol Asia-Pacific TechX Summit 2024

ISSUE 21 | 04.2024
04 15
CONTENTS
TECHXPLAIN COMMITTEE
Core Editorial Team

Ng Gee Wah (Q Team) (Collaborator-in-Chief)
03
Lau Yan Ling (Q Team) (Secretary)
Tech Bytes Group

Oh Hue Kian (CBRNE)
Joe-Lin TAN (Forensics)
Joseph Kp Ng (Forensics)
EDITOR’S NOTE
Clement Low (ICPMC)
Deep Dive Magazine Group

Ho Choong Chuin (RAUS)
Hud Syafiq Herman (EG)
04
Ng Chee Wah (Ops Sys) DEEP DIVE
Clarice Lee (Forensics)
Alfred See (Q Team) Enhanced Home Team Capabilities
Sean Lim (RAUS)
through Multi-Modal Gen AI
Publication Team
09
Li Xin Hui (Strat Comms)
Jacqueline Woo (Strat Comms)
TECH BYTES
� Unmasking the Threat of Audio Deepfakes
through Detection and Verification
� HTX-eXaminer Raises the Bar
in Parcels Screening
� Harnessing Gen AI:
Single-modal vs Multi-modal Gen AI
� ChatGPT with Eyes?! The Rise of
Multi-modal Large Language Models
17
BACK PAGE
TECHXPLAIN | ISSUE 21 | 04.2024 | p.2
EDITOR’S NOTE
Dear readers,
Get ready to delve into the fascinating intersection Furthermore, we explore the emerging technology
of artificial intelligence (AI) and homeland security of Single-modal and Multi-modal Generative AI
in our latest edition. As we gear up for the (Gen AI). From their specialized capabilities such
upcoming event, Milipol Asia Pacific - TechX as in processing text-based content in Large
Summit 2024, where cutting-edge innovations in Language Models (LLMs) to more generalised
AI-driven solutions promise to redefine the models that integrate multiple modalities such as
landscape of security, our magazine offers a images, videos, and audio. Read more on HTX in-
glimpse into the groundbreaking advancements house, on premise Multi-modal Gen AI solution
shaping the future. known as QCaption.
In this special edition of TechXplain, we embark But the spotlight truly shines on our deep dive
on a journey through the advancement of AI article, where we unravel the complexities of Multi-
for homeland security. We begin by exploring modal Gen AI and its landscape. This revolutionary
AI-powered models used against the increasing approach pave the way for unprecedented
threat of audio deepfakes and the critical need capabilities and possibilities such as enhancing
for detection and verification mechanisms. our daily tasks and also aid Home Team officers in
Through ongoing research and technological investigations and analysis work. From NVIDIA’s
advancements, we strive to stay ahead of groundbreaking hardware innovations to the
malicious actors, bolstering our defences against intricate architectures of Multi-modal Gen AI
emerging threats. models, we witness the convergence of hardware
technology and software intelligence, shaping the
Next, we look at the remarkable developments in future of AI-driven solutions in homeland security
parcel screening technology with the introduction and beyond.
of HTX-eXaminer. Leveraging the power of AI, this
innovative model serves as a second pair of eyes As we navigate the ever-evolving landscape of AI,
during screening, empowering officers to swiftly let us remain vigilant, innovative, and collaborative
and accurately detect security threats and in our pursuit of a safer and more secure future.
contraband items. With the increasing volume of
inbound parcels, the HTX-eXaminer ensures border
safety remains paramount, enhancing screening Regards,
efficiency and effectiveness. Gee Wah

DEEP DIVE | MULTI-MODAL GEN AI
DEEP
DIVE
Enhanced
Home Team
Capabilities
through
Multi-Modal
Gen AI
Source: Shutterstock
How Multi-modal Gen AI Amplifies Human Intelligence
On 18 March 2024, NVIDIA CEO Jensen Huang announced the Blackwell B200 GPU, a revolutionary AI
computing chip poised to completely transform the Multi-modal Generative Artificial Intelligence (Gen
AI) landscape. It can run Gen AI models 30x faster than its predecessor and, according to Huang, is
designed to facilitate the widespread use of trillion-parameter multi-modal models (e.g., OpenAI’s GPT-4
with 1.8 trillion parameters), a feat once deemed impossible for most.
But what exactly is Multi-modal Gen AI? Multi-modal Gen AI, which involves multiple modalities for input
and output, is an emerging field in AI. There are two main approaches to designing Multi-modal Gen AI
architectures: the Unified and the Mixture-of-Experts (MoE) architectures, as depicted in Figure 1.
Stable Whisper
SORA
Figure 1: Unified vs MoE architectures (The above diagram is modified from Google Gemini)

The Unified Architecture, currently used in research and development, is trained with datasets in different
modalities without pairing them, making it a truly large and generalized model. In contrast, the MoE
architecture, used in Google’s Gemini series and OpenAI’s GPT-4, is trained with datasets in pairwise
modalities, meaning specialised dataset pairs are trained and then integrated into specific expert models,
resulting in “one large, generalized model”.
It’s important to note that despite their architectural differences, both frameworks share the common
goal of achieving the same capability to effectively handle complex tasks. The critical distinction lies
solely in their design architecture, impacting factors such as ease of implementation, robustness,
simplicity in data handling, and scalability.
Figure 2 gives us an example of a Multi-modal Gen AI MoE architecture designed with the various expert
models being integrated by a pipeline approach.
Figure 2: Example of a Multi-modal Gen AI MoE Architecture
Multi-modal Gen AI Landscape

The competition among industry leaders in multi-modal Gen AI is both fast and furious, as evidenced by
the rapid development of cutting-edge technologies in both hardware and software. This began with the
launch of OpenAI’s GPT-4 in March 2023, quickly followed by Google’s Gemini versions and Microsoft’s
LLava13B, a large vision model, released in October 2023, and Claude 3, which made its debut in December
2023. In February 2024, OpenAI introduced SORA, specializing in text-to-video generation, marking
another milestone in this vibrant landscape. And on 17 March 2024, x.AI released Grok-1, a 314B token
MoE model. With each release, we anticipate more advanced capabilities as the competition among tech
giants intensifies. Thus, it becomes imperative for the Home Team to be Gen AI-ready in defence of our
nation’s homeland security.
Figure 3: Example of a Multi-modal Gen AI Landscape

How Can Multi-modal Gen AI Enhance the Home Team?
It is our belief that Multi-modal Gen AI

will be a huge transformative force that
will greatly enhance the Home Team. As
captured in Figure 4, the inner circle
highlights the capabilities that Multi-
modal Gen AI can bring to the table from
Robot Control to Mix-Modality Search.
Given this extensive possible skill set, we
see a future where Multi-modal Gen AI
will eventually be integrated into a
number of potential Home Team domains
applications, as outlined in the outer ring
of Figure 4.
Additionally, we foresee Multi-modal

Gen AI playing a pivotal role in amplifying
human capabilities. Its benefits will span
across individual, organizational, and
societal levels, enhancing effectiveness,
productivity, and fostering innovation.
Table 1 below illustrates in greater detail
how Multi-modal Gen AI can impact us
across these varying scales. Figure 4: Multi-modal Gen AI Potential Applications for Home Team
Amplification of Intelligence through Multi-modal Gen AI
Scale Description Examples
Individual Multi-modal Gen AI AI Assistants can significantly streamline how

serves as a personal individuals manage their daily communications
assistant, augmenting and workflows. By analyzing previous interactions,
the individual’s abilities these Gen AI systems can potentially assist in
and facilitating tasks. sending calendar invites, or book air tickets based
on single text prompts. There has also been research
into employing Gen AI to create PowerPoint slides,
analyse Excel Spreadsheets, or create multimedia
(i.e. image and text) PDFs from scratch.
Organisation Multi-modal Gen AI Tools like HTX’s QCaption and Spark are transforming
automates routine HTX’s and HTDs’ processes. QCaption (more info in
tasks to improve TechByte #73) enhances scene analysis in images
productivity, freeing up and videos, while Spark optimizes resume screening
humans for creativity by aligning to job descriptions and competencies.
and strategic Together, these AI solutions exemplify how technology
planning. can transform traditional tasks and optimize
efficiency while empowering human innovation and
strategic planning.
Society Multi-modal Gen AI Public service chatbots offer 24/7 assistance to

enhances public citizens, answering queries, processing requests,
service delivery, safety, and providing information with minimal human
and fosters innovation intervention. One such example is the EPC chatbot
and societal problem- developed by HTX which aims to improve online
solving thereby police reporting by having the chatbot ask for
enhancing citizens’ details that were missed out in the initial report and
quality of life. generate a concise summarized report.
Table 1: Scale of amplification of intelligence using Multi-modal Gen AI

As we navigate the evolving landscape of technology, the groundbreaking capabilities of Multi-modal

Gen AI not only promise progress but also raise concerns about potential misuse. Citizens encounter
scammers employing this technology to craft convincing deepfake videos for deception, while
organisations grapple with malicious actors exploiting vulnerabilities, putting sensitive data at risk.
Furthermore, society faces the looming threat of fake news campaigns disseminating misinformation
via AI, which undermines public trust.
Interestingly, Gen AI itself could serve as a tool to combat its misuse. We can leverage Multi-modal Gen AI’s
ability to analyse diverse media types by integrating text, image, and audio-visual data analysis. This
approach could potentially detect synthetic media, revealing subtle inconsistencies often overlooked when
examining a single media type. Through adversarial training, where one part of the system generates fakes
and another detects them, detection capabilities continually improve. Additionally, Multi-modal Gen AI
could proactively scan the web to identify potential abuses and educate users to recognise fake content.
To ensure the ethical and effective use of AI, a multidisciplinary approach and industry-wide collaboration
are essential. We must prioritize establishing standards for responsible AI use and enhancing the integrity
of digital information.
Challenges of Multi-modal Gen AI

While Multi-modal Gen AI has closed the gap with human intelligence, there are still challenging aspects of
intelligence that Multi-modal Gen AI needs to master. These include but are not limited to the abilities to:
1. Use resources efficiently and effectively

2. Perform abstract, spatial, and cognitive reasoning
3. Retain and retrieve memories in general over time
4. Formulate complex plans
5. Understand the physical environment
Aspect of Intelligence Description Research Directions
Optimize Compute The capacity to Research is currently focused on overcoming

Resources use resources the “linearity” limitation observed in Multi-
dynamically based modal Gen AIs. This limitation hinders the
on the complexity of AI’s ability to allocate resources intelligently
tasks. For example, and prioritize complex problems that need
the human brain more computing power. Researchers are
can allocate more actively working to improve the efficiency
brainpower to of multi-modal Gen AI, for example by
complex problems building smarter software such as Quantized
and effortlessly handle Tiny Models and hardware acceleration to
straightforward ones. allocate resources.
Abstract, Spatial and The ability to perform There are several lines of research trying to
Cognitive Reasoning abstract, spatial, and enhance Multi-modal Gen AI’s reasoning
cognitive reasoning is capability, such as:
an area where current
1. Chain-of-Thought Processing
Multi-modal Gen AI
2. Dual Loop Learning
can do better.
3. Self-Reflection
4. Smart Hypothesis Generation
5. Quiet Star research
Retain and Retrieve The capability to Some solutions have been implemented such
Memories Over Time remember past as long-term memory service via embedding
experiences and and storage in a vector search solution for
retrieve context real-time retrieval of historical context.
to answer current Other improvements include long context
questions with windows and Vector Database Retrieval.
coherence to past
contextual awareness.
Table 2: Advancing Multi-modal Gen AI’s Intelligence

Aspect of Intelligence Description Research Directions
Complex Planning The ability to Researchers are actively working in this

formulate strategies space and approaches that are being
and process plans to explored include:
achieve goals is a skill 1. Agent-based approach
that is still lacking in 2. OpenAI’s Q* (Pronounced Q-Star),
Multi-modal Gen AI. . possibly a form of self-play or akin to
“human thought” processes in complex
problem solving and higher-level planning
3. Graph network architecture
Understand the Understanding the Researchers are using contextual datasets to

Physical Environment physical environment enable better interpretation and responses
entails the capacity in three-dimensional space. In parallel there
to interpret and are other lines of research such as:
respond to real-world
1. Unified Embedding Models
surroundings in three-
2. Embodied AI
dimensional space.
3. Agent (robotics) based reinforcement
learning
Table 2: Advancing Multi-modal Gen AI’s Intelligence
What’s Next?
Looking ahead, we find ourselves in an era brimming with advancements in Generative AI. With each
week unveiling new Large Language Models (LLMs), Multi-modal models, and platforms, the pace of
innovation is relentless. Yet, amidst this flurry of activity, the question of what the future holds for Multi-
modal Gen AI remains pressing.
At the recent GPU Technology Conference (GTC), OpenAI offered insights that suggest a pivotal shift in
the trajectory of Generative AI development. They proposed that we might be approaching a saturation
point in the training of LLMs, given the extensive use of virtually all available human-written text. This
juncture marks a transition towards the “post-training” phase, emphasizing fine-tuning, Retrieval-
Augmented Generation (RAG), and the integration of specialized domain knowledge as the next frontier
in Generative AI evolution. This shift from broad learning to a focus on refinement and specialization
opens new pathways and potentials for Generative AI.
But with these developments, one can’t help but wonder: What innovative applications will emerge from
this focus on post-training enhancement, and how will they further transform our interaction with
technology? Given the outsized social impact of each iteration in the evolution of Generative AI, it is
important for the Home Team to continue to monitor developments in this space.
Ng Gee Wah, Wang Jiale, Deryl Chua

Director, Q Team CoE Engineer, Q Team CoE Engineer, Q Team CoE
Gee Wah enjoys thinking Jiale enjoys backpacking, Deryl is a food enthusiast
aloud about challenging hiking, and watching who also enjoy exploring
problems and wrote planes at Changi beach the latest in technology.
his first book on Neural – anything to get away
Networks in 1996. from the computer!

TECH
Unmasking the Threat of
Audio Deepfakes Through
Detection and Verification
Imagine picking up a phone call from an unfamiliar number one day to hear the distressed
voice of your child in tears and pleading for help. The phone is then abruptly snatched away,
replaced by a gruff kidnapper demanding a ransom in Bitcoin. Thankfully, you retain enough
presence of mind, and a quick call later you realise you were almost scammed by a deepfake.
This hypothetical scenario is already possible with existing audio deepfake technology today,
which utilises artificial intelligence (AI) and machine learning (ML) to manipulate or generate
synthetic audio recordings that mimic the voices of real people. Two main techniques are
commonly used: Voice conversion, where a synthetic voice is dubbed over a target’s voice
snippet, and text-to-speech, which creates the target’s voice from text inputs.
The good and the bad

Like any emerging technology, the techniques for
mimicking one’s voice can be used for either good or
bad. In the entertainment industry, it has been employed
to preserve the voices of retired or deceased actors1.
It can also positively impact quality of life, enabling
visually impaired individuals to access written materials
or allow patients suffering from illnesses such as throat
cancer to speak again with their own voice2. Val Kilmer,
who triumphed over throat cancer and experienced the
advantages of audio deepfake technology, expressed his
gratitude for being able to once again lend his voice to
the screen in Top Gun 2 and commented that “the chance
to narrate my story, in a voice that feels authentic and
familiar, is an incredibly special gift”.
On the other hand, the use of deepfakes for scam calls
has become prevalent in parts of the world3 4. Scammers Figure 1: Val Kilmer, throat cancer
survivor and actor from Top Gun 1 & 2.
have exploited audio deepfake technology to increase
the believability of scam calls, and victims have fallen
prey to realistic voice reproductions of loved ones in
distress5. Donna Letto, a victim of a deceptive deepfake
scam call, shared with CBC News that even though her
son’s voice seemed somewhat different, she wasted no
time in taking action. This underscores the remarkable
realism and persuasiveness exhibited by deepfake scam
calls.
Audio Deepfake can also be used to spread misinformation,
orchestrate targeted attacks, and sway public opinions.
For instance, a deepfake video of Ukraine President
Volodymyr Zelenskyy announcing the country’s surrender Figure 2: Donna Letto, victim of an
was circulated in the opening weeks of the war6. Audio Deepfake scam call - CBC News
Detecting audio deepfakes

How can we detect audio deepfakes? Detection might be possible using deep neural network
models and utilising a large dataset of labelled data to learn and identify artefacts that may
indicate a synthetically generated audio clip. During inference, the detector transforms the
raw audio input into either an image representing a spectrogram or vector.

TECH
Unmasking the Threat of
Audio Deepfakes Through
Detection and Verification (Cont’d)
This is then fed into the audio deepfake
detection model, where the model
looks for features indicative of
spoofed speech. Subsequently, the
classifier module uses the feature
representation generated bythe deepfake
detection model to determine whether
the input audio is real or spoofed.
Where technological assistance is not Figure 3: Typical audio deepfake detector
available, tell-tale signs of an audio
deepfake include robotic-sounding
speech, pitch-perfect pronunciation,
and the absence of filler words. However,
more advanced audio deepfake
generators may not have these indicators.
It is therefore vital to use alternative
communication channels to confirm the
authenticity of calls.
Figure 4: Flowchart for a prototype Audio Deepfake Detector
What is HTX doing?
Audio deepfake detection is an ongoing cat-and-mouse game, requiring continual research,
dataset enhancement, and staying up to date on the latest deepfake detection techniques.
During the TechXplore 5 exhibition in May, the Sense-Making and Surveillance Centre of
Expertise (S&S CoE) showcased an Audio Deepfake Detector, which includes modules for
deepfake classifier and speaker verification. The Deepfake classifier module uses theGraph
Attention Network (GAT) method to determine the authenticity of audio clips while the
speaker verification compares its unique characteristics with a real recording.
By employing audio deepfake detection technology, law enforcement agencies could enhance
their capacity to carry out thorough post-incident investigations and discourage the occurrence
of future deepfake scam calls. It is conceivable that in the coming years, we might even have
the capability to implement deepfake detection directly on mobile phones!
Stay tuned for more news deepfake detection from S&S CoE as we work on developing our
Video/Image deepfake detection capabilities!
Source:
1. James Earl Jones is hanging up his cape as Darth Vader. Cable News Network (CNN). 26 Sep 2022. https://edition.cnn.com/2022/09/26/entertainment/james-earl-
jones-darth-vader-retiring-cec/
2. How AI is restoring voices damaged by ALS using voice banking. The Washington Post. 20 Apr 2023. https://www.washingtonpost.com/wellness/interactive/2023/
voice-banking-artificial-intelligence/
3. China scammer uses AI to impersonate victim’s friend, steal $823,000. The Straits Times. 24 May 2023.https://www.straitstimes.com/asia/east-asia/china-scammer-us-
es-ai-to-impersonate-his-victim-s-friend-steal-820000
4. Scammers use AI to enhance their family emergency schemes. Federal Trade Commission. 20 Mar 2023.https://consumer.ftc.gov/consumer-alerts/2023/03/scammers-
use-ai-enhance-their-family-emergency-schemes
5. N.L. family warns others not to fall victim to the same deepfake phone scam that cost them $10K. Canadian Broadcasting Corporation. 29 Mar 2023. https://www.cbc.
ca/news/canada/newfoundland-labrador/deepfake-phone-scame-1.6793296
6. Deepfake video of Zelenskyy could be ‘tip of the iceberg’ in info war, experts warn. National Public Radio. 16 Mar 2022. https://www.npr.org/2022/03/16/1087062648/
deepfake-video-zelenskyy-experts-war-manipulation-ukraine-russia
Tan Jian Hong, Engineer, Sense-making & Surveillance COE

Jian Hong is a feline enthusiast with a heart full of love for cats. He also seeks thrilling adventures around
the globe, exploring new cultures and savouring the beauty of our diverse world.

TECH
HTX-eXaminer Raises the Bar
in Parcels Screening
Most of us will be familiar with e-commerce platforms such as Lazada, Shopee, Taobao and
Amazon, on which we are able to order a wide variety of products from all over the world.
Before products that we’ve ordered are delivered to our doorstep, they must first go through
an important step.
All inbound parcels from overseas have to be screened through the X-ray machine by Immigration
& Checkpoints Authority (ICA) officers who will visually inspect the X-ray images for security
threats and contrabands. Majority of this screening is done at Airmail Transit Centre (ATC)
located at Changi Airfreight Centre.
How does X-ray security screening work?

X-rays are a form of electromagnetic radiation that
can pass through most materials and are used to
produce an image of the parcel contents without
the need for opening the package. Materials such as
metals will absorb more X-ray energy as compared
to clothes and books, thereby appearing darker on
the black-and-white X-ray image. How dark or light
something appears depends on the material’s X-ray
absorption and scattering properties.
ICA officers use a dual-view, dual-energy X-ray
machine. A dual-view machine produces two views of Figure 1: An ICA officer reviewing the X-ray
images of inbound parcels at ATC1
the parcel (usually the top and side). This helps officers
better assess the parcel’s contents. Meanwhile, a
dual-energy machine produces a colourised image—
for example, organic materials appear orange while
dense materials appear blue—so officers can better
identify potential threats.
HTX-eXaminer elevates security screening

With the increasing popularity of online shopping,
there is also an increasing volume of inbound parcels
that must be screened. Leveraging the potential of
Artificial Intelligence (AI), the Chemical, Biological,
Radiological, Nuclear, and Explosives (CBRNE) team
have developed the HTX-eXaminer, an AI model that
can identify both security threats such as firearms,
swords and knives, slingshots, nunchakus and
contrabands such as cigars and electronic cigarettes.
The HTX-eXaminer is a second pair of eyes during
screening to identify suspicious parcels by marking
out the location of potential threats and contrabands
Figure 2: Sample illustration of AI detection:
on the X-ray image and indicating the type of threat Both the firearm and slingshot in the top-view
or contraband detected. X-ray image (upper panel) are detected by
the HTX-eXaminer. There are no detections
Using the HTX-eXaminer, officers performing on the side-view X-ray image (lower panel)
screening are empowered to detect threats quickly as the shape of the firearm and slingshot are
distorted by the other items in the parcels.2
and accurately even with the increasing volume of
Colour scheme:
parcels, thereby maintaining border safety. Blue - Metals; Green - Inorganic; Orange - Organic

TECH
HTX-eXaminer Raises the Bar
in Parcels Screening (Cont’d)
How was the HTX-eXaminer developed?
A deep learning object detection algorithm (specifically the YOLO class of algorithms, known for
being fast and accurate) first picks up the characteristics—like shape, colour and orientation—
of objects of interest using a data set of X-ray images.
The training data used comprises both positive cases, where objects of interest are placed in
various orientations and positions in a variety of different parcels to simulate smuggling cases,
and negative cases in which parcels do not contain illicit items. For positive cases, the positions
of the objects of interest are manually marked out using a specialised software—it’s a process
known as data annotation.
After data annotation, the set of training data is used as input to the deep learning object
detection algorithm; the algorithm then seeks to minimise errors iteratively—think of it as
continuous approximations—to improve detection performance.
The error function is typically calculated by comparing the prediction of the algorithm against
the actual human annotated ground-truth. If the algorithm predicts that a threat is present
in a negative X-ray image, the error will be high; conversely, if the algorithm predicts a threat
accurately, the error will be low.
Future Plans
Computed tomography (CT) scanners have recently emerged as a potential replacement
for conventional X-ray machines used for security screening. A CT scanner is similar to an
X-ray machine and uses X-ray to image the content of the parcel. However, a CT scanner uses
many beams of X-rays and is able to generate a 3-D image of the parcel using sophisticated
mathematical algorithms, thereby providing more details on the parcel content as compared to
an X-ray machine. The CBRNE team plans to build new capabilities to extend the HTX-eXaminer
for analysing the 3-D image generated from CT scanners to empower officers using this latest
technology to maintain our border safety.
Source:
1. Internal photograph taken at ATC
2. Internal images used for evaluation of HTX-eXaminer during Ops trial at ATC
3. Background visuals designed by Drazen Zigic - Freepik.com
Teo Soo Kng, Senior Data Scientist, CBRNE

SK enjoys reading up on behavioural economics when he is not overwhelmed with catching bugs in his
codes. Nudge is one of his favourite books on this topic.

TECH
Harnessing Gen AI:
Single-modal vs
Multi-modal Gen AI
Click image to view video Click image to view video
Video 1: A Lunar New Year celebration video Video 2: A 3D animated scene featuring a close-up of a
with Chinese Dragon short fluffy monster kneeling beside a melting red candle
Notice anything off about the videos above? That’s right – both are AI generated! They were
generated by OpenAI’s SORA, the latest Generative Artificial Intelligence (or Gen AI) platform
that has taken the world by storm.
Gen AI has captured widespread attention due to its remarkable ability to generate content
such as text, images, music, and human-like conversations through large scale machine
learning models. However, before you start to harness its potential applications, do you know
the different types of Gen AI?
We can broadly classify AI models into two main types: Single-modal and Multi-modal Gen
AIs. The well-known models you may have heard of, such as the original ChatGPT (GPT 3), are
single-modal Gen AIs. Single-modal models focus on generating content in a single modality,
such as text, images, audio, or video exclusively. On the other hand, Multi-modal Gen AIs can
generate content that span across multiple modalities, such as converting text into images
or handling multiple types of input modalities and producing outputs in multiple modalities.
Table 1 below provides a comparison of Single-modal and Multi-modal Gen AI.
Feature Single-modal Generative AI Multi-modal Generative AI
Complexity Generally simpler More complex

in architecture. due to the need
Single specific to understand
expert domain and generate
is trained in the cross modalities.
architecture. Mixture of
experts’ domains
are trained and
integrated in the
architecture.
Data Process data pertaining to a single Capable of processing and relating

Processing modality. information across different
modalities.
Training Trained on large datasets of a single Trained on datasets that have paired
Dataset modality. multi-modal information (e.g., image
and text pairs).
Table 1: Comparison of Single-modal and Multi-modal Gen AI

TECH
Harnessing Gen AI:
Single-modal vs
Multi-modal Gen AI (Cont’d)
Feature Single-modal Generative AI Multi-modal Generative AI
Well-known 1. Text to Text Large Language 1. Multi-modal Large Language

Models Models like ChatGPT 3, LLaMA 2, Models like ChatGPT 4 and
and Google’s PaLM and Gemma. Google’s Gemini Pro.
2. Image to image generation 2. Text to image models like Stable
models such as DCGANS Diffusion, Midjourney, and
(Deep Convolutional Generative OpenAI’s DALL-E.
Adversarial Networks).
3. Text to video models like OpenAI’s
3. Audio to Audio generation SORA.
models such as SEGAN (Speech
4. Audio (Speech) to text generation
Enhancement Generative
models such as Whisper.
Adversarial Network).
Understand 1. Text Analytics and Report 1. Multi-modal Surveillance and

the Physical Generation: HTX has developed Sensemaking: HTX has employed
Environment Gen AI platforms for automatically multi-modal Gen AI to describe
drafting reports according to and caption images and videos, or
predefined formats, summarising to search for objects and people
key points, and aggregating data of interest in an image/video
from diverse text sources. Other database. An example of such a
capabilities include AI driven news platform is QCaption, covered
monitoring, automated document under TechByte #73 (published
analysis, and document Q&A 16th Feb).
chatbots.
2. Speech to Text Translation: Using
2. Image Enhancement: HTX has speech-to-text AI models, HTX can
leveraged on Gen AI to enhance transcribe multilingual audio into
low resolution, noisy, or dark text and intelligently analyse the
photos and restore damaged content for any content of interest.
photos (e.g., smudges, tears).
3. Content Creation: HTX is
3. Audio Surveillance: Using audio experimenting with using multi-
analytics AI, HTX can denoise and modal Gen AI to generate image,
enhance noisy audio feeds from audio, and videos. This could help
public areas to monitor potential us illustrate a report, visualise a
security threats or distress signals. scene from descriptions, or have AI
read and generate an assessment
report to you.
Table 1: Comparison of Single-modal and Multi-modal Gen AI
In conclusion, while single-modal Gen AI excels within its specific domain, multi-modal Gen AI
offers the potential for richer interactions through the integration and comprehension of
multiple modalities of datasets. This rapidly growing field represents the future of Gen AI
research and applications. Stay tuned for an in-depth exploration of multi-modal Gen AI and
its application to the Home Team in our upcoming article.
Ng Gee Wah, Wang Jiale, Deryl Chua

Director, Q Team CoE Engineer, Q Team CoE Engineer, Q Team CoE
Gee Wah enjoys thinking Jiale enjoys backpacking, Deryl is a food enthusiast
aloud about challenging hiking, and watching who also enjoy exploring
problems and wrote planes at Changi beach the latest in technology.
his first book on Neural – anything to get away
Networks in 1996. from the computer!

TECH
Chat GPT with Eyes?!
The Rise of Multi-modal
Large Language Models
Like many people, you probably have
interacted with (or at least heard of)
generative artificial intelligence tools like
ChatGPT and Google Bard. These AI
platforms have gained fame for their
ability to read and write like a human1,
capturing the imagination of the world.
Yet, innovation has only accelerated

since their roll-outs. In less than a year,
the frontier of AI research has rapidly
shifted to Multi-modal Large Language
Models (MLLM), such as GPT-4(Vision)²
and Gemini3. MLLM can now process
and understand not just text, but also Figure 1a: Asking GPT-4V for Figure 1b: Generating a poem
images, videos, and audio⁴. help on adjusting the bike seat based on an image using Gemini
This marks a significant leap in AI’s journey towards mirroring human intelligence—after all,
it can now think, see, and hear, just like us. The possibilities are endless. Imagine an AI that
helps newbie cyclists lower their bike seat (Figure 1a) or composes a poem inspired by a
photograph you’ve captured (Figure 1b).
How MLLM work
You can think of MLLM as the multi-taskers of the AI world. They can handle different kinds of
data at once, like pairing images with text, to provide coherent responses. These models were
trained on millions of examples in different forms. For instance, they’ll be shown many pictures
with descriptions, so they learn to associate images with certain words.
They were built on years of AI research. Notable breakthroughs included the neural network
CLIP⁵ that enabled effortless connections between words and pictures, and the Attention
mechanism⁶ that helped the network focus on the most important parts of the data.
QCaption: The Home Team’s very own MLLM

HTX Q Team has developed QCaption, a
prototype using MLLM to aid investigation and
intelligence officers. It is specially designed for
on-premise deployment for use on sensitive
datasets – and its abilities are well-suited to
improve workflows at the Home Team.
1. Image captioning and report generation

QCaption automatically annotates an image,
extracting key details like the subjects’ attire
and action and the setting to expedite the
investigation process (Figure 2a). It can also Figure 2a: Detailed scene Figure 2b: Case exhibit
description using QCaption annotation with QCaption
automatically label case exhibits (Figure 2b).

TECH
Chat GPT with Eyes?!
The Rise of Multi-modal
Large Language Models (Cont’d)
QCaption also taps on the MLLM’s advanced
language abilities to draft reports, which
comes in useful for incident reporting
(Figure 2c), and for image-based queries
using a chatbot to assist with diverse tasks
(Figure 2d).
2. Image search
QCaption can also extract information from
a batch of images, giving officers an Figure 2c: Generate Figure 2d: Asking general
incident reports questions to QCaption
overview of the database without trawling (following specified format)
through multiple images. Likewise, officers with QCaption
can also search for certain traits and

features in images (Figure 2e).
3. Describing videos
Current MLLM are still unable to effectively
describe videos beyond a few seconds long.
Figure 2e: Image and feature search with QCaption
Hence, Q Team created a pipeline to analyse
and caption videos. First, an algorithm is used
to automatically capture screenshots of key
moments in the video.
Next, QCaption generates descriptions for

each screenshot, which are then woven into a
cohesive narrative, filtering out redundancies
and highlighting main events (Figure 3). Figure 3: Video description framework
More to come
MLLM is an exciting field of research, with potential for many more applications such as object
detection and video querying. While Q Team continues to explore new use cases, feel free to
approach us to test out QCaption and explore MLLM together!
Source:
1. Liu, Y. et al. (2023). Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models. https://doi.org/10.1016/j.metrad.2023.100017
2. OpenAI (2023). GPT-4 Technical Report. http://arxiv.org/abs/2303.08774
3. Gemini Team (2023). Gemini: A Family of Highly Capable Multi-modal Models. http://arxiv.org/abs/2312.11805
4. Wu, J. et al. (2023). Multi-modal Large Language Models: A Survey. http://arxiv.org/abs/2311.13165
5. A. Radford et al., “Learning transferable visual models from natural language supervision,” in The 38th International Conference on Machine Learning. PMLR, 2021, pp.
8748–8763.
6. A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
Wang Jiale, Enginer, Q Team CoE

Jiale enjoys backpacking, hiking, and watching planes at Changi beach – anything to get away from
the computer!

BACK
PAGE PRESENTS
Sneak Peek:
HTX @ MILIPOL Asia-Pacific
TechX Summit 2024
We hope you enjoyed this issue of TechXplain covering AI and how it impacts the Homeland Security landscape!
In this edition of BACK PAGE, readers will get a sneak peek of some of our HTX exhibits appearing at this
year’s MILIPOL that leverage AI capabilities to augment our Home Team operations.
Eye on the Internet HoloDeX

Showcasing bespoke news-monitoring (NEMO) and HoloDeX envisions a future where scenes of interest
anti-scam (Scam Buster) tools, Eye on the Internet can be virtually reconstructed, preserving their
takes visitors through NEMO’s customizable original state. Visitors can interact with this mixed
keyword search of over 2000 news sources for reality environment and use an AI assistant to
trend analysis and demonstrates Scam Buster’s analyse scenes.
hunting and assessment capabilities to combat
online scams.
NEMO and Scam Buster’s intuitive web-based UIs Officers interact with an AI assistant for scene analysis
AI for Marine Rescue

Excited to know more about
Making a reappearance after a successful display HTX-powered projects?
at MILIPOL Paris 2023, AI for Marine Rescue will give Visit the HTX Booth at
visitors insight into the improvements made to their
solution comprising custom-built video analytics, Level 1 • Halls A-C
Seaborne Electro-Optic (SEO) sensors, underwater
sonar technology, and an array of onboard sensory Sands Expo & Convention Centre
instruments to deliver accuracy in man-overboard 3 - 5 April 2024
detection.
Video analytics swiftly and reliably detect humans in the water
For more information, visit https://www.techxsummit.sg/map-txs2024-home

exponentially
impacting
Singapore’s
safety and security_

Techxplain 21

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Techxplain 21

Uploaded by

Copyright:

Available Formats

ISSUE 21 | 04.

Enhanced Home Team Capabilities

Single-modal vs Multi-modal Gen AI

BACK PAGE Sneak Peek: Milipol Asia-Pacific TechX Summit 2024

Core Editorial Team

Tech Bytes Group

Deep Dive Magazine Group

TECHXPLAIN | ISSUE 21 | 04.2024 | p.3

How Multi-modal Gen AI Amplifies Human Intelligence

TECHXPLAIN | ISSUE 21 | 04.2024 | p.4

Figure 2: Example of a Multi-modal Gen AI MoE Architecture

Multi-modal Gen AI Landscape

Figure 3: Example of a Multi-modal Gen AI Landscape

TECHXPLAIN | ISSUE 21 | 04.2024 | p.5

How Can Multi-modal Gen AI Enhance the Home Team?

It is our belief that Multi-modal Gen AI

Additionally, we foresee Multi-modal

Amplification of Intelligence through Multi-modal Gen AI

Scale Description Examples

Individual Multi-modal Gen AI AI Assistants can significantly streamline how

Society Multi-modal Gen AI Public service chatbots offer 24/7 assistance to

Table 1: Scale of amplification of intelligence using Multi-modal Gen AI

TECHXPLAIN | ISSUE 21 | 04.2024 | p.6

As we navigate the evolving landscape of technology, the groundbreaking capabilities of Multi-modal

Challenges of Multi-modal Gen AI

1. Use resources efficiently and effectively

Aspect of Intelligence Description Research Directions

Optimize Compute The capacity to Research is currently focused on overcoming

Table 2: Advancing Multi-modal Gen AI’s Intelligence

TECHXPLAIN | ISSUE 21 | 04.2024 | p.7

Aspect of Intelligence Description Research Directions

Complex Planning The ability to Researchers are actively working in this

Understand the Understanding the Researchers are using contextual datasets to

Table 2: Advancing Multi-modal Gen AI’s Intelligence

Ng Gee Wah, Wang Jiale, Deryl Chua

TECHXPLAIN | ISSUE 21 | 04.2024 | p.8

The good and the bad

Detecting audio deepfakes

TECHXPLAIN | ISSUE 21 | 04.2024 | p.9

Tan Jian Hong, Engineer, Sense-making & Surveillance COE

TECHXPLAIN | ISSUE 21 | 04.2024 | p.10

How does X-ray security screening work?

HTX-eXaminer elevates security screening

TECHXPLAIN | ISSUE 21 | 04.2024 | p.11

Teo Soo Kng, Senior Data Scientist, CBRNE

TECHXPLAIN | ISSUE 21 | 04.2024 | p.12

Click image to view video Click image to view video

Feature Single-modal Generative AI Multi-modal Generative AI

Complexity Generally simpler More complex

Data Process data pertaining to a single Capable of processing and relating

Table 1: Comparison of Single-modal and Multi-modal Gen AI

TECHXPLAIN | ISSUE 21 | 04.2024 | p.13

Well-known 1. Text to Text Large Language 1. Multi-modal Large Language

Understand 1. Text Analytics and Report 1. Multi-modal Surveillance and

Table 1: Comparison of Single-modal and Multi-modal Gen AI

Ng Gee Wah, Wang Jiale, Deryl Chua

TECHXPLAIN | ISSUE 21 | 04.2024 | p.14

Yet, innovation has only accelerated

How MLLM work

QCaption: The Home Team’s very own MLLM

1. Image captioning and report generation

TECHXPLAIN | ISSUE 21 | 04.2024 | p.15

can also search for certain traits and