Professional Documents
Culture Documents
Techxplain 21
Techxplain 21
2024
DEEP DIVE
04 15
CONTENTS
TECHXPLAIN COMMITTEE
03
Lau Yan Ling (Q Team) (Secretary)
09
Li Xin Hui (Strat Comms)
Jacqueline Woo (Strat Comms)
TECH BYTES
� Unmasking the Threat of Audio Deepfakes
through Detection and Verification
� HTX-eXaminer Raises the Bar
in Parcels Screening
� Harnessing Gen AI:
Single-modal vs Multi-modal Gen AI
� ChatGPT with Eyes?! The Rise of
Multi-modal Large Language Models
17
BACK PAGE
TECHXPLAIN | ISSUE 21 | 04.2024 | p.2
EDITOR’S NOTE
Dear readers,
Get ready to delve into the fascinating intersection Furthermore, we explore the emerging technology
of artificial intelligence (AI) and homeland security of Single-modal and Multi-modal Generative AI
in our latest edition. As we gear up for the (Gen AI). From their specialized capabilities such
upcoming event, Milipol Asia Pacific - TechX as in processing text-based content in Large
Summit 2024, where cutting-edge innovations in Language Models (LLMs) to more generalised
AI-driven solutions promise to redefine the models that integrate multiple modalities such as
landscape of security, our magazine offers a images, videos, and audio. Read more on HTX in-
glimpse into the groundbreaking advancements house, on premise Multi-modal Gen AI solution
shaping the future. known as QCaption.
In this special edition of TechXplain, we embark But the spotlight truly shines on our deep dive
on a journey through the advancement of AI article, where we unravel the complexities of Multi-
for homeland security. We begin by exploring modal Gen AI and its landscape. This revolutionary
AI-powered models used against the increasing approach pave the way for unprecedented
threat of audio deepfakes and the critical need capabilities and possibilities such as enhancing
for detection and verification mechanisms. our daily tasks and also aid Home Team officers in
Through ongoing research and technological investigations and analysis work. From NVIDIA’s
advancements, we strive to stay ahead of groundbreaking hardware innovations to the
malicious actors, bolstering our defences against intricate architectures of Multi-modal Gen AI
emerging threats. models, we witness the convergence of hardware
technology and software intelligence, shaping the
Next, we look at the remarkable developments in future of AI-driven solutions in homeland security
parcel screening technology with the introduction and beyond.
of HTX-eXaminer. Leveraging the power of AI, this
innovative model serves as a second pair of eyes As we navigate the ever-evolving landscape of AI,
during screening, empowering officers to swiftly let us remain vigilant, innovative, and collaborative
and accurately detect security threats and in our pursuit of a safer and more secure future.
contraband items. With the increasing volume of
inbound parcels, the HTX-eXaminer ensures border
safety remains paramount, enhancing screening Regards,
efficiency and effectiveness. Gee Wah
DEEP
DIVE
Enhanced
Home Team
Capabilities
through
Multi-Modal
Gen AI
Source: Shutterstock
On 18 March 2024, NVIDIA CEO Jensen Huang announced the Blackwell B200 GPU, a revolutionary AI
computing chip poised to completely transform the Multi-modal Generative Artificial Intelligence (Gen
AI) landscape. It can run Gen AI models 30x faster than its predecessor and, according to Huang, is
designed to facilitate the widespread use of trillion-parameter multi-modal models (e.g., OpenAI’s GPT-4
with 1.8 trillion parameters), a feat once deemed impossible for most.
But what exactly is Multi-modal Gen AI? Multi-modal Gen AI, which involves multiple modalities for input
and output, is an emerging field in AI. There are two main approaches to designing Multi-modal Gen AI
architectures: the Unified and the Mixture-of-Experts (MoE) architectures, as depicted in Figure 1.
Stable Whisper
SORA
Figure 1: Unified vs MoE architectures (The above diagram is modified from Google Gemini)
The Unified Architecture, currently used in research and development, is trained with datasets in different
modalities without pairing them, making it a truly large and generalized model. In contrast, the MoE
architecture, used in Google’s Gemini series and OpenAI’s GPT-4, is trained with datasets in pairwise
modalities, meaning specialised dataset pairs are trained and then integrated into specific expert models,
resulting in “one large, generalized model”.
It’s important to note that despite their architectural differences, both frameworks share the common
goal of achieving the same capability to effectively handle complex tasks. The critical distinction lies
solely in their design architecture, impacting factors such as ease of implementation, robustness,
simplicity in data handling, and scalability.
Figure 2 gives us an example of a Multi-modal Gen AI MoE architecture designed with the various expert
models being integrated by a pipeline approach.
Organisation Multi-modal Gen AI Tools like HTX’s QCaption and Spark are transforming
automates routine HTX’s and HTDs’ processes. QCaption (more info in
tasks to improve TechByte #73) enhances scene analysis in images
productivity, freeing up and videos, while Spark optimizes resume screening
humans for creativity by aligning to job descriptions and competencies.
and strategic Together, these AI solutions exemplify how technology
planning. can transform traditional tasks and optimize
efficiency while empowering human innovation and
strategic planning.
Interestingly, Gen AI itself could serve as a tool to combat its misuse. We can leverage Multi-modal Gen AI’s
ability to analyse diverse media types by integrating text, image, and audio-visual data analysis. This
approach could potentially detect synthetic media, revealing subtle inconsistencies often overlooked when
examining a single media type. Through adversarial training, where one part of the system generates fakes
and another detects them, detection capabilities continually improve. Additionally, Multi-modal Gen AI
could proactively scan the web to identify potential abuses and educate users to recognise fake content.
To ensure the ethical and effective use of AI, a multidisciplinary approach and industry-wide collaboration
are essential. We must prioritize establishing standards for responsible AI use and enhancing the integrity
of digital information.
Abstract, Spatial and The ability to perform There are several lines of research trying to
Cognitive Reasoning abstract, spatial, and enhance Multi-modal Gen AI’s reasoning
cognitive reasoning is capability, such as:
an area where current
1. Chain-of-Thought Processing
Multi-modal Gen AI
2. Dual Loop Learning
can do better.
3. Self-Reflection
4. Smart Hypothesis Generation
5. Quiet Star research
Retain and Retrieve The capability to Some solutions have been implemented such
Memories Over Time remember past as long-term memory service via embedding
experiences and and storage in a vector search solution for
retrieve context real-time retrieval of historical context.
to answer current Other improvements include long context
questions with windows and Vector Database Retrieval.
coherence to past
contextual awareness.
What’s Next?
Looking ahead, we find ourselves in an era brimming with advancements in Generative AI. With each
week unveiling new Large Language Models (LLMs), Multi-modal models, and platforms, the pace of
innovation is relentless. Yet, amidst this flurry of activity, the question of what the future holds for Multi-
modal Gen AI remains pressing.
At the recent GPU Technology Conference (GTC), OpenAI offered insights that suggest a pivotal shift in
the trajectory of Generative AI development. They proposed that we might be approaching a saturation
point in the training of LLMs, given the extensive use of virtually all available human-written text. This
juncture marks a transition towards the “post-training” phase, emphasizing fine-tuning, Retrieval-
Augmented Generation (RAG), and the integration of specialized domain knowledge as the next frontier
in Generative AI evolution. This shift from broad learning to a focus on refinement and specialization
opens new pathways and potentials for Generative AI.
But with these developments, one can’t help but wonder: What innovative applications will emerge from
this focus on post-training enhancement, and how will they further transform our interaction with
technology? Given the outsized social impact of each iteration in the evolution of Generative AI, it is
important for the Home Team to continue to monitor developments in this space.
Source:
1. James Earl Jones is hanging up his cape as Darth Vader. Cable News Network (CNN). 26 Sep 2022. https://edition.cnn.com/2022/09/26/entertainment/james-earl-
jones-darth-vader-retiring-cec/
2. How AI is restoring voices damaged by ALS using voice banking. The Washington Post. 20 Apr 2023. https://www.washingtonpost.com/wellness/interactive/2023/
voice-banking-artificial-intelligence/
3. China scammer uses AI to impersonate victim’s friend, steal $823,000. The Straits Times. 24 May 2023.https://www.straitstimes.com/asia/east-asia/china-scammer-us-
es-ai-to-impersonate-his-victim-s-friend-steal-820000
4. Scammers use AI to enhance their family emergency schemes. Federal Trade Commission. 20 Mar 2023.https://consumer.ftc.gov/consumer-alerts/2023/03/scammers-
use-ai-enhance-their-family-emergency-schemes
5. N.L. family warns others not to fall victim to the same deepfake phone scam that cost them $10K. Canadian Broadcasting Corporation. 29 Mar 2023. https://www.cbc.
ca/news/canada/newfoundland-labrador/deepfake-phone-scame-1.6793296
6. Deepfake video of Zelenskyy could be ‘tip of the iceberg’ in info war, experts warn. National Public Radio. 16 Mar 2022. https://www.npr.org/2022/03/16/1087062648/
deepfake-video-zelenskyy-experts-war-manipulation-ukraine-russia
Future Plans
Computed tomography (CT) scanners have recently emerged as a potential replacement
for conventional X-ray machines used for security screening. A CT scanner is similar to an
X-ray machine and uses X-ray to image the content of the parcel. However, a CT scanner uses
many beams of X-rays and is able to generate a 3-D image of the parcel using sophisticated
mathematical algorithms, thereby providing more details on the parcel content as compared to
an X-ray machine. The CBRNE team plans to build new capabilities to extend the HTX-eXaminer
for analysing the 3-D image generated from CT scanners to empower officers using this latest
technology to maintain our border safety.
Source:
1. Internal photograph taken at ATC
2. Internal images used for evaluation of HTX-eXaminer during Ops trial at ATC
3. Background visuals designed by Drazen Zigic - Freepik.com
Video 1: A Lunar New Year celebration video Video 2: A 3D animated scene featuring a close-up of a
with Chinese Dragon short fluffy monster kneeling beside a melting red candle
Notice anything off about the videos above? That’s right – both are AI generated! They were
generated by OpenAI’s SORA, the latest Generative Artificial Intelligence (or Gen AI) platform
that has taken the world by storm.
Gen AI has captured widespread attention due to its remarkable ability to generate content
such as text, images, music, and human-like conversations through large scale machine
learning models. However, before you start to harness its potential applications, do you know
the different types of Gen AI?
We can broadly classify AI models into two main types: Single-modal and Multi-modal Gen
AIs. The well-known models you may have heard of, such as the original ChatGPT (GPT 3), are
single-modal Gen AIs. Single-modal models focus on generating content in a single modality,
such as text, images, audio, or video exclusively. On the other hand, Multi-modal Gen AIs can
generate content that span across multiple modalities, such as converting text into images
or handling multiple types of input modalities and producing outputs in multiple modalities.
Table 1 below provides a comparison of Single-modal and Multi-modal Gen AI.
Training Trained on large datasets of a single Trained on datasets that have paired
Dataset modality. multi-modal information (e.g., image
and text pairs).
In conclusion, while single-modal Gen AI excels within its specific domain, multi-modal Gen AI
offers the potential for richer interactions through the integration and comprehension of
multiple modalities of datasets. This rapidly growing field represents the future of Gen AI
research and applications. Stay tuned for an in-depth exploration of multi-modal Gen AI and
its application to the Home Team in our upcoming article.
This marks a significant leap in AI’s journey towards mirroring human intelligence—after all,
it can now think, see, and hear, just like us. The possibilities are endless. Imagine an AI that
helps newbie cyclists lower their bike seat (Figure 1a) or composes a poem inspired by a
photograph you’ve captured (Figure 1b).
You can think of MLLM as the multi-taskers of the AI world. They can handle different kinds of
data at once, like pairing images with text, to provide coherent responses. These models were
trained on millions of examples in different forms. For instance, they’ll be shown many pictures
with descriptions, so they learn to associate images with certain words.
They were built on years of AI research. Notable breakthroughs included the neural network
CLIP⁵ that enabled effortless connections between words and pictures, and the Attention
mechanism⁶ that helped the network focus on the most important parts of the data.
2. Image search
QCaption can also extract information from
a batch of images, giving officers an Figure 2c: Generate Figure 2d: Asking general
incident reports questions to QCaption
overview of the database without trawling (following specified format)
through multiple images. Likewise, officers with QCaption
3. Describing videos
Current MLLM are still unable to effectively
describe videos beyond a few seconds long.
Figure 2e: Image and feature search with QCaption
Hence, Q Team created a pipeline to analyse
and caption videos. First, an algorithm is used
to automatically capture screenshots of key
moments in the video.
More to come
MLLM is an exciting field of research, with potential for many more applications such as object
detection and video querying. While Q Team continues to explore new use cases, feel free to
approach us to test out QCaption and explore MLLM together!
Source:
1. Liu, Y. et al. (2023). Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models. https://doi.org/10.1016/j.metrad.2023.100017
2. OpenAI (2023). GPT-4 Technical Report. http://arxiv.org/abs/2303.08774
3. Gemini Team (2023). Gemini: A Family of Highly Capable Multi-modal Models. http://arxiv.org/abs/2312.11805
4. Wu, J. et al. (2023). Multi-modal Large Language Models: A Survey. http://arxiv.org/abs/2311.13165
5. A. Radford et al., “Learning transferable visual models from natural language supervision,” in The 38th International Conference on Machine Learning. PMLR, 2021, pp.
8748–8763.
6. A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
Sneak Peek:
HTX @ MILIPOL Asia-Pacific
TechX Summit 2024
We hope you enjoyed this issue of TechXplain covering AI and how it impacts the Homeland Security landscape!
In this edition of BACK PAGE, readers will get a sneak peek of some of our HTX exhibits appearing at this
year’s MILIPOL that leverage AI capabilities to augment our Home Team operations.
NEMO and Scam Buster’s intuitive web-based UIs Officers interact with an AI assistant for scene analysis