PHD Title: Efficient Multimodal Vision Transformers For Embedded System

PhD title: Efficient Multimodal Vision Transformers for Embedded System
Keywords: vision transformer, AI, segmentation, pruning, multimodal, edge
General context:
Vision transformers (ViT), a class of deep learning models initially designed for natural language
processing tasks, have emerged as powerful tools in understanding and processing visual information.
It has been widely applied to various computer vision tasks including classification, detection,
segmentation, tracking and more, surpassing traditional Convolutional Neural Networks (CNNs) in
accuracy. ViT processes images as sequences of tokens using self-attention mechanisms, capturing
global relationships, while CNNs employ convolutional layers with local receptive fields to extract spatial
hierarchies of characteristics (features) directly from the input image. Unlike CNNs, ViTs directly attend
to image patches, enable efficient parallel processing and improve performance on diverse visual data
without the architectural constraints of grid-like structures. ViT has an image-specific pipeline in which
the input image must be split into smaller units called tokens that makes it easier for algorithms to
understand. In practice, tokenizing input and selecting embedding for the token are vital for
Transformers but highly flexible, with many alternatives. For instance, given an image, there is no one-
size-fits-all approach to tokenization, and the best method may vary depending on the characteristics
of data and the task at hand. The choice can be made, depending on the grain-level, between the use
of 1) ROIs and CNN features; 2) patches and linear projection; or 3) graph nodes and Graph Neural
Networks (GNN) features as tokens and token embeddings [8].
The world around us encompasses many modalities. Through ongoing research and innovation, the
integration of various modalities in AI systems is poised to revolutionize how we perceive, interpret, and
interact with the world around us. The state-of-the-art for many applications and AI systems capable of
tasks ranging from natural language understanding to complex perception and decision-making in
various applications is achieved by multimodal models, merging: 2D images, 3D point clouds (LiDAR),
text, or audio data [11]. Intuitively, multimodal ViTs combining data outperform their unimodal
counterparts, since more information is aggregated. The fusion can be performed at three levels: early
(input), mid (intermediate representation), and late stage (prediction, no cross-modal information is
exchanged in the model). Various approaches, including simple operation-based fusion like element-
wise addition or concatenation, as well as more intricate attention-based or cross-attention-based
fusions are under continuous exploration.
Researchers are carefully studying these methods to determine their effectiveness in dealing with the
complexity of multimodal data fusion and interaction. Each modality possesses distinct characteristics,
necessitating innovative techniques to ensure seamless integration without information loss. Integrating
multiple modalities requires robust data collection, efficient algorithms, and scalable computing
infrastructure. This integration presents unique challenges for computational researchers due to the
heterogeneity of the data, including: 1) the common representation, involving the conversion of raw data
into a format that can be understandable by a machine learning model; 2) consistent translation, which
is the process of converting data from one modality to another; 3) accurate alignment, enabling the
identification of correspondences between data from different modalities to create a more complete and
accurate representation of the underlying reality; 4) coherent fusion, which combines data from multiple
modalities; and 5) seamless co-learning.
Robustness is a critical factor in the reliability of multimodal AI systems, however it remains still unclear
and under-researched. Augmenting AI systems with additional data to enhance performance intensifies
the influx of data and the model complexity. This escalation can eventually result in computational
challenges, particularly in terms of MACC operations and memory utilization for such systems. This
poses a significant challenge for embedded systems, where computational resources are limited.
Balancing enhanced performance with the constraints of embedded systems becomes crucial in
addressing this growing issue. In light of these challenges, ongoing research is crucial to developing
innovative techniques that optimize AI algorithms for efficiency and reliability without compromising
accuracy. Solutions such as model compression, hardware acceleration, and algorithmic optimizations
are actively explored to make AI systems more viable for embedded applications. In essence, while
fusion of multimodal data holds tremendous potential, there is still much ground to cover in terms of
refining and optimizing existing techniques and developing novel approaches.
In this research work, multimodal refers specifically to vision-vision fusion, focusing on the integration
of different visual data modalities such as RGB, Infrared, depth, or LiDAR through transformers. It's
important to keep in mind that the efficiency and applicability of ViT are often slowed down by the
substantial computational demands. They are limited by the time and memory complexities that grow
quadratically with the input sequence length, which are caused by the self-attention mechanism. It also
limits their adoption in computation-constrained applications. In a multimodal context, computational
complexity will increase due to jointly high dimension representations. This challenge becomes more
significant since this thesis argues for the incorporation of resource constraints into the fundamental
principles of algorithm design.
Project description and objectives:

Starting from a single-modal vision transformer design, this thesis will endeavor to explore the
innovative domain of ViT optimization i.e. token pruning to accelerate inference for vision tasks while
preserving accuracy. Since, the large number of tokens may not be necessary, as not all tokens are
equally important, the goal is to automatically detect redundant tokens and dynamically configure its
proper number and size. The first main task involves a comprehensive analysis of existing ViT
optimization techniques to identify and evaluate the most effective token pruning methods for object
detection and instance segmentation, with the aim of bridging the gap between classification and dense
tasks. Nowadays, the majority of token pruning techniques tackle the image classification task [1-4].
Previous studies have demonstrated that a comparable level of accuracy can be achieved by
considering only a subset of tokens (through merge or suppression techniques), applying a hierarchical
structure of tokens, or combining local and global attention. These approaches, while having
demonstrated their effectiveness, were only applied to classification and have yet to be applied to other
tasks such as object detection and instance segmentation. Indeed, it is not possible to discard tokens
once they represent a specific region for semantic segmentation. Moreover, backpropagating semantic
information for each original token becomes challenging after applying existing merging approaches,
which allow any token combination. To address this issue, Content-aware Token Sharing (CTS) [5]
proposes a policy CNN network trained separately from ViT to predict if neighboring image patches
contain the same semantic class. However, it adds additional parameters to the model, which may not
be suitable for systems with limited resources.
To increase the performance of segmentation and detection via ViT, the work will also explore
multimodal token fusion. A number of studies have been carried out that can serve as the starting point.
The tokens from different modalities are usually combined directly as inputs to the transformers. For
instance, the recently released Fusion via Attention Bottlenecks (FSN) [7] allows to improve the early
concatenation. This approach passes messages through a small number of bottleneck latents and
forces the model to purify the most necessary information, i.e. to collect and “condense” the most
relevant inputs in each modality. This strategy not only improves fusion performance, but also reduces
computational costs. In [9], a multiview architecture is proposed that outperforms its single-view
counterpart in terms of accuracy/Floating-point operations per second (FLOPS) trade-offs. It consists
of creating multiple input representations of the input, by tokenizing the video using tubelets of different
sizes and feeding them into cross-view encoders. The baseline framework for this thesis can be
TokenFusion [10], which proposes an adaptive method applicable to fusing multiple single-modal
transformers, regardless of whether the data are homogeneous or heterogeneous. It dynamically
detects uninformative tokens and replaces these tokens with projected and aggregated inter-modal
features. This means that individual pruning is applied to each single-modal transformer and each
pruned unit is substituted by projected alignment features from other modalities. The goal is to analyze
and combine proposed token pruning approaches within the mentioned Token Fusion technique.
Finally, verification of robustness and effectiveness on the portability of such combinations on
embedded systems is expected. The PhD thesis will encompass theoretical and experimental research,
incorporating the implementation of a multimodal ViT-based object segmentation or detection system
designed for resource-constrained environments. The effectiveness of this proposed multimodal ViT
approach will be assessed using well-known public datasets such as nuScenes, Waymo, and
SemanticKITTI. Moreover, both research laboratories possess expertise in developing mobile robotic
platforms and boast a fleet of fully automated electric vehicles including Twizzy, Zoe, and drones,
geared towards autonomous driving. The findings of this thesis will directly contribute to enhancing the
embedded perception modules within these autonomous systems.
References:
[1] Chen, M., Lin, M., Li, K., Shen, Y., Wu, Y., Chao, F., & Ji, R. (2023, June). CF-VIT: A general coarse-
to-fine method for vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence
(Vol. 37, No. 6, pp. 7042-7052).
[2] Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., & Xie, P. (2022). Not all patches are what you need:
Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800.
[3] Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G. (2021). Not All Images are Worth 16x16
Words: Dynamic Transformers for Efficient Image Recognition, Advances in Neural Information
Processing Systems, Curran Associates, Inc., vol. 34, pp. 11960-11973
[4] Long, S., Zhao, Z., Pi, J., Wang, S., & Wang, J. (2023). Beyond Attentive Tokens: Incorporating
Token Importance and Diversity for Efficient Vision Transformers. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 10334-10343).
[5] Lu, C., de Geus D., Dubbelman, G. (2023). Content-aware Token Sharing for Efficient Semantic
Segmentation with Vision Transformers, in 2023 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023 pp. 23631-23640.
[6] Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2017). Multimodal Machine Learning: A Survey and
Taxonomy. http://arxiv.org/abs/1705.09406
[7] Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C. (2021). Attention bottlenecks for
multimodal fusion, NeurIPS.
[8] Xu, P., Zhu, X., Clifton, D. (2022). Multimodal Learning with Transformers: A
Survey.10.48550/arXiv.2206.06488.
[9] Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., Schmid, C. (2022). Multiview transformers
for video recognition, CVPR, pp. 3333-3343.
[10] Wang, Y., Chen, X., Cao, L., Huang, W., Sun, F., Wang, Y. (2022). Multimodal Token Fusion for
Vision Transformers,IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.12186-
12195.
[11] S. Hu, F. Bonardi, S. Bouchafa, D. Sidibé. Multi-modal unsupervised domain adaptation for
semantic image segmentation. Pattern Recognition, 137, 109299, 2023.
This thesis is part of a new collaboration between the IBISC laboratory (University of Evry Paris-Saclay)
and the LIAE Laboratory (CEA-LIST).
The LIAE (Embedded Artificial Intelligence Laboratory), a laboratory of the LIST (Laboratory for
Systems and Technology Integration) division under the CEA (French Alternative, Energies and Atomic
Energy Commission), serves as a central hub for driving innovation and facilitating the transfer of
technology in the field of embedded systems. The team at LIAE is responsible for the complex task of
conceptualizing, designing, and implementing highly optimized solutions that take into account various
critical factors including surface area, power consumption, and computational efficiency. These
solutions are custom-tailored for specific applications, such as object detection, segmentation, tracking,
or visual-inertial navigation systems, primarily designed for resource-constrained targets.
Currently, the lab is actively involved in the development of cutting-edge tools like the Aidge framework,
focused on optimizing and seamlessly integrating neural networks. These efforts encompass a range
of techniques including 8,4 and 2-bit quantization (both aware and post training), network pruning, and
the exportation of models into specific formats such as TensoRT or pNeuro. The ultimate objective of
this research is to deliver hardware-efficient Deep Neural Networks (DNNs) that can seamlessly adapt
to multiple embedded architectures.
The SIAM (Signal, Image, AutoMatique) team of the IBISC laboratory (Informatics, BioInformatics,
Complex Systems) is an interdisciplinary team whose research revolves around autonomous systems
and, more particularly, perception, observation, modeling, and control of “vehicles” (autonomous
vehicles, mobile robots, drones). The “Dynamic Perception” axis of the team is particularly interested in
multimodal perception and fusion. The team is developing new methods for visual odometry, obstacle
detection, or scene reconstruction based on the analysis of images from heterogeneous visual sensors
(RGB-d cameras, infrared and near-infrared cameras, and event cameras).
PROFILE AND SKILLS REQUIRED

The candidate must have a solid background in digital image processing, computer vision, AI and
robotics. She/he hashave acquired during his training a broad scientific culture and is interested in
artificial perception. She/he will have a large robotic culture allowing her/him to understand all the
problems related to the design of autonomous systems using embedded visual sensors.
Requested profile: Master degree (BAC+5) in the field of computer science, data science, AI, robotics,
automation & control
Technical required skills: C/C++, Python, Linux
Contact:
martyna.poreba@cea.fr
Michal.SZCZEPANSKI@cea.fr
samia.bouchafa@ibisc.univ-evry.fr

PHD Title: Efficient Multimodal Vision Transformers For Embedded System

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PHD Title: Efficient Multimodal Vision Transformers For Embedded System

Uploaded by

Copyright:

Available Formats

PhD title: Efficient Multimodal Vision Transformers for Embedded System

Keywords: vision transformer, AI, segmentation, pruning, multimodal, edge

Project description and objectives:

PROFILE AND SKILLS REQUIRED

You might also like