Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

To read more such articles, please visit our blog https://socialviews81.blogspot.


Meta AI’s Chameleon: A Revolutionary Leap in Mixed-Modal



The field of Artificial Intelligence is developing so fast right now with the
integration of various mixed-modal models, making it seem to take a
front seat in innovation. For example, these models are trained to
process and integrate information derived from different sources, such
as text, images, sounds, and hence expand artificial intelligence's
boundaries. However, the journey to perfection of these models is not
that smooth. The existing mixed-modal models often face challenges in
their efficient integration, data processing, and generalization of data.

Enter the Chameleon model, exciting development for overcoming these

and taking AI forward. Built by the Chameleon Team at Meta AI, formerly
Facebook AI Research (FAIR). The Chameleon model is a breakthrough
in mixed-modal models within the understanding and generation
capabilities in mixed-modal content. This way, the target of the team was

To read more such articles, please visit our blog

To read more such articles, please visit our blog

to make Chameleon able to incorporate text and image processing

entirely seamlessly. This results in a more flexible and effective
mixed-modal model capable of understanding and generating several
data types in any arbitrary order. This unique capability sets Chameleon
aside from traditional models that usually process different modalities
separately, limiting its ability to integrate information across modalities.

What is a Chameleon?

Being one of the state-of-the-art models in this sea of mixed-modal

models, Chameleon is achieved through the early fusion of all the
different input modalities, thereby allowing it to mix seamlessly different
input data types. This is done in designing Chameleon to give an AI
system that is more cohesive in power and capable of understanding
and generating images and text in any arbitrary sequence.

Model Variants

The Chameleon model has two main variants: Chameleon 7B and

Chameleon 34B. These have been released under a research-only
license, both designed to support mixed-modal inputs and text-only
outputs, being safety-tuned for responsible use in research. The image
generation capability of the Chameleon model is not being released now.
These two versions are the most updated versions of the Chameleon
model family. There is a possibility of additional advancements and
further developments, which might see new variants in the future.

Key Features of Chameleon

Several important features of Chameleon make it different from other


● Early Fusion: The model integrates text and image processing

from the very beginning, thereby making data representation more

To read more such articles, please visit our blog

To read more such articles, please visit our blog

● Token-Based Representation: Both text and images are

represented as tokens in this model so that it will be treated like
any text by the model.
● Transformer Architecture: It is just one model structure for text
and image tokens.
● Training Stability: The model stays stable with the largest sizes of
the parameters.
● High Performance: Can perform well for most complex tasks,
including visual question answering, text generation, and image

Capabilities/Use Case of Chameleon

The real-world use cases that reflect the broadness and generality of the
capabilities of Chameleon are described below:

● Virtual Assistants Augmentation: The potential to understand

and cope with the processing of multimodal queries tremendously
extends the applicability domain of virtual assistants.
● Better Content Recommendation: Being able to understand
textual and visual cues helps Chameleon in making the content
recommendation more accurate.
● Image Captioning: Chameleon has even shown excellent
performance, state-of-the-art level, in generating an image caption

To read more such articles, please visit our blog

To read more such articles, please visit our blog

source -

● Text Generation: Performance on text-only tasks is also higher,

often outperforming the Llama-2 but less capable than Mixtral
8x7B and Gemini-Pro models.
● Image Generation: Non-trivial image generation has also been
accomplished within the single model of Chameleon.

How does Chameleon work?/ Architecture/Design

Chameleon is a unique AI model, designed from the ground up to handle

multiple modalities, including images, text, and code. Its core design
principle is a fully token-based representation for both image and textual
modalities. This is achieved by quantizing images into discrete tokens,
similar to how words are represented in text, allowing Chameleon to
apply the same transformer architecture to sequences of both image and
text tokens.

This design enables an early-fusion approach, where all modalities are

projected into a shared representational space from the start. This
shared space allows for seamless reasoning and generation across

To read more such articles, please visit our blog

To read more such articles, please visit our blog

modalities. However, this approach also presents significant technical

challenges, particularly in terms of optimization stability and scaling.

To address these challenges, Chameleon incorporates a combination of

architectural innovations and training techniques. For instance, it
introduces novel modifications to the transformer architecture, such as
query-key normalization and revised placement of layer norms. These
modifications are crucial for stable training in the mixed-modal setting.

Furthermore, Chameleon adapts the supervised fine-tuning approaches

used for text-only Language Learning Models (LLMs) to the mixed-modal
setting. This adaptation enables strong alignment at scale. Using these
techniques, Chameleon-34B is successfully trained on 5x the number of
tokens as Llama-2, enabling new mixed-modal applications while still
matching or even outperforming existing LLMs on unimodal benchmarks.

source -

As shown in figure above, Chameleon represents all modalities i.e.

images, text, and code, as discrete tokens and uses a uniform
transformer-based architecture that is trained from scratch in an
end-to-end fashion on approximately 10 trillion tokens of interleaved

To read more such articles, please visit our blog

To read more such articles, please visit our blog

mixed-modal data. As a result, Chameleon can both reason over, as well

as generate, arbitrary mixed-modal documents.

Performance Evaluation with Other Models

The Chameleon has been tested and compared against the OpenAI
model GPT-4V, also compared against Google Gemini Pro. Results from
test completions show that in task fulfillment, Chameleon scored better,
with 55.2% of its completions ultimately fulfilling the tasks, whereas
Gemini+ scored 37.6% and GPT-4V+ 44.7%. Much better, thus is the
understanding and response capability with Chameleon.

source -

When computed for relative evaluation, this amounts to 41.5% of the

cases in favor of Chameleon's responses over Gemini+ and 35.8% over
GPT-4V+. For the direct comparison between the original responses by
Gemini and GPT-4V, the instances in which Chameleon's responses
were better are 53.5% and 46.0%, respectively. The results obtained
show that Chameleon is performing quite strongly among AI models.

To read more such articles, please visit our blog

To read more such articles, please visit our blog

How to Access and Use this Model?

The Chameleon models, Chameleon 7B and 34B in particular are

publicly available under a research-only license. The models can be
accessed through the GitHub repository of Facebook Research, where
one will find details for using it locally. If you would like to read more
details about this AI model, the sources are all included at the end of this
article in the 'source' section.


1. Evaluation Limitations: The prompts used for evaluating

Chameleon came from crowdsourcing and not actual users who
interact with the model. This could potentially limit the
generalizability of the evaluation due to dataset size.
2. Tasks Omission: Since the prompts are all about mixed-modal
output, naturally, some of the visual understanding tasks—for
example, Optical Character Recognition (OCR) or understanding
infographics—are left out.
3. Comparison with Other Models: Currently, the APIs of existing
multimodal Language Learning Models (LLMs) only provide textual
responses. While the baselines are enhanced by the augmentation
of their output with separately generated images, it would be
preferable to compare Chameleon with other native
mixed-modality models.


Chameleon's ability to understand and generate both images and text in

any arbitrary sequence sets it apart from traditional models, making it a
promising tool in the advancement of AI. Its development not only
addresses current challenges but also opens new avenues for

To read more such articles, please visit our blog

To read more such articles, please visit our blog

exploration and application. As AI continues to advance, models like

Chameleon will undoubtedly play a pivotal role in shaping the future of

Research paper:
research document:
GitHub Repo:

Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an
advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are
encouraged to conduct their own research and due diligence.

To read more such articles, please visit our blog

You might also like