Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

‘MAKE-A-VIDEO’: TEXT TO VIDEO GENERATION

Fozan Mohammed Azhar


CS’23
Bachelors of Engineering in Computer Science
fozanazhar28@gmail.com

1 ABSTRACT.

Make-A-Video is a new AI technology that enables elephant kicking a football) as done in image-based action
individuals to convert text suggestions into brief, high- recognition systems. Moreover, even without text
quality video snippets. Make-A-Video advances recent descriptions, unsupervised videos are sufficient to learn
developments in Meta AI’s research on generative how different entities in the world move and interact (e.g.,
technologies. A multimodal generative AI technique the motion of waves at the beach.). To enhance the visual
allows users more control over the AI-generated material quality, we train spatial super-resolution models as well as
they create. Make-A-Video is the follow-up to that frame interpolation models. This increases the resolution
announcement. With Make-A-Scene, they showed users of the generated videos, as well as enables a higher
how to use words, lines of text, and freeform sketches to (controllable) frame rate. To enhance the visual quality, we
produce lifelike graphics and artwork fit for picture books. train spatial super-resolution models as well as frame
interpolation models. This increases the resolution of the
generated videos, as well as enables a higher (controllable)
2 INTRODUCTION. frame rate.

The AI lab Open AI has made its latest text-to-image AI


system DALL-E available to everyone, and AI start up 3 CONTRIBUTIONS.
Stability.AI launched Stable Diffusion, an open-source text-
to-image system. Major contributions for the success of ‘Make-A-Video’
are:
But text-to-video AI comes with some even greater
challenges. For one, these models need a vast amount of • Make-A-Video – an effective method that extends a
computing power. They are an even bigger computational diffusion-based T2I model to T2V through a
lift than large text-to-image AI models, which use millions spatiotemporally factorized diffusion model.
of images to train, because putting together just one short
video requires hundreds of images. That means it’s really • Leveraging joint text-image priors to bypass the need for
only large tech companies that can afford to build these paired text-video data, which in turn allows us to
systems for the foreseeable future. They’re also trickier to potentially scale to larger quantities of video data.
train, because there aren’t large-scale data sets of high-
quality videos paired with text. • Presenting super-resolution strategies in space and time
that, for the first time, generate high-definition, high frame-
To work around this, Meta combined data from three open- rate videos given a user-provided textual input.
source image and video data sets to train its model. Standard
text-image data sets of labelled still images helped the AI • Evaluating Make-A-Video against existing T2V systems
learn what objects are called and what they look like. And a and present: (a) State-of-the-art results in quantitative as
database of videos helped it learn how those objects are well as qualitative measures, and (b) A more thorough
supposed to move in the world. The combination of the two evaluation than existing literature in T2V. Also collecting
approaches helped Make-A-Video generate videos from text a test set of 300 prompts for zero-shot T2V human
at scale. Make-A-Video leverages T2I models to learn the
evaluation which they plan to release.
correspondence between text and the visual world, and uses
unsupervised learning on unlabelled (unpaired) video data,
to learn realistic motion. Clearly, text describing images
does not capture the entirety of phenomena observed in
videos. That said, one can often infer actions and events
from static images (e.g., a woman drinking coffee, or an
4 METHOD. 5 ARCHITECTURE.

Make-A-Video consists of three main components: A Make-A-Video differs from previous works in several
aspects. First, the architecture breaks the dependency on text-
(i) A base T2I model trained on text-image pairs, video pairs for T2V generation. This is a significant
advantage compared to prior work, that has to be restricted to
(ii) spatiotemporal convolution and attention layers that narrow domains or require large-scale paired text-video data.
extend the networks’ building blocks to the temporal Second, involves fine-tuning the T2I model for video
dimension, and generation, gaining the advantage of adapting the model
weights effectively. Third, motivated from prior work on
(iii) spatiotemporal networks that consist of both efficient architectures for video and 3D vision tasks, the use
spatiotemporal layers, as well as another crucial element of pseudo-3D and temporal attention layers not only better
needed for T2V generation- a frame interpolation network leverage a T2I architecture, it also allows for better temporal
for high frame rate generation. information fusion compared to VDM
Make-A-Video’s final T2V scheme (depicted in the below
figure) can be formulated as:
Given input text x translated by the prior P into an image
embedding, and a desired frame rate fps, the decoder Dt
generates 16 64 × 64 frames, which are then interpolated to a
where yˆt is the generated video, SRh, SRl are the spatial higher frame rate by ↑F, and increased in resolution to 256 ×
and spatiotemporal super-resolution networks, ↑F is a 256 by SRt l and 768 × 768 by SRh, resulting in a high-
frame interpolation network, Dt is the spatiotemporal spatiotemporal-resolution generated video yˆ.
decoder, P is the prior, xˆ is the BPE-encoded text is the
BPE-encoded text, Cx is the CLIP text encoder, and x is
the input text. The three main components are described in
detail in the following sections.
6 CONCLUSION.
Learning from the world around us is one of the greatest
strengths of human intelligence. Just as we quickly learn to 7 REFERENCES.
recognize people, places, things, and actions through
observation, generative systems will be more creative and [1] DreamFusion,
useful if they can mimic the way humans learn. Learning https://dreamfusion3d.github.io/.
world dynamics from orders of magnitude more videos
using unsupervised learning helps researchers break away [2] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising
from the reliance on labeled data. diffusion probabilistic models, 2020. URL
https://arxiv.org/abs/2006.11239..
As a next step the plan is to address several of the technical
limitations. As discussed earlier, the approach cannot learn [3] Cypress Media Group,
associations between text and phenomenon that can only be http://www.cypressmedia.net/articles/article/26/six principles
inferred in videos. How to incorporate these (e.g., of technical writing., Accessed November 2016.
generating a video of a person waving their hand left-to-
right or right-to-left), along with generating longer videos, [4] What is Spatial Temporal? Definition and Related FAQs |
with multiple scenes and events, depicting more detailed HEAVY.AI
stories, is left for future work.
[5] https://www.technologyreview.com/2022/09/29/10604
AI image generation is already extremely powerful and can
72/meta-text-to-video-ai/
create amazing works of art or just funny memes, but with
AI video generation, more creative possibilities open up for
[6] Introducing Make-A-Video: An AI system that
users. Perhaps one day we will be able to create an entire
generates videos from text (facebook.com).
movie with only text descriptions and advanced artificial
intelligence. Until then. Meta's Make-A-Video technology
is an important step toward that future.

You might also like