Professional Documents
Culture Documents
Make-A-Video': Text To Video Generation: 1 Abstract
Make-A-Video': Text To Video Generation: 1 Abstract
1 ABSTRACT.
Make-A-Video is a new AI technology that enables elephant kicking a football) as done in image-based action
individuals to convert text suggestions into brief, high- recognition systems. Moreover, even without text
quality video snippets. Make-A-Video advances recent descriptions, unsupervised videos are sufficient to learn
developments in Meta AI’s research on generative how different entities in the world move and interact (e.g.,
technologies. A multimodal generative AI technique the motion of waves at the beach.). To enhance the visual
allows users more control over the AI-generated material quality, we train spatial super-resolution models as well as
they create. Make-A-Video is the follow-up to that frame interpolation models. This increases the resolution
announcement. With Make-A-Scene, they showed users of the generated videos, as well as enables a higher
how to use words, lines of text, and freeform sketches to (controllable) frame rate. To enhance the visual quality, we
produce lifelike graphics and artwork fit for picture books. train spatial super-resolution models as well as frame
interpolation models. This increases the resolution of the
generated videos, as well as enables a higher (controllable)
2 INTRODUCTION. frame rate.
Make-A-Video consists of three main components: A Make-A-Video differs from previous works in several
aspects. First, the architecture breaks the dependency on text-
(i) A base T2I model trained on text-image pairs, video pairs for T2V generation. This is a significant
advantage compared to prior work, that has to be restricted to
(ii) spatiotemporal convolution and attention layers that narrow domains or require large-scale paired text-video data.
extend the networks’ building blocks to the temporal Second, involves fine-tuning the T2I model for video
dimension, and generation, gaining the advantage of adapting the model
weights effectively. Third, motivated from prior work on
(iii) spatiotemporal networks that consist of both efficient architectures for video and 3D vision tasks, the use
spatiotemporal layers, as well as another crucial element of pseudo-3D and temporal attention layers not only better
needed for T2V generation- a frame interpolation network leverage a T2I architecture, it also allows for better temporal
for high frame rate generation. information fusion compared to VDM
Make-A-Video’s final T2V scheme (depicted in the below
figure) can be formulated as:
Given input text x translated by the prior P into an image
embedding, and a desired frame rate fps, the decoder Dt
generates 16 64 × 64 frames, which are then interpolated to a
where yˆt is the generated video, SRh, SRl are the spatial higher frame rate by ↑F, and increased in resolution to 256 ×
and spatiotemporal super-resolution networks, ↑F is a 256 by SRt l and 768 × 768 by SRh, resulting in a high-
frame interpolation network, Dt is the spatiotemporal spatiotemporal-resolution generated video yˆ.
decoder, P is the prior, xˆ is the BPE-encoded text is the
BPE-encoded text, Cx is the CLIP text encoder, and x is
the input text. The three main components are described in
detail in the following sections.
6 CONCLUSION.
Learning from the world around us is one of the greatest
strengths of human intelligence. Just as we quickly learn to 7 REFERENCES.
recognize people, places, things, and actions through
observation, generative systems will be more creative and [1] DreamFusion,
useful if they can mimic the way humans learn. Learning https://dreamfusion3d.github.io/.
world dynamics from orders of magnitude more videos
using unsupervised learning helps researchers break away [2] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising
from the reliance on labeled data. diffusion probabilistic models, 2020. URL
https://arxiv.org/abs/2006.11239..
As a next step the plan is to address several of the technical
limitations. As discussed earlier, the approach cannot learn [3] Cypress Media Group,
associations between text and phenomenon that can only be http://www.cypressmedia.net/articles/article/26/six principles
inferred in videos. How to incorporate these (e.g., of technical writing., Accessed November 2016.
generating a video of a person waving their hand left-to-
right or right-to-left), along with generating longer videos, [4] What is Spatial Temporal? Definition and Related FAQs |
with multiple scenes and events, depicting more detailed HEAVY.AI
stories, is left for future work.
[5] https://www.technologyreview.com/2022/09/29/10604
AI image generation is already extremely powerful and can
72/meta-text-to-video-ai/
create amazing works of art or just funny memes, but with
AI video generation, more creative possibilities open up for
[6] Introducing Make-A-Video: An AI system that
users. Perhaps one day we will be able to create an entire
generates videos from text (facebook.com).
movie with only text descriptions and advanced artificial
intelligence. Until then. Meta's Make-A-Video technology
is an important step toward that future.