Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Mora: Enabling Generalist Video Generation via

A Multi-Agent Framework

Zhengqing Yuan1∗ Ruoxi Chen1∗ Zhaoxu Li1∗ Haolong Jia1∗


Lifang He1 Chi Wang2 Lichao Sun1†
arXiv:2403.13248v2 [cs.CV] 22 Mar 2024

1 2
Lehigh University Microsoft Research

Abstract
Sora is the first large-scale generalist video generation model that garnered signif-
icant attention across society. Since its launch by OpenAI in February 2024, no
other video generation models have paralleled Sora’s performance or its capacity to
support a broad spectrum of video generation tasks. Additionally, there are only a
few fully published video generation models, with the majority being closed-source.
To address this gap, this paper proposes a new multi-agent framework Mora, which
incorporates several advanced visual AI agents to replicate generalist video genera-
tion demonstrated by Sora. In particular, Mora can utilize multiple visual agents
and successfully mimic Sora’s video generation capabilities in various tasks, such
as (1) text-to-video generation, (2) text-conditional image-to-video generation,
(3) extend generated videos, (4) video-to-video editing, (5) connect videos and
(6) simulate digital worlds. Our extensive experimental results show that Mora
achieves performance that is proximate to that of Sora in various tasks. However,
there exists an obvious performance gap between our work and Sora when assessed
holistically. In summary, we hope this project can guide the future trajectory of
video generation through collaborative AI agents.

A slender white rocket in the midst of launch, with a fiery exhaust trail, against a clear sky over a desolate desert landscape.

A vibrant coral reef teeming with life under the crystal-clear blue ocean, with colorful fish swimming among the coral, rays of sunlight
filtering through the water, and a gentle current moving the sea plants.

A sleek, electric skateboard zips through an urban park, its rider skillfully navigating between obstacles. The motion blur of the
surrounding trees and park benches accentuates the board's swift movement. The rider's hair flows backward, and their focused expression
captures the exhilaration of gliding effortlessly over the pavement, powered by silent electric energy.

Figure 1: Samples for text-to-video generation of Mora. Our approach can generate high-resolution,
temporally consistent videos from text prompts. The samples shown are 1024×576 resolution over
12 seconds duration at 75 frames in total.
* Zhengqing,Ruoxi, Zhaoxu, and Haolong are visiting students in the LAIR lab at Lehigh University. The
GitHub link is https://github.com/lichao-sun/Mora

Lichao Sun is co-corresponding author: lis221@lehigh.edu

Preprint. Under review.


1 Introduction
Since the release of ChatGPT in November 2022, the advent of generative AI technologies has
marked a significant transformation, reshaping interactions and integrating deeply into various facets
of daily life and industry [1, 2]. In the realm of visual AI, the focus has predominantly been on
image generation with models such as Midjourney [3], Stable Diffusion [4], and DALL-E 3 [5]
leading the way. In contrast, video generation has remained unexplored, especially when compared
to the progress seen in language and image synthesis technologies. Although recent notable video
generation models like Pika [6] and Gen-2 [7] have demonstrated their capacity to produce diverse
and high-quality videos, they have been limited by their ability to create longer-duration videos
exceeding 10 seconds. This scenario has seen a revolutionary shift with the introduction of OpenAI’s
Sora [8], marking a new era in video generation capabilities.
Sora, a pioneering text-to-video generative model introduced by OpenAI in February 2024, distin-
guishes itself by its ability to convert text prompts into detailed videos, demonstrating remarkable
potential in replicating physical world dynamics. This model transcends its predecessors by generat-
ing videos up to a minute long, closely aligning with the provided text descriptions [8]. Beyond mere
text-to-video generation, Sora excels in a range of video tasks as a generalist model, including editing,
connecting, and extending footage in ways previously unachieved. In addition, the generated content
is known for its multi-view perspectives and fidelity to user instructions, positioning it uniquely
amongst video generation models. As we look to the future, the implications of Sora and similar
advanced video generation technologies are poised to make profound contributions across various
sectors, including but not limited to filmmaking [9, 10], robotics [11, 12], and healthcare [13].
Despite its innovative contributions, Sora’s closed-source nature, similar to most video generation
models, presents significant challenges for the academic community. The inaccessibility hinders
researchers’ ability to replicate or extend Sora’s capabilities. There is a growing trend of attempts to
reverse-engineer Sora, with some studies, such as [14], proposing potential techniques that might be
employed within Sora, including diffusion transformers and spatial patch strategies [15–18]. Despite
these efforts, achieving the same level of performance and adaptability as Sora proves to be an
immense challenge. The lack of access to comparable computational power and extensive training
datasets further complicates these endeavors.

Table 1: Task comparison between Sora, Mora and other existing models.
Tasks Example Sora Mora Others
Text-to-video GenerationText-to-video generation
Text-to-video generation
Text-to-video generation ! ! [19–22]
Text-guided image-to-video
Text-guided image-to-video
image-to-video
Text-guided
Text-conditional Image-to-Video Generation
generation
generation
! ! [23, 6, 7]
Extend Generated Videos
Extend
Extend
generated
Extend generated videos
generated videos
videos ! ! -
Video-to-Video Editing Video-to-video editing
Video-to-video editing
Video-to-video editing ! ! [24–26]
Connect videos
Connect videos
videos
Connect
Connect Videos simulate
simulate
digital
simulate digital worlds
digital worlds
worlds ! ! [27]
Simulate Digital Worlds ! ! -

To address the limitations of current video generation models, we explore the potential of multi-agent
collaboration [28, 29] in accomplishing generalist video generation tasks. We introduce a multi-agent
framework, referred to as Mora, that leverages various advanced large models to enable text-to-video
capabilities similar to Sora. Specifically, we decompose video generation into several subtasks,
with each subtask assigned to a dedicated agent: (1) enhancing prompts provided by the user, (2)
generating an image from an input text prompt, (3) editing or refining images based on the enhanced
conditioning provided by the text, (4) generating a video from the generated image, and (5) connecting
two videos. By automatically organizing agents to loop and permute through these subtasks, Mora
can complete a wide array of video generation tasks through a flexible pipeline, thereby meeting the
diverse needs of users.
Intuitively, equipping the model with both a starting image and text simplifies the video generation
process, as it primarily needs to extrapolate the future progression of the image. This method
stands in contrast to direct end-to-end text-to-video approaches [30–32, 20, 21]. Our multi-agent
collaboration framework distinctively produces an intermediate image or video during inference,

2
enabling the preservation of the visual diversity, style, and quality inherent in the text-to-image
model (see examples in Figure 1). This process even facilitates editing capabilities. By effectively
coordinating the efforts of text-to-image, image-to-image, image-to-video and video-to-video agents,
Mora can adeptly conduct a broad spectrum of video generation tasks while offering superior editing
flexibility and visual fidelity, rivaling the performance of established models like Sora. A detailed
comparison of tasks between Mora and Sora is presented in Table 1. This comparison demonstrates
that, through the collaboration of multiple agents, Mora is capable of accomplishing the video-related
tasks that Sora can undertake. This comparison highlights Mora’s adaptability and proficiency in
addressing a multitude of video generation challenges.
To comprehensively assess the efficacy of Mora, we use basic metrics in publicly available video
generation benchmark Vbench [33] and self-defined metrics for six tasks, including text-to-video
generation, text-conditional image-to-video generation, extending generated videos, video-to-video
editing, connecting videos, and simulating digital worlds. Notably, Mora achieves superior perfor-
mance in the text-to-video generation task than existing open-sourced models, ranking second only
to Sora. In the other tasks, Mora also delivers competitive results, underscoring the versatility and
general capabilities of our framework.
We summarize our contributions as follows:
• In this paper, we introduce Mora, a groundbreaking meta-programming framework crafted to
enhance multi-agent collaboration. This framework stands out for its structured yet adaptable
system of agents, paired with an intuitive interface for the configuration of components and task
pipelines. These features position Mora as a prime instrument for pushing forward the boundaries
of generalist video generation tasks.
• Our research reveals that the quality of video generation can be notably improved by leveraging
the automated cooperation of multiple agents, including text-to-image, image-to-image, image-to-
video, and video-to-video agents. This collaborative process starts with generating an image from
text, followed by using both the generated image and the input text to produce a video. The process
concludes with further refinement, extension, connection, and editing of the video.
• Mora stands out for its exceptional performance across six video-related tasks, surpassing ex-
isting open-sourced models. This impressive achievement underlines the effectiveness of Mora,
showcasing its potential as a versatile framework tailored for generalist video generation. The com-
prehensive results not only affirm the capabilities of Mora but also position it as a groundbreaking
tool in the realm of video generation, promising significant advancements in how video content is
created and utilized.

2 Related Work

2.1 Text-to-Video Generation

Generating videos based on textual descriptions has been long discussed. While early efforts
in the field were primarily rooted in GANs [34, 35] and VQ-VAE [36], recent breakthroughs in
generative video models, driven by foundational work in transformer-based architectures and diffusion
models, have advanced academic research. Auto-regressive transformers are early leveraged in
video generation [37–39]. These models are designed to generate video sequences in a frame-by-
frame manner, predicting each new frame based on the previously generated frames. Parallelly,
the adaptation of masked language models [40] for visual contexts, as demonstrated by [41–44],
underscores the versatility of transformers in video generation. The recently-proposed VideoPoet [39]
leverages an auto-regressive language model and can multitask on a variety of video-centric inputs
and outputs.
In another line, large-scale diffusion models [15, 16] show competitive performance in video genera-
tion [31, 45–48]. By learning to gradually denoise a sample from a normal distribution, diffusion
models [15, 16] implement an iterative refinement process for video synthesis. Initially developed
for image generation [49, 50], they have been adapted and extended to handle the complexities of
video data. This adaptation began with extending image generation principles to video [51, 31, 45],
by using a 3D U-Net structure instead of conventional image diffusion U-Net. In the follow-up,
latent diffusion models (LDMs) [4] are integrated into video generation [52, 32, 20, 21], showcasing
enhanced capabilities to capture the nuanced dynamics of video content. For instance, Stable Video

3
Diffusion [23] can conduct multi-view synthesis from a single image while Emu Video [19] uses just
two diffusion models to generate higher-resolution videos.
Researchers have delved into the potential of diffusion models for a variety of video manipulation
tasks. Notably, Dreamix [24] and MagicEdit [25] have been introduced for general video editing,
utilizing large-scale video-text datasets. Conversely, other models employ pre-trained models for
video editing tasks in a zero-shot manner [26, 53–55]. SEINE [27] is specially designed for generative
transition between scenes and video prediction. While these above-mentioned end-to-end models
exhibit remarkable proficiency in specific areas, they encounter limitations in broadening their
capabilities to encompass a wider range of video tasks, particularly those requiring varied types of
inputs and the generation of longer-duration videos. The introduction of diffusion transformers [18,
17, 56] further revolutionized video generation, culminating in advanced solutions like Latte [22] and
Sora [8]. Sora’s ability to produce minute-long videos of high visual quality that faithfully follow
human instructions heralds a new era in video generation, promising unprecedented opportunities for
creativity and expression in digital media.

2.2 AI Agents

Large models have enabled agents to excel across a broad spectrum of applications, showcasing their
versatility and effectiveness. They have greatly advanced collaborative multi-agent structures for
multimodal tasks in areas such as scientific research [57], software development [28, 58] and society
simulation [59]. Compared to individual agents, the collaboration of multiple autonomous agents,
each equipped with unique strategies and behaviors and engaged in communication with one another,
can tackle more dynamic and complex tasks [60]. Through a cooperative agent framework known as
role-playing, CAMEL [61] enables agents to collaborate and solve complex tasks effectively. Park
et al. [59] designed a community of 25 generative agents capable of planning, communicating, and
forming connections. Liang et al. [62] have explored the use of multi-agent debates for translation
and arithmetic problems, encouraging divergent thinking in large language models. Hong et al. [28]
introduced MetaGPT, which utilizes an assembly line paradigm to assign diverse roles to various
agents. In this way, complex tasks can be broken down into subtasks, which makes it easy for many
agents working together to complete. Xu et al. [63] used a multi-agent collaboration strategy to
simulate the academic peer review process. AutoGen [29] is a generic programming framework
which can be used to implement diverse multi-agent applications across different domains, using a
variety of agents and conversation patterns.
Motivated by existing works, we extend the principle of collaborative agents to complete vision tasks.
By integrating and coordinating agents of various roles such as text-to-image and image-to-video
agents, we manage to accomplish multiple video-related tasks in a modular and extensible approach.

3 Mora: A Multi-Agent Framework for Video Generation


As illustrated in Figure 2, Mora aims to provide a framework leveraging advanced AI agents to realize
Text-to-Video generation. Sec 3.1 provides the agent definitions and example pipelines enabled by
this framework. Sec 3.2 introduces the implementation detail of each agent.

3.1 Agent-based Video Generation

Definition and Specialization of Agents. The definition of agents enables flexibility in the
breakdown of complex work into smaller and more specific tasks. Solving different video generation
tasks often requires the collaboration of agents with diverse abilities, each contributing specialized
outputs. In our framework, we have 5 basic roles: prompt selection and generation agent, text-to-
image generation agent, image-to-image generation agent, and image-to-video generation agent and
video-to-video agent.
• Prompt Selection and Generation Agent: Prior to the commencement of the initial image generation,
textual prompts undergo a rigorous processing and optimization phase. This critical agent can
employ large language models like GPT-4, Llama [64, 65]. It is designed to meticulously analyze the
text, extracting pivotal information and actions delineated within, thereby significantly enhancing
the relevance and quality of the resultant images. This step ensures that the textual descriptions are
thoroughly prepared for an efficient and effective translation into visual representations.

4
Step 1: Prompt enhancement Step 2: Image generation
Prompt selection agent Text-to-image agent
Image
Step 5: Video extraction

User Prompt GPT Llama Bard Expressive DALLE 2 SD Imagen


description
Step 4: Video generation

Description/
Emu- Refined/ Video
instruction edit SDXL Imagic SVD Pika Gen-2
Planning edited
Image-to-image agent image Image-to-video agent
⚫ Text-to-video generation:
Step 1→2 → 4 or 1→2 → 3 → 4
Step 3: Image editing
⚫ Text-guided image-to-video generation:
Step 3 → 4
⚫ Extend generated videos: Step 5 → 4 Videos
⚫ Video-to-video editing: Step 5 → 3 → 4 SEINE
⚫ Connect videos: Step 6
⚫ Simulate digital world: Step 1→2→4
Step 6: Video connection Video transition agent

Figure 2: Illustration of how to use Mora to conduct video-related tasks.

• Text-to-Image Generation Agent: The text-to-image model [49, 50] stands at the forefront of
translating these enriched textual descriptions into high-quality initial images. Its core functionality
revolves around a deep understanding and visualization of complex textual inputs, enabling it to
craft detailed and accurate visual counterparts to the provided textual descriptions.
• Image-to-Image Generation Agent: This agent [66] works to modify a given source image in
response to specific textual instructions. The core of its functionality lies in its ability to interpret
detailed textual prompts with high accuracy and subsequently apply these insights to manipulate
the source image accordingly. This involves a detailed recognition of the text’s intent, translating
these instructions into visual modifications that can range from subtle alterations to transformative
changes. The agent leverages a pre-trained model to bridge the gap between textual description and
visual representation, enabling seamless integration of new elements, adjustment of visual styles,
or alteration of compositional aspects within the image.
• Image-to-Video Generation Agent [23]: Following the creation of the initial image, the Video
Generation Model is responsible for transitioning the static frame into a vibrant video sequence.
This component delves into the analysis of both the content and style of the initial image, serving as
the foundation for generating subsequent frames. These frames are meticulously crafted to ensure
a seamless narrative flow, resulting in a coherent video that upholds temporal stability and visual
consistency throughout. This process highlights the model’s capability to not only understand and
replicate the initial image but also to anticipate and execute logical progressions in the scene.
• Video Connection Agent: Utilizing the Video-to-Video Agent, we create seamless transition videos
based on two input videos provided by users. This advanced agent selectively leverages key frames
from each input video to ensure a smooth and visually consistent transition between them. It is
designed with the capability to accurately identify the common elements and styles across the two
videos, thus ensuring a coherent and visually appealing output. This method not only improves the
seamless flow between different video segments but also retains the distinct styles of each segment.
Every agent is responsible for the specific input and output. These results can be utilized for different
designed tasks.
Approaches. By setting the agents’ roles and operational skills, we can define basic workflows
for different tasks. We design six text-to-video generation tasks: (1) Text-to-video generation, (2)
Text-conditional image-to-video generation, (3) Extend generated videos, (4) Video-to-video editing,
(5) Connect Videos, and (6) Simulate digital worlds, which are described below.
• Task 1: Text-to-Video Generation: This task harnesses a detailed textual prompt from the user as the
foundation for video creation. The prompt must meticulously detail the envisioned scene. Utilizing
this prompt, the Text-to-Image agent utilizes the text, distilling themes and visual details to craft
an initial frame. Building upon this foundation, the Image-to-Video component methodically
generates a sequence of images. This sequence dynamically evolves to embody the prompt’s

5
described actions or scenes, and each video is derived from the last frame from the previous video,
thereby achieving a seamless transition throughout the video.
• Task 2: Text-conditional Image-to-Video Generation: Task 2 mirrors the operational pipeline of
Task 1, with a key distinction. Unlike Task 1 with only texts as inputs, Task 2 integrates both a
textual prompt and an initial image into the Text-to-Image agent’s input. This dual-input approach
enriches the content generation process, enabling a more nuanced interpretation of the user’s vision.
• Task 3: Extend Generated Videos: This task focuses on extending the narrative of an existing video
sequence. By taking the last frame of an input video as the starting point, the video generation
agent crafts a series of new, coherent frames that continue the story. This approach allows for the
seamless expansion of video content, creating longer narratives that maintain the consistency and
flow of the original sequence.
• Task 4: Video-to-Video Editing: Task 4 introduces a sophisticated editing capability, leveraging
both the Image-to-Image and Image-to-Video agents. The process begins with the Image-to-Image
agent, which takes the first frame of an input video and applies edits based on the user’s prompt,
achieving the desired modifications. This edited frame then serves as the initial image for the
Image-to-Video agent, which generates a new video sequence that reflects the requested obvious or
subtle changes, offering a powerful tool for dynamic video editing.
• Task 5: Connect Videos: The Image-to-Video agent leverages the final frame of the first input video
and the initial frame of the second input video to create a seamless transition, producing a new
video that smoothly connects the two original videos.
• Task 6: Simulating Digital Worlds: This task specializes in the whole style changing for video
sequences set in digitally styled worlds. By appending the phrase "In digital world style" to the
edit prompt, the user instructs the Image-to-Video agent to craft a sequence that embodies the
aesthetics and dynamics of a digital realm or utilize the Image-to-Image agent to transfer the real
image to digital style. This task pushes the boundaries of video generation, enabling the creation of
immersive digital environments that offer a unique visual experience.

3.2 Implementation Detail of Agents

Prompt Selection and Generation. Currently, GPT-4 [64] stands as the most advanced generative
model available. By harnessing the capabilities of GPT-4, we are able to generate and meticulously
select high-quality prompts. These prompts are detailed and rich in information, facilitating the
Text-to-Image generation process by providing the agent with comprehensive guidance.
Text-to-Image Generation. We utilize the pretrained large text-to-image model to generate a
high-quality and representative first image.
The Stable Diffusion XL (SDXL) [50] is utilized for the first implementation. It introduces a
significant evolution in the architecture and methodology of latent diffusion models [49, 67] for
text-to-image synthesis, setting a new benchmark in the field. At the core of its architecture is an
enlarged UNet backbone [68] that is three times larger than those used in previous versions of Stable
Diffusion 2 [49]. This expansion is principally achieved through an increased number of attention
blocks and a broader cross-attention context, facilitated by integrating a dual text encoder system. The
first encoder is based on OpenCLIP [69] ViT-bigG [70–72], while the second utilizes CLIP ViT-L,
allowing for a richer, more nuanced interpretation of textual inputs by concatenating the outputs of
these encoders. This architectural innovation is complemented by the introduction of several novel
conditioning schemes that do not require external supervision, enhancing the model’s flexibility and
capability to generate images across multiple aspect ratios. Moreover, SDXL features a refinement
model that employs a post-hoc image-to-image transformation to elevate the visual quality of the
generated images. This refinement process utilizes a noising-denoising technique, further polishing
the output images without compromising the efficiency or speed of the generation process.
Image-to-Image Generation. Our initial framework realize uitilze InstructPix2Pix as Image-to-
Image generation agent. InstructPix2Pix [66] are intricately designed to enable effective image
editing from natural language instructions. At its core, the system integrates the expansive knowledge
of two pre-trained models: GPT-3 [73] for generating editing instructions and edited captions from
textual descriptions, and Stable Diffusion [4] for transforming these text-based inputs into visual
outputs. This ingenious approach begins with fine-tuning GPT-3 on a curated dataset of image
captions and corresponding edit instructions, resulting in a model that can creatively suggest plausible
edits and generate modified captions. Following this, the Stable Diffusion model, augmented with

6
the Prompt-to-Prompt technique, generates pairs of images (before and after the edit) based on the
captions produced by GPT-3. The conditional diffusion model at the heart of InstructPix2Pix is then
trained on this generated dataset. InstructPix2Pix directly utilizes the text instructions and input
image to perform the edit in a single forward pass. This efficiency is further enhanced by employing
classifier-free guidance for both the image and instruction conditionings, allowing the model to
balance fidelity to the original image with adherence to the editing instructions.
Image-to-Video Generation. In the Text-to-Video generation agent, video generation agents play
an important role in ensuring video quality and consistency.
Our first implementation utilizes the state-of-the-art video generation model Stable Video Diffusion
to generate video. The Stable Video Diffusion (SVD) [23] architecture introduces a cutting-edge
approach to generating high-resolution videos by leveraging the strengths of LDMs Stable Diffusion
v2.1 [4], originally developed for image synthesis, and extending their capabilities to handle the
temporal complexities inherent in video content. At its core, the SVD model follows a three-stage
training regime that begins with text-to-image pertaining, where the model learns robust visual
representations from a diverse set of images. This foundation allows the model to understand and
generate complex visual patterns and textures. In the second stage, video pretraining, the model is
exposed to large amounts of video data, enabling it to learn temporal dynamics and motion patterns
by incorporating temporal convolution and attention layers alongside its spatial counterparts. This
training is conducted on a systematically curated dataset, ensuring the model learns from high-quality
and relevant video content. The final stage, high-quality video finetuning, focuses on refining the
model’s ability to generate videos with increased resolution and fidelity, using a smaller but higher-
quality dataset. This hierarchical training strategy, complemented by a novel data curation process,
allows SVD to excel in producing state-of-the-art text-to-video and image-to-video synthesis with
remarkable detail, realism, and coherence over time.
Connect Videos. For the video connection task, we utilize SEINE [27] to connect videos. SEINE
is constructed upon a pre-trained diffusion-based T2V model, LaVie [20] agent. SEINE centered
around a random-mask video diffusion model that generates transitions based on textual descriptions.
By integrating images of different scenes with text-based control, SEINE produces transition videos
that maintain coherence and visual quality. Additionally, the model can be extended for tasks such as
image-to-video animation and autoregressive video prediction.

4 Experiments

4.1 Setup

Baseline. In the text-to-video generation, existing open-sourced works that show competitive
performance are employed as the baseline models, including Videocrafter1 [74], Show-1 [75],
Pika [6], Gen-2 [7], ModelScope [76], LaVie-Interpolation, LaVie [77] and CogVideo [78]. In the
other five tasks, we compare Mora with Sora.
Basic Metrics. For text-to-video generation, we employed several metrics from Vbench [33] for
evaluation from two aspects: video quality and video condition consistency.
For video quality measurement, we use six metrics. ❶ Object Consistency, computed by the
DINO [79] feature similarity across frames to assess whether object appearance remains consistent
throughout the whole video; ❷ Background Consistency, calculated by CLIP [71] feature similarity
across frames; ❸ Motion Smoothness, which utilizes the motion priors in the video frame interpolation
model AMT [80] to evaluate the smoothness of generated motions; ❹ Aesthetic Score, obtained by
using the LAION aesthetic predictor [81] on each video frame to evaluate the artistic and beauty
value perceived by humans, ❺ Dynamic Degree, computed by employing RAFT [82] to estimate
the degree of dynamics in synthesized videos; ❻ Imaging Quality, calculated by using MUSIQ [83]
image quality predictor trained on SPAQ [84] dataset.
For measuring video condition consistency, we use two metrics. ❶ Temporal Style, which is
determined by utilizing ViCLIP [85] to compute the similarity between video features and temporal
style description features, thereby reflecting the consistency of the temporal style; ❷ Appearance
Style, by calculating the feature similarity between synthesized frames and the input prompt using
CLIP [71], to gauge the consistency of appearance style.

7
Table 2: Comparative analysis of text-to-video generation performance between Mora and various
other models. The Others category scores are derived from the Hugging Face leaderboard. For Our
Mora, the evaluation is based on prompts generated by GPT-4, categorized into three types based on
the number of moving objects in the videos: Type I (single object in motion), Type II (two to three
objects in motion), and Type III (more than three objects in motion). Differences in the input prompt
may account for the superiority of Mora inType II prompt’s scores over those of Sora in the relevant
evaluations.
Video Object Background Motion Aesthetic Dynamic Imaging Temporal Video
Model
Quality Consistency Consistency Smoothness Quality Dgree Quality Style Length(s)
Others
Sora 0.797 0.95 0.96 1.00 0.60 0.69 0.58 0.35 60
VideoCrafter1 0.778 0.95 0.98 0.95 0.63 0.55 0.61 0.26 2
ModelScope 0.758 0.89 0.95 0.95 0.52 0.66 0.58 0.25 2
Show-1 0.751 0.95 0.98 0.98 0.57 0.44 0.59 0.25 3
Pika 0.741 0.96 0.96 0.99 0.63 0.37 0.54 0.24 3
LaVie-Interpolation 0.741 0.92 0.97 0.97 0.54 0.46 0.59 0.26 10
Gen-2 0.733 0.97 0.97 0.99 0.66 0.18 0.63 0.24 4
LaVie 0.746 0.91 0.97 0.96 0.54 0.49 0.61 0.26 3
CogVideo 0.673 0.92 0.95 0.96 0.38 0.42 0.41 0.07 4
Our Mora
Type I 0.782 0.96 0.97 0.99 0.60 0.60 0.57 0.26 12
Type II 0.810 0.94 0.95 0.99 0.57 0.80 0.61 0.26 12
Type III 0.795 0.94 0.93 0.99 0.55 0.80 0.56 0.26 12
Mora 0.792 0.95 0.95 0.99 0.57 0.70 0.59 0.26 12

Self-defined Metrics. For evaluating other tasks, we also define four metrics.
❶ Video-Text Integration V ideoT I, devised to enhance the quantitative evaluation of the model’s
fidelity to textual instructions. It employs LLaVA [86] to transfer input image into textual descriptors
Ti and Video-Llama [87] to transfer videos generated by the model into textual Tv . The textual
representation of the image is prepended with the original instructional text, forming an augmented
textual input Tmix . Both the newly formed text and the video-generated text will be input to BERT
[88]. The embeddings obtained are analyzed for semantic similarity through the computation of cosine
similarity, providing a quantitative measurement of the model’s adherence to the given instructions
and image.
V ideoT I = cosine(embedmix , embedv ) (1)
where embedmix represents the embedding for Tmix and embedv for Tv .
❷ Temporal Consistency T CON , designed to measure the coherence between an original video and
its extended version, provides a vital tool for assessing the integrity of extended video content. For
each input-output video pair, we employ ViCLIP [85] video encoder to extract their feature vectors.
We then compute cosine similarity to get the score.
T CON = cosine(Vinput , Voutput ) (2)

❸ Temporal coherence T mean, by quantifying the correlation between the intermediate generated
video and the input videos based on T CON .
T mean = (T CONf ront + T CONbeh )/2 (3)
where T CONf ront measures the correlation between the intermediate video and the preceding video
in the time series while T CONbeh assesses the correlation with the subsequent video. The average
of these scores provides an aggregate measure of temporal coherence across the video sequence.
❹ Video Length, to evaluate the models’ efficiency in producing video content, specifically focusing
on the maximum duration, measured in seconds.
Implementation Details. For text-to-video generation, we follow the style of text prompts provided
in the official Sora technical report [89]. Subsequently, we employ GPT-4 [2] to produce more text
under a few-shot setting. GPT-4 is also utilized to generate the same number of texts in a zero-shot
setting. All generated text prompts are then input into text-to-video models for generating videos.
For comparison with Sora, we utilize videos featured on its official website and technical report.
All experiments are conducted on two TESLA A100 GPUs, equipped with a substantial 2×80GB of
VRAM. The central processing was handled by 4×AMD EPYC 7552 48-Core Processors. Memory
allocation was set at 320GB. The software environment was standardized on PyTorch version 2.0.2
and CUDA 12.2 for video generation and PyTorch version 1.10.2 and CUDA 11.6 for video evaluation.

8
Table 3: Text-condtional image-to-video generation. This task evaluates the model’s ability to convert
static images and accompanying textual instructions into coherent video sequences. We used 4 pairs
of image and text inputs sourced from the official Sora technical report [89].
Motion Dynamic Imaging
Model VideoTI
Smoothness Dgree Quality
Sora 0.90 0.99 0.75 0.63
Mora 0.88 0.97 0.75 0.60

4.2 Results

Text-to-Video Generation. The quantitative results are detailed in Table 2. Mora showcases
commendable performance across all metrics, making it highly comparable to the top-performing
model, Sora, and surpassing the capabilities of other competitors. Specifically, Mora achieved a
Video Quality score of 0.792, which closely follows Sora’s leading score of 0.797 and surpasses the
current best open-source model like VideoCrafter1. In terms of Object Consistency, Mora scored 0.95,
equaling Sora and demonstrating superior consistency in maintaining object identities throughout the
videos. For Background Consistency and Motion Smoothness, Mora achieved scores of 0.95 and 0.99,
respectively, indicating high fidelity in background stability and fluidity of motion within generated
videos. Although Sora achieved 0.96 slightly outperforms Mora in Background Consistency, the
margin is minimal. The Aesthetic Quality metric, which assesses the overall visual appeal of the
videos, saw Mora scoring 0.57. This score, while not the highest, reflects a competitive stance
against other models, with Sora scoring slightly higher at 0.60. Nevertheless, Mora’s performance
in Dynamic Degree and Imaging Quality, with scores of 0.70 and 0.59, showcases its strength in
generating dynamic, visually compelling content that surpasses all other models. As for Temporal
Style, Mora scored 0.26, indicating its robust capability in addressing the temporal aspects of video
generation. Although this performance signifies a commendable proficiency, it also highlights a
considerable gap between our model and Sora, the leader in this category with a score of 0.35.
In Figure 1, the visual fidelity of Mora’s text-to-video generation is compelling, manifesting high-
resolution imagery with acute attention to detail as articulated in the accompanying textual descrip-
tions. The vivid portrayal of scenes, from the liftoff of a rocket to the dynamic coral ecosystem and
the urban skateboarding vignette, underscores the system’s adeptness in capturing and translating
the essence of the described activities and environments into visually coherent sequences. Notably,
the images exude a temporal consistency that speaks to Mora’s nuanced understanding of narrative
progression, an essential quality in video synthesis from textual prompts.
Text-conditional Image-to-Video Generation. Analyzing Table 3, we discern a notable demonstra-
tion of Mora’s capabilities in text-conditional image-to-video generation, closely trailing behind Sora.
Sora leads with a VideoTI score of 0.90 and Motion Smoothness of 0.99, underscoring its refined
alignment of video with the provided text and the fluidity of motion within its generated videos.
Mora’s VideoTI score of 0.88 and Motion Smoothness of 0.97, while marginally lower, reflect its
robust potential in the accurate interpretation of textual prompts and the generation of smooth motion
sequences. Both models exhibit a matched score in Dynamic Degree at 0.75, indicating a comparable
proficiency in animating videos with a sense of activity and movement. The Imaging Quality scores
show Sora with a slight lead at 0.63 against Mora’s 0.60, suggesting Sora’s subtle superiority in
image rendering within video sequences.
In Figure 3, a qualitative comparison between the video outputs from Sora and Mora reveals that both
models adeptly incorporate elements from the input prompt and image. The monster illustration and
the cloud spelling "SORA" are well-preserved and dynamically translated into video by both models.
Despite quantitative differences, the qualitative results of Mora nearly rival those of Sora, with both
models are able to animate the static imagery and narrative elements of the text descriptions into
coherent video. This qualitative observation attests to Mora’s capacity to generate videos that closely
parallel Sora’s output, achieving a high level of performance in rendering text-conditional imagery
into video format while maintaining the thematic and aesthetic essence of the original inputs.
Extend Generated Videos. The quantitative results from Table 4 reveal that while Sora holds a
slight edge over Mora in TCON and Imaging Quality, indicating a higher consistency and fidelity in
extending video sequences, Mora’s performance remains close, with a marginal difference of 0.05
in TCON and 0.04 in Imaging Quality. The closeness in Temporal Style scores, 0.24 for Sora and

9
Sora

Mora

Sora

Mora

Figure 3: Samples for text-conditional image-to-video generation of Mora and Sora. Prompt for the
first line image is: Monster Illustration in flat design style of a diverse family of monsters. The group
includes a furry brown monster, a sleek black monster with antennas, a spotted green monster, and a
tiny polka-dotted monster, all interacting in a playful environment. The second image’s prompt is:
An image of a realistic cloud that spells "SORA".

Table 4: Extend generated videos. This experiment compares the continuity and quality of the
extended video with the same input video given to Sora and Mora respectively. The input video is
taken from the full video of the offical Sora technical report [89], and we take 2 clips as the input
video for this task.
Model TCON Imaging Quality Temporal Style
Sora 0.99 0.43 0.24
Mora 0.94 0.39 0.22

0.22 for Mora, further signifies that Mora nearly matches Sora in maintaining stylistic continuity
over time. Despite Sora’s lead, Mora’s capabilities, particularly in following the temporal style and
extending existing videos without significant quality loss, demonstrate its effectiveness in the video
extension domain.
From a qualitative standpoint, Figure 4 illustrates the competencies of Mora in extending video
sequences. Both Sora and Mora adeptly maintain the narrative flow and visual continuity from the
original to the extended video. Despite the slight numerical differences highlighted in the quantitative
analysis, the qualitative outputs suggest that Mora’s extended videos preserve the essence of the
original content with high fidelity. The preservation of dynamic elements such as the rider’s motion
and the surrounding environment’s blur effect in the Mora generated sequences showcases its capacity
to produce extended videos that are not only coherent but also retain the original’s motion and energy
characteristics. This visual assessment underscores Mora’s proficiency in generating extended video
content that closely mirrors the original, maintaining the narrative context and visual integrity, thus
providing near parity with Sora’s performance.
Video-to-Video Editing. Table 5 exhibits a comparative analysis of video-to-video editing capabili-
ties between Sora and Mora. Sora secures a higher score in Imaging Quality at 0.52, which suggests a
superior capability in preserving the visual details and overall image fidelity during the video editing
process. Mora demonstrates capability in video-to-video editing with an Imaging Quality score of
0.38. Although this score reveals a noticeable discrepancy when compare to Sora, it offers valuable
insights on pinpointing areas for targeted enhancements in future iterations of the Mora. In the aspect
of Temporal Style, both models exhibit proximal performance, with Sora marginally leading at 0.24
compared to Mora’s 0.23. This near-parity underscores the capacity of Mora to closely emulate the
stylistic temporal consistency achieved by Sora, ensuring that the edited videos retain a coherent style
over time, a crucial aspect of seamless video-to-video editing.
Upon qualitative evaluation, Figure 5 presents samples from video-to-video editing tasks, wherein
both Sora and Mora were instructed to modify the setting to the 1920s style while maintaining the

10
Extending
Video

Sora

Original
Video

Mora

Extending
Video

Figure 4: Samples for Extend generated video of Mora and Sora.

Table 5: Video-to-video editing. This task examines the model’s ability to edit and transform existing
video content in accordance with textual instructions while maintaining visual and stylistic coherence.
Data includes 1 input video and 12 instructions from official Sora technical report.
Model Imaging Quality Temporal Style
Sora 0.52 0.24
Mora 0.38 0.23

Table 6: Connect Videos. This task assess the capability of models to seamlessly integrate distinct
video clips into a cohesive sequence. ∗ Sora data we utilize is 5 distinct video-video pairs from Sora
technical report website and Mora data we utilize 5 video-video pairs of similar styles.
Model Imaging Quality Tmean
Sora∗ 0.52 0.64
Mora ∗ 0.42 0.45

car’s red color. Visually, Sora’s output exhibits a transformation that convincingly alters the modern-
day setting into one reminiscent of the 1920s, while carefully preserving the red color of the car.
Mora’s transformation, while achieving the task instruction, reveals differences in the execution of
the environmental modification, with the sampled frame from generated video suggesting a potential
for further enhancement to achieve the visual authenticity displayed by Sora. Nevertheless, Mora ’s
adherence to the specified red color of the car underline its ability to follow detailed instructions and
enact considerable changes in the video content. This capability, although not as refined as Sora’s,
demonstrates Mora’s potential for significant video editing tasks.
Connect Videos. Quantitative analysis based on the results shown in Table 6 suggests that the model
Sora outperforms Mora in terms of Imaging Quality and Tmean. Sora achieves a score of 0.52
in Imaging Quality, indicating its higher fidelity in visual representation compared to Mora’s 0.42.
Furthermore, Sora’s superiority is evident in the Temporal coherence aspect with a score of 0.64,
which implies that Sora maintains a more consistent visual narrative over time than Mora, which
scores 0.45. This quantitative assessment not only solidifies Sora’s position as a superior model in
creating high-quality, temporally coherent video sequences but also delineates a trajectory for future
enhancements in video connectivity for the Mora framework.
Qualitative analysis based on Figure 6 suggest that, in comparison to Sora’s proficiency in synthesiz-
ing intermediate video segments that successfully incorporate background elements from preceding
footage and distinct objects from subsequent frames within a single frame, the Mora model demon-

11
Sora

Mora

Instruction: Change the setting to the 1920s with an old school car. make sure to keep the red color

Figure 5: Samples for Video-to-video editing

Sora

Mora

Figure 6: Samples for Connect Videos

Table 7: Simulate digital worlds. This task assesses model’s effectiveness in creating videos that
emulate digital or virtual environments, with a focus on preserving the visual details and the distinct
appearance style of such worlds. By referring to the video generated by Sora from the official
technical report [89], we used GPT-4 [2] to build the input prompt for Mora, which contains the
"minecraft scene" and the "player character".
Model Imaging Quality Appearance style
Sora 0.62 0.23
Mora 0.52 0.23

strates a blurred background in the intermediate videos, which results in indistinguishable object
recognition. Accordingly, this emphasizes the potential for advancing the fidelity of images within
the generated intermediate videos as well as enhancing the consistency with the entire video sequence.
This would contribute to refining the video connecting process and improving the integration quality
of Mora’s model outputs.
Simulate Digital Worlds. In the assessment of digital world simulation capabilities, Table 7 presents
a comparative metric analysis between Sora and Mora. Sora exhibits a lead in Imaging Quality with
a score of 0.62, indicative of its refined capability to render digital worlds with a higher degree of
visual realism and fidelity. Mora, with a score of 0.52, although demonstrating a competent level
of performance, falls behind Sora, suggesting areas for improvement in achieving the same level
of image clarity and detail. However, both models achieve identical scores in Appearance Style, at
0.23, which reflects a shared ability to adhere to the stylistic parameters of the digital worlds being
simulated. This suggests that while there is a difference in the imaging quality, the stylistic translation
of textual descriptions into visual aesthetics is accomplished with equivalent proficiency by both
models.

12
Sora

Mora

Figure 7: Samples for Simulate digital worlds

Upon qualitative evaluation, Figure 7 presents samples from Simulate digital worlds tasks, wherein
both Sora and Mora were instructed to generated video of "Minecraft" scenes. In the top row of
frames generated by Sora, we see that the videos maintain high fidelity to the textures and elements
typical of digital world aesthetics, characterized by crisp edges, vibrant colors, and clear object
definition. The pig and the surrounding environment appear to adhere closely to the style one would
expect from a high-resolution game or a digital simulation. These are crucial aspects of performance
for Sora, indicating a high-quality synthesis that aligns well with user input while preserving visual
consistency and digital authenticity. The bottom row of frames generated by Mora suggests a step
towards achieving the digital simulation quality of Sora but with notable differences. Although
Mora seems to emulate the digital world’s theme effectively, there is a visible gap in visual fidelity.
The images generated by Mora exhibit a slightly muted color palette, less distinct object edges,
and a seemingly lower resolution compared to Sora’s output. This suggests that Mora is still in
a developmental phase, with its generative capabilities requiring further refinement to reach the
performance level of Sora.

5 Discussion
5.1 Strengths of Mora

Innovative Framework and Flexibility. Mora introduces a groundbreaking multi-agent framework


for video generation, significantly advancing the field by enabling a vast array of tasks. This innovative
approach not only facilitates text-to-video conversion but also supports the simulation of digital
worlds, showcasing unparalleled flexibility and efficiency. Unlike its closed-source counterparts such
as Sora, Mora’s open framework design stands out, offering the ability to seamlessly integrate various
models. This adaptability allows for the completion of an expanded range of tasks and projects,
making Mora an indispensable tool for diverse application scenarios.
Open-Source Contribution. Mora’s open-source nature is highlighted as a significant contribution to
the AI community, encouraging further development and refinement by providing a solid foundation
upon which future research can build. This open-source approach not only democratizes access to
advanced video generation technologies but also fosters collaboration and innovation in the field. The
discussion suggests new ways for future research, including improving the framework’s efficiency,
reducing its computational demands during training, and exploring new agent configurations to
enhance performance.

5.2 Limitations of Mora

Video Dataset is all you need. Collecting high-quality video datasets poses significant challenges,
primarily due to copyright restrictions on many videos. Unlike images, which can often be easier to
gather and utilize for training purposes due to a broader range of available resources and more lenient
copyright laws, videos frequently contain copyrighted materials that are not as straightforward to
collect or use legally. This issue is compounded by the fact that high-quality videos often come from
professional sources, such as movies, television shows, and proprietary game footage, which are
rigorously protected by copyright laws. Especially in scenarios involving humans in realistic settings,

13
Mora struggles to generate lifelike movements, such as walking or riding a bicycle. Capturing
the subtleties of human motion requires not just any video footage, but high-resolution, smoothly
captured sequences that detail every aspect of human kinetics, including the nuances of balance,
posture, and interaction with surroundings. Without access to extensive datasets that accurately
represent this wide range of human movements, it becomes challenging for video generation models
like Mora to replicate these actions convincingly. This limitation underscores the importance of not
only the quantity but also the quality and diversity of video datasets in training models to understand
and recreate complex human behaviors accurately.
Quality and Length Gaps. Despite its innovative approach, Mora faces notable challenges, especially
when compared to its counterpart, Sora, in terms of video generation quality and capabilities. While
Mora is capable of accomplishing tasks similar to those of Sora, the quality of videos generated by
Mora falls significantly short, particularly in scenarios involving substantial object movement. Such
conditions often introduce a considerable amount of noise, with the quality degradation becoming
more pronounced relative to the video length. Moreover, although Mora generally maintains quality
for up to 12 seconds of video, any attempt to extend beyond this duration leads to a marked decline in
video quality. In contrast, Sora has demonstrated the ability to produce high-quality videos exceeding
one minute in length, as showcased in its technical report. This disparity in imaging quality, alongside
the limitations in generating videos longer than 12 seconds at a resolution of 1024x576, underscores
the urgent need for advancements in Mora’s rendering capabilities and an extension of its video
generation parameters.
Instruction Following Capability. Mora has reached a milestone in its development, now capable
of generating videos exceeding 10 seconds from given prompts. However, despite its ability to
include all objects specified in the prompts within the generated videos, Mora encounters limitations
in executing certain functions. Notably, it struggles with interpreting and rendering the dynamics
of motion described in prompts, such as the speed of movement. Additionally, Mora lacks the
capability to control the direction of motion—specific directions like left or right for specified objects
remain unachievable. This shortfall primarily stems from the system’s foundational approach to video
generation, which operates on an Image to Video basis without direct input from textual prompts.
Consequently, meeting these specific requirements poses significant challenges, highlighting areas
for potential enhancement in Mora’s functionality and interpretative capabilities.
Human Visual Preference Alignment. The absence of human labeling information within the
video domain suggests that experimental results may not always align with human visual preferences,
highlighting a significant gap. For example, during connection video task, the generation of transition
videos that disintegrate a male figure only to form a female figure represents an exceedingly illogical
scenario. This example underscores the necessity for datasets that adhere more closely to physical
laws, thereby refining our work and ensuring more realistic and coherent video generation outcomes.
Such a framework would enhance the appeal of generated videos to viewers and contribute to
establishing comprehensive benchmarks and evaluation criteria for the field.

6 Conclusion

We introduce Mora, a groundbreaking generalist framework for video generation that addresses a
range of video-related tasks. Leveraging the collaborative power of multiple agents, Mora marks a
considerable advancement in generating videos from textual prompts, establishing new benchmarks
for adaptability, efficiency, and output quality in the field of video generation. Our thorough evaluation
reveals that Mora not only competes with but also exceeds the capabilities of current leading models in
certain areas. However, it has a notable gap with Sora model by OpenAI whose closed-source nature
poses considerable challenges for replication and innovation within the academic and professional
communities. Our work demonstrates the untapped potential of a meta-programming approach that
facilitates intricate collaborations among a variety of agents, each specializing in a segment of the
video generation process. In essence, the achievements of Mora not only illustrate the current open
source state-of-the-art models in video generation but also illuminate the path forward for the field.
As we continue to explore the vast landscape of generative AI, multi-agent collaboration frameworks
like Mora will undoubtedly play a pivotal role in unlocking new creative possibilities and applications,
from storytelling and content creation to simulation and training. The journey of innovation is far
from over, and Mora represents a significant milestone on this ongoing voyage of discovery.

14
Looking ahead, there are several promising pathways for further research. One such direction involves
exploring the integration of more sophisticated natural language understanding capabilities within
the agents, potentially allowing for more detailed and context-aware video generations. In addition,
the expansion of Mora to incorporate real-time feedback loops could offer interactive video creation
experiences, where user inputs could guide the generation process in more dynamic and responsive
ways. Furthermore, the challenge of accessibility and computational resource requirements remains
a critical barrier to wider adoption and innovation. Future iterations of Mora could benefit from
optimizations that reduce these requirements, making advanced video generation technologies more
accessible to a broader range of users and developers. In parallel, efforts to create more open and
collaborative research environments could accelerate progress in this domain, enabling the community
to build upon the foundation laid by Mora framework and other pioneering works.

15
References
[1] OpenAI. Chatgpt, 2023. https://openai.com/product/chatgpt.
[2] OpenAI. Gpt-4 technical report. 2023. URL https://arxiv.org/pdf/2303.08774.pdf.
[3] Midjourney. Midjourney, 2023. URL https://www.midjourney.com/home.
[4] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[5] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang,
Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.
Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
[6] Pika. URL https://pika.art/home.
[7] Gen-2. URL https://runwayml.com/ai-tools/gen-2/.
[8] OpenAI. Sora: Creating video from text. https://openai.com/sora, 2024.
[9] Junchen Zhu, Huan Yang, Huiguo He, Wenjing Wang, Zixi Tuo, Wen-Huang Cheng, Lianli
Gao, Jingkuan Song, and Jianlong Fu. Moviefactory: Automatic movie creation from text using
large generative models for language and images. In Proceedings of the 31st ACM International
Conference on Multimedia, pages 9313–9319, 2023.
[10] Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali
Wang. Vlogger: Make your dream a vlog. arXiv preprint arXiv:2401.09414, 2024.
[11] Ivan Kapelyukh, Vitalis Vosylius, and Edward Johns. Dall-e-bot: Introducing web-scale
diffusion models to robotics. IEEE Robotics and Automation Letters, 2023.
[12] Weiyu Liu, Tucker Hermans, Sonia Chernova, and Chris Paxton. Structdiffusion: Object-centric
diffusion for semantic rearrangement of novel objects. In Workshop on Language and Robotics
at CoRL 2022, 2022.
[13] Akash Awasthi, Jaer Nizam, Samira Zare, Safwan Ahmad, Melisa J Montalvo, Navin Varadara-
jan, Badri Roysam, and Hien Van Nguyen. Video diffusion models for the apoptosis forcasting.
bioRxiv, pages 2023–11, 2023.
[14] Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue
Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations,
and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024.
[15] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper-
vised learning using nonequilibrium thermodynamics. In International conference on machine
learning, pages 2256–2265. PMLR, 2015.
[16] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances
in neural information processing systems, 33:6840–6851, 2020.
[17] Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth
words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
[18] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings
of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
[19] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh
Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing
text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709,
2023.

16
[20] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan
Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming
Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, and Ziwei Liu. Lavie:
High-quality video generation with cascaded latent diffusion models, 2023.
[21] Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying
Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models.
arXiv preprint arXiv:2401.09047, 2024.
[22] Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen,
and Yu Qiao. Latte: Latent diffusion transformer for video generation, 2024.
[23] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do-
minik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion:
Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
[24] Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv
Leviathan, and Yedid Hoshen. Dreamix: Video diffusion models are general video editors.
arXiv preprint arXiv:2302.01329, 2023.
[25] Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, Zhongcong Xu, and Jiashi Feng. Magicedit:
High-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749, 2023.
[26] Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image
diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 23206–23217, 2023.
[27] Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang,
Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative
transition and prediction. In The Twelfth International Conference on Learning Representations,
2023.
[28] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili
Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for
multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
[29] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li,
Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via
multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
[30] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry
Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video:
Text-to-video generation without text-video data. In International Conference on Learning
Representations, 2023.
[31] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko,
Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High
definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
[32] Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin.
Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint
arXiv:2305.13840, 2023.
[33] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang,
Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark
suite for video generative models. arXiv preprint arXiv:2311.17982, 2023.
[34] Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. Imaginator: Con-
ditional spatio-temporal gan for video generation. In Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision, pages 1160–1169, 2020.
[35] Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taixé, and Nils Thuerey. Learning temporal
coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics
(TOG), 39(4):75–1, 2020.

17
[36] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation
using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.

[37] Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. Nüwa:
Visual synthesis pre-training for neural visual world creation. In European conference on
computer vision, pages 720–736. Springer, 2022.

[38] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale
pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868,
2022.

[39] Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig
Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model
for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.

[40] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked
autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.

[41] Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Li Fei-Fei.
Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894,
2022.

[42] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang,
Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable
length video generation from open domain textual description. arXiv preprint arXiv:2210.02399,
2022.

[43] Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen,
Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats
diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.

[44] Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang,
and José Lezama. Photorealistic video generation with diffusion models. arXiv preprint
arXiv:2312.06662, 2023.

[45] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu,
Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without
text-video data. arXiv preprint arXiv:2209.14792, 2022.

[46] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang
Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion
models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.

[47] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne
Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image
diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 7623–7633, 2023.

[48] Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans,
and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in
Neural Information Processing Systems, 36, 2024.

[49] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
High-resolution image synthesis with latent diffusion models, 2022.

[50] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe
Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image
synthesis, 2023.

[51] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and
David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems,
35:8633–8646, 2022.

18
[52] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo:
Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
[53] Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, and Nicolas Thome. Videdit:
Zero-shot and spatially aware text-driven video editing. arXiv preprint arXiv:2306.08707, 2023.
[54] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot
text-guided video-to-video translation. arXiv preprint arXiv:2306.07954, 2023.
[55] Wenhao Chai, Xun Guo, Gaoang Wang, and Yan Lu. Stablevideo: Text-driven consistency-
aware diffusion video editing. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 23040–23050, 2023.
[56] Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and
Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant
transformers. arXiv preprint arXiv:2401.08740, 2024.
[57] Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng
Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, et al. Prioritizing safeguarding over autonomy:
Risks of llm agents for science. arXiv preprint arXiv:2402.04247, 2024.
[58] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu,
and Maosong Sun. Communicative agents for software development. arXiv preprint
arXiv:2307.07924, 2023.
[59] Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and
Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceed-
ings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages
1–22, 2023.
[60] Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf
Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress
and challenges. arXiv preprint arXiv:2402.01680, 2024.
[61] Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel:
Communicative agents for" mind" exploration of large language model society. Advances in
Neural Information Processing Systems, 36, 2024.
[62] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang,
Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models
through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023.
[63] Zhenran Xu, Senbao Shi, Baotian Hu, Jindi Yu, Dongfang Li, Min Zhang, and Yuxiang Wu.
Towards reasoning in large language models via multi-agent peer review collaboration. arXiv
preprint arXiv:2311.08152, 2023.
[64] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report. arXiv preprint arXiv:2303.08774, 2023.
[65] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-
thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open
and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[66] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow
image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 18392–18402, 2023.
[67] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano
Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations, 2022.
[68] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for
biomedical image segmentation, 2015.

19
[69] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan
Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi,
Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/
zenodo.5143773. If you use this software, please cite it as below.
[70] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade
Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws
for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
[71] Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agar-
wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya
Sutskever. Learning transferable visual models from natural language supervision. In ICML,
2021.
[72] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman,
Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick
Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kacz-
marczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next gen-
eration image-text models. In Thirty-sixth Conference on Neural Information Processing
Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?
id=M3Y74vmsMcY.
[73] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[74] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo
Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for
high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
[75] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu,
Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for
text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
[76] Modelscope. URL https://huggingface.co/ali-vilab/
modelscope-damo-text-to-video-synthesis.
[77] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang,
Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded
latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
[78] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale
pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868,
2022.
[79] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski,
and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings
of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
[80] Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, and Ming-Ming Cheng.
Amt: All-pairs multi-field transforms for efficient frame interpolation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9801–9810, 2023.
[81] Aesthetic predictor. URL https://github.com/LAION-AI/aesthetic-predictor.
[82] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part II 16, pages 402–419. Springer, 2020.
[83] Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image
quality transformer. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 5148–5157, 2021.

20
[84] Yuming Fang, Hanwei Zhu, Yan Zeng, Kede Ma, and Zhou Wang. Perceptual quality assessment
of smartphone photography. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 3677–3686, 2020.
[85] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen,
Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal
understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
[86] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances
in neural information processing systems, 36, 2024.
[87] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual
language model for video understanding. arXiv preprint arXiv:2306.02858, 2023. URL
https://arxiv.org/abs/2306.02858.
[88] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[89] OpenAI. Video generation models as world simulators. https://openai.com/research/video-
generation-models-as-world-simulators, 2024.

21

You might also like