SORA

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

SORA

This is not real video!


Its generated by Sora,a text to video model by OpenAi.
OpenAI recently announced its latest groundbreaking tech—
Sora, which produces short videos from text prompts. While
Sora is not yet available to the public, the high quality of the
sample outputs published so far has provoked both excited and
concerned reactions. The sample videos published by OpenAI,
which the company says were created directly by Sora without
modification, show outputs from prompts like “photorealistic
closeup video of two pirate ships battling each other as they sail
inside a cup of coffee” and “historical footage of California
during the gold rush”.
At first glance, it is often hard to tell they are generated by AI,
due to the high quality of the videos, textures, dynamics of
scenes, camera movements, and a good level of consistency.
OpenAI chief executive Sam Altman also posted some videos to
X (formerly Twitter) generated in response to user-suggested
prompts, to demonstrate Sora’s capabilities.
So what Combining diffusion and transformer models
Sora combines the use of a diffusion model with a transformer architecture, as used by GPT.

When combining these two model types, Jack Qiao noted that "diffusion models are great at
generating low-level texture but poor at global composition, while transformers have the
opposite problem." That is, you want a GPT-like transformer model to determine the high-level
layout of the video frames and a diffusion model to create the details.

In a technical article on the implementation of Sora, OpenAI provides a high-level description


of how this combination works. In diffusion models, images are broken down into smaller
rectangular "patches." For video, these patches are three-dimensional because they persist
through time. Patches can be thought of as the equivalent of "tokens" in large language models:
rather than being a component of a sentence, they are a component of a set of images. The
transformer part of the model organizes the patches, and the diffusion part of the model generates
the content for each patch.

Another quirk of this hybrid architecture is that to make video generation computationally
feasible, the process of creating patches uses a dimensionality reduction step so that
computation does not need to happen on every single pixel for every single frame.

How does Sora work?


Sora combines features of text and image generating tools in
what is called a “diffusion transformer model”. Transformers
are a type of neural network first introduced by Google in 2017.
They are best known for their use in large language models
such as ChatGPT and Google Gemini. Diffusion models, on the
other hand, are the foundation of many AI image generators.
They work by starting with random noise and iterating towards
a “clean” image that fits an input prompt.

A video can be made from a sequence of such images.


However, in a video, coherence and consistency between
frames are essential. Sora uses the transformer architecture to
handle how frames relate to one another. While transformers
were initially designed to find patterns in tokens representing
text, Sora instead uses tokens representing small patches of
space and time.
Solving temporal consistency
One area of innovation in Sora is that it considers several video
frames at once, which solves the problem of keeping objects
consistent when they move in and out of view. In the following
video, notice that the kangaroo's hand moves out of the shot
several times, and when it returns, the hand looks the same as
before.
Combining diffusion and transformer models
Sora combines the use of a diffusion model with a transformer
architecture, as used by GPT.
When combining these two model types, Jack Qiao noted that
"diffusion models are great at generating low-level texture but
poor at global composition, while transformers have the
opposite problem." That is, you want a GPT-like transformer
model to determine the high-level layout of the video frames
and a diffusion model to create the details.
In a technical article on the implementation of Sora, OpenAI
provides a high-level description of how this combination
works. In diffusion models, images are broken down into
smaller rectangular "patches." For video, these patches are
three-dimensional because they persist through time. Patches
can be thought of as the equivalent of "tokens" in large
language models: rather than being a component of a
sentence, they are a component of a set of images. The
transformer part of the model organizes the patches, and the
diffusion part of the model generates the content for each
patch.
Another quirk of this hybrid architecture is that to make video
generation computationally feasible, the process of creating
patches uses a dimensionality reduction step so that
computation does not need to happen on every single pixel for
every single frame.

Sora Video AI : Actual use case in real life


Sora can be used to create videos from scratch or extend
existing videos to make them longer. It can also fill in missing
frames from videos. In the same way that text-to-image
generative AI tools have made it dramatically easier to create
images without technical image editing expertise, Sora
promises to make it easier to create videos without image
editing experience. Here are some key use cases.
Creative Industries
For filmmakers, visual artists, and designers, Sora opens up new
avenues for creativity. Imagine generating storyboard visuals or
short film sequences directly from a script, significantly
reducing the time and resources needed for conceptualization
and pre-production.
Education and Training
Sora can create detailed educational content, such as historical
reenactments or scientific simulations, making learning more
engaging and visually immersive.
Advertising and Marketing
Brands can leverage Sora to produce eye-catching video
content for marketing campaigns based on textual descriptions
alone, enabling faster turnaround times and creative
experimentation.
Gaming and Virtual Reality
Developers can use Sora to generate dynamic backgrounds,
character interactions, or even entire cutscenes, enhancing the
storytelling aspect of video games and VR experiences.
Whether you're a filmmaker looking to visualize your next
screenplay, an educator aiming to bring history to life, or a
marketer seeking innovative content creation tools, Sora
promises to be a game-changer in the way we conceive and
produce video content.
Prototyping and concept visualization
Even if AI video isn't used in a final product, it can be helpful for
demonstrating ideas quickly. Filmmakers can use AI for
mockups of scenes before they shoot them, and designers can
create videos of products before they build them. In the
following example, a toy company could generate an AI
mockup of a new pirate ship toy before committing to creating
them at scale.
How Can I Access Sora?
Sora is currently only available to "red team" researchers. That
is, experts who are given the task of trying to identify problems
with the model. For example, they will try to generate content
with some of the risks identified in the previous section so
OpenAI can mitigate the problems before releasing Sora to the
public. OpenAI has not yet specified a public release date for
Sora, though it is likely to be some time in 2024.
What Does OpenAI Sora Mean for the Future?
There can be little doubt that Sora is ground-breaking. It’s also
clear that the potential for this generative model is vast. What
are the implications of Sora on the AI industry and the world?
We can, of course, only take educated guesses. However, here
are some of the ways that Sora may change things, for better or
worse.
Short-term implications of OpenAI Sora
Let’s first take a look at the direct, short-term impacts we might
see from Sora in the wake of its (likely phased) launch to the
public.
A wave of quick wins
In the section above, we’ve already explored some of Sora's
potential use cases. Many of these will likely see quick
adoption if and when Sora is released for public use. This
might include:
-The proliferation of short-form videos for social media and
advertising. Expect creators on X (formerly Twitter), TikTok,
LinkedIn, and others to up the quality of their content with Sora
productions.
-The adoption of Sora for prototyping. Whether it’s
demonstrating new products or showcasing proposed
architectural developments, Sora could become commonplace
for pitching ideas.
-Improved data storytelling. Text-to-video generative AI could
give us more vivid data visualization, better simulations of
models, and interactive ways to explore and present data. That
said, it will be important to see how Sora performs on these
types of prompts.
-Better learning resources. With tools like Sora, learning
materials could be greatly enhanced. Complicated concepts can
be brought to life, while more visual learners have the chance
for better learning aids
Long-term implications of OpenAI Sora
As the dust begins to settle after the public launch of
OpenAI’s Sora, we’ll start to see what the longer-term future
holds. As professionals across a host of industries get their
hands on the tool, there’ll inevitably be some game-changing
uses for Sora. Let’s speculate on what some of these could be:
High-value use cases can be unlocked
It’s possible that Sora (or similar tools) could become a
mainstay in several industries:
Advanced content creation. We could see Sora as a tool to
speed up production across fields such as VR and AR, video
games, and even traditional entertainment such as TV and
movies. Even if it’s not used directly to create such media, it
could help to prototype and storyboard ideas.
Personalised entertainment. Of course, we could see an
instance where Sora creates and curates content tailored
specifically to the user. Interactive and responsive media that
are tailored to an individual’s tastes and preferences could
emerge.
Personalised education. Again, this highly individualized
content could find a home in the education sector, helping
students learn in a way that’s best suited to their needs.
Real-time video editing. Video content could be edited or re-
produced in real-time to suit different audiences, adapting
aspects such as tone, complexity, or even narrative based on
viewer preferences or feedback.

You might also like