Professional Documents
Culture Documents
SORA
SORA
SORA
When combining these two model types, Jack Qiao noted that "diffusion models are great at
generating low-level texture but poor at global composition, while transformers have the
opposite problem." That is, you want a GPT-like transformer model to determine the high-level
layout of the video frames and a diffusion model to create the details.
Another quirk of this hybrid architecture is that to make video generation computationally
feasible, the process of creating patches uses a dimensionality reduction step so that
computation does not need to happen on every single pixel for every single frame.