Video Generation - Cruxr.ai

From Images to Video: What Changes?

A video is just a sequence of images — frames — played back fast enough to create the illusion of motion. At 24 frames per second, a 5-second clip is 120 frames. If we already have models that generate excellent single images, why not just generate 120 images one after the other and stitch them together?

Try it and the result is unwatchable. Each frame is generated independently, so the model has no memory of what it produced a frame ago. A person's face drifts in shape, a building's windows rearrange, colours shift randomly between frames. This is the temporal consistency problem: objects should preserve their identity across time, lighting should remain stable unless something changes it, and motion should follow plausible physics. None of this happens when frames are generated in isolation.

The model must understand time . It needs to know that a ball thrown upward should decelerate, that a turning head reveals a profile, that a shadow moves as the sun shifts. The jump from image generation to video generation is analogous to the jump from predicting a single token to generating a coherent paragraph in language models: the model must maintain state, plan ahead, and respect constraints that span the entire sequence.

So what changes architecturally? Three things. First, the VAE must compress not just spatial dimensions but the temporal dimension too. Second, the DiT (or whatever denoiser backbone is used) must process tokens that span space and time, which changes the attention patterns dramatically. Third, the sequence lengths explode, demanding new efficiency techniques. Let's look at each in turn.

Spacetime Patches and the 3D VAE

For image generation, a DiT patchifies a 2D latent into a sequence of flat tokens. For video, we need to extend this idea into three dimensions: height, width, and time.

The 3D VAE. A standard image VAE takes a single frame of shape $H \times W \times 3$ and compresses it to $h \times w \times c$, where $h = H/s$ and $w = W/s$ for some spatial downsampling factor $s$. A 3D VAE extends this to a full video. The input is $F \times H \times W \times 3$ (frames $\times$ height $\times$ width $\times$ RGB channels), and the output is a latent of shape:

f \times h \times w \times c, \quad \text{where } f = \frac{F}{t_f},\; h = \frac{H}{s},\; w = \frac{W}{s}

Here $t_f$ is the temporal compression factor and $s$ is the spatial compression factor per dimension. Typical values are $t_f = 4$ and $s = 8$. Why compress temporally at all? Because adjacent video frames are extremely redundant — most pixels barely change from one frame to the next. The temporal compression exploits this redundancy.

Let's see what this buys us. For a 5-second clip at 24fps with resolution $512 \times 512$:

# 3D VAE compression for a 5-second, 24fps, 512x512 video
F, H, W, C_in = 120, 512, 512, 3    # 5s * 24fps = 120 frames
t_f, s, c = 4, 8, 16                 # temporal 4x, spatial 8x, 16 latent channels

f = F // t_f   # compressed frames
h = H // s     # compressed height
w = W // s     # compressed width

pixel_elements = F * H * W * C_in
latent_elements = f * h * w * c

print(f"Input (pixel space):  {F} x {H} x {W} x {C_in} = {pixel_elements:,} values")
print(f"Latent (3D VAE):      {f} x {h} x {w} x {c} = {latent_elements:,} values")
print(f"Compression ratio:    {pixel_elements / latent_elements:.0f}x")

💡 The 3D VAE compresses $120 \times 512 \times 512 \times 3 \approx 94$ million values down to $30 \times 64 \times 64 \times 16 \approx 2$ million values, roughly a 48$\times$ reduction. Without this compression, the diffusion model would need to denoise a 94-million-dimensional tensor at each step, which is computationally infeasible.

Spacetime patches. Once we have the 3D latent, we patchify it just as DiT patchifies 2D latents, but now the patches span time as well. Each patch is a small 3D cube of size $p_t \times p_h \times p_w$ (temporal $\times$ height $\times$ width). Flattening a patch of size $p_t \times p_h \times p_w \times c$ gives one token of dimension $p_t \cdot p_h \cdot p_w \cdot c$. The total number of tokens is:

T = \frac{f}{p_t} \times \frac{h}{p_h} \times \frac{w}{p_w}

Let's check the boundary cases. If $p_t = p_h = p_w = 1$, every single latent position becomes its own token: $T = f \cdot h \cdot w$, the maximum possible sequence length. If $p_t = f$, $p_h = h$, $p_w = w$, the entire video collapses into a single token — useless but $T = 1$. In practice, patch sizes like $1 \times 2 \times 2$ or $2 \times 2 \times 2$ are common, balancing token count against spatial-temporal resolution.

import json, js

# Token counts for different patch sizes on our 30x64x64 latent
f, h, w = 30, 64, 64

configs = [
    ("1 x 2 x 2", 1, 2, 2),
    ("2 x 2 x 2", 2, 2, 2),
    ("1 x 4 x 4", 1, 4, 4),
    ("2 x 4 x 4", 2, 4, 4),
]

rows = []
for name, pt, ph, pw in configs:
    T = (f // pt) * (h // ph) * (w // pw)
    rows.append([name, f"{f//pt} x {h//ph} x {w//pw}", f"{T:,}"])

# Compare with a single 1024x1024 image (latent 128x128, patch 2x2)
img_T = (128 // 2) * (128 // 2)
rows.append(["Image: 2 x 2 (1024x1024)", "64 x 64", f"{img_T:,}"])

js.window.py_table_data = json.dumps({
    "headers": ["Patch size (t x h x w)", "Token grid", "Token count"],
    "rows": rows
})

print("A 5-second video produces 5-30x more tokens than a high-res image.")
print("This is why video generation is so computationally demanding.")

Even with aggressive compression, video token counts easily reach 30,000+. Compare that to roughly 4,000 tokens for a high-resolution image. Since self-attention scales as $O(T^2)$, going from 4,000 to 30,000 tokens increases attention cost by roughly $(30{,}000/4{,}000)^2 \approx 56\times$. This is the core computational challenge of video generation.

The OpenAI Sora technical report (Brooks et al., 2024) was the first major publication to describe this spacetime-patch approach in a video DiT, treating video frames as a unified spacetime volume rather than a sequence of independent images.

Temporal Attention: Connecting Frames

Once the video is patchified into spacetime tokens, the DiT must capture relationships across both space and time. How do we design the attention mechanism to handle this?

Full 3D attention. The simplest approach: every spacetime token attends to every other spacetime token. A token at frame 5, position (10, 20) can directly attend to a token at frame 80, position (50, 30). The cost is:

O\big((f' \cdot h' \cdot w')^2\big) = O(T^2)

where $f' = f/p_t$, $h' = h/p_h$, $w' = w/p_w$ are the token grid dimensions and $T$ is the total token count. With $T = 30{,}720$ (patch size $1 \times 2 \times 2$), full attention requires $T^2 \approx 944$ million pairwise scores per head per layer. This is theoretically ideal because every spatial and temporal relationship can be captured directly, but the computational cost makes it impractical for most video lengths and resolutions.

Factored (decomposed) attention. The solution most video models adopt is to factorise attention into two separate operations that alternate within each transformer block:

Spatial attention: tokens within the same frame attend to each other, but not to tokens in other frames. Each frame is processed as an independent image. Cost: $O(f' \times (h' \cdot w')^2)$.
Temporal attention: tokens at the same spatial position across all frames attend to each other, creating a "tube" through time. A patch at position (10, 20) in frame 1 attends to position (10, 20) in frames 2, 3, ..., $f'$. Cost: $O(h' \cdot w' \times f'^2)$.

The total cost of factored attention per block is:

O\big(f' \cdot (h' \cdot w')^2 + h' \cdot w' \cdot f'^2\big)

Let's check the boundary cases to understand the savings. When $f' = 1$ (a single frame), there is no temporal attention, and the cost reduces to $O((h' \cdot w')^2)$, which is just standard image attention — exactly what we would expect. When $h' = w' = 1$ (a single spatial position), there is no spatial attention, and the cost reduces to $O(f'^2)$ — pure temporal modelling. In the typical video case where $f' = 30$, $h' = w' = 32$ (patch $1 \times 2 \times 2$), full 3D costs $O((30 \cdot 1024)^2) = O(9.4 \times 10^8)$, while factored costs $O(30 \cdot 1024^2 + 1024 \cdot 900) = O(32.4 \times 10^6)$ — roughly 29 times cheaper .

# Compare full 3D vs factored attention cost
fp, hp, wp = 30, 32, 32  # token grid after patchifying 30x64x64 with patch 1x2x2
T = fp * hp * wp

full_3d = T ** 2
spatial = fp * (hp * wp) ** 2
temporal = hp * wp * fp ** 2
factored = spatial + temporal

print(f"Token count T = {T:,}")
print(f"Full 3D attention:     {full_3d:,.0f} pairwise scores")
print(f"Factored (spatial):    {spatial:,.0f}")
print(f"Factored (temporal):   {temporal:,.0f}")
print(f"Factored (total):      {factored:,.0f}")
print(f"Speedup:               {full_3d / factored:.1f}x")

The tradeoff is that factored attention cannot capture diagonal spatiotemporal patterns directly — an object that moves to a different spatial position over time requires information to flow through spatial attention first, then temporal attention, taking at least two layers to connect. Full 3D attention would capture this in one layer. In practice, stacking many factored-attention blocks provides enough layers for information to propagate, and some architectures add occasional full 3D attention blocks at lower resolutions to capture long-range spatiotemporal dependencies.

Sora and the World Simulator Vision

In February 2024, OpenAI revealed Sora (Brooks et al., 2024) , a video generation model that fundamentally shifted what the field considered possible. The generated clips showed coherent multi-second scenes with realistic camera motion, consistent characters, and plausible (though imperfect) physics. More importantly, the accompanying technical report framed video generation not as a content-creation tool but as a path toward building world simulators — models that learn how the physical world works from video data alone.

Architecture. Based on the technical report, Sora uses:

Spacetime patches on compressed video latents, as described in the previous sections.
A DiT backbone with spatiotemporal attention (likely factored, given the sequence lengths involved).
Text conditioning via a language encoder, steering the denoising process with text prompts.
Variable resolution and duration : rather than training on a single fixed size, Sora handles different video dimensions natively. The spacetime-patch approach makes this natural — a longer video simply produces more tokens, and flexible positional embeddings adapt to the varying grid shapes.

The "world simulator" framing. The technical report highlights emergent capabilities that were not explicitly trained for. Sora can generate 3D-consistent scenes (rotating around an object maintains its geometry), simulate simple interactions (a ball bouncing, water splashing), and even perform basic image editing by conditioning on a starting frame. The argument is that by training on enough video data at sufficient scale, the model is forced to build internal representations of 3D geometry, physics, and object permanence — not because it was given a physics engine, but because predicting the next frame requires understanding how the world works.

Limitations. Despite the impressive demos, Sora regularly produces physically impossible results: objects that phase through each other, liquids that defy gravity, hands with incorrect finger counts, and scenes where cause-and-effect relationships break down. Temporal coherence degrades noticeably for longer videos (beyond 10 seconds), and complex multi-object interactions remain challenging. These failures show that the model has learned statistical regularities of video, not actual physics.

The Open-Source Video Generation Landscape

Sora demonstrated what was possible but remained closed-source. Throughout 2024 and 2025, a wave of open-weight and commercial video generation models appeared, many of them matching or approaching Sora's quality. Here is the current landscape.

Wan 2.1 (Alibaba, 2025). A fully open-source model combining flow matching with a 3D VAE and a DiT backbone. Available in 1.3B and 14B parameter variants, it generates 480p to 720p video at up to 5 seconds. Wan is notable for strong prompt adherence (the generated video closely follows the text description) and for being one of the most capable fully open models (Wan-Video, 2025) .

HunyuanVideo (Tencent, 2024). A unified image-and-video generation framework using a dual-stream DiT architecture — separate transformer streams for text and video tokens that interact through cross-attention layers. Generates 720p video at 5+ seconds with open weights. The dual-stream design allows the text and visual pathways to develop specialised representations before merging, which improves text-video alignment (Kong et al., 2024) .

CogVideoX (Zhipu AI, 2024). Uses an expert transformer architecture with a 3D VAE and a progressive training strategy : the model is first trained on images, then short video clips, then longer videos. This curriculum approach lets it learn spatial quality first, then temporal coherence, which stabilises training and improves final quality. Available in 2B and 5B parameter variants (Yang et al., 2024) .

LTX Video (Lightricks, 2024). Designed specifically for real-time-capable video generation. Uses a video VAE with a transformer backbone, optimised for latency over maximum quality. The target is interactive applications — video editing tools, live previews — where generating a few seconds of video in under a minute matters more than photorealism (Lightricks, 2024) .

Veo 2 (Google DeepMind, 2024-2025). A high-quality model producing 1080p, minute-long videos. Architecture details are sparse, but it likely uses a cascaded DiT approach (generate at low resolution, then super-resolve). Available through Google's API. Veo 2 is notable for generating some of the longest coherent clips in the field (Google DeepMind, 2024) .

Runway Gen-3 Alpha (Runway, 2024). A commercial model with high visual quality and strong motion coherence. Architecture details are undisclosed. Gen-3 Alpha is widely used in the creative industry for short-form video generation and editing.

Kling 1.6 (Kuaishou, 2024-2025). A commercial model competitive with Sora in quality benchmarks. Architecture details are undisclosed, but the model is accessible through an API and has gained significant adoption, especially in the Chinese market.

The table below compares the key properties of these models.

import json, js

rows = [
    ["Sora (OpenAI)", "Closed", "DiT + 3D VAE", "Up to 1080p", "Up to 60s", "2024"],
    ["Wan 2.1 (Alibaba)", "Open", "DiT + Flow Matching + 3D VAE", "480p-720p", "Up to 5s", "2025"],
    ["HunyuanVideo (Tencent)", "Open", "Dual-stream DiT + 3D VAE", "720p", "5+ s", "2024"],
    ["CogVideoX (Zhipu AI)", "Open", "Expert Transformer + 3D VAE", "720p", "Up to 6s", "2024"],
    ["LTX Video (Lightricks)", "Open", "Transformer + Video VAE", "720p", "Up to 5s", "2024"],
    ["Veo 2 (Google DeepMind)", "Closed", "Likely cascaded DiT", "1080p", "Up to 60s+", "2024-25"],
    ["Gen-3 Alpha (Runway)", "Closed", "Undisclosed", "1080p", "Up to 10s", "2024"],
    ["Kling 1.6 (Kuaishou)", "Closed", "Undisclosed", "1080p", "Up to 10s", "2024-25"],
]

js.window.py_table_data = json.dumps({
    "headers": ["Model", "Weights", "Architecture", "Resolution", "Duration", "Year"],
    "rows": rows
})

print("Open models (Wan, HunyuanVideo, CogVideoX, LTX) now cover the 480p-720p range.")
print("Closed models (Sora, Veo 2) push higher resolution and longer duration.")
print("The gap is closing rapidly — 2024 open models rival 2023 closed ones.")

💡 A common pattern across all these models: they all operate in a compressed latent space (3D VAE), use some form of transformer-based denoiser, and either use flow matching or standard diffusion as the training objective. The core architectural ideas — spacetime patches, factored attention, latent-space denoising — have converged across the field.

The Challenges Ahead

Video generation in 2025 is roughly where image generation was in 2022: the results are impressive in short demos, but the technology is far from a solved problem. Several fundamental challenges remain.

Temporal consistency over long durations. Maintaining a character's identity, the style of a scene, and the rules of physics across many frames is hard. Most models produce visually coherent output for 2-5 seconds, but quality degrades noticeably beyond 10 seconds. Characters subtly change appearance, objects appear and disappear, and motion becomes unnatural. The attention mechanism can only connect frames that fall within its effective context window, and with factored attention, long-range temporal dependencies must propagate through many layers.

Compute cost. Video generation is 10-100$\times$ more expensive than image generation, depending on duration and resolution. A 5-second 720p clip can take minutes to generate on high-end hardware. The $O(T^2)$ attention cost (or $O(T)$ with linear attention variants) combined with many denoising steps makes inference expensive. Training costs are proportionally enormous: training a state-of-the-art video model requires thousands of GPU-hours.

Length. Generating minute-long coherent video is still largely unsolved. Most open models top out at 5-10 seconds. Some closed models (Sora, Veo 2) claim longer durations, but with noticeable quality degradation. Approaches like autoregressive chunk generation (generate 5 seconds, condition on the last frame, generate the next 5 seconds) can extend duration but introduce visible seams and drift.

Controllability. Text prompts alone are a coarse control mechanism for video. Users often need to specify camera motion ("slow dolly forward"), character actions ("she turns and walks to the door"), scene transitions ("cut to a close-up"), or timing ("the explosion happens at the 3-second mark"). Current models offer limited support for these fine-grained controls. Some systems are exploring additional conditioning signals: reference images for character consistency, depth maps for camera motion, and pose sequences for character animation.

Real-time generation. Interactive video generation — where a user can steer the output in real time — requires generating frames faster than the playback rate (24+ fps). Current models take minutes to produce seconds of video, placing us orders of magnitude away from real-time. Distillation techniques (reducing the number of denoising steps) and architectural optimisations are active research areas, but real-time high-quality video generation remains a distant goal.

The rapid progress from 2023 to 2025 suggests that many of these challenges will be partially addressed in the near term. But truly controllable, long-duration, real-time video generation at high quality remains one of the hardest open problems in generative AI.

Quiz

Test your understanding of video generation architectures and challenges.

Why does generating each video frame independently with an image model produce poor results?

Image models cannot generate at video resolutions

Each frame is generated without knowledge of adjacent frames, causing temporal inconsistency (flickering, morphing objects)

Image models are too slow to generate 24 frames per second

The VAE cannot encode individual frames at sufficient quality

In factored spatiotemporal attention, what is the key tradeoff compared to full 3D attention?

Factored attention cannot process video at all, only images

Factored attention is more expensive but produces higher quality

Factored attention is much cheaper but cannot capture diagonal spatiotemporal patterns (moving objects) in a single layer

Factored attention requires a 3D VAE while full attention does not

A 3D VAE with temporal compression $t_f = 4$ and spatial compression $s = 8$ receives a video of 80 frames at $256 \times 256$. What is the latent shape (ignoring channels)?

$20 \times 32 \times 32$ (80/4 frames, 256/8 height, 256/8 width)

$80 \times 32 \times 32$ (no temporal compression, 256/8 height and width)

$10 \times 128 \times 128$ (80/8 frames, 256/2 height, 256/2 width)

$4 \times 8 \times 8$ (compression factors applied as divisors to each axis)

Which architectural pattern has converged across most state-of-the-art video generation models (Sora, Wan, HunyuanVideo, CogVideoX)?

U-Net denoiser operating directly in pixel space

GAN-based frame-by-frame generation with a temporal discriminator

3D VAE for compression + transformer-based (DiT) denoiser with spatiotemporal attention

Autoregressive token prediction using a decoder-only LLM