
ByteDance's flagship. 15s video with native audio in a single generation.
ByteDance's Multi-Modal Diffusion Transformer. Dual-branch architecture generates video and audio simultaneously with millisecond synchronisation. Up to 15 seconds at 720p with native stereo audio. Multi-modal reference system accepts up to 9 reference images for character and scene consistency. Flow Matching framework delivers 30% faster generation than v1.5.

“Armoured hero landing on a rain-soaked rooftop, shockwave cracking concrete, cape billowing, lightning illuminating a neon cityscape behind, dramatic low angle, volumetric rain, Unreal Engine 5 cinematic”

“High-speed car chase along a sun-drenched Miami coastal highway, matte black supercar drifting sideways through an intersection, tyre smoke, palm trees blurring, helicopter tracking shot, golden hour lens flare”

“Anime warrior standing on a floating crystal platform above clouds, glowing energy sword raised, a colossal dragon emerging from the storm below, cel-shaded rendering, volumetric god rays, epic fantasy”

“Cyberpunk samurai walking through a neon-drenched market street in futuristic Tokyo, holographic ads floating overhead, rain cascading off a translucent umbrella, katana on back, atmospheric fog, chrome reflections”

“Open-world gameplay shot: figure on a motorcycle cresting a hill overlooking a vast coastal city at sunset, ocean to the horizon, winding highway below, photorealistic, cinematic colour grading”

“Colossal mech robot emerging from stormy ocean waves, searchlights cutting through spray and fog, fighter jets banking away, lightning illuminating armour plating, dramatic low angle, IMAX scale, teal and orange”
Type a detailed prompt describing the video you want, or upload a reference image as a starting frame.
Pick your resolution and duration. See the credit cost before you generate.
Your video is ready in 1-3 minutes. Download, iterate, or extend the sequence.
Jump into the Studio and start generating. Plans from £10/month.
Seedance 2.0 is ByteDance's flagship video generation model. It uses a dual-branch Multi-Modal Diffusion Transformer: one branch generates video frames, the other generates audio waveforms, connected by a cross-attention bridge that synchronises them at every step. The result is video and audio created together in a single pass, not audio bolted on after the fact.
The multi-modal reference system is the standout capability. Feed Seedance 2.0 up to 9 reference images for character appearance and scene consistency. Use @1, @2 etc. in your prompt to direct specific references: '@1 walks through the market while @2 watches from a balcony.' The model decouples content from motion, letting you combine a character from one reference with camera movement from another. This is directing, not prompting.
Output reaches 15 seconds at up to 1080p with 24fps and native dual-channel stereo audio. Flow Matching replaces traditional Gaussian diffusion with a more direct mathematical path from noise to output, delivering 30% faster generation than Seedance 1.5 while improving quality. 480p and 720p options are available for faster iteration at lower credit cost.
Professional video generation. Plans from £10/month.