How does it differ from Seedance 2.0?

Both generate video and audio jointly. Seedance 2.0 supports omni references (image + video + audio inputs) and goes to 1080p with multi-shot storytelling. Happy Horse leads on multilingual lipsync accuracy and currently ranks first on blind preference tests. Different strengths — pick by what your scene demands.

Can I use my own reference images?

Yes — up to nine reference images on the reference-to-video endpoint. Useful for locking character appearance, product details, or environment across multi-shot sequences. A single reference image uses the image-to-video endpoint instead, and no references falls back to pure text-to-video.

What duration and resolutions are available?

3 to 15 seconds, integer seconds only. Resolutions are 720p and 1080p. Aspect ratios include 16:9, 9:16, 1:1, 4:3, and 3:4 — so it covers landscape, portrait, square, and the photographic in-betweens.

Happy Horse 1.1

Alibaba's flagship. #1-ranked video model with native multilingual lipsync.

Alibaba's #1-ranked video model, voted top by blind human preference. Picture and sound are generated together, with native lip-sync across seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. Start from text, a single image, or up to nine reference images to lock characters and scenes. Version 1.1 adds nine-image references and cheaper 1080p.

Example outputs

“A studio interview close-up of a Japanese chef explaining knife technique in fluent Japanese, soft key light, shallow depth of field, ambient kitchen sounds, 9 seconds”

“Music video shot: a singer performs the chorus to camera in a neon-lit Hong Kong alley, Cantonese vocals, dynamic crane pull-back, light rain on glossy pavement”

“A French perfume commercial: a model whispers the brand tagline in French while light catches the bottle in slow motion, soft glow, intimate audio, 6 seconds”

“Two characters argue across a Berlin café table, German dialogue, handheld two-shot, natural cafe ambience, golden hour light through window, dramatic close-up at climax”

“A Mandarin-speaking news anchor delivers a financial bulletin at a sleek broadcast desk, locked-off frame, ambient studio audio, professional broadcast aesthetic, 8 seconds”

“Korean K-drama style: a young woman waits at a Seoul subway platform, train arriving, she whispers a single line in Korean as the doors close, melancholy ambient score”

“A studio interview close-up of a Japanese chef explaining knife technique in fluent Japanese, soft key light, shallow depth of field, ambient kitchen sounds, 9 seconds”

“Music video shot: a singer performs the chorus to camera in a neon-lit Hong Kong alley, Cantonese vocals, dynamic crane pull-back, light rain on glossy pavement”

“A French perfume commercial: a model whispers the brand tagline in French while light catches the bottle in slow motion, soft glow, intimate audio, 6 seconds”

“Two characters argue across a Berlin café table, German dialogue, handheld two-shot, natural cafe ambience, golden hour light through window, dramatic close-up at climax”

“A Mandarin-speaking news anchor delivers a financial bulletin at a sleek broadcast desk, locked-off frame, ambient studio audio, professional broadcast aesthetic, 8 seconds”

“Korean K-drama style: a young woman waits at a Seoul subway platform, train arriving, she whispers a single line in Korean as the doors close, melancholy ambient score”

How it works

Describe your scene

Type a detailed prompt describing the video you want, or upload a reference image as a starting frame.

Choose your settings

Pick your resolution and duration. See the credit cost before you generate.

Generate your video

Your video is ready in 1-3 minutes. Download, iterate, or extend the sequence.

Ready to create with Happy Horse 1.1?

Jump into the Studio and start generating. Plans from $11/month.

AI video with audio and lipsync, generated together.

Happy Horse 1.1 is Alibaba's flagship video model, and it currently sits at the top of the Artificial Analysis Video Arena, ranked first by blind human preference across head-to-head comparisons with every other major model. The architecture is unusual: a single 40-layer self-attention Transformer that generates pixels and audio waveforms in the same pass, with no cross-attention modules and no separate audio decoder. The training data and the inference pipeline both treat audio and video as one signal, which is a big part of why its lip-sync and motion feel so natural.

The standout capability is multilingual lipsync. Where most models that handle dialogue are calibrated to English, Happy Horse generates accurate mouth shapes for English, Mandarin, Cantonese, Japanese, Korean, German, and French. This unlocks dialogue-driven content for the world's largest video-consuming markets without leaving the platform. Combined with ambient sound, music, and effects all rendered in the same pass, it produces shot-ready dialogue scenes that previously required separate audio production.

Three endpoints share the same engine. Text-to-video for original concepts. Image-to-video to animate a still. Reference-to-video accepts up to nine reference images so you can lock character appearance, location, and key props across multi-shot sequences. Duration runs 3-15 seconds at 720p or 1080p, with five aspect ratios on offer — full landscape, portrait, square, and the two close 4:3 variants. One prompt cap at 2,500 characters means you can direct shots in detail.

Native lipsync across seven languages

Most AI video lipsync is calibrated to English phonemes and falls apart on tonal languages or compound German vowels. Happy Horse generates correct mouth shapes for Mandarin tones, Cantonese intonation, Japanese moraic syllables, Korean batchim, German umlauts, and French liaison. For studios serving multilingual audiences, this removes the need for a separate dubbing or re-render pipeline.

One unified model, three creative entry points

Use text-to-video when you're starting from a concept and want maximum creative latitude. Use image-to-video when you have a hero frame — a key still, a product shot, a character portrait — and want it to come alive. Use reference-to-video with two to nine images when you need character or product consistency across a multi-shot scene. The model parameters and quality are identical across all three; only the input shape changes.

Why it tops the rankings

Joint video-audio generation produces fewer alignment artefacts than two-pass pipelines. The 15B-param unified Transformer captures cross-modal patterns — eyelid timing matching speech consonants, footstep impact landing on the correct frame, music swell coinciding with camera pushes — that bolt-on audio cannot. For dialogue-driven content, music videos, and any scene where audio drives visual energy, this is the difference between competent and convincing.

Frequently asked

Questions about Happy Horse 1.1.

Seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. Mouth shapes are accurate per language, not interpolated from English templates.

Built differently

Why Stensyl?.

Because creative work doesn't live in one box. A real project spans research, writing, image, video, 3D, motion graphics, editing, audio, and a way to publish it all. Stensyl puts every piece under one roof: dedicated studios for Film, Graphics, Canvas, 3D, 3D Worlds, Motion, Editing, Web, Social, and App, plus Generate for one-shot work, Projects to keep everything tied together, Workflows for repeatable pipelines, Research backed by Perplexity, and Write for proper documents. One login, one credit balance, one bill, one place where your work actually compounds. You stop paying five subscriptions for tools that don't talk to each other.

Ready to create with Happy Horse 1.1?

Professional video generation. Plans from $11/month.

Also available on Stensyl

Seedance 2.0

ByteDance's flagship. 15s, native audio, multi-ref.

Veo 3.1

Google's flagship. Native 4K. Audio in one pass.

Kling O3

4K at 60fps. Character consistency.

Runway Gen-4.5

Premium cinematic from Runway.

Hailuo 2.3 Pro

Premium camera control, realism.