Happy Horse 1.0 example output

Happy Horse 1.0

Alibaba's flagship. #1-ranked video model with native multilingual lipsync.

Alibaba's 15-billion-parameter video generation model. Ranked #1 on Artificial Analysis Video Arena by blind human preference. A unified 40-layer self-attention Transformer generates video and audio jointly in a single forward pass — no separate audio model, no post-sync. Native lipsync across English, Mandarin, Cantonese, Japanese, Korean, German, and French. Three endpoints behind one engine: text-to-video, image-to-video, and reference-to-video with up to five image references.

Example outputs

Happy Horse 1.0 example 1

A studio interview close-up of a Japanese chef explaining knife technique in fluent Japanese, soft key light, shallow depth of field, ambient kitchen sounds, 9 seconds

Happy Horse 1.0 example 2

Music video shot: a singer performs the chorus to camera in a neon-lit Hong Kong alley, Cantonese vocals, dynamic crane pull-back, light rain on glossy pavement

Happy Horse 1.0 example 3

A French perfume commercial: a model whispers the brand tagline in French while light catches the bottle in slow motion, soft glow, intimate audio, 6 seconds

Happy Horse 1.0 example 4

Two characters argue across a Berlin café table, German dialogue, handheld two-shot, natural cafe ambience, golden hour light through window, dramatic close-up at climax

Happy Horse 1.0 example 5

A Mandarin-speaking news anchor delivers a financial bulletin at a sleek broadcast desk, locked-off frame, ambient studio audio, professional broadcast aesthetic, 8 seconds

Happy Horse 1.0 example 6

Korean K-drama style: a young woman waits at a Seoul subway platform, train arriving, she whispers a single line in Korean as the doors close, melancholy ambient score

How it works

01

Describe your scene

Type a detailed prompt describing the video you want, or upload a reference image as a starting frame.

02

Choose your settings

Pick your resolution and duration. See the credit cost before you generate.

03

Generate your video

Your video is ready in 1-3 minutes. Download, iterate, or extend the sequence.

Ready to create with Happy Horse 1.0?

Jump into the Studio and start generating. Plans from £10/month.

AI video with audio and lipsync, generated together.

Happy Horse 1.0 is Alibaba's flagship video generation model, released to fal.ai's enterprise platform in April 2026. It currently sits at the top of the Artificial Analysis Video Arena, ranked first by blind human preference across head-to-head comparisons with every other major model. The architecture is unusual: a single 40-layer self-attention Transformer that generates pixels and audio waveforms in the same pass, with no cross-attention modules and no separate audio decoder. The training data and the inference pipeline both treat audio and video as one signal.

The standout capability is multilingual lipsync. Where most models that handle dialogue are calibrated to English, Happy Horse generates accurate mouth shapes for English, Mandarin, Cantonese, Japanese, Korean, German, and French. This unlocks dialogue-driven content for the world's largest video-consuming markets without leaving the platform. Combined with ambient sound, music, and effects all rendered in the same pass, it produces shot-ready dialogue scenes that previously required separate audio production.

Three endpoints share the same engine. Text-to-video for original concepts. Image-to-video to animate a still. Reference-to-video accepts up to five reference images so you can lock character appearance, location, and key props across multi-shot sequences. Duration runs 3-15 seconds at 720p or 1080p, with five aspect ratios on offer — full landscape, portrait, square, and the two close 4:3 variants. One prompt cap at 2,500 characters means you can direct shots in detail.

Native lipsync across seven languages

Most AI video lipsync is calibrated to English phonemes and falls apart on tonal languages or compound German vowels. Happy Horse generates correct mouth shapes for Mandarin tones, Cantonese intonation, Japanese moraic syllables, Korean batchim, German umlauts, and French liaison. For studios serving multilingual audiences, this removes the need for a separate dubbing or re-render pipeline.

One unified model, three creative entry points

Use text-to-video when you're starting from a concept and want maximum creative latitude. Use image-to-video when you have a hero frame — a key still, a product shot, a character portrait — and want it to come alive. Use reference-to-video with two to five images when you need character or product consistency across a multi-shot scene. The model parameters and quality are identical across all three; only the input shape changes.

Why it tops the rankings

Joint video-audio generation produces fewer alignment artefacts than two-pass pipelines. The 15B-param unified Transformer captures cross-modal patterns — eyelid timing matching speech consonants, footstep impact landing on the correct frame, music swell coinciding with camera pushes — that bolt-on audio cannot. For dialogue-driven content, music videos, and any scene where audio drives visual energy, this is the difference between competent and convincing.

Frequently asked

Questions about Happy Horse 1.0.

Seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. Mouth shapes are accurate per language, not interpolated from English templates.
Built differently

Why Stensyl?.