
Alibaba's flagship. #1-ranked video model with native multilingual lipsync.
Alibaba's 15-billion-parameter video generation model. Ranked #1 on Artificial Analysis Video Arena by blind human preference. A unified 40-layer self-attention Transformer generates video and audio jointly in a single forward pass — no separate audio model, no post-sync. Native lipsync across English, Mandarin, Cantonese, Japanese, Korean, German, and French. Three endpoints behind one engine: text-to-video, image-to-video, and reference-to-video with up to five image references.

“A studio interview close-up of a Japanese chef explaining knife technique in fluent Japanese, soft key light, shallow depth of field, ambient kitchen sounds, 9 seconds”

“Music video shot: a singer performs the chorus to camera in a neon-lit Hong Kong alley, Cantonese vocals, dynamic crane pull-back, light rain on glossy pavement”

“A French perfume commercial: a model whispers the brand tagline in French while light catches the bottle in slow motion, soft glow, intimate audio, 6 seconds”

“Two characters argue across a Berlin café table, German dialogue, handheld two-shot, natural cafe ambience, golden hour light through window, dramatic close-up at climax”

“A Mandarin-speaking news anchor delivers a financial bulletin at a sleek broadcast desk, locked-off frame, ambient studio audio, professional broadcast aesthetic, 8 seconds”

“Korean K-drama style: a young woman waits at a Seoul subway platform, train arriving, she whispers a single line in Korean as the doors close, melancholy ambient score”
Type a detailed prompt describing the video you want, or upload a reference image as a starting frame.
Pick your resolution and duration. See the credit cost before you generate.
Your video is ready in 1-3 minutes. Download, iterate, or extend the sequence.
Jump into the Studio and start generating. Plans from £10/month.
Happy Horse 1.0 is Alibaba's flagship video generation model, released to fal.ai's enterprise platform in April 2026. It currently sits at the top of the Artificial Analysis Video Arena, ranked first by blind human preference across head-to-head comparisons with every other major model. The architecture is unusual: a single 40-layer self-attention Transformer that generates pixels and audio waveforms in the same pass, with no cross-attention modules and no separate audio decoder. The training data and the inference pipeline both treat audio and video as one signal.
The standout capability is multilingual lipsync. Where most models that handle dialogue are calibrated to English, Happy Horse generates accurate mouth shapes for English, Mandarin, Cantonese, Japanese, Korean, German, and French. This unlocks dialogue-driven content for the world's largest video-consuming markets without leaving the platform. Combined with ambient sound, music, and effects all rendered in the same pass, it produces shot-ready dialogue scenes that previously required separate audio production.
Three endpoints share the same engine. Text-to-video for original concepts. Image-to-video to animate a still. Reference-to-video accepts up to five reference images so you can lock character appearance, location, and key props across multi-shot sequences. Duration runs 3-15 seconds at 720p or 1080p, with five aspect ratios on offer — full landscape, portrait, square, and the two close 4:3 variants. One prompt cap at 2,500 characters means you can direct shots in detail.
Professional video generation. Plans from £10/month.