HeyGen example output

HeyGen

The gold standard in AI talking heads and audio lipsync. Three models, one umbrella.

HeyGen pioneered AI talking avatars and remains the benchmark for portrait-driven video and audio lipsync. Stensyl integrates the full HeyGen lineup via fal.ai: Avatar 4 for talking-head video from a single portrait, V3 Lipsync Precision for premium audio dub, and V3 Lipsync Speed for fast bulk dubbing. All three share one credit pool, one studio, one place to learn.

Example outputs

HeyGen example 1

Build a 10-second product launch teaser as your avatar: 'Hey, I'm thrilled to share what we've been building this month. Watch this.'

HeyGen example 2

Dub a tutorial video from English to Spanish with the same on-screen presenter (V3 Lipsync Precision).

HeyGen example 3

Render a 15-second testimonial as your avatar: 'I've been using this tool for six months. Here's what changed for me.'

HeyGen example 4

Iterate on five voiceover variants for a 10-second social hook before picking the final one (V3 Lipsync Speed).

HeyGen example 5

Generate caption-tracked social cuts of a podcast clip for TikTok, Reels, and Shorts (V3 Lipsync Speed).

HeyGen example 6

Render a 5-second avatar sign-off for every video on your channel: 'Thanks for watching. Subscribe for more.'

How it works

01

Describe your scene

Type a detailed prompt describing the video you want, or upload a reference image as a starting frame.

02

Choose your settings

Pick your resolution and duration. See the credit cost before you generate.

03

Generate your video

Your video is ready in 1-3 minutes. Download, iterate, or extend the sequence.

Ready to create with HeyGen?

Jump into the Studio and start generating. Plans from £10/month.

Choose a Plan

Avatars and lipsync from the team that pioneered both.

HeyGen is the most respected name in AI talking-head video. Their Avatar 4 model produces frame-accurate lipsync from a single portrait and an audio clip, inferring head movement, micro-expressions, and mouth shapes from the audio waveform alone. Their V3 Lipsync engine reads the speaker's face on existing video and reproduces matching lip movement for any new audio you supply. Both are the gold standard in their categories.

Stensyl ships all three HeyGen models on Pro and Studio plans (Avatar 4 and Lipsync Speed are also on Starter). Avatar 4 powers Stensyl's My Avatar feature, where each user persona bundles a portrait, an ElevenLabs voice clone, and style defaults into a persistent record that auto-appears as a Cast member in Storyboards and Film Studio, slots into Canvas as an Avatar Video node, and shows up in Generate's Talking Avatar mode. V3 Lipsync handles audio replacement on existing video for dubbing, voiceover swaps, and language localisation.

Pricing matches HeyGen's own per-second billing structure. Avatar 4 is 100/200/300 credits at 5/10/15 second buckets. V3 Lipsync Precision is 21 credits per second of output (5s = 105, 30s = 630). V3 Lipsync Speed is 11 credits per second, half the cost for drafts and high-volume work. Every V3 Lipsync render returns an SRT subtitle file alongside the MP4, mirrored into Stensyl storage so the URL never expires.

Avatar 4: a persistent talking head from one portrait

Avatar 4 is HeyGen's flagship inference model for portrait-driven video. Feed it a single image and audio, get back a frame-accurate talking-head MP4. Stensyl wraps this as the My Avatar feature: build a persona once with portrait + voice clone + style defaults, render it ten thousand times across every studio without re-uploading anything. Three avatars on Pro, ten on Studio. The talking head you build today still works on every render six months from now.

V3 Lipsync Precision: premium audio replacement

V3 Precision is HeyGen's flagship audio-replacement engine. Upload any video plus the new audio you want it to speak. HeyGen reads the speaker's face, infers phoneme positions, and reproduces frame-accurate lip movement matching the new audio. Use it for final dub passes, language localisation, voiceover swaps, and single-line revisions on finished video where a reshoot would be too expensive. 21 credits per second, +19% margin floor, broadcast-quality output.

V3 Lipsync Speed: half the cost for drafts and bulk dubbing

V3 Speed runs the same input contract as Precision (video + audio → MP4 + SRT) at 11 credits per second instead of 21. Marginal fidelity loss, broadcast-acceptable for social cuts and internal review. Use it to iterate on script variants, localise video libraries into multiple languages, or generate caption-tracked social cuts at scale. Available on Starter plans up.

Captions included on every lipsync render

Every V3 Lipsync render (Speed or Precision) returns a companion SRT subtitle file alongside the MP4. The file is mirrored into Stensyl storage permanently, so the URL never expires. Download it from the gallery lightbox with one click, or pull it programmatically via the gallery API. Drop it into any NLE or social platform for instant caption tracks. Speed and Precision return identical caption quality.

Integrated across every Stensyl studio

My Avatar (powered by Avatar 4) shows up as a Cast member in Storyboards and Film Studio without any extra wiring. Drop the Avatar Video node onto Canvas to wire scripted clips into composable workflows. Right-click any text on a Social Studio slide and pick 'Star me in this' to render the slide as your avatar saying it. Ask Ray to render a clip and she emits a confirm card. V3 Lipsync sits in the Generate page's Utilities tab next to kling-lipsync, ready for any video + audio pairing.

Pair with ElevenLabs for end-to-end automation

Generate the new audio with ElevenLabs TTS or voice cloning, then drive the result through V3 Lipsync. The full pipeline: original clip → new script → ElevenLabs in the cloned voice → HeyGen lipsync → final MP4 + SRT. Localise an entire video library into a new language without studio time. Or pair Avatar 4 with ElevenLabs to render scripted talking-head video in your own voice, end to end.

Frequently asked

Questions about HeyGen.

No. One portrait image is enough. Avatar 4 infers head motion, expressions, and lipsync from the audio waveform alone.
Built differently

Why Stensyl?

A small indie studio building creative tools the way they should be built. No VC theatre, no funnel games, no faceless support.