Can I clone my own voice?

Yes. Stensyl supports ElevenLabs voice cloning end-to-end. Record 30 seconds in the avatar wizard or upload a sample. The clone is private to your account.

What's the difference between Lipsync Speed and Precision?

Same input contract (video + audio in, MP4 + SRT out). Precision uses HeyGen's full V3 inference engine for broadcast-quality lip accuracy at 21 credits per second. Speed runs a faster pipeline at half the cost (11 cr/s) with marginally lower fidelity, ideal for drafts and bulk localisation. Both return identical caption quality.

What video and audio formats are supported?

Lipsync accepts MP4, MOV, WEBM, M4V, and GIF for video; MP3, OGG, WAV, M4A, AAC for audio. Output is always MP4 plus an SRT caption file. Avatar 4 takes a portrait image (JPG/PNG) plus a voice ID; you don't supply audio directly because Stensyl runs ElevenLabs TTS in your cloned voice automatically.

How long can a clip be?

Avatar 4 has three buckets: 5, 10, or 15 seconds. V3 Lipsync supports 1–60 seconds per render (Stensyl engineering cap; longer clips need to be segmented). Output duration mirrors input video duration for lipsync.

Are captions automatic?

Yes for V3 Lipsync (Speed and Precision). Every render produces an SRT subtitle file mirrored into your Stensyl storage. Avatar 4 doesn't generate captions because the script is already known (you typed it); use it as your subtitle source if you need one.

Which tier do I need?

Starter unlocks Avatar 4 (1 avatar) and V3 Lipsync Speed. Pro adds Lipsync Precision and increases the avatar limit to 3. Studio takes the avatar limit to 10. Lite users see the upsell.

HeyGen

The gold standard in AI talking heads and audio lipsync. Three models, one umbrella.

HeyGen pioneered AI talking avatars and remains the benchmark for portrait-driven video and audio lipsync. Stensyl integrates the full HeyGen lineup via fal.ai: Avatar 4 for talking-head video from a single portrait, V3 Lipsync Precision for premium audio dub, and V3 Lipsync Speed for fast bulk dubbing. All three share one credit pool, one studio, one place to learn.

Choose a Plan

Example outputs

“Build a 10-second product launch teaser as your avatar: 'Hey, I'm thrilled to share what we've been building this month. Watch this.'”

“Dub a tutorial video from English to Spanish with the same on-screen presenter (V3 Lipsync Precision).”

“Render a 15-second testimonial as your avatar: 'I've been using this tool for six months. Here's what changed for me.'”

“Iterate on five voiceover variants for a 10-second social hook before picking the final one (V3 Lipsync Speed).”

“Generate caption-tracked social cuts of a podcast clip for TikTok, Reels, and Shorts (V3 Lipsync Speed).”

“Render a 5-second avatar sign-off for every video on your channel: 'Thanks for watching. Subscribe for more.'”

How it works

Describe your scene

Type a detailed prompt describing the video you want, or upload a reference image as a starting frame.

Choose your settings

Pick your resolution and duration. See the credit cost before you generate.

Generate your video

Your video is ready in 1-3 minutes. Download, iterate, or extend the sequence.

Ready to create with HeyGen?

Jump into the Studio and start generating. Plans from £10/month.

Choose a Plan

Avatars and lipsync from the team that pioneered both.

HeyGen is the most respected name in AI talking-head video. Their Avatar 4 model produces frame-accurate lipsync from a single portrait and an audio clip, inferring head movement, micro-expressions, and mouth shapes from the audio waveform alone. Their V3 Lipsync engine reads the speaker's face on existing video and reproduces matching lip movement for any new audio you supply. Both are the gold standard in their categories.

Stensyl ships all three HeyGen models on Pro and Studio plans (Avatar 4 and Lipsync Speed are also on Starter). Avatar 4 powers Stensyl's My Avatar feature, where each user persona bundles a portrait, an ElevenLabs voice clone, and style defaults into a persistent record that auto-appears as a Cast member in Storyboards and Film Studio, slots into Canvas as an Avatar Video node, and shows up in Generate's Talking Avatar mode. V3 Lipsync handles audio replacement on existing video for dubbing, voiceover swaps, and language localisation.

Pricing matches HeyGen's own per-second billing structure. Avatar 4 is 100/200/300 credits at 5/10/15 second buckets. V3 Lipsync Precision is 21 credits per second of output (5s = 105, 30s = 630). V3 Lipsync Speed is 11 credits per second, half the cost for drafts and high-volume work. Every V3 Lipsync render returns an SRT subtitle file alongside the MP4, mirrored into Stensyl storage so the URL never expires.

Avatar 4: a persistent talking head from one portrait

Avatar 4 is HeyGen's flagship inference model for portrait-driven video. Feed it a single image and audio, get back a frame-accurate talking-head MP4. Stensyl wraps this as the My Avatar feature: build a persona once with portrait + voice clone + style defaults, render it ten thousand times across every studio without re-uploading anything. Three avatars on Pro, ten on Studio. The talking head you build today still works on every render six months from now.

V3 Lipsync Precision: premium audio replacement

V3 Precision is HeyGen's flagship audio-replacement engine. Upload any video plus the new audio you want it to speak. HeyGen reads the speaker's face, infers phoneme positions, and reproduces frame-accurate lip movement matching the new audio. Use it for final dub passes, language localisation, voiceover swaps, and single-line revisions on finished video where a reshoot would be too expensive. 21 credits per second, +19% margin floor, broadcast-quality output.

V3 Lipsync Speed: half the cost for drafts and bulk dubbing

V3 Speed runs the same input contract as Precision (video + audio → MP4 + SRT) at 11 credits per second instead of 21. Marginal fidelity loss, broadcast-acceptable for social cuts and internal review. Use it to iterate on script variants, localise video libraries into multiple languages, or generate caption-tracked social cuts at scale. Available on Starter plans up.

Captions included on every lipsync render

Every V3 Lipsync render (Speed or Precision) returns a companion SRT subtitle file alongside the MP4. The file is mirrored into Stensyl storage permanently, so the URL never expires. Download it from the gallery lightbox with one click, or pull it programmatically via the gallery API. Drop it into any NLE or social platform for instant caption tracks. Speed and Precision return identical caption quality.

Integrated across every Stensyl studio

My Avatar (powered by Avatar 4) shows up as a Cast member in Storyboards and Film Studio without any extra wiring. Drop the Avatar Video node onto Canvas to wire scripted clips into composable workflows. Right-click any text on a Social Studio slide and pick 'Star me in this' to render the slide as your avatar saying it. Ask Ray to render a clip and she emits a confirm card. V3 Lipsync sits in the Generate page's Utilities tab next to kling-lipsync, ready for any video + audio pairing.

Pair with ElevenLabs for end-to-end automation

Generate the new audio with ElevenLabs TTS or voice cloning, then drive the result through V3 Lipsync. The full pipeline: original clip → new script → ElevenLabs in the cloned voice → HeyGen lipsync → final MP4 + SRT. Localise an entire video library into a new language without studio time. Or pair Avatar 4 with ElevenLabs to render scripted talking-head video in your own voice, end to end.

Frequently asked

Questions about HeyGen.

No. One portrait image is enough. Avatar 4 infers head motion, expressions, and lipsync from the audio waveform alone.

Built differently

Why Stensyl?

A small indie studio building creative tools the way they should be built. No VC theatre, no funnel games, no faceless support.

Ready to create with HeyGen?

Professional video generation. Plans from £10/month.

Choose Your Plan

Also available on Stensyl

Runway Act Two

Character performance from a reference video.

Kling O3

Multilingual lipsync within full video generation.

Kling 3.0 Pro

Multi-shot video with lipsync.

Veo 3.1

Native 4K with audio in one pass.