★ — TECHNOLOGY

Audio-driven lipsync, explained.

Modern AI agents lip-sync in real time to streaming audio. The technology has two parts: a neural network that converts audio waveforms to facial parameters, and a rig that turns those parameters into mouth movement. Both are now mature.

The neural network half

Audio2Face (NVIDIA) is a transformer model that takes streaming audio and outputs ARKit blendshape values at sub-100ms latency. NeuroSync is the open-source equivalent that runs on Apple Silicon / CPU.

Both consume streaming TTS (ElevenLabs, OpenAI, Cartesia) and produce frame-accurate facial parameter streams.

The rig half

A face rig with the 51 ARKit blendshapes (jaw open, mouth pucker, smile, etc.) deforms the mesh in real time as parameters update. EquipVerse ships every character with full ARKit + jaw + tongue + correctives.

End-to-end latency

LLM token generation (50ms first-token) + TTS streaming (100ms) + Audio2Face (sub-100ms) + render (16ms) ≈ 200–300ms total round trip. Indistinguishable from human conversation latency.

Audio-driven lipsync, explained.

The neural network half

The rig half

End-to-end latency

AI Agent Embodiment Bootcamp

AI Agent Embodiment

Voice + Lipsync (per minute)

Pipeline notes,
monthly.

Audio-driven lipsync, explained.

The neural network half

The rig half

End-to-end latency

AI Agent Embodiment Bootcamp

AI Agent Embodiment

Voice + Lipsync (per minute)

Pipeline notes, monthly.

Pipeline notes,
monthly.