★ — TECHNOLOGY
Audio-driven lipsync, explained.
Modern AI agents lip-sync in real time to streaming audio. The technology has two parts: a neural network that converts audio waveforms to facial parameters, and a rig that turns those parameters into mouth movement. Both are now mature.
The neural network half
Audio2Face (NVIDIA) is a transformer model that takes streaming audio and outputs ARKit blendshape values at sub-100ms latency. NeuroSync is the open-source equivalent that runs on Apple Silicon / CPU.
Both consume streaming TTS (ElevenLabs, OpenAI, Cartesia) and produce frame-accurate facial parameter streams.
The rig half
A face rig with the 51 ARKit blendshapes (jaw open, mouth pucker, smile, etc.) deforms the mesh in real time as parameters update. EquipVerse ships every character with full ARKit + jaw + tongue + correctives.
End-to-end latency
LLM token generation (50ms first-token) + TTS streaming (100ms) + Audio2Face (sub-100ms) + render (16ms) ≈ 200–300ms total round trip. Indistinguishable from human conversation latency.