★ — PIPELINE

The real-time lipsync frontier.

How NVIDIA Audio2Face and open-source NeuroSync changed the game for embodied AI agents — and how to integrate them with a MetaHuman.

2026-05-06 · 8 min read

The problem: lipsync was always offline

For two decades, character lipsync was an offline craft — animators hand-keyed phonemes or used pre-baked audio-driven tools (FaceFX, Mimic) that rendered to FBX and shipped. Real-time was reserved for stylised characters with limited mouth shapes. Photoreal real-time lipsync didn't exist below 200 ms latency.

Audio2Face changed the math

NVIDIA Audio2Face is a transformer-based neural model that converts streaming audio into ARKit blendshape values at sub-100 ms latency on consumer GPUs. Drop a MetaHuman face rig in front of it and you get cinematic-grade lipsync that responds to ChatGPT-class agent voice in real time.

NeuroSync for the open-source path

NeuroSync is an open-source equivalent that runs on Apple Silicon and CPUs without NVIDIA hardware — slightly lower fidelity but free and self-hostable. EquipVerse rigs are compatible with both pipelines.

Integration with a MetaHuman

In UE5: Audio2Face runs as a service, streams blendshape values via OSC or LiveLink, drives the MetaHuman face rig in real time. Combined with ElevenLabs / OpenAI TTS upstream, you have an end-to-end agent: text-in → voice-out → photoreal lipsync. Total round-trip ~200–300 ms with first-token streaming.

What it unlocks

Embodied customer-service agents, photoreal kiosk concierges, AI news anchors that respond to live tickers, language-learning conversation partners, healthcare patient avatars that explain pre-op procedures in 30 languages. EquipVerse AI Agent Embodiment package delivers this stack out of the box for $2,500.

★ — Related