★ — PIPELINE
The real-time lipsync frontier.
How NVIDIA Audio2Face and open-source NeuroSync changed the game for embodied AI agents — and how to integrate them with a MetaHuman.
The problem: lipsync was always offline
For two decades, character lipsync was an offline craft — animators hand-keyed phonemes or used pre-baked audio-driven tools (FaceFX, Mimic) that rendered to FBX and shipped. Real-time was reserved for stylised characters with limited mouth shapes. Photoreal real-time lipsync didn't exist below 200 ms latency.
Audio2Face changed the math
NVIDIA Audio2Face is a transformer-based neural model that converts streaming audio into ARKit blendshape values at sub-100 ms latency on consumer GPUs. Drop a MetaHuman face rig in front of it and you get cinematic-grade lipsync that responds to ChatGPT-class agent voice in real time.
NeuroSync for the open-source path
NeuroSync is an open-source equivalent that runs on Apple Silicon and CPUs without NVIDIA hardware — slightly lower fidelity but free and self-hostable. EquipVerse rigs are compatible with both pipelines.
Integration with a MetaHuman
In UE5: Audio2Face runs as a service, streams blendshape values via OSC or LiveLink, drives the MetaHuman face rig in real time. Combined with ElevenLabs / OpenAI TTS upstream, you have an end-to-end agent: text-in → voice-out → photoreal lipsync. Total round-trip ~200–300 ms with first-token streaming.
What it unlocks
Embodied customer-service agents, photoreal kiosk concierges, AI news anchors that respond to live tickers, language-learning conversation partners, healthcare patient avatars that explain pre-op procedures in 30 languages. EquipVerse AI Agent Embodiment package delivers this stack out of the box for $2,500.