Lip-Syncing Virtual AI Characters: Techniques, Integration, and Future Trends

Convai Team

May 26, 2024

Realistic lip-syncing is crucial for creating immersive and expressive AI characters. Lip-syncing involves matching the mouth movements of a virtual character to the spoken dialogue, ensuring that the visual and auditory aspects of speech are synchronized. This synchronization improves the character's believability and engagement as viewers perceive the speech through visual and audio cues.

Accurate lip-syncing is important because it conveys the illusion that the character is speaking the words. When mouth movements match dialogue, the performance is more natural and convincing, allowing your audience to focus on the speech's content and emotion.

This article explores the challenges faced in achieving realistic lip-sync, such as quality, latency, and emotional expression, and how Convai is working to overcome these hurdles. You’ll also learn about integrating Convai’s lip sync component into characters with popular game engines like Unreal Engine and Unity.

‍

Overview of Lip-Syncing for Virtual Characters: Matching Mouth Movements to Spoken Dialogue

Realistic lip-sync is essential for capturing the nuances of speech, such as the shape and timing of mouth movements. This process typically uses “visemes,” which are visual representations of phonemes (the distinct units of sound in speech; e.g., "cat" is three phonemes: /k/, /æ/, and /t/).

The visemes are the specific mouth shapes and movements associated with each phoneme. While there are many phonemes, visemes group similar-looking mouth shapes, reducing the complexity required for animation. For instance, the single viseme representing the phonemes /b/, /p/, and /m/ has a lip shape similar to all three.

*Animation that shows visemes (Ocululus Lipsync)*

Lip-syncing techniques range from simple rule-based systems to complex neural networks that produce highly detailed and expressive animations. Advanced techniques use AI and machine learning to analyze audio input and generate corresponding mouth shapes and facial expressions in real time.

When done well, lip-sync brings virtual characters to life and allows them to deliver dialogue naturally and convincingly. It is a key component of creating expressive characters that can engage in realistic conversations and convey emotions through their speech and facial animations.

Lip-sync is especially important for applications like video games, animated films, and virtual assistants, where the character's speech is a primary means of communication and storytelling.

Watch Convai and Nvidia’s demo from CES 2024, which showcases accurate lip sync, action generation, and scene perception (read more about it on this blog).

‍

Convai's Lip-Syncing Technology

At Convai, we have developed a multi-tiered approach to lip-syncing, addressing various needs and use cases for virtual characters.

The underlying goal is to create realistic, natural lip movements that match the spoken audio. We have seen this improve the overall immersion and engagement of conversational AI experiences for our users.

‍

Convai's Approach:

Our strategy uses existing industry-standard solutions and cutting-edge technology to provide you with flexible options for lip-syncing. We primarily focus on server-side processing to reduce the computational burden on your devices for smoother performance and broader compatibility.

‍

Three levels of lip-syncing:

Basic phoneme-based blend shapes.
OVR lip sync (Oculus Meta).
Audio2Face (Nvidia).

At Convai, we offer our users both the OVR Lip Sync component and Audio2Face options. In the sections that follow, we will discuss how Convai allows you to use both.

‍

Basic Phoneme-Based Blend Shapes

Hardcoded blend shapes involve using predefined facial expressions or shapes (blend shapes) mapped to phonemes. ARKit, Apple's augmented reality framework, provides a set of blend shapes specifically designed for facial animation.

In this technique:

Text-to-Phoneme Conversion: The input text is converted into phonemes.
Phoneme-to-Blend Shape Mapping: Each phoneme is mapped to a corresponding blend shape.
Animation Application: These blend shapes are applied to the character model to simulate speech.

This method is straightforward but can be limited in flexibility and naturalness as it relies on predefined mappings.

Convai supports ARKit blendshapes for Ready Player Me characters through heuristic mapping.

‍

OVR Lip Sync (Oculus Meta)

Convai's primary lip-sync solution is OVR Lip Sync, which Oculus (Meta) developed to enable realistic mouth animations for virtual characters based on audio input.

This technique takes raw audio bytes as input and outputs visemes (visual phonemes) at 100 frames per second. There are 15 visemes that get parsed on the client side (Unity, Unreal Engine, Three.js) and applied to the character's face.

With OVR Lip Sync, the audio processing occurs on the server-side on Convai. The client sends raw audio data to the lip sync server, which analyzes the audio and extracts the viseme information.

The server outputs the viseme data at 100 frames per second. (We apply smoothing here to avoid jittery animation.) This approach offloads the processing from your device and ensures consistent lip sync results across different platforms.

While easy to implement, OVR Lip Sync has limitations:

Limited quality and expressiveness with just 15 visemes (e.g., Reallusion Characters can support up to 150 blend shapes). We solve this challenge by extrapolating 15 values into 150 shapes through heuristic mapping and interpolation to vary the mouth shapes and movements widely.
Only animates the lips, lacking facial expressions. We solve this by applying heuristic mapping to synchronize lips and faces using pre-generated emotional shapes. We then apply those shapes to the face while the user and the character speak.

‍

NVIDIA Audio2Face

Convai offers a more advanced lip-sync solution for enterprise customers—Audio2Face, powered by NVIDIA. It uses pre-trained models and runs in NVIDIA containers. Unlike OVR Lip Sync, which uses 15 visemes, Audio2Face outputs weights for 52 Arkit blend shapes. These blend shapes cover a wider range of facial expressions, including mouth, emotional, and head motions.

The AI models used by Audio2Face are trained on large audio and corresponding facial motion datasets. They analyze the audio signal and map it to the appropriate blend shape weights to generate realistic lip-sync and facial expressions. This AI-driven approach enables Audio2Face to produce high-quality, expressive results superior to OVR Lip Sync.

Some key advantages of Audio2Face include:

Generates natural lip-sync for multilingual dialogue, songs, and gibberish.
Supports both lip-sync and emotional facial expressions.
Provides intuitive controls to adjust the animation style and intensity.

‍

Slower Than OVR but Higher Quality

While Audio2Face offers superior quality in terms of facial animation, it is slower than OVR Lip Sync. This trade-off is because of the more complex processing involved in analyzing audio and generating blend shape weights.

Key points of comparison include:

Quality: Audio2Face provides higher quality animations with more nuanced and natural movements, thanks to its use of AI and a greater number of blend shapes.
Performance: Audio2Face's processing time is longer, which can introduce latency in real-time applications. This is a significant consideration for developers who want to maintain low latency in interactive environments.

‍

Plans for Improving the Quality

We are actively working on improving Audio2Face based on your feedback and requests. Some key areas of focus include:

Latency Reduction: We optimize the AI models and container orchestration to reduce processing times. This includes refining the algorithms and leveraging more efficient computational techniques.
Enhanced Blend Shape Mapping: Improving the accuracy of blend shape weights to better capture subtle facial expressions and ensure even more lifelike animations.
Expanded Support for Emotions: Integrating more advanced emotional recognition capabilities allows characters to lip-sync accurately and convey appropriate emotions through facial expressions.
Developer Tools and Documentation: Providing better tools and comprehensive documentation (including tutorials and sample projects) to help you easily use Audio2Face with the Convai plugin.

Big changes are coming to Audio2Face to improve its quality, speed, and ease of use. We are always working to improve Audio2Face so that you can create perfect facial animations for virtual characters across a wide range of applications.

Lip-Sync Technique	Strengths	Suitability
Hardcoded Blend Shapes (ARKit)	Best for simple applications where quick and easy implementation is required.	Suitable for projects with limited animation needs and lower complexity.
OVR Lip Sync (Oculus/Meta)	Offers a good balance between ease of use and quality.	Ideal for real-time applications and projects requiring moderate to high-quality lip-syncing.
NVIDIA Audio2Face	Perfect for enterprise solutions and projects with complex animation requirements.	Suited for high-end applications where the highest quality of lip-syncing is essential.

‍

Lip-Syncing a MetaHuman Character in Unreal Engine with Convai

The Convai plugin helps you integrate two main lip-sync technologies: OVR Lip Sync and Reallusion CC4 Extended (Reallusion CC4+) Blendshapes. It supports majorly three characters:

Ready Player Me (RPM).
Reallusion.‍
MetaHuman.

Let’s see how to add lip sync to a MetaHuman character in this section. In Unreal Engine (UE), you can add lip sync to a MetaHuman character from the plugin on the UE marketplace.

NOTE: Before you begin, ensure your character model has a compatible facial rig using blend shapes or bones to control the mouth, jaw, and other facial features. Unreal Engine has a tutorial on YouTube on creating facial rigs for your specific MetaHuman character. Also, check out the documentation to learn more about it.

Prerequisites

Set up your project with the Convai plugin. Learn how in the documentation.
Add MetaHuman to your Unreal Engine project. See the documentation on how to do it.

Modify the parent class of your MetaHuman to ConvaiBaseCharacter, as indicated in the documentation.

‍

Steps to add LipSync component to MetaHuman in Unreal Engine:

Open your MetaHuman blueprint.
Navigate to the Components section and select the Add button.

3. Search for Convai Face Sync and select.

4. Finally, LipSync has been added to your MetaHuman character. Compile and save it, then try it.

‍

If you use Unity, you can also install the Convai Unity SDK from the Asset Store. Here’s also a tutorial to learn how to create lifelike NPCs in Unity with Reallusion CC4 characters:

‍

To add the lip-sync component to your Read Player Me character, follow the documentation's instructions.

‍

Performance Considerations

When integrating lip-sync functionality into your Unreal Engine project using Convai, it's essential to consider the performance implications. Convai has designed its lip-sync component to minimize the impact on client-side performance while optimizing server-side latency.

This section will discuss the minimal client-side performance impact and the server-side latency optimization techniques to ensure smooth and efficient lip-syncing.

‍

Minimal Client-Side Performance Impact

One significant advantage of using Convai's lip-sync solution is the minimal performance impact on the client side. We designed the lip-sync component to be lightweight so it does not add significant overhead to the client application. Here’s how:

Offloaded Processing: The computationally intensive tasks of processing and converting audio data into viseme data are handled on the server side. This offloads the heavy lifting from the client, so the game or application runs smoothly without significant additional resource demands.
Optimized Component Design: The Convai Lip Sync component is designed to integrate seamlessly with Unreal Engine, using the engine's existing capabilities for animation and rendering. The skinned mesh renderers for the face, teeth, and tongue are optimized to handle real-time updates without a noticeable drop in frame rates or slowdowns during interactive sessions.

Based on benchmarks conducted by our engineering team, server-side lip-sync on the client results in negligible processing time. Most latency is introduced on the server side, where the audio data is analyzed and the viseme data is generated. Even low-end client devices (e.g., Android) can benefit from high-quality lip-sync without sacrificing performance.

You can be confident that enabling lip-sync will not adversely affect the performance of your projects.

‍

Server-Side Latency Optimization

Convai is continuously working on optimizing the server-side latency to provide the best possible lip-sync experience. The goal is to minimize the delay between the audio input and the corresponding lip-sync output. This ensures that the character's mouth movements are synchronized with the speech in real-time.

To achieve low latency, Convai employs various techniques, such as:

High-Performance Servers: Convai uses Kubernetes to manage and scale the server resources dynamically. This helps with efficient load balancing and resource allocation. Our servers can handle high volumes of audio data without significant delays.
Low-Latency Communication Protocols: Convai uses low-latency communication protocols such as gRPC (gRPC Remote Procedure Calls) to transmit data between the client and server. gRPC is designed for high-performance communication, which reduces the time it takes to send and receive data.
Efficient Audio Processing: Our audio processing algorithms (e.g., smoothing) are optimized for speed and accuracy. They quickly convert audio input into viseme data so that the lip-sync animation remains in sync with the audio with minimal delay.

It's worth noting that while Convai's lip-sync solution is designed to minimize latency, there may still be some delay depending on factors such as network conditions and the complexity of the audio input.

However, our team is actively working on further optimizations to reduce latency and improve the overall performance of the lip-sync functionality.

‍

How Are We Working to Improve Lip Sync at Convai?

We are actively working on several exciting developments to enhance its lip-sync and facial animation plugins. As you demand realistic and engaging virtual characters, we are developing more features to create expressive and emotionally responsive characters that can better connect with users in various applications.

Here’s what we are doing:

Incorporating Emotion Detection

One key area of focus is incorporating emotion detection based on the user's speech. We are working on improving the realism of our lipsync component by detecting emotions in real-time. Here’s how:

Analyzing Speech Patterns: Using advanced AI models to analyze audio input for emotional cues such as happiness, sadness, anger, and surprise. By detecting these emotions in real-time, the system can make characters more responsive and engaging.
Multimodal Emotion Recognition: Combining audio analysis with other inputs (personality, facial expressions, body language) to provide a comprehensive understanding of the character’s emotional state. This holistic approach ensures more accurate and nuanced emotion detection.
Context-Aware Emotion Detection: This involves considering the context of the conversation or interaction to better interpret and respond to emotions. It involves understanding the situational context, user history, and conversational flow.

‍

Generating Emotional Responses

Building on emotion detection, we want characters to generate emotional responses that improve their expressiveness and realism:

Emotion-Driven Lip Sync: Adjusting lip-sync animations to reflect the detected emotions. For example, a character might speak more quickly and with more pronounced mouth movements when angry or more slowly and softly when sad.
Dynamic Facial Expressions: Generating facial expressions (eye gaze, head movements, body language) that match the emotional tone of the speech. This includes subtle changes in eye movement, eyebrow positioning, and mouth shape that convey the character’s feelings.
Contextual Dialogue Adjustment: Modifying the character's dialogue in real-time based on the detected emotion. For instance, if the user is frustrated, the character might use a more soothing tone and reassuring words.

‍

Improving Facial Expressions for Realism

Improving facial expressions is crucial for achieving greater realism in virtual characters. At Convai, we are focused on several initiatives to improve this aspect:

Advanced Blend Shape Integration: Developing more sophisticated blend shapes that capture a wider range of facial expressions. This involves creating blend shapes representing complex emotions and subtle nuances in facial movements.
High-Fidelity Animation Techniques: Using animation techniques, such as motion capture and deep learning-based animation, to produce more lifelike and fluid facial movements. These techniques ensure that the characters’ expressions are both realistic and responsive.
Real-Time Facial Adaptation: Implementing algorithms that adapt facial expressions in real-time based on the character’s personality, state of mind, user interactions, and environmental factors. This dynamic adaptation ensures that characters remain engaging and believable in multiple cases.
User Feedback Integration: Continuously refining facial animation models based on user feedback and interaction data. We can fine-tune models to better meet the needs and expectations of users by integrating insights from them.

Our goal is to improve your experience by creating virtual characters that can truly connect with users on an emotional level.

‍

Conclusion

Lip-syncing is crucial to creating realistic and engaging virtual characters. Convai offers multiple lip-sync solutions, including a basic form using hard-coded blend shapes, OVR Lip Sync from Oculus/Meta, and NVIDIA's more advanced Audio2Face tool.

While each approach has strengths and weaknesses, they all seek to synchronize the character's mouth movements with the spoken dialogue, improving the interaction's overall believability.

Our team continuously explores ways to improve the quality, expressiveness, and efficiency of lip-sync animations as you demand immersive and interactive experiences. This includes incorporating emotion detection and generation, AI techniques, and real-time performance optimization.

With ongoing advancements in this field, virtual characters can expect to become increasingly lifelike, responsive, and engaging, opening up new possibilities for applications in entertainment, education, and beyond.

Additionally, see our article on Convai’s Safety guardrails for AI Characters to learn more about Convai ensures safe conversation with intelligent virtual characters.

‍