Creating realistic and expressive facial animations for AI characters is a complex challenge in developing virtual worlds and interactive experiences. Developers and animators often struggle to achieve natural and believable facial movements that accurately synchronize with speech and emotions. This can lead to a disconnect between the character's appearance and behavior, diminishing the overall immersion and engagement of users.
Fortunately, recent advancements in AI (LLMs, diffusion models, agents) have given rise to new solutions that promise to revolutionize lip sync and facial animation. Convai’s plugin for platforms such as Unreal Engine, Unity, Roblox, etc., includes components for applying context-aware face animations and lip-sync to talking characters.
In this article, we will:
- Explore the challenges of traditional lip sync and facial animation methods.
- Introduce Convai's AI-powered lip sync and facial animation components.
- Discuss the benefits of using AI for lip sync and facial animation, including increased efficiency, reduced costs, and improved realism.
- Provide an overview of Convai's workflow and integration with popular game engines like Unreal Engine.
By the end of this article, you will know how to create highly realistic characters with precise lip sync and expressive facial animations.
See Also: Lip-Syncing Virtual AI Characters with Convai in Unreal Engine.
Importance of Facial Expressions in Creating Realistic AI Characters
Facial expressions are crucial for creating realistic AI characters, as they convey emotions and enhance the overall immersive experience for users. Realistic facial animations can improve user engagement by making interactions with virtual characters more relatable and lifelike.
Expressive faces help communicate non-verbal cues essential for natural and effective communication. Without accurate facial expressions, AI characters may appear robotic and fail to evoke the intended emotional responses from users.
Use of ARKit Blend Shapes
ARKit blend shapes provide a powerful tool for creating dynamic facial expressions. These blend shapes are based on the Facial Action Coding System (FACS), which breaks down facial movements into individual action units (eyebrows, eyes, mouth, and cheeks).
Combining these action units can achieve various facial expressions that reflect various emotions. ARKit offers 52 distinct blend shapes, such as mouthSmile_L, eyeBlink_R, and browInnerUp, which can be used to create nuanced emotions.
ARKit blend shapes are particularly effective because they provide a standardized way to achieve high-quality facial animations across different platforms and devices. These blend shapes are integral to generating realistic facial movements synchronized with audio for lip-syncing.
Predefined Poses for Emotions
You use predefined poses—sets of blend shape values representing specific emotions—to simplify the creation of emotional expressions. For instance, a “happy” pose might involve raised eyebrows, a wide smile, and lifted cheeks, while a “sad” pose might include furrowed brows, a downturned mouth, and drooping eyelids.
Here are some common predefined poses:
- Happy: Raised eyebrows, wide smile, lifted cheeks
- Sad: Furrowed brows, downturned mouth, drooping eyelids
- Angry: Lowered brows, pursed lips, flared nostrils
- Surprised: Wide eyes, raised eyebrows, slightly open mouth
- Fear: Tense facial muscles, wide eyes, slightly open mouth
Using these predefined poses, you can quickly apply realistic facial expressions to AI characters to improve their emotional depth and make them more engaging.
Blending Multiple Emotions
In real-world interactions, people often experience and express mixed emotions simultaneously. To capture this complexity, your AI character must seamlessly blend multiple emotions. This involves combining different predefined poses based on the intensity of each emotion.
For example, a character might be primarily happy but also slightly surprised. The system would blend the "happy" and "surprised" poses, resulting in a nuanced expression that reflects both emotions.
The process of blending multiple emotions involves:
- Receiving Emotional Data: The backend system analyzes the conversation's context and the character's emotional state.
- Calculating Blend Shape Values: The system calculates each emotion's appropriate blend shape value based on the emotional data.
- Blending Poses: The system blends predefined poses according to the calculated values for smooth emotional transitions.
- Applying to the Character: The blended facial expression is then applied to the AI character in real-time to create a lifelike and dynamic emotional response.
When you blend multiple emotions, AI characters can exhibit a wide range of expressive behaviors that make interactions with them more realistic and engaging. This is essential for gaming, virtual reality, and other interactive media applications where emotional authenticity improves the user experience.
Integration and Workflow for AI Character Facial Expressions in Convai
Convai's lip sync and facial animation system is designed to provide seamless integration and an efficient workflow across various platforms. Convai ensures optimal performance and customization options for creating realistic character animations using server-side computation and client-side synchronization.
Server-Side Computation of Lip Sync and Facial Animation
Convai handles the heavy lifting of generating lip sync and facial animation data on the server side. This approach reduces client-side performance impact by offloading the computation to the server.
Client devices, especially those with limited processing power like mobile phones or web browsers, can focus on rendering and other tasks. This ensures a smooth user experience without overburdening the client's hardware.
Client-Side Integration in Unreal Engine, Unity, WebGL
Convai provides integration support for popular game engines and platforms (Unity, Unreal Engine, Roblox, WebGL, Discord). This helps if you want to incorporate lip sync and facial animation into your projects.
On the client side, platforms like Unreal Engine, Unity, and WebGL synchronize the audio playback with the lip sync frames received from the server. Convai's plugin handles this synchronization process so that the character's lips move in perfect harmony with the audio.
Each animation frame is carefully timed to correspond with the audio, making the characters' speech appear natural and fluid. Synchronization is crucial for creating a believable and immersive character performance.
Developer Customization Options
Convai offers flexibility and customization options to fine-tune the lip sync and facial animation system to your needs as a developer.
Enabling/Disabling Features in Initial Request to Server
When sending the initial request to the Convai server, you can control which features to enable or disable. For example, you can receive only the audio without lip sync data or request both audio and lip sync frames for comprehensive character animation.
This granular control allows you to optimize performance and tailor the system to the project's requirements.
Mapping Visemes to Custom Character Blend Shapes
For custom characters with unique blend shapes, developers can map the visemes (visual representations of speech sounds) to their character's specific blend shapes. Convai provides a tutorial and step-by-step guidance on how to perform this mapping process.
You can ensure the lip sync looks natural and matches the character's unique facial structure by customizing the viseme-to-blend shape mapping.
Customizing Facial Animations and Lip-Sync in Unreal Engine
In Unreal Engine, you can fine-tune lip-sync and facial animations to achieve desired results for your AI characters. Convai plugin allows developers to adjust and customize how characters' mouths and facial features move during speech.
This customization is crucial for ensuring that animations align with the project's specific visual style and requirements, whether it's a realistic or stylized character.
The video below is a tutorial on integrating custom characters with Convai in Unreal Engine and adding components like facial animations and lip sync.
Here are the steps from that video on how to customize lip-sync and facial animations in Unreal Engine:
- Create a new Convai integration folder in the content folder.
- Create a new blueprint for your character and name it Echo. For the parent class, select Convai Base Character.
- Add your character's skeletal mesh to the blueprint.
Set Up Face Animations
- Download the latest Convai Reallusion animation blueprint using this link.
- You will add animations in the body animation section of the animation graph.
- Connect your idle, walking, and transition animations.
- Select the Echo’s head bone in the head rotation section to allow it to look at the player.
- Assign Echo's right and left eye bones in their respective sections for eye animation. If you want Echo to have facial emotions, you can add animations for each emotion in the emotions layer.
Set Up Lip Sync
Convai currently uses viseme-based animations for lip sync rather than blend shapes or ARKit. In future releases, the plugin will support ARKit blend shapes. To set up lip sync, map each visem to Echo's corresponding blend shape or bones.
Here's how to do it:
- Open Echo skeleton mesh and navigate to the morph targets section. Identify the relevant blend shapes or bones in Echo's morph targets or skeleton.
- In the lips sync layer, map each viseme to Echo's corresponding blend shape or bones. For each viseme, replace the existing blend shapes node with a “modify” curve node. Right-click on the modify curve node and select add curve for each relevant blend shape. Adjust the values for Echo’s specific facial structure.
- Compile the animation blueprint to ensure there are no errors.
Note that the main challenge in this process is the mapping from visemes to your character's specific blend shapes or bones. This step can vary significantly depending on your character's design and facial structure.
Take your time to carefully identify the relevant blend shapes or bones for each viseme and adjust as needed to achieve the desired lip sync animation.
Recent Improvements to Convai’s Plugin for Face Animations and Lip Syncing
Convai lip sync and facial animation components have undergone significant advancements over the past year. These improvements have improved the realism and accuracy of the animations, resulting in a more immersive and engaging experience for users.
Improved Viseme-Based Lip Sync Accuracy
One of the key improvements has been the optimization of the viseme-based lip sync system. Previously, the lip movements were often inaccurate and noisy, with fast fluctuations that appeared unnatural.
However, by implementing server-side post-processing techniques, Convai has significantly improved the accuracy and smoothness of the lip sync, resulting in more realistic animations.
Introduction of Audio2Face for ARKit Blend Shapes (in Beta)
We have also started rolling out NVIDIA's Audio2Face on the Convai server. Unlike the viseme-based system, Audio2Face outputs ARKit blend shapes directly for more advanced and expressive solution.
It generates blend shapes not only for the lips but also for the eyebrows, eyes, and the entire face, enabling a wider range of emotions and expressions. Although Audio2Face is computationally expensive and in beta, it is gradually being rolled out for high-tier plans and partners.
Transition to AI-Generated Facial Expressions to Match Voice
Looking ahead, Convai is excited about the ongoing work on AI-generated facial expressions. We are training models to output emotional voice, emotional lip sync, and corresponding facial expressions.
Our users want to move beyond preset poses and achieve even greater realism. This transition will ensure that your character's voice, facial expressions, and lip sync are seamlessly synchronized to improve the overall believability of the animations.
Applications in Brand Agents, Gaming, and Education
Advanced lip sync and facial animation technology open up many potential applications. In the gaming industry, these improvements can lead to more immersive and engaging character interactions, enhancing the overall player experience.
Brand agents can benefit from more natural and expressive communication, making the interaction more human-like. Realistic facial animations can also help the education sector create compelling e-learning content.
Conclusion
Great! Throughout this article, we reviewed face animation, lip-syncing, and the complexities of creating realistic and expressive animations. We reviewed how you could use the Convai plugin on a platform like Unreal Engine to integrate face animations and apply lip-syncing to your character.
The integration process involves server-side computation for performance optimization and client-side synchronization for seamless playback. You also have customization options for tailoring animations to specific characters and use cases.
Convai is actively working on integrating facial capture technology and AI-generated facial expressions for even greater realism and expressiveness. Convai empowers developers to create engaging and believable character animations with AI across various platforms.
Check out our other article on lip-syncing AI characters, and sign up for our Discord if you have questions about using Convai.