Experience Safety Overview: Understanding Guardrails for Convai-Powered AI Characters

By
Convai Team
May 21, 2024

As AI characters in games become increasingly prevalent, it is paramount to ensure they behave in ways that are safe, responsible, and appropriate to their lore. The lore is the backstory and contextual details that enrich the character's universe. 

Without proper safety guardrails, these AI characters may generate content or engage in conversations inconsistent with the application context, potentially exposing users to harmful content.

This article will discuss Convai's multi-layered safety architecture, which allows users to create engaging, dynamic, safe, and content-moderated AI characters for video games and virtual worlds. Watch this video to check out some of Convai’s customer stories, new features and learn more about the product.

Why Moderation and Safety Are Important for AI Characters

As AI characters become more common across gaming, education, healthcare, and retail industries, their safety and moderation are essential for positive user experiences and risk mitigation.

Strong guardrails are needed to prevent AI characters from engaging in harmful, offensive, or inappropriate behavior that could harm users and the companies deploying them.

AI Character Guardrails in the Gaming Industry

In gaming, AI characters often serve as companions, guides, or opponents. However, without proper moderation, these virtual characters may exhibit toxic behavior, use offensive language, or even cheat, ruining the gaming experience for players.

Implementing robust content moderation systems, such as chat filters and automated detection of inappropriate behavior, helps maintain a safe and enjoyable gaming environment.

AI Character Guardrails in the Learning and Education Sector

Smart characters in education act as AI tutors, mentors, or simulated patients/clients. They must provide learners with accurate, unbiased, and appropriate information. 

Guardrails ensure that the virtual characters adhere to the intended educational content, avoid promoting misinformation, and maintain professional boundaries with learners.

Guardrails for AI Brand Agents

Recently, brands have started experimenting with AI characters, and these AI brand agents can be seen directly interacting with customers. Inadequate moderation could lead to brand agents behaving in ways that could damage the brand's reputation, offend customers, or provide misleading product information.

Strong AI guardrails help maintain brand consistency, prevent inappropriate or offensive interactions, and ensure compliance with advertising regulations.

Guardrails for Embodied AI Agents

Embodied AI characters interact with many people, including children and vulnerable populations. Without proper safety measures, these AI agents may exhibit biases, engage in inappropriate conversations, or pose physical risks to users.

Implementing strict moderation policies, content filters, and fail-safe mechanisms is essential to protect users and maintain public trust in these technologies.

Strong guardrails are necessary across all these use cases to ensure that AI characters behave in safe, appropriate, and beneficial ways. At Convai, we recognize the importance of moderation and safety in developing and deploying AI characters. 

The next section will discuss how Convai uses guardrails, moderation, and safety to address these concerns and create AI characters that improve user experiences and reduce risks.

How Convai Provides Guardrails for AI Characters

To mitigate those risks, Convai provides a robust, multi-layer safety architecture for AI characters that users create using our platform:

  1. Core Model Training and Finetuning: Ensuring the AI characters are well-grounded and aligned with the game world.
  2. Character Crafting Features: Tools that enable developers to implement guardrails and safety measures tailored to individual characters.
  3. Test Framework: A comprehensive testing framework to evaluate and ensure the safety and reliability of AI characters.
  4. External AI Audit: Third-party audits to verify the safety and compliance of AI characters with established guidelines.

Let’s take a look at each component.

Core Model Training and Fine-tuning

Universe Grounding 

The foundation of any believable AI character is its knowledge—what it knows about itself, the game world it inhabits, and how that world relates to the real one. While base language models are trained on large amounts of real-world data, AI characters in games usually operate in fictional universes that may differ significantly from ours.

To solve this, Convai has developed extensive datasets to fine-tune base language models, grounding them in the lore and laws of the game world. As a result, narrative designers at game studios can create characters that are true to and constrained by the universe in which they exist.

For example, an AI character who’s a merchant in a medieval fantasy RPG would be knowledgeable about the kingdoms, magic system, and history of its game world but should not start referencing smartphones, space travel, or other concepts foreign to that universe. 

Convai includes a “Knowledge Bank” system for storing large amounts of text-based knowledge for your character. The Test Framework complements the knowledge bank to bridge the gap between the dynamic nature of AI-driven conversational agents and the need for reliable, predictable interactions. More on this later.

Brand Safety and Guardrailing

Equally important to lore-consistency is brand safety—ensuring AI characters do not generate content or engage in harmful, hateful, dangerous, or otherwise inappropriate conversations with the game's target audience. 

Convai takes a two-pronged approach to this:

  1. Pre-Training Censoring: The base language models Convai builds upon, such as those from OpenAI and Meta, have some built-in content filtering to avoid generating harmful text right out of the box.
  2. Fine-Tuning Training: Convai goes further by creating proprietary datasets to fine-tune these base LLMs and baking stricter content moderation policies in the model. These datasets are carefully designed to train models to avoid graphic violence, hate speech, harassment, self-harm instructions, and sexually explicit content.
High accuracy fine-tuned knowledge bank process in Convai.

This finetuning process makes content moderation an inherent part of the LLM rather than a separate blocking layer. In practice, this means that when prompted with a potentially inappropriate query, the model will not simply refuse to answer but will generate a response that diverts the conversation in a safer direction.

For instance, if a player asks an AI character how to build a bomb, the model, having internalized these content policies during finetuning, might respond with something like: "I apologize, but I do not feel comfortable providing any information about the creation of weapons or explosives, as that could be dangerous. Perhaps we could find a safer topic to discuss?"

Convai Features for Grounding and Guardrailing

1. Character Identity and Topic Grounding

The following features are applied in real-time during character interactions to maintain consistency with the character's identity and keep conversations on track:

Character Backstory and Personality: This feature grounds the character with its identity and keeps it in character and on-brand. Convai works with the game studio's narrative and design team to write the prompt defining the AI character's personality. 

This prompt dictates the character's backstory, tonality, approach to answering questions, and more. This backstory and personality prompt sets the base for the character and guides its responses.

Knowledge Bank (RAG System): While retrieving relevant knowledge, the Knowledge Bank also acts as a guardrailing mechanism, constraining the model's information to the provided knowledge. 

Every time the AI character speaks, it retrieves information from its Knowledge Bank, prioritizing that knowledge over the base model's general information. This inherently applies more guardrails to the conversation and helps ensure the character stays on-topic.

Narrative Design:This feature outlines how the character should direct the flow of the conversation. For example, once the AI character asks question X, if the user responds with Z, then the character is prompted to ask question Y next. This feature helps steer the model to ensure conversations stay on predefined narrative tracks.

Read Also: The Future of NPC Interaction with Convai's Narrative Design.

Limit Responses: This feature combines grounding prompt instructions and language model settings, such as temperature, that allow the character designer to set the response scale between one and five. A value of one means the character will stick closely to the point, largely relying on facts provided via the knowledge base. 

A value of five gives the character more creative freedom to derive from the information baked into the base language model, leading to more open-ended responses. Depending on the game design and character objective, Convai sets this value, defaulting to 3 for a balance between specificity and flexibility.

2. Convai's Brand and IP-Specific Filtering APIs

These features provide an additional layer of safety filtering to avoid discussions on certain topics or brands.

Denylist and Allowlist Guardrails: Convai provides a dedicated interface for your studio's narrative, design, and brand team to define a set of approved (allowlist) and unapproved (denylist) topics for the AI character to adhere to. 

For example, the allowlist might include topics central to the game's theme and narrative, while the denylist would bar the character from discussing competitor IPs or controversial real-world issues. Convai will guide your creative team through these configurations to ensure the character only engages with approved topics.

Blocked Words: This is a straightforward list of words to be removed from the character's responses, typically including common curse words and other inappropriate language.

Guardrails applied during an interaction in Convai.

3. Moderation Filter API

As a general layer of safety filtering on top of Convai's character language model, the Moderation Filter API checks the character's responses against the following categories before returning them:

  • Violence.
  • Violence/graphic.
  • Hate.
  • Hate/threatening.
  • Harassment.
  • Harassment/threatening.
  • Self-harm.
  • Self-harm/intent.
  • Self-harm/instructions.
  • Sexual content.
  • Sexual content/minors.

Convai uses OpenAI's moderation endpoint for this filtering. If a user message triggers one of these moderation filters, the character will deliver a pre-scripted message informing the user that their input violates the studio's community guidelines and try to redirect the conversation.

For more details on OpenAI's moderation system, see their documentation.

Test Framework

Convai provides a comprehensive test framework for a robust AI character testing process. The framework allows studios to create extensive test sets that rigorously test the character across various scenarios and edge cases.

For example, tests might include probing the character's knowledge boundaries to ensure it adheres to its intended domain expertise, challenging its ability to stay in character when asked unusual or tangential questions, or stress-testing its capacity to handle complex multi-turn conversations.

Convai works closely with video game studio's narrative, design, and QA teams to develop tests that evaluate the character's accuracy, screen for unsafe or inappropriate content (e.g., violence, hate speech, explicit themes, etc.), and assess the overall quality and coherence of its responses. The goal is to identify and remediate potential issues or inconsistencies before launch.

To streamline this process, we collaborate with game studios to auto-generate relevant test cases based on the AI character's specific backstory, personality, and intended use case. These auto-generated tests help initiate the testing process and augment the studio's test sets for comprehensive coverage.

Convai Testing Framework character tool.

Third-party tests

In addition to internal testing, Convai works with studios and brands to recruit and onboard third-party AI audit experts. These safety experts independently verify the robustness and reliability of AI characters.

We have already worked with such external parties and received overwhelmingly successful results in preparation for production deployment with some of the world’s largest brands.

Conclusion

Our multi-layered approach to AI characters at Convai ensures that AI characters stay true to their intended personalities and narratives while avoiding problematic topics. This approach ranges from character backstory grounding and knowledge bank integration to real-time content filtering and rigorous scenario testing by internal teams and leading external auditors.

This empowers studios to create engaging, immersive, and trustworthy AI characters that captivate audiences while mitigating risks. At Convai, we innovate to deliver responsible, high-quality, interactive AI experiences across industries.

Read our blog about the future of NPC interaction to know more about Convai’s Narrative design process.