Build Vision-Based Conversational AI Characters in Unity

By
Convai Team
November 19, 2025

Imagine putting on your Quest headset, looking at a tool on your workbench and just asking:

  • “What does this do?”
  • “Is this the right way to hold it?”
  • “How do I turn this on safely?”

…and a calm, expert AI voice answers in real time, based on what it sees.

That’s what you’ll build in this tutorial: a vision-based, disembodied AI character (think “your own Jarvis”) in Unity using the Convai Unity Plugin and the Live Character API. (Download the Unity plugin now)

See the demo for yourself:

What you’re actually building

By the end, you’ll have:

So if you’ve ever wanted a floating expert voice that just… knows what you’re looking at, this is for you.

How Convai’s Unity plugin helps

The new Convai Unity plugin is powered by the Live Character API, built on WebRTC. Practically, that means:

Your character can:

  • See what the camera or crosshair is looking at
  • Understand it with the help of your manuals, PDFs, images
  • Remember context across turns and sessions
  • Talk back naturally, like a human expert

Step 1: Give your AI a brain

Before touching Unity, set up your character on the Convai Playground.

  1. Create a character in Playground
    • Give it a name or a role: “Expert woodworking coach”, “Lab supervisor”, “Machine operator trainer”…
    • Add a short backstory in character description and tweak its personality: calm, practical, safety-first, good at explaining things in simple language.
  2. Add knowledge
    • Upload manuals, SOPs, spec sheets, safety docs, or labeled images to Knowledge Bank.
    • This is how your Jarvis knows the difference between a feed selector and a material removal gauge.
  3. Pick language and voice
  4. Test quickly in the browser
    • Ask a few questions:
      • “What does the material removal gauge do?”
      • “How do I safely turn this planer on?”
    • Make sure the answers sound grounded and on-brand before you wire anything into Unity.

Once you like how it thinks and talks, you’re ready to bring it into a scene.

Step 2: Install the Convai Unity Plugin

Now hop into Unity and install the Unity plugin. For this: 

  1. Create/open a Unity project
    • Use a recent Unity version (e.g., 2022+).
    • A Universal 3D template is a good starting point.
  2. Import the Convai package
    • Go to Assets → Import Package → Custom Package…
    • Select the Convai .unitypackage you downloaded.
    • Click Import and let Unity bring everything in.
  3. Run the setup tool (if provided)
    • Some plugin versions include a project setup tool (for PC + Meta Quest).
    • Open it and click Fix All so Unity applies recommended settings.
  4. Add your Convai API key
    • Use the top menu (for example: Convai → Account / API Key Setup).
    • Paste your Convai API key from the dashboard and save.
    • Unity will store it so the plugin can talk to Convai.

Feel free to check in with the detailed step-by-step video tutorial below, in case you face any problems:

Step 3: Drop your “Jarvis” into a Unity scene

  1. Open the sample scene
    • Navigate to something like: Assets/Convai/Demo/Scenes/Convai Sample Scene.
    • Open it so you don’t start from scratch.
  2. Connect your character
    • In the Hierarchy, find the main Convai character object (often named something like ConvaiNPC or similar).
    • Select it and look for fields like Character Name and Character ID in the Inspector.
    • Copy your Character ID from the Convai dashboard and paste it in.
    • Give it a name (e.g., “Steve”, “ShopCoach”, “LabGuide”).
  3. Import TextMesh Pro resources (if prompted)
    • Unity might ask to import TMP essentials for subtitles or debug UI.
    • Click Import TMP Essentials once and you’re done.

Hit Play in the editor. You should now be able to talk to your character and have it respond in real time.

Step 4: Give your character “eyes” (vision in Unity)

Talking is nice. Seeing is better.

To let the character talk about what you’re looking at or pointing at, you’ll typically:

  • Use the main camera (desktop or XR) as the “eyes”
  • Add a crosshair / reticle UI so there’s a clear target in the scene

In the Convai Unity package, look for a crosshair/canvas prefab (often something like Convai Crosshair Canvas). Drop it into your scene. This tells the system what the user is currently focused on.

Now try:

  • Aim at a control or object in the scene
  • Ask: “What is this?” or “What does this part do?”

The character should answer contextually, using both what it sees and what you’ve put in Knowledge Bank.

This is how you get exchanges like:

“What’s this gauge for?” “You’re pointing at the material removal gauge. It shows how much material is removed in a single pass.”

or

“Is this the right way to feed the board in?” “You’re holding the board flat against the table—that’s exactly right. Always keep it flat against the bed to avoid kickback.”

Step 5: Optional – Run it on Meta Quest

If you want this experience inside a Quest headset:

  1. Switch build target
    • Go to File → Build Settings / Build Profiles.
    • Add your active scene (remove the sample scene if it’s still there).
    • Select Android / Meta Quest as the build platform and click Switch Platform.
  2. Apply XR settings
    • Use Meta XR tools or the provided setup helper (again, usually a Fix All button).
    • Let it configure OpenXR, Android player settings, etc.
  3. Build & run
    • Select your headset under Run Device (if supported).
    • Click Build & Run.

Put on your headset, look at your scene, and start talking.

Now you’ve basically turned your headset into a vision-powered AI assistant for your environment.

Design tips: making your Jarvis genuinely helpful

A few tweaks go a long way:

  • Keep answers short by default
    • Users are often standing, moving, or holding tools—no one wants a monologue.
    • Use follow-up questions for deeper detail.
  • Lead with safety
    • For tools, machines, labs: always mention PPE and safe posture early.
    • Example: “Before you turn that on, make sure your safety glasses are on.”
  • Encourage “show, don’t tell” questions
    • “Hold the part up to the camera and ask me what it’s for.”
    • “Point at the control you’re unsure about.”
  • Chunk the knowledge
    • One doc for controls, one for onboarding, one for troubleshooting.
    • Easier to maintain and easier for the model to use effectively.
  • Test with real phrasing
    • Don’t just test “What is the material removal gauge?”
    • Also test “What’s this thing?” / “What does this gauge do?” / “Am I using this right?”

Troubleshooting

If something feels off:

  • Character not responding
    • Double-check your API key and Character ID.
    • Make sure the scene has the Convai scripts/components enabled.
  • Can’t build for Quest
    • Re-run the setup tool and make sure the right scene is added to Build Settings.
    • Check Android / XR settings are applied.
  • Character gives vague answers
    • Tighten the Character Description (role, speaking style).
    • Add or refine Knowledge Bank documents.

Where you can use this

Once the pipeline is working, you can reuse the same pattern for:

  • Workshops & fab labs – tool setup, safe first use
  • Manufacturing & maintenance – visual checks, troubleshooting guidance
  • Training simulators – “coach over your shoulder” scenarios
  • Museums & exhibitions – “what am I looking at?” guided tours
  • Retail & showrooms – product explainers and fitting guidance

Anywhere someone can point at something and ask a question, a vision-based conversational character can help.

Wrap-up

You just connected three powerful things:

  1. A Convai character with real domain knowledge
  2. The Convai Unity plugin for low-latency voice
  3. Vision from your scene or headset

Together, they give you a Jarvis-style guide that can see what’s in front of you and talk you through it in real time.

From here, you can:

  • Swap in different characters (safety coach, sales expert, lab instructor)
  • Add more knowledge for deeper domains
  • Move from desktop to XR, or into more complex multi-scene projects