Long Term Memory - A Technical Overview

Jon Durbin

April 28, 2024

Mimir

In this article, we'll be discussing Mimir - the new hierarchical memory backend developed for use in convai's innovative AI NPC platform.

Memory is fundamental in creating engaging and evolving conversational AI systems (and humans, naturally). Currently, LLMs don't have evolutionary algorithms to continuously update weights to capture new information and grow/learn with the conversation, meaning all pertinent information must be available in either the model's weights or within the provided context (prompt).

When it comes to "memory" as it relates to LLMs, we have three main options:

Extend the context size to stuff more data into the chat.
Continuously fine-tune the model with new information.
Create a hierarchical memory storage backend from which relevant information can be retrieved, akin to retrieval augmented generation.

‍

Options for adding memory

❌ Option 1. Extended context window

In theory, simply including more information/chat history in the prompt is the simplest approach; just continue increasing the prompt size to include anything and everything. In practice, however, there are many pitfalls, including, but not limited to:

Lost context

important information is lost, depending on where the information resides in the prompt

There are many tricks to increase the model's context length, including YaRN, RoPE scaling, Self-Extend, etc.

The standard benchmarks used to test the extended context length of these models are calculating perplexity (on say, wikipedia data) and "needle-in-haystack" searches, in which random UUIDs or other "needle" information is placed randomly within the context and the model is asked to find the value.

It is unclear how well the benchmarks translate to more practical use-cases, such as an LLMs ability to remember context in previous conversation messages, RAG, etc.

Time/space complexity

Time and space complexity is quadratic in the length of the input.

Sure, there are impressive improvements to inference performance like flash attention, radix attention (part of sglang), etc., and new architectures like RWKV and mamba, but nonetheless, computations on very large inputs are quite expensive.

Cost

If you plan to use a hosted closed-source LLM option such as OpenAI, Anthropic, Mistral, etc., or even a hosted provider of open-weight LLMs, you are generally charged by the token (both input and output). Even if self-hosting, input prompt processing is computationally expensive, which equates to costs regardless.

Suppose you have a chat history that is 50k tokens, and you want to use OpenAI's gpt-4-turbo, at today's (2024-03-14) price of $10.00 / 1M tokens input tokens and $30.00/1M output tokens, each interaction, assuming an output length of 100 tokens, will cost (50000 tokens * $10 / million tokens) + (100 tokens * $30 / million tokens) = $0.503. Ouch.

Even so, at some point you will likely reach the upper limit of the model's context size and this will cease being an option.

❌ Option 2. Continuously fine-tune the model with new information

This is a possible solution to the problem, however it's a bit impractical depending on the number of unique interactions. Each unique character <-> human pair would need to be fine-tuned separately in order to maintain privacy, so the number of total models (LoRAs or otherwise) would grow exponentially.

The message accumulation rate, depending on the interaction cadence, may exceed the rate at which fine-tuning can take place, and would require an infrastructure that allows swapping LoRA adapters/models on the fly. Now imagine scaling this up for a plethora of characters, clients, etc., in a multi-tenant environment.

This approach is simply not practical at scale.

✅ Option 3. Hierarchical memory storage + retrieval

Humans have 4 types of memory:

Sensory (e.g. what you just heard) - fades away within seconds.
Short term memory - a few minutes generally, with very high recall/precision/granularity. For example, in conversation, you might vividly remember the last few sentences discussed in fairly high detail.
Long-term memory - the capacity to store and retrieve information over extended periods, ranging from days to decades, with (generally speaking) much less granularity and detail than sensory and short-term memory.
Working memory - roughly speaking, this is your short-term memory combining strategies and knowledge from your long-term memory to assist in making a decision or calculation.

For conversational AI systems, we can roughly approximate human memory:

Scene awareness - providing context about the scene is fairly similar to sensory input, albeit it's treated as text. We can use vision models, metadata within the game engine, etc. to provide details about items, actions, and events that the NPC can make use of.
Short term memory - if we keep the last N turns of the conversation in the context window as-is/unchanged, we have the verbatim chat history encompassing the past few minutes (or longer, depending on size of N and rate of conversation).
Medium term memory - once we exceed our specified N (short-term memory/message retention), we can summarize a window of conversation into a memory, capturing information about the topics discussed, personal details shared, emotions expressed, etc.
Long term memory - when a medium-term memory is created, we first check for highly similar existing memories, and if the similarity is within a defined threshold range, we can consolidate the memories into a single long-term memory, improving performance by enhancing recall and resolving conflicting information.
Working memory - the entire, combined prompt the NPC uses, including the backstory, information from the knowledge bank, and the various layers of "memory" defined previously.

So how would it work in practice?

When the user provides the next input, we can use a hybrid search mechanism to retrieve the most relevant medium and long-term memories, along with retrieving the most recent N messages, and we can compose a prompt that is more concise (cheaper, faster), while maintaining the history to a large extent.

See also related open source works including https://github.com/cpacker/MemGPT and https://github.com/Victorwz/LongMem

‍

How do we create a "memory"?

There are many ways you could go about creating memories. The simplest approach would be simply storing the entire conversation window as-is. We've taken the approach of using an LLM to generate specific memory attributes so we can condense the information and have more discrete values for ranking/scoring.

We extract many pieces of information in this way, but the two most critical bits are summaries of the conversation segment and importance, which we use for custom scoring/ranking.

P.S.: If using OpenAI APIs, you can force JSON output by adding {"response_format": {"type": "json_object"}} to the request body, or if you are using local inference, there are several tools that can help ensure properly formatted JSON output, such as https://github.com/guidance-ai/guidance, grammars within llama.cpp, outlines, SGLang, etc.

‍

Storage/retrieval

Requirements for high quality storage/retrieval backend

Semantic search is often used for retrieval when interacting with LLMs, e.g. for RAG. While semantic search generally performs well, we can actually build a more robust system by using a hybrid search with custom scoring/ranking.

This is by no means a comprehensive study of performance, however it does indicate that a hybrid search, that is - combining semantic search with keyword search (bm25) - can actually boost retrieval accuracy quite substantially:

By using a hybrid search mechanism, we improve the retrieval stage quite a bit, but it still lacks two key components of human memory:

Ranking by emotional impact/importance.
- Suppose you recently lost a loved one, and someone asks if you'd like to go out for yogurt. Your experience of this loss may not relate in any way to yogurt, so it would not likely appear in search results (keyword, semantic, or hybrid), but it could certainly change your response to the question.
Recency bias.
- All things being equal, we're generally more likely to make use of recent memories. We don't want to simply sort by date on the results, however, because we want to consider relevance to search terms, emotional importance, etc. Ideally, we want to slightly punish scores for older results.

When performing keyword search as part of the hybrid search mechanism, three functions are critical to achieving excellent retrieval accuracy (if not accuracy, certainly recall): tokenization, normalization, and stemming.

Suppose we have a sentence: "The house is Red. I found it driving to dallas." (ignoring the oddity of the sentence for a moment).

tokenization - breaking the sentence into individual tokens, e.g. for whitespace tokenization, the tokens would be ["The", "house", "is", "Red.", "I", "found", "it", "driving", "to", "dallas."] whitespace tokenization as shown here can be problematic, however, because as you can see "dallas." and "Red." become tokens with "."
normalization - simple transformations to the tokens, for example converting each token to lowercase (without which, "Dallas" vs "dallas" becomes quite a problem
stemming - applying word-stemming algorithms to the tokens, e.g. Porter Stemmer, in which case "driving" becomes "drive"

If I were to then search "What color was the House you saw on the drive to Dallas?", without proper tokenization, normalization, and stemming, we could have serious problems trying to find the correct document.

It's also important to consider other languages, because it fundamentally changes how some of these data transforms work (particularly word stemming).

Custom scoring to avoid re-ranking

Re-ranking is one of the best ways to further enhance the accuracy of RAG systems, however it adds a fair amount of latency do the process (or cost, if using a fast third-party API). One of the most critical components to high-quality interactions with NPCs is latency, so we've created a custom scoring mechanism in the initial search results to reduce the need to perform this extra step.

Hybrid search

As discussed above, we use a hybrid search approach to get the best possible initial results from the corpus of memories. In terms of scoring, this means we need to normalize the scores for each part (semantic, keyword) into a single score.

The semantic search portion of the search inherently produces a normalized score in the range 0 to 1, however bm25/keyword search has no such discrete range. The first step is to normalize the bm25 score into a discrete range so it can be properly weighted and averaged with semantic score. There are several ways to accomplish this, but the easiest is to calculate the score as (original score - min(scores)) / (max(scores) - min(scores))

Once we have a normalized bm25 score, we can assign a weight to semantic score and bm25 score and produce a weighted mean accordingly, e.g. 0.5 * semantic_score + 0.5 * bm25_score. The weights needed to produce optimal retrieval scores will vary based on embedding model, language, diversity of text, etc., but 0.5/0.5 is a decent starting point.

Recency bias

We can apply a recency bias by penalizing scores of documents based on the date. In our case, we use a gaussian distribution over the range of [first message date, current date], applying a penalty based on the value of that gaussian curve with a scaling factor.

To try to simplify this:

suppose we have a max "age penalty" of 0.3
it's been 10 days since the very first message.
the incoming user instruction triggers retrieval of 2 very similar memories, one created at the same time as the first message, one created just prior to the current message.
the older memory's score will receive the maximum penalty of 0.3 (i.e., a 30% reduction in score), and the newer memory will receive no/minimal penalty.

Importance scaling

When creating these memories, we use an LLM to assign an "importance" factor the memory, which is an integer in the range 1-10. We can then use this numeric value to boost the most importance memories slightly, via calculating the $\log_{10}(importance)$

For example, if the importance is considered a one (chit chat greetings, chat about the weather without any information of significance), our "boost" will be $\log_{10}(1) = 0$

On the other hand, if the importance is considered high (you got married, a loved one passed, etc.), the boost will be $\log_{10}(10) = 1$

Now the more impactful memories will experience a slight boost in scoring.

How do we query?

Since we are utilizing a hybrid search, we need to use both plain text and embeddings when performing a search against the memory storage backend. Initially, we utilized only the most recent user message to discover the most appropriate memories, but that approach is highly error prone.

Suppose, for example, we have the following conversation:‍

User: I like apples.
NPC: Oh, me too, especially the ones from Wiard's orchard!
User: Where are they located?

‍

If we were to just use the user text, "Where are they located?", our memory search results would be abysmal, because there's no specific person or subject matter referenced at all, just an ambiguous concept of location.

Instead, we take into consideration a certain portion of the recent chat history that is most likely to contain information that would result in more relevant search results. For example, in this scenario, we'd include the entire chat history when performing the search, so it is sure to include Apples, the specific orchard Wiard's, and the idea that the user is looking for a location.

‍

Conclusion

Our new, innovative conversational memory backend promises to enhance the conversational quality and engagement provided by Convai NPCs. While there remain several potential areas for improvements, the system has resulted in immediate jumps in quality and coherency from our internal testing, and we hope you find the feature interesting and useful!

‍