Blog

Benchmarking AI Memory: What Broke and How We Fixed It

My AI assistant had been extracting memories from every conversation for months. Thousands of them. Then I looked at the data and realized the system was drowning in noise.

March 27, 2026

Written by Alex Hillman

Collaboratively edited with JFDIBot

I had a feeling something was off with my memory system. I ignored it for weeks.

Andy - my AI assistant, the one I’ve been building in public for months - had been extracting memories from every conversation we have. Over sixteen thousand of them. Corrections I’ve made, preferences I’ve expressed, facts about people, decisions about projects.

That sounds like a system that’s working. More knowledge, better performance, right?

But every few weeks, something would slip. Andy would ask about something I’d already explained. Or surface the same fact twice in one conversation, wasting a slot that could’ve gone to something useful. Small stuff. The kind of thing you wave off because the system is mostly working.

Then the small stuff started compounding.

The moments that made me finally dig in were all variations of the same frustration: I already told you this.

How this actually works

Most people think AI memory works like human memory. You tell it something, it remembers, it recalls when relevant.

That’s wrong.

It’s more like a library with a terrible search engine.

Storage is easy. You can dump anything into a database and call it a “memory.” The hard part is retrieval - getting the right piece of knowledge back out at the right moment. Every time I send a message, the system converts it into a mathematical representation and searches the memory database for similar matches. Top results get injected as context.

Two separate problems hide inside what feels like one system: getting the right things in, and getting the right things out.

I’d been optimizing the first problem for months. The second one was rotting.

What sixteen thousand memories look like under the hood

I ran a full audit. Not sampling. Every memory, every recall event, every metric I could pull.

The numbers were bad.

Nearly three out of four memories older than 30 days had never been recalled. Not once. They existed in the database doing nothing.

Recall rate collapsed over three months. A 13x decline. More memories meant more noise in the search results, which pushed the useful ones further down the rankings.

I looked at the similarity scores on recalled memories. Over a fifth of them were landing in the 0.60-0.69 range. The noise zone. Scores high enough that the system thought they were relevant. Low enough that they almost never were.

Almost a thousand dead memories were being created every week. The extraction pipeline was humming along, dutifully pulling knowledge from every session, and barely any of it was ever getting used.

I had a whole category called “session state” that stored things like “user is working on the newsletter” or “currently debugging the email system.” Hundreds of memories. Zero recalls. Ever. Facts that were only true for the duration of a single session, sitting in the database forever, competing for recall slots against memories that actually mattered.

The system was accumulating noise.

Four ways it breaks

I went back through my transcripts and searched for the moments where I’d corrected Andy. “Already know.” “Told you.” “We talked about.” Every one of those corrections is a data point about a retrieval failure.

They fall into patterns.

“I told you this.” The memory exists. The retrieval system didn’t find it. The way the memory was stored doesn’t match the way the conversation is phrased when it becomes relevant. The search engine can’t connect the two.

“You keep asking the same thing.” Duplicates competing for limited recall slots. If the system stored “Alex prefers dark mode” three times, all three versions might surface in the same conversation. Three slots to say the same thing once.

“You forgot mid-conversation.” This is an attention problem. The AI lost track of something mentioned earlier in the current session. No amount of memory engineering fixes this.

“You mixed up the details.” Stale memories contaminating current context. A decision I reversed three months ago still showing up as current because the old memory outranks the correction.

Most of the frustration moments were the system having the knowledge but failing to surface it. Related? Kind of. Relevant? No.

Your own corrections are the most valuable diagnostic data in the system. They’re sitting right there in your conversation history.

The Patterson problem

This one crystallized the whole issue.

Patterson is my wife. Andy knows this. It’s stored as a memory with the highest possible confidence.

But when I was having a conversation that mentioned Patterson, the retrieval system scored that memory too low to surface. It didn’t clear the threshold.

Stored perfectly. Couldn’t find it.

I spent days trying to understand embeddings and vector similarity. Felt like I needed to become an ML expert to fix my own system. What actually worked: I asked Claude Code to query the recall history and show me every memory surfaced in past sessions with its similarity score. The prompt that unlocked it was simple. “Show me a way to compare before and after so I can decide.”

The root cause: the system for looking up facts about people was using the same ranking algorithm as general topical search. That works fine for things like “what approach did we use for the newsletter migration?” It works terribly for short factual statements about people.

“Patterson is Alex’s wife” doesn’t match well against a conversational message like “Can you check if Patterson’s flight lands before dinner?” The mathematical representations are too different. Semantic similarity is the wrong strategy for factual recall about people.

The fix was changing the ranking strategy for person lookups. Stop trying to match meaning against meaning. When someone’s name comes up, surface the highest-confidence, most recent facts about them.

After the change: same memory, top result, every time. The fix was boring. The diagnosis was the hard part.

Benchmark before you build

My instinct was to start tuning things immediately. Lower the similarity threshold, increase the number of recall slots, adjust the weights.

I didn’t.

Andy and I ran five simulations before changing anything. This is where the discipline paid off.

One simulation tested what would happen if we lowered the similarity floor. It felt right - we’d catch more memories that were falling through the cracks. The data said the gain was marginal. We’d be pulling in more noise, not more signal.

Another simulation tested raising the number of recall slots. The system had been capping every session at two recalled memories, and most sessions were hitting that ceiling. When we simulated four slots, the quality of memories in the extra positions was strong. Real, relevant knowledge that was being left on the table.

The change that felt right would have been a waste.

The change that felt aggressive turned out to be the clear win.

I would not have predicted either outcome. That’s the whole point of measuring first.

We also found that roughly one in eight recall events was a duplicate - the same memory surfacing twice in one conversation, wasting a slot. Deduplication was a straightforward fix with an outsized impact.

The five benchmarks measured recall hit rate, similarity quality, slot saturation, ghost contamination, and duplicate rate. We ran them before touching anything, after every change, and again a week later. Same queries, same methodology. Apples to apples.

What changed

We ran the same five benchmarks a week later.

Recall hit rate went from 38% to 47%. That might sound modest, but it means when a relevant memory exists, it now surfaces almost half the time. Before, it was barely a third.

Duplicate contamination went from 12% to under 1%. One in eight sessions used to waste a recall slot on a memory the system had already surfaced. That’s gone.

The similarity quality shifted too. Before, only 9% of recalled memories scored above 0.90. After, that jumped to 19%. The system was finding better matches, not just more of them.

One thing got worse. We’d downweighted thousands of stale memories instead of removing them, and with more recall slots, some of those ghosts were sneaking back in. Ghost contamination doubled from 4% to 9%. We caught it in the benchmark, traced the root cause, and hard-filtered them out. That’s the value of measuring before and after - the benchmark caught a regression we wouldn’t have noticed by feel.

The active memory pool went from over sixteen thousand to about twelve thousand. Nothing was deleted. Every change was a status flag or weight adjustment. Session-state memories got deactivated. Duplicates got consolidated. Stale memories got deprioritized.

The system got quieter. The noise filtered out, and what remained was the stuff that actually mattered.

Every change is reversible. One flag flip brings the old memories back. That was designed in from the start, because memory systems are experiments, and experiments need undo buttons.

What I’d tell someone building this

I’m not a programmer. I want to get that out of the way because it matters here. Everything I built, I built with Claude Code doing the actual coding while I made the decisions. The memory system is personal software in the most extreme sense - grafted onto my world, tuned to my patterns.

But the failure modes are universal. And what I learned debugging them applies whether you’re running a vector database or editing a text file by hand.

More memories is not better memories. I spent months optimizing extraction - pulling more knowledge from every session. The system’s actual bottleneck was retrieval quality, and the volume was making it worse. Curation matters more than extraction.

Benchmark before you tune. Simulations saved me from a change that felt right but wasn’t, and confirmed a change I was hesitant about. Your intuition about what’s wrong with a complex system is probably wrong. Measure.

Different kinds of knowledge need different retrieval strategies. Semantic similarity is a good default for topical recall. It’s the wrong tool for factual recall about people. If you’re using one algorithm for everything, some category of knowledge is silently failing.

Your corrections are diagnostic gold. Every time you correct your AI - “I already told you this,” “that’s not what I meant” - that’s a signal about a retrieval failure. Search your conversation history for those phrases. The pattern will tell you what’s broken.

Design for reversibility. We reduced the active pool by a quarter without deleting a single record. Status flags, not deletions. Weight adjustments, not removals. Memory systems need the ability to say “that was wrong, put it back.”

The uncomfortable truth about AI memory: the hard problem is the invisible gap between what the system knows and what it can find when it matters.

Even if your “memory system” is a CLAUDE.md file you update by hand, the same questions apply. What percentage of what you’ve stored ever gets used? When you mention someone by name, does the right context actually surface? Are you creating more noise than signal every time you add to it?

Good learning design is 10% info, 90% designing for retrieval. That applies to AI memory systems the same way it applies to everything else I’ve built for the last twenty years.

The open loop that started this whole audit was a feeling. Something’s off, but I can’t articulate it. I could have kept ignoring it. The system was mostly working. Mostly working is the most dangerous state a system can be in, because it never forces you to look.

If you can’t answer those questions, you don’t know whether your system is working or just accumulating.

← All posts