Blog

Three Mistakes From Building Virtual Humans

Julius Stener

January 13, 2026

Hero image for Three Mistakes From Building Virtual Humans

Summary

Designing a virtual human comes with a lot of challenges and mistakes. We've been building "virtual humans" for almost two years now, and we've made a lot of mistakes along the way. Given that we are now scaling, it's time we tally up the biggest things we did wrong so far.

Mistake 1

We underestimated the power scaling of intelligence.

Moore's law says that compute density doubles every two years. The new Moore's law says that intelligence will double every six months, and when I tell that to people, almost everyone understands the words but can't grasp the implications, just like we couldn't.

Because we underestimated it, we wasted most of the first year building scaffolding around LLMs so they couldn't break our Agent framework. We spent an entire week building an abstraction around the Gmail API so the Agent didn't accidentally delete half of a user's inbox (again), and we bored ourselves with benchmarking chunking techniques to ensure the Agent had the most optimized memory mechanism.

Today, I spend most of my (admittedly now limited) engineering time tearing down the scaffolding to unlock the power of the latest intelligence. We are now able to entrust an LLM with a lot more autonomy within the Agent framework rather than build to guard against that autonomy.

In a year, LLMs will be four times what they are today. In two years, they'll be 16x (quick math). As far as we know, there is no other resource on earth that is improving in quality anywhere near this rate, and this resource happens to be the most precious resource known to us: intelligence.

Mistake 2

Memory is an abstract concept, but we treated it like an engineering domain.

Human brains are incredibly good at selective recall. That is, I say "dog" and you instantly remember both the slobbering doberman you met last weekend and your childhood mutt, "Bailey." And if you think more about one of the dogs, I'd bet you remember the way she liked to have her ears scratched or how to get her to roll over for you.

In engineering terms, what your brain just did there is actually incredibly difficult to replicate in code. You have an initial search step, and then you have to search-filter again by one of the products of that search. On top of that, there are almost certainly conflicting memories (like all the unsuccessful rollovers) that you have to decide between based on recency and relevancy.

Like most builders in the AI wave, we began by throwing a vector database at the problem, and that solved the search step: "dog" -> "slobbering doberman" + "Bailey." But no matter what we did, successfully going from "Bailey" to "rollover trick" wouldn't work. We tried BM25 embeddings, different object stores, and different chunking algorithms.

None of it worked because human memory does not work how we think of it as working. It's not input that obviously recalls output. Instead, human memory is semantic search coupled with filtered graph traversal, all at once, backwards and forwards, and so that's what we built.

I won't give away too much secret sauce here, but at a high level, we have had to incorporate semantic search and graph traversal with a very meticulous ingestion engine.

Mistake 3

Encoding rules as codified workflows doesn't work because workflows don't understand intent.

It's really, really enticing to try to build Zapier-style Zaps to run the workflows of a virtual human. Think about it, workflows ensure:

Consistency of outcome
Durability of process
Observability of failures
Low cost

Getting all of that would be incredible. And in theory it shouldn't be that hard: create a few blocks like "if statement," "for loop," and "LLM decider," then use an Agent to translate user requests into Zaps and you're a trillionaire.

This system breaks the second you take it out of the testing environment, and we know that because we did exactly that. Consider a daily brief made just for you and sent to you every morning. It seems simple, but the coded workflow ends up incredibly complicated, so complicated I can't detail it entirely here, but I will give an example.

I requested that my secretary at the time "include better descriptions of what the companies do and who the founders are for all Term Sheet emails, filtered for companies that are like mine." This created a fork of the workflow that included an "LLM extractor" node, a "for loop" node, a "web search" node, an "LLM summarizer" node, an "LLM decider" node, and then an "aggregator" node. Ok, that looks ugly but might work.

The next day, when it couldn't find the company because it's a pre-seed startup lost in keywords, it sent me a company summary talking about a farm in Montana. It did this because the codified workflow doesn't understand the intent behind the task, and therefore can't check for errors midway through.

To fix this, we literally threw the whole thing away and learned that intent awareness is just as important as the task itself for successful execution.

So what now?

So now we build differently. We treat the model as the system, not a component within it. When something breaks, our first question isn't "what guardrail do we add?" It's "what's getting in the way?"

We're still getting things wrong. Ask us again in two years.