Compile-time vs runtime memory: what your agent knows before the first message

There are two ways to give an agent understanding of a user. They differ in when the understanding is built and what it costs at query time.

Query-time memory stores facts from conversations and reasons over them when a query arrives. The agent starts empty. As the user talks, the system extracts preferences and patterns from messages and stores them as memories. When the agent needs context, it retrieves stored memories and runs LLM inference to rank, reason over, or synthesize them into a response. Tools like Mem0, Honcho, Letta, and Supermemory work this way. Every query involves at least one LLM call — for re-ranking, reasoning, or agentic search.

Compile-time semantics extracts understanding from content that already exists and pre-computes all the expensive relationships upfront. Before the first conversation, the engine reads a corpus, generates thematic questions (catalysts), and computes similarity between every catalyst and every document chunk. At query time, there’s no LLM call — just a database lookup of pre-computed results. Enzyme works this way.

The difference isn’t abstract. It shows up in query cost, latency, and what the system can understand on day one.

What query-time memory does well

These tools have matured fast. Mem0 benchmarks at 66.9% overall accuracy on LoCoMo, with a scoring formula that weights similarity, recency, and importance — and an LLM re-ranking call per retrieval to keep precision high. Honcho’s approach hits 90.4% on LongMemEval S using a “Dialectic Agent” that calls tools and applies formal logic at query time to extract latent information that retrieval misses. Letta introduced sleep-time compute — agents that consolidate memory during idle periods, converting raw context into learned context asynchronously. Supermemory claims 85.4% on LongMemEval with graph-based relationship mapping and parallel search agents that rewrite queries and rerank results per request.

These are real capabilities. For products where the primary interaction is conversational — support agents, AI companions, assistant apps — query-time memory is the right architecture. The user talks. The system stores facts. Each session gets richer.

The shared property: understanding accumulates from conversations, and every query runs LLM inference to retrieve it.

Where query-time memory breaks down

Three structural limitations show up when you move outside the chatbot pattern.

The cold start

A user imports 500 reading highlights into your product. Or uploads years of meeting transcripts. Or connects a collection of research notes. Query-time memory sees none of it — it builds understanding from conversation messages, not from existing documents. The memory layer starts empty. The user has to have 10, 20, 50 conversations before the system builds a useful model of their thinking.

Most users won’t. They’ll try one query, get a generic response, and leave. The intelligence layer was supposed to be the value — but it wasn’t ready yet.

Cross-source patterns

Query-time memory stores what was said in conversations. It’s good at facts: “the user prefers Python,” “they mentioned a deadline on March 15.” It doesn’t see conceptual patterns that span a user’s highlights, their notes, their saved articles, and their transcripts — because those aren’t conversation messages. They’re accumulated content that exists before anyone talks to an agent.

A user who’s been saving articles about decentralized governance, highlighting passages about institutional trust, and writing journal entries about organizational autonomy has a clear intellectual thread. Query-time memory can’t see it until the user explicitly surfaces it in conversation. The thread is latent in the corpus, not in the chat history.

Per-query cost at scale

Here’s what each tool does when a query arrives:

  • Mem0: embed query → cosine similarity against stored memories → LLM re-ranking call → return. The re-ranking is an LLM inference per retrieval.
  • Honcho: query → Dialectic Agent (an LLM with tool access) searches representations → formal logic reasoning → return. ~200ms, involving LLM inference.
  • Supermemory: query → LLM rewrites query → parallel search agents (each an LLM) search graph + vector store → rerank → return. Multiple LLM calls per query.
  • Letta: query → agent decides which memory tier to search → tool calls to recall/archival → agent synthesizes → return. The agent itself is the query engine.

Every one of these involves at least one LLM inference at query time. That’s the cost floor — it doesn’t go to zero no matter how many queries you serve. For a consumer product at $5/month serving thousands of users, this arithmetic matters. Supermemory prices at $0.10 per 1,000 queries. Honcho charges $2 per million tokens ingested. These are real costs that scale with usage.

What compile-time semantics means in practice

Enzyme treats understanding as a compile step. Here’s what actually happens when you run enzyme init on a corpus of 1,000 documents:

1. Structure reading (~2s). The engine walks the corpus and extracts entities — tags, wikilinks, folders — with temporal metadata: when each entity first appeared, when it was last active, whether it’s accelerating or going dormant. An entity isn’t a keyword. It’s a handle that the user’s own organizing behavior created.

2. Chunking and embedding (~5s). Documents are split into overlapping chunks and embedded into 384-dimensional vectors using a local ONNX model (23MB, int4 quantized, runs on CPU). No API call. No data leaving the machine.

3. Catalyst generation (~10s). For each entity, the engine samples context excerpts across temporal eras — not just recent content, but the full timeline. An LLM generates catalysts: thematic questions that probe what the content is actually about.

A catalyst for an entity that spans 18 months of meeting notes doesn’t ask “what happened in meetings?” It asks something like: “The team revisited caching three times — once as a performance fix, once as a cost concern, once as a reliability question. What changed between each return?” That question cuts across the timeline. It names a pattern the user hasn’t named yet.

4. Similarity precomputation (~3s). Every catalyst is embedded and compared against every document chunk. The top 50 similar chunks per catalyst are stored. This is the step that makes queries fast — at query time, there’s no vector search. The similarities are already computed.

Total: under 20 seconds. Cost: cents (one LLM call for catalyst generation). Everything after that is local.

The query path: why 8ms and $0

Here’s the contrast. Every competitor listed above runs LLM inference per query. Here’s what Enzyme does:

  1. Embed the query (1-2ms — local ONNX model on CPU, no API call)
  2. Find top catalysts by dot product (0.5ms — ~100 pre-embedded vectors)
  3. Look up pre-computed chunk similarities from SQLite (2-3ms — indexed table scan, no vector search)
  4. Aggregate, weight by recency and multi-catalyst coverage, deduplicate (1-2ms)

Zero LLM calls. Zero API calls. Zero network round-trips. The query is a database lookup of relationships that were already computed at init time.

This is why the cost model is structurally different. Query-time memory tools have a cost floor per query because each one involves LLM inference — the cheapest inference is still not free. Enzyme’s queries are zero marginal cost because the expensive work (catalyst generation, similarity computation) happened once during the compile step. The 11MB binary and 23MB embedding model run on any CPU, on device.

For a product serving thousands of users: the per-user cost is the init step (cents), and every query after that is free. A million queries costs the same as one.

The cold-start advantage

This is the structural difference that matters for product builders.

A user signs up for your product and imports their reading highlights. With query-time memory, the intelligence layer is empty. With Enzyme, it’s ready in under 20 seconds. The first conversation is as rich as the hundredth because the concept graph was compiled from what the user already brought in.

For products built on import — reading managers, collection tools, research platforms, curation apps — the compile-time approach means the value is immediate. The user doesn’t have to earn personalization through conversation. It’s extracted from the content they already accumulated.

Apply: projecting understanding across corpora

Query-time memory is scoped to a single user’s conversation history. Enzyme’s concept graph is a portable artifact.

enzyme apply ./new-corpus takes the catalysts from one vault and projects them onto a different set of documents. The user’s intellectual framework becomes the lens for exploring unfamiliar content — papers that resonate with their existing thinking, articles that connect to themes they’ve been tracking, repositories that overlap with their design principles.

This is structurally impossible with query-time memory, because the understanding isn’t a separable artifact. It’s entangled with the conversation history that produced it. Compile-time semantics produces a concept graph that can be applied, shared, and reused.

When to use which

Query-time memory when:

  • The primary interaction is conversational (chatbots, companions, support)
  • The user starts with no existing content
  • Understanding should accumulate incrementally from behavior
  • You need fact-level recall from past conversations
  • Per-query LLM cost is acceptable for your unit economics

Compile-time semantics when:

  • Users bring existing content (imports, collections, knowledge bases)
  • You need intelligence from day one, not after 50 conversations
  • The value is in cross-source patterns, not individual fact recall
  • Per-query cost needs to be zero (on-device, local-first)
  • Understanding needs to transfer across corpora (apply)

Both when your product has an import-and-converse pattern. Compile the user’s imported content with Enzyme so the first conversation is rich. Layer query-time memory on top for ongoing personalization from chat. The concept graph gives the agent a starting point; conversation memory extends it.

The benchmark question

Query-time memory tools measure themselves on recall benchmarks — LoCoMo, LongMemEval, BEAM. These test whether a system can remember and retrieve facts from conversation history. Honcho scores 90.4%. Mem0 scores 66.9%. Supermemory scores 85.4%.

These benchmarks measure what query-time memory is designed for. Enzyme isn’t designed for that. It doesn’t store conversation history. It doesn’t retrieve facts from past messages.

The question Enzyme answers is different: given a corpus of existing content, what conceptual patterns run through it? The SearchReach comparison on the landing page shows what this looks like — keyword search finds 3 files if you guess the right words. Vector search finds 3 files from the same month. Enzyme’s catalyst reaches across a retro, an abandoned proposal from 8 months prior, and a journal entry that connects them.

There isn’t a benchmark for that yet. The closest proxy is token efficiency: Enzyme uses 2x fewer tokens for broad exploration and 12x fewer for pointed questions compared to brute-force agent search. But the real measure is whether the user recognizes what surfaces — whether the agent names a pattern they’ve been circling but hadn’t articulated.

That’s harder to put in a table. It’s also the reason people keep using it.


Try it. Enzyme compiles Obsidian vaults, Readwise exports, and any markdown corpus into a searchable concept graph. Under 20 seconds. Free for individuals. If you’re building a product on imported content and evaluating memory infrastructure, let’s talk.