When worse embeddings give better results
Position #1 is the wrong target
Most embedding optimization advice assumes you want the highest-fidelity vector representation possible. BEIR leaderboards rank models by nDCG. RAG tutorials obsess over cosine similarity thresholds. The implicit goal is always the same: put the right document at position #1.
For personal knowledge, that’s exactly the wrong target.
A research vault isn’t a customer support database. You’re not looking for the one correct answer to a factual query. The connections that matter are the ones you didn’t expect — a passage from a philosophy book that reframes a technical problem, a half-finished note from six months ago that turns out to be the seed of what you’re writing now. These cross-domain, serendipitous matches live in the top-10 neighborhood, not at position #1. A precise embedding model actively filters them out, because they don’t score high enough on cosine similarity to a literal query.
When you’re searching your own accumulated thinking, recall in the neighborhood matters more than ranking fidelity. You want the embedding to get you to the right part of the space, not to rank-order the results. That changes what you should optimize for.
Two hops, both lossy
Enzyme’s retrieval architecture was built around this premise. Instead of embedding a query and searching documents directly, there’s a layer of indirection: catalysts.
Catalysts are LLM-generated thematic handles — questions, theses, or claims that probe what your vault is already thinking about. They’re derived from trending entities (tags, links, folders) and grounded in actual passages from your notes. A catalyst like “How does spaced repetition reshape what we consider worth remembering?” isn’t a generic IR query. It’s synthesized from your vault’s own vocabulary and conceptual frame.
Search goes through two hops: query to catalyst, then catalyst to document. Your query finds the most relevant catalysts via embedding similarity. Each catalyst already has pre-computed similarity scores against every document in the vault, calculated offline during indexing. The final ranking aggregates across multiple catalysts — a document that appears in the neighborhood of three different relevant catalysts scores higher than one that’s a perfect match for just one.
This two-hop design was a deliberate choice, not an optimization shortcut. The first hop is already lossy by design: catalysts are a compression of meaning, not a faithful reproduction. A catalyst about memory techniques will use words like “retention” and “spaced repetition” that naturally overlap with relevant documents, but it won’t capture every nuance of every note it’s meant to surface. It doesn’t need to. It just needs to get the search into the right neighborhood.
The architecture absorbs embedding imprecision at every level. Individual catalyst-to-document similarity errors are localized to specific pairs, not global. They’re pre-computed and baked into the similarity table during indexing, so they don’t compound at search time. And because results aggregate across multiple catalysts, one catalyst’s blind spot gets covered by another’s strength.
The cheapest model that still works
Enzyme originally ran transformer embeddings through CoreML or ONNX. That worked, but it meant runtime model loading, separate model distribution, and more moving parts than a local CLI should need. We wanted embedding to work well on edge devices and commodity hardware: a single-core Colab instance, a cheap Linux VM, anything without a dedicated inference accelerator or model setup step.
Because we knew the embedding layer was a neighborhood finder — not a ranker — we could ask a different optimization question. Not “what’s the best embedding model?” but “what’s the cheapest model that maintains top-10 recall?” The architecture gave us license to be aggressive with the answer.
Current Enzyme uses a static embedding backend compiled into the Rust binary. Embedding becomes a lookup-and-pool operation instead of transformer inference. There is no tokenizer dependency, no ONNX runtime, no daemon to keep warm, and no separate model file to download.
The tradeoff is explicit: static embeddings lose word order. “Alice recommended Bob” and “Bob recommended Alice” collapse toward the same vector. For direct search, that can matter. For catalyst-mediated search, the LLM-generated catalysts carry much of the compositional work, and the embedding layer mostly needs to get queries into the right neighborhood.
In the current build, the embedding weights add roughly 16MB of model data and the shipped CLI is about 30MB total. Query routing over precomputed catalysts benchmarks around 1.6ms P50 locally. The old separate model download disappears.
That would be a measurable regression by conventional IR standards. In practice, it’s the point: the catalyst indirection layer absorbs far more variance than the embedding model itself.
On top of the model itself, the preprocessing layer cheaply discards documents unlikely to match any catalyst before spending work on similarity scoring. Because catalysts are derived from the vault’s own vocabulary, lexical matching has unusually high recall — and it’s nearly free compared to neural inference on a single CPU core.
But the speed gains aren’t just about making a number smaller. They’re a budget you can spend.
Our chunks were larger than the model’s useful window, so each embedding was built from truncated content. We wanted to fit more of each document into the embedding space — smaller chunks, less truncation, better coverage. But that means more chunks to embed and score. On the old runtime model, that tradeoff was expensive. With static embeddings and pre-filtering buying headroom, it becomes viable even on a single CPU core.
That’s what the architecture makes possible. The embedding layer is a commodity — you choose whether to spend your speed budget on a shorter wait or on richer representation of each document.
Looseness as a feature
The conventional optimization path for retrieval systems is: better embeddings, longer context windows, higher-dimensional vectors, more precise ranking. Each improvement is expensive and yields diminishing returns.
Catalyst indirection inverts this. Instead of making the embedding layer more precise, you make it less important. The intelligence lives in the catalyst generation (which uses an LLM with full document context) and the aggregation (which combines signals across multiple thematic handles). The embedding model is just the spatial index that makes this computationally tractable.
The optimization question flips from “how precise can we get?” to “how imprecise can we afford to be?” We found the floor is lower than expected. Not because the embeddings are good enough — because the architecture doesn’t need them to be.
For personal knowledge specifically, the looseness in the embedding layer lets cross-domain matches survive. A precise model would have filtered out the philosophy-meets-engineering connection at the similarity threshold. An approximate one lets it through to the catalyst aggregation layer, where it gets boosted by appearing in the neighborhood of multiple relevant thematic handles.
The interesting connections in a personal knowledge base aren’t the ones a precise embedding model ranks #1. They’re the ones that land in the top 10 because something rhymed.
Try it
This is how Enzyme approaches search in personal knowledge. If you’re building on Obsidian and want retrieval that rewards loose, cross-domain thinking over exact-match precision, take a look. If you’re building a product and want this architecture running on your infra, let’s talk.
Read next: The case against tagging your notes — what happens when Enzyme runs on a vault with zero tags, and why the retrieval system that matters isn’t your metadata.