When worse embeddings give better results
Position #1 is the wrong target
Most embedding optimization advice assumes you want the highest-fidelity vector representation possible. BEIR leaderboards rank models by nDCG. RAG tutorials obsess over cosine similarity thresholds. The implicit goal is always the same: put the right document at position #1.
For personal knowledge, that’s exactly the wrong target.
A research vault isn’t a customer support database. You’re not looking for the one correct answer to a factual query. The connections that matter are the ones you didn’t expect — a passage from a philosophy book that reframes a technical problem, a half-finished note from six months ago that turns out to be the seed of what you’re writing now. These cross-domain, serendipitous matches live in the top-10 neighborhood, not at position #1. A precise embedding model actively filters them out, because they don’t score high enough on cosine similarity to a literal query.
When you’re searching your own accumulated thinking, recall in the neighborhood matters more than ranking fidelity. You want the embedding to get you to the right part of the space, not to rank-order the results. That changes what you should optimize for.
Two hops, both lossy
Enzyme’s retrieval architecture was built around this premise. Instead of embedding a query and searching documents directly, there’s a layer of indirection: catalysts.
Catalysts are LLM-generated thematic handles — questions, theses, or claims that probe what your vault is already thinking about. They’re derived from trending entities (tags, links, folders) and grounded in actual passages from your notes. A catalyst like “How does spaced repetition reshape what we consider worth remembering?” isn’t a generic IR query. It’s synthesized from your vault’s own vocabulary and conceptual frame.
Search goes through two hops: query to catalyst, then catalyst to document. Your query finds the most relevant catalysts via embedding similarity. Each catalyst already has pre-computed similarity scores against every document in the vault, calculated offline during indexing. The final ranking aggregates across multiple catalysts — a document that appears in the neighborhood of three different relevant catalysts scores higher than one that’s a perfect match for just one.
This two-hop design was a deliberate choice, not an optimization shortcut. The first hop is already lossy by design: catalysts are a compression of meaning, not a faithful reproduction. A catalyst about memory techniques will use words like “retention” and “spaced repetition” that naturally overlap with relevant documents, but it won’t capture every nuance of every note it’s meant to surface. It doesn’t need to. It just needs to get the search into the right neighborhood.
The architecture absorbs embedding imprecision at every level. Individual catalyst-to-document similarity errors are localized to specific pairs, not global. They’re pre-computed and baked into the similarity table during indexing, so they don’t compound at search time. And because results aggregate across multiple catalysts, one catalyst’s blind spot gets covered by another’s strength.
The cheapest model that still works
Enzyme had been running on Apple’s Neural Engine via CoreML — FP16 weights on dedicated silicon, sub-millisecond per sentence. But that’s a luxury not everyone has. We wanted embedding to work well on edge devices and commodity hardware: a single-core Colab instance, a cheap Linux VM, anything without a dedicated inference accelerator.
Because we knew the embedding layer was a neighborhood finder — not a ranker — we could ask a different optimization question. Not “what’s the best embedding model?” but “what’s the cheapest model that maintains top-10 recall?” The architecture gave us license to be aggressive with the answer.
We swapped from all-MiniLM-L6-v2 to MongoDB’s mdbr-leaf-ir, a LEAF-distilled model that ranks #1 on BEIR for models under 100M params. It happens to share the exact same architecture — 6-layer BERT, 384 dimensions, same vocab size — so it was a drop-in replacement with zero code changes. Then we split the quantization strategy by platform.
On macOS, CoreML FP16 stays as-is. ANE runs float16 natively, and int4 quantization would actually push execution to the GPU — slower, not faster. The conversion is essentially lossless: 0.999985 cosine similarity against the PyTorch reference.
For CPU-bound environments, we quantized the ONNX model to int4 using MatMulNBitsQuantizer with per-block quantization (block_size=32). This targets the 36 MatMul ops across the transformer’s attention and FFN layers — query, key, value, output projection, two dense layers, times six layers. The embedding tables stay FP32 because the word embedding matrix (30K vocab x 384 dims) dominates model size and can’t be meaningfully int4’d without quality collapse. The result: 52 MB model (down from 87 MB), 0.98 cosine fidelity, 2.2x faster single-sentence inference on CPU.
That 0.98 cosine fidelity would be a measurable regression by conventional IR standards. In practice, it’s noise — the catalyst indirection layer absorbs far more variance than two points of cosine similarity.
On top of the model itself, a BM25 pre-filter cheaply discards documents unlikely to match any catalyst before the expensive embedding pass. Because catalysts are derived from the vault’s own vocabulary, lexical matching has unusually high recall — and it’s nearly free compared to neural inference on a single CPU core.
But the speed gains aren’t just about making a number smaller. They’re a budget you can spend.
Our chunks were larger than the model’s sequence window, so each embedding was built from truncated content. We wanted to fit more of each document into the embedding space — smaller chunks, less truncation, better coverage. But that means 3x as many chunks to embed. On the old model and old hardware, that tradeoff wasn’t available. With a faster model, int4 quantization, and pre-filtering buying headroom, it becomes viable even on a single CPU core.
That’s what the architecture makes possible. The embedding layer is a commodity — you choose whether to spend your speed budget on a shorter wait or on richer representation of each document.
Looseness as a feature
The conventional optimization path for retrieval systems is: better embeddings, longer context windows, higher-dimensional vectors, more precise ranking. Each improvement is expensive and yields diminishing returns.
Catalyst indirection inverts this. Instead of making the embedding layer more precise, you make it less important. The intelligence lives in the catalyst generation (which uses an LLM with full document context) and the aggregation (which combines signals across multiple thematic handles). The embedding model is just the spatial index that makes this computationally tractable.
The optimization question flips from “how precise can we get?” to “how imprecise can we afford to be?” We found the floor is lower than expected. Not because the embeddings are good enough — because the architecture doesn’t need them to be.
For personal knowledge specifically, the looseness in the embedding layer lets cross-domain matches survive. A precise model would have filtered out the philosophy-meets-engineering connection at the similarity threshold. An approximate one lets it through to the catalyst aggregation layer, where it gets boosted by appearing in the neighborhood of multiple relevant thematic handles.
The interesting connections in a personal knowledge base aren’t the ones a precise embedding model ranks #1. They’re the ones that land in the top 10 because something rhymed.
Try it
This is how Enzyme approaches search in personal knowledge. If you’re building on Obsidian and want retrieval that rewards loose, cross-domain thinking over exact-match precision, take a look.