Embeddings as a geometric substrate
An embedding maps a span of text into a fixed-length vector in ℜd. Similar text ends up near each other under some distance metric, usually cosine similarity or L2. Once text is a point, retrieval is just geometry: nearest-neighbor search, k-means clustering, centroid summarization, reranking by reweighted distance.
Coarse centroids, fine-grained local search
Scanning every vector for every query is O(N·d). At millions of chunks that is already too slow on a laptop. The standard fix is a two-stage index: cluster the corpus once, keep a centroid per cluster, and at query time restrict the fine-grained search to the top-k nearest clusters.
// coarse pass: which clusters to inspect
int[] shortlist = argTopK(dot(query, centroids), k: 8);
// fine pass: full-precision search only inside those
Candidate[] hits;
foreach (c; shortlist)
hits ~= nearest(query, members[c], n: 20);
return rerank(hits, query);
With ~√N clusters, a typical query touches a small constant fraction of the corpus instead of all of it. Recall loss is bounded by how well the centroids separate the data, and is usually recoverable with a slightly wider shortlist.
Data layout beats algorithm
Cosine similarity between two float[d] vectors is the inner-product kernel from any linear algebra library. The wins on laptop hardware are not algorithmic; they come from how the vectors are stored:
- Store vectors as a contiguous
float[N][d]matrix in row-major order so one SIMD load streamsd/8lanes at a time with AVX2, ord/16with AVX-512. - L2-normalize vectors at write time. Cosine similarity collapses to a single FMA-friendly dot product, no per-query sqrt.
- Keep the query vector small and hot. It fits in L1 and stays there across a whole cluster scan.
- Batch by cluster, not by query. All the vectors you compare against a given query live in one contiguous run.
Intuit’s intuit.space module exposes cosineSimilarity, dotProduct, euclideanDistance, l2Norm, normalize, and normMean as SIMD-specialized primitives so downstream code can stay algorithmic without losing the low-level speed.
Quantization for memory, not accuracy
Full float32 embeddings at d=1024 cost 4 KB per chunk. At a million chunks that is 4 GB just for the index. float16 halves it with no meaningful recall change on most retrieval workloads; int8 scalar quantization with per-vector scale halves it again.
The useful insight on a laptop is that quantization is about staying in cache. Int8 vectors let a much larger working set live in L2/L3, which in practice dominates any precision loss.
Determinism and pure functions make this testable
Every transform in the pipeline (chunking, embedding, normalization, centroid computation, shortlist selection, reranking) is a pure function of its inputs. That makes clusters reproducible across runs, makes it possible to check centroid stability across corpus updates, and makes it possible to golden-test retrieval on a fixed corpus snapshot without spinning up any infrastructure.
Models that fit the laptop envelope
The pair that has worked best for me is Qwen 0.6B for embeddings and Qwen 4B for language. The embedding model is small enough to re-embed hundreds of thousands of chunks in an afternoon on CPU; the language model is small enough to stream tokens interactively while still producing useful extractive summaries when given retrieved context.
Takeaways
- Treat retrieval as geometry, not string matching.
- Use coarse centroids to prune, then do full-precision search inside a small shortlist.
- Contiguous, normalized, SIMD-friendly layouts matter more than which inner-product algorithm you pick.
- Quantize to keep the working set in cache, not to save disk.
- Keep the pipeline pure and deterministic so you can actually test it.