Semantic retrieval on laptops

Embeddings as a geometric substrate

An embedding maps a span of text into a fixed-length vector in ℜ^d. Similar text ends up near each other under some distance metric, usually cosine similarity or L2. Once text is a point, retrieval is just geometry: nearest-neighbor search, k-means clustering, centroid summarization, reranking by reweighted distance.

On a laptop, the real costs are memory bandwidth and SIMD lane utilization, not raw FLOPs. Good retrieval on local hardware is mostly a data-layout problem.

Coarse centroids, fine-grained local search

Scanning every vector for every query is O(N·d). At millions of chunks that is already too slow on a laptop. The standard fix is a two-stage index: cluster the corpus once, keep a centroid per cluster, and at query time restrict the fine-grained search to the top-k nearest clusters.

// coarse pass: which clusters to inspect
int[] shortlist = argTopK(dot(query, centroids), k: 8);

// fine pass: full-precision search only inside those
Candidate[] hits;
foreach (c; shortlist)
    hits ~= nearest(query, members[c], n: 20);

return rerank(hits, query);

With ~√N clusters, a typical query touches a small constant fraction of the corpus instead of all of it. Recall loss is bounded by how well the centroids separate the data, and is usually recoverable with a slightly wider shortlist.

Data layout beats algorithm

Cosine similarity between two float[d] vectors is the inner-product kernel from any linear algebra library. The wins on laptop hardware are not algorithmic; they come from how the vectors are stored:

Store vectors as a contiguous float[N][d] matrix in row-major order so one SIMD load streams d/8 lanes at a time with AVX2, or d/16 with AVX-512.
L2-normalize vectors at write time. Cosine similarity collapses to a single FMA-friendly dot product, no per-query sqrt.
Keep the query vector small and hot. It fits in L1 and stays there across a whole cluster scan.
Batch by cluster, not by query. All the vectors you compare against a given query live in one contiguous run.

Intuit’s intuit.space module exposes cosineSimilarity, dotProduct, euclideanDistance, l2Norm, normalize, and normMean as SIMD-specialized primitives so downstream code can stay algorithmic without losing the low-level speed.

Quantization for memory, not accuracy

Full float32 embeddings at d=1024 cost 4 KB per chunk. At a million chunks that is 4 GB just for the index. float16 halves it with no meaningful recall change on most retrieval workloads; int8 scalar quantization with per-vector scale halves it again.

The useful insight on a laptop is that quantization is about staying in cache. Int8 vectors let a much larger working set live in L2/L3, which in practice dominates any precision loss.

Determinism and pure functions make this testable

Every transform in the pipeline (chunking, embedding, normalization, centroid computation, shortlist selection, reranking) is a pure function of its inputs. That makes clusters reproducible across runs, makes it possible to check centroid stability across corpus updates, and makes it possible to golden-test retrieval on a fixed corpus snapshot without spinning up any infrastructure.

Models that fit the laptop envelope

The pair that has worked best for me is Qwen 0.6B for embeddings and Qwen 4B for language. The embedding model is small enough to re-embed hundreds of thousands of chunks in an afternoon on CPU; the language model is small enough to stream tokens interactively while still producing useful extractive summaries when given retrieved context.

Takeaways

Treat retrieval as geometry, not string matching.
Use coarse centroids to prune, then do full-precision search inside a small shortlist.
Contiguous, normalized, SIMD-friendly layouts matter more than which inner-product algorithm you pick.
Quantize to keep the working set in cache, not to save disk.
Keep the pipeline pure and deterministic so you can actually test it.

cetio

←