s       .                
                         :8      @88>              
                        .88      %8P          u.   
      .        .u      :888ooo    .     ...ue888b  
 .udR88N    ud8888.  -*8888888  .@88u   888R Y888r 
<888'888k :888'8888.   8888    ''888E`  888R I888> 
9888 'Y"  d888 '88%"   8888      888E   888R I888> 
9888      8888.+"      8888      888E   888R I888> 
9888      8888L       .8888Lu=   888E  u8888cJ888  
?8888u../ '8888c. .+  ^%888*     888&   "*888*P"   
 "8888P'   "88888%      'Y"      R888"    'Y"      
   "P'       "YP'                 ""               
        

cetio

Semantic retrieval on laptops: embeddings, centroids, and SIMD

· original on LinkedIn

Embeddings as a geometric substrate

An embedding maps a span of text into a fixed-length vector in ℜd. Similar text ends up near each other under some distance metric, usually cosine similarity or L2. Once text is a point, retrieval is just geometry: nearest-neighbor search, k-means clustering, centroid summarization, reranking by reweighted distance.

On a laptop, the real costs are memory bandwidth and SIMD lane utilization, not raw FLOPs. Good retrieval on local hardware is mostly a data-layout problem.

Coarse centroids, fine-grained local search

Scanning every vector for every query is O(N·d). At millions of chunks that is already too slow on a laptop. The standard fix is a two-stage index: cluster the corpus once, keep a centroid per cluster, and at query time restrict the fine-grained search to the top-k nearest clusters.

// coarse pass: which clusters to inspect
int[] shortlist = argTopK(dot(query, centroids), k: 8);

// fine pass: full-precision search only inside those
Candidate[] hits;
foreach (c; shortlist)
    hits ~= nearest(query, members[c], n: 20);

return rerank(hits, query);

With ~√N clusters, a typical query touches a small constant fraction of the corpus instead of all of it. Recall loss is bounded by how well the centroids separate the data, and is usually recoverable with a slightly wider shortlist.

Data layout beats algorithm

Cosine similarity between two float[d] vectors is the inner-product kernel from any linear algebra library. The wins on laptop hardware are not algorithmic; they come from how the vectors are stored:

Intuit’s intuit.space module exposes cosineSimilarity, dotProduct, euclideanDistance, l2Norm, normalize, and normMean as SIMD-specialized primitives so downstream code can stay algorithmic without losing the low-level speed.

Quantization for memory, not accuracy

Full float32 embeddings at d=1024 cost 4 KB per chunk. At a million chunks that is 4 GB just for the index. float16 halves it with no meaningful recall change on most retrieval workloads; int8 scalar quantization with per-vector scale halves it again.

The useful insight on a laptop is that quantization is about staying in cache. Int8 vectors let a much larger working set live in L2/L3, which in practice dominates any precision loss.

Determinism and pure functions make this testable

Every transform in the pipeline (chunking, embedding, normalization, centroid computation, shortlist selection, reranking) is a pure function of its inputs. That makes clusters reproducible across runs, makes it possible to check centroid stability across corpus updates, and makes it possible to golden-test retrieval on a fixed corpus snapshot without spinning up any infrastructure.

Models that fit the laptop envelope

The pair that has worked best for me is Qwen 0.6B for embeddings and Qwen 4B for language. The embedding model is small enough to re-embed hundreds of thousands of chunks in an afternoon on CPU; the language model is small enough to stream tokens interactively while still producing useful extractive summaries when given retrieved context.

Takeaways