Show HN: I wrote a GPU-less billion-vector DB for molecule search (live demo)

Input a SMILES string (or pick one molecule from the examples) and it returns up to 100k molecules closest in 3-D shape or electrostatic similarity – from 10+ billion scale databases — typically in under 5-10 s.

*Why it might interest HN*

* Entire index lives on disk — no GPU at query-time, less than ~10 GB RAM total. * Built from scratch (no FAISS index / Milvus / Pinecone). * Index-build cost: one Nvidia T4 (~ 300USD) for one 5.5B database. * Open to anyone, predict ADMET, export results as CSV/SDF.

Full write-up & benchmarks (DUD-E, LIT-PCBA, SVS) in the pre-print: https://chemrxiv.org/engage/chemrxiv/article-details/6725091...

jasonjmcghee
·
11 hours ago
·
[ - ]

Nice project! A regular on HN and creator of usearch built an embedding search for the same dataset and did a write up which is a great read.

https://ashvardanian.com/posts/usearch-molecules/

mireklzicar
·
9 hours ago
·
[ - ]

Thanks — I read Ash’s post (great blog!) and even spun up USEARCH when I first explored this space.

Main differences:

* *Cost-efficiency:* USEARCH / FAISS / HNSW keep most of the index in RAM; at the billion scale that often means hundreds of GB. In CHEESE both build and search stream from disk. For the 5.5 B-compound Enamine set the footprint is ~1.7 TB NVMe plus ~4 GB RAM (only the centroids), so it can run on a laptop and still scale to tens of billions of vectors. This is also huge difference over commercial vector DB providers (pinecone, milvus...) who would bill you many thousands USD per month for it, because of the RAM heavy instances.

* *Vector type:* USEARCH demo uses binary fingerprints with Tanimoto distance. I use 256-D float embeddings trained to approximate 3-D shape and electrostatic overlap, searched with Euclidean distance.

* *Latency vs. accuracy:* BigANN-style work optimises for QPS and milisecond latency. Chemists usually submit queries one-by-one, so they don’t mind 1–6 s if the top hits are chemically meaningful. I pull entire clusters from disk and scan them exactly to keep recall high.

So the trade-off is a few seconds slower, but far cheaper hardware and results optimized for accuracy.