Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?
https://huggingface.co/MongoDB/mdbr-leaf-ir
It ranks #1 on a bunch of leaderboards for models of its size. It can be used interchangeably with the model it has been distilled from (https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1...).
You can see an example comparing semantic (i.e., embeddings-based) search vs bm25 vs hybrid here: http://search-sensei.s3-website-us-east-1.amazonaws.com (warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)
This mini app illustrates the advantage of semantic vs bm25 search. For instance, embedding models "know" that j lo refers to jennifer lopez.
We have also published the recipe to train this type of models if you were interested in doing so; we show that it can be done on relatively modest hardware and training data is very easy to obtain: https://arxiv.org/abs/2509.12539
These are very interesting models.
The tradeoff here is that you get even faster inference, but lose on retrieval accuracy [0].
Specifically, inference will be faster because essentially you are only doing tokenization + a lookup table + an average. So despite the fact that their largest model is 32M params, you can expect inference speeds to be higher than ours, which 23M params but it is transformer-based.
I am not sure about typical inference speeds on a CPU for their models, but with ours you can expect to do ~22 docs per second, and ~120 queries per second on a standard 2vCPU server.
As far as retrieval accuracy goes, on BEIR we score 53.55, all-MiniLM-L12-v2 (a widely adopted compact text embedding model) scores 42.69, while potion-8M scores 30.43.
I can't find their larger models but you can generally get an idea of the power level of different embedding models here: https://huggingface.co/spaces/mteb/leaderboard
If you want to run them on a CPU it may make sense to filter for smaller models (e.g., <100M params). On the other side our models achieve higher retrieval scores.
[0] "accuracy" in layman terms, not in accuracy vs recall terms. The correct word here would be "effectiveness".
For retrieval I load all the vectors from the SQlite database into a numpy.array and hand it to FAISS. Faiss-gpu was impressively fast on the RTX6000 and faiss-cpu is slower on the M1 Ultra but still fast enough for my purposes (I'm firing a few queries per day, not per minute). For 5 million chunks memory usage is around 40 GB which both fit into the A6000 and easily fits into the 128GB of the M1 Ultra. It works, I'm happy.
I can recommend https://github.com/tobi/qmd/ . It’s a simple CLI tool for searching in these kinds of files. My previous workflow was based on fzf, but this tool gives better results and enables even more fuzzy queries. I don’t use it for code, though.
I haven't used it extensively, but semantic grep alone was kind of worth it.
Shameless plug: https://github.com/jankovicsandras/plpgsql_bm25 BM25 search implemented in PL/pgSQL ( Unlicense / Public domain )
The repo includes also plpgsql_bm25rrf.sql : PL/pgSQL function for hybrid search ( plpgsql_bm25 + pgvector ) with Reciprocal Rank Fusion; and Jupyter notebook examples.
For all intents and purposes, running gpt-oss 20B in a while loop with access to ripgrep works pretty dang well. gpt-oss is a tool calling god compared to everything else i've tried, and fast.
You can use pgvector for the vector lookup and paradedb for bm25.
Can you elaborate? What makes the modern alternatives easier to operate? What makes Elasticsearch complicated?
Asking because in my experience, Elasticsearch is pretty simple to operate unless you have a huge cluster with nodes operating in different modes.
BM25/tf-idf and N grams have always been extremely difficult to beat baselines in information retrieval. This is why embeddings still have not led to a "ChatGPT" moment in information retrieval.
This took about one hour to set up and works very well.
(+) At least, I don't think this counts as RAG. I'm honestly a bit hazy on the definition. But there's no vectordb anyway.
After some time we noticed a semi-structured field in the prompt had a 100% match with the content needed to process the prompt.
Turns out operators started puting tags both in the input and the documents that needed to match on every use case (not much, about 50 docs).
Now we look for the field first and put the corresponding file in the prompt, then we look for matches in the database using the embedding.
85% of the time we don't need the vectordb.
Using Ollama for the embeddings with “nomic-embed-text”, with LanceDB for the vector database. Recently updated it to use “agentic” RAG, but probably not fully needed for a small project.
Mine is much more basic than yours and I just started it a couple of weeks ago.
For the curious RAG = Retrieval Augmented Generation. From wikipedia: RAG enables large language models (LLMs) to retrieve and incorporate new information from external data sources
I store file content blobs in SQLite, and use FTS5 (bm25) to maintain a fulltext index plus sqlite-vec for storing embeddings. Search uses both of these, and then reciprocal rank fusion gets the best results and pipes those to a local transformers model to judge. It’s all Python with mlx-lm and mlx-embeddings libraries, the models are grabbed from huggingface. It’s not the fastest, but it’s local and easy to understand (and for Claude to write, mostly).
https://github.com/rhobimd-oss/shebe
One area where BM25 particularly shines is the refactoring workflow: let's say you want to upgrade your istio installation from 1.28 to 1.29 and maybe in 1.29 the authorizationpolicy crd has a breaking change in one of it's properties. BM25 allows you to efficiently enumerate all code locations in your codebase that need to change and then you can set the cli coders off using this list. Grep and LSP can still perform this enumeration but they have shortcomings. Wrote about it here https://github.com/rhobimd-oss/shebe/blob/main/WHY_SHEBE.md#...
fwiw this does look interesting.
It uses PostgreSQL with pgvector, hybrid BM25, multi-query expansion, and reranking.
(It's the first time I share it publicly, so I am sure there'll be quirks.)
Studies generally show when you do agentic retrieval w/ text search, that's pretty good. Adding vector retrieval and graph rag, so the typical parallel multi-retrieval followed by reranking, gives a bit of speedup and quality lift. That lines up with my local flow experience, where it is only enough that I want that for $$$$ consumer/prosumer tools, and not easy enough for DIY that I want to invest in that locally. For those who struggle with tools like spotlight running when it shouldn't, that kind of thing turns me off on the cost/benefit side.
For code, I experiment with unsound tools (semgrep, ...) vs sound flow analyzers, carefully setup for the project. Basically, ai coders love to use grep/sed for global replace refactors and other global needs, but keeps tripped up on sound flow analysis. Similar to lint and type checking, that needs to be setup for a project and taught as a skill. I'm not happy with any of my experiments here yet however :(
Their discussion is super relevant to exactly what I wrote --
* They note speed benefits * The quality benefit they note is synonym search... which agentic text search can do: Agents can guess synonyms in the first shot for you, eg, `navigation` -> `nav|header|footer`, and they'll be iterating anyways
To truly do better, and not make the infra experience stink, it's real work. We do it on our product (louie.ai) and our service engagements, but real costs/benefits.
It uses LanceDB and has dozens of different extraction/embedding models to choose from. It even has evals for checking retrieval accuracy, including automatically generating the eval dataset.
You can use its UI, or call the RAG via MCP.
The real lightbulb moment is when you realise the ONLY thing a RAG passes to the LLM is a short string of search results with small chunks of text. This changes it from 'magic' to 'ahh, ok - I need better search results'. With small models you cannot pass a lot of search results ( TOP_K=5 is probably the limit ), otherwise the small models 'forget context'.
It is fun trying to get decent results - and it is a rabbithole, next step I am going into is pre-summarising files and folders.
I open sourced the code I was using - https://github.com/acutesoftware/lifepim-ai-core
It’s a CLI tool and MCP server for creating discrete, versioned “libraries” of RAG-able content.
Under the hood, it uses an embedding model locally. It chunks your content and stores embeddings in SQLite. The search functionality uses vector + keyword search + a re-ranking model.
You can also point it at any GitHub repo and it will create a RAG DB out of it.
You can also use the MCP server to create and query the libraries.
Demo: https://app.dwani.ai
GitHub: https://github.com/dwani-ai/discovery
Now working on added Agentic features, by continuous analysis of Document with Generated prompts.
On the retrieval side, I built a custom search/indexing layer (Node) specifically for service traceability and discovery. It uses a hybrid approach — embeddings + full-text search + IVF-HNSW — to index and cross-reference our APIs, services, proxies and orchestration repos. The RAG pipelines sit on top of this layer, which gives us reasonable recall and predictable latency.
Compliance and observability are still a problem. Every year new vendors show up promising audits, data lineage and observability, but none of them really handle the informational sprawl of ~600 distributed systems. The entropy keeps increasing.
Lately I’ve been experimenting with a more semantic/logical KAG approach on top of knowledge graphs to map business rules scattered across those systems. The goal is to answer higher-level questions about how things actually work — Palantir-like outcomes, but with explicit logic instead of magic.
Curious if others are moving beyond “pure RAG” toward graph-based or hybrid reasoning setups.
I'm positively surprised on how well it works, especially if you also connect it to an LLM.
This is specifically a “remembrance agent”, so it surfaces related atoms to what you’re writing rather than doing anything generative.
Extension: https://github.com/mmargenot/tezcat
Also available in community plugins.
If the total size of your data isn't loo large...?
Data being a plural gets me.
You might have small datums but a lot of kilobytes!
So I use hosted one to prevent this. My business use vector db, so created a new db to vectorize and host my knowledge base. 1. All my knowledge base is markdown files. So I split that by header tags. 2. The split is hashed and hash value is stored in SQLite 3. The hashed version is vectorized and pushed to cloud db. 4. When ever I make changes , I run a script which splits and checks hash, if it is changed the. I upsert the document. If not I don’t do anything. This helps me keep the store up to date
For search I have a cli query which searches and fetches from vector store.
https://aws.amazon.com/blogs/machine-learning/use-language-e...
The code for it is here: https://github.com/aws-samples/rss-aggregator-using-cohere-e...
The example link no longer works, as I no longer work at AWS.
ragtune explain "your query" --collection prod
Shows scores, sources, and diagnostics. Helps catch when your chunking
or embeddings are silently failing or you need numeric estimations to base your judgements on.Open source: https://github.com/metawake/ragtune
The real challenge wasn't model quality - it was the chunking strategy. Financial data is weirdly structured and breaking it into sensible chunks that preserve context took more iteration than expected. Eventually settled on treating each complete record as a chunk rather than doing sliding windows over raw text. The "obvious" approaches from tutorials didn't work well at all for structured tabular-ish data.
kb = Ragi(["./docs", "s3://bucket/data/*/*.pdf", "https://api.example.com/docs"])
answer = kb.ask("How do I deploy this?")
that's it! with https://pypi.org/project/piragi/
For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.
Not sure how useful it is for what you need specifically: https://blog.yakkomajuri.com/blog/local-rag
save_memory, recall_memory, search
Save memory vectorizes a session, summarizes it, and stores it in SQLite. Recall memory takes vector or a previous tool run id and loads the full text output. Search takes a vector array or string array and searches through the graph using fuzzy matching and vector dot products.
It’s not fancy, but it works really well. gpt-oss
The newer “agent” search approach can just query a file system or api. It’s slightly slower but easier to setup and maintain as no extra infrastructure.
it all moves so fast, i wouldnt be surprised if everything i made is now crazy outdated and it was probably like 2 months ago.
To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.
However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.
Works well, but I didn't tested on larger scale
The problems with datasheets is tables which span multiple pages, embedded images for diagrams and plots, they're generally PDFs, and only sometimes are they 2-column layout.
Converting from PDF to markdown while retaining tables correctly seems to work well for me with Mistral's latest OCR model, but this isn't an open model. Using docling with different models has produced much worse results.
I’ve optimized https://markdownconverter.pro/pdf-to-markdown to handle complex PDFs, including those tricky tables that span multiple pages and 2-column formats that usually trip up tools like Docling. It also extracts embedded diagrams/images and links them properly in the output.
Full disclosure: I'm the developer behind it. I’d love to see if it handles your specific datasheets better than the models you've tried. Feel free to give it a spin!
Question being: WHY would I be doing RAG locally?
TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)
I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api
Also I've got no idea what this product does, this is just a generic page of topical ai buzzwords
Don't tell me what it is, /show me why/ you built it. Then go back and keep that reasoning in, show me why I should care