Full text search or even grep/rg are a lot faster and cheaper to work with - no need to maintain a vector database index - and turn out to work really well if you put them in some kind of agentic tool loop.
The big benefit of semantic search was that it could handle fuzzy searching - returning results that mention dogs if someone searches for canines, for example.
Give a good LLM a search tool and it can come up with searches like "dog OR canine" on its own - and refine those queries over multiple rounds of searches.
Plus it means you don't have to solve the chunking problem!
http://search-sensei.s3-website-us-east-1.amazonaws.com/
(warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)
It runs a small embedding model in the browser and returns search results in "real time".
It has a few illustrative examples where semantic search returns the intended results. For example bm25 does not understand that "j lo" or "jlo" refer to Jennifer Lopez. Similarly embedding based methods can better deal with things like typos.
EDIT: search is performed over 1000 news articles randomly sampled from 2016 to 2024
Anthropic found embeddings + BM25 (keyword search) gave the best results. (Well, after contextual summarization, and fusion, and reranking, and shoving the whole thing into an LLM...)
But sadly they didn't say how BM25 did on its own, which is the really interesting part to me.
In my own (small scale) tests with embeddings, I found that I'd be looking right at the page that contained the literal words in my query and embeddings would fail to find it... Ctrl+F wins again!
For most cases though sticking with BM25 is likely to be "good enough" and a whole lot cheaper to build and run.
Embeddings just aren’t the most interesting thing here if you’re running a frontier fm.
Besides, if you could reliably verify results, you’ve essentially built an RL harness—which is a lot harder to do than building an effective search system and probably worth more.
In general RAG != Vector Search though. If a SQL query, grep, full text search or other does the job then by all means. But for relevance-based search, vector search shines.
Unless I’ve misunderstood your post and you are doing some form of this in your pipeline you should see a dramatic improvement in performance once you implement this.
Back in 2023 when I compared semantic search to lexical search (tantivy; BM25), I found the search results to be marginally different.
Even if semantic search has slightly more recall, does the problem of context warrant this multi-component, homebrew search engine approach?
By what important measure does it outperform a lexical search engine? Is the engineering time worth it?
Its very dependent on use case imo
Huh, interesting. I might be building a German-language RAG at some point in my future and I never even considered that some models might not support German at all. Does anyone have any experience here? Do many models underperform or not support non-English languages?
Yes they do. However:
1. German is one of the more common languages so more models will support it than say, Bahasa
2. There should still be a reasonable amount of multi-lingual models available. Particularly if you're OK with using proprietary models via API. AFAIK all the frontier embedding and reranking models (non open-source) are multi-lingual
Check under the "Retrieval" section, either RTEB Multilingual or RTEB German (under language specific).
You may also want to filter for model sizes (under "Advanced Model Filters"). For instance if you are self-hosting and running on a CPU it may make sense to limit to something like <=100M parameters models.
I think it was like a dollar per search or something in those days. We've come a long way!
Anthropic, in their RAG article, actually say that if your thing fits in context, you should probably just put it there instead of using RAG.
I don't know where the optimal cutoff is though, since quality does suffer with long contexts. (Not to mention price and speed.)
https://www.anthropic.com/engineering/contextual-retrieval
The context size and pricing has come so far! Now the whole book fits in context, and it's like 1 cent to put the whole thing in context.
(Well, a little more with Anthropic's models ;)
There are some patterns to help such as RAPTOR where you make ingestion content aware and instead of just ingesting content, you start using LLMs to question and summarise the content and save that to the vector database.
But reality is, having one size fits all for RAQ is not an easy task.
Do vector databases do better with long grouped text vs table formats?
What are my options?
I want to avoid building my own or customising a lot. Ideally it would also recommend which models work well and have good defaults for those.
For what its worth, I run a local first, small model, private RAG that uses LangGraph, Neo4J knowledge graphs, I swap the models around constantly. It mostly just gets called by agent tools now.
Why?
- developer oriented (easy to read Python and uses pydantic-ai)
- benchmarks available
- docling with advanced citations (on branch)
- supports deep research agent
- real open source by long term committed developer not fly by night
But I'm lazy and assumed that someone has already built such a thing. I'm just not aware of this "Wikipedia-RAG-in-a-box".
Give Jan (https://www.jan.ai/) a try for instance. You'll need to do a bit of research as to what model will give you the best perf on your system but one of the quantized Llama or Qwen models will probably suit you well.
Similarly, I used sqlite-vec, and was very happy with it. (if I were already using postgres I'd have gone with that, but this was more of a cli tool).
If the author is here, did you try any of those models? how would you compare the ones you did use?
Even starting with having "just" the documents and vector db locally is a huge first step and much more doable than going with a local LLM at the same time. I don't know any one or any org that has the resources to run their own LLM at scale.
So for sure any medium sized company could afford to run their own LLMs, also at scale if they want to make the investment. The question is, how much they value their confidential data. (I would not trust any of the big AI companies). And you don't usually need cutting edge reasoning and coding abilities to process basic information.
Would you want a Rust SDK for Skald?
I just put this example together today: https://gist.github.com/davidmezzetti/d2854ed82f2d0665ec7efd...