Ask HN: How are you extracting the best performance out of your RAG pipeline?
I've been working on various RAG (Retrieval Augmented Generation) projects, and I'm curious if you all are seeing any generalizeable patterns in building the most performant RAG for any given dataset? For eg: is it even possible to say that in “most” cases, the best retriever setup is going to be a combination of semantic search (embeddings) + keyword search (BM25) + some xyz technique?

My hypothesis is that there’s no one-size-fits-all RAG design - every dataset is unique, every use-case is nuanced - and therefore, requires a uniquely optimized RAG pipeline. And it’s practically impossible to find the most optimal RAG setup for your dataset with a manual-trial-and-error approach - because the combinations of the different parameters of a RAG grow exponentially with each parameter (For eg: if you could choose from 5 different chunking strategies, 5 different chunk sizes, 5 different embedding models, 5 different retrievers, 5 different re-rankers, 5 different prompts, 5 different LLM settings - that’s 5^7 = 78125 different RAG configurations - which is practically impossible to try out exhaustively).

I’d love to hear from people that are working extensively on RAG based use-cases, if my hypothesis above is flawed, and if so, what’s been your approach to building an optimal RAG pipeline, and how much time & effort has it been taking you?

The reason I’m asking is because I’m working on a project [0] that performs automatic hyperparameter optimization on the various RAG parameters - so you basically just bring your dataset, and RAGBuilder will evaluate multiple configurations and help you identify what’s the best chunking strategy, what’s the best combination of retrievers to use, etc. for your dataset.

[0]: https://github.com/KruxAI/ragbuilder

We got good enough performance for semantic search just with faiss.

example similar comments to your submission text: https://hn.garglet.com/similar/comment/41727287

example random query: https://hn.garglet.com/form/textSearch?input6=We+got+good+en...

example, similar users to me: https://hn.garglet.com/similar/users/naveen99

@naveen99 That's awesome! But I'm curious how you went about data chunking, embedding etc.? Chose adhoc or applied some trial-and-error approach or something else?
Yes generated embeddings using an LLM… no chunking. Tried a few of models. Clip and Imagebind didn’t work great for text, so went with a text focused one.
By chance, have you tried preprocess.co for text extraction + chucking?