PageIndex takes a different approach to RAG. Instead of relying on vector databases or artificial chunking, it builds a hierarchical tree structure from documents and uses reasoning-based tree search to locate the most relevant sections. This mirrors how humans approach reading: navigating through sections and context rather than matching embeddings.
As a result, the retrieval feels transparent, structured, and explainable. It moves RAG away from approximate "semantic vibes" and toward explicit reasoning about where information lives. That clarity can help teams trust outputs and debug workflows more effectively.
The broader implication is that retrieval doesn't need to scale endlessly in vectors to be powerful. By leaning on document structure and reasoning, it reminds us that efficiency and human-like logic can be just as transformative as raw horsepower.
How is this not precisely "vibe retrieval" and much more approximate, where approximate in this case is uncertainty over the precise reasoning?
Similarity with conversion to high-dimensional vectors and then something like kNN seems significantly less approximate, less "vibe" based, than this.
This also appears to be completely predicated on pre-enrichment of the documents by adding structure through API calls to, in the example, openAI.
It doesn't at all seem accurate to:
1: Toss out mathematical similarity calculations
2: Add structure with LLMs
3: Use LLMs to traverse the structure
4: Label this as less vibe-ish
Also for any sufficiently large set of documents, or granularity on smaller sets of documents, scaling will become problematic as the doc structure approaches the context limit of the LLM doing the retrieval.
Embeddings are great at basic conceptual similarity, but in quality maximalist fields and use cases they fall apart very quickly.
For example:
"I want you to find inconsistencies across N documents." There is no concept of an inconsistency in an embedding. However, a textual summary or context stuffing entire documents can help with this.
"What was John's opinion on the European economy in 2025?" It will find a similarity to things involving the European economy, including lots of docs from 2024, 2023, etc. And because of chunking strategies with embeddings and embeddings being heavily compressed representations of data, you will absolutely get chunks from various documents that are not limited to 2025.
"Where are Sarah or John directly quoted in this folder full of legal documents?" Sarah and John might be referenced across many documents, but finding where they are directly quoted is nearly impossible even in a high dimensional vector.
Embeddings are awesome, and great for some things like product catalog lookups and other fun stuff, but for many industries the mathematical cosign similarity approach is just not effective.
This makes a lot of sense if you think about it. You want something as conceptually similar to the correct answer as possible. But with vector search, you are looking for something conceptually similar to some formulation of the question, which has some loose correlation, but is very much not the same thing.
There's ways you can prepare data to try to get a closer approximation (e.g. you can have an LLM formulate for each indexed block questions that it could answer and index those, and then you'll be searching for material that answers a question similar to the question being asked, which is a bit closer to what you want, but its still an approximation.
But if you ahead of time know from experience salient features of the dataset that are useful for the particular application, and can index those directly, it just makes sense that while this will be more labor intensive than generalized vector search and may generalize less well outside of that particular use case, it will also be more useful in the intended use case in many places.
> scaling will become problematic as the doc structure approaches the context limit of the LLM doing the retrieval
IIUC, retrieval is based on traversing a tree structure, so only the root nodes have to fit in the context window. I find that kinda cool about this approach.But yes, still "vibe retrieval".
That was my immediate take. [Look at the summary and answer based on where you expect the data to be found] maybe works well for reliably structured data.
I might have misunderstood of course.
If so, then the use cases for this would be fairly limited since you'd have to deal with lots of latency and costs. In some cases (legal documents, medical records, etc) it might be worth it though.
An interesting alternative I've been meaning to try out is inverting this flow. Instead of using an LLM at time of searching to find relevant pieces to the query, you flip it around: at time of ingesting you let an LLM note all of the possible questions that you can answer with a given text and store those in an index. You could them use some traditional full-text search or other algorithms (BM25?) to search for relevant documents and pieces of text. You could even go for a hybrid approach with vectors on top or next to this. Maybe vectors first and then more ranking with something more traditional.
What appeals to me with that setup is low latency and good debug-ability of the results.
But as I said, maybe I've misunderstood the linked approach.
You may already know of this one, but consider giving Google LangExtract a look. A lot of companies are doing what you described in production, too!
This is what I am doing with my AI Search Assistant feature, which I discuss in more detail via the link below:
https://github.com/gitsense/chat/blob/main/packages/chat/wid...
By default, I provide what I call a "Tiny Overview Analyzer". You can read the prompt for the Analyzer with the link below:
https://github.com/gitsense/chat/blob/main/packages/chat/wid...
In a nutshell, it generates a very short summary of every document along with keywords. The basic idea is to use BM25 ranking to identify the most relevant documents for the AI to review. For example, my use case is to understand how Aider, Claude Code, etc., store their conversations so that I can make them readable in my chat app. To answer this, I would ask 'How does Aider store conversations?' and the LLM would construct a deterministic keyword search using terms that would most likely identify how conversations are stored.
Once I have the list of files, the LLM is asked again to review the summaries of all matches and suggest which documents should be loaded in full for further review. I've found this approach to be inconsistent, however. What I've found to work much better is just loading the "Tiny Overview" summaries into context and chatting with the LLM. For example, I would ask the same question: "Which files do you think can tell me how Aider stores conversations? Identify up to 20 files and create a context bundle for them so I can load them into context." For a thousand files, you can easily fit three-sentence summaries for each of them without overwhelming the LLM. Once I have my answer, I just need a few clicks to load the files into context, and then the LLM will have full access to the file content and can better answer my question.
It’s really hard to get to such a place with standard vector-based systems, even GraphRag. Because it relies on summaries of topic clusters that are pre-computed, if one of those summaries is inaccurate or none of the summaries deal with your exact question, that will never change during query processing. Moreover, GraphRag preprocessing is insanely expensive and precisely does not scale linearly with your dataset.
TLDR all the trade-offs in RAG system design are still being explored, but in practice I’ve found the main desired property to be “predictably better answer with predictably scaling cost” and I can see how similar concerns got OP to this design.
Sounds interesting. What exactly is the expensive computation?
On a separate note: I have a feeling RAG could benefit from a kind of ”simultaneous vector search” across several different embedding spaces, sort of like AND in an SQL database. Do you agree?
Through more thorough ANN vector search / higher recall, or would it also require different preprocessing?
My motivation back then I had 8k context length to work with so I had to be very conservative about what I include. I still used vectors to narrow down the entry points and then use LLM to drill down or pick the most relevant ones and the search threads were separate, would summarize the response based on the tree path they took and then main thread would combine it.
What does this even mean? At what point do you know you have all of them?
Humans are quite ingenious coming up with new, unique questions in my observation, whereas LLMs have a hard time replicating those efficiently.
I think for most use cases, it doesn't make much sense to use vector DBs. When I started to design my AI Search feature, I researched chunking a lot and the general consensus was, you can can lose context if you don't chunk in the right way and there wasn't really a right way to chunk. This was why I decided to take the approach that I am using today, which I talk about in another comment.
With input cost for very good models ($0.30/1M) for Gemini 2.5 Flash (bulk rates would be $0.15/1M), feeding the llm thousands of documents to generate summaries would probably cost 5 dollars or less if using bulk rate pricing. With input cost and with most SOTA LLMs being able to handle 50k tokens in context window with no apparent lost in reasoning, I really don't see the reason for vector DBs anymore, especially if it means potentially less accurate results.
I can't remember what post I read this in (but it was on Hacker News) and I read when designing Claude Code, they (Anthropic) tried a RAG approach but it didn't work very well compared to loading in the full file. If my understanding of how Claude Code works is correct (this was based on comments from others), was it "greps like a intern/junior developer". So what Claude Code does (provided grep is the key), is it would ask Sonnet for keywords to grep for based on the users query. And it would continuously revise the grep key words until it was satisfied with the files that it found.
As ridiculous as this sounds, this approach is not horrible, albeit very inefficient. For my approach, I focus on capturing intent which is what grep can't match. And for RAG, if the code is not chunked correctly and/or if the code is just badly organized, you may miss the true intent for the code.
My feeling is that what you're getting at is actually the fact that it's hard to get semantic chunks and when embedding them, it's hard to have those chunks retain context/meaning, and then when retrieving, the cosine similarity of query/document is too vibes-y and not strictly logical.
These are all extremely real problems with the current paradigm of vector search. However, my belief is that one can fix each of these problems vs abandoning the fundamental technology. I think that we've only seen the first generation of vector search technology and there is a lot more to be built.
At Vectorsmith, we have some novel takes on both the comptuation and storage architecture for vector search. We have been working on this for the last 6 months and have seen some very promising resutls.
Fundamentally my belief is that the system is smarter when it mostly stays latent. All the steps of discretization that are implied in a search system like the above lose information in a way that likely hampers retrieval.
Yeah, exactly.
>And embeddings are also too lossy (in terms of losing context and structure)
Interestingly, it appears that the problem is not embeddings but rather retrieval. It appears that embeddings can contain a lot more information than we're currently able to pull out. Like, obviously they are lossy, but... less than maybe I thought before I started this project? Or at least can be made to be that way?
> But you guys are working on something less lossy for both semantics and context?
Yes! :) We're getting there! It's currently at the good-but-not-great like GPT-2ish kind of stage. It's a model-toddler - it can't get a job yet, but it's already doing pretty interesting stuff (i.e. it does much better than SOTA on some complex tasks). I feel pretty optimistic that we're going to be able to get it to work at a usable commercial level for at least some verticals — maybe at an alpha/design partner level — before the end of the year. We'll definitely launch the semantic part before the context part, so this probably means things like people search etc. first — and then the contextual chunking for big docs for legal etc... ideally sometime next year?
It was for a complex scenario of QA on long documents, like 200 page earning reports.
Instead of using embeddings which are easy to make a cheap to compare, you use summarized sections of documents and process them with an LLM? LLM's are slower and more expensive to run.
Either way that your input data structure could build bad summaries that the LLM misses with.
Wasn't this a feature of RAGs, though? That they could match semantics instead of structure, while us mere balls of flesh need to rely on indexes. I'd be interested in benchmarks of this versus traditional vector-based RAGs, is something to that effect planned?
Embedding based RAG is fast and conceptually accurate, but very poor for high complexity tasks. Agentic RAG is higher quality, but much higher compute and latency cost. But often worth it for complex situations.
The original documents are in HTML format and although I don't have access to them I can obtain them if I want. Is it better to just use these HTML documents instead? Previously I tried converting HTML to markdown and then use these for RAG. I wasn't too happy with the result although I fear I might be doing something wrong.
If you have scanned documents, last I checked Gemini Flash was very good cost/performance wise for document extraction. Mistral OCR claims better performance in their benchmarks but people I know used it and other benchmarks beg to differ. Personally I use Azure Document Intelligence a lot for the bounding boxes feature, but Gemini Flash apparently has this covered too.
https://getomni.ai/blog/ocr-benchmark
Sidenote: What you want for RAG is not OCR as-in extracting text. The task for RAG preprocessing is typically called Document Layout Analysis or End-to-End Document Parsing/Extraction.
Good RAG is multimodal and semantic document structure and layout-aware so your pipeline needs to extract and recognize text sections, footers/headers, images, and tables. When working with PDFs you want accurate bounding boxes in your metadata for referring your users to retrieved sources etc.
Got it. Indeed, I need to do End-to-End Document Parsing/Extraction.
If it's an image / you need to OCR it, Gemini Flash is so good and so cheap that I've had good luck using it as a "meta OCR" tool
I have used Gemini for OCR and it was indeed good. I also used GPT 3.5 and liked that too.
I did some measurements and found you can't even really tell if two documents are "similar" or not. Here: https://joecooper.me/blog/redundancy/
One common way is to mix approaches. e.g. take a large top-K from ANN on embeddings as a preliminary shortlist, then run a tuned LLM or cross encoder to evaluate relevance.
I'll link here these guys' paper which you might find fun: https://arxiv.org/pdf/2310.08319
At the end of the day you just want a way to shortlist and focus information that's cheaper, computationally, and more reliable, than dumping your entire corpus into a very large context window.
So what we're doing is fitting the technique to the situation. Price of RAM; GPU price; size of dataset; etc. The "ideal" setup will evolve as the cost structure and model quality evolves, and will always depend on your activity.
But for sure, ANN-on-embedding as your RAG pipeline is a very blunt instrument and if you can afford to do better you can usually think of a way.
PageIndex does not state to what degree the semantic structuring is rule-based (document structure) or also inferred by an ML model, in any case structuring chunks using semantic document structure is nothing new and pretty common, as is adding generated titles and summaries to the chunk nodes. But I find it dubious that prompt-based retrieval on structured chunk metadata works robustly, and if it does perform well it is because of the extra work in prompt-engineering done on chunk metadata generation and retrieval. This introduces two LLM-based components that can lead to highly variable output versus a traditional vector chunker and retriever. There are many more knobs to tune in a text prompt and an LLM-based chunker than in a sentence/paragraph chunker and a vector+text similarity hybrid retriever.
You will have to test retrieval and generation performance for your application regardless, but with so many LLM-based components this will lead to increased iteration time and cost vs. embeddings. Advantage of PageIndex is you can make it really domain-specific probably. Claims of improved retrieval time are dubious, vector databases (even with hybrid search) are highly efficient, definitely more efficient that prompting an LLM to select relevant nodes.
1. https://pageindex.ai/blog/Mafin2.5 2. https://github.com/VectifyAI/Mafin2.5-FinanceBench
I wonder how this "vectorless" engine would deal with this. Simply, I can't see this tech scalable.
I think the technology is promising but I don't believe in all those "advantages" that they advertise on the website.
I like your approach because it seems like a very natural search process, like a human would navigate a website to find information. I imagine the tradeoff is performance of both indexing and search, but for some use cases (like mine) it’s a good sacrifice to make.
I wonder if it’s useful to merge to two approaches. Like you could vectorize the nodes in the tree to give you a heuristic that guides the search. Could be useful in cases where information is hidden deep in a subtree, in a way that the document’s structure doesn’t give it away.
So are we are creating create for each document on the fly ? even if its a batch process then dont you think we are pointing back to something which is graph (approximation vs latency sort of framework)
Looks like you are talking more in line of LLM driven outcome where "semantic" part is replaced with LLM intelligence.
I tried similar approaches few months back but those often results in poor scalablity, predictiablity and quality.
When you have a question and you don't know which of the million documents in your dataspace contains the answer - I'm not sure how this approach will perform. In that case we are looking at either feeding an enormously large tree as context to LLM or looping through potentially thousands of iterations between a tree & a LLM.
That said, this really is a good idea for a small search space (like a single document).
SELECT id, body
FROM docs
WHERE body ~* E'(?x) -- x = allow whitespace/comments
(?:\\m(?:dr|rev(?:erend)?)\\.?\\M[\\s.]+)? -- optional title: Dr., Rev., Reverend
( -- name forms
(?:\\mmartin\\M[\\s.]+(?:\\mluther\\M[\\s.]+)?\\mking\\M) -- "Martin (Luther)? King"
| (?:\\mm\\.?\\M[\\s.]+(?:\\ml\\.?\\M[\\s.]+)?\\mking\\M) -- "M. (L.)? King" / "M L King"
| (?:\\mmlk\\M) -- "MLK"
)
(?:[\\s.,-]*\\m(?:jr|junior)\\M\\.?)* -- optional suffix(es): Jr, Jr., Junior
';
I'd do some large scale benchmarks before doubling down on this approach.
But the home page doesn't indicate any sort of sign up or pricing.
So I'm a little confused.
edit Ok I found a sign up flow, but the verification email never came :(
But for on-demand, near instant RAG (like say in a chat application), this won't work. Speed vs accuracy vs cost. Cost will be a really big one.
Might be useful for a few hundred documents max though.
Would've loved to seen the author run experiments about how they compare to other RAG approaches or what the limitations are to this one.
I've found all leave something to be desired, sadly.
There are plenty of lightweight retrieval options that don't require a separate vector database (I'm the author of txtai [https://github.com/neuml/txtai], which is one of them).
It can be as simple this in Python: you pass an index operation a data generator and save the index to a local folder. Then use that for RAG.
Context and prompt engineering are going to be replaced by algorithms, 100%.