Knowledge Graphs in RAG: Hype vs. Ragas Analysis

davedx
·
3 months ago
·
[ - ]

This seems highly relevant: https://arxiv.org/abs/2406.01506

> In this paper, we study the two foundational questions in this area. First, how are categorical concepts, such as {'mammal', 'bird', 'reptile', 'fish'}, represented? Second, how are hierarchical relations between concepts encoded? For example, how is the fact that 'dog' is a kind of 'mammal' encoded? We show how to extend the linear representation hypothesis to answer these questions. We find a remarkably simple structure: simple categorical concepts are represented as simplices, hierarchically related concepts are orthogonal in a sense we make precise, and (in consequence) complex concepts are represented as polytopes constructed from direct sums of simplices, reflecting the hierarchical structure.

Basically, LLM's already partially encode information as semantic graphs internally.

With this it is less surprising that augmenting them with external knowledge graphs has a lower ROI.

CharlieDigital
·
3 months ago
·
[ - ]

    > Basically, LLM's already partially encode information as semantic graphs internally.

There's an (underutilized?) technique here to take advantage of that internal graph: have the LLM tell you the related concepts first and then perform the RAG using not just the original concept, but the expanded set of related concepts.

So:

    concept → [related concepts] → [[.. rag-rc1],[.. rag-rc2],[.. rag-rcn]] → summarize

With GPTs prior to 4o, it would have been too slow to do this as a two-step process. With 4o and some of the higher throughput Llama3 based options (Together.ai, Fireworks.ai, Groq.com), a two-step fan-out RAG approach takes advantage of this internal graph and could probably yield similar gains in RAG without additional infrastructure (another datastore) nor data pre-processing to take advantage of a graph approach.

neeleshs
·
3 months ago
·
[ - ]

Even with old GPT, if the summary is decent, it works reasonably well with even no RAG. We are a data management platform and allow users to build data pipelines around a data model. This is basically a DAG. We autogenerate documentation for these pipelines, using gpt4, and feed a summarized version of the data pipeline - expressed as graphviz dot file format in the prompt. gpt4 understands this format well, and seeminlgy understands the graph itself reasonably well!

It performs poorly expressing the higher level intent of the pipeline, but tactical details are accurately documented. We are trying to push prompting itself more, before turning to RAG & finetuning

davedx
·
3 months ago
·
[ - ]

Yup. Fascinating stuff really. Kind of like mnemonics for LLM's, if you squint a bit.

mkehrt
·
3 months ago
·
[ - ]

As an GenAI skeptic, I think this is a very cool finding. My experience with AI tools is that they are complete bullshit artists. But to a large extent that's just a result of the way they are trained. If this description of how the data is structured is correct, it indicates that these programs do encode a real model about the world. Perhaps alternative ways of training these same models, or fixing the data afterwards, will result in more truthful models.

piizei
·
3 months ago
·
[ - ]

Looks like the test-setup confuses knowledge graphs with graph databases. The code just creates a neo4j database from a document, not a knowledge graph (basically uses neo4j as vector database). A knowledge graph would be created by a LLM as a preprocessing step (and queried similary by an LLM). This is a different approach than was tested, an approach that trades preprocessing time and domain knowledge for accuracy. Reference: https://python.langchain.com/v0.1/docs/use_cases/graph/const...

rcarmo
·
3 months ago
·
[ - ]

Yeah, I think the dataset is flawed. GraphRAG appears to be aimed at navigating the Microsoft 365 document and people graph that you get in an organization setting, not doing a homogenous search.

visarga
·
3 months ago
·
[ - ]

The Microsoft GraphRAG paper focuses on global sensemaking through hierarchical summarization, which is a fundamental aspect of their approach. The blog post analysis, however, doesn't address this core feature at all. Another issue is the corpus size, the paper focuses on sizes on the order of 1M tokens, while the reference text used in the blog post is probably shorter. On shorter text a simple LLM call could do summarization directly.

qeternity
·
3 months ago
·
[ - ]

I don’t believe the author read the GraphRAG paper as there is nothing in this “deep dive” that implements anything remotely close.

dmezzetti
·
3 months ago
·
[ - ]

There is no one size fits all formula. For simple RAG, a search query (vector, keyword, SQL, etc) works to build a context.

For more complex questions or research, a knowledge graph can be beneficial. I wrote an article[1] earlier this year that used graph path traversal to build a context.

The goal was to build a short narrative about English history from 500 - 1000 using Wikipedia articles. Vector similarity alone won't bring back good results. This article used a cypher graph path query that jumped multiple hops through concepts of interest. Those articles on that path were then brought in as the context.

[1] https://neuml.hashnode.dev/advanced-rag-with-graph-path-trav...

Tostino
·
3 months ago
·
[ - ]

I really need to dig into the more recent advances in knowledge graphs + LLMs. I've been out of the game for ~10 months now, and am just starting to dig back into things and get my training pipeline working (darn bitrot...)

I had previously trained a llama2 13b model (https://huggingface.co/Tostino/Inkbot-13B-8k-0.2) on a whole bunch of knowledge graph tasks (in addition to a number of other tasks).

Here is an example of the training data for training it how to use knowledge graphs:

easy - https://gist.github.com/Tostino/76c55bdeb1f099fb2bfab00ce144...

medium - https://gist.github.com/Tostino/0460c18024697efc2ac34fe86ecd...

I also trained it on generating KGs from conversations, or articles you have provided. So from the LLM side, it's way more knowledgeable about the connections in the graph than GPT4 is by default.

Here are a couple examples of the trained model actually generating a knowledge graph:

1. https://gist.github.com/Tostino/c3541f3a01d420e771f66c62014e...

2. https://gist.github.com/Tostino/44bbc6a6321df5df23ba5b400a01...

I haven't done any work on integrating those into larger structures, combining the graphs generated from different documents, or using a graph database to augment my use case...all things I am eager to try out, and I am glad there is a bunch more to read on the topic available now.

Anyways, near term plans are to train a llama3 8b, and likely a phi-3 13b version of Inkbot on an improved version of my dataset. Glad to see others as excited as was on this topic!

itkovian_
·
3 months ago
·
[ - ]

Knowledge graphs where created to solve the problem of making natural,free flowing text machine processable. We now have a technology that completely understands natural free flowing text and can extract meaning. Why would going back to structure help when that structure can never be as rich as just text. I get it if the kb has new information, that's not what I'm saying.

visarga
·
3 months ago
·
[ - ]

> Why would going back to structure help

When your corpus is large it is useful to split it up and hierarchically combine. In their place I would do both bottom-up and top-down summarization passes, so information can percolate from a leaf to the root and from the root to a different leaf. Global context can illuminate local summaries, for example think of the twist in a novel, it sheds new light on everything.

itkovian_
·
3 months ago
·
[ - ]

That's not what a kb is

th0ma5
·
3 months ago
·
[ - ]

> We now have a technology that completely understands natural free flowing text and can extract meaning.

Actually we don't. I know it certainly feels like LLMs do this but no one would dare stake their life on their output if they know how they work. Still useful!

esafak
·
3 months ago
·
[ - ]

But RAG without graphs just relies on similarity search, which isn't very smart.

gkorland
·
3 months ago
·
[ - ]

[dead]

jimmySixDOF
·
3 months ago
·
[ - ]

This is a nice sandbox walkthrough of the author's objective which was to test MSFT claims in the paper -- but with all due respect the buzz of graphs is because they add whole third layer in a combined approach like Reciprocal Rank Fusion (RRF). You do a BM25 search then you do a vector based nearest neighbors search and now you can add a KG search then all combined with local and global reranking etc the expectation is this produces a better final outcome. These findings aside, it still makes sense that adding KG to a hybrid search pipeline is going to be useful.

gkorland
·
3 months ago
·
[ - ]

[dead]

DrStartup
·
3 months ago
·
[ - ]

Knowledge / property graphs provide truths that can guide the retrieval. LLMs lack a truth function, ie causality. The KPG provides this as sorta a lace across the llm vector space. A KPG can either be used as a filter or a router of sorts. I expect we’ll see kpgs colocated with vector data of the llm and a tuned router layer uses it to guide retrieval and course correct the output. Kind of like MoE.

·
3 months ago
·
[ - ]

yetanotherjosh
·
3 months ago
·
[ - ]

It seems to me that the "knowledge graph" generated in this article is incredibly naive and not comparable to the process in the MS paper, which requires multiple rounds of preprocessing the source content using LLMs to extract, summarize, find relationships at multiple levels and model them in the graph store. This just splats chunks and words into a vector graph and is barely defensible as a "knowledge graph".

Please tell me I'm missing something because this is egregious. How can you expect a graph approach to improve over naive rag if you don't actually build a knowledge graph that captures high quality, higher level entity relationships?

mark_l_watson
·
3 months ago
·
[ - ]

That is an interesting writeup, but I had trouble understanding what they meant by what for me is a new term: “faithfulness.”

This is supposedly a measure of reducing hallucinations. Is it just me, or did other people here have difficulty understanding how faithfulness was evaluated?

EDIT: OK, faithfulness is calculated by human evaluation, and can be automatically calculated with ROUGE and BLEU.

lmeyerov
·
3 months ago
·
[ - ]

I'm happy to see third-party comparisons, most of the marketing here indeed just assumes KGs are better with zero proof: marketers to be wary of. Unfortunately, I suspect a few key steps need to happen for this post to fairly reflect what the Microsoft NLP researchers called their alg, vs the broader family named by neo4j. Afaict, they're talking about a different graph.

* The kg index should be text documents hierarchically summarized based on an extracted named-entity-relation graph. The blog version seems to instead do (document, word), not the KG, and afaict, skips the hierarchical NER community summarization. The blog post is doing what neo4j calls a lexical graph, not the novel KG summary index of the MSR paper.

* The data volume should go up. Think a corpus like 100k+ tweets or 100+ documents. You start to see challenges like redundant tweets that clog retrieval/ranking, or many pieces of the puzzle spread over disparate chunks with indirect 'multi-hop' reasoning. Something like a debate can fit into one ChatGPT call, with no RAG. It's an interesting question how summarization preprocessing can still help small documents, but a more nuanced topic (and we have Thoughts on ;-))

* The tasks should reflect the challenges: multi-hop reasoning, wider summarization with fixed budget, etc. Retesting simple queries naive RAG already solves isn't the point. The paper focused on a couple types, which is also why they route to 2 diff retrieval modes. Subtle, part of the challenge in bigger data is how many resources we give the retriever & reasoner, and part of why graph rag is exciting IMO.

Afaict the blogpost essentially did a lexical graph with chunk/node embeddings, reran on a small document, and at that scale, asked simple q's... So close to a naive retrieval, and unsurprisingly, got parity. It's not too much more to improve so would encourage doing a bit more. Beyond the MSR paper, I would also experiment a bit more with retrieval strategies, eg, agentic layer on top, and include simple text search mixed in with reranking. And as validation before any of that, focus specifically on the queries expected to fail naive RAG and work in graph, and make sure those work.

Related: We are working on a variant of Graph RAG that solves some additional scale & quality challenges in our data (investigations: threat intel reports, real-time social & news, misinfo, ...), and may be open to an internship or contract role for the right person. One big focus area is ensuring AI quality & AI scale as our version is more GPU/AI-centric and used in serious situations by less technical users... A bit ironic given the article :) LMK if interested, see my profile. We'll need proof of capability for both engineering + AI challenges, and easier for us to teach the latter than the former.