workflow of graphrag
Phase 1: Compose TextUnits
Objective: Break down input documents into smaller, manageable chunks of text (TextUnits) for detailed analysis
Example Input
A text document containing "A Christmas Carol" by Charles Dickens
Example Output
Document 1: "A Christmas Carol"
TextUnit 1: "Marley was dead, to begin with..."
TextUnit 2: "Scrooge was his sole executor, his sole administrator..."
https://scrapbox.io/files/66b5ac73b7f9da001c25f12c.png
https://scrapbox.io/files/66b5b25a79cd47001d837484.png
Explanation
TextUnits are chunks of the document, each of a specific token size (e.g., 1200 tokens)
These chunks are created for better processing in subsequent steps
Phase 2: Graph Extraction
Objective: Extract entities (people, places, events) and relationships from the TextUnits and represent them as a graph
Example Input
TextUnit 1: "Marley was dead, to begin with..."
Example Output
Entity: "Marley" (Person)
Entity: "Death" (Event)
Relationship: Marley -> Death (Relationship)
https://scrapbox.io/files/66b5b27e4a7b9a001c48fee4.png
Explanation
Each TextUnit is processed to extract key entities and the relationships between them.
These extracted items form the basic nodes and edges of the graph
Phase 3: Graph Augmentation
Objective: The goal of Phase 3 is to enhance the graph structure generated in the previous phase (Graph Extraction) by performing two main tasks
Community Detection: Identify and group related entities into clusters or communities within the graph
Graph Embedding: Represent the graph in a vector space, enabling further analysis such as similarity searches, clustering, or classification
Step 1: Community Detection
What is Community Detection?
Community detection is a process in graph analysis where entities (nodes) are grouped together into clusters based on their relationships (edges)
Entities that are more closely related (i.e., have more or stronger connections) are grouped into the same community, while entities that are less connected are placed into different communities
Example Input
Entities and Relationships extracted in Phase 2:
Entity: "Marley" (Person)
Entity: "Scrooge" (Person)
Entity: "Ghost of Christmas Past" (Person)
Entity: "Tiny Tim" (Person)
Entity: "Bob Cratchit" (Person)
Entity: "Fred" (Person)
Relationships
Marley -> Scrooge (Mentor)
Scrooge -> Ghost of Christmas Past (Visited by)
Bob Cratchit -> Tiny Tim (Father of)
Fred -> Scrooge (Nephew of)
How Does Community Detection Work?
Algorithm Used: The community detection might be performed using the Leiden algorithm or other clustering algorithms
Output: The output is a set of communities, where each community contains entities that are closely related
Example Output
Community 1: {Marley, Scrooge, Ghost of Christmas Past}
Community 2: {Tiny Tim, Bob Cratchit, Fred}
In this case:
Community 1 groups characters involved in the supernatural aspects of Scrooge's journey
Community 2 groups characters that are part of Scrooge's personal and familial relationships
Step 2: Graph Embedding
What is Graph Embedding?
Graph embedding is the process of transforming the nodes (entities) and edges (relationships) of a graph into a vector space representation
Each entity in the graph is represented as a vector (a list of numbers)
This vector captures the structural and relational properties of the entity within the graph
Example Input:
Community 1: {Marley, Scrooge, Ghost of Christmas Past}
Community 2: {Tiny Tim, Bob Cratchit, Fred}
How Does Graph Embedding Work?
Algorithm Used: #Node2Vec is commonly used for this purpose It learns low-dimensional representations of nodes that preserve network neighborhoods
It does this by simulating random walks on the graph and capturing the sequence of visited nodes.
Process
Random Walks: Start from a node and simulate a random walk to traverse the graph.
Context Window: For each node in the random walk, the context (neighboring nodes) is noted.
Training: A neural network or another embedding model is trained to predict the neighboring nodes, thus generating a vector for each node.
Example Output
These vectors capture the essence of each entity's role in the graph
Nodes that are close in the vector space (i.e., have similar vector values) have similar roles or contexts in the graph
https://scrapbox.io/files/66b5b2ae1695df001de09dfd.png
Phase 4: Community Summarization
Objective: Summarize the communities detected in Phase 3 to provide a high-level understanding.
Example Input
Community 1: {Marley, Scrooge, Ghost of Christmas Past}
Example Output
Community Report: "This community highlights the key characters involved in Scrooge's transformation..."
Explanation
Community reports summarize the distinct information within each community
These summaries help in understanding the broader structure of the graph.
Phase 5: Document Processing
Objective: Link the original documents to the TextUnits and generate embeddings for the documents
Example Input
Document 1: "A Christmas Carol"
TextUnits from Phase 1
Example Output
Document 1 -> TextUnit 1, TextUnit 2, ...
Explanation:
The documents are linked to the TextUnits to maintain traceability
Document embeddings provide a vector representation of the entire document, aiding in searching and querying
CODE