workflow of graphrag

Phase 1: Compose TextUnits

Objective: Break down input documents into smaller, manageable chunks of text (TextUnits) for detailed analysis

Example Input

A text document containing "A Christmas Carol" by Charles Dickens

Example Output

Document 1: "A Christmas Carol"

TextUnit 1: "Marley was dead, to begin with..."

TextUnit 2: "Scrooge was his sole executor, his sole administrator..."

https://scrapbox.io/files/66b5ac73b7f9da001c25f12c.png

https://scrapbox.io/files/66b5b25a79cd47001d837484.png

Explanation

TextUnits are chunks of the document, each of a specific token size (e.g., 1200 tokens)

These chunks are created for better processing in subsequent steps

Phase 2: Graph Extraction

Objective: Extract entities (people, places, events) and relationships from the TextUnits and represent them as a graph

Example Input

TextUnit 1: "Marley was dead, to begin with..."

Example Output

Entity: "Marley" (Person)

Entity: "Death" (Event)

Relationship: Marley -> Death (Relationship)

https://scrapbox.io/files/66b5b27e4a7b9a001c48fee4.png

Explanation

Each TextUnit is processed to extract key entities and the relationships between them.

These extracted items form the basic nodes and edges of the graph

Phase 3: Graph Augmentation

Objective: The goal of Phase 3 is to enhance the graph structure generated in the previous phase (Graph Extraction) by performing two main tasks

Community Detection: Identify and group related entities into clusters or communities within the graph

Graph Embedding: Represent the graph in a vector space, enabling further analysis such as similarity searches, clustering, or classification

Step 1: Community Detection

What is Community Detection?

Community detection is a process in graph analysis where entities (nodes) are grouped together into clusters based on their relationships (edges)

Entities that are more closely related (i.e., have more or stronger connections) are grouped into the same community, while entities that are less connected are placed into different communities

Example Input

Entities and Relationships extracted in Phase 2:

Entity: "Marley" (Person)

Entity: "Scrooge" (Person)

Entity: "Ghost of Christmas Past" (Person)

Entity: "Tiny Tim" (Person)

Entity: "Bob Cratchit" (Person)

Entity: "Fred" (Person)

Relationships

Marley -> Scrooge (Mentor)

Scrooge -> Ghost of Christmas Past (Visited by)

Bob Cratchit -> Tiny Tim (Father of)

Fred -> Scrooge (Nephew of)

How Does Community Detection Work?

Algorithm Used: The community detection might be performed using the Leiden algorithm or other clustering algorithms

Output: The output is a set of communities, where each community contains entities that are closely related

Example Output

Community 1: {Marley, Scrooge, Ghost of Christmas Past}

Community 2: {Tiny Tim, Bob Cratchit, Fred}

In this case:

Community 1 groups characters involved in the supernatural aspects of Scrooge's journey

Community 2 groups characters that are part of Scrooge's personal and familial relationships

Step 2: Graph Embedding

What is Graph Embedding?

Graph embedding is the process of transforming the nodes (entities) and edges (relationships) of a graph into a vector space representation

Each entity in the graph is represented as a vector (a list of numbers)

This vector captures the structural and relational properties of the entity within the graph

Example Input:

Community 1: {Marley, Scrooge, Ghost of Christmas Past}

Community 2: {Tiny Tim, Bob Cratchit, Fred}

How Does Graph Embedding Work?

Algorithm Used: #Node2Vec is commonly used for this purpose

It learns low-dimensional representations of nodes that preserve network neighborhoods

It does this by simulating random walks on the graph and capturing the sequence of visited nodes.

Process

Random Walks: Start from a node and simulate a random walk to traverse the graph.

Context Window: For each node in the random walk, the context (neighboring nodes) is noted.

Training: A neural network or another embedding model is trained to predict the neighboring nodes, thus generating a vector for each node.

Example Output

Marley: 0.12, 0.45, ..., 0.98

Scrooge: 0.34, 0.67, ..., 0.21

Ghost of Christmas Past: 0.78, 0.45, ..., 0.56

Tiny Tim: 0.11, 0.88, ..., 0.33

Bob Cratchit: 0.22, 0.99, ..., 0.47

Fred: 0.58, 0.12, ..., 0.67

These vectors capture the essence of each entity's role in the graph

Nodes that are close in the vector space (i.e., have similar vector values) have similar roles or contexts in the graph

https://scrapbox.io/files/66b5b2ae1695df001de09dfd.png

Phase 4: Community Summarization

Objective: Summarize the communities detected in Phase 3 to provide a high-level understanding.

Example Input

Community 1: {Marley, Scrooge, Ghost of Christmas Past}

Example Output

Community Report: "This community highlights the key characters involved in Scrooge's transformation..."

Explanation

Community reports summarize the distinct information within each community

These summaries help in understanding the broader structure of the graph.

Phase 5: Document Processing

Objective: Link the original documents to the TextUnits and generate embeddings for the documents

Example Input

Document 1: "A Christmas Carol"

TextUnits from Phase 1

Example Output

Document Embedding: 0.23, 0.67, ..., 0.89 (vector representation)

Document 1 -> TextUnit 1, TextUnit 2, ...

Explanation:

The documents are linked to the TextUnits to maintain traceability

Document embeddings provide a vector representation of the entire document, aiding in searching and querying

CODE

https://github.com/microsoft/graphrag/blob/7376f149d2f9fcb9ee792f0a33616c2a7338491f/graphrag/index/verbs/graph/layout/methods/umap.py

https://github.com/microsoft/graphrag/blob/7376f149d2f9fcb9ee792f0a33616c2a7338491f/graphrag/index/graph/visualization/compute_umap_positions.py