A brief explanation of GraphMat and Graphicionado

This post was copy pasted from my previous site, apologies for any inconsistency in formatting, but we all have to start somewhere.

Dec 31, 2022

Overview: GraphMat

Vertex programming

In this model, the entire algorithm can be expressed as operations on a single vertex and its edges. This programming model is generally considered easy to use and not overly restrictive, but implementation performance can vary significantly in practice. GraphMat has been shown to have the best performance amongst many different software graph frameworks on a single node.

What?

Traditionally, when we implement a graph algorithm, we can freely iterate through all vertices, and all edges, and maybe create some additional data structure. Basically, we have a global view of the graph and everything else. In the vertex-centric model, however, we write algorithms from a single vertex's point of view. For any vertex, the only information it has is its neighbor list and its own properties, which vary by an algorithm depending on what needs to be stored!

How?

A vertex-centric algorithm consists of a so-called vertex program. This program is executed by each vertex. It can manipulate any local information, send messages to neighboring vertices, and receive messages from them. An algorithm runs iteratively. In each iteration, the vertex program is executed by the vertices, and messages are exchanged between vertices. The algorithm terminates when no messages are sent from any, vertex indicating a halt.

Why?

This model seems to be more restrictive, and it is. However, it makes distributed graph processing much easier.

The algorithm can be parallelized because all vertices can execute the vertex program in parallel.
With a limited set of operations available to the vertex program, distributed graph processing frameworks can be developed. This would enable all kinds of features you want to see in a distributed system. This also improves the programmer's productivity because we only need to write a few lines of code.
As a bonus, this can enable some additional optimizations because the computation and communication patterns are constrained.

Sparse Matrix vector multiplication

From Wiki: When storing and manipulating sparse matrices on a computer, it is beneficial and often necessary to use specialized algorithms and data structures that take advantage of the sparse structure of the matrix. Specialized computers have been made for sparse matrices, as they are common in the machine learning field. Operations using standard dense-matrix structures and algorithms are slow and inefficient when applied to large sparse matrices as processing and memory are wasted on the zeros. Sparse data is by nature more easily compressed and thus requires significantly less storage. Some very large sparse matrices are infeasible to manipulate using standard dense-matrix algorithms.

\(y = Ax : y \text{ and } x \text{ are dense matrices }, A \text{ is a sparse matrix}\)

function sparse_matrix_mult(A,x) =
  {sum({v * x[i] : (i,v) in row})
   : row in A};

A common operation on sparse matrices is to multiply them by a dense vector. In such an operation, the result is the dot-product of each sparse row of the matrix with the dense vector.

This code takes each index-value pair (i,v) in the sparse row, multiplies v with the ith value of x, and sums the results. The work and depth is easily calculated using the performance rules. If n is the number of non-zeros in the row, then the depth of the computation is the depth of the sum, which is O(log n), and the work is the sum of the work across the elements, which is O(n).

GraphMat Processing Model: Surface

Simplified GraphMat processing model

For each Vertex V
	For each incoming edge E(U,V) from active vertex U
		Res <- Process_Edge (E_weight, U_prop, [OPTIONAL]V_prop)
		V_temp <- Reduce (V_temp, Res)
	End
End
For each Vertex V,
	V_prop <- Apply (V_temp, V_prop, V_const)
End

Fig. 1 Slightly different from GraphMat that integrates Send_message with Apply

Fig. 1 shows a vertex programming example using three essential operations

Process_Edge
Reduce
Apply

As previously stated, in a vertex program, each vertex has an application-specific vertex property that is updated iteratively. For all incoming edges from active vertices (vertices whose states were updated in the last iteration), the corresponding vertex processes each edge separately using Process_Edge and performs a reduction using Reduce to form a single value. Lastly, the reduced value, its vertex property, and a constant vertex property are used to update the current state of the vertex using Apply. Iterations proceed until there are no more active vertices or until a maximum number of iterations is specified.

GraphMat Processing Model: Deep

GraphMat functions by taking vertex programs and mapping them to high performance sparse matrix operations in the backend. We thus get the productivity benefits of a vertex programming framework without sacrificing performance.

GraphMat is based on the idea that graph analytics via vertex programming can be performed through a backend that supports only sparse matrix operations. GraphMat takes graph algorithms written as vertex programs and performs generalized sparse matrix-vector multiplication on them (iteratively in many cases).

This is possible as edge traversals from a set of vertices can be written as a sparse matrix-sparse vector multiplication routine on the graph adjacency matrix (or its transpose). Example to get out-degrees of a graph

Mapping Vertex to Generalized SPMV

It's important to note that while vertex programs can have slightly different semantics, they are all equivalent in terms of expressibility.

As previously stated, a typical vertex program has a state associated with each vertex that is updated iteratively. Each iteration starts with a subset of vertices that are active. A vertex receiving such messages from its neighbors processes each message separately and reduces them to a single value. The reduced value is used to update the current state of the vertex. The iterative process continues for a fixed number of iterations or until no vertices change state. This follows the Bulk-synchronous parallel model.

Bulk Synchronous Parallel (BSP) is a programming model and computation framework for parallel computing. Computation is divided into a sequence of supersteps. In each superstep, a set of processes, running the same code, executes concurrently and creates messages that are sent to other processes. The superstep ends when all the computation in the superstep is complete and all messages have been sent. A barrier synchronization at the end of the superstep ensures that all messages have been transmitted (but not yet delivered to the processes). The next superstep begins with the delivery of all those messages to the processes, that then execute their superstep and send messages that will be delivered at the start of the next superstep. This process continues until all processors vote to halt.
BSP is a model that comes up in the context of large multi-core machines, sharing main memory. M/R comes up in the context of distributing over many small machines, where it is not possible to share memory.

Specifications for a graph program

Each vertex has user-defined property data that is initialized (based on the algorithm used).
A set of vertices are marked active.
The user-defined function Send_Message() reads the vertex data and produces a message object (done for each active vertex)
Process_Message() reads the message object, edge data along which the message arrived, and the destination vertex data and produces a processed message for that edge.
Reduce() is typically a commutative function taking in the processed messages for a vertex and producing a single reduced value.
Apply() reads the reduced value and modifies its vertex data (done for each vertex that receives a message).
Send_Message() can be called to scatter along in and/or out edges.

Example: Single Source Shortest Path (SSSP)

We want to calculate the shortest path to all vertices from a source vertex A.

At a given iteration, we generate a sparse vector using Send_Message() on the active vertices. The message is the shortest distance to the vertex calculated so far.
Process_Message() adds this message to the edge length
Reduce() performs a min operation
Process_Edge() and Reduce() together form a sparse matrix sparse vector multiply operation replacing traditional SPMV multiply operation with addition and SPMV addition with min respectively.

SPMV Implementations

In this framework, a vertex program with Process_Message and Reduce can be written as a generalized SPMV. We can also partition this matrix into many chunks to improve parallelism and load balancing.

Generalized SPMV

function SPMV(Graph G, SparseVector x, Process_Message, Reduce)
	y <- new SparseVector()
	for j in G^T.column_indices do
		if j is present in x then
			for k in G^T .column_j do
				result <- Process_Message(x_j, G.edge_value(k,j), G.getVertexProperty(k))
				y_k <- Reduce(y_k, result)
	return y

Assuming that the graph adjacency matrix transpose G^T is stored in a Compressed Sparse Column (CSC) format

We implement SPMV by traversing the non-zero columns in G^T. If a particular column j has a corresponding non-zero at position j in the sparse vector, then the elements in the column are processed and values accumulated in the output vector y.

Overall framework

The set of active vertices is maintained using a boolean array for performance reasons. In each iteration, this array is scanned to find the active vertices and a sparse vector of messages is generated. Then, a generalized SPMV is performed using this vector. The resulting output vector is used to update the state of vertices. If any vertices change state, they are marked active for the next iteration. The algorithm continues for a user-specified maximum number of iterations or until convergence.

GraphMat overview `x`,`y`, are sparse vectors

function Run_Graph_Program(Graph G, GraphProgram P)
	for i = 1 to MaxIterations do
		for v = 1 to Vertices do
			if v is active then
				x_v <- P.Process_Message, P.Reduce)
				Reset active for all vertices
				for j = 1 to y.length do
					v <- y.getVertex(j)
					old_vertexproperty <- G.getVertexProperty(v)
					G.setVertexProperty(v, y.getValye(j), P.Apply)
					if G.getVertexProperty(v) != old_vertexproperty then 
						v set to active
		if Number of active vertices == 0 then
			break

GraphMat follows an iterative process of Send_Message (lines 3-5), SPMV (line 6), and Apply (lines 8-13). Each such iteration is a superstep. GraphMat does not work on graphs that are too large to fit in the main memory, as it assumes in-memory graph processing.

Overview: Graphicionado

Key characteristics that must be accounted for when considering graph domain accelerators

Graph analytics applications are often memory latency/bandwidth bound
- Graph traversals often require many memory accesses relative to only small amounts of computation.
General purpose processors are not the ideal platform for executing such applications. Their inefficiencies include
- Waste of off-chip memory bandwidth from inefficient memory access granularity
- ineffective on-chip memory usage
- mismatch in execution granularity

Graphicionado: Surface

Graphicionado pipeline is inspired by the vertex programming paradigm coupled with a few reconfigurable blocks.

Graphicionado makes the following contributions

A specialized graph analytics processing hardware pipeline that employs datatype and memory subsystem specializations while offering workload-specific reconfigurable blocks called Graphicionado.
An in-depth tour of the various microarchitecture optimizations to provide performance and energy efficiency, techniques to extract more parallelism, and tactics to support large-scale real-world graphs using slicing and partitioning.

We previously discussed the GraphMat framework, now we delve into Graphicionado: a hardware accelerator specialized for processing graph analytics algorithms. For better programming and flexibility, Graphicionado inherits the advantage of GraphMat. As in software graph processing frameworks, Graphicionado allows users to express specific graph algorithms by defining three computations

Process_Edge
Reduce
Apply

In addition, it transparently handles all necessary data movement and communication on-chip and off-chip to support those operations. Graphicionado overcomes the limitations of GraphMat (and others) by applying specializations on the compute pipeline and the memory subsystem.

Graphicionado: Deep

Graphicionado Base Graph Processing Model

Graphicionado takes an input graph in the coordinate format. In this format, a graph is represented as a list of vertices where each vertex v is associated with a vertex property VProperty, and each edge e is associated with a 3-tuple (srcid, dstid, weight) indexed by the edge id eid. This edge list is sorted according to the srcid and then the dstid. Before the input graph can be fed into the Graphicionado processing pipeline, some preprocessing of the input graph is done: an EdgeIDTable is constructed and stored in memory. This array stores the eid of the first edge of each vertex to allow streaming accesses of the edges starting at a particular vertex. Graphicionado also uses memory to store the vertex property array VProperty, the temporary vertex property array VTempProperty, and the constant vertex property array VConst associated with all vertices.

Processing Phase

for (int i=0; i < ActiveVertexCount; i++) {
	Vertex src = ActiveVertex[i]; //Sequential Vertex Read
	int eid = EdgeIDTable[src.id]; //Edge ID Read
	Edge e = Edges[eid]; //Edge Read
	while (e.srcid == src.id) {
		dst.prop = VProperty[e.dstid]; //[Optional] Random Vertex Read
		VertexProperty res = Process_Edge(e.weight, src.prop, dst.prop);
		VertexProperty temp = VTempProperty[e.dstid]; //Random Vertex Read
		temp = Reduce(temp, res);
		VTempProperty[e.dstid] = temp; //Random Vertex Write
		e = Edges[++eid] //Edge Read
	}
}
//Reset ActiveVertex and ActiveVertexCount

In this phase, all outgoing edges from every active vertex are examined, and the necessary user-defined computations, Process-Edge and Reduce, and updates to the associated vertex properties are calculated and stored in the temporary vertex property array VTempProperty. This phase is only terminated when all active vertices are processed. For some graph analytics workloads such as Collaborative Filtering, not only does the property associated with the source vertex need to be read and manipulated, but also the property associated with the destination vertex, as shown in the pseudocode (line 6),

Apply Phase

for (int i=1; i < TotalVertexCount; i++) {
	VertexProperty vprop = VProperty[i]; //Sequential Vertex Read
	VertexProperty temp = VTempProperty[i]; //Sequential Vertex Read
	VertexProperty vconst = VConst[i]; 
	temp = Apply(vprop, temp, vconst);
	VProperty[i] = temp; //Sequential Vertex Write
	if (temp != vprop) {
		Vertex v;
		v.id = i;
		v.prop = temp;
		ActiveVertex[ActiveVertexCount++] = v; //Sequential Vertex Write
	}
}

In this phase, properties associated with all vertices are updated using the user-defined Apply function. Apply uses input values from the vertex property array VProperty, the constant vertex property array VConst stored in memory and the temporary vertex property array VTempProperty computed from the processing phase to make necessary updates to the vertex property array. The ActiveVertex array keeps track of which active vertices changed their property values in this phase.

Bandwidth

As the number of cores on chips increases, off-chip bandwidth is becoming a big wall. The off-chip bandwidth is divided between traffic to memory and traffic from memory. The traffic from memory is mainly due to cache misses. The traffic to memory occurs due to blocks evicted from the last level cache to be sent down to the memory.

From the last level on-chip cache to memory is due to the writes sent to the memory whenever a block in LLC is evicted and is dirty. The block chosen to be evicted depends on the replacement policy of LLC. the victim block is then written to the memory.

Microarchitecture

Each module corresponds to one or more lines of pseudocode written above, and their physical characteristics are obtained using the implementation methodology described later on.

Modules

The Sequential Vertex Read performs sequential memory read accesses given a starting address. It buffers one cacheline worth of data and passes the vertex properties one at a time to the output queue for the consumption of the next pipe stage. This module is used in the stage P1 of the Processing phase, and A1 and A2 of the Apply phase to read VProperty and VTempProperty.

The Sequential Vertex Write performs sequential memory writes given a starting address. It takes the data to be stored from the input queue and writes it to its internal buffer. When its internal buffer has a cacheline worth of data, it performs a written request to memory. This module is used to store the updated VProperty in stage A4 and ActiveVertex in stage A5 during the Apply phase.

The Random Vertex Read/Write performs random reads and writes given a vertex id. It is used to read the destination VProperty in stage P4, read and write VTempProperty in stage P7 and P9.

The EdgeID Read performs a read from the preconstructed EdgeIdTable given vertex id and outputs the eid of the first edge associated with the input vertex id. Implementations of this module are different for the base and the optimized Graphicionado pipeline.

The Edge Read performs random and sequential reads of the edge data given an eid. The initial access is random but subsequent accesses are sequential. All edge data streamed in are examined for their srcid. The module continues fetching edge data from the memory as long as the srcid fetched edges match the srcid of the edges being processed.

The Atomic Update performs the update of the destination VProperty in three steps. First, the module reads the VProperty in stage P7, then it performs a Reduce computation in stage P8, and finally, it writes the modified VProperty back to memory in stage P9. Since this is a read-modify-write update, the process needs to be atomic for the same vertex (i.e. the same vertex cannot be in more than one of these pipeline stages at the same time). This hardware module enforces the atomicity in the stage P6 by stalling the pipeline and blocking the issue of the vertex id being processed when a violation of the condition is detected. We use a small 4-way associative CAM-like structure to enforce such atomicity (16 4-byte entries).

Custom Modules

As previously described, Graphicionado uses three user-defined custom computations, which express the target graph algorithm. There are several options to implement these custom computations:

Fully reconfigurable fabric (FPGA on-chip/on-package)
Programmable ALUs
Fully custom implementations on chip for a set of algorithms.

Optimizations

There are many algorithms (such as PageRank and CF) that treat all vertices as active. From this, we can make optimizations to the pipeline that satisfy these specific algorithms, essentially all that don't operate with frontiers of active vertices

serious thoughts and empty ideas

Discussion about this post