# Distributed GraphLab

# A Framework for Machine Learning and Data Mining in the Cloud [12]

# 1- INTRODUCTION

This report summarizes the extended multicore GraphLab abstraction introduced in [12] along with critiques, current state-of-the-art, and related work. Systems capable of executing MLDM algorithms efficiently on distributed large clusters (in parallel) were highly needed to cope with the challenges of MLDM problems and techniques. MapReduce [6] [8], Dryad [13], and Pregel [9] were the only existing high-level distributed systems at that time. However, these systems were unable to fulfill the needs of MLDM community, like enabling cloud services fully in a distributed setting.

Therefore, MLDM community came up with a high-level abstraction of GraphLab targeting dynamic, asynchronous, and graph-parallel computation (found in many MLDM algorithms) in the shared memory setting while hiding the underlying complexities of distributed system design.

To implement an efficient distributed execution model while preserving the strict consistency requirements, several methods such as *data versioning *(to reduce network congestion), *pipelined distributed locking *(to decrease the effects of network latency) and *atom graph *(to address the challenges of data locality) are incorporated in GraphLab abstraction. Fault tolerance is added using the classic Chandy-Lamport [7] snapshot algorithm.

# 2- MLDM ALGORITHM PROPERTIES

GraphLab abstraction addresses the following key properties of efficient and parallel MLDM systems.

## 2.1 Graph Structured Computation

Sometimes computation requires modeling dependencies between data. In this way, more signals can be extracted from noisy data.

Graph-parallel abstractions e.g. Pregel [9] and GraphLab [12] naturally express computational dependencies by adopting vertex-centric model i.e.computation run on each vertex as kernels. In Pregel, vertices communicate via messages following the bulk synchronous model. In GraphLab, each vertex can read and write to data on adjacent vertices and edges based on sequential shared memory abstraction. GraphLab abstraction enables users to focus on sequential computation rather than the parallel movement of data (i.e., messaging).

## 2.2 Asynchronous Iterative Computation

Asynchronous systems are more beneficial for many MLDM algorithms e.g.; linear systems converge faster when solved asynchronously [5]. Asynchronous systems update parameters using the most recent parameter values as input. Accelerated convergence of PageRank is shown in fig.1(a). Other data parallel abstractions extended to iterative settings were not supportive for asynchronous computation. Therefore, GraphLab abstraction was designed to efficiently express the asynchronous iterative behavior for advanced MLDM algorithms.

## 2.3 Dynamic Computation

Iterative computation converges asymmetrically in many MLDM algorithms e.g., in parameter optimization and PageRank. Fig.1(b) shows how in PageRank, only 3% of the vertices required more than 10 updates; the rest of them just required a single update to converge. Dynamic computation enabled prioritizations of computations across vertex-kernels to accelerate the convergence of MLDM algorithms.

Fig.1(c) shows accelerated convergence of a popular MLDM algorithm called Loopy Belief propagation using dynamic computation in web spam detection.

GraphLab and Pregel both support dynamic computation while only GraphLab permits prioritizations as well as a pull-based model to retrieve information from adjacent vertices. Note that in this GraphLab abstraction, some of the original GraphLab scheduling requirements are relaxed to enable efficient distributed priority scheduling and FIFIO.

## 2.4 Serializability

It is desirable to have an equivalent serial execution for all parallel executions to achieve correctness and faster convergence. GraphLab allows users to choose the desired level of consistency required for the correctness. The complexity introduced by concurrency gets eliminated allowing the experts to focus more on the model of algorithm instead of complex configuration settings. Fig.1(d) shows the unstable behavior of the *dynamic ALS algorithm *between serializable and non-serializable executions for the Netflix movie recommendation problem.

# 3 DIST. GRAPHLAB ABSTRACTION

The main components of GraphLab abstraction include *Data Graph *to represent user mutable program state, *Update function *to represent the user computation and operation on the data graph, and *Sync operation *to maintain the global aggregates concurrently. To work on the data graph, update function transforms data in small overlapping contexts called scopes.

The program state is stored as a directed graph called **data graph ***G *= *V*, *E*, *D*. Data graph is in control of managing the user-defined data *D*. Any data can be associated with each vertex and edge in the graph. GraphLab abstraction is mutable, independent of edge directions, and static. It is not modifiable during the execution.

**Update Function **is a stateless method having access to modify data within the scope of a vertex and schedule its future execution on other vertices.

• **Input**: Vertex *v *and its scope *Sv *(data stored in *v *and its all adjacent vertices and edges)

• **Output**: New version of the data in the scope and set of vertices *T*

**Update **: *f *(*v*, *Sv *) → (*Sv *, *T *)

After the execution of the update function, modified data *Sv *is written back to the data graph. GraphLab gives users the freedom to read/write data from adjacent vertices and edges in a simple manner by removing the complexity to manage the movement of data like in message passing or data flow models [9][13]. Update function can efficiently express adaptive computation by controlling its scheduling for the returned set of vertices *T*. The expressiveness of dynamic computation actually differentiates the GraphLab from Pregel. In Pregel, messages initiate the update functions which can only access the data in the message while GraphLab completely decouples the scheduling of future computation from the data movement and naturally expresses the pull model.

**Sync Operation & Global Values **Many MLDM algorithms require to maintain a global statistic to describe the data stored in the data graph e.g. tracking of convergence estimators by statistical inference algorithms. Therefore, GraphLab abstraction defines global values that are written by *sync operation *and read by *update functions*. Sync operation contains a finalization phase to support normalization (a common task in MLDM algorithms). In order to maintain updated estimates of the global values, sync operation needs to run continuously in the background which makes serializability of the sync operation an expensive task. Therefore, multiple consistency levels are available for update functions along with the choice of the consistent or inconsistent Sync operation.

## 3.1 GraphLab Execution Model

The graphLab execution model is based on a single loop semantics. Input to the model is data graph *G *= *V*, *E*, *D*, an update function, and an initial set of vertices *T *to be executed. Each step takes a vertex from *T*, update and add to the updated set of vertices *T *for future computations. There is no specific order of execution of vertices besides only one constraint that eventually all the vertices should be executed. As GraphLab allows prioritization, therefore users can assign priorities to vertices in *T*.

## 3.2 Consistency Models

An execution model automatically translated from sequential to parallel in such a way that multiple processors are executing the same loop on the same graph i.e. simultaneous addition and execution of vertices is not considered as a risk-free model. Overlapping computation should not run simultaneously. Therefore, to retain the original semantics of sequential execution several **consistency models **are introduced to work on optimization of parallel execution while maintaining the serializability.

GraphLab’ runtime ensures serializability in three different ways mentioned below.

**Full consistency **in which scope of concurrently updating functions cannot overlap and it has full access to read/write but limited potential parallelism(because concurrently updating vertices must be at least two vertices apart).

**Edge Consistency **has a slightly overlapping scope having read-only access to the adjacent vertices with full read/write access on the vertex and adjacent edges. It increases the parallelism and used by many MLDM algorithms e.g. PageRank.

**Vertex Consistency **allows read-only access to the adjacent vertices and edges with write-only access to the vertex. Update function can run simultaneously on all the vertices in this type of consistency model.

# 4 DISTRIBUTED GRAPHLAB DESIGN

In this section, the shared memory setting design of GraphLab is extended to more challenging distributed settings to discuss the various required techniques. Overview of distributed GraphLab design is illustrated by

**Figure 2: System Overview**

## 4.1 Distributed Data Graph

The key to implementing an efficient distributed data graph is computation, communication, storage, and an appropriate balance between all of them. Therefore, a two-phased partitioning process for load balancing the graph across arbitrary cluster sizes is developed. The first phase partitions the graph into k parts (called **atom**) where k is greater than the number of machines. Atom is a file containing graph generating commands e.g. *AddVertex(100,vdata), AddEdge(2 -> 4,edata)*. Atom also stores the **ghosts **information. Ghosts are set of vertices and edges that are adjacent to the partition boundary. The atom index file contains metadata (connectivity structure and file location) for k atoms. Second phase partitions this meta-graph over the physical machines.

## 4.2 Distributed GraphLab Engines

GraphLab engine is responsible for the execution of *Sync *operation and *Update *function. It also maintains the scheduling set of Vertices *T *and ensures serializability w.r.t to different consistency models. As discussed in section 3.2, performance and expressiveness of the execution model is dependent on the implementation of how vertices get removed from *T*. Two type of engines are introduced to evaluate this trade-off, **Chromatic Engine **to support partial asynchronous execution of a set of vertices *T *and **Locking Engine **a fully asynchronous engine to allow vertex priorities.

*4.2.1* *Chromatic Engine. *Vertex coloring is a classic technique to achieve a serializable parallel execution of a set of dependent tasks (vertices in a graph). It requires a full communication barrier between two color steps. Given a vertex coloring of the data graph, **edge consistency **model can be satisfied by executing all vertices of the same color before going to the next color and running sync operation between these color steps. Changes made to ghost vertices and edges are communicated asynchronously. **Vertex consistency **can be achieved by assigning the same color to all the vertices. And finally, **full consistency **can be achieved by constructing second-order vertex coloring (no vertex shares the same color as any of its distance two neighbors). As an example in MLDM optimization problems, bipartite (two-color) graphs are used.

*4.2.2* *Distributed Locking Engine. The chromatic* engine can not allow sufficient scheduling flexibility. Therefore, the Distributed Locking Engine extends the mutual exclusion technique to overcome this limitation. It associates reader-writer locks on each vertex. Only the local vertices can be updated by each machine. Different consistency models using different protocols can be implemented. All lock and sync requests are pipelined to reduce network latency. A pipeline of vertices is maintained by each machine for which locks are requested but not granted yet. Once the lock acquisition and data synchronization are completed, a vertex is executed. It has been observed in experiments that strong and nearly linear scalability is provided by distributed locking systems. The more the length of the pipeline, the less is the runtime.

## 4.3 Fault Tolerance

Fault tolerance in GraphLab abstraction is achieved by distributed checkpointing in two modes, synchronous and asynchronous checkpointing.

**Synchronous checkpointing **suspends computation to save all modified data since the last checkpoint. While **Asynchronous checkpointing **is based on the Chandy-Lamport snapshot algorithm [7]. The snapshot step becomes an update function in the GraphLab abstraction and is proven to be better than synchronous checkpointing.

## 4.4 System Design

Each machine runs one instance of GraphLab in a distributed setting. GraphLab processes are symmetric and communicate via remote procedural calls. The first process acts as a master and computes the placement of atoms based on the atom’ index. A local scheduler is maintained by each process (for its vertices) and a cache to access the remote data. A distributed consensus algorithm decides when all the schedulers will become empty.

# 5- APPLICATIONS

GraphLab was evaluated using three state-of-the-art MLDM applications. Collaborative filtering for Netflix Movie Recommendation, Named Entity Recognition (NER) using Chromatic Engine and Video Co-segmentation (CoSeg) using distributed Locking Engine. It was found out that GraphLab’s performance is comparable to tailored MPI implementations and it outperforms Hadoop by 20–60x. Also, Netflix, CoSeg, and NER were more compactly expressed by GraphLab abstraction as compared to MapReduce or MPI. (Ref to [12] for details about datasets and clusters specifications).

In **Netflix Movie Recommendation**, ALS [14] algorithm takes a sparse *Users Movies *matrix as input. This matrix contains movie ratings for every user. ALS algorithm computes low-rank matrix factorization. The rank of the matrices is denoted as *d*. Higher the rank, the higher the accuracy with a high computational cost. The high speedup is achieved by varying the values of *d *& corresponding number of cycles per update by adding more machines in incremental order. A fair comparison with MPI implementation and Hadoop for a fixed value of *d *= 20 and a varying number of machines from 4 to 64 has shown the runtime of the experiment in seconds in the following order:

*Hadoop *> *MPI *> *GraphLab*

GraphLab performs 40–60 times faster than Hadoop.

**Video Co-segmentation **automatically identify and cluster Spatio-temporal segments of a video using LBP [10]. It makes use of the distributed locking engine pipelines. Experiments showed that the locking engine is capable of achieving high performance and scalability on a large 10.5 million vertex graph(utilized by application). It resulted in 10x speedup with 16x more machines. Optimal weak scaling provided by the locking engine states that an increase in the size of a graph in proportion to the number of machines does not influence the runtime. Experiments by varying the length of the pipeline showed that the length of the pipeline is directly proportional to the performance and thus compensates for the poor partitioning.

It is concluded that distributed GraphLab abstraction provides excellent performance for video CoSeg tasks by allowing dynamically prioritized scheduling. Pipelining has been proven as an effective way to minimize latency and poor partitioning.

**Named Entity Recognition (NER) **is a technique of Natural Language Processing and information extraction. In experiments, a large amount of web-crawled data is used to count the number of occurrences of noun-phrases in each context. NER problem constructs two sets of vertices corresponding to noun phrases and their contexts. This can be mapped as a bipartite data graph, if a noun-phrase occurs in the context, an edge is drawn between noun-phrase and its context.

Experiments showed that NER is not able to scale like CoSeg and Netflix. It could only achieve a modest 3x improvement using 16x more machines. Such poor scaling is because of large vertex data size (816 bytes), random cut partitioning strategy, and dense connectivity structure, resulting in high communication overhead in each iteration. Network analysis showed that NER utilizes more network bandwidth than Netflix and CoSeg. It saturates with each machine having sending rate over 100MBPS.

# 6- STRENGTHS AND LIMITATIONS

In this section, we summarize a few strengths and limitations of distributed GraphLab abstraction. Application-oriented extensive experiments are considered as main strength of this paper. It naturally exhibits dynamic priority scheduling asynchronously. Fault tolerance using snapshotting schemes is reducing a large amount of communication overhead. The use of distributed locking engine pipelines hides the network latency up-to a high extent. Random vertex-cut partitioning can perform badly in some applications like NER. There is a self-edges issue in GraphLab, it is possible to change the GraphLab system to allow self-edges using flags but it requires significant changes to GraphLab code [3].

# 7- CURRENT STATE-OF-THE-ART

Initially, GraphLab was started in 2009 by Prof. Carlos Guestrin of Carnegie Mellon University as a research project. Later on, it turned into a start-up company named Dato. Its name was changed to *Turi a *few years back and finally, on August 05, 2016, it was acquired by Apple for 200 million dollars.

Currently, *Turi Create*1 supports the development of custom machine learning models. It claims that it is no specifically designed for

machine learning experts. Common ML tasks such as Recommender, Regression, Clustering, and Classifiers can easily be accomplished by Turi Create. It has built-in support for data visualization and it is easy to deploy on cross-platforms.

# 8- RELATED WORK

In this section, we will discuss related distributed graph processing systems including those introduced after GraphLab. In Sec.2 while explaining the core properties of MLDM algorithms, few comparisons of GraphLab with high-level parallel distributed systems [6][13][9] are discussed.

**MapReduce **[6] framework simplifies parallel processing by the map & reduce semantics. It partitions data randomly across machines. MapReduce implementation is not suitable for iterative Graph algorithms due to the high I/O cost of data shuffling at every iteration.

**Pregel **[9] differs from GraphLab in terms of the expressiveness of dynamic computations. Update function in Pregel is initiated by messages and it can only access the data associated with each message, while GraphLab decouples the data movement from future scheduling completely. Also, Pregel doesn’t support the asynchronous properties of MLDM algorithms.

**Giraph **[1] (An open-source implementation of Pregel[9]) is a vertex-centric BSP system. It uses the edge-cut approach to partition data randomly. Similar to GraphLab it keeps all data in memory. Giraph API has a *Compute *function that updates the state of the vertex based on its own or neighbor’s data. Giraph is found to be very competitive with GraphLab when GraphLab runs for a fixed number of iterations using random partitioning. GraphLab wins Giraph over large clusters.

**Dryad **[13] construct a data flow graph by combining computational vertices with communication channels and runs the data flow by communicating through TCP pipes and shared-memory FIFOs. It is not based on vertex-centric computations.

Graph structured databases like **Neo4J **[2] mainly focus on efficient CRUD operations of graph-structured data while GraphLab focuses on iterative graph-structured computation.

**HaLoop **[15] (A modified MapReduce System) minimizes data shuffling and network utilization after the very first iteration. It makes master node loop-aware to reduce network communication. Slave nodes contain a special module to cache (on disk) and index loop-invariant data (used in iterations).

The extended implementation of Spark abstraction known as **GraphX **[11] is designed for iterative graph operations. Similar to GraphLab abstraction it uses a vertex-cut partitioning strategy. It allows the developer to decide on what data portions to cache. GraphX is not efficient when experimenting on applications requiring a large number of iterations [3].

**Vertica **[4] is a relational graph processing system that is not competitive to existing distributed graph processing systems due to high I/O and network cost overhead. It is found that this overhead increases with the increase in cluster size.

# 9- CONCLUSION

We summarized the distributed GraphLab abstraction and other related distributed graph processing systems.

To address the key properties exhibited by most of the MLDM algorithms shared memory GraphLab abstraction is extended to the distributed setting by improving the execution model, relaxing some scheduling requirements, integrating a new distributed data graph, introducing new fault-tolerance schemes and execution engines.

Two-stage partitioning scheme is introduced in distributed data graph design to achieve efficient load-balancing and distributed ingress for varying size clusters. A partially synchronous chromatic engine and a fully synchronous distributed locking engine are designed.

Distributed GraphLab abstraction was implemented in C++ and evaluated on Netflix, Named Entity Recognition, and Video CoSeg- mentation using real data. It was found by extensive experiments that distributed GraphLab abstraction outperforms Hadoop (20–60x) and compete with other MPI implementations.

# REFERENCES

[1] 2010. Giraph. http://giraph.apache.org

[2] 2011. Neo4j. https://neo4j.com

[3] 2018. Analysis of Distributed Graph Systems. https://arxiv.org/abs/1806.08082

[4] Malu Castellanos Meichun Hsu Alekh Jindal, Samuel Madden. [n. d.]. Graph analytics using Vertica relational database. *IEEE *([n. d.]).

[5] D. P. Bertsekas and J. N. Tsitsiklis. 1989. Parallel and distributed computation: numerical methods. *Prentice-Hall Inc. *(1989).

[6] Y.-A. Lin Y. Yu G. Bradski A. Y. Ng C.-T. Chu, S. K. Kim. 2006. Map-reduce for machine learning on multicore. *NIPS *(2006), 281–288.

[7] K. M. Chandy and L. Lamport. 1985. Distributed Snapshots: Determining global states of distributed systems. *ACM Trans *3, 1 (1985), 63–75.

[8] J. Dean and S. Ghemawat. 2004. simplified data processing on large clusters. *OSDI *(2004).

[9] A. J. Bik J. Dehnert I. Horn N. Leiser G. Malewicz, M. H. Austern and G. Czajkowski. 2010. Pregel: a system for large-scale graph processing. *SIGMOD *(November 2010), 135–146.

[10] Y. Low J. Gonzalez and C. Guestrin. 2009. Belief Propagation: Residual splash for optimally parallelizing belief propagation. *AISTATS *5 (2009), 177–184.

[11] Ankur Dave Daniel Crankshaw Michael J. Franklin Joseph E. Gonzalez, Reynold S. Xin, and Ion Stoica. 2014. Graph processing in a distributed dataflow framework. *Proc. 11th USENIX Symp *(2014).

[12] GONZALEZ J. KYROLA A. BICKSON GUESTRIN C. LOW, Y. and J. M. HELLERSTEIN. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. *PVLDB*(2012).

[13] Y. Yu A. Birrell M. Isard, M. Budiu, and D.Fetterly. 2007. distributed data-parallel programs from sequential building blocks. *EuroSys *(2007), 59âĂŞ72.

[14] R. Schreiber Y. Zhou, D. Wilkinson, and R.Pan. 2008. Large-scale parallel collaborative filtering for the Netflix prize. *AAIM *(2008), 337âĂŞ348.

[15] Magdalena Balazinska Yingyi Bu, Bill Howe, and Michael D. Ernst. 2012. The HaLoop approach to large-scale iterative data analysis. *VLDB *(2012), 169–190.