Open-sourcing PyTorch-BigGraph for faster embeddings of extremely large graphs

April 02, 2019

Working effectively with large graphs is crucial to advancing both the research and applications of artificial intelligence. So Facebook AI has created and is now open-sourcing PyTorch-BigGraph (PBG), a tool that makes it much faster and easier to produce graph embeddings for extremely large graphs — in particular, multi-relation graph embeddings for graphs where the model is too large to fit in memory. PBG is faster than commonly used embedding software and produces embeddings of comparable quality to state-of-the-art models on standard benchmarks. With this new tool, anyone can take a large graph and quickly produce high-quality embeddings using a single machine or multiple machines in parallel.

Related Research

PyTorch-BigGraph: A Large-Scale Graph Embedding System

As an example, we are also releasing the first published embeddings of the full Wikidata graph of 50 million Wikipedia concepts, which serves as structured data for use in the AI research community. The embeddings, which were created with PBG, can help other researchers perform machine learning tasks on Wikidata concepts. PBG is available for download here and the Wikidata embeddings can be found here.

Further, since PBG is written in PyTorch, researchers and engineers can easily swap in their own loss functions, models, and other components, and PBG will be able to compute the gradients and will be scalable automatically.

Creating embeddings of graphs with billions of nodes

Graphs are a core tool to represent many types of data. They can be used to encode networks of related items, such as facts about the world. For example, knowledge bases like Freebase have various entities (e.g., “Stan Lee” and “New York City”) as nodes and edges that describe their relationships (e.g., “was born in”).

Graph embedding methods learn a vector representation of each node in a graph by optimizing the objective that the embeddings for pairs of nodes with edges between them are closer together than pairs of nodes without a shared edge. This is similar to how word embeddings like word2vec are trained on text.

Graph embedding methods are a form of unsupervised learning, in that they learn representations of nodes using only the graph structure and no task-specific “labels” for nodes. Like text embeddings, these representations can be leveraged for a wide variety of downstream tasks.

Embeddings of massive graphs

Modern graphs can be extremely large, with billions of nodes and trillions of edges. Standard graph embedding methods don’t scale well out of the box to operate on very large graphs. There are two challenges for embedding graphs of this size. First, an embedding system must be fast enough to allow for practical research and production uses. With existing methods, for example, training a graph with a trillion edges could take weeks or even years. Memory is a second significant challenge. For example, embedding two billion nodes with 128 float parameters per node would require 1 terabyte of parameters. That exceeds the memory capacity of commodity servers.

PBG uses a block partitioning of the graph to overcome the memory limitations of graph embeddings. Nodes are randomly divided into P partitions that are sized so that two partitions can fit in memory. The edges are then divided into P² buckets based on their source and destination node.

The PBG partitioning scheme for large graphs. Nodes are divided into P partitions that are sized to fit in memory. Edges are divided into buckets based on the partition of their source and destination nodes. In distributed mode, multiple buckets with non-overlapping partitions can be executed in parallel (as shown with blue squares).

Once the nodes and edges are partitioned, training can be performed on one bucket at a time. The training of bucket (i, j) only requires the embeddings for partitions i and j to be stored in memory.

PBG provides two ways to train embeddings of partitioned graph data. In single-machine training, embeddings and edges are swapped out to disk when they are not being used. In distributed training, embeddings are distributed across the memory of multiple machines.

Distributed training

PBG uses PyTorch parallelization primitives to perform distributed training. Since a single model partition can only be used by one machine at a time, embeddings can be trained on up to P/2 machines at a time. Model data is communicated only when a machine needs to switch to a new bucket. For distributed training, shared parameters that represent the different types of edges are synchronized using a classical parameter server model.

Architecture diagram for PBG distributed training. Machines coordinate to train on disjoint buckets using a lock server. Partitioned model parameters are exchanged via a sharded partition server, and shared parameters are updated asynchronously via a sharded parameter server.

Negative sampling

Graph embeddings, like text embeddings, construct random “false” edges as negative training examples along with the true positive edges. This significantly speeds training because only a small percentage of weights must be updated with each new sample. Typically, these negative examples are constructed by “corrupting” true edges with random source or destination nodes. However, we found that several modifications to standard negative sampling were necessary for large graphs.

First, we observed that in traditional methods used to generate graph embeddings, almost all the training time was spent on the negative edges. We took advantage of the linearity of the functional form to reuse a single batch of N random nodes to produce corrupted negative samples for N training edges. In comparison to other embedding methods, this technique allows us to train on many negative examples per true edge at little computational cost.

We also found that to produce embeddings that were useful in a variety of downstream tasks, we found an effective approach was to corrupt edges with a mix of 50 percent nodes sampled uniformly from the nodes, along with 50 percent nodes sampled based on their number of edges.

Finally, we introduce a notion of “entity types,” which restrict how nodes are used for constructing negative samples. For example, consider a graph that contains nodes for songs and artists and genres, and suppose there’s a “produced” relation from artists to songs. If we uniformly sampled source entities for this relation, we would overwhelmingly sample songs (since there are far more songs than artists), but these are not valid potential edges (since a song can only be produced by an artist). PBG restricts which negatives can be constructed based on the entity type of the relation.

Evaluating PyTorch-BigGraph

To evaluate PBG’s performance, we used the publicly available Freebase knowledge graph, which contains more than 120 million nodes and 2.7 billion edges. We also used a smaller subset of the Freebase graph, known as FB15k, which contains 15,000 nodes and 600,000 edges and is commonly used as a benchmark for multi-relation embedding methods.

A t-SNE plot of some of the embeddings trained by PBG for the Freebase knowledge graph. Entities such as countries, numbers, and scientific journals have similar embeddings.

In our paper, we show that PBG performs comparably to state-of-the-art embedding methods for the FB15k dataset.

Performance of embedding methods on a link-prediction task on the FB15k dataset. PBG matches the performance of TransE and ComplEx embedding methods using their models. We measure mean reciprocal rank (MRR) and Hit@10 statistics for link prediction on the FB15k test set. Lacroix et al. achieve higher MRR using extremely large embedding dimension, which we can replicate in PBG but do not report here.

We then use PBG to train embeddings for the full Freebase graph. A dataset of this size can fit on a modern server, but PBG’s partitioned and distributed execution reduces both memory usage and training time. We are releasing the first embeddings of Wikidata, which is a more current knowledge graph of similar data.

PBG’s partitioning scheme reduces memory usage by 88% without degrading model quality. Training time can be reduced using multiple machines in parallel.

We also evaluated PBG embeddings for several publicly available social graph datasets in our paper. We find that PBG outperforms competing methods, and that partitioning and distributed execution decrease memory usage and reduce training time. For knowledge graphs, partitioned or distributed execution makes training more sensitive to hyperparameters and modeling choices. For social graphs, however, embedding quality appears to be insensitive to the partitioning and parallelization choices.

Unlocking the benefits of distributed training for embeddings

PBG allows the entire AI community to train embeddings for large graphs — including knowledge graphs, as well as other large graphs such as graphs of stock transactions, online content, and biological data — without specialized computing resources like GPUs or huge amounts of memory. We also hope that PBG will be a useful tool for smaller companies and organizations that may have large graph datasets but not the tools to apply this data to their ML applications.

While we demonstrate PBG on datasets like Freebase, PBG is designed to work with graphs that are 10-100x larger. We hope that this encourages practitioners to release and experiment with even larger datasets. Recent breakthroughs in computer vision (improving image recognition with deep learning on hashtags) and natural language processing (word2vec, Bert, Elmo) have come as the result of task-agnostic pretraining on massive datasets. We hope that unsupervised learning on massive graphs may eventually lead to better algorithms for inference on graph-structured data.