April 02, 2019
Written byAdam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, Alex Peysakhovich
Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, Alex Peysakhovich
Working effectively with large graphs is crucial to advancing both the research and applications of artificial intelligence. So Facebook AI has created and is now open-sourcing PyTorch-BigGraph (PBG), a tool that makes it much faster and easier to produce graph embeddings for extremely large graphs — in particular, multi-relation graph embeddings for graphs where the model is too large to fit in memory. PBG is faster than commonly used embedding software and produces embeddings of comparable quality to state-of-the-art models on standard benchmarks. With this new tool, anyone can take a large graph and quickly produce high-quality embeddings using a single machine or multiple machines in parallel.
As an example, we are also releasing the first published embeddings of the full Wikidata graph of 50 million Wikipedia concepts, which serves as structured data for use in the AI research community. The embeddings, which were created with PBG, can help other researchers perform machine learning tasks on Wikidata concepts. PBG is available for download here and the Wikidata embeddings can be found here.
Further, since PBG is written in PyTorch, researchers and engineers can easily swap in their own loss functions, models, and other components, and PBG will be able to compute the gradients and will be scalable automatically.
Graphs are a core tool to represent many types of data. They can be used to encode networks of related items, such as facts about the world. For example, knowledge bases like Freebase have various entities (e.g., “Stan Lee” and “New York City”) as nodes and edges that describe their relationships (e.g., “was born in”).
Graph embedding methods learn a vector representation of each node in a graph by optimizing the objective that the embeddings for pairs of nodes with edges between them are closer together than pairs of nodes without a shared edge. This is similar to how word embeddings like word2vec are trained on text.
Graph embedding methods are a form of unsupervised learning, in that they learn representations of nodes using only the graph structure and no task-specific “labels” for nodes. Like text embeddings, these representations can be leveraged for a wide variety of downstream tasks.
Modern graphs can be extremely large, with billions of nodes and trillions of edges. Standard graph embedding methods don’t scale well out of the box to operate on very large graphs. There are two challenges for embedding graphs of this size. First, an embedding system must be fast enough to allow for practical research and production uses. With existing methods, for example, training a graph with a trillion edges could take weeks or even years. Memory is a second significant challenge. For example, embedding two billion nodes with 128 float parameters per node would require 1 terabyte of parameters. That exceeds the memory capacity of commodity servers.
PBG uses a block partitioning of the graph to overcome the memory limitations of graph embeddings. Nodes are randomly divided into P partitions that are sized so that two partitions can fit in memory. The edges are then divided into P2 buckets based on their source and destination node.
Once the nodes and edges are partitioned, training can be performed on one bucket at a time. The training of bucket (i, j) only requires the embeddings for partitions i and j to be stored in memory.
PBG provides two ways to train embeddings of partitioned graph data. In single-machine training, embeddings and edges are swapped out to disk when they are not being used. In distributed training, embeddings are distributed across the memory of multiple machines.
PBG uses PyTorch parallelization primitives to perform distributed training. Since a single model partition can only be used by one machine at a time, embeddings can be trained on up to P/2 machines at a time. Model data is communicated only when a machine needs to switch to a new bucket. For distributed training, shared parameters that represent the different types of edges are synchronized using a classical parameter server model.
Graph embeddings, like text embeddings, construct random “false” edges as negative training examples along with the true positive edges. This significantly speeds training because only a small percentage of weights must be updated with each new sample. Typically, these negative examples are constructed by “corrupting” true edges with random source or destination nodes. However, we found that several modifications to standard negative sampling were necessary for large graphs.
First, we observed that in traditional methods used to generate graph embeddings, almost all the training time was spent on the negative edges. We took advantage of the linearity of the functional form to reuse a single batch of N random nodes to produce corrupted negative samples for N training edges. In comparison to other embedding methods, this technique allows us to train on many negative examples per true edge at little computational cost.
We also found that to produce embeddings that were useful in a variety of downstream tasks, we found an effective approach was to corrupt edges with a mix of 50 percent nodes sampled uniformly from the nodes, along with 50 percent nodes sampled based on their number of edges.
Finally, we introduce a notion of “entity types,” which restrict how nodes are used for constructing negative samples. For example, consider a graph that contains nodes for songs and artists and genres, and suppose there’s a “produced” relation from artists to songs. If we uniformly sampled source entities for this relation, we would overwhelmingly sample songs (since there are far more songs than artists), but these are not valid potential edges (since a song can only be produced by an artist). PBG restricts which negatives can be constructed based on the entity type of the relation.
To evaluate PBG’s performance, we used the publicly available Freebase knowledge graph, which contains more than 120 million nodes and 2.7 billion edges. We also used a smaller subset of the Freebase graph, known as FB15k, which contains 15,000 nodes and 600,000 edges and is commonly used as a benchmark for multi-relation embedding methods.
In our paper, we show that PBG performs comparably to state-of-the-art embedding methods for the FB15k data set.
We then use PBG to train embeddings for the full Freebase graph. A data set of this size can fit on a modern server, but PBG’s partitioned and distributed execution reduces both memory usage and training time. We are releasing the first embeddings of Wikidata, which is a more current knowledge graph of similar data.
We also evaluated PBG embeddings for several publicly available social graph data sets in our paper. We find that PBG outperforms competing methods, and that partitioning and distributed execution decrease memory usage and reduce training time. For knowledge graphs, partitioned or distributed execution makes training more sensitive to hyperparameters and modeling choices. For social graphs, however, embedding quality appears to be insensitive to the partitioning and parallelization choices.
PBG allows the entire AI community to train embeddings for large graphs — including knowledge graphs, as well as other large graphs such as graphs of stock transactions, online content, and biological data — without specialized computing resources like GPUs or huge amounts of memory. We also hope that PBG will be a useful tool for smaller companies and organizations that may have large graph data sets but not the tools to apply this data to their ML applications.
While we demonstrate PBG on data sets like Freebase, PBG is designed to work with graphs that are 10-100x larger. We hope that this encourages practitioners to release and experiment with even larger data sets. Recent breakthroughs in computer vision (improving image recognition with deep learning on hashtags) and natural language processing (word2vec, Bert, Elmo) have come as the result of task-agnostic pretraining on massive data sets. We hope that unsupervised learning on massive graphs may eventually lead to better algorithms for inference on graph-structured data.
Research Assistant, CIFRE