NLP

Developer Tools

Neural Code Search: ML-based code search using natural language queries

6/24/2019

Engineers work best when they can easily find code examples to guide them on particular coding tasks. For some questions — for example, “How to programmatically close or hide the Android soft keyboard?” — information is readily available from popular resources like Stack Overflow. But questions specific to proprietary code or APIs (or code written in less common programming languages) need a different solution, since they are not typically discussed in those forums.

To address this need, we’ve developed a code search tool that applies natural language processing (NLP) and information retrieval (IR) techniques directly to source code text. This tool, called Neural Code Search (NCS), accepts natural language queries and returns relevant code fragments retrieved directly from the code corpus. Our premise is that with the availability of large codebases, code fragments related to a developer's query are likely to be discoverable somewhere within existing large codebases. In this blog, we introduce two models that accomplish this task:

NCS is an unsupervised model that combines NLP and IR techniques.
UNIF is an extension of NCS that uses a supervised neural network model to improve performance when good supervision data is available for training.

Leveraging open source Facebook AI tools (including fastText, FAISS, and PyTorch), NCS and UNIF both represent natural language queries and code snippets as vectors, and then train a network such that the vector representations of semantically similar code snippets and queries are close together in the vector space. With these models, we are able to find code snippets directly from code corpus in order to efficiently answer engineers’ questions. To evaluate NCS and UNIF, we used our newly created dataset of the public queries on Stack Overflow. Our models can correctly answer questions from this dataset, such as:

How to close / hide the Android soft keyboard?
How to convert a bitmap to drawable in Android?
How to delete a whole folder and content?
How to handle back button in activity?

NCS's performance demonstrates that a relatively simple approach can work well in the domain of source code. UNIF’s performance shows that a simple supervised learning approach can offer significant additional benefits when labeled data is available. In conjunction with other Facebook-built systems, like Aroma and Getafix, this project provides our engineers with an extensive and growing ML-powered toolkit to help them write and manage code more effectively.

How NCS uses embeddings

The NCS model captures program semantics (in this case, the intent of the code snippet) by using embeddings — continuous vector representations that, when computed appropriately, have the desirable property of putting semantically similar entities close to each other in the vector space. In the example below, there are two distinct method bodies that both pertain to closing or hiding the Android soft keyboard (the first question above). Since they share similar semantic meanings, even if they do not share the exact same lines of code, they are represented by points in the vector space that are close to each other.

This graphic shows how similar code snippets are clustered in the vector space.

(This graphic shows publicly available code from GitHub shared here and here under a GNU General Public License.)

We use this concept to build the NCS model. At a high level, each code fragment during model generation is embedded in a vector space at a method-level granularity. Once the model is built, a given query is mapped to the same vector space, and vector distance is used to estimate the relevance of code fragments to the query. This section will describe the model generation and search retrieval pipeline, as shown in the diagram below, in more detail.

This graphic shows NCS's overall model generation and search retrieval process.

Model generation

To generate a model, NCS must extract words, build word embeddings, and then build document embeddings. (Here “document” refers to a method body.)

Extracting words

NCS extracts words from source code and tokenizes them to produce a linear sequence of words.

(The sample data shown here is from publicly available code shared on GitHub here under the Apache 2.0 license.)

To produce a vector that represents a method body, we treat the source code as text and extract from the following syntactic categories: method names, method invocations, enums, string literals, and comments. Then we tokenize it based on standard English conventions (e.g. white space, punctuation) and punctuation relevant to code (e.g. snake and camel case). For example, for the method body “pxToDp” in the graphic above, the source code can be treated as the collection of words: “converts pixel in dp px to dp get resources get display metrics”.

For each method body in our corpus, we can tokenize the source code in this manner and learn an embedding for each word. After this step, the list of words we extracted for each method body resembles a natural language document.

Building word embeddings

We use fastText to build word embeddings for all the words in the vocabulary corpus. FastText computes vector representations using a two-layer dense neural network that can be trained unsupervised on a large corpus. In particular, we use the skip-gram model, where the embedding for a target token is used to predict embeddings of context tokens within a fixed window size. In our example above, given the embedding for the token “dp” and a window size 2, the skip-gram model would learn to predict the embeddings for the tokens “pixel,” “in,” “px,” and “to.” The objective is to learn an embedding matrix T_q ∈ R^|V_c|×d where |V_c | is the size of the corpus, d is the word embedding dimension, and the k^th row in T is the embedding for the k^th word in V_c.

In this matrix, two vector representations are close together if the corresponding words often occur in similar contexts. We use the converse of this statement to help define semantic relation: words with vectors that are closer together should have related meanings. This is called the distributional hypothesis in the NLP literature, and we believe the same concept holds for source text.

Building document embeddings

The next step is to express the overall intent of a method body using the words that are present in the method body. To do so, we take a weighted average of the word embedding vectors for the set of words in the method body. We call this a document embedding.

Here, d is a set of words representing a method body, v_w is the fastText word embedding for the word w, C is the corpus containing all documents, and u is a normalizing function.

We use term frequency–inverse document frequency function (TF–IDF), which assigns a weight for a given word in a given document. Its goal is to highlight the most representative words in the document — a word has a higher weight if it appears frequently in the document, but it’s also penalized if it appears in too many documents in the corpus.

At the end of this step, we have an index of each method body in the corpus to its document vector representation, and the model generation is complete.

Search retrieval

The search query is expressed as natural language sentences, such as “Close/hide soft keyboard” or “How to create a dialog without title”. We tokenize the query in the same fashion as for source code, and using the same fastText embedding matrix T, we simply average the vector representations of words to create a document embedding for the query sentence; out-of-vocab words are dropped. We then use a standard similarity search algorithm, FAISS, to find the document vectors with closest cosine distance to the query, and return the top_n results (plus some post-processed ranking, which is explained further in the paper).

The two method bodies and the query are mapped to points close together in the same vector space. This means that the query and these two method bodies are semantically similar and relevant for the query.

Results

We evaluated NCS’s performance using Stack Overflow questions, with the title as the query and a code snippet from the answers as the desired code answer. Given a query, we measure whether our model was able to retrieve from a collection of GitHub repositories a correct answer in the top 1, 5, and 10 results (labeled Answered@1, 5, 10 in the table below, respectively). We also report mean reciprocal rank (MRR) to measure at which n^th result NCS was able to answer the question correctly. We explain the experimental setup in more detail below. Out of the 287 questions in the Stack Overflow evaluation dataset we created, NCS answered 175 questions correctly in the top 10 results; this equates to over 60% of the entire dataset. We also compared NCS’s performance with that of a traditional IR technique, BM25. As shown in this table, NCS outperformed BM25.

An example of a question that NCS answers well is “Start Android Market from app”, where the top result returned form NCS is:

private void showMarketAppIn() {
  try {
    startActivity(new Intent(Intent.ACTION_VIEW, Uri.parse("market://details?id=" + BuildConfig.APPLICATION_ID)));
  } catch (ActivityNotFoundException e) {
    startActivity(new Intent(Intent.ACTION_VIEW, Uri.parse("http://play.google.com/store/apps/details?id="
                + BuildConfig.APPLICATION_ID)));
  }
}

(The snippet is from publicly available code on Github shared here under the MIT license.)

UNIF: Exploring a supervised approach

The crux of NCS is its use of word embeddings. Since NCS is an unsupervised model, it has several significant advantages: It can learn directly over the search corpus, and it can be trained quickly and easily. NCS assumes that the words in the query come from the same domain as the words extracted from source code, as the queries and the code snippets are both mapped to the same vector space. However, this is not always the case. For example, none of the words in the query “Get free space on internal memory”, appear in the code snippet below. What we want is to map the query words “free space” to the word “available” in the code.

File path = Environment.getDataDirectory();
StatFs stat = new StatFs(path.getPath());
long blockSize = stat.getBlockSize();
long availableBlocks = stat.getAvailableBlocks();
return Formatter.formatFileSize(this, availableBlocks * blockSize);

(The snippet was taken from publicly available code on Stack Overflow shared here under the CC-By-SA 3.0 license.)

From a dataset of 14,005 Stack Overflow posts, we analyzed the overlap of the words in the queries with the words in the source code. We found that out of 13,972 unique words in the queries, fewer than half (6,072 words) also existed in the source code domain. This shows that if a query contains words that do not exist in source code, our model is not going to be effective in retrieving the correct method, since we drop out-of-vocab words. This observation motivated us to explore supervised learning to map words in the query to words in source code.

We decided to experiment with UNIF, a supervised minimal extension of the NCS technique, in order to bridge the gap between natural language words and source code words. In this model, we use supervised learning to modify the word embedding matrix T to produce two embedding matrices, T_c and T_q, for code and query tokens, respectively. We also replace the TF-IDF weighting of the code token embeddings with a learned attention-based weighing scheme.

How the UNIF model works

We trained UNIF on the same collection of (c, q) data points as with NCS, where c, and q represent code and query tokens, respectively. (See the section below for details about this dataset.) The model architecture can be described as follows. Let T_q ∈ R^|V_q^|×d and T_c ∈ R^|V_c^|×d be two embedding matrices mapping each word from the natural language description and code tokens, respectively, to a vector of length d (V_q is the query vocabulary corpus, and V_c is the code vocabulary corpus.) The two matrices are initialized using the same initial weights, T, and modified separately during training (counterpart to fastText). To combine each bag of code token vectors into a document vector, we use an attention mechanism to compute the weighted average. The attention weights, a_c ∈ R^d, is a d-dimensional vector that is learned during training, and acts as a counterpart to TF-IDF. Given a bag of code word embedding vector {e₁, ....., e_n}, the attention weight a_i for each e_i is computed as follows:

The document vector is then computed as the sum of the word embedding vectors weighted by the attention weights:

To create a query document vector, e_q, we compute a simple average of the query word embedding, similar to the approach in NCS. Our training process learns parameters T_q, T_c, and a_c during classic back-propagation.

This graphic shows the UNIF network.

Retrieval works in the same fashion as it does for NCS. Given a query, we represent it as a document vector using the approach explained above, and use FAISS to find the document vectors with the closest cosine distance to the query. (In principal, UNIF would also benefit from post-processed ranking, as NCS does.)

Comparing results with NCS

We compared NCS and UNIF against the Stack Overflow evaluation dataset, to see whether the model correctly answered the query in the top 1, 5, 10 results, along with the MRR score. The table below shows that UNIF improves the number of questions answered significantly over NCS.

This highlights the impressive search performance that a supervised technique could deliver if given access to an ideal training corpus. For example, with the search query “How to exit from the application and show the home screen?”, NCS returned:

public void showHomeScreenDialog(View view) {
  Intent nextScreen = new Intent(getApplicationContext(), HomeScreenActivity.class);
  startActivity(nextScreen);
}

UNIF provided a more relevant code snippet:

public void clickExit(MenuItem item) {
  Intent intent=new Intent(Intent.ACTION_MAIN);
  intent.addCategory(Intent.CATEGORY_HOME);
  intent.setFlags(Intent.FLAG_ACTIVITY_NEW_TASK);
  metr.stop();
  startActivity(intent);
  finish();
}

(The first snippet is from publicly available code shared to GitHub here under the Apache 2.0 license. The second was shared to GitHub here under the MIT license.)

Another example is with the query “How to get the ActionBar height?” NCS returned:

public int getActionBarHeight() {
  return mActionBarHeight;
}

UNIF returned a relevant code snippet:

public static int getActionBarHeightPixel(Context context) {
  TypedValue tv = new TypedValue();
  if (context.getTheme().resolveAttribute(android.R.attr.actionBarSize, tv, true)) {
    return TypedValue.complexToDimensionPixelSize(tv.data,
              context.getResources().getDisplayMetrics());
  } else if (context.getTheme().resolveAttribute(R.attr.actionBarSize, tv, true)) {
    return TypedValue.complexToDimensionPixelSize(tv.data,
              context.getResources().getDisplayMetrics());
  } else {
    return 0;
  }
}

(The first snippet is from publicly available code shared to GitHub here under the Apache 2.0 license. The second was shared to GitHub here also under the Apache 2.0 license.)

Additional data on UNIF’s performance is available in this paper.

Building effective ML-powered tools

One of the keys to creating a successful machine learning tool is obtaining a high-quality training dataset. For our models, we leveraged the availability of the large open source codebase from GitHub. Furthermore, having a high-quality evaluation dataset is equally important for assessing the quality of the model. When exploring a relatively new realm of research, such as code search, the lack of available evaluation datasets can limit our ability to assess the performance across various code search tools. So to help benchmark performance in this area, we curated a dataset of 287 publicly available data points from Stack Overflow, where each data point consists of a natural language query and a “golden” code snippet answer.

Creating a training dataset

We trained our unsupervised model, NCS, directly over the search corpus by picking the 26,109 most popular Android projects on GitHub. This also became the search corpus from which NCS returns code snippets.

To incorporate supervision for our UNIF model, we needed an aligned pair of data points to learn the mapping. We trained UNIF on a collection of (c, q) data points, where q is the natural language description or query, and c is the corresponding code snippet. We obtained this dataset by extracting Stack Overflow question titles and code snippets from data publicly released by Stack Exchange under a CC-BY-SA 3.0 license. After filtering the questions using a variety of heuristics — the code snippet must have an Android tag, for example, or must have a method call, or must not contain XML tags — we ended up with 451,000 training data points. This dataset is disjoint from the evaluation queries. (This reflects a best case availability of a training dataset; as we note in the paper, a docstring based training did not give good results.)

The evaluation dataset

We evaluated the effectiveness of NCS using Stack Overflow questions. Stack Overflow is a useful resource for evaluation because it contains a large number of natural language queries, along with upvoted answers that can be treated as an acceptable response. Given a Stack Overflow question title as the query, NCS retrieves a list of methods from GitHub. In our work creating and refining NCS, we considered a search successful if at least one of the top_n results from NCS matched the method described in the Stack Overflow answer code snippet. (For our evaluation, we calculated using top₁, top₅, and top₁₀.)

We chose the Stack Overflow questions using a script with the following criteria: 1) the question included “Android” and “Java” tags, 2) there was an upvoted code answer, and 3) the ground truth code snippet had at least one match in our corpus of GitHub Android repos. With some manual processing to make sure the questions were acceptable, we obtained a dataset of 287 questions.

Using Aroma for automatic evaluation

We found that manually assessing the correctness of search results can be difficult to do in a reproducible fashion, as opinions can vary across authors and people. We decided instead to implement an automated evaluation pipeline using Aroma. Aroma gives a similarity score between search results and a ground truth code snippet to assess whether a query was correctly answered if the score passes a threshold value. With this pipeline, we can evaluate the models in a reproducible fashion. We use code answers found on Stack Overflow as the ground truth for evaluation.

We used the above evaluation procedure to not only compare UNIF and NCS, but also compare UNIF against some of the other code search solutions in the literature. (Details on the comparisons are available here.)

An expanding ML-powered toolkit for writing and editing code

With the widespread availability of large code repositories in production today, machine learning can extract patterns and insights that can boost engineer productivity. At Facebook, these machine learning tools include code-to-code recommendation with Aroma and automatic bug fixing with Getafix. NCS and UNIF are examples of code search models that can bridge the gap between natural language queries and finding relevant code snippets. With tools such as these, engineers will be able to easily find and use relevant code snippets, even when working with proprietary source code or writing in less common programming languages. In the future, we hope to explore other deep learning models in the realm of synthesis to further boost engineers’ productivity.