Neural Code Search Evaluation Dataset

10/3/2019

What the research is:

A new benchmark to evaluate code search techniques. The benchmark includes the largest evaluation dataset currently available for Java, consisting of a natural language query and code snippet pairs. This dataset comprises 287 Stack Overflow question-and-answer pairs from the Stack Exchange Data Dump. Also included is a search corpus that contains more than 24,000 of the most popular Android repositories on GitHub (ranked by the number of stars) and is indexed using the more than 4.7 million method bodies parsed from these repositories. A score sheet on the evaluation dataset, using two models from our recent work, is also included. We intend for this dataset to serve as a benchmark for evaluating search quality across a variety of code search models.

How it works:

The search corpus is used to evaluate a code search model. Given a natural language query, such as How can I convert a stack trace to a string?, the code search model will search the corpus to find relevant code snippets (i.e., method bodies). The evaluation dataset includes code snippet examples that correctly answer the submitted query, as well as the results of two code search models from our recent work: NCS and UNIF, each with two variations. For each model variation, we provide the rank of the first correct answer among the search results for each question in our evaluation dataset.

Why it matters:

In recent years, learning the mapping between natural language and code snippets has been a popular field of research. In particular, NCS, UNIF, and others have explored finding relevant code snippets given a natural language query, with methods varying from word embeddings and information retrieval techniques to sophisticated neural networks. To evaluate the performance of these models, Stack Overflow questions and code answer pairs are prime candidates, as Stack Overflow questions effectively represent what a developer may ask. Our dataset is not only the largest currently available for Java, it’s also the only one validated against ground truth answers from Stack Overflow in an automated (consistent) manner.

Take, for example, a query for Close/hide the Android Soft Keyboard. One of the first answers on Stack Overflow correctly answers this question. But collecting these questions can be tedious. To simplify this task and make it easier to evaluate a new code search model on a common set of questions, we are releasing our dataset to serve as a benchmark for evaluating performance across various models.