A new benchmark to evaluate code search techniques. The benchmark includes the largest evaluation data set currently available for Java, consisting of a natural language query and code snippet pairs. This data set comprises 287 Stack Overflow question-and-answer pairs from the Stack Exchange Data Dump. Also included is a search corpus that contains more than 24,000 of the most popular Android repositories on GitHub (ranked by the number of stars) and is indexed using the more than 4.7 million method bodies parsed from these repositories. A score sheet on the evaluation data set, using two models from our recent work, is also included. We intend for this data set to serve as a benchmark for evaluating search quality across a variety of code search models.
The search corpus is used to evaluate a code search model. Given a natural language query, such as How can I convert a stack trace to a string?, the code search model will search the corpus to find relevant code snippets (i.e., method bodies). The evaluation data set includes code snippet examples that correctly answer the submitted query, as well as the results of two code search models from our recent work: NCS and UNIF, each with two variations. For each model variation, we provide the rank of the first correct answer among the search results for each question in our evaluation data set.
In recent years, learning the mapping between natural language and code snippets has been a popular field of research. In particular, NCS, UNIF, and others have explored finding relevant code snippets given a natural language query, with methods varying from word embeddings and information retrieval techniques to sophisticated neural networks. To evaluate the performance of these models, Stack Overflow questions and code answer pairs are prime candidates, as Stack Overflow questions effectively represent what a developer may ask. Our data set is not only the largest currently available for Java, it’s also the only one validated against ground truth answers from Stack Overflow in an automated (consistent) manner.
Take, for example, a query for Close/hide the Android Soft Keyboard. One of the first answers on Stack Overflow correctly answers this question. But collecting these questions can be tedious. To simplify this task and make it easier to evaluate a new code search model on a common set of questions, we are releasing our data set to serve as a benchmark for evaluating performance across various models.