For too long, the AI research community hasn’t consistently and comprehensively shown its work — sometimes for good reasons, such as proprietary code or data privacy concerns. But in many ways, this habit is stalling the industry instead of propelling it.
It’s difficult, if not impossible, to reproduce many state-of-the-art machine learning results without access to information that was crucial to the original research, including datasets, software environment used, and randomization controls. Yet study after study has shown that AI researchers typically don’t share enough information to enable reproducibility and that just describing the experiments may not be enough to show that the findings themselves are robust. Last year, one survey of 400 algorithms that had been presented at leading AI conferences found that just 6 percent of the presentations included the algorithm’s code. Only one third included test datasets. A previous survey by the journal Nature of 1,576 machine learning researchers in 2016 revealed that more than 70 percent failed in their attempts to reproduce others’ experiments. More than half couldn’t even reproduce their own.
I saw firsthand how reproducibility was a problem a few years ago while helping my students at McGill University design a new algorithm. They were attempting to improve another lab’s model to train reinforcement learning policies, but it wasn’t working. After much digging into papers, reading code and communicating with the authors, they found that some “creative manipulations” were necessary to approach the results. As I watched them painstakingly re-create someone else’s work and waste time trying to guess what other groups might have done to reach published results, it became very clear that the community wasn’t doing enough to share its findings in a way that was efficient and sound.
To make AI reproducibility both practical and effective, I helped introduce the first Machine Learning Reproducibility Checklist, presented at the 2018 Conference on Neural Information Processing Systems (NeurIPS). Essentially, the checklist is a road map of where the work is and how it arrived there, so others can test and replicate it. It also can serve as a guide of what our expectations are in terms of scientific excellence for a community that is quickly growing. I am now supporting the use of this checklist as part of the NeurIPS 2019 paper submission process. In parallel, I am also co-organizing the accompanying 2019 reproducibility challenge at NeurIPS. The challenge invites members of the community to reproduce papers accepted at the conference and to report on their findings via the OpenReview platform. Challenge participants from all over the world can take advantage of the fact that NeurIPS authors were encouraged to submit code and data along with their papers.
In many ways, reproducibility is a process, and the checklist is just a tool to facilitate this process. No one is demanding that a company release all its proprietary code and data. There are myriad other ways to support reproducibility. Data can be aggregated or artificially generated so that it is anonymized or doesn’t otherwise compromise confidentiality. Companies can release partial code that won’t run but can be read. The goal is to be as transparent and complete as possible with respect to the scientific methodology and the results of the work. Indeed, the checklist seeks to make the industry more collaborative, as well as eliminate confusion about methodology, by making the science more transparent. Plus, by adding reproducibility to the process, it also promotes the use of robust experimental workflows, potentially reducing unintentional errors.
The checklist is showing promising early results. This year, NeurIPS required researchers to include a completed checklist with all 6,743 submissions. Authors of accepted papers filled out the checklist again when submitting their final camera-ready paper in late October. Overall, 75 percent of papers accepted at NeurIPS this year included a link to code in their final camera-ready submission, compared to 50 percent of papers at NeurIPS 2018 and 67 percent at ICML 2019.
We are still analyzing the data from those checklists, and correlating answers with what we have observed in the submission and review process. From these preliminary results, we’ve noticed that 89 percent of submissions stated that they were providing a description of how their experiments were run. (It’s worth noting that 9 percent of all submissions fell under the “Theory” category and are thus unlikely to have an experimental component.) We also found that of the researchers submitting papers to NeurIPS, 36 percent thought that providing error bars was “not applicable” to their paper. Finally, the data shows a strong correlation (P=1e-8) between code availability and higher reviewer scores, though more study would be required before drawing conclusions. Our analysis, which was carried out in collaboration with Vincent Larivière and Philippe Vincent-Lamarre from the Université de Montréal, is ongoing and we expect to share more findings in coming months to help improve the community’s publishing and reviewing practices.
These encouraging results complement several other efforts to make reproducibility a core component of AI research. Scientists from various research universities are arguing that AI researchers must release more details so that techniques and results can be corroborated. The International Conference on Learning Representations (ICLR) held its second reproducibility challenge this year, asking participants to choose any of the 2019 ICLR submissions and attempt to replicate its experiments. Workshops on reproducibility have been offered this year by a number of AI conferences, including ICLR, the Conference on Artificial Intelligence (AAAI), and the International Conference on Machine Learning (ICML).
Many of my colleagues at Facebook AI are also helping advance reproducibility through building blocks such as PyTorch Hub, which consists of a pretrained model repository designed specifically to facilitate reproducibility and enable researchers to learn from and build on others’ work. PyTorch Hub also has built-in support for Colab and integration with Papers With Code, making it faster and easier to evaluate results. From a workflow perspective, PyTorch Lightning, a lightweight wrapper on PyTorch, automates much of the research workflow and guarantees tested, correct, and modern best practices for the automated portions of training new models. And recently launched tools such as sotabench provide an automated comparison of papers using real code on GitHub and a transparent, continuous integration-like infrastructure for reproduction of research.
These are important parts of solving the challenge of reproducibility in AI research, but much more is needed from both industry leaders and academic experts. Since reproducibility will ultimately benefit everyone in the AI community, we must all work together, share ideas, and hold ourselves accountable for improving how we work.
If we claim machine learning as a scientiﬁc discipline — and especially as the AI community continues its rapid growth — our best practices must evolve. We will need new tools and and a shared sense of responsibility to put reproducibility front and center when publishing our work. The benefits of this shift, however, are well worth the effort. If we integrate reproducibility into AI research, our field will become more open and accessible, and I believe we will make faster progress and spur new breakthroughs that will benefit the entire AI community as well as the world of people who use the technologies we are building.
Managing Director, Facebook AI