March 22 2022
To develop new AI systems, researchers must be able to experiment with different data to train their models. But many widely used public data sets (and even test sets) have labeling errors, which can make it difficult to train robust models that work as intended, especially on novel tasks. Researchers typically overcome these challenges by applying various techniques for controlling data quality; however, there is no centralized collection of examples for using these methods. As a result, researchers often risk running into the same pitfalls of data collection that prior work has already resolved.
As part of our commitment to open and reproducible science, we’re introducing Mephisto, a new open, collaborative way to collect, share, and iterate on best practices for collecting data to train AI models. Mephisto allows researchers to share novel collection methodologies in an immediately reusable and iterable form. Researchers can swap out components and easily find the exact annotations they need, lowering the barrier for creating custom tasks.
In Mephisto, we identify a number of common workflows for driving a complex annotation task from the idea stage through gathering the data. This allows for iterating on task design and quality control in a meaningful way before any data collection task. It also allows us to publish our methodologies for the wider AI community to use or improve upon.
With Mephisto, researchers and engineers collecting data across different research domains, crowdsourcing, and server configurations can all use the same code to run their tasks. Mephisto handles this with a number of plug-and-play abstractions that do the heavy lifting to get a data collection job started. It also provides workflow guides of the process from ideation to full-fledged creation.
For example, researchers may first find an existing task that looks relevant to what they want to collect. This blueprint will serve as a starting point, and the researcher can make changes directly to the code to tweak the data displayed, the types of annotations to return, and more. At this stage, researchers could be using anything from simple HTML forms to powerful tasks invoking models in the loop, or annotations requiring live collaboration between workers. Mephisto makes it easy to test and iterate locally before piloting.
Once the task appears ready, Mephisto offers a clean workflow for launching small pilot batches and viewing the results across many workers, making it easy to identify possible issues with the task or identify workers intentionally submitting invalid data. Researchers can then use a number of existing quality control methods to improve the task quality, or they can construct their own heuristics specific to the data being collected.
Once the pilots display high-quality results, they can launch the complete task and monitor progress while it’s in flight. From here, researchers can package up their data set and publish the complete code by which others can collect something similar.
Open research is a core value for Meta AI, and gathering training data is a crucial component of AI research. By publishing the code for data collection, we’re also making it reproducible, which allows others to re-create similar data sets or extend existing ones.
Our goal is to set an industry-wide standard for including collection methodologies as part of data set releases. In the longer term, we believe this work will enable everyone to share, adopt, and standardize on more responsible techniques. For example, Mephisito currently includes important privacy protection protocols, like hiding worker identification. In the future, we plan to add additional features that report worker statistics on contributions to a data set, include warnings about fair pay and protections that support responsible treatment of workers, and highlight projects that explicitly try to debias data sets.
Improving data quality is an ongoing process, and we hope Mephisto improves not only the quality of data sets in AI research and models trained on them, but also the experience of both the researchers and the annotators that construct the data set.
Research Tech Lead Manager