Advancing computer vision research with new Detectron2 Mask R-CNN baselines

June 21, 2021

What the research is:

Since Facebook AI released Mask R-CNN, our state-of-the-art model for instance segmentation, in 2018, it has become a widely used core tool for computer vision research and applications. We are now sharing new, significantly improved baselines based on the recently published state-of-the-art results produced by other experts in the field. We’re also offering new analysis of how these improvements were achieved and adding new baseline recipes to our open source detection library, Detectron2, making it easy for other researchers to reproduce and build on these advancements.

It is difficult to make rapid scientific progress without a way to measure results and compare them with others’ work. AI researchers use baselines to do this because they can serve as an easily reproducible yardstick. But because the field is evolving rapidly, we must often update them to reflect the field’s progress.

Recent work in the field, such as Simple Copy-Paste Data Augmentation, has shown substantial improvements in accuracy (measured by average precision, or AP) for two core tasks, creating a bounding box around an object and drawing a detailed mask over different objects. The paper’s highest-reported Mask R-CNN ResNet-50-FPN baseline is 47.2 Box AP and 41.8 Mask AP, which exceeds Detectron2's highest reported baseline of 41.0 Box AP and 37.2 Mask AP. This difference is significant because most research papers publish improvements in the order of 1 percent to 3 percent.

Without a thorough understanding of this gap in baselines and the ability to reproduce these results, it is hard for the research community to advance their own work or to understand what drove others’ performance gains. In this case, the very significant improvements in AP appear to be attributable to two simple factors: longer training and stronger random image resizing augmentation.

This image shows the box predictions using the new Mask R-CNN baseline.

How it works:

Reproducing research is a core mechanism for advancing scientific knowledge, but it is often difficult in practice. Details of a particular experiment may be unclear or unavailable, and different labs may use different hardware (e.g., Tensor Processing Units instead of GPUs) and software platforms (e.g., TensorFlow vs. PyTorch), which may introduce subtle differences in their output. Responsibility has been a particular focus for Facebook AI, with efforts such as the reproducibility checklist and challenge as well as with our work with Papers with Code.

To reproduce the ResNet-50-FPN baselines achieved in the Copy-Paste paper mentioned above, we started with the TensorFlow implementation of Mask R-CNN to train the recipes from that paper using the COCO data set. (We used information from the Bottleneck Transformer paper to approximate some implementation details that were not available.)

In the next step, we implemented the Scale Jitter algorithm (the primary data augmentation method used in the Copy-Paste paper's baseline) in Detectron2. Although many low-level differences exist between the TensorFlow and PyTorch-based Detectron2 implementations, we wanted to test whether the basic principles of longer training and stronger data augmentations would be robust to these lower-level details.

The new recipes increased the Mask R-CNN ResNet-50-FPN Box AP metric from 41 (using ImageNet initialization) to 46.7 (using ImageNet initialization) and 47.4 (using random initialization).

We conducted a series of ablation experiments to understand which hyperparameter changes drove these improvements. To see whether we can drive accuracy even higher, we also tried deeper models with larger images. Our experiments demonstrated that:

A longer training schedule, larger input image size, and a larger scale jitter range have positive effects on AP. Box AP and Mask AP continued to scale with increases in training schedule (as shown in the chart above). Box AP and Mask AP plateaued for scale jitter at 0.5–1.6 when trained with 144 epoch schedule (as shown in the chart below).
Sync Batch Norm, Weight Decay, and deeper Region Proposal Network (RPN) and Region of Interest (ROI) heads also have a positive impact on Box AP and Mask AP, as shown in the table below.
Enabling PyTorch’s automatic mixed precision (AMP) and FP16 improved training speed by 30 percent and does not degrade Box AP and Mask AP. These performance gains were on an eight-node cluster, where each node had eight Nvidia V100 32GB GPUs.
Deeper heads can potentially further improve AP. It needs further investigation across a broad range of training schedules.

Box AP and Mask AP plateaued for scale jitter at 0.5–1.6 when trained with 144 epoch schedule.

Ablation experiments	Box AP	Mask AP	Notes
LSj, 144 Epochs, RandomInit	45.2	41.0	We ran ablation experiments using a 144 epoch schedule (instead of 396 epoch schedule) to have a shorter training time.
- SyncBN, + Group Norm	43.5 (-1.7)	39.8 (-1.2)
+ AMP, FP16 gradients	44.9 (-0.3)	40.5 (-0.5)	Using AMP and FP16 improved throughput by 30%.
D2 default weight decay	43.7 (-1.6)	39.7 (-1.3)	Weight decay increased from 4E-5 to 1E-4.
D2 default box head	44.0 (-1.2)	40.1 (-0.9)	Replace 4 conv, 1 FC with 2 FC.

This chart summarizes the results of our ablation experiments.

Why it matters:

For computer vision tasks ranging from AR effects to detecting harmful content, performance depends in large part on the accuracy of the image detection models used. Improving AP can directly improve the user experience with products like Portal, which uses a Smart Camera system powered by our Mask R-CNN2Go algorithm to intelligently frame shots during video calls, much as an experienced camera operator would.

By sharing our work here and implementing it with Detectron2, we hope to not only help others build better computer vision tools, but also make it easy for the research community to use them as a foundation for their new detection research. Ultimately, we hope this will help lead to new breakthroughs in building machines that can master challenging computer vision tasks.