Using a classical rendering technique for state of the art image segmentation

June 12, 2020

What the research is:

A new efficient and high-resolution image segmentation approach called PointRend, and inspired by the classical adaptive sampling technique used in computer graphics rendering. It produces sharper and more accurate segmentation of objects and scenes, compared with previous state-of-the-art methods.

Modern image segmentation approaches are based on convolutional neural networks that distribute computation across an input image evenly. Typically, these methods make predictions on a coarser resolution than the input image in order to limit computational complexity. By spending a large part of its computation budget (oversampling) on nonambiguous regions of an image, like a dog’s torso or the background, such techniques undersample more challenging parts of an object, missing fine-grained details like a dog’s paw or object edges more generally.

Instance segmentation with standard Mask R-CNN segmentation head (left) and with PointRend head (right). Qualitatively, PointRend outputs crisp object boundaries in regions that are over-smoothed by previous methods.

The PointRend module performs point-based segmentation predictions at adaptively selected locations, which is inspired by the classical technique of adaptive subdivision used for rendering in computer graphics. As a result, our model efficiently produces significantly more detailed segmentation with pixel-level precision that was not possible using previous best segmentation approaches, such as Mask R-CNN or Semantic FPN.

Quantitatively, PointRend yields significant gains on two major benchmarks, for both instance and semantic segmentation tasks. Mask R-CNN enhanced by PointRend segmentation head produces masks that are 8x more detailed than the standard Mask R-CNN model output, increasing average precision up to 2.8 percent. The code is available here.

How it works:

The PointRend approach builds on top of existing image segmentation approaches, such as Mask R-CNN, Semantic FPN, and Deeplab, which efficiently produce a coarse prediction. Starting from coarse global prediction, PointRend gradually upscales and refines the prediction, reaching the resolution of the input image in just a few steps.

At each step, PointRend selects a subset of locations on the upscaled prediction that require refinement. For each of these locations, the new model updates its prediction independently, using features from the intermediate representation of the underlying convolutional neural network. You can see an example of the process in the animation above.

Overall, the technique allows researchers to obtain a high-resolution segmentation while making an actual prediction for just a small fraction of the overall pixels in the input image. PointRend avoids using excessive computation on the areas that can be flawlessly predicted by a coarse prediction. At the same time, it refines its prediction for pixels that lie in the challenging regions of an object or a scene. Such adaptivity leads to an efficient inference that can be modified based on the computational resources at hand and without the need to retrain the model.

Why it matters:

PointRend’s efficiency enables output resolutions that are otherwise impractical in terms of memory or computation, compared with existing approaches. With less memory and compute constraints, we’ll be able to deploy image segmentation models to smaller devices and low-resource areas. Pixel-perfect segmentation can also enhance a wide range of AR/VR experiences that were previously limited, like seamlessly transforming backgrounds or inserting more precise, realistic objects in a scene.



This work will be presented at CVPR 2020. Learn full details about Facebook AI at CVPR here.