We’re introducing a new framework, called TensorMask, that uses a dense, sliding-window technique for extremely accurate instance segmentation. TensorMask designs novel architectures and operators to capture the 4D geometric structure with rich, effective representations for dense images. This is the first time this approach has been used to achieve results that are both qualitatively and quantitatively on par with Facebook AI’s pioneering bounding box-driven framework, Mask R-CNN.
The direct sliding-window paradigm has recently witnessed a resurgence for bounding box object detection, making it possible to accurately detect objects in a single stage without requiring a follow-up refinement step. However, this approach has not been effective for instance segmentation tasks, because instance masks are complex 2D geometric structures, not simple rectangles. When sliding densely on a 2D regular grid, instance masks require high-dimensional 4D tensors with scale-adaptive sizes for effective representation.
TensorMask accomplishes this using structured, high-dimensional 4D geometric tensors, which are composed of subtensors that have axes with well-defined units of pixels. These subtensors enable geometrically meaningful operations, such as coordinating transformations, up and down scaling, and the use of scale pyramids. In contrast, previous attempts like DeepMask used unstructured 3D tensors that lacked clear geometric meaning, which makes the representation harder to manipulate.
To generate masks efficiently in sliding windows, we use various tensor representations in which subtensors represent the mask values. For instance, the aligned representation is such that its subtensors enumerate mask values in all windows that overlap it. As we show in the image below, the aligned representation enables the use of coarse subtensors to better predict finer-resolution masks.
We used the TensorMask framework to develop Tensor Bipyramid, a new pyramid structure that naturally captures the geometric structure of the task wherein large objects have high-resolution masks in coarse locations and small objects have low-resolution masks in fine locations. The best TensorMask model, which leverages the Tensor Bipyramid structure, achieves 37.1 AP — the standard metric representing average — while the Mask R-CNN counterpart achieves 38.3 AP.
TensorMask establishes the foundation to explore a new direction for instance segmentation research compared with the standard approach driven by Mask R-CNN. With TensorMask, boxes are no longer necessary for high-performance instance segmentation. This new, complementary approach can help advance research toward a ground-up unification of object and background segmentation into a single model. This research will help us better understand the task of dense mask prediction more broadly, an important part of our continuing to innovate and build stronger image-understanding systems.