Pushing state-of-the-art in 3D content understanding

October 29, 2019

Written by Georgia Gkioxari, Shubham Tulsiani, and David Novotny

Written by

Georgia Gkioxari, Shubham Tulsiani, and David Novotny


In order to interpret the world around us, AI systems must understand visual scenes in three dimensions. This need extends beyond robotics, navigation, and even augmented reality applications. Even with 2D photos and videos, the scenes and objects depicted are themselves three-dimensional, of course, and truly intelligent content-understanding systems must be able to recognize the geometry of a cup’s handle when it’s being rotated in a video, or identify which objects are in the foreground and background of a photo.

Today, we’re sharing details on several new Facebook AI research projects that advance the state of the art in 3D image understanding in different but complementary ways. This work, which is being presented at the International Conference on Computer Vision (ICCV) in Seoul, addresses a variety of use cases and circumstances, with different types and amounts of training data and inputs.

  • Mesh R-CNN is a novel, state-of-the-art method to predict the most accurate 3D shapes in a wide range of real-world 2D images. This method, which leverages our general Mask R-CNN framework for object instance segmentation, can detect even complex objects, such as the legs of a chair or overlapping furniture.

  • Using an alternative and complementary approach to Mesh R-CNN, termed C3DPO, we’re the first to achieve a successful large-scale 3D reconstruction of nonrigid shapes on three benchmarks for more than 14 object categories by interpreting 3D geometry. We achieve this using only 2D keypoints and zero 3D annotations.

  • We’ve introduced a novel method to learn association between images and 3D shapes while significantly reducing the need for annotated training examples. This brings us closer to self-supervised systems that can create 3D representations for more kinds of objects.

  • We’ve developed a novel technique, called VoteNet, to perform object detection for circumstances when 3D input from LIDAR or other sensors is available. While most traditional systems for this task depend on 2D image signals, ours is based purely on 3D point clouds, which achieves higher precision than prior work.

This research builds on recent advances in using deep learning to predict and localize objects in an image, as well as new tools and architectures for 3D shape understanding, like voxels, point clouds, and meshes. The field of computer vision extends to a wide range of tasks, but 3D understanding will play a central role in advancing the ability of AI systems to more closely understand, interpret, and operate in the real world.

Achieving state-of-the-art in predicting 3D shapes of unconstrained, obstructed objects

Perception systems like Mask R-CNN are powerful and versatile tools for understanding images. But because they make predictions in 2D, they ignore the 3D structure of the world. Leveraging the advances in 2D perception, we designed a 3D object reconstruction model that predicts 3D object shapes from unconstrained real-world images with a range of optical challenges, including objects with occlusion, clutter, and diverse topologies. Adding a third dimension to object detection systems that are robust against such complexities requires stronger engineering capabilities, and current engineering frameworks have hindered progress in this area.

Something Went Wrong
We're having trouble playing this video.

Mesh R-CNN takes an input image, predicts object instances in that image, and infers their 3D shape. To capture diversity in geometries and topologies, it first predicts coarse voxels, which are refined for accurate mesh predictions.

To address these challenges, we augmented Mask R-CNN’s 2D object segmentation system with a mesh prediction branch, and we built Torch3d, a Pytorch library with highly optimized 3D operators in order to implement the system. Mesh R-CNN uses Mask R-CNN to detect and classify the various objects in an image. It then infers 3D shapes with a novel mesh predictor, which is composed of a hybrid approach of voxel prediction followed by mesh refinement. This two-step process enables us to achieve higher results than prior work for predicting fine-grained 3D structures. Torch3d helps make this possible by enabling efficient, flexible, and modular implementation of complex operations, like chamfer distance, differentiable mesh sampling, and a differentiable renderer.

We use Detectron2 to implement the resulting system, which uses RGB images as input in order to both detect objects and predict 3D shapes. Similar to Mask R-CNN’s use of supervised learning for strong 2D perception, our novel approach learns 3D prediction using fully supervised learning with pairs of images and meshes. For training, we use the Pix3D data set, composed of 10,000 pairs of images and meshes, which is significantly smaller than 2D benchmarks typicallying contain hundreds of thousands of images and object annotations.

We evaluated Mesh R-CNN on two data sets and achieved strong results on both. On the Pix3D data set, Mesh R-CNN is the first system to be able to jointly detect objects of all categories and estimate their full 3D shape across diverse, cluttered, and occluded scenes of furniture. Previous work focused on evaluating models that were trained on perfectly cropped, unoccluded image segments. And on the ShapeNet data set, our hybrid approach of voxel prediction and mesh refinement outperforms prior work by a 7 percent relative margin.

Something Went Wrong
We're having trouble playing this video.

System overview of Mesh R-CNN. We augment Mask R-CNN with 3D shape inference.

Accurately predicting and reconstructing the shapes of unconstrained scenes in the real world is an important step toward enhancing new experiences, like virtual reality and other forms of telepresence. Still, gathering annotated data for 3D images is substantially more complex and time-consuming than doing so for 2D images, which is why data sets for 3D shape prediction have lagged compared with their 2D counterparts. We’re therefore exploring different approaches to leveraging both supervised and self-supervised learning for reconstructing objects in 3D.

Read the full paper on Mesh R-CNN here.

Reconstructing 3D object categories with 2D keypoints

For scenarios when meshes and corresponding images are not available for training and full reconstruction of static objects or scenes are not necessary, we’ve developed an alternative approach. Our new C3DPO (Canonical 3D Pose Networks) system builds reconstructions of 3D keypoint models and achieves state-of-the-art reconstruction results using the more widely accessible and abundant 2D keypoint supervision. C3DPO helps us understand the 3D geometry of objects in a weakly supervised fashion suitable for large-scale deployment.

C3DPO generates 3D keypoints from detected 2D keypoints for a range of object categories, accurately differentiating between viewpoint changes and shape deformations.

2D keypoints, which track specific parts of the object category (e.g., human joints or bird wings), provide a complete set of cues about the object geometry and its deformations, or viewpoint changes. The resulting 3D keypoints are useful, for instance, in modeling 3D faces and full-body meshes for more lifelike avatar graphics in VR. Similar to Mesh R-CNN, C3DPO reconstructs 3D objects using unconstrained images with occlusions and missing values.

C3DPO is the first method capable of reconstructing data sets consisting of hundreds of thousands of images with several thousand 2D keypoints. We achieve state-of-the-art reconstruction accuracy on three different data sets for more than 14 diverse nonrigid object categories. And we’ve made the code for this work available here.

Our model has two important innovations. First, given a set of monocular 2D keypoints, our new 3D reconstruction network predicts the parameters of the corresponding camera viewpoint as well as the 3D keypoint locations in a canonical orientation. Second, we introduce a novel regularization technique termed canonicalization, which consists of a second auxiliary deep network that learns alongside the 3D reconstruction network. This technique addresses the ambiguity that comes with factorizing 3D viewpoint and shape. These two innovations enable us to capture much better statistical models of the data than is possible with traditional approaches.

Such reconstructions were previously unachievable mainly because of memory constraints with the previous matrix-factorization-based methods which, unlike our deep network, cannot operate in a “minibatch” regime. Previous methods addressed the modeling of deformations by leveraging multiple simultaneous images and establishing correspondences between instantaneous 3D reconstructions, which requires hardware that’s mostly found in special labs. The efficiencies introduced by C3DPO makes it possible to enable 3D reconstruction in cases where employing hardware for 3D capture isn’t feasible, such as with large-scale objects like airplanes. Read the full paper on C3DPO here.

Learning pixel-to-surface mappings from image collections

Something Went Wrong
We're having trouble playing this video.

Our system learns a parameterized convolutional neural network (CNN) that takes an image as input and predicts a per-pixel canonical surface map that indicates a corresponding location point on the template shape. The similar coloring of the predicted canonical surface mapping between the 2D image and 3D shape implies correspondence.

We take a step further toward reducing the supervision required for developing 3D understanding for generic classes of objects. We introduce an approach that can leverage unannotated image collections with approximate automatic instance segmentations. Instead of explicitly predicting the 3D structure underlying an image, we tackle a complementary task of mapping pixels in an image to the surface of a category-level template for 3D shapes.

Not only does this mapping allow us to understand the image in context of a category-level 3D shape, but it also gives us the ability of generalizing correspondences between objects of the same class or category. For instance, when people see the highlighted beak of the bird in the left image, we can easily locate the corresponding point in the image on the right.

This is possible because we intuitively understand the shared 3D structure across these instances. Our novel approach of mapping pixels of images to a canonical 3D surface enables our learned system to have this capability as well. When evaluating our approach by measuring its accuracy of transferring correspondences across instances, we achieved results that are about twice as accurate as previous self-supervised methods that did not leverage the underlying 3D structure of the task.

Our key insight – which allows learning with significantly less supervision – is that mapping from pixel to 3D surface can be paired with the inverse operation (going from 3D to pixel) in order to complete a cycle. Our novel approach operationalizes this and can learn using only unannotated, free, publicly available image collections with approximate segmentations from a detection method. Our resulting system can be used off the shelf, applied generally alongside other methods of top-down 3D prediction to provide a complementary pixelwise 3D understanding, and we’ve released the code here.

As demonstrated by the consistency of the colors of the cars that are moving in the video above, our system yields an invariant pixelwise embedding for objects undergoing motion and rotation. This consistency extends beyond a specific instance and can be useful in scenarios where we need to understand the commonalities across objects.

Instead of learning the 2D to 2D correspondence between two images directly, we learn 2D to 3D correspondence and ensure consistency with a 3D to 2D reprojection — and this consistent cycle serves as a supervised signal for learning the 2D to 3D correspondence.

For instance, if we train a system to learn the correct place to sit on a chair or where to grasp a mug, our representation can be useful the next time the system needs to understand where to sit on a different chair or how to grasp another mug. Such tasks can not only help deepen our understanding of traditional 2D images and video content, but also enhance AR/VR experiences by transferring representations of objects. Read more about canonical surface mapping here.

Improving the fundamentals of object detection in current 3D systems

As leading-edge technologies, like autonomous agents and systems to scan 3D spaces, continue to advance, we need to push forward the mechanisms for detecting objects when 3D data is readily available. In these cases, a 3D scene understanding system needs to know what objects are in a scene and where they are in order to support high-level tasks like navigation. We’ve improved upon existing systems by constructing VoteNet, a highly accurate end-to-end 3D object detection network tailored for point clouds, which was nominated for the Best Paper Award at ICCV 2019. Unlike traditional systems for this task, which depend on 2D image signals, ours is one of the first systems based purely on 3D point clouds. This approach is more efficient and achieves much higher recognition precision than previous works.

Our model, which we’ve open-sourced here, achieves state-of-the-art 3D detection outperforming all previous methods for 3D object detection by at least 3.7 and 18.4 mAP (mean average precision) increases in SUN RGB-D and ScanNet, respectively. VoteNet outperforms previous methods by using only geometric information, without relying on standard color images.

VoteNet has a simple design, compact model size, and high efficiency, with a speed of about 100 milliseconds for a full scene and a smaller memory footprint than previous methods designed for research. Our algorithm takes in 3D point clouds from depth cameras and returns 3D bounding boxes of objects with their semantic classes.

Illustration of the VoteNet architecture for 3D object detection in point clouds.

We introduce a voting mechanism that’s inspired by the classical Hough voting algorithm. Using this method, we essentially generate new points that lie close to object centers, and these points can then be grouped and aggregated to generate box proposals. With the basic idea of voting, which is learned through deep neural networks, a set of 3D seed points vote to object centers in order to recover where they are and what they are.

As the use of 3D scanners grows in the real world — already common in applications from autonomous vehicles to biomedicine — it’s important for us to be able to achieve semantic understanding of the 3D content by localizing and classifying objects of a 3D scene. Supplementing 2D cameras with more advanced depth camera sensors for 3D recognition allows us to capture a more robust view of any given scene. With VoteNet, systems can better recognize major objects in a scene, supporting tasks like placing a virtual object, or navigation and LiveMap construction.

Developing systems with richer understanding of the real world

3D computer vision has many open research questions, and we are experimenting with multiple problem statements, techniques, and methods of supervision as we explore the best way to push the field forward as we did for 2D understanding. As the digital world adapts and shifts to use products like 3D Photos and immersive AR and VR experiences, we need to keep pushing sophisticated systems to more accurately understand and interact with objects in a visual scene.

It’s also part of Facebook AI’s long-term goal of developing AI systems that understand and interact with the real world as humans do. We have been creating scientific breakthroughs across a broad range of capabilities focused on narrowing the gap between physical and virtual spaces. Our latest 3D-focused research can also help improve and better populate 3D objects in Facebook AI’s simulation platform, which is important for training virtual agents to operate in the real world. In the same way that robotics pushes us to address complex challenges that come from conducting experiments in the physical world, where conditions are more unpredictable, 3D research is important for teaching systems how to understand all viewpoints of objects, even when they’re occluded, hidden, or have other optical challenges.

When combined with other senses, like tactile sensing and natural language understanding, AI systems, such as virtual assistants, can function in a way that’s more seamless and useful. Collectively, this leading-edge research helps us move one step closer to building AI systems that can more intuitively understand three dimensions in the same way that humans do.

The research papers described in this blog post are being presented at ICCV 2019, along with other new work in computer vision, including:

  • SlowFast, a method for extracting information from video using input at two different frame rates.

  • TensorMask, an alternate method of object segmentation using the dense, sliding-window technique

Written by

Georgia Gkioxari

Research Scientist

Shubham Tulsiani

Research Scientist

David Novotny

Research Scientist