September 2, 2021
Reconstructing objects in 3D is a seminal computer vision problem with AR/VR applications ranging from telepresence to the generation of 3D models for gaming. With photorealistic, versatile 3D reconstruction, it becomes possible to seamlessly combine real and virtual objects on traditional smartphone and laptop screens, as well as on the AR glasses that will power future experiences. Current 3D reconstruction methods, however, rely on learning models for various object categories (“car,” “donut,” “apple,” etc.), and progress is hindered by a lack of data sets that contain both videos of real-world objects and accurate 3D re-creations of these objects. Since models rely on these examples to learn how to create 3D reconstructions, researchers typically just use data sets of synthetic objects, which only approximately match the challenging nature of the real-world problems.
To help address this gap and spur advances in this field, Facebook AI is releasing Common Objects in 3D (CO3D), a large-scale data set comprising real videos of common object categories with 3D annotations. CO3D contains a total of 1.5 million frames from nearly 19,000 videos capturing objects from 50 categories in the widely used MS-COCO data set. CO3D surpasses the existing alternatives in terms of the number of both categories and objects.
We are also sharing our work on NeRFormer, a novel method that learns to synthesize images of an object from novel viewpoints by observing the videos from the CO3D data set. To this end, NeRFormer efficiently marries two recent machine learning contributions — Transformers and Neural Radiance Fields. As such, NeRFormer is up to 17 percent more accurate than the nearest competitor in synthesizing new object views.
Our main goal is to collect a large-scale real-life data set of common objects in the wild annotated with 3D shapes. While it is possible to collect the latter with specialized hardware (e.g., a turntable 3D scanner), that approach is difficult to scale to match the scope of synthetic data sets comprising thousands of objects across diverse categories. Instead, we devised a photogrammetric approach requiring only object-centric multiview images. Such data can be effectively gathered in large quantities by means of crowdsourcing “turntable'' videos captured with consumer smartphones.
To this end, we crowdsourced object-centric videos on Amazon Mechanical Turk (AMT). Each AMT task asked a worker to select an object in a given category, place it on a solid surface, and record a video, keeping the whole object in view while moving a full circle around it (examples can be seen in the video below). We selected 50 MS-COCO categories comprising stationary objects that have a well-defined notion of shape and are good candidates for a successful 3D reconstruction.
COLMAP, a mature photogrammetry framework, provides 3D annotations that are treated as ground truth, by tracking the positions of the smartphone camera in 3D space and further reconstructing a dense 3D point cloud capturing the object’s surface. Example reconstructions and camera tracking can be seen in the example above. Finally, to ensure high-quality 3D annotations, we devised a semiautomated active-learning algorithm that filters out videos with insufficient 3D reconstruction accuracy.
Along with releasing the CO3D data set, we propose NeRFormer, a novel deep architecture that learns the geometric structure of object categories by observing the collected videos. During training, NeRFormer learns by differentiably rendering a neural radiance field (NeRF) that represents the geometry and appearance of an object. Importantly, rendering is carried out by a novel deep Transformer that jointly learns to predict the properties of the radiance field by analyzing the content of the object’s video frames, and to render the new view by “marching” along the rendering rays. In this manner, once NeRFormer learns the common structure of a category, it is able to synthesize new views of a previously unseen object given only a small number of its known views.
As the first data set of its kind, CO3D will aptly enable reconstruction of real-life 3D objects. Indeed, CO3D already provides training data to enable our NeRFormer to tackle the new-view synthesis (NVS) task. Here, photorealistic NVS is a major step on the path to fully immersive AR/VR effects, where objects can be virtually transported across different environments, which will allow connecting users by sharing or recollecting their experiences.
Besides practical applications in AR/VR, we hope that the data set will become a standard testbed for the recent proliferation of methods (including NeRFormer, Implicit Differentiable Renderer, NeRF, and others) that reconstruct 3D scenes by means of an implicit shape model.
Software Engineering Manager