May 14, 2021
Understanding what objects are present in a scene and where they are located is a standard task in computer vision. It's widely used for applications ranging from self-driving cars to augmented reality (AR). Training such systems to recognize 3D spaces usually involves capturing a scene using a sensor (often a 3D sensor), and then hand-labeling the spatial extents of objects in the scene, including marking their locations, with a 3D box. Though a popular and powerful way to train AI models, manual labeling is very time-consuming. On average, it takes more than 20 minutes to label and draw boxes in a small indoor 3D scene. Labeling 3D scenes would be much faster and easier without boxes and with collections of scene-level labels, such as a list of objects present in the scene.
In this work, we ask the following question: Can we learn to perform spatial recognition (e.g., detecting and segmenting objects) in 3D data (e.g., point clouds) using only scene-level tags (e.g., a list of objects present in a scene) as supervision during training? Through our proposed method, WyPR, we show that by jointly tackling the two tasks of segmentation and detection, which naturally constrain each other, we are able to learn effective representations for this weakly supervised problem setup. WyPR combines advances in 2D weakly supervised learning with unique properties of 3D point cloud data. It outperforms previous state of the art by 6 percent mIOU on a challenging data set (ScanNet), and establishes new benchmarks and baselines for future work.
WyPR first extracts a point-level feature representation from the input using standard 3D deep learning techniques. To obtain object segmentations, it classifies each point into an object class. Since it does not assume point-level supervision to train this part of the network, WyPR employs multiple-instance learning (MIL) along with self-supervised objectives (such as that predictions should be consistent across augmented views of the input) for training.
Next, to obtain object bounding boxes, it leverages a novel 3D object proposal technique inspired from selective search and referred to as geometric selective search (GSS). Each proposal is classified into one of the object classes using MIL as before, along with similar self-supervised losses. Finally, WyPR enforces consistency across the predictions made by the segmentation and detection subsystems, enforcing, for instance, that all points within a detected bounding box be consistent with the box-level prediction. The following figure illustrates the overall process.
As the following semantic segmentation results show, WyPR is able to detect and segment objects in the scene fairly well without ever having seen a scene labeled at a point level! Additionally, WyPR formalizes the weakly supervised 3D detection problem setup, including setting up baselines and benchmarks, which we believe would spur future research in this area.
Spatial 3D scene understanding is important for various downstream tasks, such as when a robot needs to help an elderly person fetch items from another room or when projecting colleagues sitting around someone’s dining table via an AR device. WyPR gives models spatial 3D understanding capability without the need for labeled training scenes at a point level, which is an extremely time-consuming process. By lowering the training data barrier and enabling finer-grained understanding over larger numbers of classes, WyPR could help make spatial 3D scene understanding a lot more accessible, thus bringing previously imagined experiences closer to reality.