Facebook researchers will join computer vision experts from around the world to discuss the latest advances at the International Conference on Computer Vision (ICCV) in Seoul, Korea, from October 27 to November 2. The research conference is one of the preeminent gatherings for leaders in the field. More than 5,000 students, academics, industry professionals, and researchers will be in attendance.
During the conference, Facebook researchers will be presenting more than 40 papers in oral presentations, poster sessions, workshops, and tutorials. Topics include a new platform for research in embodied AI; leveraging explanations to make vision and language models more grounded; novel object captioning at scale; order-aware generative modeling using the new 3D-Craft data set; 360-degree perception and interaction; and computer vision for fashion, art, and design.
For those attending the conference, be sure to stop by Facebook Research booth C4 to chat with our program managers, researchers, and recruiters. Demos and scheduled booth events include the following:
Demo: Fashion Segmentation
Demo: Habitat — Beat the Bot
Demo: Replica on Quest
Preview of Detectron2: Tuesday, October 29, from 4:00 p.m. to 4:30 p.m.
Preview of the Deepfake Detection Challenge: Wednesday, October 30, from 11:30 a.m. to 12:00 p.m.
A day-by-day schedule of research being presented at ICCV is available here.
Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, and Ajay Divakaran
We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a “downstream” task to guide the process of phrase localization. Our method, as a first step, infers the latent correspondences between regions of interest (RoIs) and phrases in the caption and creates a discriminative image representation using these matched RoIs. In the subsequent step, this learned representation is aligned with the caption. Our key contribution lies in building this “caption-conditioned” image encoding which tightly couples both the tasks and allows the weak supervision to effectively guide visual grounding. We provide extensive empirical and qualitative analysis to investigate the different components of our proposed model and compare it with competitive baselines. For phrase localization, we report improvements of 4.9% and 1.3% (absolute) over prior state of the art on the VisualGenome and Flickr30k Entities data sets. We also report results that are at par with the state of the art on the downstream caption-to-image retrieval task on COCO and Flickr30k data sets.
David Novotny, Nikhila Ravi, Benjamin Graham, Natalia Neverova, and Andrea Vedaldi
We propose C3DPO, a method for extracting 3D models of deformable objects from 2D keypoint annotations in unconstrained images. We do so by learning a deep network that reconstructs a 3D object from a single view at a time, accounting for partial occlusions, and explicitly factoring the effects of viewpoint changes and object deformations. In order to achieve this factorization, we introduce a novel regularization technique. We first show that the factorization is successful if, and only if, there exists a certain canonicalization function of the reconstructed shapes. Then, we learn the canonicalization function together with the reconstruction one, which constrains the result to be consistent. We demonstrate state-of-the-art reconstruction results for methods that do not use ground-truth 3D supervision for a number of benchmarks, including Up3D and PASCAL3D+.
Nilesh Kulkarni, Abhinav Gupta, and Shubham Tulsiani
We explore the task of Canonical Surface Mapping (CSM). Specifically, given an image, we learn to map pixels on the object to their corresponding locations on an abstract 3D model of the category. But how do we learn such a mapping? A supervised approach would require extensive manual labeling, which is not scalable beyond a few hand-picked categories. Our key insight is that the CSM task (pixel to 3D), when combined with 3D projection (3D to pixel), completes a cycle. Hence, we can exploit a geometric cycle consistency loss, thereby allowing us to forgo the dense manual supervision. Our approach allows us to train a CSM model for a diverse set of classes, without sparse or dense keypoint annotation, by leveraging only foreground mask labels for training. We show that our predictions also allow us to infer dense correspondence between two images, and compare the performance of our approach against several methods that predict correspondence by leveraging varying amount of supervision.
Keren Ye, Mingda Zhang, Adriana Kovashka, Wei Li, Danfeng Qin, and Jesse Berent
Learning to localize and name object instances is a fundamental problem in vision, but state-of-the-art approaches rely on expensive bounding box supervision. While weakly supervised detection (WSOD) methods relax the need for boxes to that of image-level annotations, even cheaper supervision is naturally available in the form of unstructured textual descriptions that users may freely provide when uploading image content. However, straightforward approaches to using such data for WSOD wastefully discard captions that do not exactly match object names. Instead, we show how to squeeze the most information out of these captions by training a text-only classifier that generalizes beyond data set boundaries. Our discovery provides an opportunity for learning detection models from noisy but more abundant and freely available caption data. We also validate our model on three classic object detection benchmarks and achieve state-of-the-art WSOD performance. Our code is available here .
Ruohan Gao and Kristen Grauman
Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of “true” mixed sounds. We introduce a co-separation training paradigm that permits learning object-level sounds from unlabeled multisource videos. Our novel training objective requires that the deep neural network’s separated audio for similar-looking objects be consistently identifiable, while simultaneously reproducing accurate video-level audio tracks for each source training pair. Our approach disentangles sounds in realistic test videos, even in cases where an object was not observed individually during training. We obtain state-of-the-art results on visually guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench data sets.
Yufei Ye, Maneesh Singh, Abhinav Gupta, and Shubham Tulsiani
We present an approach for pixel-level future prediction given an input image of a scene. We observe that a scene is composed of distinct entities that undergo motion and present an approach that operationalizes this insight. We implicitly predict future states of independent entities while reasoning about their interactions, and compose future video frames using these predicted states. We overcome the inherent multimodality of the task using a global trajectory-level latent random variable, and show that this allows us to sample diverse and plausible futures. We empirically validate our approach against alternate representations and ways of incorporating multimodality. We examine two data sets, one comprising stacked objects that may fall, and the other containing videos of humans performing activities in a gym, and show that our approach allows realistic stochastic video prediction across these diverse settings. See project website for video predictions.
Cross-X Learning for Fine-Grained Visual Categorization
Wei Luo, Xitong Yang, Xianjie Mo, Yuheng Lu, Larry S. Davis, Jun Li, Jian Yang, and Ser-Nam Lim
Recognizing objects from subcategories with very subtle differences remains a challenging task due to the large intraclass and small interclass variation. Recent work tackles this problem in a weakly supervised manner: Object parts are first detected and the corresponding part-specific features are extracted for fine-grained classification. However, these methods typically treat the part-specific features of each image in isolation while neglecting their relationships between different images. In this paper, we propose Cross-X learning, a simple yet effective approach that exploits the relationships between different images and between different network layers for robust multiscale feature learning. Our approach involves two novel components: 1) a cross-category cross-semantic regularizer that guides the extracted features to represent semantic parts, and 2) a cross-layer regularizer that improves the robustness of multiscale features by matching the prediction distribution across multiple layers. Our approach can be easily trained end-to-end and is scalable to large data sets like NABirds. We empirically analyze the contributions of different components of our approach and demonstrate its robustness, effectiveness, and state-of-the-art performance on five benchmark data sets. Code is available here.
DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare
Yuanlu Xu, Song-Chun Zhu, and Tony Tung
We present DenseRaC, a novel end-to-end framework for jointly estimating 3D human pose and body shape from a monocular RGB image. Our two-step framework takes the body pixel-to-surface correspondence map (i.e., IUV map) as proxy representation and then performs estimation of parameterized human pose and shape. Specifically, given an estimated IUV map, we develop a deep neural network optimizing 3D body reconstruction losses and further integrating a render-and-compare scheme to minimize differences between the input and the rendered output, i.e., dense body landmarks, body part masks, and adversarial priors. To boost learning, we further construct a large-scale synthetic data set (MOCA) utilizing web-crawled Mocap sequences, 3D scans, and animations. The generated data covers diversified camera views, human actions, and body shapes, and is paired with full ground truth. Our model jointly learns to represent the 3D human body from hybrid data sets, mitigating the problem of unpaired training data. Our experiments show that DenseRaC obtains superior performance against state of the art on public benchmarks of various human-related tasks.
Rohit Girdhar, Du Tran, Lorenzo Torresani, and Deva Ramanan
Video recognition models have progressed significantly over the past few years, evolving from shallow classifiers trained on handcrafted features to deep spatiotemporal networks. However, labeled video data required to train such models has not been able to keep up with the ever-increasing depth and sophistication of these networks. In this work we propose an alternative approach to learning video representations that requires no semantically labeled videos, and instead leverages the years of effort in collecting and labeling large and clean still-image data sets. We do so by using state-of-the-art models pretrained on image data sets as “teachers” to train video models in a distillation framework. We demonstrate that our method learns truly spatiotemporal features, despite being trained only using supervision from still-image networks. Moreover, it learns good representations across different input modalities, using completely uncurated raw video data sources and with different 2D teacher models. Our method obtains strong transfer performance, outperforming standard techniques for bootstrapping video architectures with image-based models by 16%. We believe that our approach opens up new approaches for learning spatiotemporal representations from unlabeled video data.
Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng
In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies, and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially “slower” at a lower spatial resolution, reducing both memory and computation cost. Unlike existing multiscale methods, OctConv is formulated as a single, generic, plug-and-play convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing convolutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memory and computational cost. An OctConv-equipped ResNet-152 can achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2 GFLOPs.
Dmitrii Marin, Zijian He, Peter Vajda, Priyam Chatterjee, Sam Tsai, Fei Yang, and Yuri Boykov
Many automated processes such as autopiloting rely on a good semantic segmentation as a critical component. To speed up performance, it is common to downsample the input frame. However, this comes at the cost of missed small objects and reduced accuracy at semantic boundaries. To address this problem, we propose a new content-adaptive downsampling technique that learns to favor sampling locations near semantic boundaries of target classes. Cost-performance analysis shows that our method consistently outperforms the uniform sampling, improving balance between accuracy and computational efficiency. Our adaptive sampling gives segmentation with better quality of boundaries and more reliable support for smaller-size objects.
Jianwei Yang, Zhile Ren, Mingze Xu, Xinlei Chen, David J. Crandall, Devi Parikh, and Dhruv Batra
Passive visual systems typically fail to recognize objects in the amodal setting where they are heavily occluded. In contrast, humans and other embodied agents have the ability to move in the environment and actively control the viewing angle to better understand object shapes and semantics. In this work, we introduce the task of Embodied Amodel Recognition (EAR): An agent is instantiated in a 3D environment close to an occluded target object, and is free to move in the environment to perform object classification, amodal object localization, and amodal object segmentation. To address this problem, we develop a new model called Embodied Mask R-CNN for agents to learn to move strategically to improve their visual recognition abilities. We conduct experiments using a simulator for indoor environments. Experimental results show that 1) agents with embodiment (movement) achieve better visual recognition performance than passive ones, and 2) in order to improve visual recognition abilities, agents can learn strategic paths that are different from shortest paths.
Qian Huang, Isay Katsman, Horace He, Zeqi Gu, Serge Belongie, and Ser-Nam Lim
Neural networks are vulnerable to adversarial examples, malicious inputs crafted to fool trained models. Adversarial examples often exhibit black-box transfer, meaning that adversarial examples for one model can fool another model. However, adversarial examples are typically overfit to exploit the particular architecture and feature representation of a source model, resulting in suboptimal black-box transfer attacks to other target models. We introduce the Intermediate Level Attack (ILA), which attempts to fine-tune an existing adversarial example for greater black-box transferability by increasing its perturbation on a prespecified layer of the source model, improving upon state-of-the-art methods. We show that we can select a layer of the source model to perturb without any knowledge of the target models while achieving high transferability. Additionally, we provide some explanatory insights regarding our method and the effect of optimizing for adversarial examples in intermediate feature maps.
Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He
Neural networks for image recognition have evolved through extensive manual design from simple chainlike models to structures with multiple wiring paths. The success of ResNets  and DenseNets  is due in large part to their innovative wiring plans. Now, neural architecture search (NAS) studies are exploring the joint optimization of wiring and operation types, however, the space of possible wirings is constrained and still driven by manual design despite being searched. In this paper, we explore a more diverse set of connectivity patterns through the lens of randomly wired neural networks. To do this, we first define the concept of a stochastic network generator that encapsulates the entire network generation process. Encapsulation provides a unified view of NAS and randomly wired networks. Then, we use three classical random graph models to generate randomly wired graphs for networks. The results are surprising: Several variants of these random generators yield network instances that have competitive accuracy on the ImageNet benchmark. These results suggest that new efforts focusing on designing better network generators may lead to new breakthroughs by exploring less constrained search spaces with more room for novel design. The code is publicly available online.
Wei-Lin Hsiao, Isay Katsman, Chao-Yuan Wu, Devi Parikh, and Kristen Grauman
Given an outfit, what small changes would most improve its fashionability? This question presents an intriguing new vision challenge. We introduce Fashion++, an approach that proposes minimal adjustments to a full-body clothing outfit that will have maximal impact on its fashionability. Our model consists of a deep image generation neural network that learns to synthesize clothing conditioned on learned per-garment encodings. The latent encodings are explicitly factorized according to shape and texture, thereby allowing direct edits for both fit/presentation and color/patterns/material, respectively. We show how to bootstrap web photos to automatically train a fashionability model, and develop an activation maximization-style approach to transform the input image into its more fashionable self. The edits suggested range from swapping in a new garment to tweaking its color, how it is worn (e.g., rolling up sleeves), or its fit (e.g., making pants baggier). Experiments demonstrate that Fashion++ provides successful edits, both according to automated metrics and human opinion.
Tushar Nagarajan, Christoph Feichtenhofer, and Kristen Grauman
Learning how to interact with objects is an important step toward embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements. We propose an approach to learn human-object interaction “hotspots” directly from video. Rather than treat affordances as a manually supervised semantic segmentation task, our approach learns about interactions by watching videos of real human behavior and anticipating afforded actions. Given a novel image or video, our model infers a spatial hotspot map indicating where an object would be manipulated in a potential interaction — even if the object is currently at rest. Through results with both first- and third-person video, we show the value of grounding affordances in real human-object interactions. Not only are our weakly supervised hotspots competitive with strongly supervised affordance methods, but they can also anticipate object interaction for novel object categories. Project page: http://vision.cs.utexas.edu/projects/interaction-hotspots/
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra
We present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation. Specifically, Habitat consists of 1) Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, sensors, and generic 3D data set handling. Habitat-Sim is fast — when rendering a scene from Matterport3D, it achieves several thousand frames per second (fps) running single-threaded, and can reach over 10,000 fps multiprocess on a single GPU. 2) Habitat-API: a modular high-level library for end-to-end development of embodied AI algorithms — defining tasks (e.g., navigation, instruction following, question answering), configuring, training, and benchmarking embodied agents.
Cheng-Yang Fu, Tamara L. Berg, and Alexander C. Berg
In this work, we present a new operator, called Instance Mask Projection (IMP), which projects a predicted Instance Segmentation as a new feature for semantic segmentation. It also supports back propagation so is trainable end-toend. By adding this operator, we introduce a new paradigm which combines top-down and bottom-up information in semantic segmentation. Our experiments show the effectiveness of IMP on both Clothing Parsing (with complex layering, large deformations, and non-convex objects), and on Street Scene Segmentation (with many overlapping instances and small objects). On the Varied Clothing Parsing data set (VCP), we show instance mask projection can improve three points on mIOU from a state-of-the-art Panoptic FPN segmentation approach. On the ModaNet clothing parsing data set, we show a dramatic improvement of 20.4% absolutely compared to existing baseline semantic segmentation results. In addition, the instance mask projection operator works well on other (nonclothing) data sets, providing an improvement of three points in mIOU on Thing classes of Cityscapes and a self-driving data set, on top of a state-of-the-art approach.
Lluís Castrejon, Nicolas Ballas, and Aaron Courville
Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current state-of-the-art latent variable models. Our method performs favorably under several metrics in three different data sets.
Oran Gafni, Lior Wolf, and Yaniv Taigman
We propose a method for face de-identification that enables fully automatic video modification at high frame rates. The goal is to maximally de-correlate the identity while having the perception (pose, illumination, and expression) fixed. We achieve this by a novel feed-forward encoder-decoder network architecture that is conditioned on the high-level representation of a person’s facial image. The network is global, in the sense that it does not need to be retrained for a given video or for a given identity, and it creates natural-looking image sequences with little distortion in time.
Georgia Gkioxari, Jitendra Malik, and Justin Johnson
Rapid advances in 2D perception have led to systems that accurately detect objects in real-world images. However, these systems make predictions in 2D, ignoring the 3D structure of the world. Concurrently, advances in 3D shape prediction have mostly focused on synthetic benchmarks and isolated objects. We unify advances in these two areas. We propose a system that detects objects in real-world images and produces a triangle mesh giving the full 3D shape of each detected object. Our system, called Mesh R-CNN, augments Mask R-CNN with a mesh prediction branch that outputs meshes with varying topological structure by first predicting coarse voxel representations which are converted to meshes and refined with a graph convolution network operating over the mesh’s vertices and edges. We validate our mesh prediction branch on ShapeNet, where we outperform prior work on single-image shape prediction. We then deploy our full Mesh R-CNN system on Pix3D, where we jointly detect objects and predict their 3D shapes.
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson
Image captioning models have achieved impressive results on data sets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection data sets, we present the first large-scale benchmark for this task. Dubbed nocaps, for novel object captioning at scale, our benchmark consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. The associated training data consists of COCO image-caption pairs, plus Open Images image-level labels and object bounding boxes. Since Open Images contains many more classes than COCO, nearly 400 object classes seen in test images have no or very few associated training captions (hence, nocaps). We extend existing novel object captioning models to establish strong baselines for this benchmark and provide analysis to guide future work.
Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, and Piotr Dollár
Over the past several years, progress in designing better neural network architectures for visual recognition has been substantial. To help sustain this rate of progress, in this work we propose to reexamine the methodology for comparing network architectures. In particular, we introduce a new comparison paradigm of distribution estimates, in which network design spaces are compared by applying statistical techniques to populations of sampled models, while controlling for confounding factors like network complexity. Compared to current methodologies of comparing point and curve estimates of model families, distribution estimates paint a more complete picture of the entire design landscape. As a case study, we examine design spaces used in neural architecture search (NAS). We find significant statistical differences between recent NAS design space variants that have been largely overlooked. Furthermore, our analysis reveals that the design spaces for standard model families like ResNeXt can be comparable to the more complex ones used in recent NAS work. We hope these insights into distribution analysis will enable more robust progress toward discovering better networks for visual recognition.
Zhuoyuan Chen, Demi Guo, Tong Xiao, Saining Xie, Xinlei Chen, Haonan Yu, Jonathan Gray, Kavya Srinet, Haoqi Fan, Jerry Ma, Charles R. Qi, Shubham Tulsiani, Arthur Szlam, and C. Lawrence Zitnick
Research on 2D and 3D generative models typically focuses on the final artifact being created, e.g., an image or a 3D structure. Unlike 2D image generation, the generation of 3D objects in the real world is commonly constrained by the process and order in which the object is constructed. For instance, gravity needs to be taken into account when building a block tower.
In this paper, we explore the prediction of ordered actions to construct 3D objects. Instead of predicting actions based on physical constraints, we propose learning through observing human actions. To enable large-scale data collection, we use the Minecraft environment. We introduce 3D-Craft, a new data set of 2,500 Minecraft houses each built by human players sequentially from scratch. To learn from these human action sequences, we propose an order-aware 3D generative model called VoxelCNN. In contrast to other 3D generative models, which either have no explicit order (e.g., holistic generation with 3DGAN ) or follow a simple heuristic order (e.g., raster-scan), VoxelCNN is trained to imitate human building order with spatial awareness. We also transferred the order to other data set such as ShapeNet. The 3D-Craft data set, models, and benchmark system will be made publicly available, which may inspire new directions for future research exploration.
Yuyin Zhou, Zhe Li, Song Bai, Chong Wang, Xinlei Chen, Mei Han, Elliot Fishman, and Alan L. Yuille
Accurate multi-organ abdominal CT segmentation is essential to many clinical applications, such as computer-aided intervention. As data annotation requires massive human labor from experienced radiologists, it is common that training data are partially labeled, e.g., pancreas data sets only have the pancreas labeled while leaving the rest marked as background. However, these background labels can be misleading in multi-organ segmentation, since the background usually contains some other organs of interest. To address the background ambiguity in these partially labeled data sets, we propose Prior-aware Neural Network (PaNN) via explicitly incorporating anatomical priors on abdominal organ sizes, guiding the training process with domain-specific knowledge. More specifically, PaNN assumes that the average organ size distributions in the abdomen should approximate their empirical distributions, prior statistics obtained from the fully labeled data set. As our training objective is difficult to be directly optimized using stochastic gradient descent, we propose to reformulate it in a min-max form and optimize it via the stochastic primal-dual gradient algorithm. PaNN achieves state-of-the-art performance on the MICCAI2015 challenge “Multi-Atlas Labeling Beyond the Cranial Vault,” a competition on organ segmentation in the abdomen. We report an average Dice score of 84.97%, surpassing the prior art by a large margin of 3.27%.
Kaiming He, Ross Girshick, and Piotr Dollár
We report competitive results on object detection and instance segmentation on the COCO data set using standard models trained from random initialization. The results are no worse than their ImageNet pretraining counterparts, even when using the hyperparameters of the baseline system (Mask R-CNN) that were optimized for fine-tuning pretrained models, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization is surprisingly robust; our results hold even when 1) using only 10% of the training data, 2) for deeper and wider models, and 3) for multiple tasks and metrics. Experiments show that ImageNet pretraining speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy. To push the envelope, we demonstrate 50.9 AP on COCO object detection without using any external data — a result on par with the top COCO 2017 competition results that used ImageNet pretraining. These observations challenge the conventional wisdom of ImageNet pretraining for dependent tasks and we expect these discoveries will encourage people to rethink the current de facto paradigm of pretraining and fine-tuning in computer vision.
Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra
Self-supervised learning aims to learn representations from the data itself without explicit manual supervision. Existing efforts ignore a crucial aspect of self-supervised learning — the ability to scale to large amount of data because self-supervision requires no manual labels. In this work, we revisit this principle and scale two popular self-supervised approaches to 100 million images. We show that by scaling on various axes (including data size and problem “hardness”), one can largely match or even exceed the performance of supervised pretraining on a variety of tasks such as object detection, surface normal estimation (3D) and visual navigation using reinforcement learning. Scaling these methods also provides many interesting insights into the limitations of current self-supervised techniques and evaluations. We conclude that current self-supervised methods are not “hard” enough to take full advantage of large-scale data and do not seem to learn effective high-level semantic representations. We also introduce an extensive benchmark across nine different data sets and tasks. We believe that such a benchmark along with comparable evaluation settings is necessary to make meaningful progress. Code can be found here.
Bruno Korbar, Du Tran, and Lorenzo Torresani
While many action recognition data sets consist of collections of brief, trimmed videos each containing a relevant action, videos in the real world (e.g., on YouTube) exhibit very different properties: They are often several minutes long, where brief relevant clips are often interleaved with segments of extended duration containing little change. Applying densely an action recognition system to every temporal clip within such videos is prohibitively expensive. Furthermore, as we show in our experiments, this results in suboptimal recognition accuracy as informative predictions from relevant clips are outnumbered by meaningless classification outputs over long uninformative sections of the video. In this paper we introduce a lightweight “clip-sampling” model that can efficiently identify the most salient temporal clips within a long video. We demonstrate that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips. Furthermore, we show that this yields significant gains in recognition accuracy compared to analysis of all clips or randomly/uniformly selected clips. On Sports1M, our clip sampling scheme elevates the accuracy of an already state-of-the-art action classifier by 7% and reduces by more than 15 times its computational cost.
Gines Hidalgo, Yaadhav Raaj, Haroon Idrees, Donglai Xiang, Hanbyul Joo, Tomas Simon, and Yaser Sheikh
We present the first single-network approach for 2D whole-body pose estimation, which entails simultaneous localization of body, face, hands, and feet keypoints. Due to the bottom-up formulation, our method maintains constant real-time performance regardless of the number of people in the image. The network is trained in a single stage using multitask learning, through an improved architecture which can handle scale differences between body/foot and face/hand keypoints. Our approach considerably improves upon OpenPose , the only work so far capable of whole-body pose estimation, both in terms of speed and global accuracy. Unlike , our method does not need to run an additional network for each hand and face candidate, making it substantially faster for multiperson scenarios. This work directly results in a reduction of computational complexity for applications that require 2D whole-body information (e.g., VR/AR, retargeting). In addition, it yields higher accuracy, especially for occluded, blurry, and low-resolution faces and hands. For code, trained models, and validation benchmarks, visit our project page.
SlowFast Networks for Video Recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He
We present SlowFast networks for video recognition. Our model involves 1) a Slow pathway, operating at low frame rate, to capture spatial semantics, and 2) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pinpointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades, and AVA. Code has been made available here.
SplitNet: Sim2Sim and Task2Task Transfer for Embodied Visual Navigation
Daniel Gordon, Abhishek Kadian, Devi Parikh, Judy Hoffman, and Dhruv Batra
We propose SplitNet, a method for decoupling visual perception and policy learning. By incorporating auxiliary tasks and selective learning of portions of the model, we explicitly decompose the learning objectives for visual navigation into perceiving the world and acting on that perception. We show dramatic improvements over baseline models on transferring between simulators, an encouraging step toward Sim2Real. Additionally, SplitNet generalizes better to unseen environments from the same simulator and transfers faster and more effectively to novel embodied navigation tasks. Further, given only a small sample from a target domain, SplitNet can match the performance of traditional end-to-end pipelines, which receive the entire data set.
Devi Parikh, Dhruv Batra, Hongxia Jin, Larry Heck, Shalini Ghosh, Stefan Lee, and Yilin Shen
Many vision and language models suffer from poor visual grounding — often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. In this work, we propose a generic approach called Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding. HINT encourages deep networks to be sensitive to the same input regions as humans. Our approach optimizes the alignment between human attention maps and gradient-based network importances — ensuring that models learn not just to look at but rather rely on visual concepts that humans found relevant for a task when making predictions. We apply HINT to Visual Question Answering and Image Captioning tasks, outperforming top approaches on splits that penalize overreliance on language priors (VQA-CP and robust captioning) using human attention demonstrations for just 6% of the training data.
Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh
We present a 16.2 million-frame (50-hour) multimodal data set of two-person face-to-face spontaneous conversations. Our data set features synchronized body and finger motion as well as audio data. To the best of our knowledge, it represents the largest motion capture and audio data set of natural conversations to date. The statistical analysis verifies strong intraperson and interperson covariance of arm, hand, and speech features, potentially enabling new directions on data-driven social behavior analysis, prediction, and synthesis. As an illustration, we propose a novel real-time finger motion synthesis method: a temporal neural network innovatively trained with an inverse kinematics (IK) loss, which adds skeletal structural information to the generative model. Our qualitative user study shows that the finger motion generated by our method is perceived as natural and conversation enhancing, while the quantitative ablation study demonstrates the effectiveness of IK loss.
Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato
One of the hallmarks of human intelligence is the ability to compose learned knowledge into novel concepts which can be recognized without a single training example. In contrast, current state-of-the-art methods require hundreds of training examples for each possible category to build reliable and accurate classifiers. To alleviate this striking difference in efficiency, we propose a task-driven modular architecture for compositional reasoning and sample efficient learning. Our architecture consists of a set of neural network modules, which are small, fully connected layers operating in semantic concept space. These modules are configured through a gating function conditioned on the task to produce features representing the compatibility between the input image and the concept under consideration. This enables us to express tasks as a combination of subtasks and to generalize to unseen categories by reweighting a set of small modules. Furthermore, the network can be trained efficiently, as it is fully differentiable and its modules operate on small subspaces. We focus our study on the problem of compositional zero-shot classification of object-attribute categories. We show in our experiments that current evaluation metrics are flawed as they only consider unseen object-attribute pairs. When extending the evaluation to the generalized setting, which accounts also for pairs seen during training, we discover that naïve baseline methods perform similarly or better than current approaches. However, our modular network is able to outperform all existing approaches on two widely used benchmark data sets.
Xinlei Chen, Kaiming He, Piotr Dollár, and Ross Girshick
Sliding-window object detectors that generate bounding-box object predictions over a dense, regular grid have advanced rapidly and proven popular. In contrast, modern instance segmentation approaches are dominated by methods that first detect object bounding boxes, and then crop and segment these regions, as popularized by Mask R-CNN. In this work, we investigate the paradigm of dense sliding-window instance segmentation, which is surprisingly underexplored. Our core observation is that this task is fundamentally different than other dense prediction tasks such as semantic segmentation or bounding-box object detection, as the output at every spatial location is itself a geometric structure with its own spatial dimensions. To formalize this, we treat dense instance segmentation as a prediction task over 4D tensors and present a general framework called TensorMask that explicitly captures this geometry and enables novel operators on 4D tensors. We demonstrate that the tensor view leads to large gains over baselines that ignore this structure, and leads to results comparable to Mask R-CNN. These promising results suggest that TensorMask can serve as a foundation for novel advances in dense mask prediction and a more complete understanding of the task. Code will be made available.
Anh T. Tran, Cuong V. Nguyen, and Tal Hassner
We propose a novel approach for estimating the difficulty and transferability of supervised classification tasks. Unlike previous work, our approach is solution agnostic and does not require or assume trained models. Instead, we estimate these values using an information theoretic approach: treating training labels as random variables and exploring their statistics. When transferring from a source to a target task, we consider the conditional entropy between two such variables (i.e., label assignments of the two tasks). We show analytically and empirically that this value is related to the loss of the transferred model. We further show how to use this value to estimate task hardness. We test our claims extensively on three large scale data sets — CelebA (40 tasks), Animals with Attributes 2 (85 tasks), and Caltech-UCSD Birds 200 (312 tasks) — together representing 437 classification tasks. We provide results showing that our hardness and transferability estimates are strongly correlated with empirical hardness and transferability. As a case study, we transfer a learned face recognition model to CelebA attribute classification tasks, showing state-of-the-art accuracy for tasks estimated to be highly transferable.
Ruth Fong, Mandela Patrick, and Andrea Vedaldi
The problem of attribution is concerned with identifying the parts of an input that are responsible for a model’s output. An important family of attribution methods is based on measuring the effect of perturbations applied to the input. In this paper, we discuss some of the shortcomings of existing approaches to perturbation analysis and address them by introducing the concept of extremal perturbations, which are theoretically grounded and interpretable. We also introduce a number of technical innovations to compute extremal perturbations, including a new area constraint and a parametric family of smooth perturbations, which allow us to remove all tunable hyperparameters from the optimization problem. We analyze the effect of perturbations as a function of their area, demonstrating excellent sensitivity to the spatial properties of the deep neural network under stimulation. We also extend perturbation analysis to the intermediate layers of a network. This application allows us to identify the salient channels necessary for classification, which, when visualized using feature inversion, can be used to elucidate model behavior. Lastly, we introduce TorchRay, an interpretability library built on PyTorch.
Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin
Pretraining general-purpose visual features with convolutional neural networks without relying on annotations is a challenging and important task. Most recent efforts in unsupervised feature learning have focused on either small or highly curated data sets like ImageNet, whereas using non-curated raw data sets was found to decrease the feature quality when evaluated on a transfer task. Our goal is to bridge the performance gap between unsupervised methods trained on curated data, which are costly to obtain, and massive raw data sets that are easily available. To that effect, we propose a new unsupervised approach which leverages self-supervision and clustering to capture complementary statistics from large-scale data. We validate our approach on 96 million images from YFCC100M , achieving state-of-the-art results among unsupervised methods on standard benchmarks, which confirms the potential of unsupervised learning when only non-curated raw data are available. We also show that pretraining a supervised VGG-16 with our method achieves 74.9% top-1 classification accuracy on the validation set of ImageNet, which is an improvement of +0.8% over the same network trained from scratch. Our code is available here.
Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli
Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask 1) if group convolution can help to alleviate the high computational cost of video classification networks, 2) what factors matter the most in 3D group convolutional networks, and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks. This paper studies the effects of different design choices in 3D group convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks.
Our experiments suggest two main findings. First, it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions, as this leads to improved accuracy and lower computational cost. Second, 3D channel-separated convolutions provide a form of regularization, yielding lower training accuracy but higher test accuracy compared to 3D convolutions. These two empirical findings lead us to design an architecture — Channel-Separated Convolutional Network (CSN) — which is simple and efficient, yet accurate. On Sports1M and Kinetics, our CSNs are comparable with or better than the state of the art while being 2 to 3 times more efficient.
xR-EgoPose: Egocentric 3D Human Pose from an HMD Camera
Denis Tome, Patrick Peluse, Lourdes Agapito, and Hernan Badino
We present a new solution to egocentric 3D body pose estimation from monocular images captured from a downward looking fish-eye camera installed on the rim of a head-mounted virtual reality device. This unusual viewpoint, just 2 cm away from the user’s face, leads to images with unique visual appearance, characterized by severe self-occlusions and strong perspective distortions that result in a drastic difference in resolution between lower and upper body. Our contribution is two-fold. First, we propose a new encoder-decoder architecture with a novel dual branch decoder designed specifically to account for the varying uncertainty in the 2D joint locations. Our quantitative evaluation, both on synthetic and real-world data sets, shows that our strategy leads to substantial improvements in accuracy over state-of-the-art egocentric pose estimation approaches. Our second contribution is a new large-scale photorealistic synthetic data set — xR-EgoPose — offering 383K frames of high-quality renderings of people with a diversity of skin tones, body shapes, and clothing, in a variety of backgrounds and lighting conditions, and performing a range of actions. Our experiments show that the high variability in our new synthetic training corpus leads to good generalization to real-world footage and to state-of-the-art results on real-world data sets with ground truth. Moreover, an evaluation on the Human3.6M benchmark shows that the performance of our method is on par with top-performing approaches on the more classic problem of 3D human pose from a third-person viewpoint.
Kristen Grauman (co-organizer); Hanbyul Joo (speaker)
Closing the Loop Between Vision and Language (CLVL)
Marcus Rohrbach (co-organizer)
Computer Vision for Fashion, Art, and Design
Tamara Berg (speaker); Kristen Grauman (steering committee)
CroMoL: Cross-Modal Learning in Real World
Lior Wolf (speaker)
Disguised Faces in the Wild (DFW)
Manohar Paluri (speaker)
Egocentric Perception, Interaction, and Computing (EPIC)
Kristen Grauman (co-organizer)
Extreme Vision Modeling
Vignesh Ramanathan, Dhruv Mahajan, Laurens van der Maaten, Alexander C. Berg, and Ishan Misra (co-organizers)
Eye Tracking for AR and VR
Sachin Talathi, Immo Schütz, Chen Jixu, Robert Cavin, and Stephan Garbin (co-organizers)
Geometry Meets Deep Learning (GMDL)
Andrea Vedaldi (speaker)
Interpreting and Explaining Visual Artificial Intelligence Models
Paper: Occlusions for Effective Data Augmentation in Image Classification
Ruth Fong and Andrea Vedaldi
Joint COCO and Mapillary Recognition Workshop
Detection, Keypoint, Panoptic, and DensePose Challenges
Larry Zitnick, Piotr Dollár, and Ross Girshick (COCO Consortium)
Large Scale Holistic Video Understanding
Christoph Feichtenhofer(program committee); Rohit Girdhar, Kristen Grauman, Manohar Paluri, and Du Tran (speakers)
Low-Power Computer Vision Workshop
Alexander C. Berg (co-organizer)
Multi-modal Video Analysis and Moments in Time
Alexander C. Berg (co-organizer)
Ross Girshick (speaker)
Scene Graph Representation and Learning
Devi Parikh and Laurens van der Maaten (speakers)
Visual Recognition for Images, Video, and 3D Tutorial
Alexander Kirillov, Ross Girshick, Kaiming He, Georgia Gkioxari, Christoph Feichtenhofer, Saining Xie, Haoqi Fan, Yuxin Wu, Nikhila Ravi, Wan-Yen Lo,and Piotr Dollár
Visual Recognition for Medical Images
Kyunghyun Cho (co-organizer); Adriana Romero (program committee)
Workshop on Preregistration
Michela Paganini (speaker)
YouTube-8M Large-Scale Video Understanding Workshop
Jitendra Malik (speaker)
Facebook Artificial Intelligence Research Scientist Tamara Berg is the 2019 recipient of the PAMI Mark Everingham Prize for her work in generating and maintaining the Labeled Faces in the Wild (LFW) data set and benchmark since its inception in 2008. The Everingham Prize is given to a researcher or a team of researchers who have made a selfless contribution of significant benefit to other members of the computer vision community.