A new approach to video recognition that improves action classification and action detection by simultaneously extracting information from video at both slow and fast frame rates. This model, called SlowFast, uses two pathways, with one focusing on processing spatial appearance semantics (such as colors, textures, and objects) that can be viewed at low frame rates, while the other pathway looks for rapidly changing motions (such as clapping or waving) that are more easily recognized in video shown at higher frame rates. Our approach, which was inspired in part by the dual-pathway nature of primate vision, is more lightweight than previous video recognition systems and sets a new state-of-the-art on four major public benchmark datasets.
By analyzing raw video at different speeds, our method enables a SlowFast network to essentially divide and conquer, with each pathway leveraging its particular strengths in video modeling. One pathway processes video clips at rates as slow as two frames per second (fps) in video that originally refreshed at 30 fps. Even at these speeds, features such as the color, texture, or identity of an object or a person do not change. The fast pathway, meanwhile, operates on the same raw video clips, but at a much higher frame rate — given 30 fps footage, this path might process it at 16 fps. These faster refresh speeds allow for better understanding of what kinds of movements are taking place in video. But the main benefit of this approach is the efficiency gained by reducing the fast pathway’s channel capacity while also boosting its temporal modeling ability. The result is a system with less overall computational complexity and higher accuracy than other, more compute-heavy approaches.
We evaluated this approach’s ability to classify actions in video on the Kinetics-400, Kinetics-600, and Charades datasets, and its ability to detect actions on the AVA dataset. The results of these experiments show that SlowFast networks are consistently more accurate than systems that are pretrained, including beating state-of-the-art models by several percentage points on Kinetics and Charades. Our SlowFast-based system also ranked first at the AVA video activity detection challenge at CVPR 2019.
We haven’t used SlowFast or the public datasets mentioned in this post to train production models, but our research could have broad applications for video analysis, including improving how systems automatically identify and classify video content. Progress in this area could advance efforts to find and remove harmful videos, while also offering better personalization for video suggestions. In addition to sharing our results in the paper below, we’re open-sourcing the codebase for this method, which is available to download from GitHub.
Those attending ICCV 2019 can learn more about SlowFast and the corresponding codebase at our tutorial on October 28 and our oral presentation on October 31.