PyTorchVideo: A deep learning library for video understanding

May 18, 2021

Share on Facebook

Share on Twitter

What it is:

PyTorchVideo is a deep learning library for research and applications in video understanding. It provides easy-to-use, efficient, and reproducible implementations of state-of-the-art video models, data sets, transforms, and tools in PyTorch.

What it does:

The PyTorchVideo library supports components that can be used for a variety of video understanding tasks, such as video classification, detection, self-supervised learning, and optical flow. More importantly, it is not limited to visual signals: PyTorchVideo also supports other modalities, including audio and text. Furthermore, PyTorchVideo is not limited to desktop devices: The Accelerator package provides mobile hardware–specific optimizations and model deployment flow, pushing the boundaries for on-device performance.

Something Went Wrong

We're having trouble playing this video.

Learn more

A PyTorchVideo-accelerated X3D model running on a Samsung Galaxy S10 phone. The model runs ~8x faster than real time, requiring roughly 130 ms to process one second of video.

Something Went Wrong

We're having trouble playing this video.

Learn more

A PyTorchVideo-based SlowFast model performing video action detection.

Features that allow PyTorchVideo to accelerate a project include:

A suite of state-of-the-art video models and their pretrained weights with customizable components that enable researchers to build new video architectures.
A set of downstream tasks including action classification, acoustic event detection, action detection, and self-supervised learning (SSL).
Support a wide variety of data sets and tasks for benchmarking various video models under different evaluation protocols.
Efficient building blocks and deployment flow optimized for inference on hardware (mobile device, Intel NNPI, etc.), enabling hardware-aware model design and full-speed on-device model execution.
A growing toolkit of common scripts for video processing, including decoding, tracking, and optimal flow extracting.

Going forward, we are committed to continue enhancing the PyTorchVideo library to enable and support more groundbreaking research in video understanding. We welcome contributions from the entire community. All our efforts will be directed at supporting the rich open source community committed to pushing the boundaries of video research.

Why it matters:

Understanding video is one of the grand challenges of computer vision. Increases in computational resources and the amount of video data on the web are leading to more advances in the field. However, the scale, richness, and difficulty in analyzing video data means there is a strong demand for cutting-edge models that are effective and efficient, infrastructure, and tools for video understanding.

PyTorchVideo aims to meet that demand by providing a unified repository of reproducible and efficient video understanding components that are readily available for centralized use in research and production applications.

Another major challenge is the lack of a standardized, video-focused library that serves a variety of video use cases in one place. This has created a barrier to entry for developers looking to work with videos for the first time. Lack of standardization also makes it difficult to collaborate and to build upon others’ work. In this regard, PyTorchVideo is our sincere effort to address some of these bottlenecks.

At Facebook, PyTorchVideo supports state-of-the-art research works from FAIR, such as:

It also has been used to power recent advances in video transformers and self-supervised learning, such as:

Check out the PyTorchVideo website

Get the code on GitHub

Acknowledgments:

PyTorchvideo is supported and developed by the following contributors: Tullie Murrell, Haoqi Fan, Kalyan Vasudev Alwala, Yilei Li, Yanghao Li, Heng Wang, Bo Xiong, Nikhila Ravi, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Jitendra Malik, Ross Girshick, and Christoph Feichtenhofer