RESEARCH

COMPUTER VISION

Learning State-Aware Visual Representations from Audible Interactions

November 10, 2022

Abstract

We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. First, given the uncurated nature of long-form continuous videos, learning effective representations require focusing on moments in time when interactions take place. Second, visual representations of daily activities should be sensitive to changes in the state of the environment. However, current successful multi-modal learning frameworks encourage representation invariance over time. To address these challenges, we leverage audio signals to identify moments of likely interactions which are conducive to better learning. We also propose a novel self-supervised objective that learns from audible state changes caused by interactions. We validate these contributions extensively on two large-scale egocentric datasets, EPIC-Kitchens-100 and the recently released Ego4D, and show improvements on several downstream tasks, including action recognition, long-term action anticipation, and object state change classification.

Download the Paper

AUTHORS

Written by

Unnat Jain

Abhinav Gupta

Himangi Mittal

Pedro Morgado

Publisher

NeurIPS

Research Topics

Computer Vision

Related Publications

June 04, 2023

COMPUTER VISION

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

Dahyun Kang, Peter Koniusz, Minsu Cho, Naila Murray

June 04, 2023

May 09, 2023

COMPUTER VISION

ImageBind: One Embedding Space To Bind Them All

Rohit Girdhar, Alaa El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

May 09, 2023

April 20, 2023

COMPUTER VISION

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

Xubo Liu, Egor Lakomkin, Dino Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jachym Kolar, Stavros Petridis, Maja Pantic, Christian Fuegen

April 20, 2023

April 06, 2023

COMPUTER VISION

On the Benefits of 3D Pose and Tracking for Human Action Recognition

Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, Jitendra Malik

April 06, 2023

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.