September 08, 2022
While transformers have begun to dominate many tasks in vision, applying them to large images is still computationally difficult. A large reason for this is that self-attention scales quadratically with the number of tokens, which in turn, scales quadratically with the image size. On larger images (e.g., 1080p), over 60% of the total computation in the network is spent solely on creating and applying attention matrices. We take a step toward solving this issue by introducing Hydra Attention, an extremely efficient attention operation for Vision Transformers (ViTs). Paradoxically, this efficiency comes from taking multi-head attention to its extreme: by using as many attention heads as there are features, Hydra Attention is computationally linear in both tokens and features with no hidden constants, making it significantly faster than standard self-attention in an off-the-shelf ViT-B/16 by a factor of the token count. Moreover, Hydra Attention retains high accuracy on ImageNet and, in some cases, actually improves it.
Written by
Cheng-Yang Fu
Daniel Bolya
Peizhao Zhang
Xiaoliang Dai
Judy Hoffman
Publisher
ECCV - International Workshop on Computational Aspects of Deep Learning
Research Topics
April 18, 2024
Jonas Kohler, Albert Pumarola, Edgar Schoenfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, Ali Thabet
April 18, 2024
March 20, 2024
Armen Avetisyan, Chris Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Julian Engel, Edward Miller, Richard Newcombe, Vasileios Balntas
March 20, 2024
February 13, 2024
Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, Filippos Kokkinos
February 13, 2024
January 25, 2024
Felix Xu, Di Lin, Jianjun Zhao, Jianlang Chen, Lei Ma, Qing Guo, Wei Feng, Xuhong Ren
January 25, 2024
Product experiences
Foundational models
Product experiences
Latest news
Foundational models