Understanding Video Transformers
U
Understanding Video Transformers
Overview :
This paper investigates the problem of conceptual interpretability for video Transformer representations. Specifically, we aim to explain the decision-making process of video Transformers based on high-level spatio-temporal concepts that are automatically discovered. Previous research on concept-based interpretability has primarily focused on image-level tasks. In contrast, video models handle the additional time dimension, increasing complexity and posing challenges in identifying dynamic concepts that evolve over time. In this work, we systematically address these challenges by introducing the first video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an effective unsupervised method for identifying video Transformer representation units (concepts) and rank their importance in the model output. The obtained concepts exhibit high interpretability, revealing the spatio-temporal reasoning mechanisms and object-centric representations within black-box video models. Through joint analysis on diverse supervised and self-supervised representations, we discover that some of these mechanisms are prevalent across video Transformers. Finally, we demonstrate that VTCD can be used to improve the performance of models on fine-grained tasks.
Target Users :
Used to explain the decision-making process of video Transformers and improve model performance
Total Visits: 29.7M
Top Region: US(17.94%)
Website Views : 52.2K
Use Cases
Explain the decision-making process of video Transformers
Improve the performance of video models
Discover universal mechanisms within video Transformers
Features
Unsupervised video Transformer Concept Discovery
Ranking the importance of video Transformer concepts
Revealing the spatio-temporal reasoning mechanisms and object representations in video Transformers
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase