Second Large Scale Holistic Video Understanding Workshop Call for Papers

 Second Large Scale Holistic Video Understanding Workshop @CVPR'21

CVPR Dates: June 19-25, 2021 / Workshop Date: TBD




CAMERA READY:  April 18, 2021

Please submit papers via CMT:

WORKSHOP REGISTRATION: In conjunction with CVPRí21


In the last years, we have seen tremendous progress in the
capabilities of computer systems to classify video clips taken from
the Internet or to analyze human actions in videos. There are lots of
works in video recognition field focusing on specific video
understanding tasks, such as action recognition, scene understanding,
etc. There have been great achievements in such tasks, however, there
has not been enough attention toward the holistic video understanding
task as a problem to be tackled. Current systems are expert in some
specific fields of the general video understanding problem. However,
for real-world applications, such as, analyzing multiple concepts of a
video for video search engines and media monitoring systems or
providing an appropriate definition of the surrounding environment of
a humanoid robot, a combination of current state-of-the-art methods
should be used. Therefore, in this workshop, we intend to introduce
holistic video understanding as a new challenge for the video
understanding efforts. This challenge focuses on the recognition of
scenes, objects, actions, attributes, and events in the real-world
user-generated videos. To be able to address such tasks, we also
introduce our new dataset named Holistic Video Understanding (HVU
dataset) that is organized hierarchically in a semantic taxonomy of
holistic video understanding. Almost all of the real-world conditioned
video datasets are targeting human action or sport recognition. So,
our new dataset can help the vision community and bring more attention
to bring more interesting solutions for holistic video
understanding. The workshop is tailored to bringing together ideas
around multi-label and multi-task recognition of different semantic
concepts in the real-world videos. And the research efforts can be
tried on our new dataset. HVU Dataset:


    Large scale video understanding

    Multi-Modal learning from videos

    Multi-concept recognition from videos

    Multi-task deep neural networks for videos

    Learning holistic representation from videos

    Weakly supervised learning from web videos

    Object, scene and event recognition from videos

    Unsupervised video visual representation learning

    Unsupervised and self-≠supervised learning with videos


    Cordelia Schmid, Google AI
    Joao Carreira, Google DeepMind
    Carl Vondrick, Columbia University
    Dima Damen, University of Bristol
    Sanja Fidler, University of Toronto
    Kristen Grauman, University of Texas at Austin

For questions about the HVU workshop, please contact Also, follow HVU on Twitter for the latest
news: or


Mohsen Fayyaz, University of Bonn

Ali Diba, KU Leuven

Vivek Sharma, Harvard, MIT

Juergen Gall, University of Bonn

Ehsan Adeli, Stanford University

Rainer Stiefelhagen, KIT

Luc Van Gool, ETH Zurich & KU Leuven

David Ross, Google AI

Manohar Paluri, Facebook AI