Learning Streaming Video Representation via Multitask Training

Yibin Yan*, Jilan Xu*, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie

School of Artificial Intelligence, Shanghai Jiao Tong University
Shanghai Innovation Institute, Shanghai AI Laboratory, Fudan University
ICCV 2025 Oral

Abstract

Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions. To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability. (ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.

BibTeX

@misc{yan2025learning, title={Learning Streaming Video Representation via Multitask Training}, author={Yibin Yan and Jilan Xu and Shangzhe Di and Yikun Liu and Yudi Shi and Qirui Chen and Zeqian Li and Yifei Huang and Weidi Xie}, year={2025}, eprint={2504.20041}, archivePrefix={arXiv}, primaryClass={cs.CV} }

Learning Streaming Video Representation via Multitask Training

Abstract

Streaming Video Representation

StreamFormer learns streaming video representations of various granularities through multitask training, making it applicable for diverse downstream tasks such as Online Action Detection, Online Video Instance Segmentation, and Video Question Answering.

Method

Experiments

BibTeX