Learning Streaming Video Representation via Multitask Training

Yibin Yan*, Jilan Xu*, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie
School of Artificial Intelligence, Shanghai Jiao Tong University
Shanghai Innovation Institute, Shanghai AI Laboratory, Fudan University
Technical Report

Abstract

Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability.(ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.

Streaming Video Representation

Teaser


StreamFormer learns streaming video representations of various granularities through multitask training, making it applicable for diverse downstream tasks such as Online Action Detection, Online Video Instance Segmentation, and Video Question Answering.

Method

Overall Structure


Overall framework of StreamFormer. Left: Our StreamFormer is trained under a unified visual-language alignment framework, enabling simultaneous understanding of global semantics, temporal dynamics, and fine-grained spatial relationships. Right: Each level utilizes features of different granularities: (i) last frame for the global level, (ii) per frame for the temporal level, (iii) and per frame per patch feature for the spatial level.

Multitask Training


Specifically, we optimize the backbone with three complementary objectives:
(i) Global-level Tasks: The objective is to learn the global semantics of a video clip, encompassing both the primary action and the scene.
(ii) Temporal-level Tasks: The objective here is to develop fine-grained discriminative capabilities along the temporal dimension, enabling the model to perform tasks such as per-frame action understanding and perceiving events that occur across frames.
(iii) Spatial-level Tasks: The aim is to learn the fine-grained spatial relationships of a video clip, which involves understanding the interactions between different objects in all video frames.

Experiments

Experiments


We present our downstream performance on three tasks: Online Action Detection, Online Video Instance Segmentation, and Video Question Answering. Our baseline is the original image encoder of SigLIP, i.e., without multitask training and temporal modeling. It can be inferred that our streamformer achieves significantly better performance than the baseline on all downstream tasks, including online action detection, online video instance segmentation, and video question answering.

Inference Efficiency


Equipped with KV cache, our StreamFormer achieves a significant reduction in inference latency and memory consumption compared to the bi-directional attention baseline. The KV cache mechanism allows the model to store and reuse previously computed key-value pairs when inferencing, enabling efficient processing of streaming video data.

Data Efficiency


With our proposed multitask training strategy, we can achieve competitive performance compared to the video-text contrastive baseline. We randomly select 1M video-text pairs from WebVid-10M, which is comparable to the scale of our pre-training data. The models trained on WebVid-1M exhibit relatively low performance, possibly due to insufficient pre-training data for video-text contrastive learning. In comparison, our approach outperforms WebVid-1M model even using only 0.1M pre-training data, significantly reducing the computational cost for training.

BibTeX

@misc{yan2025learning,
        title={Learning Streaming Video Representation via Multitask Training},
        author={Yibin Yan and Jilan Xu and Shangzhe Di and Yikun Liu and Yudi Shi and Qirui Chen and Zeqian Li and Yifei Huang and Weidi Xie},
        year={2025},
        eprint={2504.20041},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }