Technical Report

OmniStream

Mastering Perception, Reconstruction and Action in Continuous Streams

School of Artificial Intelligence, SJTU  ยท  Shanghai Innovation Institute  ยท  VGG, Oxford
* Joint first author
OmniStream Teaser

Left: OmniStream supports a wide spectrum of tasks, including 2D/3D perception, vision-language understanding, and embodied robotic manipulation.
Right: The frozen features of our single backbone achieve highly competitive or superior performance compared to leading domain-specific experts.


Abstract

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.

๐Ÿ—๏ธ

Unified Architecture

Causal spatiotemporal attention + 3D-RoPE turn a pre-trained image ViT into an efficient online streaming backbone with KV-cache support.

๐ŸŽฏ

Multi-task Pre-training

Synergistic training across 29 datasets spanning self-supervised learning, geometric reconstruction, and vision-language alignment (~200M frames).

๐ŸงŠ

Frozen Backbone Transfer

A single frozen backbone achieves competitive results across perception, 3D reconstruction, VLM reasoning, and robotic manipulation.


Visualization


Method

OmniStream is built upon a pre-trained DINOv3 ViT-L, extended with two key modifications: (i) causal spatiotemporal attention that enforces strict temporal causality and enables efficient frame-by-frame inference via a persistent KV-cache; (ii) 3D rotary positional embeddings (3D-RoPE) that extend 2D spatial RoPE to a consistent spatiotemporal relative encoding.

OmniStream Overall Framework
Overall framework of OmniStream. Equipped with 3D-RoPE and causal spatiotemporal attention, our unified backbone is trained via a multi-task framework that couples static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment.

Causal Spatiotemporal Attention

A naive Transformer that attends over all frames violates causality and is inefficient for streaming. We apply spatiotemporal self-attention with a causal temporal mask, so tokens at time t may only attend to tokens from times โ‰ค t. This enables streaming inference: when a new frame arrives, we compute its queries and reuse cached keys/values from past frames, achieving online processing without recomputing attention over the full stream.

Causal Attention Mechanism
Causal attention mechanism. Tokens at frame t attend only to tokens from frames โ‰ค t, enabling efficient streaming inference via KV-cache.

Streaming Inference Efficiency

Thanks to our causal temporal attention mechanism, OmniStream natively supports KV-caching during inference. This allows the model to process incoming video streams frame-by-frame with O(T) temporal complexity per step, avoiding redundant re-computation. Furthermore, 3D-RoPE enables zero-shot length extrapolation (e.g., 110 frames) far beyond the training horizon of T=16 frames.

KV-Cache Inference Efficiency
Streaming inference with KV-cache. OmniStream achieves significant reduction in inference latency and memory consumption via KV-cache reuse.

Unified Multi-task Training

We train OmniStream with three complementary objectives that jointly encourage representations that are temporally coherent, geometrically grounded, and language-aligned:

๐Ÿ“ Static & Temporal SSL

Student-teacher distillation that unifies image and causal video modeling through global and patch-level objectives.

๐ŸŒ Geometric Reconstruction

Feed-forward geometric heads to inject explicit 3D constraints, ensuring features encode physical scene structure.

๐Ÿ’ฌ Vision-Language Alignment

A lightweight language decoder trained on captioning, OCR, and grounding aligns visual tokens with linguistic concepts.


Experiments

We evaluate OmniStream across five domains with a strictly frozen backbone: image & video probing, streaming geometric reconstruction, visual backbone for VLMs, and visual backbone for VLA policies. The model is pre-trained on ~200M frames from 29 datasets, covering both 2D image and video datasets, 3D/4D datasets and visual-language datasets (more details in the paper).

Holistic Evaluation

Benchmark Metric Ours DINOv3-L V-JEPA2-L CUT3R LLaVA-Video OpenVLA
ImageNetcls ACC@1 โ†‘ 84.7 86.7 - - - -
NYUv2depth RMSE โ†“ 0.377 0.377 - - - -
ADE20Kseg mIoU โ†‘ 49.1 51.5 - - - -
SSv2act ACC@1 โ†‘ 68.5 54.0 73.7 - - -
K400act ACC@1 โ†‘ 85.7 83.6 85.1 - - -
DAVIS'17vos J&F โ†‘ 71.6 73.2 44.2 - - -
Sinteldepth absRel โ†“ 0.314 - - 0.421 - -
BONNdepth absRel โ†“ 0.072 - - 0.078 - -
ScanNetpose ATE โ†“ 0.076 - - 0.099 - -
VideoMMEvqa Acc. โ†‘ 60.7 - - - 61.8 -
VideoMMMUvqa Acc. โ†‘ 40.0 - - - 38.7 -
PerceptionTestvqa Acc. โ†‘ 68.9 - - - 67.6 -
EgoSchemavqa Acc. โ†‘ 60.9 - - - 57.3 -
VSI-Benchvqa Acc. โ†‘ 70.6 - - - 35.6 -
CALVINmani Avg. Len โ†‘ 3.89 - - - - 2.55
Simpler-Bridgemani SR% โ†‘ 45.8 - - - - 53.7

Holistic evaluation. We compare OmniStream across 5 domains with a frozen backbone. "-" indicates a specialized baseline is not natively applicable to the given task.

Image Probing: OmniStream preserves robust spatial discrimination comparable to the image specialist DINOv3, achieving competitive results on dense prediction tasks (NYUv2 and ADE20K), confirming that multi-task training does not compromise fine-grained spatial priors.
Video Probing: OmniStream significantly outperforms DINOv3 on motion-intensive tasks (SSv2: 68.5% vs. 54.0%). Unlike typical video backbones that sacrifice spatial quality (e.g., V-JEPA2 at 44.2 J&F on DAVIS), OmniStream maintains strong dense spatiotemporal tracking (71.6 J&F), bridging the gap between static and dynamic perception.
3D Geometric Reconstruction: OmniStream achieves highly competitive or superior results against specialized online 3D models on depth estimation (Sintel, BONN) and camera pose estimation (TUM, ScanNet), while natively supporting KV-cache streaming inference.
VLM Reasoning: Our frozen OmniStream aligned with a LLM (Qwen2.5-7B) achieves state-of-the-art on VSI-Bench (70.6%), surpassing even geometry-aware baselines โ€” without any auxiliary geometric modules โ€” validating that unified streaming pre-training naturally engenders rich spatial understanding.
VLA Embodied Control: Building upon the VLM, we attach a lightweight MLP to decode actions to adapt OmniStream into a VLA policy. It is the first frozen visual encoder to demonstrate effective zero-shot transferability to robotic manipulation benchmarks (CALVIN: 3.89), bridging the gap between perception and action by explicitly encoding 3D geometry and temporal dynamics during pre-training.

Ablation Study

We investigate the contribution of each pre-training objective. The results demonstrate that our unified multi-task formulation is not merely a concatenation of independent losses, but a synergistic framework where semantic, dynamic, and geometric objectives mutually reinforce the backbone.

Method SSv2 โ†‘ DAVIS โ†‘ ImageNet โ†‘ NYUv2 โ†“ ADE20k โ†‘ VSI-Bench โ†‘ VideoMME โ†‘ CALVIN โ†‘
OmniStream (Full) 69.3 71.6 85.2 0.379 49.6 57.3 54.1 3.80
w/o VideoSSL 63.0 67.7 85.4 0.420 47.2 57.9 55.8 3.42
w/o 3D Geometry 68.4 69.7 85.0 0.471 42.3 52.5 53.8 3.34
w/o Captioning 67.4 71.0 84.4 0.395 46.9 44.9 45.0 2.38

Ablation study. Each pre-training objective contributes uniquely: VideoSSL drives temporal dynamics, 3D Geometry is prerequisite for spatial intelligence and embodied control, and Captioning is critical for VLM integration.

w/o VideoSSL: Omitting video data from self-supervised learning severely degrades dynamic perception (SSv2: 69.3โ†’63.0, DAVIS: 71.6โ†’67.7) and embodied control (CALVIN: 3.80โ†’3.42), confirming its necessity for capturing temporal motions and dynamics.
w/o 3D Geometry: Disabling geometric reconstruction collapses spatial perception (NYUv2 RMSE: 0.379โ†’0.471, ADE20K drops 7.3 mIoU) and causes sharp declines in spatial intelligence (VSI-Bench drops 4.8%) and embodied control (CALVIN drops 0.46), validating that explicit 3D priors are prerequisites for Embodied AI.
w/o Captioning: While pure vision probing remains stable, omitting vision-language alignment causes catastrophic failures in VLM integration (VideoMME: 54.1โ†’45.0, VSI-Bench: 57.3โ†’44.9) and devastates VLA performance (CALVIN: 3.80โ†’2.38), highlighting that early language alignment within the backbone is critical to bridging the semantic gap.

BibTeX

@article{yan2026omnistream, title={OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams}, author={Yibin Yan and Jilan Xu and Shangzhe Di and Haoning Wu and Weidi Xie}, journal={arXiv preprint arXiv:2603.12265}, year={2026}, url={https://arxiv.org/abs/2603.12265} }