OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

Abstract

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.

🏗️

Unified Architecture

Causal spatiotemporal attention + 3D-RoPE turn a pre-trained image ViT into an efficient online streaming backbone with KV-cache support.

🎯

Multi-task Pre-training

Synergistic training across 29 datasets spanning self-supervised learning, geometric reconstruction, and vision-language alignment (~200M frames).

🧊

Frozen Backbone Transfer

A single frozen backbone achieves competitive results across perception, 3D reconstruction, VLM reasoning, and robotic manipulation.

Visualization

Video object segmentation: the blackswan scene.

Video object segmentation: the drift-straight scene.

Video object segmentation: the horse-jumping scene.

Streaming video depth estimation.

Streaming video depth estimation.

Q: If I am standing by the power strip and facing the telephone, is the door to my left, right, or back?

A: Back.

Robotic manipulation: put spoon on table cloth.

Robotic manipulation: stack green cube on yellow cube.

Robotic manipulation: put eggplant in basket.

Method

OmniStream is built upon a pre-trained DINOv3 ViT-L, extended with two key modifications: (i) causal spatiotemporal attention that enforces strict temporal causality and enables efficient frame-by-frame inference via a persistent KV-cache; (ii) 3D rotary positional embeddings (3D-RoPE) that extend 2D spatial RoPE to a consistent spatiotemporal relative encoding.

OmniStream Overall Framework — **Overall framework of OmniStream.** Equipped with 3D-RoPE and causal spatiotemporal attention, our unified backbone is trained via a multi-task framework that couples static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment.

Causal Spatiotemporal Attention

A naive Transformer that attends over all frames violates causality and is inefficient for streaming. We apply spatiotemporal self-attention with a causal temporal mask, so tokens at time t may only attend to tokens from times ≤ t. This enables streaming inference: when a new frame arrives, we compute its queries and reuse cached keys/values from past frames, achieving online processing without recomputing attention over the full stream.

Causal Attention Mechanism — **Causal attention mechanism.** Tokens at frame t attend only to tokens from frames ≤ t, enabling efficient streaming inference via KV-cache.

Streaming Inference Efficiency

Thanks to our causal temporal attention mechanism, OmniStream natively supports KV-caching during inference. This allows the model to process incoming video streams frame-by-frame with O(T) temporal complexity per step, avoiding redundant re-computation. Furthermore, 3D-RoPE enables zero-shot length extrapolation (e.g., 110 frames) far beyond the training horizon of T=16 frames.

KV-Cache Inference Efficiency — **Streaming inference with KV-cache.** OmniStream achieves significant reduction in inference latency and memory consumption via KV-cache reuse.

Unified Multi-task Training

We train OmniStream with three complementary objectives that jointly encourage representations that are temporally coherent, geometrically grounded, and language-aligned:

📐 Static & Temporal SSL

Student-teacher distillation that unifies image and causal video modeling through global and patch-level objectives.

🌐 Geometric Reconstruction

Feed-forward geometric heads to inject explicit 3D constraints, ensuring features encode physical scene structure.

💬 Vision-Language Alignment

A lightweight language decoder trained on captioning, OCR, and grounding aligns visual tokens with linguistic concepts.

Experiments

We evaluate OmniStream across five domains with a strictly frozen backbone: image & video probing, streaming geometric reconstruction, visual backbone for VLMs, and visual backbone for VLA policies. The model is pre-trained on ~200M frames from 29 datasets, covering both 2D image and video datasets, 3D/4D datasets and visual-language datasets (more details in the paper).

Holistic Evaluation

Benchmark	Metric	Ours	DINOv3-L	V-JEPA2-L	CUT3R	LLaVA-Video	OpenVLA
Benchmark	Metric	Ours	DINOv3-L	V-JEPA2-L	CUT3R	LLaVA-Video	OpenVLA	ImageNet_cls	ACC@1 ↑	84.7	86.7	-	-	-	-
NYUv2_depth	RMSE ↓	0.377	0.377	-	-	-	-
ADE20K_seg	mIoU ↑	49.1	51.5	-	-	-	-
SSv2_act	ACC@1 ↑	68.5	54.0	73.7	-	-	-
K400_act	ACC@1 ↑	85.7	83.6	85.1	-	-	-
DAVIS'17_vos	J&F ↑	71.6	73.2	44.2	-	-	-
Sintel_depth	absRel ↓	0.314	-	-	0.421	-	-
BONN_depth	absRel ↓	0.072	-	-	0.078	-	-
ScanNet_pose	ATE ↓	0.076	-	-	0.099	-	-
VideoMME_vqa	Acc. ↑	60.7	-	-	-	61.8	-
VideoMMMU_vqa	Acc. ↑	40.0	-	-	-	38.7	-
PerceptionTest_vqa	Acc. ↑	68.9	-	-	-	67.6	-
EgoSchema_vqa	Acc. ↑	60.9	-	-	-	57.3	-
VSI-Bench_vqa	Acc. ↑	70.6	-	-	-	35.6	-
CALVIN_mani	Avg. Len ↑	3.89	-	-	-	-	2.55
Simpler-Bridge_mani	SR% ↑	45.8	-	-	-	-	53.7

Holistic evaluation. We compare OmniStream across 5 domains with a frozen backbone. "-" indicates a specialized baseline is not natively applicable to the given task.

Image Probing: OmniStream preserves robust spatial discrimination comparable to the image specialist DINOv3, achieving competitive results on dense prediction tasks (NYUv2 and ADE20K), confirming that multi-task training does not compromise fine-grained spatial priors.

Video Probing: OmniStream significantly outperforms DINOv3 on motion-intensive tasks (SSv2: 68.5% vs. 54.0%). Unlike typical video backbones that sacrifice spatial quality (e.g., V-JEPA2 at 44.2 J&F on DAVIS), OmniStream maintains strong dense spatiotemporal tracking (71.6 J&F), bridging the gap between static and dynamic perception.

3D Geometric Reconstruction: OmniStream achieves highly competitive or superior results against specialized online 3D models on depth estimation (Sintel, BONN) and camera pose estimation (TUM, ScanNet), while natively supporting KV-cache streaming inference.

VLM Reasoning: Our frozen OmniStream aligned with a LLM (Qwen2.5-7B) achieves state-of-the-art on VSI-Bench (70.6%), surpassing even geometry-aware baselines — without any auxiliary geometric modules — validating that unified streaming pre-training naturally engenders rich spatial understanding.

VLA Embodied Control: Building upon the VLM, we attach a lightweight MLP to decode actions to adapt OmniStream into a VLA policy. It is the first frozen visual encoder to demonstrate effective zero-shot transferability to robotic manipulation benchmarks (CALVIN: 3.89), bridging the gap between perception and action by explicitly encoding 3D geometry and temporal dynamics during pre-training.

Ablation Study

We investigate the contribution of each pre-training objective. The results demonstrate that our unified multi-task formulation is not merely a concatenation of independent losses, but a synergistic framework where semantic, dynamic, and geometric objectives mutually reinforce the backbone.

Method	SSv2 ↑	DAVIS ↑	ImageNet ↑	NYUv2 ↓	ADE20k ↑	VSI-Bench ↑	VideoMME ↑	CALVIN ↑
OmniStream (Full)	69.3	71.6	85.2	0.379	49.6	57.3	54.1	3.80
w/o VideoSSL	63.0	67.7	85.4	0.420	47.2	57.9	55.8	3.42
w/o 3D Geometry	68.4	69.7	85.0	0.471	42.3	52.5	53.8	3.34
w/o Captioning	67.4	71.0	84.4	0.395	46.9	44.9	45.0	2.38

Ablation study. Each pre-training objective contributes uniquely: VideoSSL drives temporal dynamics, 3D Geometry is prerequisite for spatial intelligence and embodied control, and Captioning is critical for VLM integration.

w/o VideoSSL: Omitting video data from self-supervised learning severely degrades dynamic perception (SSv2: 69.3→63.0, DAVIS: 71.6→67.7) and embodied control (CALVIN: 3.80→3.42), confirming its necessity for capturing temporal motions and dynamics.

w/o 3D Geometry: Disabling geometric reconstruction collapses spatial perception (NYUv2 RMSE: 0.379→0.471, ADE20K drops 7.3 mIoU) and causes sharp declines in spatial intelligence (VSI-Bench drops 4.8%) and embodied control (CALVIN drops 0.46), validating that explicit 3D priors are prerequisites for Embodied AI.

w/o Captioning: While pure vision probing remains stable, omitting vision-language alignment causes catastrophic failures in VLM integration (VideoMME: 54.1→45.0, VSI-Bench: 57.3→44.9) and devastates VLA performance (CALVIN: 3.80→2.38), highlighting that early language alignment within the backbone is critical to bridging the semantic gap.

OmniStream

Abstract

Unified Architecture

Multi-task Pre-training

Frozen Backbone Transfer

Visualization

Method

Causal Spatiotemporal Attention

Streaming Inference Efficiency

Unified Multi-task Training

📐 Static & Temporal SSL

🌐 Geometric Reconstruction

💬 Vision-Language Alignment

Experiments

Holistic Evaluation

Ablation Study

BibTeX