Seeing Across Views: Benchmarking Spatial
Reasoning of Vision-Language Models in Robotic Scenes

ICLR 2026
* equal contribution     Work done during research internship at Microsoft Research     corresponding author
1Tsinghua University    2Peking University    3Fudan University
4Microsoft Research Asia    5HKUST    6Zhejiang University

Abstract

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question.

To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating Chain-of-Thought (CoT)-inspired techniques.

The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark.

Figure 1: Representative multi-view QA instances from MV-RoboBench
Figure 1: Representative multi-view QA instances from the eight tasks in MV-RoboBench, with spatial tasks shown on the left and robotic tasks on the right. For clarity, only simplified versions with ground-truth answers are presented here, omitting distractors.
Table 1: Comparison of spatial reasoning benchmarks

Existing spatial reasoning benchmarks mostly focus on single-view inputs or non-embodied settings, leaving the multi-view robotic manipulation regime underexplored. MV-RoboBench addresses this gap with synchronized multi-camera observations from real robot demonstrations, covering both spatial reasoning and robotic execution tasks. The benchmark includes 1.7K human-curated QA items across diverse tasks and environments, enabling a systematic evaluation of whether VLMs can integrate complementary viewpoints for decision-making.

Leaderboard

Evaluation on MV-RoboBench under a unified zero-shot prompt. Sorted by Avg. (higher is better).

Rank Model Avg. Spatial Tasks Robotic Tasks
Cross-View
Match
Distance
Judge
Viewpoint
ID
3D Spatial
Consist.
Action
Plan.
Step
Exec.
Trajectory
Sel.
Affordance
Rec.
1 🥇 GPT-5 56.41 29.00 55.22 44.14 82.35 79.41 68.38 54.50 39.23
2 🥈 Gemini-2.5-pro 49.52 39.50 56.22 38.28 49.02 65.20 50.85 65.50 31.58
3 🥉 o4-mini 46.47 21.50 48.26 26.17 65.69 74.51 63.25 44.00 25.36
4 GPT-5-mini 38.28 22.00 49.25 25.78 72.55 66.18 48.72 47.00 27.75
5 GPT-5-nano 32.75 21.50 33.33 17.58 56.86 39.71 35.47 31.00 26.32
6 Claude-3.7-think 31.67 24.40 35.04 36.00 52.45 21.50 37.81 21.08 23.05
7 GPT-5-chat 31.63 30.00 42.79 31.64 4.90 36.76 40.17 38.00 27.75
8 GPT-4.1 30.90 26.00 43.28 32.03 6.37 29.90 31.62 41.50 28.23
9 Gemini-2.0-flash 28.94 28.00 32.84 21.48 7.35 32.84 29.91 52.50 20.57
10 GPT-4o 27.59 24.50 37.31 19.92 6.37 33.33 33.76 33.00 20.10
11 Gemini-2.5-flash 27.23 26.50 37.31 27.34 6.37 34.80 30.34 42.00 19.14
12 Llama-4-Maverick 26.11 14.00 42.79 17.58 5.88 37.75 37.18 36.00 20.10
13 Claude-3.7 25.47 18.00 35.32 20.31 6.86 36.76 29.06 34.50 22.97
14 Qwen2.5-vl-72b 24.29 20.50 34.83 27.34 4.90 28.43 27.35 29.00 24.88
15 GPT-4.1-mini 23.98 28.50 33.83 25.00 7.84 26.47 21.79 32.00 18.18
16 Claude-3.5 23.71 17.50 27.86 20.31 8.82 34.80 20.09 33.00 27.27
17 InternVL3-78b 23.25 19.00 28.86 23.83 11.76 29.90 29.06 26.50 21.05
18 GPT-4-turbo 22.91 19.00 13.43 19.92 7.84 41.67 31.20 20.00 27.27
19 InternVL3-38b 22.80 24.50 25.87 23.44 6.86 27.94 25.21 27.50 21.05
20 GPT-4o-mini 22.52 24.00 22.89 23.44 11.76 24.51 28.21 20.50 23.44
21 Qwen2.5-vl-32b 22.48 20.50 25.87 25.39 10.78 24.51 19.66 30.50 22.49
22 Llama-4-Scout 22.12 20.50 22.39 23.83 7.35 25.49 28.21 23.00 18.18
23 InternVL3-14b 21.47 19.50 22.39 24.61 10.78 23.53 23.50 24.00 23.44
24 InternVL3-8b 20.97 19.00 21.39 26.17 12.75 26.47 21.37 20.50 20.10
25 GPT-4.1-nano 20.85 17.50 25.37 18.75 14.71 22.55 22.22 20.00 17.22
26 Qwen2.5-vl-7b 20.84 20.50 20.40 20.70 8.82 22.55 26.07 24.50 22.49
27 Gemma-3-27b 20.55 21.50 23.88 20.31 9.31 20.10 23.08 29.00 17.22
28 Gemma-3-12b 20.49 18.00 26.37 20.31 9.80 22.55 20.94 25.50 20.57
29 Qwen2.5-vl-3b 20.37 17.50 21.89 22.66 17.65 17.16 17.95 22.00 25.84
30 Gemma-3-4b 19.79 21.00 22.89 21.09 11.76 17.65 16.67 25.50 22.01
31 Random Choice 19.71 17.80 19.40 20.00 19.07 19.41 21.54 20.65 19.81
32 InternVL3-2b 18.93 16.50 15.42 20.70 20.59 17.16 20.94 21.00 19.14
33 GPT-3.5-turbo 18.52 15.50 22.39 20.31 12.25 21.57 18.38 23.00 16.75
Construction pipeline of MV-RoboBench
Figure 2: Construction pipeline of MV-RoboBench with three stages: data collection from synchronized multi-view demonstrations, template-guided QA generation, and human-in-the-loop quality review to refine and balance the final QA pool.
Data distribution of MV-RoboBench across subtasks and sources
Figure 3: MV-RoboBench data distribution across subtasks and sources (AgiWorld, BridgeV2), highlighting the balance between spatial and robotic domains.

From Perception to Action: Correlation and Transfer

We analyze two axes: the internal correlation between spatial and robotic reasoning, and the external generalization from single-view to multi-view spatial intelligence. The results highlight that spatial and robotic reasoning can align, but only for models with sufficient multi-view integration ability, and that strong single-view performance does not reliably transfer to embodied multi-view scenarios.

Spatial vs. robotic accuracy on MV-RoboBench
Figure 4: Spatial vs. robotic accuracy on MV-RoboBench. Models clustered near the lower-left operate close to random guessing, while reasoning-enhanced proprietary models show a clear upward trend across both axes.
Comparison of model accuracies on OmniSpatial versus MV-RoboBench
Figure 5: Comparison of model accuracies on OmniSpatial versus MV-RoboBench, with the left plot for spatial subtasks and the right plot for robotic subtasks.

BibTeX

@article{feng2025seeing,
  title={Seeing across views: Benchmarking spatial reasoning of vision-language models in robotic scenes},
  author={Feng, Zhiyuan and Kang, Zhaolu and Wang, Qijie and Du, Zhiying and Yan, Jiongrui and Shi, Shubin and Yuan, Chengbo and Liang, Huizhi and Deng, Yu and Li, Qixiu and others},
  journal={arXiv preprint arXiv:2510.19400},
  year={2025}
}