Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question.
To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating Chain-of-Thought (CoT)-inspired techniques.
The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark.
Existing spatial reasoning benchmarks mostly focus on single-view inputs or non-embodied settings, leaving the multi-view robotic manipulation regime underexplored. MV-RoboBench addresses this gap with synchronized multi-camera observations from real robot demonstrations, covering both spatial reasoning and robotic execution tasks. The benchmark includes 1.7K human-curated QA items across diverse tasks and environments, enabling a systematic evaluation of whether VLMs can integrate complementary viewpoints for decision-making.
Evaluation on MV-RoboBench under a unified zero-shot prompt. Sorted by Avg. (higher is better).
| Rank | Model | Avg. | Spatial Tasks | Robotic Tasks | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Cross-View Match |
Distance Judge |
Viewpoint ID |
3D Spatial Consist. |
Action Plan. |
Step Exec. |
Trajectory Sel. |
Affordance Rec. |
|||
| 1 🥇 | GPT-5 | 56.41 | 29.00 | 55.22 | 44.14 | 82.35 | 79.41 | 68.38 | 54.50 | 39.23 |
| 2 🥈 | Gemini-2.5-pro | 49.52 | 39.50 | 56.22 | 38.28 | 49.02 | 65.20 | 50.85 | 65.50 | 31.58 |
| 3 🥉 | o4-mini | 46.47 | 21.50 | 48.26 | 26.17 | 65.69 | 74.51 | 63.25 | 44.00 | 25.36 |
| 4 | GPT-5-mini | 38.28 | 22.00 | 49.25 | 25.78 | 72.55 | 66.18 | 48.72 | 47.00 | 27.75 |
| 5 | GPT-5-nano | 32.75 | 21.50 | 33.33 | 17.58 | 56.86 | 39.71 | 35.47 | 31.00 | 26.32 |
| 6 | Claude-3.7-think | 31.67 | 24.40 | 35.04 | 36.00 | 52.45 | 21.50 | 37.81 | 21.08 | 23.05 |
| 7 | GPT-5-chat | 31.63 | 30.00 | 42.79 | 31.64 | 4.90 | 36.76 | 40.17 | 38.00 | 27.75 |
| 8 | GPT-4.1 | 30.90 | 26.00 | 43.28 | 32.03 | 6.37 | 29.90 | 31.62 | 41.50 | 28.23 |
| 9 | Gemini-2.0-flash | 28.94 | 28.00 | 32.84 | 21.48 | 7.35 | 32.84 | 29.91 | 52.50 | 20.57 |
| 10 | GPT-4o | 27.59 | 24.50 | 37.31 | 19.92 | 6.37 | 33.33 | 33.76 | 33.00 | 20.10 |
| 11 | Gemini-2.5-flash | 27.23 | 26.50 | 37.31 | 27.34 | 6.37 | 34.80 | 30.34 | 42.00 | 19.14 |
| 12 | Llama-4-Maverick | 26.11 | 14.00 | 42.79 | 17.58 | 5.88 | 37.75 | 37.18 | 36.00 | 20.10 |
| 13 | Claude-3.7 | 25.47 | 18.00 | 35.32 | 20.31 | 6.86 | 36.76 | 29.06 | 34.50 | 22.97 |
| 14 | Qwen2.5-vl-72b | 24.29 | 20.50 | 34.83 | 27.34 | 4.90 | 28.43 | 27.35 | 29.00 | 24.88 |
| 15 | GPT-4.1-mini | 23.98 | 28.50 | 33.83 | 25.00 | 7.84 | 26.47 | 21.79 | 32.00 | 18.18 |
| 16 | Claude-3.5 | 23.71 | 17.50 | 27.86 | 20.31 | 8.82 | 34.80 | 20.09 | 33.00 | 27.27 |
| 17 | InternVL3-78b | 23.25 | 19.00 | 28.86 | 23.83 | 11.76 | 29.90 | 29.06 | 26.50 | 21.05 |
| 18 | GPT-4-turbo | 22.91 | 19.00 | 13.43 | 19.92 | 7.84 | 41.67 | 31.20 | 20.00 | 27.27 |
| 19 | InternVL3-38b | 22.80 | 24.50 | 25.87 | 23.44 | 6.86 | 27.94 | 25.21 | 27.50 | 21.05 |
| 20 | GPT-4o-mini | 22.52 | 24.00 | 22.89 | 23.44 | 11.76 | 24.51 | 28.21 | 20.50 | 23.44 |
| 21 | Qwen2.5-vl-32b | 22.48 | 20.50 | 25.87 | 25.39 | 10.78 | 24.51 | 19.66 | 30.50 | 22.49 |
| 22 | Llama-4-Scout | 22.12 | 20.50 | 22.39 | 23.83 | 7.35 | 25.49 | 28.21 | 23.00 | 18.18 |
| 23 | InternVL3-14b | 21.47 | 19.50 | 22.39 | 24.61 | 10.78 | 23.53 | 23.50 | 24.00 | 23.44 |
| 24 | InternVL3-8b | 20.97 | 19.00 | 21.39 | 26.17 | 12.75 | 26.47 | 21.37 | 20.50 | 20.10 |
| 25 | GPT-4.1-nano | 20.85 | 17.50 | 25.37 | 18.75 | 14.71 | 22.55 | 22.22 | 20.00 | 17.22 |
| 26 | Qwen2.5-vl-7b | 20.84 | 20.50 | 20.40 | 20.70 | 8.82 | 22.55 | 26.07 | 24.50 | 22.49 |
| 27 | Gemma-3-27b | 20.55 | 21.50 | 23.88 | 20.31 | 9.31 | 20.10 | 23.08 | 29.00 | 17.22 |
| 28 | Gemma-3-12b | 20.49 | 18.00 | 26.37 | 20.31 | 9.80 | 22.55 | 20.94 | 25.50 | 20.57 |
| 29 | Qwen2.5-vl-3b | 20.37 | 17.50 | 21.89 | 22.66 | 17.65 | 17.16 | 17.95 | 22.00 | 25.84 |
| 30 | Gemma-3-4b | 19.79 | 21.00 | 22.89 | 21.09 | 11.76 | 17.65 | 16.67 | 25.50 | 22.01 |
| 31 | Random Choice | 19.71 | 17.80 | 19.40 | 20.00 | 19.07 | 19.41 | 21.54 | 20.65 | 19.81 |
| 32 | InternVL3-2b | 18.93 | 16.50 | 15.42 | 20.70 | 20.59 | 17.16 | 20.94 | 21.00 | 19.14 |
| 33 | GPT-3.5-turbo | 18.52 | 15.50 | 22.39 | 20.31 | 12.25 | 21.57 | 18.38 | 23.00 | 16.75 |
We analyze two axes: the internal correlation between spatial and robotic reasoning, and the external generalization from single-view to multi-view spatial intelligence. The results highlight that spatial and robotic reasoning can align, but only for models with sufficient multi-view integration ability, and that strong single-view performance does not reliably transfer to embodied multi-view scenarios.
@article{feng2025seeing,
title={Seeing across views: Benchmarking spatial reasoning of vision-language models in robotic scenes},
author={Feng, Zhiyuan and Kang, Zhaolu and Wang, Qijie and Du, Zhiying and Yan, Jiongrui and Shi, Shubin and Yuan, Chengbo and Liang, Huizhi and Deng, Yu and Li, Qixiu and others},
journal={arXiv preprint arXiv:2510.19400},
year={2025}
}