Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question.

To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating Chain-of-Thought (CoT)-inspired techniques.

The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark.

Existing spatial reasoning benchmarks mostly focus on single-view inputs or non-embodied settings, leaving the multi-view robotic manipulation regime underexplored. MV-RoboBench addresses this gap with synchronized multi-camera observations from real robot demonstrations, covering both spatial reasoning and robotic execution tasks. The benchmark includes 1.7K human-curated QA items across diverse tasks and environments, enabling a systematic evaluation of whether VLMs can integrate complementary viewpoints for decision-making.

Rank	Model	Avg.	Spatial Tasks	Robotic Tasks
1 🥇	GPT-5	56.41	29.00	55.22	44.14	82.35	79.41	68.38	54.50	39.23
2 🥈	Gemini-2.5-pro	49.52	39.50	56.22	38.28	49.02	65.20	50.85	65.50	31.58
3 🥉	o4-mini	46.47	21.50	48.26	26.17	65.69	74.51	63.25	44.00	25.36
4	GPT-5-mini	38.28	22.00	49.25	25.78	72.55	66.18	48.72	47.00	27.75
5	GPT-5-nano	32.75	21.50	33.33	17.58	56.86	39.71	35.47	31.00	26.32
6	Claude-3.7-think	31.67	24.40	35.04	36.00	52.45	21.50	37.81	21.08	23.05
7	GPT-5-chat	31.63	30.00	42.79	31.64	4.90	36.76	40.17	38.00	27.75
8	GPT-4.1	30.90	26.00	43.28	32.03	6.37	29.90	31.62	41.50	28.23
9	Gemini-2.0-flash	28.94	28.00	32.84	21.48	7.35	32.84	29.91	52.50	20.57
10	GPT-4o	27.59	24.50	37.31	19.92	6.37	33.33	33.76	33.00	20.10
11	Gemini-2.5-flash	27.23	26.50	37.31	27.34	6.37	34.80	30.34	42.00	19.14
12	Llama-4-Maverick	26.11	14.00	42.79	17.58	5.88	37.75	37.18	36.00	20.10
13	Claude-3.7	25.47	18.00	35.32	20.31	6.86	36.76	29.06	34.50	22.97
14	Qwen2.5-vl-72b	24.29	20.50	34.83	27.34	4.90	28.43	27.35	29.00	24.88
15	GPT-4.1-mini	23.98	28.50	33.83	25.00	7.84	26.47	21.79	32.00	18.18
16	Claude-3.5	23.71	17.50	27.86	20.31	8.82	34.80	20.09	33.00	27.27
17	InternVL3-78b	23.25	19.00	28.86	23.83	11.76	29.90	29.06	26.50	21.05
18	GPT-4-turbo	22.91	19.00	13.43	19.92	7.84	41.67	31.20	20.00	27.27
19	InternVL3-38b	22.80	24.50	25.87	23.44	6.86	27.94	25.21	27.50	21.05
20	GPT-4o-mini	22.52	24.00	22.89	23.44	11.76	24.51	28.21	20.50	23.44
21	Qwen2.5-vl-32b	22.48	20.50	25.87	25.39	10.78	24.51	19.66	30.50	22.49
22	Llama-4-Scout	22.12	20.50	22.39	23.83	7.35	25.49	28.21	23.00	18.18
23	InternVL3-14b	21.47	19.50	22.39	24.61	10.78	23.53	23.50	24.00	23.44
24	InternVL3-8b	20.97	19.00	21.39	26.17	12.75	26.47	21.37	20.50	20.10
25	GPT-4.1-nano	20.85	17.50	25.37	18.75	14.71	22.55	22.22	20.00	17.22
26	Qwen2.5-vl-7b	20.84	20.50	20.40	20.70	8.82	22.55	26.07	24.50	22.49
27	Gemma-3-27b	20.55	21.50	23.88	20.31	9.31	20.10	23.08	29.00	17.22
28	Gemma-3-12b	20.49	18.00	26.37	20.31	9.80	22.55	20.94	25.50	20.57
29	Qwen2.5-vl-3b	20.37	17.50	21.89	22.66	17.65	17.16	17.95	22.00	25.84
30	Gemma-3-4b	19.79	21.00	22.89	21.09	11.76	17.65	16.67	25.50	22.01
31	Random Choice	19.71	17.80	19.40	20.00	19.07	19.41	21.54	20.65	19.81
32	InternVL3-2b	18.93	16.50	15.42	20.70	20.59	17.16	20.94	21.00	19.14
33	GPT-3.5-turbo	18.52	15.50	22.39	20.31	12.25	21.57	18.38	23.00	16.75

From Perception to Action: Correlation and Transfer

We analyze two axes: the internal correlation between spatial and robotic reasoning, and the external generalization from single-view to multi-view spatial intelligence. The results highlight that spatial and robotic reasoning can align, but only for models with sufficient multi-view integration ability, and that strong single-view performance does not reliably transfer to embodied multi-view scenarios.

BibTeX

@article{feng2025seeing,
  title={Seeing across views: Benchmarking spatial reasoning of vision-language models in robotic scenes},
  author={Feng, Zhiyuan and Kang, Zhaolu and Wang, Qijie and Du, Zhiying and Yan, Jiongrui and Shi, Shubin and Yuan, Chengbo and Liang, Huizhi and Deng, Yu and Li, Qixiu and others},
  journal={arXiv preprint arXiv:2510.19400},
  year={2025}
}

Seeing Across Views: Benchmarking Spatial
Reasoning of Vision-Language Models in Robotic Scenes

Abstract

Leaderboard

From Perception to Action: Correlation and Transfer

BibTeX

Seeing Across Views: Benchmarking SpatialReasoning of Vision-Language Models in Robotic Scenes

Abstract

Leaderboard

From Perception to Action: Correlation and Transfer

BibTeX

Seeing Across Views: Benchmarking Spatial
Reasoning of Vision-Language Models in Robotic Scenes