Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning

1 Benchmark

1.1 Textual Benchmarks

Arxiv 2026

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

🔻

🏛️ Beijing Institute of Technology 🏛️ BUCT

👤 Author 📄 Paper 💻 Code

🏷️

Subject: Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation

❓

Problem:

Perception–reasoning entanglement in VLM benchmarks
Lack of high-fidelity text-only spatial tasks
Over-reliance on language priors/pattern matching
Weak evaluation of global consistency, mental mapping

💡

Idea: Convert visual scenes into coordinate-aware text to isolate and test symbolic spatial reasoning in LLMs.

🛠️

Solution:

SiT-Bench: 3.8K QA across 5 categories, 17 subtasks for spatial cognition
Textual Encoding: Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning
Dual Construction: Image-based generation + vision-benchmark-to-text adaptation
R1 Filtering: Reasoning-based filtering removes trivial, inconsistent, leakage samples
Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability

🏆

Results: Best model 59.46% vs. 74.42% human; large gap in global tasks (<10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning.

1.2 Text-to-Image Benchmarks

ICLR 2026

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

🔻

🏛️ AMAP - Alibaba Group 🏛️ Beijing University of Posts and Telecommunications

👤 Author 📄 Paper 💻 Code

🏷️

Subject: Information-dense Spatial Benchmarking for Text-to-Image Spatial Intelligence

❓

Problem:

Prompt Sparsity: short / sparse prompts → fail probe complex spatial constraints
Metric Coarseness: yes/no, detection → lack fine-grained diagnosis
Spatial Intelligence Gap: strong "what", weak "where/how/why"
Reasoning Blind Spot: comparison, occlusion, causality under-evaluated

💡

Idea: Use information-dense prompts + omni-dimensional QA to explicitly decompose and measure spatial intelligence across perception, reasoning, and interaction.

🛠️

Solution:

SpatialGenEval: 1,230 long prompts + 10 sub-domains; comprehensive spatial coverage
Omni-QA Evaluation: 10 multi-choice QAs per prompt; fine-grained capability diagnosis
Hierarchical Decomposition: foundation → perception → reasoning → interaction modeling
Leakage-Free Evaluation: image-only QA, “None” option prevents forced guessing
SpatialT2I Dataset: 15.4K pairs; rewritten dense prompts for training consistency
Data-Centric SFT: fine-tune T2I models to enhance spatial reasoning

🏆

Results: Spatial reasoning emerges as dominant bottleneck (~20–30% on key sub-tasks); SpatialT2I yields consistent gains (+4.2%–5.7%), validating data-centric improvement.

💭 Thoughts:

Need Bidirectional Evaluation: Current T2I benchmarks only test forward generation, but spatial intelligence should be bidirectional and reversible. Can a model truly understand spatial relations if it cannot consistently reconstruct them across generation and interpretation (T2I ↔ I2T)?
Cross-modal Spatial Consistency: Do multimodal models maintain a unified spatial representation when reasoning across image and text, or do they rely on modality-specific shortcuts?
Structure-aware Spatial Robustness: Can a model still perform correct spatial reasoning when specific spatial factors (e.g., position, occlusion) are selectively removed rather than randomly missing?

Samples of SpatialGenEval. T2I Generation $\rightarrow$ MLLMs as a judge evaluation.

Comparisons between SpatialGenEval and previous T2I Benchmarks. 'L' and 'S' denote long and short prompt.

1.3 Video-based Benchmarks

Arxiv 2025

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

🔻

🏛️ Stanford University 🏛️ UST

👤 Author 📄 Paper 💻 Code 🚀 Demo

🏷️

Subject: Quantitative Kinematic Benchmark for VLMs Physical Reasoning Evaluation

❓

Problem:

Qualitative Evaluation Bias: current Benchmark for VQA-style; lacks numerical precision sensitivity.
Missing Kinematic Quantification: no explicit size/velocity/acceleration inference.

💡

Idea: Cast physical reasoning as prior-conditioned kinematic scaling with numerical error calibration.

🛠️

Solution:

QuantiPhy Benchmark: 3.3K video–text pairs; numeric GT for kinematics.
Kinematic Inference Task: single prior → infer remaining quantities via scaling
MRA Metric: multi-threshold relative error aggregation.
Diagnostic Probing Suite: prior-only, counterfactual, CoT analyses.

🏆

Results: Best VLM achieves 53.1 MRA vs. human 55.6; counterfactual drops (70–80%) reveal failure in input-faithful quantitative reasoning and reliance on memorized priors.

1 Benchmark#

1.1 Textual Benchmarks#

1.2 Text-to-Image Benchmarks#

1.3 Video-based Benchmarks#

1 Benchmark

1.1 Textual Benchmarks

1.2 Text-to-Image Benchmarks

1.3 Video-based Benchmarks