Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning
1 Benchmark 1.1 Textual Benchmarks Arxiv 2026 Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions 🔻 🏛️ Beijing Institute of Technology 🏛️ BUCT 👤 Author 📄 Paper 💻 Code 🏷️ Subject: Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation ❓ Problem: Perception–reasoning entanglement in VLM benchmarks Lack of high-fidelity text-only spatial tasks Over-reliance on language priors/pattern matching Weak evaluation of global consistency, mental mapping 💡 Idea: Convert visual scenes into coordinate-aware text to isolate and test symbolic spatial reasoning in LLMs. 🛠️ Solution: SiT-Bench: 3.8K QA across 5 categories, 17 subtasks for spatial cognition Textual Encoding: Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning Dual Construction: Image-based generation + vision-benchmark-to-text adaptation R1 Filtering: Reasoning-based filtering removes trivial, inconsistent, leakage samples Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability 🏆 Results: Best model 59.46% vs. 74.42% human; large gap in global tasks (<10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning. Example of SiT Benchmark ...