1 Benchmark

1.1 Textual Benchmarks

Arxiv 2026

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

🔻
🏛️ Beijing Institute of Technology 🏛️ BUCT
🏷️
Subject: Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation
Problem:
  • Perception–reasoning entanglement in VLM benchmarks
  • Lack of high-fidelity text-only spatial tasks
  • Over-reliance on language priors/pattern matching
  • Weak evaluation of global consistency, mental mapping
💡
Idea: Convert visual scenes into coordinate-aware text to isolate and test symbolic spatial reasoning in LLMs.
🛠️
Solution:
  • SiT-Bench: 3.8K QA across 5 categories, 17 subtasks for spatial cognition
  • Textual Encoding: Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning
  • Dual Construction: Image-based generation + vision-benchmark-to-text adaptation
  • R1 Filtering: Reasoning-based filtering removes trivial, inconsistent, leakage samples
  • Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability
🏆
Results: Best model 59.46% vs. 74.42% human; large gap in global tasks (<10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning.
Example of SiT Benchmark
Example of SiT Benchmark

1.2 Text-to-Image Benchmarks

ICLR 2026

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

🔻
🏛️ AMAP - Alibaba Group 🏛️ Beijing University of Posts and Telecommunications
🏷️
Subject: Information-dense Spatial Benchmarking for Text-to-Image Spatial Intelligence
Problem:
  • Prompt Sparsity: short / sparse prompts → fail probe complex spatial constraints
  • Metric Coarseness: yes/no, detection → lack fine-grained diagnosis
  • Spatial Intelligence Gap: strong "what", weak "where/how/why"
  • Reasoning Blind Spot: comparison, occlusion, causality under-evaluated
💡
Idea: Use information-dense prompts + omni-dimensional QA to explicitly decompose and measure spatial intelligence across perception, reasoning, and interaction.
🛠️
Solution:
  • SpatialGenEval: 1,230 long prompts + 10 sub-domains; comprehensive spatial coverage
  • Omni-QA Evaluation: 10 multi-choice QAs per prompt; fine-grained capability diagnosis
  • Hierarchical Decomposition: foundation → perception → reasoning → interaction modeling
  • Leakage-Free Evaluation: image-only QA, “None” option prevents forced guessing
  • SpatialT2I Dataset: 15.4K pairs; rewritten dense prompts for training consistency
  • Data-Centric SFT: fine-tune T2I models to enhance spatial reasoning
🏆
Results: Spatial reasoning emerges as dominant bottleneck (~20–30% on key sub-tasks); SpatialT2I yields consistent gains (+4.2%–5.7%), validating data-centric improvement.
💭 Thoughts:
  • Need Bidirectional Evaluation: Current T2I benchmarks only test forward generation, but spatial intelligence should be bidirectional and reversible. Can a model truly understand spatial relations if it cannot consistently reconstruct them across generation and interpretation (T2I ↔ I2T)?
  • Cross-modal Spatial Consistency: Do multimodal models maintain a unified spatial representation when reasoning across image and text, or do they rely on modality-specific shortcuts?
  • Structure-aware Spatial Robustness: Can a model still perform correct spatial reasoning when specific spatial factors (e.g., position, occlusion) are selectively removed rather than randomly missing?
Samples of SpatialGenEval. T2I Generation $\rightarrow$ MLLMs as a judge evaluation.
Samples of SpatialGenEval. T2I Generation $\rightarrow$ MLLMs as a judge evaluation.
Comparisons between SpatialGenEval and previous T2I Benchmarks. 'L' and 'S' denote long and short prompt.
Comparisons between SpatialGenEval and previous T2I Benchmarks. 'L' and 'S' denote long and short prompt.

1.3 Video-based Benchmarks

Arxiv 2025

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

🔻
🏛️ Stanford University 🏛️ UST
🏷️
Subject: Quantitative Kinematic Benchmark for VLMs Physical Reasoning Evaluation
Problem:
  • Qualitative Evaluation Bias: current Benchmark for VQA-style; lacks numerical precision sensitivity.
  • Missing Kinematic Quantification: no explicit size/velocity/acceleration inference.
💡
Idea: Cast physical reasoning as prior-conditioned kinematic scaling with numerical error calibration.
🛠️
Solution:
  • QuantiPhy Benchmark: 3.3K video–text pairs; numeric GT for kinematics.
  • Kinematic Inference Task: single prior → infer remaining quantities via scaling
  • MRA Metric: multi-threshold relative error aggregation.
  • Diagnostic Probing Suite: prior-only, counterfactual, CoT analyses.
🏆
Results: Best VLM achieves 53.1 MRA vs. human 55.6; counterfactual drops (70–80%) reveal failure in input-faithful quantitative reasoning and reliance on memorized priors.
Examples from QuantiPhy Benchmark
Examples from QuantiPhy Benchmark