Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability
🏆
Results: Best model 59.46% vs. 74.42% human; large gap in global tasks (<10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning.
Idea: Use information-dense prompts + omni-dimensional QA to explicitly decompose and measure spatial intelligence across perception, reasoning, and interaction.
🛠️
Solution:
SpatialGenEval: 1,230 long prompts + 10 sub-domains; comprehensive spatial coverage
Omni-QA Evaluation: 10 multi-choice QAs per prompt; fine-grained capability diagnosis
Hierarchical Decomposition: foundation → perception → reasoning → interaction modeling
SpatialT2I Dataset: 15.4K pairs; rewritten dense prompts for training consistency
Data-Centric SFT: fine-tune T2I models to enhance spatial reasoning
🏆
Results: Spatial reasoning emerges as dominant bottleneck (~20–30% on key sub-tasks); SpatialT2I yields consistent gains (+4.2%–5.7%), validating data-centric improvement.
💭 Thoughts:
Need Bidirectional Evaluation: Current T2I benchmarks only test forward generation, but spatial intelligence should be bidirectional and reversible. Can a model truly understand spatial relations if it cannot consistently reconstruct them across generation and interpretation (T2I ↔ I2T)?
Cross-modal Spatial Consistency: Do multimodal models maintain a unified spatial representation when reasoning across image and text, or do they rely on modality-specific shortcuts?
Structure-aware Spatial Robustness: Can a model still perform correct spatial reasoning when specific spatial factors (e.g., position, occlusion) are selectively removed rather than randomly missing?
Samples of SpatialGenEval. T2I Generation $\rightarrow$ MLLMs as a judge evaluation.
Comparisons between SpatialGenEval and previous T2I Benchmarks. 'L' and 'S' denote long and short prompt.
Results: Best VLM achieves 53.1 MRA vs. human 55.6; counterfactual drops (70–80%) reveal failure in input-faithful quantitative reasoning and reliance on memorized priors.