When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study

Mon, 23 Mar 2026 16:42:46 +0800

TOPIC Quantitative Physical Understanding

WHY READ Exposes that top VLMs guess physical quantities from memory (pre-trained world knowledge) rather than measure from video, with rigorous tests to diagnose this failure.

TAKEAWAY Current VLMs are cognitive mimics not physical reasoners, so build systems that arbitrate between perception and memory rather than forcing pure end to end inference. (Context Learning, Agentic AI)

Stanford University, UST

📄 Paper 💻 Code 🌐 Project 👤 Author

🚀 1 Motivation & Problem

Humans understand the physical world through structured mathematical abstractions. From Isaac Newton’s formulation of universal gravitation inspired by a falling apple, to modern physics, quantitative laws enable precise reasoning about the dynamics of the real world. In contrast, although state-of-the-art AI systems demonstrate remarkable capabilities in mathematical reasoning, programming, and scientific writing, enabling artificial intelligence to ground its understanding in the physical world remains a fundamental and unresolved challenge. This limitation poses a critical barrier to deploying AI systems in real-world, embodied environments.

Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning

Thu, 19 Mar 2026 11:15:09 +0800

1 Benchmark

1.1 Textual Benchmarks

Arxiv 2026

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

🔻

🏛️ Beijing Institute of Technology 🏛️ BUCT

👤 Author 📄 Paper 💻 Code

🏷️

Subject: Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation

❓

Problem:

Perception–reasoning entanglement in VLM benchmarks
Lack of high-fidelity text-only spatial tasks
Over-reliance on language priors/pattern matching
Weak evaluation of global consistency, mental mapping

💡

Idea: Convert visual scenes into coordinate-aware text to isolate and test symbolic spatial reasoning in LLMs.

🛠️

Solution:

SiT-Bench: 3.8K QA across 5 categories, 17 subtasks for spatial cognition
Textual Encoding: Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning
Dual Construction: Image-based generation + vision-benchmark-to-text adaptation
R1 Filtering: Reasoning-based filtering removes trivial, inconsistent, leakage samples
Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability

🏆

Results: Best model 59.46% vs. 74.42% human; large gap in global tasks (<10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning.

Example of SiT Benchmark

Spatial Intelligence on PaperMoon's blog

When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study

🚀 1 Motivation & Problem

Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning

1 Benchmark

1.1 Textual Benchmarks