When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study

TOPIC Quantitative Physical Understanding WHY READ Exposes that top VLMs guess physical quantities from memory (pre-trained world knowledge) rather than measure from video, with rigorous tests to diagnose this failure. TAKEAWAY Current VLMs are cognitive mimics not physical reasoners, so build systems that arbitrate between perception and memory rather than forcing pure end to end inference. (Context Learning, Agentic AI) Stanford University, UST 📄 Paper💻 Code🌐 Project👤 Author 🚀 1 Motivation & Problem Humans understand the physical world through structured mathematical abstractions. From Isaac Newton’s formulation of universal gravitation inspired by a falling apple, to modern physics, quantitative laws enable precise reasoning about the dynamics of the real world. In contrast, although state-of-the-art AI systems demonstrate remarkable capabilities in mathematical reasoning, programming, and scientific writing, enabling artificial intelligence to ground its understanding in the physical world remains a fundamental and unresolved challenge. This limitation poses a critical barrier to deploying AI systems in real-world, embodied environments. ...

Date: Mar. 23, 2026 | Total: 2336 words | Author: PaperMoon | Last Modified: Apr. 5, 2026

Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning

1 Benchmark 1.1 Textual Benchmarks Arxiv 2026 Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions 🔻 🏛️ Beijing Institute of Technology 🏛️ BUCT 👤 Author 📄 Paper 💻 Code 🏷️ Subject: Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation ❓ Problem: Perception–reasoning entanglement in VLM benchmarks Lack of high-fidelity text-only spatial tasks Over-reliance on language priors/pattern matching Weak evaluation of global consistency, mental mapping 💡 Idea: Convert visual scenes into coordinate-aware text to isolate and test symbolic spatial reasoning in LLMs. 🛠️ Solution: SiT-Bench: 3.8K QA across 5 categories, 17 subtasks for spatial cognition Textual Encoding: Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning Dual Construction: Image-based generation + vision-benchmark-to-text adaptation R1 Filtering: Reasoning-based filtering removes trivial, inconsistent, leakage samples Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability 🏆 Results: Best model 59.46% vs. 74.42% human; large gap in global tasks (<10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning. Example of SiT Benchmark ...

Date: Mar. 19, 2026 | Total: 594 words | Author: PaperMoon | Last Modified: Apr. 5, 2026