When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study

TOPIC Quantitative Physical Understanding

WHY READ Exposes that top VLMs guess physical quantities from memory (pre-trained world knowledge) rather than measure from video, with rigorous tests to diagnose this failure.

TAKEAWAY Current VLMs are cognitive mimics not physical reasoners, so build systems that arbitrate between perception and memory rather than forcing pure end to end inference. (Context Learning, Agentic AI)

Stanford University, UST

📄 Paper 💻 Code 🌐 Project 👤 Author

🚀 1 Motivation & Problem

Humans understand the physical world through structured mathematical abstractions. From Isaac Newton’s formulation of universal gravitation inspired by a falling apple, to modern physics, quantitative laws enable precise reasoning about the dynamics of the real world. In contrast, although state-of-the-art AI systems demonstrate remarkable capabilities in mathematical reasoning, programming, and scientific writing, enabling artificial intelligence to ground its understanding in the physical world remains a fundamental and unresolved challenge. This limitation poses a critical barrier to deploying AI systems in real-world, embodied environments.

Modern large language models (LLMs) are predominantly trained under the next-token prediction paradigm, which implicitly encourages models to capture statistical regularities in data. A natural extension toward building world models is to train systems to predict future states—such as future frames in videos or evolving spatial configurations. While such approaches can improve perceptual modeling and temporal prediction, they do not necessarily lead to a true understanding of physical laws. Instead, models may learn to imitate surface-level patterns in visual data without acquiring the underlying causal and quantitative structure of the physical world.

This limitation can be intuitively understood by analogy to human cognition. If humans were to perceive the world purely through passive observation, without forming explicit conceptual or physical knowledge, their behavior would be driven by superficial correlations rather than grounded reasoning. As a result, actions would lack an understanding of physical consequences (e.g., failing to infer the danger of falling from a height), reflecting a gap between perception and cognition. Similarly, current AI systems often rely on learned statistical priors rather than principled physical reasoning.

To mitigate this issue, prior work has introduced large-scale datasets in the form of Visual Question Answering (VQA) to inject world knowledge into models. However, such approaches remain insufficient for evaluating true physical understanding.

Problem: Existing benchmarks for physical world understanding are predominantly VQA-based and qualitative. These evaluations often reduce reasoning to discrete answer selection or linguistic plausibility, which can be solved via pattern matching rather than genuine physical inference.
Insight: To address this limitation, the authors introduce a new paradigm that evaluates quantitative physical reasoning, focusing on whether models can infer numerical kinematic properties (e.g., size, velocity, acceleration) from visual inputs.

💡 2 Methodology

2.1 Task Formulation

The paper formulates a kinematic inference task for evaluating physical reasoning in vision-language models. Given a video and a single physical prior (e.g., size $\bold{S}_t^{\text{world}}$, velocity $\bold{V}_t^{\text{world}}$, or acceleration $\bold{A}_t^{\text{world}}$), the model is required to estimate another physical quantity of an target object in real-world units.

Tab. 1: Pixel-to-World Representation and Scale Mapping
Component	Definition	Measurement Units
Pixel Space	Observable quantities derived from video frames	[pixel], [pixel/s], [pixel/s²]
World Space	Physical quantities in real-world coordinates	[m], [m/s], [m/s²]
Scale Factor (γ)	Mapping between pixel space and world space	[m/pixel]

Given a video capturing the translational motion of a target object under a fixed camera, the object's position in pixel space, denoted as $\mathbf{X}_t^{\text{pixel}}$, can be obtained at each time step $t$ from the frames. Based on the resulting discrete trajectory, the velocity and acceleration in pixel space can be estimated using finite difference approximations:

$$ \bold{V}_t^{\text{pixel}}\approx\frac{\bold{X}_{t+\mathrm{d}t}^{\text{pixel}}-\bold{X}_t^{\text{pixel}}}{\mathrm{d}t}; \bold{A}_t^{\text{pixel}}\approx\frac{\bold{X}_{t+2\mathrm{d}t}^{\text{pixel}}-2\bold{X}_{t+\mathrm{d}t}^{\text{pixel}}+\bold{X}_t^{\text{pixel}}}{\mathrm{d}t^2}. \tag{1} $$

To convert these pixel-based measurements into real-world physical quantities, a scale factor $\gamma$ is introduced, which maps pixel space to world space. The relationship can be expressed as follows:

$$ \bold{S}_t^{\text{world}}=\gamma \cdot \bold{S}_t^{\text{pixel}}; \bold{V}_t^{\text{world}}=\gamma \cdot \bold{V}_t^{\text{pixel}}; \bold{A}_t^{\text{world}}=\gamma \cdot \bold{A}_t^{\text{pixel}}. \tag{2} $$

Thus, we can compute the kinematic properties from the video and these priors.

2.2 Benchmark Design

For comprehensively evaluate of the kinematic movements above, QuantiPhy include video-question pairs along three primary axes. The first two axes define the core reasoning task:

Dimensionality: {2D, 3D}. 2D movement assumes motion strictly in the x-y plane (constant depth), while 3D movement includes the z-axis (varying depth), making it intrinsically more challenging.
Physical prior: {Static, Dynamic}. The Static prior provides constant object size $\bold{S}^{\text{world}}$ throughout the video, while the Dynamic prior provides velocity $\bold{V}_t^{\text{world}}$ or acceleration $\bold{A}_t^{\text{world}}$ at a given timestep $t$.

These two axes yield four tasks: 2D-Static (2S), 2D-Dynamic (2D), 3D-Static (3S), and 3D-Dynamic (3D). The data statistic of QuantiPhy benchmark is presented in Table 2.

Tab. 2: Data Statistics of the QuantiPhy Benchmark
Category	Value	Description
Total Videos	569	Unique video samples collected from multiple sources
Total QA Pairs	3,355	Video-question pairs with numerical ground truth
Task Types	4	2D-Static, 2D-Dynamic, 3D-Static, 3D-Dynamic
Video Duration	2–3 seconds	Typical length of each video clip
Data Sources	3	Blender simulation, lab capture, internet videos
Storage Size	~115 MB	Total dataset size after processing

2.3 Data Construction

QuantiPhy employs a three-stage construction pipeline that balances experimental control with real-world diversity. As illustrated in Figure 2.1, the authors integrate synthetic simulation, controlled laboratory capture, and in-the-wild internet videos to create a comprehensive evaluation benchmark.

Stage 1: Data Collection. The authors source videos from three complementary channels to ensure broad coverage of physical scenarios:

Tab. 3: Data Source Characteristics and Collection Methodology
Source	Quantity	Key Characteristics	Primary Use Case
Blender Simulation	300 videos	Full physical control; precise ground-truth; scalable scene variation	Controlled experiments; counterfactual testing
Lab Capture	112 videos	Real-world physics; 4D metric reconstruction; calibrated multi-view	Real sensor validation; depth-varying 3D motion
Internet Scraping	72 videos	Natural scenes; diverse distributions; uncontrolled conditions	Out-of-distribution evaluation
Segmented (SAM2)	85 videos	Isolated objects on plain backgrounds; background ablation	Scene complexity analysis

Blender Simulation enables precise control over object kinematics, camera parameters, and scene composition. They render scenes using Cycles/EEVEE engines with varying resolutions (1920×1080, 1080×1080, 480×960), frame rates (24–120 fps), and lighting conditions. Motion types include: (i) keyframed animation for articulated objects (humans, animals), and (ii) physics-driven simulation for rigid-body dynamics with Newtonian constraints.
Lab Capture utilizes four Orbbec Femto Mega RGB-D cameras arranged in multi-view stereo configuration. They capture diverse motions including free fall, sliding, pendulum oscillation, and bouncing across small-scale (desk-top) and large-scale (room-scale) setups.
Internet Videos are manually curated from open-source platforms and author-recorded footage, strictly filtered for static camera, translational motion, and visible reference objects. All identifiable information (faces, license plates) is anonymized via blurring.

Stage 2: Data Annotation. They employ source-specific annotation protocols to extract precise kinematic ground truth:

Tab. 4: Annotation Methods by Data Source
Source	Annotation Method	Extracted Quantities	Precision
Blender	Automated Python scripts querying scene graph	Size, displacement, velocity, acceleration, depth	Exact (floating-point)
Lab	UI-assisted depth clicking + multi-view triangulation	Metric depth, 3D trajectory, instantaneous velocity/acceleration	±1 cm (depth camera limited)
Internet	Interactive pixel measurement tool + reference scaling	Pixel kinematics → world units via γ estimation	Approximate (reference-dependent)

Stage 3: Task Formulation. Each video is associated with multiple (prior, question, ground-truth) triplets following the kinematic inference framework:

Tab. 5: Video-Text Record Schema
Field	Description	Example
video_id	Unique identifier	simulation_0032
video_type	4-character code: [Prior][Dim][Objects][Background]	A3MC (Acceleration, 3D, Multiple objects, Complex)
inference_type	Prior dynamics → Target dynamics (S=static, D=dynamic)	DD (Dynamic prior → Dynamic target)
ground_truth_prior	Provided physical constant with unit	gravity acc = 9.8 m/s²
depth_info	Temporal depth annotations (3D tasks only)	t=1s, distance_ball_camera = 1.4020m
ground_truth_posterior	Numerical answer (unit specified in question)	2.86

The four-character video type code systematically encodes task complexity:

1st character: S (Size prior), V (Velocity prior), or A (Acceleration prior)
2nd character: 2 (2D planar motion) or 3 (3D depth-varying motion)
3rd character: S (Single object) or M (Multiple objects requiring relational reasoning)
4th character: X (Plain background), S (Simple texture), or C (Complex scene) This schema yields 36 fine-grained categories (e.g., A2SX, V3MC), each populated with ≥4 videos to ensure statistical validity. The final dataset comprises 569 unique videos and 3,355 question-answer pairs, with 2D:3D ratio of approximately 4:3 and balanced distribution across inference types.

🛠️ 3 Evaluation Protocol

The QuantiPhy evaluation framework is designed to rigorously assess Vision-Language Models' quantitative physical reasoning through standardized prompting, robust parsing, and calibrated metrics. Their protocol addresses three critical challenges: (i) ensuring consistent model behavior across diverse architectures, (ii) extracting reliable numerical predictions from potentially verbose outputs, and (iii) measuring proximity to ground truth with appropriate tolerance for physical measurement uncertainty.

3.1 Benchmark Models

The authors evaluate 21 state-of-the-art VLMs spanning proprietary APIs and open-weight architectures to ensure comprehensive coverage of current capabilities:

Tab. 6: Evaluated Model Suite
Category	Models	Key Characteristics
Proprietary	ChatGPT-5.1, ChatGPT-5	OpenAI multimodal with extended CoT reasoning
	Gemini-2.5 Pro/Flash	Google long-context video understanding
	Grok-4.1 (Fast Reasoning)	xAI rapid inference with reasoning optimization
	Claude-4.5 Sonnet	Anthropic detailed explanatory generation
Open-Weight (Scaling Series)	Qwen3-VL-Instruct (2B/8B/32B)	Alibaba architecture scaling analysis
	InternVL-3.5 (2B/8B/30B)	Shanghai AI Lab vision-language alignment
	Phi-4-Multimodal / Phi-3-Mini	Microsoft efficient multimodal design
	SmolVLM-Instruct (256M)	Ultra-lightweight edge deployment
Specialized	Molmo-7B, VILA-7B, LLaVA-13B	Academic research architectures
Specialized	MiniCPM-V 4.5, CogVLM2-Video	Native video input processing

Deployment Configuration: Proprietary models are accessed via official APIs (OpenAI, Google, Anthropic, xAI). Open-weight models are hosted via Replicate API or self-deployed with vLLM. Temperature is fixed at 0–0.1 for deterministic outputs; token limits range from 500 (lightweight models) to 10,000 (reasoning-intensive models).

3.2 Prompting Strategy

They employ a constrained generation protocol designed to minimize output variance and enforce numerical precision:

Tab. 7: Standardized Prompt Structure
Component	Content	Purpose
[Video Frames]	Full temporal sequence at 480p resolution; all frames retained	Preserve motion dynamics; avoid temporal aliasing
[System Prompt]	"You are an expert video analyst specializing in physics measurements"	Establish authoritative persona; pilot-validated for adherence
[Ground Truth Prior]	Single physical constant (e.g., "length of yellow car = 5.67m")	Enable scale factor γ determination
[Depth Info] (3D only)	Temporal camera-object distances	Support depth-varying kinematic inference
[Question]	Target quantity with explicit unit and timestamp	Remove ambiguity in prediction target
[Post-Prompt]	"Output ONLY the numerical answer and unit. No explanation."	Suppress verbose CoT; enforce parseability

Critical Design Choices:

Temporal fidelity over spatial resolution: 480p preserves all frames; subsampling degrades velocity/acceleration tracking
Single prior constraint: Exactly one physical constant provided to test scale transformation, not multi-factor estimation
Deterministic decoding: Greedy sampling (temperature=0) where supported; default parameters otherwise.

3.3 Answer Retrieval and Parsing

Given model outputs ranging from concise numerical responses to extensive analytical narratives, they implement a hierarchical parsing pipeline:

1
2
3
4
5
1. Exact Match: Check if response matches [number][unit] format
2. Delimiter Search: Scan for "=", "Final Answer:", "=>", ":" → Retain substring after last delimiter
3. Unit Sanitization: Remove "meters", "m/s", "cm/s²" etc.
4. Heuristic Extraction: Apply regex for floating-point numbers → Take absolute value; select last valid number if multiple
5. Failure Handling: Return None if no valid number identified

3.4 Evaluation Metric: Mean Relative Accuracy (MRA)

The authors adopt MRA as the primary metric, extending the design from VSI-Bench with threshold calibration for physical reasoning tasks:

$$ \text{MRA}=\frac{1}{10}\sum_{\theta\in\mathcal{C}}\mathbb{1}\bigg(\frac{|\hat{y}-y|}{|y|}\lt 1-\theta\bigg),\quad \mathcal{C}=\{0.5,0.55,...,0.95\} \tag{3} $$

Tab. 8: MRA Design Rationale and Properties
Property	Description	Physical Reasoning Justification
Multi-threshold	10 confidence levels (0.5–0.95)	Captures gradations of "accurate enough"; avoids binary rigidity
Relative error	$\|\hat{y}-y\|/\|y\|$ rather than absolute	Scale-invariant; comparable across microscopic to astronomical scenes
Partial credit	Linear accumulation across thresholds	3.1m error (3% relative) rewarded; 31m error (1000% relative) penalized
Robustness	Indicator function rather than continuous loss	Tolerates annotation ambiguity (hair inclusion in height, rim vs. outer diameter)

Aggregation Protocol:

Question-level: MRA computed per (video, question) pair.
Category-level: Average MRA across all questions in {2D-Static, 2D-Dynamic, 3D-Static, 3D-Dynamic}.
Model-level: Unweighted mean of four category scores.

Questions with no valid numerical output after 5 retries contribute MRA = 0 to the category average.

📊 4 Experiments

4.1 Main Results

Figure 4.1 presents performance across four kinematic inference categories. ChatGPT-5.1 achieves the highest overall MRA (53.1%), marginally surpassing humans on 2D-Dynamic tasks but remaining below the human average of 55.6%. Open-weight models exhibit clear scaling effects: Qwen3-VL improves from 29.0% (2B) to 46.0% (32B), with gains most pronounced on dynamic categories requiring temporal integration.

4.2 Effect of Scene Context

The authors analyze performance across scene difficulty axes:

Background Complexity: Performance in complex backgrounds (C, 0.40 MRA) slightly exceeds simple textures (S, 0.38) and plain backgrounds (X, 0.35). Realistic backgrounds provide additional scale reference cues (road markings, architectural elements) that aid inference.
Object Multiplicity: Multiple-object scenes (M) consistently outperform single-object scenes (S) by 3–5 MRA points. Additional objects serve as implicit comparison standards for size and speed estimation.

🧠 5 Reflection & Inspiration

The study reveals that current vision-language models struggle with quantitative physical reasoning. Instead of relying on visual evidence and provided priors, they tend to depend heavily on pre-trained world knowledge, leading to limited numerical accuracy and poor input faithfulness.

Pros:
1. Novel quantitative paradigm: Moves beyond binary VQA evaluation to continuous numerical accuracy with MRA metric, distinguishing 3.1m error (acceptable) from 31m error (catastrophic).
2. Controlled yet diverse data: Blender simulation enables exact ground-truth and systematic variation; lab capture adds real-world validation; internet data tests distribution generalization.
Cons:
1. Simplified physical scope: Restricted to translational motion of rigid objects—no rotation, deformation, fluid dynamics, or multi-body contact physics relevant to real robotics.
2. Fixed camera assumption: Eliminates ego-motion ambiguity present in embodied navigation and AR/VR applications (not general scenarios).
3. No possible solution: Benchmark identifies failure modes but provides no demonstration of improved training recipes or fine-tuning strategies to address the
Inspiration:
1. From perception to cognition: Personal testing on VSI-Bench confirms SOTA models maintain strong performance (e.g., Object Counting) without visual input—mirroring QuantiPhy's findings. Rather than forcing pure perception, we should architect cognitive systems that strategically arbitrate between sensing and memory, transforming VLMs from passive perceivers into active agents that know when to look and when to recall.
2. Agentic system as solution pathway: Even with substantial room for base model improvement, the immediate deployment of embodied AI may benefit more from intelligent system design—explicit uncertainty quantification, selective memory retrieval, and input-confidence gating—than from waiting for perfect end-to-end physical reasoning to emerge.

🚀 1 Motivation & Problem#

💡 2 Methodology#

2.1 Task Formulation#

2.2 Benchmark Design#

2.3 Data Construction#

🛠️ 3 Evaluation Protocol#

3.1 Benchmark Models#

3.2 Prompting Strategy#

3.3 Answer Retrieval and Parsing#

3.4 Evaluation Metric: Mean Relative Accuracy (MRA)#

📊 4 Experiments#

4.1 Main Results#

4.2 Effect of Scene Context#

🧠 5 Reflection & Inspiration#