Posts on PaperMoon's blog

When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study

Mon, 23 Mar 2026 16:42:46 +0800

TOPIC Quantitative Physical Understanding

WHY READ Exposes that top VLMs guess physical quantities from memory (pre-trained world knowledge) rather than measure from video, with rigorous tests to diagnose this failure.

TAKEAWAY Current VLMs are cognitive mimics not physical reasoners, so build systems that arbitrate between perception and memory rather than forcing pure end to end inference. (Context Learning, Agentic AI)

Stanford University, UST

📄 Paper 💻 Code 🌐 Project 👤 Author

🚀 1 Motivation & Problem

Humans understand the physical world through structured mathematical abstractions. From Isaac Newton’s formulation of universal gravitation inspired by a falling apple, to modern physics, quantitative laws enable precise reasoning about the dynamics of the real world. In contrast, although state-of-the-art AI systems demonstrate remarkable capabilities in mathematical reasoning, programming, and scientific writing, enabling artificial intelligence to ground its understanding in the physical world remains a fundamental and unresolved challenge. This limitation poses a critical barrier to deploying AI systems in real-world, embodied environments.

Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning

Thu, 19 Mar 2026 11:15:09 +0800

1 Benchmark

1.1 Textual Benchmarks

Arxiv 2026

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

🔻

🏛️ Beijing Institute of Technology 🏛️ BUCT

👤 Author 📄 Paper 💻 Code

🏷️

Subject: Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation

❓

Problem:

Perception–reasoning entanglement in VLM benchmarks
Lack of high-fidelity text-only spatial tasks
Over-reliance on language priors/pattern matching
Weak evaluation of global consistency, mental mapping

💡

Idea: Convert visual scenes into coordinate-aware text to isolate and test symbolic spatial reasoning in LLMs.

🛠️

Solution:

SiT-Bench: 3.8K QA across 5 categories, 17 subtasks for spatial cognition
Textual Encoding: Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning
Dual Construction: Image-based generation + vision-benchmark-to-text adaptation
R1 Filtering: Reasoning-based filtering removes trivial, inconsistent, leakage samples
Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability

🏆

Results: Best model 59.46% vs. 74.42% human; large gap in global tasks (<10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning.

Example of SiT Benchmark

The Evolution of Unified Multimodal Models

Sat, 07 Mar 2026 14:51:21 +0800

1 Timeline Order

Summarize the literature reviewed in chronological order.

2026

Arxiv 2026

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

🔻

🏛️ MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition 🏛️ University of Science and Technology of China 🏛️ Zhejiang University 🏛️ The Hong Kong University of Science and Technology

📄 Paper

🏷️

Subject: Bridging Pre-trained VLMs and Diffusion Models for UMMs

❓

Problem:

Existing methods (MetaQuery) performs alignment via learnable queries, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types.

💡

Idea: Probabilistic Expert Bridge (from Bagel) samples Noisy Query Tokens.

🛠️

Solution:

Noisy Query Tokens: Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features.
Probabilistic Expert Bridge: Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection.
VAE Branch: Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models's burden.
Progressive Training: Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity.

🏆

Results: Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details.

WeMMU

LoRA Variants Surveys

Fri, 16 Jan 2026 00:09:30 +0800

1 Timeline Order

Summarize the literature reviewed in chronological order.

2023

📝【EMNLP 2023 - Main】- Sparse Low-rank Adaptation of Pre-trained Language Models (Tsinghua University, The University of Chicago)

Subject: Adaptive Rank Selection

Problem: Standard LoRA uses a fixed, inflexible rank (hyperparameter $ r $), requiring expensive manual tuning.
Core Idea: Make the rank learnable rather than fixed.
Mechanism:
- Gating: Introduces an optimizable gating unit to the low-rank matrices.
- Optimization: Uses proximal gradient methods to update the gates.
- Dynamics: Prunes less important ranks during training automatically.
Result: Eliminates discrete rank search; the model discovers its own optimal rank structure.

SoRA

Designing Bert for Convolutional Networks

Thu, 28 Aug 2025 20:47:43 +0800

SparK：Designing Bert for Convolutional Networkss: Sparse and Hierarchical Masked Modeling (ICLR 2023 Spotlight)

论文介绍：https://www.bilibili.com/video/BV11s4y1M7qL/

Bert算法是遮住数据的一部分，用模型去进行预测，达到一个自监督学习的效果。迁移到图像领域中的视觉Transformer的工作比如MAE，但是直接将Transformer替换为卷积网络则出现问题。如下图，zero-outing表示直接替换：

可以看到只有0.1个点的提升，是完全无效的。下面是作者的分析。

为什么失败？

问题1：Pixel Intensity Distribution Shift

Transformer在处理patches时，只要保证是随机删去一些patches，可以保证删除的patches和图像的像素分布是一致的。而卷积神经网络则不能删去一些像素，只能是将一些像素“涂黑”来模拟丢失这部分像素的信息。

像素分布。横轴是像素强度，纵轴是像素出现的频率

Self-supervised Object-Centric Learning for Videos

Sun, 10 Dec 2023 11:35:33 +0800

来源：NIPS 2023

论文地址：http://arxiv.org/abs/2310.06907

代码地址：❌

作者主页：二作谢伟迪主页https://weidixie.github.io/

项目主页：https://kuis-ai.github.io/solv/

介绍

背景：无监督多目标分割借助自监督学习预训练中学习到的强力的语义信息展示了显著的效果。通常也是通过添加额外的模态（比如深度、动作）来增强视频序列的分割结果。然而，在 _合成序列 _中观察到的性能提升依赖于额外信息的鲁棒性，并不能转化为更具挑战的真实世界场景。

任务：给定一个复杂场景的视频序列，目标是训练一个视觉系统能够发现、追踪和分割图像帧里的目标，将数百万的像素的视觉信息抽象为语义部分。（object-centric视觉表征学习）

(a) Ground Truth

(b) Prediction

领域的发展：从合成图像开始，转向in-the-wild图像和real-world视频。现有方法通常使用自编码器训练范式（如重建输入信号，希望能基于数据或结构的先验来将区域像素分组为有语义意义的对象）。

对图像：使用来源于预训练模型的低级特征（如颜色、语义特征等）来确定像素到目标的分配
对视频：通常结合额外的模态、信号（如光流、深度图），可直接从不连续性获得分割掩码

提出问题

使用额外信息带来的问题：在视频中使用额外的信号会增加计算开销和误差累积。比如光流信号在处理静态或可变形物体以及帧间大位移时可能会产生问题，而深度值在普通视频中可能不易获得，在低光照或低对比度环境中其估算也会受到影响。

过度分割问题：由于视觉场景的复杂性，使用固定数量的slots可能导致过度分割问题（over-segmentation issuse）。

解决问题

作者方法：首次提出用于真实世界序列中多目标分割的完全无监督方法。SOLV，一个能够发现真实世界视频序列中多个目标且不使用额外的模态信息或任何类似弱监督方法（比如使用第一帧进行初始化）。

方案：使用轴向空间-时隙注意力（axial spatial-temporal slot attentions）

首先对每帧内空间区域进行分组
然后使用来自相邻帧的交互来丰富时隙表示（slot representations）

训练策略：masked autoencoder（MAE）训练范式。两个优势：

充当信息瓶颈（information bottleneck），让模型观察部分区域，强迫模型学习高级语义结构。
缓解内存限制，有助于提高计算效率

针对over-segmentation问题：作者通过使用简单的聚类算法来融合相似的slots。

总的来说，贡献如下：

提出一个在真实世界视频上的自监督多目标分割模型，使用axial spatial-temporal slots attention，能有效地将具有相似特性的视觉区域进行分组，而不需要使用额外的信号。
展示了一个基于掩码特征重建的object-centric学习方式以及slot融合方法。
MOVi-E和Youtube-VIS 2019数据集上的SOTA以及DAVIS2017数据集上的具有竞争力的性能。

slot即视频中的各物体对象，见下图。

Source from: Conditional object-centric learning from video

Posts on PaperMoon's blog

When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study

🚀 1 Motivation & Problem

Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning

1 Benchmark

1.1 Textual Benchmarks

The Evolution of Unified Multimodal Models

1 Timeline Order

2026

LoRA Variants Surveys

1 Timeline Order

2023

Designing Bert for Convolutional Networks

为什么失败？

问题1：Pixel Intensity Distribution Shift

Self-supervised Object-Centric Learning for Videos

介绍

提出问题

解决问题

相关工作

Object-centric Learning

Posts on PaperMoon's blog

When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study

🚀 1 Motivation & Problem

Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning

1 Benchmark

1.1 Textual Benchmarks

The Evolution of Unified Multimodal Models

1 Timeline Order

2026

LoRA Variants Surveys

1 Timeline Order

2023

​Designing Bert for Convolutional Networks

为什么失败？

问题1：Pixel Intensity Distribution Shift

Self-supervised Object-Centric Learning for Videos

介绍

提出问题

解决问题

相关工作

Object-centric Learning

Designing Bert for Convolutional Networks