Hi there 👋

Welcome to my blog

When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study

TOPIC Quantitative Physical Understanding WHY READ Exposes that top VLMs guess physical quantities from memory (pre-trained world knowledge) rather than measure from video, with rigorous tests to diagnose this failure. TAKEAWAY Current VLMs are cognitive mimics not physical reasoners, so build systems that arbitrate between perception and memory rather than forcing pure end to end inference. (Context Learning, Agentic AI) Stanford University, UST 📄 Paper💻 Code🌐 Project👤 Author 🚀 1 Motivation & Problem Humans understand the physical world through structured mathematical abstractions. From Isaac Newton’s formulation of universal gravitation inspired by a falling apple, to modern physics, quantitative laws enable precise reasoning about the dynamics of the real world. In contrast, although state-of-the-art AI systems demonstrate remarkable capabilities in mathematical reasoning, programming, and scientific writing, enabling artificial intelligence to ground its understanding in the physical world remains a fundamental and unresolved challenge. This limitation poses a critical barrier to deploying AI systems in real-world, embodied environments. ...

Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning

1 Benchmark 1.1 Textual Benchmarks Arxiv 2026 Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions 🔻 🏛️ Beijing Institute of Technology 🏛️ BUCT 👤 Author 📄 Paper 💻 Code 🏷️ Subject: Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation ❓ Problem: Perception–reasoning entanglement in VLM benchmarks Lack of high-fidelity text-only spatial tasks Over-reliance on language priors/pattern matching Weak evaluation of global consistency, mental mapping 💡 Idea: Convert visual scenes into coordinate-aware text to isolate and test symbolic spatial reasoning in LLMs. 🛠️ Solution: SiT-Bench: 3.8K QA across 5 categories, 17 subtasks for spatial cognition Textual Encoding: Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning Dual Construction: Image-based generation + vision-benchmark-to-text adaptation R1 Filtering: Reasoning-based filtering removes trivial, inconsistent, leakage samples Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability 🏆 Results: Best model 59.46% vs. 74.42% human; large gap in global tasks (<10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning. Example of SiT Benchmark ...

The Evolution of Unified Multimodal Models

1 Timeline Order Summarize the literature reviewed in chronological order. 2026 Arxiv 2026 WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens 🔻 🏛️ MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition 🏛️ University of Science and Technology of China 🏛️ Zhejiang University 🏛️ The Hong Kong University of Science and Technology 📄 Paper 🏷️ Subject: Bridging Pre-trained VLMs and Diffusion Models for UMMs ❓ Problem: Existing methods (MetaQuery) performs alignment via learnable queries, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types. 💡 Idea: Probabilistic Expert Bridge (from Bagel) samples Noisy Query Tokens. 🛠️ Solution: Noisy Query Tokens: Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features. Probabilistic Expert Bridge: Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection. VAE Branch: Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models's burden. Progressive Training: Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity. 🏆 Results: Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details. WeMMU ...

LoRA Variants Surveys

1 Timeline Order Summarize the literature reviewed in chronological order. 2023 📝【EMNLP 2023 - Main】- Sparse Low-rank Adaptation of Pre-trained Language Models (Tsinghua University, The University of Chicago) Subject: Adaptive Rank Selection Problem: Standard LoRA uses a fixed, inflexible rank (hyperparameter $ r $), requiring expensive manual tuning. Core Idea: Make the rank learnable rather than fixed. Mechanism: Gating: Introduces an optimizable gating unit to the low-rank matrices. Optimization: Uses proximal gradient methods to update the gates. Dynamics: Prunes less important ranks during training automatically. Result: Eliminates discrete rank search; the model discovers its own optimal rank structure. SoRA ...

Designing Bert for Convolutional Networks

SparK：Designing Bert for Convolutional Networkss: Sparse and Hierarchical Masked Modeling (ICLR 2023 Spotlight) 论文介绍：https://www.bilibili.com/video/BV11s4y1M7qL/ Bert算法是遮住数据的一部分，用模型去进行预测，达到一个自监督学习的效果。迁移到图像领域中的视觉Transformer的工作比如MAE，但是直接将Transformer替换为卷积网络则出现问题。如下图，zero-outing表示直接替换：可以看到只有0.1个点的提升，是完全无效的。下面是作者的分析。为什么失败？问题1：Pixel Intensity Distribution Shift Transformer在处理patches时，只要保证是随机删去一些patches，可以保证删除的patches和图像的像素分布是一致的。而卷积神经网络则不能删去一些像素，只能是将一些像素“涂黑”来模拟丢失这部分像素的信息。像素分布。横轴是像素强度，纵轴是像素出现的频率 ...