Flash Card on PaperMoon's blog

Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning

Thu, 19 Mar 2026 11:15:09 +0800

1 Benchmark

1.1 Textual Benchmarks

Arxiv 2026

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

🔻

🏛️ Beijing Institute of Technology 🏛️ BUCT

👤 Author 📄 Paper 💻 Code

🏷️

Subject: Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation

❓

Problem:

Perception–reasoning entanglement in VLM benchmarks
Lack of high-fidelity text-only spatial tasks
Over-reliance on language priors/pattern matching
Weak evaluation of global consistency, mental mapping

💡

Idea: Convert visual scenes into coordinate-aware text to isolate and test symbolic spatial reasoning in LLMs.

🛠️

Solution:

SiT-Bench: 3.8K QA across 5 categories, 17 subtasks for spatial cognition
Textual Encoding: Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning
Dual Construction: Image-based generation + vision-benchmark-to-text adaptation
R1 Filtering: Reasoning-based filtering removes trivial, inconsistent, leakage samples
Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability

🏆

Results: Best model 59.46% vs. 74.42% human; large gap in global tasks (<10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning.

Example of SiT Benchmark

The Evolution of Unified Multimodal Models

Sat, 07 Mar 2026 14:51:21 +0800

1 Timeline Order

Summarize the literature reviewed in chronological order.

2026

Arxiv 2026

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

🔻

🏛️ MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition 🏛️ University of Science and Technology of China 🏛️ Zhejiang University 🏛️ The Hong Kong University of Science and Technology

📄 Paper

🏷️

Subject: Bridging Pre-trained VLMs and Diffusion Models for UMMs

❓

Problem:

Existing methods (MetaQuery) performs alignment via learnable queries, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types.

💡

Idea: Probabilistic Expert Bridge (from Bagel) samples Noisy Query Tokens.

🛠️

Solution:

Noisy Query Tokens: Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features.
Probabilistic Expert Bridge: Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection.
VAE Branch: Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models's burden.
Progressive Training: Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity.

🏆

Results: Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details.

WeMMU

LoRA Variants Surveys

Fri, 16 Jan 2026 00:09:30 +0800

1 Timeline Order

Summarize the literature reviewed in chronological order.

2023

📝【EMNLP 2023 - Main】- Sparse Low-rank Adaptation of Pre-trained Language Models (Tsinghua University, The University of Chicago)

Subject: Adaptive Rank Selection

Problem: Standard LoRA uses a fixed, inflexible rank (hyperparameter $ r $), requiring expensive manual tuning.
Core Idea: Make the rank learnable rather than fixed.
Mechanism:
- Gating: Introduces an optimizable gating unit to the low-rank matrices.
- Optimization: Uses proximal gradient methods to update the gates.
- Dynamics: Prunes less important ranks during training automatically.
Result: Eliminates discrete rank search; the model discovers its own optimal rank structure.

SoRA