Flash Card

Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning

1 Benchmark 1.1 Textual Benchmarks Arxiv 2026 Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions 🔻 🏛️ Beijing Institute of Technology 🏛️ BUCT 👤 Author 📄 Paper 💻 Code 🏷️ Subject: Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation ❓ Problem: Perception–reasoning entanglement in VLM benchmarks Lack of high-fidelity text-only spatial tasks Over-reliance on language priors/pattern matching Weak evaluation of global consistency, mental mapping 💡 Idea: Convert visual scenes into coordinate-aware text to isolate and test symbolic spatial reasoning in LLMs. 🛠️ Solution: SiT-Bench: 3.8K QA across 5 categories, 17 subtasks for spatial cognition Textual Encoding: Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning Dual Construction: Image-based generation + vision-benchmark-to-text adaptation R1 Filtering: Reasoning-based filtering removes trivial, inconsistent, leakage samples Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability 🏆 Results: Best model 59.46% vs. 74.42% human; large gap in global tasks (<10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning. Example of SiT Benchmark ...

The Evolution of Unified Multimodal Models

1 Timeline Order Summarize the literature reviewed in chronological order. 2026 Arxiv 2026 WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens 🔻 🏛️ MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition 🏛️ University of Science and Technology of China 🏛️ Zhejiang University 🏛️ The Hong Kong University of Science and Technology 📄 Paper 🏷️ Subject: Bridging Pre-trained VLMs and Diffusion Models for UMMs ❓ Problem: Existing methods (MetaQuery) performs alignment via learnable queries, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types. 💡 Idea: Probabilistic Expert Bridge (from Bagel) samples Noisy Query Tokens. 🛠️ Solution: Noisy Query Tokens: Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features. Probabilistic Expert Bridge: Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection. VAE Branch: Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models's burden. Progressive Training: Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity. 🏆 Results: Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details. WeMMU ...

LoRA Variants Surveys

1 Timeline Order Summarize the literature reviewed in chronological order. 2023 📝【EMNLP 2023 - Main】- Sparse Low-rank Adaptation of Pre-trained Language Models (Tsinghua University, The University of Chicago) Subject: Adaptive Rank Selection Problem: Standard LoRA uses a fixed, inflexible rank (hyperparameter $ r $), requiring expensive manual tuning. Core Idea: Make the rank learnable rather than fixed. Mechanism: Gating: Introduces an optimizable gating unit to the low-rank matrices. Optimization: Uses proximal gradient methods to update the gates. Dynamics: Prunes less important ranks during training automatically. Result: Eliminates discrete rank search; the model discovers its own optimal rank structure. SoRA ...