The Evolution of Unified Multimodal Models

Sat, 07 Mar 2026 14:51:21 +0800

1 Timeline Order

Summarize the literature reviewed in chronological order.

2026

Arxiv 2026

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

🔻

🏛️ MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition 🏛️ University of Science and Technology of China 🏛️ Zhejiang University 🏛️ The Hong Kong University of Science and Technology

📄 Paper

🏷️

Subject: Bridging Pre-trained VLMs and Diffusion Models for UMMs

❓

Problem:

Existing methods (MetaQuery) performs alignment via learnable queries, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types.

💡

Idea: Probabilistic Expert Bridge (from Bagel) samples Noisy Query Tokens.

🛠️

Solution:

Noisy Query Tokens: Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features.
Probabilistic Expert Bridge: Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection.
VAE Branch: Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models's burden.
Progressive Training: Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity.

🏆

Results: Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details.

WeMMU

Unified Multimodal Models on PaperMoon's blog

The Evolution of Unified Multimodal Models

1 Timeline Order

2026