The Evolution of Unified Multimodal Models

1 Timeline Order Summarize the literature reviewed in chronological order. 2026 Arxiv 2026 WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens 🔻 🏛️ MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition 🏛️ University of Science and Technology of China 🏛️ Zhejiang University 🏛️ The Hong Kong University of Science and Technology 📄 Paper 🏷️ Subject: Bridging Pre-trained VLMs and Diffusion Models for UMMs ❓ Problem: Existing methods (MetaQuery) performs alignment via learnable queries, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types. 💡 Idea: Probabilistic Expert Bridge (from Bagel) samples Noisy Query Tokens. 🛠️ Solution: Noisy Query Tokens: Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features. Probabilistic Expert Bridge: Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection. VAE Branch: Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models's burden. Progressive Training: Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity. 🏆 Results: Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details. WeMMU ...

Date: Mar. 7, 2026 | Total: 749 words | Author: PaperMoon | Last Modified: Mar. 23, 2026