WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
π»
ποΈ
MoE Key Laboratory of Brain-inspired Intelligent Perception and CognitionποΈ
University of Science and Technology of ChinaποΈ
Zhejiang UniversityποΈ
The Hong Kong University of Science and Technology
Subject: Bridging Pre-trained VLMs and Diffusion Models for UMMs
β
Problem:
Existing methods (MetaQuery) performs alignment via learnable queries, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types.
Noisy Query Tokens: Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features.
Probabilistic Expert Bridge: Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection.
VAE Branch: Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models's burden.
Progressive Training: Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity.
π
Results: Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details.
WeMMU
CVPR 2026
UAE: Incentivizing Mutual Benefits for Unified Multimodal Understanding and Generation via RL
Subject: Bridging Pre-trained VLMs and Diffusion Models for UMMs
β
Problem:
Image-to-text (I2T) and text-to-image (T2I) tasks are optimized independently, failing to leverage their inherent connection for mutual enhancement.
Joint training of existing UMMs leads to mutual degradation of understanding and generation capabilities, while decoupled training misses cross-task reciprocal benefits.
π‘
Idea: Links I2T and T2I via an auto-encoder perspective (text as intermediate latent representation) + Unified-GRPO RL post-training with reconstructive rewards
π οΈ
Solution:
Unified Auto-Encoder Paradigm: Define I2T as image-to-text semantic encoding and T2I as text-to-image decoding, taking semantic similarity between input and reconstructed images as the core optimization objective.
Unified-GRPO Post-Training Strategy: Adapt to two mainstream UMMs, freeze visual modules to only optimize LLMs, and adopt CLIP+generator as a frozen reconstructive reward module.
Unified-Bench Evaluation Benchmark: Design dual protocols-calculate Unified-Score through four visual backbones and evaluate caption quality via commercial LLM
π
Results: UAE achieves an overall Unified-Score of 86.09 on Unified-Bench, surpassing GPT-4o-Imageβs 85.95, and attains SOTA generation performance of 0.86 on GenEval and 0.475 on GenEval++. The core innovation of reconstructive reinforcement learning is fully validated, as it successfully drives the model to produce long, detail-rich text that indirectly enhances image perception, establishing a bidirectional synergistic mechanism.
Subject: Visual Instruction Tuning for Large Multimodal Models
β
Problem:
LLM-centric Training Paradigm: Conventional visual instruction tuning for LMMs rely on vision-to-text alignment and text-only supervision.
Extrinstic Assistance: Previous vision-centric methods leverage extra vision experts[1] at the encoder end to enrich the crucial visual details in images for MLLMs, but require careful manual selection of experts and resulting in a complex inference process.
Spatial Redundancy in Images: Visual signals have heavy spatial redundancy, making it hard to generate meaningful feedback from natural images.
π‘
Idea: Reconsturct latent visual tokens of input images by denoiser to supervise the visual outputs of LMMs
π οΈ
Solution:
Reconsturction Variant Design: Proposes three regression-based reconstruction variants: $\textbf{ROSS}^R\text{-Pixel}$ (regresses raw RGB pixel values via patchify operation), $\textbf{ROSS}^R\text{-Latent}$ (regresses fine-grained latent tokens extracted by frozen teacher tokenizers VAE/DINOv2/DEiT-III), and $\textbf{ROSS}^R\text{-Latent2Pixel}$ (back to RGB pixel space for regression).
Training Objective:
How to reconstruct: Replaces vanilla regression with a per-token denoising objective to address visual spatial redundancy.
How to train: Trains the model with a joint loss of original textual next-token prediction and visual reconstructive denoising.
π
Results: Reconstructive objectives significantly boost LMMs' fine-grained visual comprehension and reduce hallucinations, while generative objectives focus only on high-aesthetic image generation instead of text-image alignment and thus fail to improve multimodal comprehension.
References: [1] S. Tong et al., Eyes wid shut? exploring the visual shortcomings of multimodal llms. in CVPR 2024.