1 Timeline Order

Summarize the literature reviewed in chronological order.

  • 2026

Arxiv 2026

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

πŸ”»
πŸ›οΈ MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition πŸ›οΈ University of Science and Technology of China πŸ›οΈ Zhejiang University πŸ›οΈ The Hong Kong University of Science and Technology
🏷️
Subject: Bridging Pre-trained VLMs and Diffusion Models for UMMs
❓
Problem:
Existing methods (MetaQuery) performs alignment via learnable queries, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types.
πŸ’‘
Idea: Probabilistic Expert Bridge (from Bagel) samples Noisy Query Tokens.
πŸ› οΈ
Solution:
  • Noisy Query Tokens: Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features.
  • Probabilistic Expert Bridge: Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection.
  • VAE Branch: Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models's burden.
  • Progressive Training: Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity.
πŸ†
Results: Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details.
WeMMU
WeMMU

CVPR 2026

UAE: Incentivizing Mutual Benefits for Unified Multimodal Understanding and Generation via RL

πŸ”»
πŸ›οΈ Peking University πŸ›οΈ Baidu πŸ›οΈ Rabbitpre AI πŸ›οΈ SYSU πŸ›οΈ USTC πŸ›οΈ CASIA
🏷️
Subject: Bridging Pre-trained VLMs and Diffusion Models for UMMs
❓
Problem:
  • Image-to-text (I2T) and text-to-image (T2I) tasks are optimized independently, failing to leverage their inherent connection for mutual enhancement.
  • Joint training of existing UMMs leads to mutual degradation of understanding and generation capabilities, while decoupled training misses cross-task reciprocal benefits.
πŸ’‘
Idea: Links I2T and T2I via an auto-encoder perspective (text as intermediate latent representation) + Unified-GRPO RL post-training with reconstructive rewards
πŸ› οΈ
Solution:
  • Unified Auto-Encoder Paradigm: Define I2T as image-to-text semantic encoding and T2I as text-to-image decoding, taking semantic similarity between input and reconstructed images as the core optimization objective.
  • Unified-GRPO Post-Training Strategy: Adapt to two mainstream UMMs, freeze visual modules to only optimize LLMs, and adopt CLIP+generator as a frozen reconstructive reward module.
  • Unified-Bench Evaluation Benchmark: Design dual protocols-calculate Unified-Score through four visual backbones and evaluate caption quality via commercial LLM
πŸ†
Results: UAE achieves an overall Unified-Score of 86.09 on Unified-Bench, surpassing GPT-4o-Image’s 85.95, and attains SOTA generation performance of 0.86 on GenEval and 0.475 on GenEval++. The core innovation of reconstructive reinforcement learning is fully validated, as it successfully drives the model to produce long, detail-rich text that indirectly enhances image perception, establishing a bidirectional synergistic mechanism.
The workflow of RAE
The workflow of RAE

  • 2025

ICLR 2025

Reconstructive Visual Instruction Tuning

πŸ”»
πŸ›οΈ CASIA πŸ›οΈ University of Hongkong πŸ›οΈ MEGVII Tech. πŸ›οΈ StepFun
🏷️
Subject: Visual Instruction Tuning for Large Multimodal Models
❓
Problem:
  • LLM-centric Training Paradigm: Conventional visual instruction tuning for LMMs rely on vision-to-text alignment and text-only supervision.
  • Extrinstic Assistance: Previous vision-centric methods leverage extra vision experts[1] at the encoder end to enrich the crucial visual details in images for MLLMs, but require careful manual selection of experts and resulting in a complex inference process.
  • Spatial Redundancy in Images: Visual signals have heavy spatial redundancy, making it hard to generate meaningful feedback from natural images.
πŸ’‘
Idea: Reconsturct latent visual tokens of input images by denoiser to supervise the visual outputs of LMMs
πŸ› οΈ
Solution:
  • Reconsturction Variant Design: Proposes three regression-based reconstruction variants: $\textbf{ROSS}^R\text{-Pixel}$ (regresses raw RGB pixel values via patchify operation), $\textbf{ROSS}^R\text{-Latent}$ (regresses fine-grained latent tokens extracted by frozen teacher tokenizers VAE/DINOv2/DEiT-III), and $\textbf{ROSS}^R\text{-Latent2Pixel}$ (back to RGB pixel space for regression).
  • Training Objective:
    • How to reconstruct: Replaces vanilla regression with a per-token denoising objective to address visual spatial redundancy.
    • How to train: Trains the model with a joint loss of original textual next-token prediction and visual reconstructive denoising.
πŸ†
Results: Reconstructive objectives significantly boost LMMs' fine-grained visual comprehension and reduce hallucinations, while generative objectives focus only on high-aesthetic image generation instead of text-image alignment and thus fail to improve multimodal comprehension.
References:
[1] S. Tong et al., Eyes wid shut? exploring the visual shortcomings of multimodal llms. in CVPR 2024.
Training Procedure of ROSS
Training Procedure of ROSS