Shanghai Innovation Institute, DeepGen Team
- Feb 13, 2026: We released DeepGen 1.0, Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints can be found in Huggingface, support both T2I generation and image editing.
- Feb 13, 2026: We released the training code support Pre-training, Supervised Fine-Tuning, Reinforcement Learning and evaluation code support wide range of benchmarks.
- Feb 13, 2026: We released the DeepGen 1.0 technical report on Arxiv
Broader Scenario and Dimension Coverage We propose DeepGen 1.0, a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities: general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.
Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.
git clone https://github.com/deepgenteam/deepgen.git
cd deepgen
conda create -n deepgen python=3.12 -y
conda activate deepgen
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation
pip install xtuner==0.2.0
pip install transformers==4.56.1
pip install triton==2.3.0
pip install -U opencv-python-headlessPlease See DATA for more details. We provide a detailed description of the data download and usage procedures for both the Pre-traning stage and the Supervised Fine-Tuning stage.
Diffusers (click to expand)
We provide a diffusers-compatible format at 🤗deepgenteam/DeepGen-1.0-diffusers.
Text-to-Image:
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"deepgenteam/DeepGen-1.0-diffusers",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
pipe.to("cuda")
result = pipe(
prompt="a photo of a blue pizza and a yellow baseball glove",
height=512, width=512,
num_inference_steps=50,
guidance_scale=4.0,
seed=42,
)
result.images[0].save("output.png")Image Editing:
from PIL import Image
result = pipe(
prompt="Place this guitar on a sandy beach with the sunset in the background.",
image=Image.open("guitar.png"),
negative_prompt="blurry, low quality, low resolution, distorted, deformed, broken content, missing parts, damaged details, artifacts, glitch, noise, pixelated, grainy, compression artifacts, bad composition, wrong proportion, incomplete editing, unfinished, unedited areas.",
height=512, width=512,
num_inference_steps=50,
guidance_scale=4.0,
seed=42,
)
result.images[0].save("edited.png")Please refer to INFERENCE for more details, including the native pipeline usage.
See TRAIN for more details. We provide a detailed description of the model and training configs for both the Pre-traning stage and the Supervised Fine-Tuning stage.
We provide the scripts for evaluating wide range of T2I and Editing benchmarks. Please See EVAL for more details.
| Model | Params | Geneval ↑ | DPGBench ↑ | UniGenBench ↑ |
|---|---|---|---|---|
| OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 |
| BAGEL | 14B | 0.82 | 85.10 | 61.53 |
| X-Omni | 7B + 12B | 0.83 | 87.65🥉 | 53.77 |
| Lumina-DiMOO | 8B | 0.88🥇 | 86.04 | 71.12 |
| Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | — |
| Qwen-Image | 7B + 20B | 0.87 🥈 | 88.32 🥇 | 78.81 🥇 |
| LongCat-Image | 7B + 6B | 0.87 🥈 | 86.80 | — |
| Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 |
| GLM-Image | 9B + 7B | — | 84.78 | — |
| DeepGen 1.0 (SFT) | 3B + 2B | 0.86 🥉 | 87.05 | 74.18 🥉 |
| DeepGen 1.0 (RL) | 3B + 2B | 0.87 🥈 | 87.90 🥈 | 75.74 🥈 |
| Model | Params | GEdit-EN ↑ | ImgEdit ↑ |
|---|---|---|---|
| BAGEL | 14B | 6.52 | 3.20 |
| Qwen-Image-Edit [2509] | 7B + 20B | 7.54 🥈 | 4.35 🥈 |
| LongCat-Image-Edit | 7B + 6B | 7.60 🥇 | 4.50 🥇 |
| Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 |
| DeepGen 1.0 (SFT) | 3B + 2B | 7.12 | 4.09 |
| DeepGen 1.0 (RL) | 3B + 2B | 7.17 🥉 | 4.14 🥉 |
| Model | Params | WISE ↑ | T2I-CoREBench ↑ |
|---|---|---|---|
| OmniGen2 | 3B + 4B | 0.47 | 36.1 |
| BAGEL | 14B | 0.70 🥉 | 41.1 |
| Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 |
| Qwen-Image | 7B + 20B | 0.62 | 46.3 🥉 |
| LongCat-Image | 7B + 6B | 0.65 | 52.2 🥇 |
| Z-Image-Turbo | 4B + 6B | - | 43.7 |
| DeepGen 1.0 (SFT) | 3B + 2B | 0.72 🥈 | 45.7 |
| DeepGen 1.0 (RL) | 3B + 2B | 0.73 🥇 | 46.5 🥈 |
| Model | Params | RISE ↑ | UniREditBench ↑ |
|---|---|---|---|
| OmniGen2 | 3B + 4B | - | 43.4 |
| BAGEL | 14B | 11.9 🥈 | 51.0 |
| Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 🥉 |
| DeepGen 1.0 (SFT) | 3B + 2B | 13.3 🥇 | 77.5 🥇 |
| DeepGen 1.0 (RL) | 3B + 2B | 10.8 🥉 | 75.7 🥈 |
[email protected], [email protected]
@article{wang2026deepgen,
title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
journal={arXiv preprint arXiv:2602.12205},
year={2026}
}The project builds upon the following pioneering works:
- OpenUni: We thank the OpenUni releasing the elegant and concise code and pretrain dataset.
- UniPic2-SD3.5M-Kontext-2B: We use UniPic2-SD3.5M-Kontext-2B as our diffusion module, considering its efficiency and strong performance on both t2i and editing.
- UnifiedReward-Think: We use UnifiedReward-Think as our reward model for RL, considering its strong performance.
- Qwen2.5 VL: We useQwen2.5 VL-3B as our VLM module, considering its efficiency and strong performance on multimodal understanding abilities.
- BLIP3-o: We thank the BLIP3-o team for releasing the precious high-quality tuning dataset.
- OpenGPT-4o-Image: We thank the OpenGPT-4o-Image team for releasing the precious high-quality tuning dataset.
- ShareGPT-4o-Image: We thank the ShareGPT-4o-Image team for releasing the precious high-quality tuning dataset.
- Echo-4o: We thank the Echo-4o team for releasing the precious high-quality tuning dataset.
- OmniGen2: We thank the OmniGen2 team for releasing the precious high-quality editing tuning dataset and code.
- Uniworld-V1: We thank the Uniworld team for releasing the precious high-quality tuning dataset and code.
- Picobanana: We thank the Picobanana team for releasing the precious high-quality editing tuning dataset.
- Nano-consist: We thank the Nano-consist team for releasing the precious high-quality editing tuning dataset.
- NHR-edit: We thank the NHR-edit team for releasing the precious high-quality editing tuning dataset.
- UniREditBench: We thank the UniREditBench team for releasing the precious high-quality reason-based editing tuning dataset.



