DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Shanghai Innovation Institute, DeepGen Team

🔥 News

Feb 13, 2026: We released DeepGen 1.0, Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints can be found in Huggingface, support both T2I generation and image editing.
Feb 13, 2026: We released the training code support Pre-training, Supervised Fine-Tuning, Reinforcement Learning and evaluation code support wide range of benchmarks.
Feb 13, 2026: We released the DeepGen 1.0 technical report on Arxiv

✨ Introduction

Broader Scenario and Dimension Coverage We propose DeepGen 1.0, a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities: general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.

🧠 Method

Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.

💻 Train & Eval

Set up environment

git clone https://github.com/deepgenteam/deepgen.git
cd deepgen
conda create -n deepgen python=3.12 -y
conda activate deepgen
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation
pip install xtuner==0.2.0
pip install transformers==4.56.1
pip install triton==2.3.0
pip install -U opencv-python-headless

Data Prepare

Please See DATA for more details. We provide a detailed description of the data download and usage procedures for both the Pre-traning stage and the Supervised Fine-Tuning stage.

Inference

Diffusers (click to expand)

We provide a diffusers-compatible format at 🤗deepgenteam/DeepGen-1.0-diffusers.

Text-to-Image:

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "deepgenteam/DeepGen-1.0-diffusers",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
pipe.to("cuda")

result = pipe(
    prompt="a photo of a blue pizza and a yellow baseball glove",
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("output.png")

Image Editing:

from PIL import Image

result = pipe(
    prompt="Place this guitar on a sandy beach with the sunset in the background.",
    image=Image.open("guitar.png"),
    negative_prompt="blurry, low quality, low resolution, distorted, deformed, broken content, missing parts, damaged details, artifacts, glitch, noise, pixelated, grainy, compression artifacts, bad composition, wrong proportion, incomplete editing, unfinished, unedited areas.",
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("edited.png")

Please refer to INFERENCE for more details, including the native pipeline usage.

Train

See TRAIN for more details. We provide a detailed description of the model and training configs for both the Pre-traning stage and the Supervised Fine-Tuning stage.

Eval

We provide the scripts for evaluating wide range of T2I and Editing benchmarks. Please See EVAL for more details.

📊 Benchmarks

1. General Image Generation

Model	Params	Geneval ↑	DPGBench ↑	UniGenBench ↑
OmniGen2	3B + 4B	0.80	83.57	63.09
BAGEL	14B	0.82	85.10	61.53
X-Omni	7B + 12B	0.83	87.65🥉	53.77
Lumina-DiMOO	8B	0.88🥇	86.04	71.12
Hunyuan-Image-3.0	80B	0.72	86.10	—
Qwen-Image	7B + 20B	0.87 🥈	88.32 🥇	78.81 🥇
LongCat-Image	7B + 6B	0.87 🥈	86.80	—
Z-Image-Turbo	4B + 6B	0.84	85.15	71.40
GLM-Image	9B + 7B	—	84.78	—
DeepGen 1.0 (SFT)	3B + 2B	0.86 🥉	87.05	74.18 🥉
DeepGen 1.0 (RL)	3B + 2B	0.87 🥈	87.90 🥈	75.74 🥈

2. General Image Editing

Model	Params	GEdit-EN ↑	ImgEdit ↑
BAGEL	14B	6.52	3.20
Qwen-Image-Edit [2509]	7B + 20B	7.54 🥈	4.35 🥈
LongCat-Image-Edit	7B + 6B	7.60 🥇	4.50 🥇
Mammoth2	8B + 3B + 2B	6.60	4.06
DeepGen 1.0 (SFT)	3B + 2B	7.12	4.09
DeepGen 1.0 (RL)	3B + 2B	7.17 🥉	4.14 🥉

3. Reasoning Image Generation

Model	Params	WISE ↑	T2I-CoREBench ↑
OmniGen2	3B + 4B	0.47	36.1
BAGEL	14B	0.70 🥉	41.1
Hunyuan-Image-3.0	80B	0.57	46.0
Qwen-Image	7B + 20B	0.62	46.3 🥉
LongCat-Image	7B + 6B	0.65	52.2 🥇
Z-Image-Turbo	4B + 6B	-	43.7
DeepGen 1.0 (SFT)	3B + 2B	0.72 🥈	45.7
DeepGen 1.0 (RL)	3B + 2B	0.73 🥇	46.5 🥈

4. Reasoning Image Editing

Model	Params	RISE ↑	UniREditBench ↑
OmniGen2	3B + 4B	-	43.4
BAGEL	14B	11.9 🥈	51.0
Qwen-Image-Edit [2509]	7B + 20B	8.9	56.5 🥉
DeepGen 1.0 (SFT)	3B + 2B	13.3 🥇	77.5 🥇
DeepGen 1.0 (RL)	3B + 2B	10.8 🥉	75.7 🥈

📧 Contact

[email protected], [email protected]

🎨 Quantitative results

⭐ Citation

@article{wang2026deepgen,
  title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
  author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
  journal={arXiv preprint arXiv:2602.12205},
  year={2026}
}

🙏 Acknowledgement

The project builds upon the following pioneering works:

OpenUni: We thank the OpenUni releasing the elegant and concise code and pretrain dataset.
UniPic2-SD3.5M-Kontext-2B: We use UniPic2-SD3.5M-Kontext-2B as our diffusion module, considering its efficiency and strong performance on both t2i and editing.
UnifiedReward-Think: We use UnifiedReward-Think as our reward model for RL, considering its strong performance.
Qwen2.5 VL: We useQwen2.5 VL-3B as our VLM module, considering its efficiency and strong performance on multimodal understanding abilities.
BLIP3-o: We thank the BLIP3-o team for releasing the precious high-quality tuning dataset.
OpenGPT-4o-Image: We thank the OpenGPT-4o-Image team for releasing the precious high-quality tuning dataset.
ShareGPT-4o-Image: We thank the ShareGPT-4o-Image team for releasing the precious high-quality tuning dataset.
Echo-4o: We thank the Echo-4o team for releasing the precious high-quality tuning dataset.
OmniGen2: We thank the OmniGen2 team for releasing the precious high-quality editing tuning dataset and code.
Uniworld-V1: We thank the Uniworld team for releasing the precious high-quality tuning dataset and code.
Picobanana: We thank the Picobanana team for releasing the precious high-quality editing tuning dataset.
Nano-consist: We thank the Nano-consist team for releasing the precious high-quality editing tuning dataset.
NHR-edit: We thank the NHR-edit team for releasing the precious high-quality editing tuning dataset.
UniREditBench: We thank the UniREditBench team for releasing the precious high-quality reason-based editing tuning dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
DeepGen-RL		DeepGen-RL
configs		configs
figure		figure
scripts		scripts
src		src
.gitignore		.gitignore
DATA.md		DATA.md
EVAL.md		EVAL.md
INFERENCE.md		INFERENCE.md
LICENSE		LICENSE
README.md		README.md
TRAIN.md		TRAIN.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

🔥 News

✨ Introduction

🧠 Method

💻 Train & Eval

Set up environment

Data Prepare

Inference

Train

Eval

📊 Benchmarks

1. General Image Generation

2. General Image Editing

3. Reasoning Image Generation

4. Reasoning Image Editing

📧 Contact

🎨 Quantitative results

⭐ Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

🔥 News

✨ Introduction

🧠 Method

💻 Train & Eval

Set up environment

Data Prepare

Inference

Train

Eval

📊 Benchmarks

1. General Image Generation

2. General Image Editing

3. Reasoning Image Generation

4. Reasoning Image Editing

📧 Contact

🎨 Quantitative results

⭐ Citation

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages