Automatically benchmark and optimize the attention mechanism in diffusion models for maximum generation speed.
Modern diffusion models (SDXL, Flux, WAN, LTX-V, Hunyuan Video) are based on transformer architecture. The core operation - attention - computes relationships between all elements in the image/video latent space. This is:
- The most expensive operation - attention takes 40-70% of total generation time
- O(n²) complexity - cost grows quadratically with resolution/frames
- GPU-dependent - different GPUs perform best with different implementations
Multiple optimized attention backends exist:
- PyTorch SDPA - built-in, always available
- Flash Attention - CUDA kernels, memory efficient
- SageAttention - INT8 quantization, up to 2-4x faster
- xFormers - memory efficient attention
But which one is fastest for YOUR specific GPU and model?
This plugin benchmarks all available backends and automatically applies the fastest one.
Tested on RTX 4090 with head_dim=128 (SDXL, Flux):
| Backend | Time | Speedup |
|---|---|---|
| PyTorch SDPA | 5.0ms | 1.0x (baseline) |
| Flash Attention | 5.4ms | 0.93x |
| SageAttention | 2.7ms | 1.9x |
Result: 1.9x faster generation just by switching attention backend.
For video models (WAN, Hunyuan) with longer sequences, speedups can reach 2-4x.
- Open ComfyUI Manager
- Click "Install via Git URL"
- Paste:
https://github.com/D-Ogi/ComfyUI-Attention-Optimizer.git - Restart ComfyUI
cd ComfyUI/custom_nodes
git clone https://github.com/D-Ogi/ComfyUI-Attention-Optimizer.gitRestart ComfyUI.
The plugin works out-of-the-box with PyTorch SDPA. For better performance, install additional backends:
# SageAttention - recommended for RTX 30xx/40xx (1.5-2x speedup)
pip install sageattention
# Flash Attention - alternative for Ampere+ GPUs
pip install flash-attn
# xFormers - memory efficient option
pip install xformersNote: On Windows, Flash Attention requires building from source or using prebuilt wheels. SageAttention is easier to install and often faster on consumer GPUs.
- Add "Attention Optimizer" node to your workflow (category:
model_patches) - Connect your model to the
modelinput - Run - it benchmarks once, caches results, and auto-applies the fastest backend
┌─────────────────┐ ┌──────────────────────────┐ ┌─────────────┐
│ Load Checkpoint │────▶│ Attention Optimizer │────▶│ KSampler │
└─────────────────┘ │ │ └─────────────┘
│ 1. Detect model params │
│ 2. Check cache │
│ 3. Benchmark (if needed) │
│ 4. Clone model & apply │
│ attention override │
└──────────────────────────┘
First run: Benchmarks all backends (~5-10 seconds), saves to cache. Subsequent runs: Loads from cache (instant), applies optimal backend.
| Input | Type | Default | Description |
|---|---|---|---|
model |
MODEL | required | The diffusion model to optimize |
attention_backend |
dropdown | auto |
auto = benchmark & select best, or force specific backend |
force_refresh |
bool | False | Re-run benchmark even if cached |
auto_apply |
bool | True | Apply the selected backend to this model |
seq_len |
int | 8192 | Sequence length for benchmark |
num_heads |
int | 24 | Number of attention heads |
| Output | Type | Description |
|---|---|---|
model |
MODEL | Cloned model with optimized attention applied |
best_attention |
STRING | Name of applied backend |
kjnodes_mode |
STRING | Compatible mode for KJNodes PatchSageAttention |
impl_type |
STRING | Implementation type (cuda/triton/pytorch) |
speedup |
FLOAT | Speedup vs PyTorch SDPA baseline |
time_ms |
FLOAT | Time per attention call in milliseconds |
head_dim |
INT | Detected head dimension from model |
report |
STRING | Full benchmark report text |
| Backend | Implementation | Best For |
|---|---|---|
pytorch |
PyTorch SDPA | Always available, baseline |
xformers |
xFormers CUDA | Memory efficiency |
sage_auto |
SageAttention auto | General use (auto-selects best variant) |
sage_cuda |
SageAttention CUDA | RTX 30xx/40xx |
sage_triton |
SageAttention Triton | When CUDA kernel unavailable |
sage_fp8_cuda |
SageAttention FP8 | Maximum speed, slight quality trade-off |
sage_fp8_cuda_fast |
SageAttention FP8++ | Even faster FP8 |
sage3 |
SageAttention 3 | RTX 50xx (Blackwell) only |
flash |
Flash Attention 2 | H100, A100, RTX 30xx/40xx |
| Model | Status | Notes |
|---|---|---|
| SDXL | ✅ Full | head_dim=128, SageAttention optimal |
| SD 1.5 | ✅ Full | head_dim=64 |
| SD 3 | ✅ Full | |
| Flux | ✅ Full | Per-model attention override |
| LTX-V | ✅ Full | head_dim=160 |
| WAN 2.1/2.2 | ✅ Full | Per-model attention override |
| Hunyuan Video | ✅ Full | Per-model attention override |
| Cosmos | ✅ Full | Per-model attention override |
| SeedVR2 | ❌ N/A | Uses own attention system, not affected |
| GPU | Recommended Backend | Expected Speedup |
|---|---|---|
| RTX 4090/4080 | sage_auto or sage_fp8_cuda_fast |
1.5-2.0x |
| RTX 3090/3080 | sage_auto or flash |
1.3-1.8x |
| RTX 50xx (Blackwell) | sage3 |
2-4x |
| H100/A100 | flash |
1.5-2.0x |
| AMD (ROCm) | pytorch |
1.0x (baseline) |
=================================================================
BENCHMARK REPORT
=================================================================
dtype: float16 | head_dim: 128 | seq_len: 8192 | CUDA: 12.4 | Triton: 3.0.0
SageAttention: v2.1.1
>>> BEST: sage_fp8_cuda_fast (1.89x speedup) <<<
impl: cuda | kjnodes mode: sageattn_qk_int8_pv_fp8_cuda++
Results (fastest first):
-----------------------------------------------------------------
[v] sage_fp8_cuda_fast 2.671ms 1.89x (cuda) <<<
[v] sage_auto 2.679ms 1.88x (auto)
[v] sage_fp8_cuda 3.100ms 1.63x (cuda)
[v] sage_triton 3.446ms 1.47x (triton)
[v] sage_cuda 3.947ms 1.28x (cuda)
[v] pytorch 5.049ms 1.00x (pytorch)
[v] xformers 5.194ms 0.97x (cuda/triton)
[v] flash 5.430ms 0.93x (cuda)
[ ] sage3 --- (N/A) Not installed
-----------------------------------------------------------------
[v] = validated (tested underlying library directly)
=================================================================
PyTorch SDPA uses cuDNN/cuBLAS - general purpose, always works.
Flash Attention fuses operations into single CUDA kernel, reducing memory bandwidth. Great for long sequences.
SageAttention quantizes Q/K to INT8, reducing memory and compute. Works best for head_dim ≤ 128.
xFormers similar to Flash Attention, good memory efficiency.
Models have different attention head dimensions:
- SD 1.5: head_dim=64
- SDXL, Flux: head_dim=128
- LTX-V: head_dim=160
SageAttention works best with head_dim ≤ 128. For larger dimensions, SDPA or Flash Attention may be faster.
Benchmark results are cached in benchmark_db.json based on:
- Model hash (architecture + weights)
- head_dim
- seq_len / num_heads parameters
Cache is per-machine - different GPUs will have different optimal backends.
Install the missing package:
pip install sageattention # for sage_*
pip install flash-attn # for flash
pip install xformers # for xformers- Check if
auto_applyis enabled - Try
force_refresh=Trueto re-benchmark - Check console for
[Benchmark] Applied: Xmessage
Some models (like SeedVR2) use their own attention implementation and won't be affected by this plugin. Check the compatibility table above.
MIT License - see LICENSE
- SageAttention - THU-ML
- Flash Attention - Dao-AILab
- xFormers - Meta