Skip to content

Commit 3c70440

Browse files
authored
Update distributed_inference.md to reposition sections (huggingface#12971)
1 parent 7299121 commit 3c70440

File tree

1 file changed

+21
-20
lines changed

1 file changed

+21
-20
lines changed

docs/source/en/training/distributed_inference.md

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -314,25 +314,6 @@ Pass the [`ContextParallelConfig`] to [`~ModelMixin.enable_parallelism`].
314314
pipeline.transformer.enable_parallelism(config=ContextParallelConfig(ulysses_degree=2))
315315
```
316316

317-
### parallel_config
318-
319-
Pass `parallel_config` during model initialization to enable context parallelism.
320-
321-
```py
322-
CKPT_ID = "black-forest-labs/FLUX.1-dev"
323-
324-
cp_config = ContextParallelConfig(ring_degree=2)
325-
transformer = AutoModel.from_pretrained(
326-
CKPT_ID,
327-
subfolder="transformer",
328-
torch_dtype=torch.bfloat16,
329-
parallel_config=cp_config
330-
)
331-
332-
pipeline = DiffusionPipeline.from_pretrained(
333-
CKPT_ID, transformer=transformer, torch_dtype=torch.bfloat16,
334-
).to(device)
335-
```
336317
### Unified Attention
337318

338319
[Unified Sequence Parallelism](https://huggingface.co/papers/2405.07719) combines Ring Attention and Ulysses Attention into a single approach for efficient long-sequence processing. It applies Ulysses's *all-to-all* communication first to redistribute heads and sequence tokens, then uses Ring Attention to process the redistributed data, and finally reverses the *all-to-all* to restore the original layout.
@@ -360,4 +341,24 @@ We ran a benchmark with Ulysess, Ring, and Unified Attention with [this script](
360341
| ring | 13076.492 | 3.82 | 56.02 |
361342
| unified_balanced | 11068.705 | 4.52 | 33.85 |
362343

363-
From the above table, it's clear that Ulysses provides better throughput, but the number of devices it can use remains limited to number of attention-heads, a limitation that is solved by unified attention.
344+
From the above table, it's clear that Ulysses provides better throughput, but the number of devices it can use remains limited to the number of attention heads, a limitation that is solved by unified attention.
345+
346+
### parallel_config
347+
348+
Pass `parallel_config` during model initialization to enable context parallelism.
349+
350+
```py
351+
CKPT_ID = "black-forest-labs/FLUX.1-dev"
352+
353+
cp_config = ContextParallelConfig(ring_degree=2)
354+
transformer = AutoModel.from_pretrained(
355+
CKPT_ID,
356+
subfolder="transformer",
357+
torch_dtype=torch.bfloat16,
358+
parallel_config=cp_config
359+
)
360+
361+
pipeline = DiffusionPipeline.from_pretrained(
362+
CKPT_ID, transformer=transformer, torch_dtype=torch.bfloat16,
363+
).to(device)
364+
```

0 commit comments

Comments
 (0)