Skip to content

Synchronize keys and handle missing values in dist_utils#136

Open
Jimmy-Mendez wants to merge 1 commit intoIntellindust-AI-Lab:mainfrom
Jimmy-Mendez:patch-1
Open

Synchronize keys and handle missing values in dist_utils#136
Jimmy-Mendez wants to merge 1 commit intoIntellindust-AI-Lab:mainfrom
Jimmy-Mendez:patch-1

Conversation

@Jimmy-Mendez
Copy link

@Jimmy-Mendez Jimmy-Mendez commented Jan 15, 2026

When training with multiple GPUs, batches with no ground truth objects cause some ranks to produce fewer loss keys (e.g., denoising losses are skipped). This results in reduce_dict attempting all_reduce on tensors of different sizes across ranks, causing a deadlock.

Fix: Synchronize loss dictionary keys across all ranks before all_reduce, filling missing keys with zeros. (fixes #6 and fixes #34 and fixes #113 ?)

When training with multiple GPUs, batches with no ground truth objects cause some ranks to produce fewer loss keys (e.g., denoising losses are skipped). This results in reduce_dict attempting all_reduce on tensors of different sizes across ranks, causing a deadlock.

Fix: Synchronize loss dictionary keys across all ranks before all_reduce, filling missing keys with zeros.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant