-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
🐛 Describe the bug
During experimentation, I observed unexpected 320MB GPU0(maybe default cuda device) memory allocations in each subprocess that persist after process group destruction. I've constructed a minimal reproducible example demonstrating this issue.
import torch
import os
import glob
import time
import torch.distributed as dist
import torch.multiprocessing as mp
def worker(rank, world_size):
torch.cuda.set_device(rank)
dist.init_process_group(
backend="nccl",
init_method="tcp://127.0.0.1:29500",
world_size=world_size,
rank=rank
)
dist.barrier()
device = f"cuda:{rank}"
# do something
print("to sleep")
time.sleep(5)
dist.destroy_process_group()
# sleep for observing nvidia-smi
time.sleep(100)
def test_nccl():
world_size = torch.cuda.device_count()
mp.spawn(
worker,
args=(world_size,),
nprocs=world_size,
join=True,
)
if __name__ == '__main__':
test_nccl()After calling dist.destroy_process_group(), approximately 320MB of GPU memory remains allocated and the CUDA context appears to persist. This can be verified by adding sleep intervals before and after dist.destroy_process_group() and monitoring GPU memory usage with nvidia-smi.
Versions
PyTorch version: 2.6.0
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A
Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] onnx==1.17.0
[pip3] torch==2.6.0
[pip3] torchvision==0.21.0
[pip3] triton==3.2.0
[conda] Could not collect