Skip to content

Unexpected cuda context after dist.destroy_process_group #163741

@ABNER-1

Description

@ABNER-1

🐛 Describe the bug

During experimentation, I observed unexpected 320MB GPU0(maybe default cuda device) memory allocations in each subprocess that persist after process group destruction. I've constructed a minimal reproducible example demonstrating this issue.

import torch
import os
import glob
import time
import torch.distributed as dist
import torch.multiprocessing as mp

def worker(rank, world_size):
    torch.cuda.set_device(rank)
    dist.init_process_group(
        backend="nccl", 
        init_method="tcp://127.0.0.1:29500",
        world_size=world_size,
        rank=rank
    )
    dist.barrier()

    device = f"cuda:{rank}"
    # do something
    print("to sleep")
    time.sleep(5)
    
    dist.destroy_process_group()

    # sleep for observing nvidia-smi
    time.sleep(100)

def test_nccl():
    world_size = torch.cuda.device_count()
    mp.spawn(
        worker,
        args=(world_size,),
        nprocs=world_size,
        join=True,
    )

if __name__ == '__main__':
    test_nccl()

After calling dist.destroy_process_group(), approximately 320MB of GPU memory remains allocated and the CUDA context appears to persist. This can be verified by adding sleep intervals before and after dist.destroy_process_group() and monitoring GPU memory usage with nvidia-smi.

Image

Versions

PyTorch version: 2.6.0
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] onnx==1.17.0
[pip3] torch==2.6.0
[pip3] torchvision==0.21.0
[pip3] triton==3.2.0
[conda] Could not collect

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions