Unexpected cuda context after dist.destroy_process_group

### &#128027; Describe the bug

During experimentation, I observed unexpected 320MB GPU0(maybe default cuda device) memory allocations in each subprocess that persist after process group destruction. I've constructed a minimal reproducible example demonstrating this issue.
```python
import torch
import os
import glob
import time
import torch.distributed as dist
import torch.multiprocessing as mp

def worker(rank, world_size):
    torch.cuda.set_device(rank)
    dist.init_process_group(
        backend="nccl", 
        init_method="tcp://127.0.0.1:29500",
        world_size=world_size,
        rank=rank
    )
    dist.barrier()

    device = f"cuda:{rank}"
    # do something
    print("to sleep")
    time.sleep(5)
    
    dist.destroy_process_group()

    # sleep for observing nvidia-smi
    time.sleep(100)

def test_nccl():
    world_size = torch.cuda.device_count()
    mp.spawn(
        worker,
        args=(world_size,),
        nprocs=world_size,
        join=True,
    )

if __name__ == '__main__':
    test_nccl()
```

After calling `dist.destroy_process_group()`, approximately 320MB of GPU memory remains allocated and the CUDA context appears to persist. This can be verified by adding sleep intervals before and after `dist.destroy_process_group()` and monitoring GPU memory usage with `nvidia-smi`.

<img width="1792" height="784" alt="Image" src="https://github.com/user-attachments/assets/0669422d-6b35-4ad8-84b7-6d5b11515ac8" />

### Versions

PyTorch version: 2.6.0
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] onnx==1.17.0
[pip3] torch==2.6.0
[pip3] torchvision==0.21.0
[pip3] triton==3.2.0
[conda] Could not collect

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpected cuda context after dist.destroy_process_group #163741

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected cuda context after dist.destroy_process_group #163741

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions