Skip to content

TSAastro/DistributedTraining

 
 

Repository files navigation

Deep Learning Training Examples with PyTorch

This repository provides a comprehensive guide and practical examples for training deep learning models using PyTorch across various parallelism strategies. Whether you are working on single-GPU training or scaling to multi-GPU setups with Distributed Data Parallel (DDP) these examples will guide you through the process.


Contents

01. Introduction to Deep Learning

  • Foundational concepts of deep learning and PyTorch.
  • HPC Environment Setup:
    • Using SLURM for job scheduling: Submitting and managing training jobs.
    • Loading necessary modules: Configuring PyTorch and CUDA on an HPC cluster.

02. Single-GPU Training

  • Efficiently training models on a single GPU.

03. Multi-GPU Training with Data Parallelism (DP)

  • Scaling models across multiple GPUs using torch.nn.DataParallel.
  • Key Considerations:
    • Understanding inter-GPU communication overhead.
    • Differences between DP and DDP for better performance.

04. Distributed Data Parallel (DDP) Training

  • Leveraging torch.nn.parallel.DistributedDataParallel for efficient multi-GPU training.
  • Setting up process groups and distributed samplers
  • Advantages of DDP Over DP:
    • Lower communication overhead.
    • Better scalability across multiple nodes.

05. Containerized Training with Enroot and NGC Containers

  • Running PyTorch training using NVIDIA Enroot and NGC Containers on HPC.
  • Topics Covered:
    • Importing and running NGC PyTorch containers with Enroot.
    • Running single and multi-GPU PyTorch workloads inside containers.
    • Using SLURM to launch containerized PyTorch jobs on GPU clusters.

Resources


Note

About

ML in ICTS FTSky

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 63.2%
  • Python 30.6%
  • Shell 6.2%