• Tutorials >
  • Distributed Data Parallel in PyTorch - Video Tutorials

Introduction || What is DDP || Single-Node Multi-GPU Training || Fault Tolerance || Multi-Node training || minGPT Training

Distributed Data Parallel in PyTorch - Video Tutorials

Authors: Suraj Subramanian

Follow along with the video below or on youtube.

This series of video tutorials walks you through distributed training in PyTorch via DDP.

The series starts with a simple non-distributed training job, and ends with deploying a training job across several machines in a cluster. Along the way, you will also learn about torchrun for fault-tolerant distributed training.

The tutorial assumes a basic familiarity with model training in PyTorch.

Running the code

You will need multiple CUDA GPUs to run the tutorial code. Typically, this can be done on a cloud instance with multiple GPUs (the tutorials use an Amazon EC2 P3 instance with 4 GPUs).

The tutorial code is hosted at this github repo. Clone the repo and follow along!

Tutorial sections

  1. Introduction (this page)

  2. What is DDP? Gently introduces what DDP is doing under the hood

  3. Single-Node Multi-GPU Training Training models using multiple GPUs on a single machine

  4. Fault-tolerant distributed training Making your distributed training job robust with torchrun

  5. Multi-Node training Training models using multiple GPUs on multiple machines

  6. Training a GPT model with DDP “Real-world” example of training a minGPT model with DDP

더 궁금하시거나 개선할 내용이 있으신가요? 커뮤니티에 참여해보세요!

이 튜토리얼이 어떠셨나요? 평가해주시면 이후 개선에 참고하겠습니다! :)

© Copyright 2018-2023, PyTorch & 파이토치 한국 사용자 모임(PyTorch Korea User Group).

Built with Sphinx using a theme provided by Read the Docs.

PyTorchKorea @ GitHub

파이토치 한국 사용자 모임을 GitHub에서 만나보세요.

GitHub로 이동

한국어 튜토리얼

한국어로 번역 중인 PyTorch 튜토리얼입니다.

튜토리얼로 이동


다른 사용자들과 의견을 나누고, 도와주세요!

커뮤니티로 이동