Shortcuts

Introduction || What is DDP || Single-Node Multi-GPU Training || Fault Tolerance || Multi-Node training || minGPT Training

What is Distributed Data Parallel (DDP)

Authors: Suraj Subramanian

What you will learn
  • How DDP works under the hood

  • What is the DistributedSampler

  • How gradients are synchronized across GPUs

Prerequisites

Follow along with the video below or on youtube.

This tutorial is a gentle introduction to PyTorch DistributedDataParallel (DDP) which enables data parallel training in PyTorch. Data parallelism is a way to process multiple data batches across multiple devices simultaneously to achieve better performance. In PyTorch, the DistributedSampler ensures each device gets a non-overlapping input batch. The model is replicated on all the devices; each replica calculates gradients and simultaneously synchronizes with the others using the ring all-reduce algorithm.

Why you should prefer DDP over DataParallel (DP)

DataParallel is an older approach to data parallelism. DP is trivially simple (with just one extra line of code) but it is much less performant. DDP improves upon the architecture in a few ways:

DataParallel

DistributedDataParallel

More overhead; model is replicated and destroyed at each forward pass

Model is replicated only once

Only supports single-node parallelism

Supports scaling to multiple machines

Slower; uses multithreading on a single process and runs into Global Interpreter Lock (GIL) contention

Faster (no GIL contention) because it uses multiprocessing

Further Reading


더 궁금하시거나 개선할 내용이 있으신가요? 커뮤니티에 참여해보세요!


이 튜토리얼이 어떠셨나요? 평가해주시면 이후 개선에 참고하겠습니다! :)

© Copyright 2018-2023, PyTorch & 파이토치 한국 사용자 모임(PyTorch Korea User Group).

Built with Sphinx using a theme provided by Read the Docs.

PyTorchKorea @ GitHub

파이토치 한국 사용자 모임을 GitHub에서 만나보세요.

GitHub로 이동

한국어 튜토리얼

한국어로 번역 중인 PyTorch 튜토리얼입니다.

튜토리얼로 이동

커뮤니티

다른 사용자들과 의견을 나누고, 도와주세요!

커뮤니티로 이동