• Tutorials >
  • Training “real-world” models with DDP

Introduction || What is DDP || Single-Node Multi-GPU Training || Fault Tolerance || Multi-Node training || minGPT Training

Training “real-world” models with DDP

Authors: Suraj Subramanian

What you will learn
  • Best practices when writing a distributed training script

  • Increased flexibility with saving/loading artifacts in the cloud

  • When DDP is NOT suitable

View the code used in this tutorial on GitHub


Follow along with the video below or on youtube.

In this video, we will review the process of training a GPT model in multinode DDP. We first clone the minGPT repo and refactor the Trainer to resemble the structure we have used in this series. Watch the video for details on these changes.

We use hydra to centrally manage all the configurations for our training run. Once the code has been refactored, we run it first on a single-node with 4 GPUs, and then on a slurm cluster.

Files used for training

  • trainer.py includes the Trainer class that runs the distributed training iterations on the model with the provided dataset.

  • model.py defines the model architecture.

  • char_dataset.py contains the Dataset class for a character-level dataset.

  • gpt2_train_cfg.yaml contains the configurations for data, model, optimizer, and training run.

  • main.py is the entry point to the training job. It sets up the DDP process group, reads all the configurations and runs the training job.

Saving and Loading from the cloud

In the video above, we save training snapshots directly to the cloud. This gives us the flexibility to continue training from any node that has access to the cloud bucket.

Using Mixed Precision

To speed things up, you might be able to use Mixed Precision to train your models. In Mixed Precision, some parts of the training process are carried out in reduced precision, while other steps that are more sensitive to precision drops are maintained in FP32 precision.

When is DDP not enough?

A typical training run’s memory footprint consists of model weights, activations, gradients, the input batch, and the optimizer state. Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint. When models grow larger, more aggressive techniques might be useful:

  • activation checkpointing: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.

  • Fully-Sharded Data Parallel: Here the model is not replicated but 《sharded》 across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our blog to learn how we trained a 1 Trillion parameter model with FSDP.

더 궁금하시거나 개선할 내용이 있으신가요? 커뮤니티에 참여해보세요!

이 튜토리얼이 어떠셨나요? 평가해주시면 이후 개선에 참고하겠습니다! :)

© Copyright 2018-2024, PyTorch & 파이토치 한국 사용자 모임(PyTorch Korea User Group).

Built with Sphinx using a theme provided by Read the Docs.

PyTorchKorea @ GitHub

파이토치 한국 사용자 모임을 GitHub에서 만나보세요.

GitHub로 이동

한국어 튜토리얼

한국어로 번역 중인 PyTorch 튜토리얼입니다.

튜토리얼로 이동


다른 사용자들과 의견을 나누고, 도와주세요!

커뮤니티로 이동