Introduction || What is DDP || Single-Node Multi-GPU Training || Fault Tolerance || Multi-Node training || minGPT Training

Multinode Training

Authors: Suraj Subramanian

What you will learn
  • Launching multinode training jobs with torchrun

  • Code changes (and things to keep in mind) when moving from single-node to multinode training.

View the code used in this tutorial on GitHub

  • Familiarity with multi-GPU training and torchrun

  • 2 or more TCP-reachable GPU machines (this tutorial uses AWS p3.2xlarge instances)

  • PyTorch installed with CUDA on all machines

Follow along with the video below or on youtube.

Multinode training involves deploying a training job across several machines. There are two ways to do this:

  • running a torchrun command on each machine with identical rendezvous arguments, or

  • deploying it on a compute cluster using a workload manager (like SLURM)

In this video we will go over the (minimal) code changes required to move from single-node multigpu to multinode training, and run our training script in both of the above ways.

Note that multinode training is bottlenecked by inter-node communication latencies. Running a training job on 4 GPUs on a single node will be faster than running it on 4 nodes with 1 GPU each.

Local and Global ranks

In single-node settings, we were tracking the gpu_id of each device running our training process. torchrun tracks this value in an environment variable LOCAL_RANK which uniquely identifies each GPU-process on a node. For a unique identifier across all the nodes, torchrun provides another variable RANK which refers to the global rank of a process.


Do not use RANK for critical logic in your training job. When torchrun restarts processes after a failure or membership changes, there is no guarantee that the processes will hold the same LOCAL_RANK and RANKS.

Heteregeneous Scaling

Torchrun supports heteregenous scaling i.e. each of your multinode machines can have different number of GPUs participating in the training job. In the video, I deployed the code on 2 machines where one machine has 4 GPUs and the other used only 2 GPUs.


  • Ensure that your nodes are able to communicate with each other over TCP.

  • Set env variable NCCL_DEBUG to INFO (using export NCCL_DEBUG=INFO) to print verbose logs that can help diagnose the issue.

  • Sometimes you might need to explicitly set the network interface for the distributed backend (export NCCL_SOCKET_IFNAME=eth0). Read more about this here.

Further Reading

더 궁금하시거나 개선할 내용이 있으신가요? 커뮤니티에 참여해보세요!

이 튜토리얼이 어떠셨나요? 평가해주시면 이후 개선에 참고하겠습니다! :)

© Copyright 2018-2024, PyTorch & 파이토치 한국 사용자 모임(PyTorch Korea User Group).

Built with Sphinx using a theme provided by Read the Docs.

PyTorchKorea @ GitHub

파이토치 한국 사용자 모임을 GitHub에서 만나보세요.

GitHub로 이동

한국어 튜토리얼

한국어로 번역 중인 PyTorch 튜토리얼입니다.

튜토리얼로 이동


다른 사용자들과 의견을 나누고, 도와주세요!

커뮤니티로 이동