Multi-GPU and Multi-Node Training

Multi-GPU and Multi-Node Training#

Isaac Lab supports multi-GPU and multi-node reinforcement learning. Currently, this feature is only available for RL-Games, RSL-RL and skrl libraries workflows. We are working on extending this feature to other workflows.

Attention

Multi-GPU and multi-node training is only supported on Linux. Windows support is not available at this time. This is due to limitations of the NCCL library on Windows.

Multi-GPU Training#

Isaac Lab supports the following multi-GPU training frameworks:

Torchrun through PyTorch distributed
JAX distributed

Pytorch Torchrun Implementation#

We are using Pytorch Torchrun to manage multi-GPU training. Torchrun manages the distributed training by:

Process Management: Launching one process per GPU, where each process is assigned to a specific GPU.
Script Execution: Running the same training script (e.g., RL Games trainer) on each process.
Environment Instances: Each process creates its own instance of the Isaac Lab environment.
Gradient Synchronization: Aggregating gradients across all processes and broadcasting the synchronized gradients back to each process after each training step.

Tip

Check out this 3 minute youtube video from PyTorch to understand how Torchrun works.

The key components in this setup are:

Torchrun: Handles process spawning, communication, and gradient synchronization.
RL Library: The reinforcement learning library that runs the actual training algorithm.
Isaac Lab: Provides the simulation environment that each process instantiates independently.

Under the hood, Torchrun uses the DistributedDataParallel module to manage the distributed training. When training with multiple GPUs using Torchrun, the following happens:

Each GPU runs an independent process
Each process executes the full training script
Each process maintains its own:
- Isaac Lab environment instance (with n parallel environments)
- Policy network copy
- Experience buffer for rollout collection
All processes synchronize only for gradient updates

For a deeper dive into how Torchrun works, checkout PyTorch Docs: DistributedDataParallel - Internal Design.

Jax Implementation#

Tip

JAX is only supported with the skrl library.

With JAX, we are using skrl.utils.distributed.jax Since the ML framework doesn’t automatically start multiple processes from a single program invocation, the skrl library provides a module to start them.

Running Multi-GPU Training#

To train with multiple GPUs, use the following command, where --nproc_per_node represents the number of available GPUs:

rl_games

python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/train.py --rl_library rl_games --task=Isaac-Cartpole --distributed

rsl_rl

python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/train.py --rl_library rsl_rl --task=Isaac-Cartpole --distributed

skrl

PyTorch

python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/train.py --rl_library skrl --task=Isaac-Cartpole --distributed

JAX

python -m skrl.utils.distributed.jax --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/train.py --rl_library skrl --task=Isaac-Cartpole --distributed --ml_framework jax

Troubleshooting NCCL Errors#

On some Linux multi-GPU systems, distributed training may fail with CUDA error: an illegal memory access was encountered reported by ProcessGroupNCCL during or shortly after communicator initialization.

If this occurs, try disabling the NCCL shared-memory transport before launching training:

export NCCL_SHM_DISABLE=1

If the issue persists, additional NCCL fallbacks that may help are:

export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring

Separately, restricting training to a subset of a node’s GPUs with CUDA_VISIBLE_DEVICES (for example, CUDA_VISIBLE_DEVICES=0,1 on a larger machine) can cause training to hang during communicator initialization or on the first collective, with no error reported. On affected systems, disabling NCCL’s peer-to-peer (P2P) transport resolves the hang:

export NCCL_P2P_DISABLE=1

Then relaunch the distributed training command as usual.

Note

These variables are NCCL-level workarounds intended for affected systems. They are not required on all machines, and may change communication behavior or performance depending on the hardware topology. In particular, NCCL_P2P_DISABLE=1 routes inter-GPU traffic through host/shared memory instead of a direct P2P link, which can reduce communication bandwidth, so only set it when you observe a hang while restricting visible devices.

`train_multigpu` Command (Experimental)#

Warning

This command is experimental and subject to change in future releases.

Isaac Lab provides a train_multigpu convenience script that wraps the distributed launchers, adds --distributed automatically, and forwards remaining arguments to the selected training library. It defaults to rsl_rl and uses torch.distributed.run for torch-based workflows.

Single-node training (defaults to all available GPUs):

isaaclab.sh

./isaaclab.sh -p scripts/reinforcement_learning/train_multigpu.py \
   --task Isaac-Reorient-KukaAllegro \
   --num_envs 4096 --max_iterations 100

uv run

uv run isaaclab train_multigpu \
   --task Isaac-Reorient-KukaAllegro \
   --num_envs 4096 --max_iterations 100

Override the GPU count or torchrun settings when needed:

isaaclab.sh

./isaaclab.sh -p scripts/reinforcement_learning/train_multigpu.py \
   --num_gpus 4 --master_port 29504 \
   --task Isaac-Reorient-KukaAllegro \
   --num_envs 4096 --max_iterations 100

uv run

uv run isaaclab train_multigpu --num_gpus 4 --master_port 29504 \
   --task Isaac-Reorient-KukaAllegro \
   --num_envs 4096 --max_iterations 100

Use --rl_library to select other distributed-capable libraries (rsl_rl, rl_games, or skrl). For skrl JAX training, pass an integer GPU count and the --coordinator_address:

isaaclab.sh

./isaaclab.sh -p scripts/reinforcement_learning/train_multigpu.py \
   --rl_library skrl --ml_framework jax --num_gpus 4 \
   --coordinator_address localhost:5000 \
   --task Isaac-Reorient-KukaAllegro \
   --num_envs 4096 --max_iterations 100

uv run

uv run isaaclab train_multigpu --rl_library skrl --ml_framework jax --num_gpus 4 \
   --coordinator_address localhost:5000 \
   --task Isaac-Reorient-KukaAllegro \
   --num_envs 4096 --max_iterations 100

For multi-node torch jobs, pass torchrun settings such as --nnodes, --node_rank, --rdzv_backend, --rdzv_endpoint, and --rdzv_id before the training arguments. For skrl JAX multi-node jobs, pass --nnodes, --node_rank, and --coordinator_address.

Multi-GPU and Multi-Node Training

Contents

Multi-GPU and Multi-Node Training#

Multi-GPU Training#

Pytorch Torchrun Implementation#

Jax Implementation#

Running Multi-GPU Training#

Troubleshooting NCCL Errors#

Multi-Node Training#

train_multigpu Command (Experimental)#

`train_multigpu` Command (Experimental)#