Multi-GPU and Multi-Node Training#
Isaac Lab supports multi-GPU and multi-node reinforcement learning. Currently, this feature is only available for RL-Games, RSL-RL and skrl libraries workflows. We are working on extending this feature to other workflows.
Attention
Multi-GPU and multi-node training is only supported on Linux. Windows support is not available at this time. This is due to limitations of the NCCL library on Windows.
Multi-GPU Training#
Isaac Lab supports the following multi-GPU training frameworks:
Pytorch Torchrun Implementation#
We are using Pytorch Torchrun to manage multi-GPU training. Torchrun manages the distributed training by:
Process Management: Launching one process per GPU, where each process is assigned to a specific GPU.
Script Execution: Running the same training script (e.g., RL Games trainer) on each process.
Environment Instances: Each process creates its own instance of the Isaac Lab environment.
Gradient Synchronization: Aggregating gradients across all processes and broadcasting the synchronized gradients back to each process after each training step.
Tip
Check out this 3 minute youtube video from PyTorch to understand how Torchrun works.
The key components in this setup are:
Torchrun: Handles process spawning, communication, and gradient synchronization.
RL Library: The reinforcement learning library that runs the actual training algorithm.
Isaac Lab: Provides the simulation environment that each process instantiates independently.
Under the hood, Torchrun uses the DistributedDataParallel module to manage the distributed training. When training with multiple GPUs using Torchrun, the following happens:
Each GPU runs an independent process
Each process executes the full training script
Each process maintains its own:
Isaac Lab environment instance (with n parallel environments)
Policy network copy
Experience buffer for rollout collection
All processes synchronize only for gradient updates
For a deeper dive into how Torchrun works, checkout PyTorch Docs: DistributedDataParallel - Internal Design.
Jax Implementation#
Tip
JAX is only supported with the skrl library.
With JAX, we are using skrl.utils.distributed.jax Since the ML framework doesn’t automatically start multiple processes from a single program invocation, the skrl library provides a module to start them.
Running Multi-GPU Training#
To train with multiple GPUs, use the following command, where --nproc_per_node represents the number of available GPUs:
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/train.py --rl_library rl_games --task=Isaac-Cartpole --headless --distributed
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/train.py --rl_library rsl_rl --task=Isaac-Cartpole --headless --distributed
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/train.py --rl_library skrl --task=Isaac-Cartpole --headless --distributed
python -m skrl.utils.distributed.jax --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/train.py --rl_library skrl --task=Isaac-Cartpole --headless --distributed --ml_framework jax
Troubleshooting NCCL Errors#
On some Linux multi-GPU systems, distributed training may fail with
CUDA error: an illegal memory access was encountered reported by ProcessGroupNCCL
during or shortly after communicator initialization.
If this occurs, try disabling the NCCL shared-memory transport before launching training:
export NCCL_SHM_DISABLE=1
If the issue persists, additional NCCL fallbacks that may help are:
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring
Separately, restricting training to a subset of a node’s GPUs with CUDA_VISIBLE_DEVICES
(for example, CUDA_VISIBLE_DEVICES=0,1 on a larger machine) can cause training to hang during
communicator initialization or on the first collective, with no error reported. On affected
systems, disabling NCCL’s peer-to-peer (P2P) transport resolves the hang:
export NCCL_P2P_DISABLE=1
Then relaunch the distributed training command as usual.
Note
These variables are NCCL-level workarounds intended for affected systems. They are not
required on all machines, and may change communication behavior or performance depending
on the hardware topology. In particular, NCCL_P2P_DISABLE=1 routes inter-GPU traffic
through host/shared memory instead of a direct P2P link, which can reduce communication
bandwidth, so only set it when you observe a hang while restricting visible devices.
Multi-Node Training#
To scale up training beyond multiple GPUs on a single machine, it is also possible to train across multiple nodes. To train across multiple nodes/machines, it is required to launch an individual process on each node.
For the master node, use the following command, where --nproc_per_node represents the number of available GPUs, and
--nnodes represents the number of nodes:
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/train.py --rl_library rl_games --task=Isaac-Cartpole --headless --distributed
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/train.py --rl_library rsl_rl --task=Isaac-Cartpole --headless --distributed
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/train.py --rl_library skrl --task=Isaac-Cartpole --headless --distributed
python -m skrl.utils.distributed.jax --nproc_per_node=2 --nnodes=2 --node_rank=0 --coordinator_address=ip_of_master_machine:5555 scripts/reinforcement_learning/train.py --rl_library skrl --task=Isaac-Cartpole --headless --distributed --ml_framework jax
Note that the port (5555) can be replaced with any other available port.
For non-master nodes, use the following command, replacing --node_rank with the index of each machine:
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/train.py --rl_library rl_games --task=Isaac-Cartpole --headless --distributed
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/train.py --rl_library rsl_rl --task=Isaac-Cartpole --headless --distributed
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/train.py --rl_library skrl --task=Isaac-Cartpole --headless --distributed
python -m skrl.utils.distributed.jax --nproc_per_node=2 --nnodes=2 --node_rank=1 --coordinator_address=ip_of_master_machine:5555 scripts/reinforcement_learning/train.py --rl_library skrl --task=Isaac-Cartpole --headless --distributed --ml_framework jax
For more details on multi-node training with PyTorch, please visit the PyTorch documentation. For more details on multi-node training with JAX, please visit the skrl documentation and the JAX documentation.
Note
As mentioned in the PyTorch documentation, “multi-node training is bottlenecked by inter-node communication latencies”. When this latency is high, it is possible multi-node training will perform worse than running on a single node instance.
train_multigpu Command (Experimental)#
Warning
This command is experimental and subject to change in future releases.
Isaac Lab provides a train_multigpu convenience script that wraps the distributed launchers,
adds --distributed automatically, and forwards remaining arguments to the selected training library.
It defaults to rsl_rl and uses torch.distributed.run for torch-based workflows.
Single-node training (defaults to all available GPUs):
./isaaclab.sh -p scripts/reinforcement_learning/train_multigpu.py \
--task Isaac-Dexsuite-Kuka-Allegro-Reorient-v0 \
--headless --num_envs 4096 --max_iterations 100
uv run train_multigpu \
--task Isaac-Dexsuite-Kuka-Allegro-Reorient-v0 \
--headless --num_envs 4096 --max_iterations 100
Override the GPU count or torchrun settings when needed:
./isaaclab.sh -p scripts/reinforcement_learning/train_multigpu.py \
--num_gpus 4 --master_port 29504 \
--task Isaac-Dexsuite-Kuka-Allegro-Reorient-v0 \
--headless --num_envs 4096 --max_iterations 100
uv run train_multigpu --num_gpus 4 --master_port 29504 \
--task Isaac-Dexsuite-Kuka-Allegro-Reorient-v0 \
--headless --num_envs 4096 --max_iterations 100
Use --rl_library to select other distributed-capable libraries (rsl_rl, rl_games, or skrl).
For skrl JAX training, pass an integer GPU count and the --coordinator_address:
./isaaclab.sh -p scripts/reinforcement_learning/train_multigpu.py \
--rl_library skrl --ml_framework jax --num_gpus 4 \
--coordinator_address localhost:5000 \
--task Isaac-Dexsuite-Kuka-Allegro-Reorient-v0 \
--headless --num_envs 4096 --max_iterations 100
uv run train_multigpu --rl_library skrl --ml_framework jax --num_gpus 4 \
--coordinator_address localhost:5000 \
--task Isaac-Dexsuite-Kuka-Allegro-Reorient-v0 \
--headless --num_envs 4096 --max_iterations 100
For multi-node torch jobs, pass torchrun settings such as --nnodes, --node_rank,
--rdzv_backend, --rdzv_endpoint, and --rdzv_id before the training arguments. For
skrl JAX multi-node jobs, pass --nnodes, --node_rank, and --coordinator_address.