Closed-Loop Policy Inference and Evaluation#
Docker Container: Base (see Docker Containers for more details)
./docker/run_docker.sh
Once inside the container, set the models directory if you plan to download pre-trained checkpoints:
export MODELS_DIR=models/isaaclab_arena/reinforcement_learning
mkdir -p $MODELS_DIR
This tutorial assumes you’ve completed Policy Training and have a trained checkpoint, or you can download a pre-trained one as described below.
Download Pre-trained Model (skip preceding steps)
hf download \
nvidia/IsaacLab-Arena-Lift-Object-RL \
model_11999.pt \
--local-dir $MODELS_DIR/lift_object_checkpoint
After downloading, the checkpoint is at:
$MODELS_DIR/lift_object_checkpoint/model_11999.pt
Replace checkpoint paths in the examples below with this path.
Evaluation Methods#
There are three ways to evaluate a trained policy:
Single environment (
policy_runner.py): detailed evaluation with metricsParallel environments (
policy_runner.py): larger-scale statistical evaluationBatch evaluation (
eval_runner.py): automated evaluation across multiple checkpoints
Method 1: Single Environment Evaluation#
python isaaclab_arena/evaluation/policy_runner.py \
--policy_type rsl_rl \
--num_steps 1000 \
--checkpoint_path logs/rsl_rl/generic_experiment/2026-01-28_17-26-10/model_11999.pt \
lift_object \
--rl_training_mode False
Note
If you downloaded the pre-trained model from Hugging Face, replace the checkpoint path with:
$MODELS_DIR/lift_object_checkpoint/model_11999.pt
Policy-specific arguments (--policy_type, --checkpoint_path, etc.) must come before the
environment name. Environment-specific arguments (--rl_training_mode, --object, etc.) must
come after it.
At the end of the run, metrics are printed to the console:
Metrics: {'success_rate': 0.85, 'num_episodes': 12}
Method 2: Parallel Environment Evaluation#
For more statistically significant results, run across many environments in parallel:
python isaaclab_arena/evaluation/policy_runner.py \
--policy_type rsl_rl \
--num_steps 5000 \
--num_envs 64 \
--checkpoint_path logs/rsl_rl/generic_experiment/2026-01-28_17-26-10/model_11999.pt \
--headless \
lift_object \
--rl_training_mode False
Metrics: {'success_rate': 0.83, 'num_episodes': 156}
Method 3: Batch Evaluation#
To evaluate multiple checkpoints in sequence, use eval_runner.py with a JSON config.
1. Create an evaluation config
Create a file eval_config.json:
{
"policy_runner_args": {
"policy_type": "rsl_rl",
"num_steps": 5000,
"num_envs": 64,
"headless": true
},
"evaluations": [
{
"checkpoint_path": "logs/rsl_rl/generic_experiment/2026-01-28_17-26-10/model_5999.pt",
"environment": "lift_object",
"environment_args": {
"rl_training_mode": false
}
},
{
"checkpoint_path": "logs/rsl_rl/generic_experiment/2026-01-28_17-26-10/model_11999.pt",
"environment": "lift_object",
"environment_args": {
"rl_training_mode": false
}
}
]
}
2. Run
python isaaclab_arena/evaluation/eval_runner.py --eval_jobs_config eval_config.json
Evaluating checkpoint 1/2: model_5999.pt
Metrics: {'success_rate': 0.72, 'num_episodes': 152}
Evaluating checkpoint 2/2: model_11999.pt
Metrics: {'success_rate': 0.85, 'num_episodes': 156}
Summary:
========================================
model_5999.pt | Success: 72% | Episodes: 152
model_11999.pt | Success: 85% | Episodes: 156
Understanding the Metrics#
The Lift Object task reports two metrics:
success_rate: fraction of episodes where the object reached the target position within tolerancenum_episodes: total number of completed episodes during the evaluation run
A well-trained policy should reach 70–90% success rate. Results will vary with the target range, random seed, and hardware.
Note
Always set --rl_training_mode False when evaluating. During training this flag is True
to disable success termination; setting it to False re-enables it for proper evaluation.