Closed-Loop Policy Inference and Evaluation#

Docker Container: Base (see Docker Containers for more details)

./docker/run_docker.sh

Once inside the container, set the models directory if you plan to download pre-trained checkpoints:

export MODELS_DIR=models/isaaclab_arena/reinforcement_learning
mkdir -p $MODELS_DIR

This tutorial assumes you’ve completed Policy Training and have a trained checkpoint, or you can download a pre-trained one as described below.

Download Pre-trained Model (skip preceding steps)
hf download \
   nvidia/IsaacLab-Arena-Lift-Object-RL \
   model_11999.pt \
   --local-dir $MODELS_DIR/lift_object_checkpoint

After downloading, the checkpoint is at:

$MODELS_DIR/lift_object_checkpoint/model_11999.pt

Replace checkpoint paths in the examples below with this path.

Evaluation Methods#

There are three ways to evaluate a trained policy:

  1. Single environment (policy_runner.py): detailed evaluation with metrics

  2. Parallel environments (policy_runner.py): larger-scale statistical evaluation

  3. Batch evaluation (eval_runner.py): automated evaluation across multiple checkpoints

Method 1: Single Environment Evaluation#

python isaaclab_arena/evaluation/policy_runner.py \
  --policy_type rsl_rl \
  --num_steps 1000 \
  --checkpoint_path logs/rsl_rl/generic_experiment/2026-01-28_17-26-10/model_11999.pt \
  lift_object \
  --rl_training_mode False

Note

If you downloaded the pre-trained model from Hugging Face, replace the checkpoint path with: $MODELS_DIR/lift_object_checkpoint/model_11999.pt

Policy-specific arguments (--policy_type, --checkpoint_path, etc.) must come before the environment name. Environment-specific arguments (--rl_training_mode, --object, etc.) must come after it.

At the end of the run, metrics are printed to the console:

Metrics: {'success_rate': 0.85, 'num_episodes': 12}

Method 2: Parallel Environment Evaluation#

For more statistically significant results, run across many environments in parallel:

python isaaclab_arena/evaluation/policy_runner.py \
  --policy_type rsl_rl \
  --num_steps 5000 \
  --num_envs 64 \
  --checkpoint_path logs/rsl_rl/generic_experiment/2026-01-28_17-26-10/model_11999.pt \
  --headless \
  lift_object \
  --rl_training_mode False
Metrics: {'success_rate': 0.83, 'num_episodes': 156}

Method 3: Batch Evaluation#

To evaluate multiple checkpoints in sequence, use eval_runner.py with a JSON config.

1. Create an evaluation config

Create a file eval_config.json:

{
  "policy_runner_args": {
    "policy_type": "rsl_rl",
    "num_steps": 5000,
    "num_envs": 64,
    "headless": true
  },
  "evaluations": [
    {
      "checkpoint_path": "logs/rsl_rl/generic_experiment/2026-01-28_17-26-10/model_5999.pt",
      "environment": "lift_object",
      "environment_args": {
        "rl_training_mode": false
      }
    },
    {
      "checkpoint_path": "logs/rsl_rl/generic_experiment/2026-01-28_17-26-10/model_11999.pt",
      "environment": "lift_object",
      "environment_args": {
        "rl_training_mode": false
      }
    }
  ]
}

2. Run

python isaaclab_arena/evaluation/eval_runner.py --eval_jobs_config eval_config.json
Evaluating checkpoint 1/2: model_5999.pt
Metrics: {'success_rate': 0.72, 'num_episodes': 152}

Evaluating checkpoint 2/2: model_11999.pt
Metrics: {'success_rate': 0.85, 'num_episodes': 156}

Summary:
========================================
model_5999.pt  | Success: 72% | Episodes: 152
model_11999.pt | Success: 85% | Episodes: 156

Understanding the Metrics#

The Lift Object task reports two metrics:

  • success_rate: fraction of episodes where the object reached the target position within tolerance

  • num_episodes: total number of completed episodes during the evaluation run

A well-trained policy should reach 70–90% success rate. Results will vary with the target range, random seed, and hardware.

Note

Always set --rl_training_mode False when evaluating. During training this flag is True to disable success termination; setting it to False re-enables it for proper evaluation.