Evaluation in Arena#

Docker Container: Base (see Installation for more details)

./docker/run_docker.sh

Once inside the container, set the models directory:

export MODELS_DIR=/models/isaaclab_arena/dexsuite_lift
mkdir -p $MODELS_DIR

This step evaluates a checkpoint using Arena’s dexsuite_lift environment. Pass --presets newton to use Newton physics (recommended when the checkpoint was trained with Newton).

Download Pre-trained Model (skip training)
hf download \
  nvidia/Arena-Dexsuite-Lift-RL-Newton-Task \
  --local-dir $MODELS_DIR

After downloading, the checkpoint is at:

$MODELS_DIR/model_14999.pt

Note

If you trained locally (see Policy Training (Isaac Lab)), your checkpoints are at:

logs/rsl_rl/dexsuite_kuka_allegro/<timestamp>/model_<iter>.pt

Replace the checkpoint paths in the examples below accordingly.

Single Environment Evaluation#

python isaaclab_arena/evaluation/policy_runner.py \
  --viz newton \
  --presets newton \
  --policy_type rsl_rl \
  --num_steps 800 \
  --checkpoint_path $MODELS_DIR/model_14999.pt \
  dexsuite_lift

At the end of the run, metrics are printed to the console:

Metrics: {'success_rate': 0.75, 'num_episodes': 12}
../../../_images/dexsuite_lift_task.gif

Tip

You can also evaluate a Newton-trained model using PhysX:

python isaaclab_arena/evaluation/policy_runner.py \
  --viz kit \
  --policy_type rsl_rl \
  --num_steps 800 \
  --checkpoint_path $MODELS_DIR/model_14999.pt \
  dexsuite_lift

However, the model behaviour may differ significantly when training and evaluation use different physics backends. The above model, which was trained with Newton, fails to grasp or lift the cube completely when evaluated with PhysX.

Parallel Environment Evaluation#

For statistically significant results, run across many environments in parallel:

python isaaclab_arena/evaluation/policy_runner.py \
  --presets newton \
  --policy_type rsl_rl \
  --num_steps 5000 \
  --num_envs 64 \
  --env_spacing 3 \
  --checkpoint_path $MODELS_DIR/model_14999.pt \
  dexsuite_lift
Metrics: {'success_rate': 0.72, 'num_episodes': 320}

Batch Evaluation#

To evaluate multiple checkpoints in sequence, use eval_runner.py with a JSON config.

1. Create an evaluation config

Create a file eval_config.json:

{
  "jobs": [
    {
      "name": "dexsuite_lift_7500",
      "arena_env_args": {
        "environment": "dexsuite_lift",
        "num_envs": 64,
        "env_spacing": 3
      },
      "num_steps": 5000,
      "policy_type": "rsl_rl",
      "policy_config_dict": {
        "checkpoint_path": "models/isaaclab_arena/dexsuite_lift/model_7500.pt"
      }
    },
    {
      "name": "dexsuite_lift_14999",
      "arena_env_args": {
        "environment": "dexsuite_lift",
        "num_envs": 64,
        "env_spacing": 3
      },
      "num_steps": 5000,
      "policy_type": "rsl_rl",
      "policy_config_dict": {
        "checkpoint_path": "models/isaaclab_arena/dexsuite_lift/model_14999.pt"
      }
    }
  ]
}

2. Run

python isaaclab_arena/evaluation/eval_runner.py --presets newton --eval_jobs_config eval_config.json

Understanding the Metrics#

The dexsuite_lift task reports:

  • success_rate: fraction of episodes where the object reached the target position within 5 cm tolerance.

  • num_episodes: total number of completed episodes.