Evaluation Types#
Isaac Lab Arena supports three main ways to run policy evaluation: a single-job policy runner (single or multi-GPU), a sequential batch eval runner for multiple jobs in one process, and a server–client setup for remote policies. This section summarizes when to use each and how they work. Each section below links to the relevant concept docs: Policy Design, Environment Design, Metrics Design, and Remote Policies Design.
Summary#
Type |
Use case |
Entry point |
Multi-GPU |
|---|---|---|---|
Policy runner |
Single job, one env config, one policy |
|
Yes (torchrun) |
Sequential batch eval runner |
Multiple jobs (env/policy combos) in sequence |
|
No |
Server–client |
Policy runs in separate process/machine |
Policy runner + remote server |
Depends on client |
1. Policy runner — single job (single GPU and multi-GPU)#
The policy runner (isaaclab_arena/evaluation/policy_runner.py) runs one
evaluation job: one environment configuration and one policy. It is the right
choice for ad-hoc runs, debugging, or when you want to drive one scenario with
full control over CLI arguments.
Design context: For how policies are defined and integrated with environments, see Policy Design.
Features:
Single environment configuration (scene, embodiment, task) and one policy.
Heterogeneous objects: When the environment supports it, you can pass
--object_setwith a space-separated list of object names. Each parallel environment is assigned a different object from the set (e.g. env 0 gets the first object, env 1 the second, etc.). This allows evaluating one policy across multiple object types in a single run without changing the scene or task logic.Run length by steps (
--num_steps) or episodes (--num_episodes); policies that define a length (e.g.policy.has_length()) can override this.Single GPU: one process, one Isaac Sim instance.
Multi-GPU: use
torchrunwith--distributed; one process per GPU, each with its own Isaac Sim instance and device (e.g.cuda:0,cuda:1).Metrics are computed at the end if the environment registers metrics and are logged to the console.
Single-GPU example
python isaaclab_arena/evaluation/policy_runner.py \
--policy_type <policy_type> \
--num_steps 2000 \
--num_envs 10 \
<arena_environment> \
--embodiment <embodiment> \
--object <object>
...
Heterogeneous objects example (single or multi-GPU)
Use --object_set so each of the --num_envs parallel environments gets a
different object from the list. Object-to-environment mapping: with the default
deterministic assignment, environment \(i\) gets the object at index
\(i \\mod n\) in the list (where \(n\) = len(object_set))—so when
num_envs > len(object_set) the assignment cycles (no truncation or
error). If the object set is created with random_choice=True, each environment
gets a randomly chosen object from the set. Some environments may require
num_envs == len(object_set).
python isaaclab_arena/evaluation/policy_runner.py \
--policy_type <policy_type> \
--num_steps 2000 \
--num_envs 4 \
--enable_cameras \
put_item_in_fridge_and_close_door \
--embodiment gr1_joint \
--object_set ketchup_bottle_hope_robolab ranch_dressing_hope_robolab bbq_sauce_bottle_hope_robolab mayonnaise_bottle_hope_robolab
Multi-GPU example
Use torch.distributed.run (or torchrun) with --nproc_per_node=<num_gpus>
and pass --distributed so each process uses a different GPU (via LOCAL_RANK):
python -m torch.distributed.run --nnode=1 --nproc_per_node=<num_gpus> \
isaaclab_arena/evaluation/policy_runner.py \
--policy_type <policy_type> \
--num_steps 2000 \
--num_envs 10 \
--distributed \
--headless \
<arena_environment> \
...
Policy runner CLI (relevant flags)
--policy_type: Registered policy name or dotted path to policy class (e.g.module.submodule.ClassName).--num_steps: Total simulation steps (mutually exclusive with--num_episodes).--num_episodes: Total episodes (mutually exclusive with--num_steps).--distributed: Enable distributed mode; use withtorchrunand set device per rank (e.g.cuda:{local_rank}).
The rest of the arguments (environment, embodiment, object, etc.) come from the
Arena environments CLI and the policy’s own add_args_to_parser.
2. Sequential batch eval runner — batch jobs#
The sequential batch eval runner (isaaclab_arena/evaluation/eval_runner.py)
runs a batch of evaluation jobs sequentially in a single process. Each job can have
a different environment (scene/object/embodiment), policy type, policy config,
and length (steps or episodes). This is suited for benchmarking many
configurations (e.g. many objects or tasks) without launching multiple processes
by hand. Persistence of the simulation application is maintained between jobs.
Design context: For how environments are composed and how metrics are defined and computed, see Environment Design and Metrics Design. Policies used per job follow Policy Design.
Features:
One JSON config file (
--eval_jobs_config) listing all jobs.Jobs run one after another; each job builds its environment, creates the policy from the job config, runs
rollout_policy, then tears down the env before the next job.If a job fails, the runner continues with the next job and marks the failed job accordingly.
Metrics are aggregated and printed at the end (e.g. via
MetricsLogger).Distributed evaluation is not supported: the sequential batch eval runner runs in a single process. For multi-GPU, use multiple policy runner invocations (e.g. with
torchrun) or split the batch across machines.
Todo
Experiment with distributed evaluation in the sequential batch eval runner.
Jobs config format
The config file must be a JSON object with a "jobs" array. Each job is an
object with:
name: Unique job name (for logging and metrics).arena_env_args: Environment arguments as a dict (e.g.environment,num_envs,object,embodiment,enable_cameras, etc.). Converted internally to the same CLI-style list the policy runner uses.policy_type: Same as policy runner (registered name or dotted class path).policy_config_dict: Policy configuration (e.g. checkpoint path, model options). Used withPolicyBase.from_dictif the policy has aconfig_class, otherwise converted to CLI args andfrom_args.num_stepsornum_episodes(optional): Simulation length for this job. If both are omitted, the runner uses the policy’s length if defined, or a CLI default (e.g.--num_steps).
Example config structure
{
"jobs": [
{
"name": "gr1_open_microwave_cracker_box",
"arena_env_args": {
"environment": "gr1_open_microwave",
"object": "cracker_box",
"embodiment": "gr1_joint",
"num_envs": 4
},
"num_steps": 500,
"policy_type": "zero_action",
"policy_config_dict": {}
},
{
"name": "gr1_sequential_static_manipulation_put_ranch_dressing_bottle_in_fridge_and_close_door",
"arena_env_args": {
"enable_cameras": true,
"environment": "gr1_sequential_static_manipulation",
"object": "ranch_dressing_hope_robolab",
"embodiment": "gr1_joint"
},
"num_steps": 100,
"policy_type": "isaaclab_arena_gr00t.policy.gr00t_closedloop_policy.Gr00tClosedloopPolicy",
"policy_config_dict": {
"policy_config_yaml_path": "isaaclab_arena_gr00t/policy/config/gr1_manip_ranch_bottle_gr00t_closedloop_config.yaml",
"policy_device": "cuda:0"
}
}
]
}
Running the sequential batch eval runner
python isaaclab_arena/evaluation/eval_runner.py \
--eval_jobs_config path/to/eval_jobs_config.json \
--num_steps 1000
If any job needs cameras, set enable_cameras: true in that job’s
arena_env_args; the sequential batch eval runner automatically enables camera support if any job requires it.
3. Server–client (remote policies)#
When the policy runs in a separate process or machine (e.g. a GPU server with a large model), evaluation still uses the policy runner on the client side, but the policy is a client-side remote policy that talks to a remote policy server over the network.
Design context: For the full remote policy design, protocol, and how to implement custom server-side and client-side policies, see Remote Policies Design.
Features:
Server: Runs the model and a
ServerSidePolicy(e.g. GR00T), often started viaremote_policy_server_runneror a wrapper script (e.g.docker/run_gr00t_server.sh). No Isaac Sim on the server.Client: Runs Isaac Lab Arena and the simulation; the policy is a
ClientSidePolicy(e.g.ActionChunkingClientSidePolicy) that packs observations, sends them to the server, and applies returned actions (with optional chunking or post-processing).Protocol: Server and client agree on an
ActionProtocol(e.g. observation keys, action shape, chunk length). The protocol is negotiated at handshake; no policy logic lives in the protocol.
Typical workflow
Start the remote policy server (separate terminal or machine), e.g. with GR00T:
bash docker/run_gr00t_server.sh \ --host 127.0.0.1 \ --port 5555 \ --policy_type isaaclab_arena_gr00t.policy.gr00t_remote_policy.Gr00tRemoteServerSidePolicy \ --policy_config_yaml_path isaaclab_arena_gr00t/policy/config/gr1_manip_ranch_bottle_gr00t_closedloop_config.yaml
Run evaluation on the client with a client-side policy (i.e.
ActionChunkingClientSidePolicy) and remote connection. Within theBasecontainer, run:python isaaclab_arena/evaluation/policy_runner.py \ --policy_type isaaclab_arena.policy.action_chunking_client.ActionChunkingClientSidePolicy \ --remote_host 127.0.0.1 \ --remote_port 5555 \ --num_steps 2000 \ --num_envs 10 \ --remote_kill_on_exit \ <arena_environment> \ --embodiment <embodiment> \ ...
The same policy runner is used as in the single-job case; only the policy type and remote options change.
Choosing an evaluation type#
One-off run, one setup: use the policy runner (single or multi-GPU); use
--object_setfor heterogeneous objects in one run.Many env/policy combinations in one go: use the sequential batch eval runner with a jobs JSON; use
--object_setfor heterogeneous objects in one run.Heavy model on another machine or process: use server–client with the policy runner on the client and a remote policy server.