Creating a Manager-Based RL Environment#
Having learnt how to create a base environment in Creating a Manager-Based Base Environment, we will now look at how to create a manager-based task environment for reinforcement learning.
The base environment is designed as an sense-act environment where the agent can send commands to the environment
and receive observations from the environment. This minimal interface is sufficient for many applications such as
traditional motion planning and controls. However, many applications require a task-specification which often
serves as the learning objective for the agent. For instance, in a navigation task, the agent may be required to
reach a goal location. To this end, we use the envs.ManagerBasedRLEnv class which extends the base environment
to include a task specification.
Similar to other components in Isaac Lab, instead of directly modifying the base class envs.ManagerBasedRLEnv, we
encourage users to simply implement a configuration envs.ManagerBasedRLEnvCfg for their task environment.
This practice allows us to separate the task specification from the environment implementation, making it easier
to reuse components of the same environment for different tasks.
In this tutorial, we will configure the cartpole environment using the envs.ManagerBasedRLEnvCfg to create a manager-based task
for balancing the pole upright. We will learn how to specify the task using reward terms, termination criteria,
curriculum and commands.
The Code#
For this tutorial, we use the cartpole environment defined in isaaclab_tasks.manager_based.classic.cartpole module.
Code for cartpole_env_cfg.py
1# Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
2# All rights reserved.
3#
4# SPDX-License-Identifier: BSD-3-Clause
5
6import math
7
8from isaaclab_newton.physics import MJWarpSolverCfg, NewtonCfg
9from isaaclab_physx.physics import PhysxCfg
10
11import isaaclab.sim as sim_utils
12from isaaclab.assets import ArticulationCfg, AssetBaseCfg
13from isaaclab.envs import ManagerBasedRLEnvCfg
14from isaaclab.managers import EventTermCfg as EventTerm
15from isaaclab.managers import ObservationGroupCfg as ObsGroup
16from isaaclab.managers import ObservationTermCfg as ObsTerm
17from isaaclab.managers import RewardTermCfg as RewTerm
18from isaaclab.managers import SceneEntityCfg
19from isaaclab.managers import TerminationTermCfg as DoneTerm
20from isaaclab.scene import InteractiveSceneCfg
21from isaaclab.utils import configclass
22
23import isaaclab_tasks.manager_based.classic.cartpole.mdp as mdp
24from isaaclab_tasks.utils import PresetCfg
25
26##
27# Pre-defined configs
28##
29from isaaclab_assets.robots.cartpole import CARTPOLE_CFG # isort:skip
30
31
32##
33# Physics backend presets
34##
35
36
37@configclass
38class CartpolePhysicsCfg(PresetCfg):
39 default: PhysxCfg = PhysxCfg()
40 physx: PhysxCfg = PhysxCfg()
41 newton: NewtonCfg = NewtonCfg(
42 solver_cfg=MJWarpSolverCfg(
43 njmax=5,
44 nconmax=3,
45 cone="pyramidal",
46 impratio=1,
47 integrator="implicitfast",
48 ),
49 num_substeps=1,
50 debug_mode=False,
51 use_cuda_graph=True,
52 )
53
54
55##
56# Scene definition
57##
58
59
60@configclass
61class CartpoleSceneCfg(InteractiveSceneCfg):
62 """Configuration for a cart-pole scene."""
63
64 # ground plane
65 ground = AssetBaseCfg(
66 prim_path="/World/ground",
67 spawn=sim_utils.GroundPlaneCfg(size=(100.0, 100.0)),
68 )
69
70 # cartpole
71 robot: ArticulationCfg = CARTPOLE_CFG.replace(prim_path="{ENV_REGEX_NS}/Robot")
72
73 # lights
74 dome_light = AssetBaseCfg(
75 prim_path="/World/DomeLight",
76 spawn=sim_utils.DomeLightCfg(color=(0.9, 0.9, 0.9), intensity=500.0),
77 )
78
79
80##
81# MDP settings
82##
83
84
85@configclass
86class ActionsCfg:
87 """Action specifications for the MDP."""
88
89 joint_effort = mdp.JointEffortActionCfg(asset_name="robot", joint_names=["slider_to_cart"], scale=100.0)
90
91
92@configclass
93class ObservationsCfg:
94 """Observation specifications for the MDP."""
95
96 @configclass
97 class PolicyCfg(ObsGroup):
98 """Observations for policy group."""
99
100 # observation terms (order preserved)
101 joint_pos_rel = ObsTerm(func=mdp.joint_pos_rel)
102 joint_vel_rel = ObsTerm(func=mdp.joint_vel_rel)
103
104 def __post_init__(self) -> None:
105 self.enable_corruption = False
106 self.concatenate_terms = True
107
108 # observation groups
109 policy: PolicyCfg = PolicyCfg()
110
111
112@configclass
113class EventCfg:
114 """Configuration for events."""
115
116 # reset
117 reset_cart_position = EventTerm(
118 func=mdp.reset_joints_by_offset,
119 mode="reset",
120 params={
121 "asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"]),
122 "position_range": (-1.0, 1.0),
123 "velocity_range": (-0.5, 0.5),
124 },
125 )
126
127 reset_pole_position = EventTerm(
128 func=mdp.reset_joints_by_offset,
129 mode="reset",
130 params={
131 "asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"]),
132 "position_range": (-0.25 * math.pi, 0.25 * math.pi),
133 "velocity_range": (-0.25 * math.pi, 0.25 * math.pi),
134 },
135 )
136
137
138@configclass
139class RewardsCfg:
140 """Reward terms for the MDP."""
141
142 # (1) Constant running reward
143 alive = RewTerm(func=mdp.is_alive, weight=1.0)
144 # (2) Failure penalty
145 terminating = RewTerm(func=mdp.is_terminated, weight=-2.0)
146 # (3) Primary task: keep pole upright
147 pole_pos = RewTerm(
148 func=mdp.joint_pos_target_l2,
149 weight=-1.0,
150 params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"]), "target": 0.0},
151 )
152 # (4) Shaping tasks: lower cart velocity
153 cart_vel = RewTerm(
154 func=mdp.joint_vel_l1,
155 weight=-0.01,
156 params={"asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"])},
157 )
158 # (5) Shaping tasks: lower pole angular velocity
159 pole_vel = RewTerm(
160 func=mdp.joint_vel_l1,
161 weight=-0.005,
162 params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"])},
163 )
164
165
166@configclass
167class TerminationsCfg:
168 """Termination terms for the MDP."""
169
170 # (1) Time out
171 time_out = DoneTerm(func=mdp.time_out, time_out=True)
172 # (2) Cart out of bounds
173 cart_out_of_bounds = DoneTerm(
174 func=mdp.joint_pos_out_of_manual_limit,
175 params={"asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"]), "bounds": (-3.0, 3.0)},
176 )
177
178
179##
180# Environment configuration
181##
182
183
184@configclass
185class CartpoleEnvCfg(ManagerBasedRLEnvCfg):
186 """Configuration for the cartpole environment."""
187
188 # Scene settings
189 scene: CartpoleSceneCfg = CartpoleSceneCfg(num_envs=4096, env_spacing=4.0, clone_in_fabric=True)
190 # Basic settings
191 observations: ObservationsCfg = ObservationsCfg()
192 actions: ActionsCfg = ActionsCfg()
193 events: EventCfg = EventCfg()
194 # MDP settings
195 rewards: RewardsCfg = RewardsCfg()
196 terminations: TerminationsCfg = TerminationsCfg()
197
198 # Post initialization
199 def __post_init__(self) -> None:
200 """Post initialization."""
201 # general settings
202 self.decimation = 2
203 self.episode_length_s = 5
204 # viewer settings
205 self.viewer.eye = (8.0, 0.0, 5.0)
206 # simulation settings
207 self.sim.dt = 1 / 120
208 self.sim.render_interval = self.decimation
209 self.sim.physics = CartpolePhysicsCfg()
The script for running the environment run_cartpole_rl_env.py is present in the
isaaclab/scripts/tutorials/03_envs directory. The script is similar to the
cartpole_base_env.py script in the previous tutorial, except that it uses the
envs.ManagerBasedRLEnv instead of the envs.ManagerBasedEnv.
Code for run_cartpole_rl_env.py
1# Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
2# All rights reserved.
3#
4# SPDX-License-Identifier: BSD-3-Clause
5
6"""
7This script demonstrates how to run the RL environment for the cartpole balancing task.
8
9.. code-block:: bash
10
11 ./isaaclab.sh -p scripts/tutorials/03_envs/run_cartpole_rl_env.py --num_envs 32
12
13"""
14
15"""Launch Isaac Sim Simulator first."""
16
17import argparse
18
19from isaaclab.app import AppLauncher
20
21# add argparse arguments
22parser = argparse.ArgumentParser(description="Tutorial on running the cartpole RL environment.")
23parser.add_argument("--num_envs", type=int, default=16, help="Number of environments to spawn.")
24
25# append AppLauncher cli args
26AppLauncher.add_app_launcher_args(parser)
27# parse the arguments
28args_cli = parser.parse_args()
29
30# launch omniverse app
31app_launcher = AppLauncher(args_cli)
32simulation_app = app_launcher.app
33
34"""Rest everything follows."""
35
36import torch
37
38from isaaclab.envs import ManagerBasedRLEnv
39
40from isaaclab_tasks.manager_based.classic.cartpole.cartpole_env_cfg import CartpoleEnvCfg
41
42
43def main():
44 """Main function."""
45 # create environment configuration
46 env_cfg = CartpoleEnvCfg()
47 env_cfg.scene.num_envs = args_cli.num_envs
48 env_cfg.sim.device = args_cli.device
49 # setup RL environment
50 env = ManagerBasedRLEnv(cfg=env_cfg)
51
52 # simulate physics
53 count = 0
54 while simulation_app.is_running():
55 with torch.inference_mode():
56 # reset
57 if count % 300 == 0:
58 count = 0
59 env.reset()
60 print("-" * 80)
61 print("[INFO]: Resetting environment...")
62 # sample random actions
63 joint_efforts = torch.randn_like(env.action_manager.action)
64 # step the environment
65 obs, rew, terminated, truncated, info = env.step(joint_efforts)
66 # print current orientation of pole
67 print("[Env 0]: Pole joint: ", obs["policy"][0][1].item())
68 # update counter
69 count += 1
70
71 # close the environment
72 env.close()
73
74
75if __name__ == "__main__":
76 # run the main function
77 main()
78 # close sim app
79 simulation_app.close()
The Code Explained#
We already went through parts of the above in the Creating a Manager-Based Base Environment tutorial to learn about how to specify the scene, observations, actions and events. Thus, in this tutorial, we will focus only on the RL components of the environment.
In Isaac Lab, we provide various implementations of different terms in the envs.mdp module. We will use
some of these terms in this tutorial, but users are free to define their own terms as well. These
are usually placed in their task-specific sub-package
(for instance, in isaaclab_tasks.manager_based.classic.cartpole.mdp).
Defining rewards#
The managers.RewardManager is used to compute the reward terms for the agent. Similar to the other
managers, its terms are configured using the managers.RewardTermCfg class. The
managers.RewardTermCfg class specifies the function or callable class that computes the reward
as well as the weighting associated with it. It also takes in dictionary of arguments, "params"
that are passed to the reward function when it is called.
For the cartpole task, we will use the following reward terms:
Alive Reward: Encourage the agent to stay alive for as long as possible.
Terminating Reward: Similarly penalize the agent for terminating.
Pole Angle Reward: Encourage the agent to keep the pole at the desired upright position.
Cart Velocity Reward: Encourage the agent to keep the cart velocity as small as possible.
Pole Velocity Reward: Encourage the agent to keep the pole velocity as small as possible.
@configclass
class RewardsCfg:
"""Reward terms for the MDP."""
# (1) Constant running reward
alive = RewTerm(func=mdp.is_alive, weight=1.0)
# (2) Failure penalty
terminating = RewTerm(func=mdp.is_terminated, weight=-2.0)
# (3) Primary task: keep pole upright
pole_pos = RewTerm(
func=mdp.joint_pos_target_l2,
weight=-1.0,
params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"]), "target": 0.0},
)
# (4) Shaping tasks: lower cart velocity
cart_vel = RewTerm(
func=mdp.joint_vel_l1,
weight=-0.01,
params={"asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"])},
)
# (5) Shaping tasks: lower pole angular velocity
pole_vel = RewTerm(
func=mdp.joint_vel_l1,
weight=-0.005,
params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"])},
)
Defining termination criteria#
Most learning tasks happen over a finite number of steps that we call an episode. For instance, in the cartpole task, we want the agent to balance the pole for as long as possible. However, if the agent reaches an unstable or unsafe state, we want to terminate the episode. On the other hand, if the agent is able to balance the pole for a long time, we want to terminate the episode and start a new one so that the agent can learn to balance the pole from a different starting configuration.
The managers.TerminationsCfg configures what constitutes for an episode to terminate. In this example,
we want the task to terminate when either of the following conditions is met:
Episode Length The episode length is greater than the defined max_episode_length
Cart out of bounds The cart goes outside of the bounds [-3, 3]
The flag managers.TerminationsCfg.time_out specifies whether the term is a time-out (truncation) term
or terminated term. These are used to indicate the two types of terminations as described in Gymnasium’s documentation.
@configclass
class TerminationsCfg:
"""Termination terms for the MDP."""
# (1) Time out
time_out = DoneTerm(func=mdp.time_out, time_out=True)
# (2) Cart out of bounds
cart_out_of_bounds = DoneTerm(
func=mdp.joint_pos_out_of_manual_limit,
params={"asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"]), "bounds": (-3.0, 3.0)},
)
Defining commands#
For various goal-conditioned tasks, it is useful to specify the goals or commands for the agent. These are
handled through the managers.CommandManager. The command manager handles resampling and updating the
commands at each step. It can also be used to provide the commands as an observation to the agent.
For this simple task, we do not use any commands. Hence, we leave this attribute as its default value, which is None. You can see an example of how to define a command manager in the other locomotion or manipulation tasks.
Defining curriculum#
Often times when training a learning agent, it helps to start with a simple task and gradually increase the
tasks’s difficulty as the agent training progresses. This is the idea behind curriculum learning. In Isaac Lab,
we provide a managers.CurriculumManager class that can be used to define a curriculum for your environment.
In this tutorial we don’t implement a curriculum for simplicity, but you can see an example of a curriculum definition in the other locomotion or manipulation tasks.
Tying it all together#
With all the above components defined, we can now create the ManagerBasedRLEnvCfg configuration for the
cartpole environment. This is similar to the ManagerBasedEnvCfg defined in Creating a Manager-Based Base Environment,
only with the added RL components explained in the above sections.
@configclass
class CartpoleEnvCfg(ManagerBasedRLEnvCfg):
"""Configuration for the cartpole environment."""
# Scene settings
scene: CartpoleSceneCfg = CartpoleSceneCfg(num_envs=4096, env_spacing=4.0, clone_in_fabric=True)
# Basic settings
observations: ObservationsCfg = ObservationsCfg()
actions: ActionsCfg = ActionsCfg()
events: EventCfg = EventCfg()
# MDP settings
rewards: RewardsCfg = RewardsCfg()
terminations: TerminationsCfg = TerminationsCfg()
# Post initialization
def __post_init__(self) -> None:
"""Post initialization."""
# general settings
self.decimation = 2
self.episode_length_s = 5
# viewer settings
self.viewer.eye = (8.0, 0.0, 5.0)
# simulation settings
self.sim.dt = 1 / 120
self.sim.render_interval = self.decimation
self.sim.physics = CartpolePhysicsCfg()
Running the simulation loop#
Coming back to the run_cartpole_rl_env.py script, the simulation loop is similar to the previous tutorial.
The only difference is that we create an instance of envs.ManagerBasedRLEnv instead of the
envs.ManagerBasedEnv. Consequently, now the envs.ManagerBasedRLEnv.step() method returns additional signals
such as the reward and termination status. The information dictionary also maintains logging of quantities
such as the reward contribution from individual terms, the termination status of each term, the episode length etc.
def main():
"""Main function."""
# create environment configuration
env_cfg = CartpoleEnvCfg()
env_cfg.scene.num_envs = args_cli.num_envs
env_cfg.sim.device = args_cli.device
# setup RL environment
env = ManagerBasedRLEnv(cfg=env_cfg)
# simulate physics
count = 0
while simulation_app.is_running():
with torch.inference_mode():
# reset
if count % 300 == 0:
count = 0
env.reset()
print("-" * 80)
print("[INFO]: Resetting environment...")
# sample random actions
joint_efforts = torch.randn_like(env.action_manager.action)
# step the environment
obs, rew, terminated, truncated, info = env.step(joint_efforts)
# print current orientation of pole
print("[Env 0]: Pole joint: ", obs["policy"][0][1].item())
# update counter
count += 1
# close the environment
env.close()
The Code Execution#
Similar to the previous tutorial, we can run the environment by executing the run_cartpole_rl_env.py script.
./isaaclab.sh -p scripts/tutorials/03_envs/run_cartpole_rl_env.py --num_envs 32 --viz kit
This should open a similar simulation as in the previous tutorial. However, this time, the environment returns more signals that specify the reward and termination status. Additionally, the individual environments reset themselves when they terminate based on the termination criteria specified in the configuration.
To stop the simulation, you can either close the window, or press Ctrl+C in the terminal
where you started the simulation.
In this tutorial, we learnt how to create a task environment for reinforcement learning. We do this
by extending the base environment to include the rewards, terminations, commands and curriculum terms.
We also learnt how to use the envs.ManagerBasedRLEnv class to run the environment and receive various
signals from it.
While it is possible to manually create an instance of envs.ManagerBasedRLEnv class for a desired task,
this is not scalable as it requires specialized scripts for each task. Thus, we exploit the
gymnasium.make() function to create the environment with the gym interface. We will learn how to do this
in the next tutorial.