Task Design Workflows#
Environments define the interface between the agent and the simulation. In the simplest case, the environment provides the agent with the current observations and executes the actions provided by the agent. In a Markov Decision Process (MDP) formulation, the environment can also provide additional information such as the current reward, done flag, and information about the current episode.
While the environment interface is simple to understand, its implementation can vary significantly depending on the complexity of the task. In the context of reinforcement learning (RL), the environment implementation can be broken down into several components, such as the reward function, observation function, termination function, and reset function. Each of these components can be implemented in different ways depending on the complexity of the task and the desired level of modularity.
We provide two different workflows for designing environments with the framework:
Manager-based: The environment is decomposed into individual components (or managers) that handle different aspects of the environment (such as computing observations, applying actions, and applying randomization). The user defines configuration classes for each component and the environment is responsible for coordinating the managers and calling their functions.
Direct: The user defines a single class that implements the entire environment directly without the need for separate managers. This class is responsible for computing observations, applying actions, and computing rewards.
Both workflows have their own advantages and disadvantages. The manager-based workflow is more modular and allows different components of the environment to be swapped out easily. This is useful when prototyping the environment and experimenting with different configurations. On the other hand, the direct workflow is more efficient and allows for more fine-grained control over the environment logic. This is useful when optimizing the environment for performance or when implementing complex logic that is difficult to decompose into separate components.
Manager-Based Environments#
A majority of environment implementations follow a similar structure. The environment processes the input actions, steps through the simulation, computes observations and reward signals, applies randomization, and resets the terminated environments. Motivated by this, the environment can be decomposed into individual components that handle each of these tasks. For example, the observation manager is responsible for computing the observations, the reward manager is responsible for computing the rewards, and the termination manager is responsible for computing the termination signal. This approach is known as the manager-based environment design in the framework.
Manager-based environments promote modular implementations of tasks by decomposing the task into individual components that are managed by separate classes. Each component of the task, such as rewards, observations, termination can all be specified as individual configuration classes that are then passed to the corresponding manager classes. The manager is then responsible for parsing the configurations and processing the contents specified in its configuration.
The coordination between the different managers is orchestrated by the class envs.ManagerBasedRLEnv
.
It takes in a task configuration class instance (envs.ManagerBasedRLEnvCfg
) that contains the configurations
for each of the components of the task. Based on the configurations, the scene is set up and the task is initialized.
Afterwards, while stepping through the environment, all the managers are called sequentially to perform the necessary
operations.
For their own tasks, we expect the user to mainly define the task configuration class and use the existing
envs.ManagerBasedRLEnv
class for the task implementation. The task configuration class should inherit from
the base class envs.ManagerBasedRLEnvCfg
and contain variables assigned to various configuration classes
for each component (such as the ObservationCfg
and RewardCfg
).
Example for defining the reward function for the Cartpole task using the manager-style
The following class is a part of the Cartpole environment configuration class. The RewardsCfg
class
defines individual terms that compose the reward function. Each reward term is defined by its function
implementation, weight and additional parameters to be passed to the function. Users can define multiple
reward terms and their weights to be used in the reward function.
@configclass
class RewardsCfg:
"""Reward terms for the MDP."""
# (1) Constant running reward
alive = RewTerm(func=mdp.is_alive, weight=1.0)
# (2) Failure penalty
terminating = RewTerm(func=mdp.is_terminated, weight=-2.0)
# (3) Primary task: keep pole upright
pole_pos = RewTerm(
func=mdp.joint_pos_target_l2,
weight=-1.0,
params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"]), "target": 0.0},
)
# (4) Shaping tasks: lower cart velocity
cart_vel = RewTerm(
func=mdp.joint_vel_l1,
weight=-0.01,
params={"asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"])},
)
# (5) Shaping tasks: lower pole angular velocity
pole_vel = RewTerm(
func=mdp.joint_vel_l1,
weight=-0.005,
params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"])},
)
Through this approach, it is possible to easily vary the implementations of the task by switching some components while leaving the remaining of the code intact. This flexibility is desirable when prototyping the environment and experimenting with different configurations. It also allows for easy collaborating with others on implementing an environment, since contributors may choose to use different combinations of configurations for their own task specifications.
See also
We provide a more detailed tutorial for setting up an environment using the manager-based workflow at Creating a Manager-Based RL Environment.
Direct Environments#
The direct-style environment aligns more closely with traditional implementations of environments,
where a single script directly implements the reward function, observation function, resets, and all the other components
of the environment. This approach does not require the manager classes. Instead, users are provided the complete freedom
to implement their task through the APIs from the base classes envs.DirectRLEnv
or envs.DirectMARLEnv
.
For users migrating from the IsaacGymEnvs and OmniIsaacGymEnvs framework, this workflow may be more familiar.
When defining an environment with the direct-style implementation, we expect the user define a single class that
implements the entire environment. The task class should inherit from the base classes envs.DirectRLEnv
or
envs.DirectMARLEnv
and should have its corresponding configuration class that inherits from
envs.DirectRLEnvCfg
or envs.DirectMARLEnvCfg
respectively. The task class is responsible
for setting up the scene, processing the actions, computing the rewards, observations, resets, and termination signals.
Example for defining the reward function for the Cartpole task using the direct-style
The following function is a part of the Cartpole environment class and is responsible for computing the rewards.
def _get_rewards(self) -> torch.Tensor:
total_reward = compute_rewards(
self.cfg.rew_scale_alive,
self.cfg.rew_scale_terminated,
self.cfg.rew_scale_pole_pos,
self.cfg.rew_scale_cart_vel,
self.cfg.rew_scale_pole_vel,
self.joint_pos[:, self._pole_dof_idx[0]],
self.joint_vel[:, self._pole_dof_idx[0]],
self.joint_pos[:, self._cart_dof_idx[0]],
self.joint_vel[:, self._cart_dof_idx[0]],
self.reset_terminated,
)
return total_reward
It calls the compute_rewards()
function which is Torch JIT compiled for performance benefits.
@torch.jit.script
def compute_rewards(
rew_scale_alive: float,
rew_scale_terminated: float,
rew_scale_pole_pos: float,
rew_scale_cart_vel: float,
rew_scale_pole_vel: float,
pole_pos: torch.Tensor,
pole_vel: torch.Tensor,
cart_pos: torch.Tensor,
cart_vel: torch.Tensor,
reset_terminated: torch.Tensor,
):
rew_alive = rew_scale_alive * (1.0 - reset_terminated.float())
rew_termination = rew_scale_terminated * reset_terminated.float()
rew_pole_pos = rew_scale_pole_pos * torch.sum(torch.square(pole_pos).unsqueeze(dim=1), dim=-1)
rew_cart_vel = rew_scale_cart_vel * torch.sum(torch.abs(cart_vel).unsqueeze(dim=1), dim=-1)
rew_pole_vel = rew_scale_pole_vel * torch.sum(torch.abs(pole_vel).unsqueeze(dim=1), dim=-1)
total_reward = rew_alive + rew_termination + rew_pole_pos + rew_cart_vel + rew_pole_vel
return total_reward
This approach provides more transparency in the implementations of the environments, as logic is defined within the task class instead of abstracted with the use of managers. This may be beneficial when implementing complex logic that is difficult to decompose into separate components. Additionally, the direct-style implementation may bring more performance benefits for the environment, as it allows implementing large chunks of logic with optimized frameworks such as PyTorch JIT or Warp. This may be valuable when scaling up training tremendously which requires optimizing individual operations in the environment.
See also
We provide a more detailed tutorial for setting up a RL environment using the direct workflow at Creating a Direct Workflow RL Environment.