Task Design Workflows

Task Design Workflows#

Environments define the interface between the agent and the simulation. In the simplest case, the environment provides the agent with the current observations and executes the actions provided by the agent. In a Markov Decision Process (MDP) formulation, the environment can also provide additional information such as the current reward, done flag, and information about the current episode.

While the environment interface is simple to understand, its implementation can vary significantly depending on the complexity of the task. In the context of reinforcement learning (RL), the environment implementation can be broken down into several components, such as the reward function, observation function, termination function, and reset function. Each of these components can be implemented in different ways depending on the complexity of the task and the desired level of modularity.

We provide two different workflows for designing environments with the framework:

  • Manager-based: The environment is decomposed into individual components (or managers) that handle different aspects of the environment (such as computing observations, applying actions, and applying randomization). The user defines configuration classes for each component and the environment is responsible for coordinating the managers and calling their functions.

  • Direct: The user defines a single class that implements the entire environment directly without the need for separate managers. This class is responsible for computing observations, applying actions, and computing rewards.

Both workflows have their own advantages and disadvantages. The manager-based workflow is more modular and allows different components of the environment to be swapped out easily. This is useful when prototyping the environment and experimenting with different configurations. On the other hand, the direct workflow is more efficient and allows for more fine-grained control over the environment logic. This is useful when optimizing the environment for performance or when implementing complex logic that is difficult to decompose into separate components.

Manager-Based Environments#

Manager-based Task Workflow

A majority of environment implementations follow a similar structure. The environment processes the input actions, steps through the simulation, computes observations and reward signals, applies randomization, and resets the terminated environments. Motivated by this, the environment can be decomposed into individual components that handle each of these tasks. For example, the observation manager is responsible for computing the observations, the reward manager is responsible for computing the rewards, and the termination manager is responsible for computing the termination signal. This approach is known as the manager-based environment design in the framework.

Manager-based environments promote modular implementations of tasks by decomposing the task into individual components that are managed by separate classes. Each component of the task, such as rewards, observations, termination can all be specified as individual configuration classes that are then passed to the corresponding manager classes. The manager is then responsible for parsing the configurations and processing the contents specified in its configuration.

The coordination between the different managers is orchestrated by the class envs.ManagerBasedRLEnv. It takes in a task configuration class instance (envs.ManagerBasedRLEnvCfg) that contains the configurations for each of the components of the task. Based on the configurations, the scene is set up and the task is initialized. Afterwards, while stepping through the environment, all the managers are called sequentially to perform the necessary operations.

For their own tasks, we expect the user to mainly define the task configuration class and use the existing envs.ManagerBasedRLEnv class for the task implementation. The task configuration class should inherit from the base class envs.ManagerBasedRLEnvCfg and contain variables assigned to various configuration classes for each component (such as the ObservationCfg and RewardCfg).

Example for defining the reward function for the Cartpole task using the manager-style

The following class is a part of the Cartpole environment configuration class. The RewardsCfg class defines individual terms that compose the reward function. Each reward term is defined by its function implementation, weight and additional parameters to be passed to the function. Users can define multiple reward terms and their weights to be used in the reward function.

@configclass
class RewardsCfg:
    """Reward terms for the MDP."""

    # (1) Constant running reward
    alive = RewTerm(func=mdp.is_alive, weight=1.0)
    # (2) Failure penalty
    terminating = RewTerm(func=mdp.is_terminated, weight=-2.0)
    # (3) Primary task: keep pole upright
    pole_pos = RewTerm(
        func=mdp.joint_pos_target_l2,
        weight=-1.0,
        params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"]), "target": 0.0},
    )
    # (4) Shaping tasks: lower cart velocity
    cart_vel = RewTerm(
        func=mdp.joint_vel_l1,
        weight=-0.01,
        params={"asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"])},
    )
    # (5) Shaping tasks: lower pole angular velocity
    pole_vel = RewTerm(
        func=mdp.joint_vel_l1,
        weight=-0.005,
        params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"])},
    )

Through this approach, it is possible to easily vary the implementations of the task by switching some components while leaving the remaining of the code intact. This flexibility is desirable when prototyping the environment and experimenting with different configurations. It also allows for easy collaborating with others on implementing an environment, since contributors may choose to use different combinations of configurations for their own task specifications.

See also

We provide a more detailed tutorial for setting up an environment using the manager-based workflow at Creating a Manager-Based RL Environment.

Direct Environments#

Direct-based Task Workflow

The direct-style environment aligns more closely with traditional implementations of environments, where a single script directly implements the reward function, observation function, resets, and all the other components of the environment. This approach does not require the manager classes. Instead, users are provided the complete freedom to implement their task through the APIs from the base classes envs.DirectRLEnv or envs.DirectMARLEnv. For users migrating from the IsaacGymEnvs and OmniIsaacGymEnvs framework, this workflow may be more familiar.

When defining an environment with the direct-style implementation, we expect the user define a single class that implements the entire environment. The task class should inherit from the base classes envs.DirectRLEnv or envs.DirectMARLEnv and should have its corresponding configuration class that inherits from envs.DirectRLEnvCfg or envs.DirectMARLEnvCfg respectively. The task class is responsible for setting up the scene, processing the actions, computing the rewards, observations, resets, and termination signals.

Example for defining the reward function for the Cartpole task using the direct-style

The following function is a part of the Cartpole environment class and is responsible for computing the rewards.

def _get_rewards(self) -> torch.Tensor:
    total_reward = compute_rewards(
        self.cfg.rew_scale_alive,
        self.cfg.rew_scale_terminated,
        self.cfg.rew_scale_pole_pos,
        self.cfg.rew_scale_cart_vel,
        self.cfg.rew_scale_pole_vel,
        self.joint_pos[:, self._pole_dof_idx[0]],
        self.joint_vel[:, self._pole_dof_idx[0]],
        self.joint_pos[:, self._cart_dof_idx[0]],
        self.joint_vel[:, self._cart_dof_idx[0]],
        self.reset_terminated,
    )
    return total_reward

It calls the compute_rewards() function which is Torch JIT compiled for performance benefits.

@torch.jit.script
def compute_rewards(
    rew_scale_alive: float,
    rew_scale_terminated: float,
    rew_scale_pole_pos: float,
    rew_scale_cart_vel: float,
    rew_scale_pole_vel: float,
    pole_pos: torch.Tensor,
    pole_vel: torch.Tensor,
    cart_pos: torch.Tensor,
    cart_vel: torch.Tensor,
    reset_terminated: torch.Tensor,
):
    rew_alive = rew_scale_alive * (1.0 - reset_terminated.float())
    rew_termination = rew_scale_terminated * reset_terminated.float()
    rew_pole_pos = rew_scale_pole_pos * torch.sum(torch.square(pole_pos).unsqueeze(dim=1), dim=-1)
    rew_cart_vel = rew_scale_cart_vel * torch.sum(torch.abs(cart_vel).unsqueeze(dim=1), dim=-1)
    rew_pole_vel = rew_scale_pole_vel * torch.sum(torch.abs(pole_vel).unsqueeze(dim=1), dim=-1)
    total_reward = rew_alive + rew_termination + rew_pole_pos + rew_cart_vel + rew_pole_vel
    return total_reward

This approach provides more transparency in the implementations of the environments, as logic is defined within the task class instead of abstracted with the use of managers. This may be beneficial when implementing complex logic that is difficult to decompose into separate components. Additionally, the direct-style implementation may bring more performance benefits for the environment, as it allows implementing large chunks of logic with optimized frameworks such as PyTorch JIT or Warp. This may be valuable when scaling up training tremendously which requires optimizing individual operations in the environment.

See also

We provide a more detailed tutorial for setting up a RL environment using the direct workflow at Creating a Direct Workflow RL Environment.