Piper RL Demo: Reach Target

Agilex_Robotics · December 22, 2025, 10:05am

Piper RL Demo

This repository contains simple demonstration code for Piper RL, showing how to use Piper RL to train a simple task: Reach Target, the task requires the center of Piper’s gripper to reach a specified target position without demanding the end pose of Piper’s end-effector, achieved through a simple reward function. The repository provides training examples under two simulators: Mujoco and Genesis.

Repositories

Piper_rl
Agilex-College

Environment Dependencies

Install RL-related dependencies

pip install -r requirements.txt

Install Genesis-related dependencies

Pytorch
Install Pytorch and select the corresponding installation command based on your CUDA version. Find the Pytorch command for your CUDA version at link. For example, for CUDA 12.9, use the following installation command:

pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu129

Genesis World
Install Genesis World with the command:

pip install genesis-world

Install Mujoco

pip install mujoco==3.3.2

Genesis Examples

Load Piper model in Genesis

Run genesis_demo/hello_genesis.py

python genesis_demo/hello_genesis.py

You can see the Piper model is loaded successfully.

Control Piper model in Genesis

Run genesis_demo/control_piper.py

python genesis_demo/control_piper.py

You can see Piper moves according to the set position.

Implement parallel simulation of multiple Pipers in Genesis

Run genesis_demo/multi_piper.py

python genesis_demo/multi_piper.py

Implement parallel training of multiple Pipers in Genesis

Run piper_rl_genesis.py

python piper_rl_genesis.py

You can see multiple Pipers attempting to approach the set position.

Launch tensorboard to view the reward changes of multiple Pipers during training:

tensorboard  --logdir tensorboard/piper_reach_target/

Basic Steps to Implement Piper Env in Genesis

Initialize the env

#  For more detailed methods of inheriting gym.Env, refer to the gym examples provided by OpenAI
class PiperEnv(gym.Env):
  
    #__init__(): Initializes the environment, robot parameters, action space and observation space, 
    # facilitating the reinforcement learning algorithm to search for appropriate actions in the given observation space
    def __init__(self, visualize: bool = False):
        super(PiperEnv, self).__init__()

        self.visualize = visualize

        # Set the joint indices to be controlled
        self.jnt_name = [
            "joint1",
            "joint2",
            "joint3",
            "joint4",
            "joint5",
            "joint6"
        ]

        # Set the generation space for random target points and the workspace of the Piper robotic arm
        self.workspace = {
            'x': [-0.5, 1.5],
            'y': [-0.8, 0.8],
            'z': [0.05, 0.5]
        }

        # Set the device for the environment
        self.tensor_device = "cpu"
        self.gs_device = gs.cpu

        #  Set joint limits
        self.jnt_range = torch.tensor([
            [-2.61, 2.61],
            [0, 3.14],
            [-2.7, 0],
            [-1.83, 1.83],
            [-1.22, 1.22],
            [-1.57, 1.57]
        ], device=self.tensor_device)

        # PD controller parameters for Piper robotic arm joints
        self.kp = torch.tensor([4500, 4500, 3500, 3500, 2500.0, 2500.0], device=self.tensor_device)
        self.kv = torch.tensor([450.0, 450.0, 350.0, 350.0, 250.0, 250.0], device=self.tensor_device)

        gs.init(backend = self.gs_device)

        # Create genesis scene
        self.scene = gs.Scene(
            # Set camera
            show_viewer = self.visualize,
            viewer_options = gs.options.ViewerOptions(
                camera_pos    = (3.5, -1.0, 2.5),
                camera_lookat = (0.0, 0.0, 0.5),
                camera_fov    = 40,
            ),
            # Set physics engine
            rigid_options = gs.options.RigidOptions(
                dt = 0.01,
            ),
        )

        # Add ground plane
        plane = self.scene.add_entity(
            gs.morphs.Plane(),
        )

        # add piper robot
        self.robot = self.scene.add_entity(
            gs.morphs.MJCF(file='xml/agilex_piper/piper.xml'),
        )

        # Build the scene
        self.scene.build()

        # The initial pose is Piper robotic arm's default pose (all joint angles are 0)
        self.default_joint_pos = torch.tensor([0.0, 0.0, 0.0, 0.0, 0.0, 0.0], device=self.tensor_device)
        # Define the position (x,y,z) of the end-effector (actually, the end-effector position is not (0,0,0) when Piper is in the initial pose)
        self.default_ee_pos = torch.tensor([0.0, 0.0, 0.0], device=self.tensor_device)

        #  Define action space and observation space to ensure joint values do not exceed -PI to PI
        self.action_space = spaces.Box(low=-3.14, high=3.14, shape=(6,), dtype=np.float32)

        #  Define observation space, including 6 joint angles and 3 end-effector positions
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(9,), dtype=np.float32)

        
        self.motors_dof_idx = [self.robot.get_joint(name).dof_start for name in self.jnt_name]
        self.robot.set_dofs_kp(self.kp, self.motors_dof_idx)
        self.robot.set_dofs_kv(self.kv, self.motors_dof_idx)

        self.goal = torch.tensor(torch.zeros(3, dtype=torch.float32), device=self.tensor_device)
        self.last_action = torch.tensor(torch.zeros(6, dtype=torch.float32), device=self.tensor_device)
        self.goal_threshold = 0.005

Design the reward function

def calc_reward(self, action, obs):

        # Calculate the distance between the robotic arm joints and the target position
        dist_to_goal = torch.linalg.norm(obs[6:] - self.goal)
        
        # Non-linear distance reward
        if dist_to_goal < self.goal_threshold:
            distance_reward = 100.0
        elif dist_to_goal < 2*self.goal_threshold:
            distance_reward = 50.0
        elif dist_to_goal < 3*self.goal_threshold:
            distance_reward = 10.0
        else:
            distance_reward = 1.0 / (1.0 + dist_to_goal)
        
        #  Action-related penalty: penalize large joint changes
        action_diff = action - self.last_action
        smooth_penalty = 0.1 * torch.linalg.norm(action_diff)

        # Joint angle limit penalty
        joint_penalty = 0.0
        for i in range(6):
            min_angle = self.jnt_range[i][0]
            max_angle = self.jnt_range[i][1]
            if obs[i] < min_angle:
                joint_penalty += 0.5 * (min_angle - obs[i])
            elif obs[i] > max_angle:
                joint_penalty += 0.5 * (obs[i] - max_angle)
        
        # Total reward calculation
        total_reward = distance_reward - smooth_penalty - joint_penalty
        #  Update last action
        self.last_action = action.clone()
        
        return total_reward, dist_to_goal

Set the step function

def step(self, action):
        action_tensor = torch.tensor(action, device=self.tensor_device, dtype=torch.float32)
        
        # Action scaling: normalize observation values with different physical dimensions. 
        # Neural networks are more sensitive to large-value features, leading to unstable training; 
        # scaling ensures all observations are in a similar range
        scaled_action = torch.zeros(6, device=self.tensor_device, dtype=torch.float32)
        for i in range(6):
            scaled_action[i] = self.jnt_range[i][0] + (action_tensor[i] + 1) * 0.5 * (self.jnt_range[i][1] - self.jnt_range[i][0])
        
        # Genesis robot executes the action 
        self.robot.control_dofs_position(scaled_action, self.motors_dof_idx)
        # Genesis scene simulates one step
        self.scene.step()

        # Observation
        obs = self.get_observation()

        # Calculate reward
        reward, dist_to_goal = self.calc_reward(action_tensor, obs)
        terminated = False
        if dist_to_goal < self.goal_threshold:
            terminated = True

        if not terminated:
            if time.time() - self.start_t > 20.0:
                reward -= 10.0
                print(f"[Timeout] Time exceeded, reward halved")
                terminated = True
        
        info = {
            'is_success': terminated and (dist_to_goal < self.goal_threshold),
            'distance_to_goal': dist_to_goal.item()
        }

        return obs.cpu().numpy(), reward.item(), terminated, False, info

Mujoco Examples

Implement parallel training of multiple Pipers in Mujoco

Run piper_rl_mujoco.py

python piper_rl_mujoco.py