Genesis physics simulator (5) Reinforcement Learning
Clone Genesis from Github and navigate to Genesis.
git clone https://github.com/Genesis-Embodied-AI/Genesis.git
cd Genesis
Install TensorBoard to visualize training logs.
By visualizing learning curves (reward, loss), parameter changes, scalars, histograms, and graphs, you can understand the learning progress during reinforcement learning. It's said to be an essential tool.
pip install tensorboard
Install the RL (reinforcement learning) library rsl-rl-lib created by Legged Robotics (ETH Zurich). It's used not only in Genesis but also in Isaac Lab and Isaac Gym. It's a lightweight and fast library.
It has a high-speed implementation of PPO (Proximal Policy Optimization) and Actor-Critic learning capabilities. It seems to run fast with CUDA support.
pip install rsl-rl-lib==2.2.4
Run a trial reinforcement learning. Specify the batch file with -B. You can try changing 1024 to 4096 or other values.
cd Genesis
python examples/locomotion/go2_train.py -B 1024 --max_iterations 100
The following display repeats as iterations progress. However, even though it was set to 100 iterations, it ended at 99. Total time was 44.56s.
################################################################################
Learning iteration 99/100
Computation: 61363 steps/s (collection: 0.299s, learning 0.101s)
Value function loss: 0.0029
Surrogate loss: -0.0049
Mean action noise std: 0.92
Mean total reward: 7.58
Mean episode length: 796.85
Mean episode rew_tracking_lin_vel: 0.5305
Mean episode rew_tracking_ang_vel: 0.1222
Mean episode rew_lin_vel_z: -0.0128
Mean episode rew_base_height: -0.0106
Mean episode rew_action_rate: -0.0869
Mean episode rew_similar_to_default: -0.1561
--------------------------------------------------------------------------------
Total timesteps: 2457600
Iteration time: 0.40s
Total time: 44.56s
ETA: 0.4s
The above was B=1024, but below is the case of B=4096. Total time increased to 76.89 s.
################################################################################ Learning iteration 99/100
Computation: 136453 steps/s (collection: 0.421s, learning 0.300s)
Value function loss: 0.0001
Surrogate loss: -0.0025
Mean action noise std: 0.58
Mean total reward: 16.83
Mean episode length: 1001.00
Mean episode rew_tracking_lin_vel: 0.9318
Mean episode rew_tracking_ang_vel: 0.1741
Mean episode rew_lin_vel_z: -0.0152
Mean episode rew_base_height: -0.0068
Mean episode rew_action_rate: -0.0821
Mean episode rew_similar_to_default: -0.1565
--------------------------------------------------------------------------------
Total timesteps: 9830400
Iteration time: 0.72s
Total time: 76.89s
ETA: 0.8s
Note: I was fumbling around trying to start TensorBoard with the following method, but the learning finished before that.
python -m tensorboard.main --logdir logs --port 6006
# Open http://localhost:6006 to see the learning status (supposedly)
After learning is complete, execute the following.
python examples/locomotion/go2_eval.py
For some reason, I got an error saying Genesis/logs/go2-walking/model_100.pt doesn't exist. When I checked the folder, there was model_99.pt, so I renamed it to model_100.pt. Why it was off by one is unknown. After renaming and re-executing, the quadruped walking was successfully displayed.
Contents of the Training Program go2_train.py
I tried adding comments in order.
try:
try:
if metadata.version("rsl-rl"):
raise ImportError
except metadata.PackageNotFoundError:
if metadata.version("rsl-rl-lib") != "2.2.4":
raise ImportError
except (metadata.PackageNotFoundError, ImportError) as e:
raise ImportError("Please uninstall 'rsl_rl' and install 'rsl-rl-lib==2.2.4'.") from e
Import PPO runner, Genesis physics simulator, and Go2Env: custom environment (quadruped walking task).
from rsl_rl.runners import OnPolicyRunner
import genesis as gs
from go2_env import Go2Env
Define function get_train_cfg that returns PPO training configuration.
def get_train_cfg(exp_name, max_iterations):
train_cfg_dict = {
"algorithm": { ... },
"policy": { ... },
"runner": { ... },
...
}
- algorithm (PPO hyperparameters) parameters are as follows:
clip_param: ε for PPO clipping (0.2)
desired_kl: Monitor KL divergence and adjust learning rate
gamma: Discount rate
lam: GAE-Lambda
learning_rate: 0.001
num_learning_epochs: 5
num_mini_batches: 4 - policy (NN structure) settings:
Hidden layers: 512 → 256 → 128
Activation: ELU
Actor and Critic have the same structure - runner settings
Here, the RSL-RL learning manager is configured.
Next, define function get_cfgs() that returns Go2 robot environment configuration.
def get_cfgs():
env_cfg = {...}
obs_cfg = {...}
reward_cfg = {...}
command_cfg = {...}
-
env_cfg (environment physics settings)
num_actions = 12 → quadruped (3 joints × 4 legs)
Initial angles (initial posture of hip, thigh, calf)
PD parameters kp = 20, kd = 0.5
Fall detection: Episode ends if roll > 10° or pitch > 10°
Action scale: 0.25
Simulate action latency: simulate_action_latency = True -
obs_cfg (observation) settings for scaling velocity and joint data
num_obs = 45
obs_scales = {
"lin_vel": 2.0,
"ang_vel": 0.25,
"dof_pos": 1.0,
"dof_vel": 0.05,
} -
reward_cfg (reward) Tracking-type reward + penalties
tracking_lin_vel: Linear velocity tracking
tracking_ang_vel: Angular velocity tracking
lin_vel_z: Vertical velocity penalty
base_height: Large penalty (-50) if base height is low
action_rate: Action change rate penalty
similar_to_default: Keep joint angles close to default -
command_cfg (target command) External command settings (target velocity, etc.)
lin_vel_x_range = [0.5, 0.5] → Task to walk at constant velocity (0.5 m/s in x direction).
The following is the main part.
def main():
# Set default values: batch size, iteration count, etc.
parser.add_argument("-e", "--exp_name", default="go2-walking")
parser.add_argument("-B", "--num_envs", default=4096)
parser.add_argument("--max_iterations", default=101)
# Initialize Genesis
gs.init(logging_level="warning")
# Create log folder (delete previous logs if they exist and create new logs)
log_dir = f"logs/{args.exp_name}"
env_cfg, obs_cfg, reward_cfg, command_cfg = get_cfgs()
train_cfg = get_train_cfg(args.exp_name, args.max_iterations)
if os.path.exists(log_dir):
shutil.rmtree(log_dir)
os.makedirs(log_dir, exist_ok=True)
# Save configuration to cfgs.pkl (save all training settings)
pickle.dump(
[env_cfg, obs_cfg, reward_cfg, command_cfg, train_cfg],
open(f"{log_dir}/cfgs.pkl", "wb"),
)
# Create Go2 environment
env = Go2Env(
num_envs=args.num_envs, env_cfg=env_cfg, obs_cfg=obs_cfg, reward_cfg=reward_cfg, command_cfg=command_cfg
)
# Create PPO runner (automatically detects and uses GPU. Manages all rollout & learning)
runner = OnPolicyRunner(env, train_cfg, log_dir, device=gs.device)
# Start learning (set initial state randomly to prevent learning bias)
runner.learn(num_learning_iterations=args.max_iterations, init_at_random_ep_len=True)
Writing out parameter explanations would become extremely long, so I'll stop here.