본문 바로가기

Enginius/Python&TensorFlow

MuJoCo Ant-V2 + PPO implementation

Some descriptions of Ant-V2 in MuJoCo

(Excellent PPO implementation https://github.com/pat-coady/trpo)


Code snippet for random movements of the Ant. 

import gym
import numpy as np
import tensorflow as tf

sess = tf.Session()

env = gym.make('Ant-v2')
obs = env.reset()
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.shape[0]

print (obs_dim,act_dim)

for i in range(1000):
env.render()
action = np.random.randn(act_dim,1)
action = action.reshape((1,-1)).astype(np.float32)
obs, reward, done, _ = env.step(np.squeeze(action, axis=0))




You'll see something like this: 


The four-legged (quadruped) thing is called an Ant. It has 111-dim observational space and 8-dim action space. Very interestingly, the physical implications of these spaces were hidden inside the ant.xml file which looks like:


This format should be familiar with roboticists who use ROS often, but it was never the case with me, so I googled again to come up with this nice explanation in https://github.com/openai/gym/issues/585.

정리하자면 다음과 같다. 


111-dim observation space

z (height) of the Torso -> 1

orientation (quarternion x,y,z,w) of the Torso -> 4

8 Joiint angles -> 8

3-dim directional velocity and 3-dim angular velocity -> 3+3=6

8 Joint velocity -> 8

External forces (force x,y,z + torque x,y,z) applied to the CoM of each link (Ant has 14 links: ground+torso+12(3links for 4legs) for legs -> (3+3)*(14)=84


=> 1+4+8+6+8+84 = 111


8-dim action space

Torque (value for -1 to +1) of each joint 


Reward structure

Make a four-legged creature walk forward as fast as possible. 

Ant-v1 defines "solving" as getting an average reward of 6000.0 over 100 consecutive trials. 

xposbefore = self.get_body_com("torso")[0]
self.do_simulation(a, self.frame_skip)
xposafter = self.get_body_com("torso")[0]
forward_reward = (xposafter - xposbefore)/self.dt
ctrl_cost = .5 * np.square(a).sum()
contact_cost = 0.5 * 1e-3 * np.sum(
np.square(np.clip(self.sim.data.cfrc_ext, -1, 1)))
survive_reward = 1.0
reward = forward_reward - ctrl_cost - contact_cost + survive_reward

: x-directional displacement - sum of squred torques - contact penalty + ..constant? 


Random thoughts

- Torque-controlled actuators in this small form-factor? I wish I have one too. 

- IMU at every joint + force sensors at every link is not likely to be possible.  

- Planner position (x,y) is NOT used. Interesting. 

- Do we really need this huge state representation? 

- Contact model can never be accurate

- What happens if we modify the frictions. 














_____________________________________________________