# Reviews I got from RSS 2017

Posted 2017.05.10 17:56__Review 1__

The paper proses a new approach to inverse reinforcement learning. The authors take inspiration from the average-reward formulation of the RL problem to formulate the objective of maximizing the inner product between reward function and stationary state-action distribution. The resulting algorithm is simple yet powerful enough to compete with existing methods. The authors establish the** sample complexity of their algorithm** and provide empirical comparisons to the main existing algorithms that show their method is competitive and outperforms baselines when the underlying reward function is non-linear. The method explicitly addresses problems where the transition model is not known and state and action spaces are continuous, which is important for applying such methods on robot systems.

I like the straightforwardness of the approach

- The proposed kernel method for dealing with non-linearity yields a convex problem with a closed form solution

- the scenario where no model is available and states and actions can be continuous is important for applying such methods to real robot systems

- I've learned about a simple yet potentially powerful method for IRL

- Leveraged kernel functions were an interesting new concept for me

Fair (a paper that is on its way to making a good contribution but not there yet)

**propose seems novel and interesting**. I have one main concern regarding the simplifying assumption about the Bellman flow constraint. It seems this assumption can easily break the method. For example, in a driving scenario, let's say a vehicle in a congested city is stuck in traffic 90% of the time and on a quiet road 9%. Only in a small amount of cases (1%, say) is the driver at a point where he can choose between a busy or a quiet road. The estimated rewards would than be roughly 0.9 for driving in traffic, 0.09 for driving on a quiet road, 0.01 for choosing the quiet road at the intersection, and 0 for choosing the busy road at the intersection. Planning ahead for two or more steps would make the vehicle choose the more congested road, if I'm not mistaken. What are the limitations on the kind of problems where the algorithm would do well? Should the reward function be quite similar to the value function, for example? This is not clear from the paper.

__Review 2__

**reward estimation**” seems to be inappropriate. The “reward” being estimated is simply a density estimate. It is easy to construct examples where high-reward states are visited rarely by an optimal policy simply due to the stochasticity of the decision process’s state transition dynamics. The “reward” estimated then would have no relationship at all to the reward function motivating the behavior, unlike inverse reinforcement learning (IRL) methods. As a result of this, I find the analysis misleading since, unlike IRL, there is no relationship between the estimated reward function and the unknown reward function assumed to motivate observed behavior. Perhaps there is an unstated claim that state-action densities should transfer across MDPs in the same manner that reward functions transfer. Unfortunately, this is not true in general and these densities will depend heavily on the state-transition dynamics, making transfer of densities much less useful than transfer of reward functions.

__Review 3__

**whether it's possible to characterize the behavior of the policy**that results from the estimated reward function. I describe that below, but first mention a question that seems pretty central to being able to reproduce these results.

**I'm a little concerned, though, that the theory doesn't address the "correctness" of the formulation.**For instance, it would be great if better estimates of state-action distribution resulted in an estimated reward function that was in some sense more *similar* to the expert's reward function (the original characterization of IRL). Or, even better, than better estimates of that state-action distribution resulted in more similar *behavior* of the resulting trained agent to the expert's behavior (the characterization of MMP, MaxEnt, and RelEnt (and even GPIRL, although indirectly)).

**exploration and interpretation of what maximizing the inner product means,**and it would be great if there were some theoretical results connecting back to some of the traditional formulations. Especially, something exploring how an optimal policy under the estimated reward function is expected to behave w.r.t. the original demonstrated policy.

#### 'Thoughts > Technical Writing' 카테고리의 다른 글

영어 논문 글쓰기 (0) | 2017.07.03 |
---|---|

Reviews I got from IROS 2017 (0) | 2017.06.27 |

Reviews I got from RSS 2017 (0) | 2017.05.10 |

Another Reject from ICRA 2017 (0) | 2017.01.26 |

Robotics paper (0) | 2015.02.16 |

호프스테더 (0) | 2014.11.10 |

- Filed under : Thoughts/Technical Writing
- Comment Trackback