# DMPL Reivews

Posted 2017. 11. 30. 15:16__Meta Review__

As pointed out by the reviewers, the paper is interesting and provides novel components. The comments of the reviewers give plenty of hints how to improve, particularly also the presentation style and the writing style towards a robotics community, and not just ML community. Issues about continuous vs. discrete space should be formulated very clearly, as this is important for the paper (as mentioned in the title of the paper).

크게 문제가 없다.?

__Review 1__

This paper introduces a method to infer a policy from demonstrations by using (1) inverse reinforcement learning to learn a reward function, (2) use Model-predictive control for the policy (which removes the need to actually search for a policy). The authors provide an extensive theoretical analysis and they benchmark their approach on two problems: (1) a discrete "object world" and a continuous "track driving" experiment". According to the benchmark, their approach outperforms the state of the art.

I am not a specialist of inverse reinforcement learning, but there are many points that are unclear to me and they should at least be clarified. In addition, it seems to me that this paper would fit better in a machine learning conference / journal than a robotics journal (no robot experiment, no realistic simulations of a robot). However, the

experiments look promising.

**Major points**

- The problem formulation section presents both the problem and the proposed approach. This makes understanding the paper difficult. The problem formulation should only present the basic notation and the

question asked. On the contrary, there should be an "Approach" section describing the approach on how to answer to the question.

Problem formulation에서는 문제 제기만 하고, approach나 proposed method에서 제안하는 방법을 제시하자!

- The problem formulation is for the discrete case whereas the title is about the continuous case. Do the authors want to solve the continuous case or the discrete case? The continuous case is added as a small equation mixed with the approach proposed by the authors. I suppose they should start with the continuous case and then derive a discrete formulation.

문제 제기가 discrete에 대해서만 제안했다고 하는데 딱히 그렇지는 않다.

- It is very hard to identify the novel idea of this paper. Is it to rewrite the product of eq. (1) as an inner product? That sounds like a straightforward generalization of equation (1). Is it to use inverse average reinforcement learning for imitation learning? Is it the kernel formulation?

왜 novel idea가 없다고 하는거지? 직관을 제시하자.

The main intuition behind the proposed method is that

- The authors claim that their approach is model-free, but for the Model Predictive Control policy, they need to have access to the dynamics function: s_{t+1} = f(s_t, a_t). This is basically the model of the environment, right (you can almost always infer the transition function T from f and the inverse)?

We never say is is model-free? 아 표에 넣어놨다. 이거 수정하자!

**Minor points**

- The 5th paragraph of the introduction (Sec. I) needs re-phrasing to be clearer. The same applies to the first two paragraphs of Related Work (Sec. II).

- The authors should double-check the notations: for example they use T as the transition function and as the horizon steps.

- The choice of kernel in Eq. (12) seems ad-hoc and no useful insight is given.

- Why GPIRL is so much slower than DMPL? Both approaches are inverting a kernel matrix of similar dimensions. If not, this means that the authors didn't use sparse GPs (with inducing points or something else) for GPIRL, which makes the comparison of the computation time really unfair. The authors should clarify on this.

: 이건 언급할 수 있다.

- Quoting the authors: "Once a reward function is optimized, a simple sample-based random steering method is used to control the car". Why didn't the authors apply the MPC-based policy as defined in Eq. (4)? I know that the underlying idea is the same, but why the need to change the policy?

이 둘은 같은 얘기이다. 이걸 좀 더 clarify하자.

__Review 2__

(1) Summary:

The paper proposes an approach for Imitation Learning in continuous state and action spaces based on a non-iterative two-stage learning process. First, a reward is learned to maximize value under an empirical estimate of the stationary state-action distribution of the expert. Then, MPC is used to find a policy that maximizes the cumulated, discounted log rewards. The paper provides a theoretical analysis for the approach.

Furthermore, it proposes a practical IRL algorithm using KDE to estimate the joint probability distribution of states and actions. Then, a reward is learned based on a finite set of basis functions and optimized to maximize value under regularization. The approach is evaluated based on the object world and a driving environment and mostly outperforms the other approaches.

(2) Related Work:

Most relevant work is cited. [8] Ho et al.: It isn't an IRL approach. They don't learn the expert's reward function. Rather they estimate a surrogate reward that guides an RL agent to learn a policy that matches the state-action distribution. As soon as it matches, the surrogate reward function will be constant or undefined. [16] Boularias et al.: They propose to minimize rel. entropy between the modeled distribution over trajectories and some baseline distribution under feature matching constraints. If the baseline distribution is assumed to be uniform, then it corresponds to MaxEnt IRL. However, it is not directly minimizing entropy between empirical distribution of demonstrations and the modeled one. [18] Herman et al.: They proposed a model-based approach, since they incorporate learning a model of the environments dynamics into IRL. [7] Levine & Koltun: If the feature function is differentiable, numerical differentiation is not necessary.

(3) Evaluation of Strengths and Weaknesses:

The proposed approach is non-iterative and can be applied in continuous spaces. The results show competitive or superior performance. However, the proof by contradiction for the policy will barely hold for real human demonstrations that are stochastic, while it is evaluated on human demonstrations. A theoretical analysis regarding learning from stochastic behavior, would be necessary. Furthermore, some motivations for algorithmic choices are missing.

(4) Detailed Comments:

[4.1]: The norm ball constraint in Eq. (3) is introduced to handle scale ambiguity. But it further enforces that more likely state-action tuples result in higher rewards.

[4.2]: It is not clear why the cumulated, discounted "log" rewards are maximized. Is it due to Eq. (3), which enforces $R: SxA \to [0,1]$ and $\lVert \mathbf{R} \rVert_2 = 1$ while more likely state action tuples have higher rewards? Then, rewards correspond to the nonlinearly scaled density of the expert's state action tuples. If this is the

case, why didn't the authors use $\ast{\mu}$ for $\ast{R}$, since this argumentation is used in the first proof of the supplement?

[4.3]: While the proof by contradiction of Theorem 1 holds for deterministic policies, it might not hold for stochastic policies. However, human demonstrations are rarely optimal and used in the experiments. Is it possible to provide bounds in matching the expert's policy to stochastic behavior?

[4.4]: The policy in Eq. (4) maximizes the discounted, cumulated log reward. However, in the experiment the transition model is stochastic. Then, optimizing expected discounted, cumulated log rewards would be more meaningful. Otherwise the policy might be suboptimal.

[4.5] In V. it is argued that under infinite length trajectories, $\ast{\mu}$ converges to the true stationary joint distribution of states and actions regardless of the initial state distribution. This is not always true. With two equally good states, the stationary state action distribution is highly depending on the initial state distribution.

[4.6] In VI. B the baselines MaxEnt IRL, GPIRL, and RelEnt IRL use stochastic policies to model human behavior. It is not clear whether predictions are based on stochastic policies or on the optimal policy.

(5) Minor Suggestions:

[5.1] Zeibart et al. [12] -> Ziebart et al.

[5.2] The distance measure between probability distributions is typically referred to as total variation distance and the formula in the footnote only holds for categorical distributions.

[5.3] In VI. A, the last paragraph says "We also conducted experiments by changing the number of demonstrations..." while the figures show evaluations over grid world size.

[5.4] In VI. B it is argued that CIOC could not be compared since computations would be too demanding. However, computing (sub-)derivatives of the proposed features should be simple.

(6) Overall evaluation:

Sound idea and proofs, while some motivations are missing. The approach shows superior results in the experiments, but an analysis for learning from stochastic demonstrations is missing.

#### 'Enginius > Robotics' 카테고리의 다른 글

Related work of LevOpt (0) | 2018.02.11 |
---|---|

Robotics and AI infographics (0) | 2017.12.13 |

DMPL Reivews (0) | 2017.11.30 |

Policy Gradient Methods to Actor Critic Model (0) | 2017.04.04 |

Reviews that I got from IROS 2016 (0) | 2016.07.02 |

Comments from ICRA 2016 (rejected GRP and accepted LevOpt) (0) | 2016.01.15 |

- Filed under : Enginius/Robotics
- 0 Comments 0 Trackbacks