# ICML 2018 Review

Posted 2018.04.11 18:01ICML 2018International Conference on Machine Learning 2018July 10 to July 15 2018, Stockholm, Sweden |

Reviews For Paper | Paper ID | Title |

Masked Reviewer ID: | Assigned_Reviewer_1 |

Review: |

Question | |
---|---|

Summary of the paper (Summarize the main claims/contributions of the paper.) | The authors propose an inverse reinforcement learning method, which estimates the density of the state-action distribution from demonstrations and uses it to find a reward function by maximizing the value function. The proposed problem can be solved using linear programming for discrete state-action spaces. The paper includes a theoretical analysis of this formulation with a probabilistic bound between the estimated and optimal value function. Further, an algorithm is introduced for continuous state-action spaces. The state-action density is estimated via kernel density estimation and the assumption is made that the reward function is in the reproducing kernel Hilbert space. The original constraint problem is approximated with an unconstrained quadratic problem that can be solved analytically. The experimental results show that the proposed method outperforms existing methods in terms of computational complexity and performance. |

Clarity (Assess the clarity of the presentation and reproducibility of the results.) | Above Average |

Significance (Does the paper contribute a major breakthrough or an incremental advance?) | Above Average |

Correctness (Is the paper technically correct?) | Paper is technically correct |

Overall Rating | Weak accept |

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.) | I think the paper is very interesting and the results look promising. The proposed algorithm in Sect. 3 and the corresponding theoretical analysis in Sect. 4 are important contributions for the inverse reinforcement learning research area. Having said this, it would be interesting to see if this approach is still feasible with: 1) Real world & noisy data; 2) Fewer (local) demonstrations only; 3) Higher dimensional state spaces (still possible to approximate the density?). However, it is clear that not all these aspects can be addressed in one paper. I also have some questions below regarding Sect. 5 that are unclear to me. Otherwise, the paper is clearly structured and well written. However, some proofreading (grammar, typos) should be done to improve the readability. Questions: - What is the advantage of approximating the problem in Eq. 8 as an unconstrained optimization problem? Why not stay with the problem in Eq. 3 and solve the problem with a constrained optimization method? Would this be possible? What is the advantage of doing it the proposed way? - Related to the previous question. Why is it necessary to have an additional l2 norm regularization with hyper-parameter beta in Eq. 13? Shouldn't the lambda term be sufficient to regularize alpha? - Experiment 6.1: Is the used algorithm DMPL or KDMPL (it says DMPL in Fig 1. and KDMPL in the text)? It is a discrete state-action space problem. So, it should be possible to solve with Eq. (3), right? Minor things: - line 144: Is there a reason why the continuous form is used here? It was a bit confusing since the other parts are described in the discrete domain. - Double usage of symbol \gamma (Line 150 + 230). - Double usage of symbol N (Line 115 + 178). - What is d in Theorem 3? I couldn't find it in the paper. - What is the expected value difference exactly? Is it the difference to the optimal value? Typos: - Appendix: KDMRL -> KDMPL - line 154: 'goes to infinity' - line 210 (right column): 'convex quadratic program' |

Reviewer confidence | Reviewer is knowledgeable |

Masked Reviewer ID: | Assigned_Reviewer_2 |

Review: |

Question | |
---|---|

Summary of the paper (Summarize the main claims/contributions of the paper.) | This paper proposes a method for inverse reinforcement learning (IRL). The proposed method involves finding a reward function that maximizes the average reward of the expert given the empirical state-action density of the expert. A policy is then recovered with model-predictive control using the logarithm of this estimated reward. The proposed method involves an optimization problem that is intractable for continuous state-action spaces. This motivates a practical algorithm that uses kernel density estimation to estimate the state-action density of the expert and then proposes an optimization problem for finding the reward function. Theoretical results for asymptotic performance and finite sample performance are given along with empirical results showing the proposed method performs better than previous methods. |

Clarity (Assess the clarity of the presentation and reproducibility of the results.) | Below Average |

Significance (Does the paper contribute a major breakthrough or an incremental advance?) | Above Average |

Correctness (Is the paper technically correct?) | Paper is technically correct |

Overall Rating | Weak accept |

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.) | Overall this paper proposes an interesting approach with good empirical and theoretical results supporting it. The paper would be much stronger with some organizational changes and if the presentation of methods and experiments were clarified. My comments on the organization of the paper concern section 4 (Theoretical Analysis). First, while proofs are given in the supplemental material, I suggest stating so in this section. As-is it makes it unclear that the theorems are supported. I would also suggest, for each theorem, giving the theorem and then describing what it tells us. That way the reader sees what the result is and then you can describe what we learn from it. Finally, I think Theorem 3 should not be introduced until KDMPL is introduced. Its hard to understand the significance of the theorem until the method has been presented. Related to this, Theorem 3 relies on assumptions that are not given in the main text. The paper would be more self-contained if these assumptions were present. I also have some clarification questions that I would like to hear the author's response to. 1. Why do you use the logarithm of the reward when solving for the policy? 2. How restrictive is the RKHS assumption? Are there cases where this assumption wouldn't hold? 3. What is the RKHS referred to in the paper. The RKHS should be clearly defined. 4. In the introduction, the authors mention the work of Finn et al. and say that it requires an additional sampling stage. But later in the paper the authors discuss using trust-region policy optimization (TRPO) which would require additional sampling from the environment. How is it that the work of Finn et al. requires additional sampling but the proposed approach does not? Also, is there a reason that the method of Finn et al. isn't used as a baseline? 5. I'm confused how TRPO is being used in this method. Is the policy not recovered by MPC when using KDMPL? Do you assume access to a model or simulator that TRPO can be ran in or are you running it on the true environment? 6. What is the median trick? The paper would be more self-contained if this was given instead of just cited. Minor comments: 1. Intro: to overcome this issue, *a* few IRL methods... 2. Intro: Main intuition -> The main intuition 3. 3.1: a chain rule -> the chain rule of probability 4. After theorem 3: The second part -> The second *term* 5. Section 5: we have to compensate *for* the effect of 6. Line 425: minimum -> smallest (unless you mean the method could not do better.) 7. Conclusion: continuous spaces -> continuous state-action spaces. |

Reviewer confidence | Reviewer is knowledgeable |

Masked Reviewer ID: | Assigned_Reviewer_3 |

Review: |

Question | |
---|---|

Summary of the paper (Summarize the main claims/contributions of the paper.) | Authors proposed an imitation learning method that estimates a "proximal reward function" by matching the joint distribution of states and actions. Although authors use the term "proximal reward function" in this paper, what is estimated in the proposed method is the value function V(s), which is different from the reward function r(s, a) in the standard definition. |

Clarity (Assess the clarity of the presentation and reproducibility of the results.) | Below Average |

Significance (Does the paper contribute a major breakthrough or an incremental advance?) | Below Average |

Correctness (Is the paper technically correct?) | Paper is technically correct |

Overall Rating | Weak reject |

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.) | My main concern is that the proposed method seems almost same as "OptV" proposed in [1]"Inverse Optimal Control with Linearly-Solvable MDPs" K. Dvijotham & E. Todorov, ICML 2010. In [1], the value function indicated by demonstration is estimated as a maximum likelihood solution, which is very similar to what is described in this paper. From the description in Section 3, I don't see any clear difference. Authors should cite [1] and clarify the difference. Assuming that the method described in Section 3 is equivalent to OptV in [1], the contribution of the paper is just the convergence analysis in Section 4. Since the function approximation for OptV is discussed in [1], the contribution of the use of RKHS seems minor. Another concern is evaluation of the proposed method. Although authors compared their proposed method with several previous methods, compared methods are not recent ones. Authors should compare with more recent methods such as Generative Adversarial Imitation Learning or Guided Cost Learning. In addition, I expected that we can obtain the similar results in the racing car task using OptV. The experimental results does not support the contribution of this paper. |

Reviewer confidence | Reviewer is knowledgeable |

Masked Reviewer ID: | Assigned_Reviewer_4 |

Review: |

Question | |
---|---|

Summary of the paper (Summarize the main claims/contributions of the paper.) | This paper casts imitation learning as the problem of learning a reward function from expert demonstrations, akin to inverse reinforcement learning. It uses a simple formulation of finding a reward function that matches the empirical state-action distribution induced by the demonstrations. Once a reward function is learnt, the policy is defined by model predictive control assuming known dynamics. A kernel-based extension for learning reward functions is also presented. Favorable empirical comparisons are reported against maximum margin planning, learning to search, multiplicative weights apprenticeship learning, maximum entropy inverse reinforcement learning and Gaussian process inverse reinforcement learning. |

Clarity (Assess the clarity of the presentation and reproducibility of the results.) | Above Average |

Significance (Does the paper contribute a major breakthrough or an incremental advance?) | Below Average |

Correctness (Is the paper technically correct?) | Paper is technically correct |

Overall Rating | Weak accept |

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.) | This paper casts imitation learning as the problem of learning a reward function from expert demonstrations, akin to inverse reinforcement learning. It uses a simple formulation of finding a reward function that matches the empirical state-action distribution induced by the demonstrations. Once a reward function is learnt, the policy is defined by model predictive control assuming known dynamics. A kernel-based extension for learning reward functions is also presented. Favorable empirical comparisons are reported against maximum margin planning, learning to search, multiplicative weights apprenticeship learning, maximum entropy inverse reinforcement learning and Gaussian process inverse reinforcement learning. Since it relies on density estimation, one would expect the proposed method to be less effective than other baselines when the state/action dimensionality is high and the number of demonstrations is smallish. This issue is not fully explored or discussed in the paper. Strictly speaking, Eqn 8 will not admit a representer theorem for continuous state action spaces. As such the method is less in the RKHS spirit and more in the basis functions spirit. The object world task is not clearly described. Please rewrite the section with an illustration. It is somewhat surprising that MMP, MaxEnt etc perform poorly. What are the benefits of the proposed scheme above these alternatives? All experiments are on relatively smallish state/action spaces. To get a full understanding of the potential of the method, larger scale experiments would be welcome. |

Reviewer confidence | Reviewer is knowledgeable |

#### 'Thoughts > Technical Writing' 카테고리의 다른 글

ICLR 2018 reviews (0) | 2018.11.22 |
---|---|

ICML 2018 Review (0) | 2018.04.11 |

Short bio & abstract (0) | 2018.02.21 |

Postdoc 자리를 물어보는 이메일 (0) | 2017.12.16 |

영어 논문 글쓰기 (0) | 2017.07.03 |

Reviews I got from IROS 2017 (0) | 2017.06.27 |