# T-RO reviews

Posted 2018.07.27 05:05Dear Prof. Songhwai Oh,

Your paper 18-0082 "Robust Learning from Demonstrations with Mixed Qualities Using Leveraged Gaussian Processes" by Sungjoon Choi, Kyungjae Lee, Songhwai Oh

Regular Paper submitted to the IEEE Transactions on Robotics (T-RO) has been reviewed by the Associate Editor and selected Reviewers. The reviews of the paper are attached. Editorial report follows:

*****

The reviewers agree that the authors have carefully addressed the comments raised in the first review round and the additional experiments and discussion clarify the novelty of the paper relative to the authors' prior work.

However, the overlap with previous publication remains too high (36% oniThenticate). As stated by the authors, the algorithmic novelty pertains primarily to one section. Authors must reduce the overlap with their previous publications. They are requested to **reduce drastically the technical description to previous works** (GP, the authors' own published contribution) and to focus the algorithmic developments only on the novel contribution.

In addition, there are a few more issues that should be addressed before the paper is ready for publication:

- The optimization properties of the proposed doubling variables method should be described in greater detail (Reviewer 3)

- The relationship between signal noise adaptation and leverage parameters should be elaborated (Reviewer 3)

- The expert ratio under which the approach no longer works should be better characterized (Reviewer 4)

- please consider the suggestion of additional relevant related work (Reviewer 2)

*****

On the basis of the reviewers' ratings and comments, your paper is conditionally accepted as a Regular Paper in the IEEE Transactions on Robotics. You should prepare and submit the revised version of the paper within 60 days from today. Note that this is a strict deadline.

Please consult the T-RO website http://www.ieee-ras.org/publications/t-ro for instructions on electronic resubmission. The revised manuscript should be formatted so that any changes can be easily identified by the reviewers, by using, e.g., colored or bold text to indicate revised passages. In addition to the revised version of your conditionally accepted paper, you should upload also a single pdf file containing your reply to the reviewers' comments and the list of changes made in the paper.

I will then verify the changes with the help of the Associate Editor (and, possibly, of selected Reviewers).

If you have any related concern, do not hesitate to contact me.

Sincerely,

Aude Billard

Editor

IEEE Transactions on Robotics

__자 여기서 부터가 진짜 리뷰의 시작이다. 리뷰어 2,3,4 의 리뷰를 보고 반박을 하면 된다. __

**#2 - 딱히 할 것은 없어 보이고, 밑에 있는 [1] 논문을 열심히 읽고 언급만 하면 될 것 같다. **

Brief Summary

-------------

The article presents an algorithm for learning from demonstration (LfD) with multiple demonstrators.

- 그렇지

While LfD usually assumes that the demonstrations come from experts, the proposed approach exploits leveraged Gaussian process (LGP) regression to overcome such limitation, allowing to learn from demonstrations of mixed quality. LGP uses an additional set of (hyper) parameters to discriminate between positive and negative samples (i.e., samples from which the regressor tries to stay away). The intuitive idea behind LGP is to be able to automatically give different importance (leverage) to samples coming from real experts and the ones that come from inexperienced users. This problem can be interpreted as a supervised learning problem with unlabeled data that is here solved by means of an optimization with sparsity constraint.

- 그렇지. 잘 이해했구먼

The optimization problem presented in [5] (based on proximal linearized mapping) suffers of two major drawbacks: 1) scalability of the optimization problem and 2) issue in dealing negative demonstrations While the first issue is solved by reducing the number of leverage parameters from one per trajectory to one per demonstrator, the second

one is solved by rewriting the L1 sparsity constraint by using the "doubling variable" trick.

Comments

--------

The authors seem to have carefully addressed the issues raised by the reviewers in the first round. Overall, I think that the paper is ready for publication. In particular, my concern about the novelty of the paper has been addressed in the rebuttal and I agree with the authors' arguments.

- 그렇게 말해주니 고맙군.

By looking into the literature I found a recent paper in the reinforcement learning that is addressing the same problem: demonstrations collected from multiple demonstrators that can be suboptimal. In [1], the authors called this setting multi-expert inverse reinforcement learning. Since they both recover a policy and a reward function, I think it is necessary to compare to it in the paper.

- 요 논문만 비교해보면 될 것 같다. **[1] 논문 읽고 이해하기**

[1] Davide Tateo, Matteo Pirotta, Andrea Bonarini and Marcello Restelli: Gradient-Based Minimization for Multi-Expert Inverse Reinforcement Learning. IEEE SSCI 2017

**#3 이 리뷰어도 엄청 심한 얘기를 하고 있지는 않는다 다만, expected measurement noise를 학습하면 어떻게 될지 실험을 해봐도 좋을 것 같다. **

The revised manuscript has improved clearly in readability and presentation of the leveraged Gaussian process algorithm. However, my concerns about the **scientific contribution remain nevertheless**. The intensive experimental analysis shows the wide applicability of the approach, though.

While I find the idea to incorporate repellent training samples very interesting in principle, I still see the following shortcomings in the contribution of the revised manuscript:

- The authors confirm that adaptation of the signal noise per training data point results in a similar GP behavior as setting the leverage parameters between 0 and 1. I even believe that one can find signal noise parameters such that the resulting GPs are identical. The behavior for training data with leverage parameter 0 would be obtained

for infinite signal noise. As this isn’t compatible with a Gaussian prior, one could omit these training samples (as they anyway don’t influence the GP) or approximate the behavior by pushing the signal noise towards infinity. Hence, in my opinion, the contribution reduces to multiplying the covariance of a “good” and a “bad” training sample

with -1. This approach is definitely worth a publication, but alread has been published in [5] and [6].

**: 여기서 말하고 있는 것은 leverage optimization 대신에 데이터의 expected measurement noise를 높이면 된다는 것이다. 이건 한번 실험해볼 수 있을 것 같다. 한 가지 차이점은 여기의 노이즈는 independent한 것이고, leverage는 일종에 clustering 효과가 있는 것이다. **

- A valid scientific contribution of this paper is presented in section V C “Leverage Optimization by Doubling Variables”. Concerning this algorithm I agree with reviewer #6 that a drawback definitely is the large number of parameters that need to be optimised. This section lacks a mathematical proof or explanation how a nxm matrix is properly optimised with m observations.

**: 이론이 부족하다? 학습..? m개의 observation만 가지고 nxm 의 행렬이 학습이 되는거!??! 이건 아닌 것 같은데. **

Additionally, I noticed some issues in the formulation of statements. I would suggest the authors to carefully revise the correctness of the following claims:

- In the abstract the authors write that they “propose a novel algorithm for learning from demonstration (LfD)”. The algorithm isn’t novel, but formerly published in the authors prior work [5,6].

**: 말을 바꾸자. **

- In section IV A, where the leveraged kernel function is introduced, only reference [5] is mentioned (the original publication of equation (5)). The authors omit that equation (6) likewise already was published in [6]. Even the proof of Proposition 1 was already given in [6].

**: 말을 바꾸자. **

- In section V the authors write that they assume that the proficiencies of the demonstrations are known. This contradicts the statement in the abstract “demonstrations with different proficiencies are provided without labeling”.

**: 말을 바꾸자. **

Minor comments:

- The section III B can be shortened and focussed more on the essential part, the hyper parameter optimisation. I find the discussion about deploying an appropriate kernel function redundant, as the only investigated kernel anyway is the squared exponential kernel.

- equations (5) and (6) and surrounding text: usage of l and \gamma not consistent

**#4: 이 리뷰어가 말하는 것은 학습이 언제까지 되고, 언제 안되는지에 대한 것이다. **

The authors have provided additional empirical evidence of the robustness of their method to bad demonstrations (which is the main claim of the paper). Although I would have preferred to have the study on the control problem rather than the regression problem (minor complaint).

**: 컨트롤 문제에..?! 어떻게 적용을 하자? **

In addition, the authors answered to all minor comments and fixed them accordingly.

I only have one enquiry regarding the new experiments: is there an explanation why DVM stops working under .2 expert ratio in Fig. 5-(c,f)? And how does it relate to a property of the task? For instance if the problem had 5 actions it would be understandable since that is the threshold were the good action has the same density as any other action (assuming bad experts are uniform on the remaining actions). But that is not the case for this regression problems. Maybe the authors have some other intuition on the kind of properties the expert ratio might depend on (which could be added to the article when discussing said figures)?

**: 학습을 시킬 때 비율의 문제?**

#### 'Enginius > Robotics' 카테고리의 다른 글

Baxter in Gazebo (0) | 2018.07.27 |
---|---|

OpenManipulator (0) | 2018.07.27 |

T-RO reviews (0) | 2018.07.27 |

PythonRobotics (0) | 2018.07.09 |

Install Ubuntu (+apps) on MSI laptop (0) | 2018.07.03 |

로보틱스에서 하고 싶은 일들 (0) | 2018.03.10 |

- Filed under : Enginius/Robotics
- Comment Trackback