# T-RO reviews

Posted 2017. 12. 27. 14:53__Reviewer 3__

좋은 리뷰가 왔다. multi-task GP를 구현해야할 것 같다.

This paper proposes a method for learning from demonstration using not only optimal but also sub-optimal (or negative) demonstrations. This is done by using a new model called leveraged Gaussian Processes. The idea is in general interesting and would definitely be of interest in the learning from demonstration community.

Regarding the approach proposed, my main concern is regarding the training procedure for the LGP. Since the weight vector l has the same dimensionality of the number of data, the optimization **can be very challenging**, even using the leverage optimization by doubling variables. Only the experiments in Sec VII seems to employ optimization of the leverage. It might be beneficial to more discuss into more details the current optimization procedure (e.g., computational time, presence of multiple local minima), and I would also like to see more thorough experiments showing the efficiency of this training process in practice. Moreover, the proposed approach seems to be very close in spirit to **multi-task GPs**, it would be discuss the differences.

Regarding the experiments, I think that also a comparison to existing **multi-task GP** should be included. Another point that you mentioned in the ntroduction was to being able to use optimal and suboptimal demonstrations jointly. However, most of the experiments make use of positive/negative examples, which is substantially different. Perhaps you could extend Sec VIIB to include more experiments with mixtures of optimal/suboptimal only

demonstrations.

Detailed comments:

- regarding the title: nowhere in the paper robustness is measured or demonstrated. It might be more accurate if you remove the "**robust**".

- "multiple desiderata to be traded off [3]" -> It is unclear to me what are the advantaged compared to multi-objective optimization. could

you elaborate?

- "a common sense tells us that directly using a set of demonstrations with mixed qualities may results in ..." -> Common sense has no place

in Science. Also common sense can often be wrong. Please rephrase your argument and preferably add citations.

- "we first model a policy function of each demonstrator as a Gaussian process and demonstrations of multiple demonstrators are collected for

learning a policy function" -> This sentence is unclear. I had to read it multiple times, and I am still not sure I got its meaning right. please rephrase.

- Sec III: "existing work focuses on modeling a single Gaussian process which is fully specified by the kernel" -> this statement is incorrect. There are many works that deal with learning multiple correlated Gaussian processes. see {1} and following literature.

- Sec III: from equation (3) to the end of the subsection, could be moved to the appendix since it is not crucial for the rest of the paper.

- "model selection in Gaussian Process Regression" -> I would change this to Hyperparameters Optimization in GPR, since model selection has a slightly different meaning and might imply that you are also selecting the best kernel, which you are not.

- At the end of Sec III you mention the leverage parameter vector l, without any further explanation. Since it is the first time that it is introduced, you might want to detail a bit more about its function, and dimensionality (which is the number of datapoints in you training set?), and eventually refer to the next section.

- At the end of Sec IV: "flipping the sign of the outputs of a negative training data will make the regressor be far from the negative data" -> Could you quantify far? This is something that should be discussed more into details because I think it is important and relevant.

- Sec IV: the text corresponding to Fig.3 state that the for the stationary GP the curve is the average over 100 trials. However for the leveraged GP (random guess), the best of 100 trial is plotted, and the leveraged GP instead requires full knowledge of the demonstrations. Overall, I think that this comparison in somewhat misleading, especially considering that optimizing the leverages seems to be a major issue with your approach (due to the high dimensionality). If you want to use random search as comparison, you should plot the mean instead of the best MLL. Also, you should compare to multi-task approaches such as {1}.

- It is unclear in Sec VI, when you talk about positive and negative data, whether this also means that the l are fixed or are learned from the data.

- In Sec VI, instead of running experiments with a fixed number of positive and negative data, could you perform a swipe across the full ratio (from 0 to 1) between the two (positive and negative) given a fix number of data (e.g., 150 and 300) ? It would help to make the experiments more consistent and insightful.

- Figure 4b is not discussed in the text. could you elaborate what are the important info there? Is the average minimum distance to obstacle good or bad? arguably might be good, if the collision rate is low...

- In Sec VIB it is unclear how the 200 negative and the 100 positive data for the LGPMC are generated. Are the 300 positive data for the GPMC the same 200+100 for the LGPMC? The origin and the optimality of the data should be discussed. Potentially it would be interesting to compare the same data, so as to show that GPMC can not deal with suboptimal data.

- In Table II it is unclear if the average minimum distance should be high or low and eventually why. A more detailed description in the text might help to better understand this result.

- Sec VIIA does not have sufficient details to understand and replicate the experiment. What is the function that generate the data?

- Fig 9, what is the label of the x axis? the leverage parameters?

{1} Multi-task Gaussian Process Prediction, Edwin V. Bonilla, Kian Ming A. Chai and Christopher K. I. Williams

__Reviewer4__

별로 관심이 다. 인크리멘탈하다?

In this paper the authors present a method for rating the training data of Gaussian process regression (GPR). This means, the GP regressor is attracted by positively rated training data, ignores training samples rated as zero and is repelled by negatively rated training data.

The paper is clearly written and precisely formulates the goal and approach of the authors in a mathematically correct way. However, I have severe concerns about the scientific contribution in the paper.

In the introduction the authors claim to present a twofold contribution, which is not true, as the first contribution, the leverage GP framework was presented in [4]. Further, the second contribution was previously presented in [5] such that the approved novelty solely concerns the method improvement for training the GP hyperparameters, including the leverage parameters. To my understanding **such a contribution is incremental**.

In the following, an outline is given to assess the paper's scientific novelty. Section III, the preliminaries, presents general GP regression characteristics and other relevant previous works.

Section IV:

A) repeats the authors' approaches in publication [4] and [5]

B) provides an alternative formulation for the GP posterior

distribution

Section V:

A) almost literally adopts the formulation of [5]

B) and C) jointly provide an improved parameter optimisation method, compared to the previous approach presented in [5]. This is undoubtedly a scientific contribution of the present paper. However, the ultimate goal (as formulated in [5]), to find an optimisation using the l_0 norm instead of the l-1 norm is omitted in those sections. Therefore, I consider the improvement **incremental** in comparison with the previous work.

Section VI presents a novel experiment for autonomous driving, besides simulation and experiment of the previous applications [4,5].

Finally, I would like to ask the authors to comment on the difference between their approach and the following: One could set the variance w (in equation(3)) of the noise parameter for each training sample separately. Further, one could allow w to take values from 0 to infinity and, for repellent samples, additionally multiply the whole kernel with the conditionally psd kernel k = -1. Then, I assume the resulting kernel to have the same properties as the proposed leveraged kernel function.

__Reviewer5__

그냥 트집쟁이.

The paper is a continuation of the authors' prior work on learning from demonstrations with negative examples ([4], [5] in the paper). [4] introduced the 'leverage' Gaussian Process so that GP regression fits functions that avoid samples with negative leverage. While [4] assumed the 'leverage' (defined as a score in [+1, -1] indicating if the sample is positive or negative) to be known, [5] extends [4] by treating the leverage as a hyper-parameter that is optimized for. This paper modifies the optimization routine of the leverage variables and demonstrates on two domains (one being completely new) that the algorithm better identifies the quality of the demonstrations.

My main criticism of the paper is the **lack of discussion on the required hypotheses to correctly infer the demonstrators quality** and the lack of experiments that would push the method to its limits, showing instances where the algorithm wrongly identifies the quality of a demonstrator. The main assumption of the paper ensuring proper identification of the quality of the demonstrations is that good demonstrations are much more common in the data set. But how much is much more? Is it problem dependent? What happens if it's 50/50? Also, my guess is that not only bad demonstrations need to be the exception rather than the rule but they should also overlap in state space with sufficiently many good demonstrations to be identified as bad. Does this explain why such a high number of demonstrations was provided in the driving experiment? How would this scale in problems with higher dimensional state spaces? A discussion along these lines would make prospective users of the method much more confident about the results to expect.

Some sections of the paper should be rectified, clarified or expanded:

- In the intro you say 'Furthermore, a common sense tells us that directly using a set of demonstrations with mixed qualities may result in unexpected or disastrous outcomes'. My common sense tells me on the contrary that the learned policy cannot be worse than the worse demonstration. You should clarify (for instance by giving an example) in which situation mixing demonstrations would result in disastrous outcomes.

- The related work should be better organized by example adding subsections. In the current state, some paragraphs seem redundant (e.g.

3 with 7 that both introduce methods using negative examples).

- At the first mention of LfD methods able to learn from negative or unlabeled data, you should not limit yourself to citing [4].

- The self-citations [17] and [18] are hardly if at all related to LfD and thus seem out of place in the related work section.

- [29] is missing from the table

(!) Related work is described but not compared to the present work.

A discussion indicating in which settings this work is a competitive alternative to the state-of-the-art would be appreciated.

- In Sec 3.A, you introduce the mean of the GP prior but never mentionit afterwards. You should at least mention that it is assumed to be 0.

(!) Theorem 2 is never referenced in the text after its definition.

- I do not understand why the proof of Thm 2 does not end at Eq. (5). To me, Thm 2 is rather a corollary of Thm 1 and follows directly from it.

- The provided proof of Thm 1/2 should either be omitted or expanded and moved to an appendix. Provide the definition of a semi-definite function. Justify why only the difference between xn and xm is considered. Optionally, you can also explain that the transition to the last line comes from the (x - y)(x + y) = x2 - y2 identity.

- Imaginary unit is i initially then becomes j. Do you mean am instead of am^*? On the last line of the proof, | . | is more common than || . || for the absolute value of a complex number.

- 3.B: K(X,X) used instead of the previous notation k(X,X).

- Sec. 4.B is very informative and is a nice addition

- Sec. 5.A: it is hard to appreciate the problem without more details on your definition and instantiation of correlated GPs.

- Sec. 5.B:

(!) What do you mean by sparsity when you say '\delta ensures the level of sparsity' after (17)? This is a hand-wavy comment especially since 1 is subtracted within the l1-norm which encourages I to contain a high number of ones and not zeros as is usually understood by sparsity.

- Is the dimensionality Ii m and not n in 18? Similarly, in (17) you use 1n, in (18) 1m in the objective and you ommit the index in the first constraint of (18)

- What has become of the noise in the definition of K?

- In the correlated fields experiment, it would be informative to know the ratio of highly/poorly correlated functions.

The multimedia file synthesizes the work well and provides qualitative comparisons of the algorithms.

__Reviewer6__

이분이 문제다... 너무 바뀐게 없다고 한다.

Brief Summary

-------------

The article presents an algorithm for learning from demonstration (LfD) with multiple demonstrators. While LfD usually assumes that the demonstrations come from experts, the proposed approach exploits leveraged Gaussian process (LGP) regression to overcome such limitation, allowing to learn from demonstrations of mixed quality. LGP uses an additional set of (hyper) parameters to discriminate between positive and negative samples (i.e., samples from which the regressor tries to stay away). The intuitive idea behind LGP is to be able to automatically give different importance (leverage) to samples coming

from real experts and the ones that come from inexperienced users. This problem can be interpreted as a supervised learning problem with unlabeled data that is here solved by means of an optimization with sparsity constraint.

The optimization problem presented in [5] (based on proximal linearized mapping) suffers of two major drawback: 1) scalability of the optimization problem and 2) issue in dealing negative demonstrations. While the first issue is solved by reducing the number of leverage parameters from one per trajectory to one per demonstrator, the second one is solved by rewriting the L1 sparsity constraint by using the "doubling variable" trick.

Comments

--------

I want to start discussing the novelty of the paper. I **carefully checked the papers** previously published by the authors (i.e., [4] and [5]) and I found that several sections (both theoretical and empirical) of the journal are a copy and paste of previous articles:

- leveraged Gaussian process (Sec. IV.A)

- Optimization (Sec. V.A)

- Learning from Demonstration Experiment (Sec. VI)

Beside an extended discussion about LGP (Section IV-B), the main (new) contribution is the reformulation of the optimization process by doubling variables (Section V.B). There is also the reduction of the number of optimization parameters that is part of the new algorithm.

I checked the T-RO information for authors and it is hard for me to say if the paper is compliant with the T-RO policy, I just pointed out this issue to the editor.

I found the paper technically sound. All the ideas (already reviewed or new) are well explained and the empirical section covers several case

studies. However, I think that few things can be improved.

Concerning "Related Work" I think that it is worth to mention the papers about LfD with multiple intents, e.g.,

A) Babes, Monica, Marivate, Vukosi N., Subramanian, Kaushik, and Littman, Michael L. Apprenticeship learning about multiple intentions. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011

B) Choi, Jaedeug and Kim, Kee-Eung. Nonparametric bayesian inverse reinforcement learning for multiple reward functions. In Advances in

Neural Information Processing Systems 25 (2012)

C) Dimitrakakis, Christos and Rothkopf, Constantin A. Bayesian multitask inverse reinforcement learning. In EWRL, volume 7188 of Lecture Notes in Computer Science, pp. 273–284. Springer, 2011

While (A)-(B) explicitly focus on multiple reward functions, (C) presents a Bayesian approach that is able to generalize between multiple-reward and multiple-expert scenarios. I think that is worth at least to mention these approaches along side with the main differences (e.g., they belong to the IRL category).

Another point that can be better investigated is the difference between IRL and BC. In particular several recent approaches in the RL community

have tried to exploit the synergy between BC and IRL (e.g., by mixing a step of BC with IRL):

- Matteo Pirotta and Marcello Restelli. Inverse reinforcement learning through policy gradient minimization. In AAAI, pages 1993-1999, 2016.

- Jonathan Ho, Jayesh K. Gupta, and Stefano Ermon. Model-free imitation learning with policy optimization. In ICML, volume 48 of JMLR Workshop

and Conference Proceedings, pages 2760-2769. JMLR.org, 2016.

- Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, pages 4565-4573, 2016.

Concerning the DVM algorithm it relies on the information about the number of demonstrators (to define \bar{l}). In the experiments this

parameter was known (Sec. VII.A) or fixed (Sec. VII.B). However, it would be interesting to evaluate how this parameter affects the overall

performance of the method. For example, you can collect a fixed number of trajectories and let the number of demonstrators vary. I think that

in many applications this information is unknown.

Concerning the experiments I think that there is a lack in the comparison of the proposed approaches with the state-of-the-art. As

mentioned in the "Related Work" there are other algorithms that can handle the case of multiple demonstrators. Even if they are IRL

methods, they are able to recover (directly or indirectly) the expert's policy. It would be interesting to comparing the results.

Minor comments and typos

------------------------

- Page 4 (below Fig. 2): I think that l=-1 (and l=1) should be replaced by l_j=1 (l_j=-1) because it refers to a specific sample

- Eq. (7) \gamma -> l

- I think that Hadamard product is commonly used to denote the component-wise product (rather than Hadamard operator)

#### 'Thoughts > Reviews' 카테고리의 다른 글

T-RO reviews (0) | 2017.12.27 |
---|

- Filed under : Thoughts/Reviews
- 0 Comments 0 Trackbacks