본문 바로가기

Enginius/Robotics

Reviews that I got from IROS 2016

1. Leveraged IRL

The paper presents a method that allows using trajectories from suboptimal demonstrators for Inverse Reinforcement Learning. The method is interesting and tackles an important and under-researched problem. A useful characteristic is that the quality of unlabelled trajectories can be inferred from labelled ones. The experimental results are good, showing that using negative demonstrations in this way works.

The background section is well written but slightly quirky. As an example, Maximum Entropy IRL is most commonly called MaxEnt or MaxEnt IRL, not MEIRL, Similarly for MEDIRL (referred to as DeepIRL in the originating paper). The bit on NPBFIRL (Should be BNP-FIRL) is summarized by a sentence from its abstract and there is no attempt to relate it to what this paper is about. 

There is a sentence on MESSIRL, but the paper does not make clear why this method is different. MESSIRL thus looks like the main candidate to be compared against, but this is not the case. An explanation is necessary.

The experimental validation could also be improved by including real-valued proficiencies. While the authors point out that the method can handle these, the experiments only test the method at these extremes rather than between them. The authors do mention the problem of a demonstration containing some desirable behaviour in a failed trajectory. It is worth testing whether using intermediate proficiency experts which are more likely to mix desirable and undesirable behaviour makes it difficult to learn.

Overall the paper is interesting and well written, and the method is promising. The evaluation is sufficient to show that this is a valuable contribution, but I would like to see the inclusion of real-valued proficiencies.

The authors propose a method for IRL that incorporates "both positive and negative" example trajectories. These trajectories are assumed labeled with a "proficiency score" although the authors propose a relaxation allowing for a mixture of labeled and unlabeled trajectories. The authors use leveraged Gaussian processes (their prior work) to perform regression over these trajectories, yielding an estimated immediate reward function.

Although the related work could be expanded, the paper is novel, technically interesting, and tackles an important problem: leveraging imperfect data to assist in learning. Although the authors do not consider the situation where proficiency scores are noisy, their experimental results are compelling and they present a convincing case for their technique's efficacy.

Detailed Comments:

The proficiency label on trajectories is essentially a trajectory score and this problem can also be conceptualized as learning from scored trajectories. The authors neglect some related work in this area including "Score-based Inverse Reinforcement Learning", "Distance   Minimization for Reward Learning from Scored Trajectories", and "Active Reward Learning".

My main concern with this paper is the proficiency score. While the authors examine learning with different ratios of good and bad demonstrations, they do not analyze the behavior of their approach with noisy proficiency scores. In practice, the quality of these scores may vary as someone (presumably a human) needs to determine at least some of these scores by hand. Addressing this would increase the impact of the paper.

This paper address an inverse reinforcement learning problem in which demonstrations include not only positive examples from experts, but also examples of incorrect behavior. The approach is a probabilistic framing of IRL and proposes a Gaussian process generative model in which each trajectory is associated with a demonstrated proficiency between -1 to +1, where on the extremes, +1 indicates a true expert and -1 indicates a demonstrator performing the worst case behavior. The model parameters are learned using a gradient ascent approach. Results are presented on two domains and compared to many existing IRL algorithms including those that only use positive examples and algorithms that include both positive and negative examples.

Overall I think this is a fairly good paper. The paper is clear to follow with a good related work summary and includes a lot of different baselines in the empirical results. However, I think that there could be more explanation of the empirical results. As it stands now, it's mostly a data dump with a very large number of values. It would be nice to have a discussion about why certain algorithms didn't do as well as LIRL, especially algorithms that support both positive and negative examples. Additionally, many of the IRL algorithms appear to do quite poorly, even when given many demonstrations. A discussion about whey they do so badly would be very helpful, especially because my expectation would be that all of the IRL algorithms would eventually do pretty well when given enough data. Is this an issue with reward functions being linear for some of them when the demonstrations require non-linear reward functions?

The paper also introduces a method to estimate when expert proficiency when it is only observable for a subset of the demonstrations. However, from what I can tell no empirical results exist for that approach. It might then be worth cutting that from the paper to save space more a larger discussion or only mention it in passing in a future work discussion.

Finally, the paper also discusses the ability to handle multiple expert proficiencies across the -1 to +1 spectrum,but it's unclear to me when this would actually be useful and the empirical results do not include any examples. That is, from what I can tell, negative proficiency value is effectively inverting the the goal reward function (so that at -1 it's motivating the opposite of what is good and bad). If a bad demonstration is meant to truly indicate "never do this" this interpretation makes sense, but when, then, is an intermediate value useful? What is the interpretation for a near 0 proficiency? There is a plot earlier showing samples with different proficiencies but it's hard to discern what it really means. Is a near-zero proficiency indicating more random behavior, or is it motivating behavior toward intermediately valued states, or something else?

The authors may also want to consider relation to work work on GP-based IRL by Englert et al. (2013). Model-based Imitation Learning by Probabilistic Trajectory Matching  ICRA.


2. Occupancy Flow

The authors proposed occupancy flow strategy for the estimation of vehicle velocity and the prediction of future occupancy in dynamic environments. The problem is very interesting and important since we will have more unmanned ground, air and maritime vehicles driving around. This paper is a fine conference paper. I have the following suggestions for this paper: (1). The authors may want to isolate the motion estimation problem from the occupancy prediction problem. Then it will be clearer on the evaluation part. So far, these two problems are kind of tangled. (2). As the authors mentioned, conventional OF algorithms such as L-K are only for generating the relative motion field between the vehicle and the surrounding environment. It is not fair to use it for the prediction part. (3). Results from Fig. 8 should be similar to structure estimation results using OF algorithms like L-K. The occupancy prediction part is not well shown in this example. (4). More literatures on optical flow based navigation are needed.


3. Gaussian Random Path

This paper suggests the motion planning method using gaussian process function which is one of the common method. In this paper, the performance of the proposed method is shown by comparing with look-ahead planner (LAP) and the authors said the proposed method can be effectively used as an initialization step for trajectory optimization methods.

However, the contribution of the paper is not standing out due to the lack of explanation of other related methods in introduction or other parts. Because local planning problem is regared as common problem that are covered in many times before, the paper would be better to describe the related works. In the review version, the introduction part did not show the necessity of the proposed method.

As an initial trajectory of trajectory optimization, the proposed method out-performs in extreme case comparing with linearly interpolrated initialization(LI) method. Though, there are remains of curiousity about the effectiveness of the proposed method because the linear interpolration is not out-performing method as a local planner in this time. Also, in simple cases, the LI method shows the better performance with cross-entropy optimization methods, and this reads more questions about the effectiveness of the proposed method as initialization in trajectory optimization method.

It would be better to describe more detailed and logical reason why LI or LAP are choosed as compared method.

This paper presents the "Gaussian random paths" method to generate a set of paths passing through anchoring points using Gaussian process regression. For implementation, " \epsilon run-up method" is proposed to make the sampled paths to be trackable with respect to a specific unicycle model. Sampled paths from GRP distribution suitable for unicycle models can be further learnt by optimizing the hyper-parameters of a Gaussian process to maximize the average log likelihood function of collected paths with unicycle dynamics. Simulation and experimental studies were conducted to the local path planning problem with the proposed GRP method and look-ahead planners(LAP). The proposed GRP method can be used as an initialization step for some existing trajectory optimization approaches. 

Overall, the paper itself was well presented, and the results were good. The provided video shows the effectiveness of the proposed method. However, the reviewer have some questions about the details.

*In section III.A, the authors suggested the \epsilon to be set as smaller, but how small should it be ? what is the consequences concerning different size of \epsilon ?

*In section III.B, how different exemplar trajectories will affect the learning results? Or in another words, how did you decide the pattern of the exemplar trajectories ? How many exemplar trajectories are needed to train the GRP