본문 바로가기

Thoughts/Technical Writing

Reviews I got from RSS 2017


Review 1


The paper proses a new approach to inverse reinforcement learning. The authors take inspiration from the average-reward formulation of the RL problem to formulate the objective of maximizing the inner product between reward function and stationary state-action distribution. The resulting algorithm is simple yet powerful enough to compete with existing methods. The authors establish the sample complexity of their algorithm and provide empirical comparisons to the main existing algorithms that show their method is competitive and outperforms baselines when the underlying reward function is non-linear. The method explicitly addresses problems where the transition model is not known and state and action spaces are continuous, which is important for applying such methods on robot systems.


 I like the straightforwardness of the approach

- The proposed kernel method for dealing with non-linearity yields a convex problem with a closed form solution

- the scenario where no model is available and states and actions can be continuous is important for applying such methods to real robot systems


- I've learned about a simple yet potentially powerful method for IRL

- Leveraged kernel functions were an interesting new concept for me


Fair (a paper that is on its way to making a good contribution but not there yet)


TECHNICAL QUALITY
The authors provide a sample complexity bound. I didn't find faults in the proofs. However, assumption 1, that the samples are independent, seems questionable for samples along trajectories in MDP's. A brief discussion on how this could affect results should be given in the paper. 

The authors discuss all prominent relevant work that I know of. In the fourth paragraph of section I), [16, 12, 8] are mentioned as IRL methods that can deal with state-action spaces and do not solve a MDP as subroutine. [3] should be added to this list. 

The method that the authors propose seems novel and interesting. I have one main concern regarding the simplifying assumption about the Bellman flow constraint. It seems this assumption can easily break the method. For example, in a driving scenario, let's say a vehicle in a congested city is stuck in traffic 90% of the time and on a quiet road 9%. Only in a small amount of cases (1%, say) is the driver at a point where he can choose between a busy or a quiet road. The estimated rewards would than be roughly 0.9 for driving in traffic, 0.09 for driving on a quiet road, 0.01 for choosing the quiet road at the intersection, and 0 for choosing the busy road at the intersection. Planning ahead for two or more steps would make the vehicle choose the more congested road, if I'm not mistaken. What are the limitations on the kind of problems where the algorithm would do well? Should the reward function be quite similar to the value function, for example? This is not clear from the paper.
Smaller comments and suggestions:
- the authors propose a kernel density estimation step. KDE can perform badly in high-dimensional domains. I think using kernel embeddings of the distribution could improve results, and it has a nice synergy with the embedding of the reward function in the RKHS. For example, the approximation of the inner product in (7) would become extremely straightforward and more elegant. e.g "Kernel embeddings of conditional distributions" by Song et al provides an easy introduction in this topic.
- For that approximation in (7), it seems the inducing points should be chosen homogeneously for (7) to be unbiased. Another approach could be to use an average of R on samples from the steady-state distribution for an unbiased approximation. Could a brief description on how the inducing points are chosen be included in the paper?
- I'm not really sure how (11) corrects for the bias towards the initial state distribution. Is this a heuristic or a principled method? In either case, in MDP's the transient behavior is actually of interest. By modifying the MDP to include resets to the initial distribution with constant probability in each time step (analogous to discount factors), transient behavior becomes part of the steady-state distribution and episodic roll-outs can be used directly to estimate the relevant quantities. See e.g. Vanhoof et al, "Learning of Non-Parametric Control Policies with High-Dimensional State Features"
- Adding the squared norm on alpha in addition to the hilbert space norm feels a bit hacky. Is this necessary because some of the entries in K are very low if the leveraged kernel is used? (also see previous point). 
- In the last paragraph of (VI), potential combination with policy search methods is used. There could be a nice synergy with non-parametric policy search methods such as the work by Bagnell and Schneider (2003), Lever and Stafford (2015), Vien et al (2016) or the previously mentioned Vanhoof paper. 


SIGNIFICANCE OF RESULTS
The proposed method is evaluated both in grid worlds and on a larger racetrack problem with continuous states and actions. The method is compared to multiple state-of-the-art methods. The experiments show that the methods works on a variety of non-trivial problems. 

Most experiments consider non-linear reward functions. As such, most of the methods that are compared to perform very poorly. It is good to see that the proposed method can naturally handle such non-linearities. However, it does seem that only the linear experiment is a fair comparison to the baseline methods, and on large, linear problems the proposed method does not perform better than most baselines. But that comparison isn't quite fair to the proposed method as it can't capitalize on the problem's linearity. I would strongly suggest the authors either evaluate their method with a linear kernel on the problems where the reward function is actually linear, or that they use the kernel vector as features such that the other methods can find reward function in the same space as the proposed method (ie., just treat the alpha's in (8) as parameters to be learned, even if that is a bit of a heuristic).

- I'm curious why KDMRL is so much faster than GPIRL. Does your implementation of GPIRL also use inducing inputs? If so, shouldn't the slowest part of either method be the inversion of the kernel matrix, so both should run in roughly similar time?
- The paper states that the features are only differentiable wrt states and actions numerically. It seems that for such a simple system, analytic gradients should be computable, although I do agree this is a limiting factor in experiments where the model is not known. 

CLARITY
The paper is overall well-written and clear. I didn't find a lot of errors in notation, spelling, or grammar. For some suggestion, see below under minor comments. 

SUPPLEMENTARY MATERIAL
As supplementary material, proofs are provided as well as a video that shows the results on the racetrack task. I commented on the proofs under 'technical quality'. The video is a good illustration of the task as well as the solutions found by various algorithms. However, it doesn't always allow enough time for the viewer to read all the text shown. I would either leave some material out or make the video longer, if possible. 

MINOR COMMENTS
- the paper mentions 'the reproducing kernel Hilbert space' at some points. I think 'a reproducing kernel Hilbert space' is meant. (the one defined by the particular choice of kernel )
- page 1: "most of IRL methods" -> "most IRL methods"
- I'm not sure what is meant by "without requiring a mapping between the feature space and the state action space" (page 3)
- page 4: "controls the smoothness of the reward function" -> functions with a low hilbert space norm can be very non-smooth. I would suggest "complexity" instead of "smoothness".
- "is a convex quadratic programming" -> quadratic program.
- "resulting four different sets experiments" -> "resulting in four different experiments" or "resulting in four different sets of experiments"


Review 2


This paper uses (kernelized) density estimation from observed trajectories to select an imitation trajectory from a set of short randomly sample trajectories. The approach performs well compared to reward estimation methods (inverse reinforcement learning) on a set of experiments.

Addressing continuous state-action spaces and incorporating high-dimensional feature representations, as this paper does, is important for imitation learning.

The results of this paper suggest that for many decision process settings, long-term planning is not necessary and short-term search heuristics can be applied effectively.

Poor (a paper that needs significant work)

My main criticism is that characterizing the proposed approach as “reward estimation” seems to be inappropriate. The “reward” being estimated is simply a density estimate. It is easy to construct examples where high-reward states are visited rarely by an optimal policy simply due to the stochasticity of the decision process’s state transition dynamics. The “reward” estimated then would have no relationship at all to the reward function motivating the behavior, unlike inverse reinforcement learning (IRL) methods. As a result of this, I find the analysis misleading since, unlike IRL, there is no relationship between the estimated reward function and the unknown reward function assumed to motivate observed behavior. Perhaps there is an unstated claim that state-action densities should transfer across MDPs in the same manner that reward functions transfer. Unfortunately, this is not true in general and these densities will depend heavily on the state-transition dynamics, making transfer of densities much less useful than transfer of reward functions.

My second major criticism relates to the experimental design. First, given my previous point, comparisons to behavioral cloning methods would be far more appropriate for the proposed approach. I am curious whether a (kernelized) logistic regression approach would outperform the proposed method because it could be trained to discriminatively select the demonstrated trajectory from a larger set of trajectories. Second, the features employed in the second set of experiments seem especially unfavorable to IRL methods. Indeed, a feature indicating a “collision” within the IRL method would eliminate these negative outcomes at testing time. Similarly, features of following distance thresholds (i.e., between two discrete values) [16] would likely realize the tailgating behavior.

Some of the experimental details raise additional specific questions. First, since sampled trajectories are 0.8s long and controls are only sampled at 0.4s, only 2 sequential actions are actually being considered. Why? Again, this seems to suggest behavioral cloning rather than estimation of meaningful rewards. Differences in relative performance for the gridworld experiments between 16x16 and 32x32 size grids are also not well explained. 

In summary, while there is merit in exploring simple methods for imitation learning, I think this paper misses the mark in framing the method as reward estimation and comparing with IRL methods rather than behavioral cloning techniques.


Review 3


The authors propose an interesting and effective analytical method for uncovering a reward function from demonstrated behavior that empirically shows strong performance on continuous control problems.

I agree that there's room for new formulations of the IRL problem-- it's easier to take existing formulations and try to devise better algorithms for them, but this paper does a good job of breaking out of that box.

The observation that optimizing the inner product between the reward function and the demonstrated state-action distribution is inspired and insightful.

Good (a paper that has its flaws but makes a good contribution)

I'm curious about whether it's possible to characterize the behavior of the policy that results from the estimated reward function. I describe that below, but first mention a question that seems pretty central to being able to reproduce these results.

The implementational question: I'm not connecting the dots how the features used to represent the reward function are actually used to represent the reward function. It's stated that there are 6 features, and they're pretty intuitive, but it's not clear based on Equation 8 how they're used in the kernal function to represent the reward function.

Now the question about the theoretical underpinnings. This question is concerning to me, which is why I've selected the overall score that I have, despite the strong empirical results. If this question can be resolved, the ideas have the potential to make a strong impact.

The theoretical results are useful to prove the estimator is consistent, but pretty intuitive: They state that better estimates of the state-action distribution result in better estimates of the reward function that would be found if the exact state-action distribution were known. I'm a little concerned, though, that the theory doesn't address the "correctness" of the formulation. For instance, it would be great if better estimates of state-action distribution resulted in an estimated reward function that was in some sense more *similar* to the expert's reward function (the original characterization of IRL). Or, even better, than better estimates of that state-action distribution resulted in more similar *behavior* of the resulting trained agent to the expert's behavior (the characterization of MMP, MaxEnt, and RelEnt (and even GPIRL, although indirectly)). 

Without discussion of that, it's not clear what the IRL formulation is really trying to do. It seems intuitive that since max_mu inner(mu, R) st mu \in G is the planning formulation (given R find an optimal policy), flipping that by maximizing R given mu would try to make the given policy look good. But it's not clear to me, especially given the norm penalization, that that's consistent. For instance, would something like the following reciprocality property would hold?

Given mu (as a distribution):
first find R* = argmax_R inner(mu, R) st |R| leq 1
then find mu* = argmax_mu inner(mu, R*) st mu in G
Is it true that mu* equals mu? 

If we say that we're trying to find a reward function that makes mu look optimal then we'd hope so. But it's not clear that that reciprocality holds for this formulation. (E.g. the above combined constrained objective over both mu and R can be optimized with coordinate descent: first optimize R given mu, then optimize mu given R, and repeat. Each cycle continues to improve the objective until an equilibrium is met, but the first cycle usually doesn't find itself in an equilibrium. So my best guess is that this sort of reciprocality property doesn't actually hold here.) The presented theoretical results only show consistency of the estimator, not that the formulation imitates or uncovers a latent reward function. Those properties are only suggested anecdotally in the empirical analysis.

I will say, the experimental results are quite impressive, so I'm hopeful for this technique. And the observation about maximizing the inner product is quite interesting, especially since it results in an analytical technique that doesn't require iteration to solve. But I'd like to see more of an exploration and interpretation of what maximizing the inner product means, and it would be great if there were some theoretical results connecting back to some of the traditional formulations. Especially, something exploring how an optimal policy under the estimated reward function is expected to behave w.r.t. the original demonstrated policy.



'Thoughts > Technical Writing' 카테고리의 다른 글

영어 논문 글쓰기  (0) 2017.07.03
Reviews I got from IROS 2017  (0) 2017.06.27
Another Reject from ICRA 2017  (0) 2017.01.26
Robotics paper  (0) 2015.02.16
호프스테더  (1) 2014.11.10