# Reviews from ICRA 2018

Posted 2018.02.19 13:351. Density matching reward learning (rejected)

Reviewer 3 of ICRA 2018 submission 2528

Comments to the author

======================

This paper introduces a method to infer a policy from demonstrations by

using (1) inverse reinforcement learning to learn a reward function ,

(2) use Model-predictive control for the policy (which removes the need

to actually search for a policy). The authors provide an extensive

theoretical analysis and they benchmark their approach on two problems:

(1) a discrete "object world" and a continuous "track driving"

experiment". According to the benchmark, their approach outperforms the

state of the art.

I am not a specialist of inverse reinforcement learning, but there are

many points that are unclear to me and they should at least be

clarified. In addition, it seems to me that this paper would fit better

in a machine learning conference / journal than a robotics journal (no

robot experiment, no realistic simulations of a robot). However, the

experiments look promising.

Major points

- The problem formulation section presents both the problem and the

proposed approach. This makes understanding the paper difficult.

The problem formulation should only present the basic notation and the

question asked. On the contrary, there should be an "Approach" section

describing the approach on how to answer to the question.

- The problem formulation is for the discrete case whereas the title is

about the continuous case. Do the authors want to solve the continuous

case or the discrete case? The continuous case is added as a small

equation mixed with the approach proposed by the authors. I suppose

they should start with the continuous case and then derive a discrete

formulation.

- It is very hard to identify the novel idea of this paper. Is it to

rewrite the product of eq. (1) as an inner product? That sounds like a

straightforward generalization of equation (1). Is it to use inverse

average reinforcement learning for imitation learning? Is it the kernel

formulation?

- The authors claim that their approach is model-free, but for the

Model Predictive Control policy, they need to have access to the

dynamics function: s_{t+1} = f(s_t, a_t). This is basically the model

of the environment, right (you can almost always infer the transition

function T from f and the inverse)?

Minor points

- The 5th paragraph of the introduction (Sec. I) needs re-phrasing to

be clearer. The same applies to the first two paragraphs of Related

Work (Sec. II).

- The authors should double-check the notations: for example they use T

as the transition function and as the horizon steps.

- The choice of kernel in Eq. (12) seems ad-hoc and no useful insight

is given.

- Why GPIRL is so much slower than DMPL? Both approaches are inverting

a kernel matrix of similar dimensions. If not, this means that the

authors didn't use sparse GPs (with inducing points or something else)

for GPIRL, which makes the comparison of the computation time really

unfair. The authors should clarify on this.

- Quoting the authors: "Once a reward function is optimized, a simple

sample-based random steering method is used to control the car". Why

didn't the authors apply the MPC-based policy as defined in Eq. (4)? I

know that the underlying idea is the same, but why the need to change

the policy?

Reviewer 4 of ICRA 2018 submission 2528

Comments to the author

======================

(1) Summary:

The paper proposes an approach for Imitation Learning in continuous

state and action spaces based on a non-iterative two-stage learning

process. First, a reward is learned to maximize value under an

empirical estimate of the stationary state-action distribution of the

expert. Then, MPC is used to find a policy that maximizes the

cumulated, discounted log rewards. The paper provides a theoretical

analysis for the approach.

Furthermore, it proposes a practical IRL algorithm using KDE to

estimate the joint probability distribution of states and actions.

Then, a reward is learned based on a finite set of basis functions and

optimized to maximize value under regularization.

The approach is evaluated based on the object world and a driving

environment and mostly outperforms the other approaches.

(2) Related Work:

Most relevant work is cited.

[8] Ho et al.: It isn't an IRL approach. They don't learn the expert's

reward function. Rather they estimate a surrogate reward that guides an

RL agent to learn a policy that matches the state-action distribution.

As soon as it matches, the surrogate reward function will be constant

or undefined.

[16] Boularias et al.: They propose to minimize rel. entropy between

the modeled distribution over trajectories and some baseline

distribution under feature matching constraints. If the baseline

distribution is assumed to be uniform, then it corresponds to MaxEnt

IRL. However, it is not directly minimizing entropy between empirical

distribution of demonstrations and the modeled one.

[18] Herman et al.: They proposed a model-based approach, since they

incorporate learning a model of the environments dynamics into IRL.

[7] Levine & Koltun: If the feature function is differentiable,

numerical differentiation is not necessary.

(3) Evaluation of Strengths and Weaknesses:

The proposed approach is non-iterative and can be applied in continuous

spaces. The results show competitive or superior performance. However,

the proof by contradiction for the policy will barely hold for real

human demonstrations that are stochastic, while it is evaluated on

human demonstrations. A theoretical analysis regarding learning from

stochastic behavior, would be necessary. Furthermore, some motivations

for algorithmic choices are missing.

(4) Detailed Comments:

[4.1]: The norm ball constraint in Eq. (3) is introduced to handle

scale ambiguity. But it further enforces that more likely state-action

tuples result in higher rewards.

[4.2]: It is not clear why the cumulated, discounted "log" rewards are

maximized. Is it due to Eq. (3), which enforces $R: SxA \to [0,1]$ and

$\lVert \mathbf{R} \rVert_2 = 1$ while more likely state action tuples

have higher rewards? Then, rewards correspond to the nonlinearly

scaled density of the expert's state action tuples. If this is the

case, why didn't the authors use $\ast{\mu}$ for $\ast{R}$, since this

argumentation is used in the first proof of the supplement?

[4.3]: While the proof by contradiction of Theorem 1 holds for

deterministic policies, it might not hold for stochastic policies.

However, human demonstrations are rarely optimal and used in the

experiments. Is it possible to provide bounds in matching the expert's

policy to stochastic behavior?

[4.4]: The policy in Eq. (4) maximizes the discounted, cumulated log

reward. However, in the experiment the transition model is stochastic.

Then, optimizing expected discounted, cumulated log rewards would be

more meaningful. Otherwise the policy might be suboptimal.

[4.5] In V. it is argued that under infinite length trajectories,

$\ast{\mu}$ converges to the true stationary joint distribution of

states and actions regardless of the initial state distribution. This

is not always true. With two equally good states, the stationary state

action distribution is highly depending on the initial state

distribution.

[4.6] In VI. B the baselines MaxEnt IRL, GPIRL, and RelEnt IRL use

stochastic policies to model human behavior. It is not clear whether

predictions are based on stochastic policies or on the optimal policy.

(5) Minor Suggestions:

[5.1] Zeibart et al. [12] -> Ziebart et al.

[5.2] The distance measure between probability distributions is

typically referred to as total variation distance and the formula in

the footnote only holds for categorical distributions.

[5.3] In VI. A, the last paragraph says "We also conducted experiments

by changing the number of demonstrations..." while the figures show

evaluations over grid world size.

[5.4] In VI. B it is argued that CIOC could not be compared since

computations would be too demanding. However, computing

(sub-)derivatives of the proposed features should be simple.

(6) Overall evaluation:

Sound idea and proofs, while some motivations are missing. The approach

shows superior results in the experiments, but an analysis for learning

from stochastic demonstrations is missing.

__2. Sparse MDP (accepted)__

Reviewer 3 of ICRA 2018 submission 2417

Comments to the author

======================

The paper describes a new MDP representation for more efficient

exploration using the sparse Tsallis entropy regularization. This

approach allows for learning policies whose densities focus on relevant

actions given the current state. This overcomes the problem of softmax

policies in the standard MDP formulation where we have nonzero

densities on suboptimal actions. Ultimately the sparse MDP allows for

more efficient exploration and faster convergence towards the optimal

policy for problems with multimdoel reward maps. The authors also show

that the performance error bound is constant in the number of actions

as opposed to the softmax policy, which scales logarithmically.

Major comments

==============

The paper explores an interesting and important topic in RL, that is,

more efficient exploration. The main idea of the paper is well

summarized in Figure 1. The authors show the connection of the sparse

MDP formulation to Tsallis generalized entropy, however, it is unclear

how much importance this adds to the paper. The paper overall gives a

sound theoretical contribution and the clarity of its description is

good.

My main concern with the paper is the experimental section and its

connection to the aim of the paper. It seems that the two important

contributions are 1) efficient exploration for problems with multimodal

reward maps, and 2) scaling better for larger action spaces. However,

the experimental section seems to be mostly focusing on the latter

aspect.

The problems are also not introduced formally, only the video

attachment gives a hint what is going on. My guess is that the two

problems only have bi-modal reward maps, but equally good solutions. I

think the experimental section should focus more on this aspect as it

is introduced in the beginning of the paper. At the moment it is a bit

unclear why you focus more on the scaling w.r.t. size of action space.

Some more considerations on the evaluation:

- How many times did you repeat the experiment, what is the standard

deviation of the outcome?

- How did you find the optimal \epsilon for the greedy policy? Grid

search, or best guess?

- Perhaps you could use a more interesting and/or real robot task that

shows the strength of your algorithm. At the moment there does not seem

to be much advantage of your approach compared to the softmax policy,

especially in the reacher task.

Reviewer 6 of ICRA 2018 submission 2417

Comments to the author

======================

The authors propose a new regularization scheme for reinforcement

learning based on Tsallis entropy. The main contributions of the paper

are a modified version of the Bellman equation for the regularized

objective, called sparse Bellman equation; a proof that the sparse

Bellman equation is a contraction; and a TD-based algorithm for

learning sparse policies for systems with unknown dynamics. The

experimental section compares the proposed algorithm to conventional

Q-learning and soft Q-learning.

As the authors explain, Tsallis entropy favours sparse distributions,

thus resulting into stochastic optimal policies that assign a zero

probability for actions with low value. The authors suggest that the

benefits of sparse policies include multimodality and efficient

exploration, but the motivation why that is the case is somewhat

unsatisfying. In particular, soft reinforcement learning already

addresses multimodality, and in terms of exploration, it is unclear

whether completely ignoring actions with low value is desirable, as it

may lead to ignoring also good actions due to noise in the Q-value

estimates.

Regarding the experiments, does the expected return in Fig. 3 and 4

show the training return or test return? Obviously soft Bellman does

not reach the same expected return since it is trained to maximize a

different objective. For comparing the effectiveness of different

exploration strategies, it would be more fair to set the temperature

(alpha) to zero at test time (for example, by learning two policies

simultaneously, one with non-zero temperature for exploration and

another one with zero-temperature for testing). Also it would be better

to show all nine combinations of exploration and update-strategies by

choosing the temperature value that worked the best for each one of the

combinations. Also, the number of episodes needed sounds quite large.

For the OpenAI versions of the inverted pendulum and reacher, DDPG

should take about 500 and 1500 episodes, respectively (Gu, 2016).

Perhaps a discrete domain, such as Atari games, would be better for

illustrating the benefits of sparse learning?

To conclude, while the technical derivation seems sound (I did not

check it in detail), the benefits of the proposed method are hard to

gauge based on the experiments only, and it remains unclear whether the

increased complexity of the algorithm is justifiable. Therefore I would

suggest improving the experiments substantially based on the comments

for the final version.

__3. Uncertainty-Aware LfD__

Reviewer 1 of ICRA 2018 submission 220

Comments to the author

======================

The paper discusses a model that is able to discern when it

may perform poorly, by estimating the uncertainty of the

model. By decomposing the uncertainty into explained and

unexplained components, the authors are able to determine

when they can safely execute the output of the model.

Overall the paper is presented clearly and efficiently.

All major points seem well explained.

However, there are a few unclear areas.

In your example implementation, you only use the explained

uncertainty to determine when to switch driving modes

(UALfD, the best version of your approach by tables II and

III) . If the previous work of Kendall and Gal already has

proposed a method that already outputs this explained

uncertainty, what **benefits** does your approach have over

their work?

Our method does not require sampling!!!

It would be nice to have a more thorough

comparison of the results of your approach with that of

related work in this domain, such as the work in your

citations [15] and [16].

In addition, your testing procedure is slightly unclear.

It seems like you're testing on new vehicles inserted into

an existing dataset. As such, it seems natural to ask, how

natural the resulting trajectories are, i.e. if you

replaced a car in the original dataset, how well does your

method track the original trajectory. If you are inserting

a new car into the existing dataset when you do the

training, the cars around you will not react according to

your actions. It would be interesting to see how the

proposed method would behave in a reactive environment.

Are you inserting the new vehicles into portions of the

dataset you've trained on?

Comments on the Video Attachment

================================

The video attachment is generally good. It gives a good

overview of the work, and gives a good representation of

the results. The music was, however, slightly loud.

Reviewer 2 of ICRA 2018 submission 220

Comments to the author

======================

This paper proposes a new uncertainty estimation method

based on a mixture density network. The authors classified

existing methods into the methods using multiple models and

the methods using Monte Carlo sampling. Compared to the

existing methods, the authors have convincingly conveyed

the advantages of the proposed method. In particular, the

proposed uncertainty measure can separately estimate the

explained variance and the unexplained variance.

The concept of extracting aleatoric and epistemic

uncertainties from deep neural networks is not new,

but the specific combination of the MDN and the uncertainty

measure is new.

Although the proposed combination is technically simple,

the authors convincingly compare the specific advantages of

the combination with previous studies and also

experimentally show that it actually works in various

examples. Especially, I liked that the proposed approach is

successfully applied to a real-world driving dataset.

Overall writing is acceptable, but not entirely

satisfactory for readers to understand. A few more passes

over the paper seem necessary to improve readability and

come up with a better top-down explanation of the method.

In the introduction, the case of a Tesla accident appeared

without explaining motivation, and even after reading

through the paper, I am not quite convinced that the

proposed method would work particularly well for such a

case.

I agree, mentioning Tesla was just for introduction.

But the summary of contribution part in the introduction

section clearly explain the advantages of the proposed

research and their contribution in comparison to the

state-of-the-art. References are pretty adequate too. The

comparison of the differences with the existing approaches

is intuitive and easy to understand.

The main weakness of this paper is that it lacks

quantitative comparisons with previous methods.

It is mentioned in the paper that Monte Carlo sampling is

computationally-heavy and thus unsuitable for real-time

applications. It's basically right, but they did not

compare how speed and performance differ exactly for a

specific example.

I guess it will be nice to do additional experiment regarding the performance of the previous method by changing the number of samples.

The parameters such as the number of MC

samples or the number of mixtures both would affect the

speed and performance. It would have been much better if

there was a comparative experiment on such a trade-off even

for a simple toy example, for example, similar to the one

in Figure 1 of [9]. However, this paper also seems to

overcome such weakness with many convincing examples and

experiments.

Comments on the Video Attachment

================================

The video was very helpful in understanding the basic idea.

Reviewer 5 of ICRA 2018 submission 220

Comments to the author

======================

This paper addresses the problem of incorporating

uncertainty estimation in deep neural networks, in the

context of learning from demonstration (LfD). Specifically,

it advocates using the mixture density network (MDN) as the

underlying process model, due to its ability for better

representing complex distributions. The authors decompose

the uncertainty in their MDN-based model in explained and

unexplained variances and propose a LfD method (combining

learning-based and rule-based methods) that uses the

explained variance as the switching factor. Finally, the

authors provide experimental results supporting their

claims: i) The multi-mixture MDN and LfD methods outperform

the single density network of [6], and ii) the explained

variance (which does not include the measurement noise) is

a good choice for the switching criterion of the proposed

LfD.

The key idea of the paper is definitely of interest and, as

seen from the results, seems quite promising for

incorporating uncertainty in complex real-world learning

problems. The simulated behavior of the proposed MDN-based

model under different input types (absence of data, noisy

data, and composition of functions), shown in the paper,

also provides useful intuition.

There are, however, three aspects of the paper that can be

improved: (1) Comparison against multi-network systems is

not provided;

What is multi-network systems?

(2) the proposed rule-based safe controller

does not seem sufficient; (3) the experimental result is

not comprehensive. Specifically,

1) For a real-world complex application, the authors show

that a 10 mixture MDN perform better than a 1 mixture MDN

or a standard uncertainty-unaware neural network, which is

an expected behavior. They, however, do not present any

comparison against the multiple deep network-based system

of [15], which seems to be a more relevant competitor

method.

2) The proposed rule-based safe controller only depends on

the center forward and rearward velocities and is designed

to keep the vehicle in the same lane. For a dynamic

traffic, where vehicles often change lanes, this strategy

does not seem sufficient. As a result, although in the

datasets (as shown in the attached video) vehicles rarely

change lanes, the safe mode has a non-zero collision rate

in the 70% dataset.

3) The authors use only one, few hundred meter long,

dataset for demonstrating their results. From the attached

video it also seemed that the dataset does not have any

challenging cases, with frequent lane changes or abrupt

driving behavior, when an uncertainty-aware system is

mostly needed for ensuring safe driving.

Comments on the Video Attachment

================================

One slide explaining what the proposed LfD method is (the

transition between the learning-based and rule-based

methods) will be helpful.

Reviewer 3 of ICRA 2018 submission 220

Comments to the author

======================

The contribution of this paper is an uncertainty modeling

approach using Mixture Density Networks (MDN), and the

uncertainty model is decomposed of explained and

unexplained variance. The authors apply this uncertainty

modeling method for uncertainty-award learning from

demonstrations.

The key idea of this paper is apparently uncertainty

estimation using MDN. I think the work is interesting, but

I have some concerns.

First, the definitions are inconsistent, and some

definitions are unclear. Supplying too many definitions for

each concept is confusing. Why is another definition

explained and unexplained variances necessary? You

already have definitions of aleatoric and epistemic

uncertainties.

But they are different..!

Epistemic and aleatoric uncertainty assumes that the network itself is stochastic in that the weights are random variables.

However, for the explained and unexplained variances, we have a randomness in the weight allocation probabilities and derived the variances based on this.

Overall, the source of uncertainties are totally different.

Superfluous definitions make the paper

confusing especially from Eqs. (7) to (10). If you first

define the relationship between explained and unexplained

variances and aleatoric and epistemic uncertainties, the

paper should be easier to read.

According to Eq. (10), the second term in Eq. (7) is

epistemic uncertainty. I think that the second term

(epistemic variance) of Eqs. (7) and (10) could be high in

the multiple choice case since GMMs has multiple separate

Gaussians when it has multiple choices and the second term

has a larger value. However, in the first paragraph on the

third page, the authors said the high aleatoric uncertainty

means multiple steering angles.

In addition, I dont understand how epistemic uncertainty

is reducible by using more training data. However, since

MDN represents multimodal distribution with a mixture of

Gaussians, it can have strongly separated Gaussians, such

as the example where an obstacle is in front of the car. It

has high epistemic values according to the papers Eqs. (7)

and (10), but this second term is not reducible despite the

large quantity of training data. With a lot of data, the

first term in Eqs. (7) and (10) can be reduced. According

to the example in the first paragraph of the third page,

the authors said high aleatoric uncertainty indicates

multiple possible steering angles. The authors are mistaken

to think a high value of the first term in Eqs. (7) and

(10) means multiple solutions.

Next, I dont think MDN is a good approach for

uncertainty-aware learning.

What??

Generally, we usually use MDN

to represent a multimodal distribution, and it is proper to

represent a probabilistic distribution of multiple

solutions. The authors said V(E[y|x,k]) is suitable for

estimating model error, but I think it is a variance of

multiple choices (variance of means of Gaussians), not

model error. The policy switches its mode to the safe

policy when the variance of means is large. However, it

indicates that it switches to the safe driving policy

whenever it has multiple obvious choices. It doesnt take

advantage of MDN which represents the multimodal

distribution.

Minor point

The Equation (1) is wrong. 1/T is missed in the second

term, and sigma^2 should be in the third term.

Comments on the Video Attachment

================================

The video is of good quality and gives nice context to the

paper.

__4. Nonparametric Motion Flow__

Reviewer 2 of ICRA 2018 submission 221

Comments to the author

======================

This papers proposes a nonparametric motion representation,

called motion flow model, that accounts for both the

spatial and temporal information of a given motion

trajectory. The proposed motion flow model is based on a

similarity measure, also defined by the authors, that uses

the mean and variance functions of a Gaussian Process. The

authors then show the application of both the proposed

similarity measure and motion flow model on a human-robot

cooperation scenario.

The work presented in this paper is highly relevant to the

human-robot cooperation field since it addresses one of its

main current issues: the inference and recognition of human

motion. In particular, the results presented in this paper

indicate that the motion model and similarity measure

proposed by the authors seem to be particularly suitable

for applications in which only partial trajectory

observations are available or a reactive and anticipative

robot behaviors are desirable.

The exposition is quite clear. However, a few comments:

- In the related work and introduction sections I would

recommend the authors to better position their work with

respect to the cited references. The authors could

explicitly list what are the advantages of the proposed

motion motion and similarity measure when compared to

existing work

- The human-robot cooperation application is based on two

existing approaches: DMRL and GRPs. It is not clear to me

what are the reasons for employing these two particular

algorithms. More so, when later, in Section V.B, the

authors state that any other algorithm could work

- Even thought motion inference is the main issue addressed

by the authors, the paper lacks of a more detailed

exposition about how the actual motion recognition and

inference is done Section IV.C does not provide enough

information and later on the document, motion inference is

only mentioned.

- Exposition of how a pertinent trajectory for the robot

arm is optimized should be reviewed. According to Eq. 11,

the final pose of the robot trajectory is determined using

all interaction observations. However, in Eq. 12 the

interacting trajectory of a robot end-effector is computed

using the different motion flow clusters. Why number of

observation in one equation and number of clusters in the

other one?

As the focus of the paper is on motion trajectory

inference for human-robot cooperation, some relevant recent

references should be added:

- Maeda et al., "Phase Estimation for Fast Action

Recognition and Trajectory Generation in Human-Robot

Collaboration", IJRR 2017

- Maeda et al., "Probabilistic movement primitives for

coordination of multiple humanrobot collaborative tasks",

Autonomous Robot 2017

Finally, one of the main contributions of this paper is the

definition of a motion similarity measure for which time

alignment is not required. It will be interesting to see

how this similarity measure performs when compared to other

metrics. In particular, how it scales to more dimensions

and to an increased number of samples and clusters.

Comments on the Video Attachment

================================

The video exemplifies well the work presented in the paper.

It helps to better understand how motion trajectory

inference is done (Paper lacks details about this part).

The transitions should be longer. Currently, the time to

read each bullet item is too

Reviewer 5 of ICRA 2018 submission 221

Comments to the author

======================

Paper describes a method for interaction between a person

and a robot, where robot acts predictively. The method is

based on two statistical approaches previously proposed by

the authors: Density Matching Reward Learning and

Gaussian Random Paths (citations 10 and 11). An important

aspect of the paper is introduction of motion similarity

measure based on so called motion flow model. The

results are compared to the state of the art method:

Mixture of Interaction Primitives (MIP) (Amor et al,

2014).

The paper is very technical and a lot of work stands behind

it. Possibility to predict action just after 20% of

trajectory sounds impressive (even though prediction among

only four different actions is considered). It is shown

that much earlier prediction as compared to the competitor

method MIP is possible.

However, it is very difficult to read the paper.

-_ㅠ

Partially

because it is quite technical, but also some things are not

explained good (or extensive) enough.

First of all the paper is a bit over-crowded. Three

different things are shown: planning with partial

observations without obstacles, planning with partial

observations with obstacles and Manipulator-X experiments.

I find the planning with obstacles experiment not very

impressive and not well enough explained, and I would

suggest to omit this experiment and explain the remaining

experiments better.

Concerning better explanations, some diagram where it is

shown how the three main technical components: Density

Matching Reward Learning, Gaussian Random Paths and the

proposed similarity measure work together would help. All

components were explained, but how the final picture comes

out of those was not clear.

Figures 3, 7 and 9 have way too small details and can be

read only in computer form with huge magnifications.

Figure 9 needs more explanations, as it is not clear where

one sees intention change.

The experiment with obstacles, if left in the paper, needs

more explanations. For me it was unclear if the MIP or the

proposed method perform better. Do the authors consider it

better when robot acts closer or further away from the

obstacle? It seems that using none of the methods the

obstacle has been actually reached, anyway, so it is hard

to say which method is better.

The issue of grasping in Manipulator_X experiment was also

not well enough described. (Description is too brisk.)

When discussing the similarity measurement the statement

Both similarities go to zero if ksi_i equals ksi_j seems

incorrect (it is just about the wording: when trajectories

are the same, the similarity shall be high).

There is one copy-paste mistake was in describing

interaction demonstrations, in (4): its end effector to

the top left shall be written.

Comments on the Video Attachment

================================

The video attachment is good, but too crowded. One can not

read the slide before the next slide appears. This can be

fixed by much reducing the time devoted for demonstration

in the beginning of the video.

Reviewer 6 of ICRA 2018 submission 221

Comments to the author

======================

The aim of this paper is to propose motion flow model that

can map a trajectory to a latent vector considering both

spatial and temporal aspects of a trajectory. This paper

demonstrates its performance is comparable with Mixture of

Interaction Primitives (MIP) when the full observation of

human demonstrations is given, and is superior when a

partial observation is given. Compared with MIP, it seems

to have an advantage in handling additional constraints

directly as described in V-D, although there are still

things to be revised and reconsidered as follows.

<Major Comment>

- P.3, Eq.(5) computes the similarity at each time t from 1

to L, which means (5) assumes a time-alignment between

x_i,t and x_j,t, despite the kernel function could

alleviate the time-alignment restriction. The

partial-observation experiments in this paper are not

enough to support time-alignment free aspect of the

proposed algorithm.

- p.6, regarding Fig. 6, it is unclear how to interpret

these panels. For each, x-axis is trajectories (trials).

Why do the minimum distances indicated by red circle

largely vary in trajectory, despite the ones indicated by

blue circle remain the same? In particular, what do jumping

of the RMS errors of the proposed method indicate?

- p.6, regarding Eq.(13), how were beta and gamma

determined? Also please explain why each needed three

significant digits instead of two or one significant

digits.

<Minor Comments>

p.4, in the final paragraph of IV-B, Figuere is Figure.

p.5, in the first paragraph of V-C, MIL must be MIP.

Comments on the Video Attachment

================================

This movie helps readers a lot.

#

#### 'Enginius > Robotics' 카테고리의 다른 글

로보틱스에서 하고 싶은 일들 (0) | 2018.03.10 |
---|---|

Planning and Decision-Making for Autonomous Vehicles (0) | 2018.03.05 |

Reviews from ICRA 2018 (0) | 2018.02.19 |

RL for real robots (0) | 2018.02.12 |

Related work of LevOpt (0) | 2018.02.11 |

Robotics and AI infographics (0) | 2017.12.13 |

- Filed under : Enginius/Robotics
- Comment Trackback