# ICLR 2018 reviews

Posted 2018.11.22 04:11__Reviewer-1__

Review: This paper presents **an apparently original method** targeted toward models training in the presence of low-quality or corrupted data. To accomplish this they introduce a "mixture of correlated density network" (MCDN), which processes representations from a backbone network, and the MCDN models the corrupted data generating process. Evaluation is on a regression problem with an analytic function, two MuJoCo problems, MNIST, and CIFAR-10.

This paper's primary strength is that the proposed method is a tool quite distinct from recent work, in that it does not use bootstrapping or solely use corruption transition matrices. The paper is typeset well. In addition to this, the experimentation has unusual breadth.

However, the synthetic regression task is a nice proof-of-concept, but thorough **regression evaluation could perhaps include the Boston Housing Prices dataset** or some UCI datasets.

Boston housing price dataset에서 잘 되는 것을 확인하였다.

The hamartia of this paper is that it does not provide **sufficient depth in its computer vision experiments**. For one, experimentation on CIFAR-100 would be appreciated.

GPU resouce 문제로 CIFAR-100은 하지 않았다. 게다가 이 논문은 비젼 학회에 제출한 논문이 아니다. 하지만 CIFAR-10에서 다른 여러 문제들에 대해서 추가/비교 실험을 하였다.

In the CIFAR-10 experiments, they consider one label corruption setting and **lack experimentation on uniform label corruptions**.

이거 실험은 이안이 해주겠지? 아직 결과가 그리 좋지는 못하다..

The related works has thorough coverage on label corruption, but these works do not appear in the experiments. They instead compare their label corruption technique to mixup, a general-purpose network regularizer. It is not clear why it is thought the "state-of-the-art technique on noisy labels"; this may be true among network regularization approaches (such as dropout) but not among label correction techniques. For this problem I would expect comparison to at least three label correction techniques, but the comparison is to one technique which was not primarily designed for label corruption.

최소 세 개랑 비교를 하라고? 경재가 실험한건 MentorNet이랑 VAT이고, 이안이 Co-teaching에 세팅에서 실험을 하면 Fair-45%와 Symmetry 50%/20% 에 대해서 결과를 보일 수 있다. 그리고 이렇게 하면 이 논문에서 실험한 것들과 같은 세팅에서 비교를 할 수 있다.

Nitpicks:

-In the related works we are told that a smaller learning rate can improve label corruption robustness. They train their method with a learning rate of 0.001; the baseline gets a learning rate of 0.1.

-The larger-than-usual batch size is 256 for their 22-4 Wide ResNets, and at the same time they do not use dropout (standard for WRNs of this width) and use less weight decay than is common. Is this because of mixup? If so why is the weight decay two orders of magnitude less for your approach compared to the baseline? How were these various atypical parameters chosen?

-They also use gradient clipping for their method, which is extremely rare for CIFAR-10 classification. Why is this necessary?

-This document could be cleaner by eschewing the Theorem of this paper, which "states that a correlation between two random matrices is invariant to an affine transform." For this audience, I suspect this theorem is unnecessary. Likewise the three lines expended for the maths of a Gaussian probability density function could probably be used for other parts of this paper.

-"a leverage optimization method which optimizes the leverage of each demonstrations is proposed. Unlike to former study," -> "a leverage optimization method which optimizes the leverage of each demonstration is proposed. Unlike a former study,"

-"In the followings," -> "In the following,"

Rating: 4: Ok but not good enough - rejection

Confidence: 4: The reviewer is confident but not absolutely certain that the evaluation is correct

닙틱은 알겠다. 그냥 그렇게 했다고 하고 이것저것 설명하면 될듯

We thank the reviewer for the helpful comments.

We agree that more evaluation of the regression problems will be helpful and conducted additional experiments with a Boston housing prices dataset. We compared our proposed method with [1] using two different robust loss functions and showed that the proposed method outperforms all compared methods with respect to handling outliers. Specifically, we replaced the output training data with outliers sampled from the uniform distribution and the computed the RMSE of each method with six different random seeds.

For the classification problems, we compared our methods with two additional baselines, a MenorNet [2] and VAT [3], in the current symmetric notice setting using CIFAR-10 and showed superior performance on the random shuffling setting.

We also tested our methods on three different permutation settings (both symmetric and asymmetric) following [4] . The experimental results including our methods are:

Add table here.

We did not conduct CIFAR-100 experiments due to the limited time and computation resouces available.

Responses to 'nitpicks':

-In the related works we are told that a smaller learning rate can improve label corruption robustness. They train their method with a learning rate of 0.001; the baseline gets a learning rate of 0.1.

=>

-The larger-than-usual batch size is 256 for their 22-4 Wide ResNets, and at the same time they do not use dropout (standard for WRNs of this width) and use less weight decay than is common. Is this because of mixup? If so why is the weight decay two orders of magnitude less for your approach compared to the baseline? How were these various atypical parameters chosen?

=>

-They also use gradient clipping for their method, which is extremely rare for CIFAR-10 classification. Why is this necessary?

=> The main resaon for usnig gradient clipping is that

-This document could be cleaner by eschewing the Theorem of this paper, which "states that a correlation between two random matrices is invariant to an affine transform." For this audience, I suspect this theorem is unnecessary. Likewise the three lines expended for the maths of a Gaussian probability density function could probably be used for other parts of this paper.

=>

-"a leverage optimization method which optimizes the leverage of each demonstrations is proposed. Unlike to former study," -> "a leverage optimization method which optimizes the leverage of each demonstration is proposed. Unlike a former study,"

=> We will modified this in the revisedversion

-"In the followings," -> "In the following,"

=> We will modified this in the revisedversion

[1] Vasileios Belagiannis, Christian Rupprecht, Gustavo Carneiro, Nassir Navab, "Robust Optimization for Deep Regression", ICCV, 2015

[2] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.

[3] T. Miyato, S. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. ICLR, 2016.

[4] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, Masashi Sugiyama, Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels, NIPS, 2018

__Reviewer-2__

Review: The paper presents a framework, called ChoiceNet, for learning when the

supervision outputs (e.g., labels) are corrupted by noise. The method relies on

estimating the correlation between the training data distribution and a

target distribution, where training data distribution is assumed to be a mixture

of that target distribution and other unknown distributions. The paper also

presents some compelling results on synthetic and real datasets, for both

regression and classification problems.

The proposed idea builds on top of previously published work on Mixture Density

Networks (MDNs) and Mixup (Zhang et al, 2017). The main difference is the MDN

are modified to construct the Mixture of Correlated Density Network (MCDN)

block, that forms the main component of ChoiceNets.

I **like the overall direction and idea of modelling correlation** between the

target distribution and the data distribution to deal with noisy labels. The

results are also compelling and I thus lean towards accepting this paper. My

decision on "marginal accept" is based primarily on my unfamiliarity with this

specific area and that some parts of the paper are not very easy or intuitive

to read through.

== Related Work ==

I like the related work discussion, but would emphasize more the connection to

MDNs and to Mixup. Only one sentence is mentioned about Mixup but reading

through the abstract and the introduction that is the first paper that came to

my mind and thus I believe that it may deserve a bit more discussion.

Also, there are a couple more papers that felt relevant to this work but are

not mentioned:

- Estimating Accuracy from Unlabeled Data: A Bayesian Approach, Platanios et al., ICML 2016.

I believe this is related in how noisy labels are modeled (i.e., section 3

in the reviewed paper) and in the idea of correlation/consistency as a means

to detect errors. There are couple more papers in this line of work that

may be relevant.

- ADIOS: Architectures Deep In Output Space, Al-Shedivat et al., ICML 2016.

I believe this is related in learning some structure in the output space,

even though not directly dealing with noisy labels.

여기 논문들 다 읽어봤다. 재밌는 논문들이고 추가하면 될듯하다. 아웃풋 스페이스의 구조를 잡는다는 것은 여러모로 좋은 아이디어인듯 하다.

== Method ==

I believe the methods section could have been written in a more

clear/easy-to-follow way, but this may also be due to my unfamiliarity with this

area. Figure 1 is hard to parse and does not really offer much more than section

3.2 currently does. If the figure is improved with some more text/labels on

boxes rather than plain equations, it may go a long way in making the methods

section easier to follow.

좀 더 쉽게 설명 / 그림을 좀 더 잘 그리기. 너무 어렵다.

I would also point out MCDN as the key contribution of this paper as ChoiceNet

is just any base network with an MCDN block stacked on top of this. Thus, I

believe this should be emphasized more to make your key contribution clear.

쵸이스넷 설명을 좀 더 잘 해보자.

== Experiments ==

The experiments are nicely presented and are quite thorough. A couple minor

comments I have are:

- It would be nice to run regression experiments for bigger real-world

datasets, as the ones used seem to be quite small.

추가 Regression 실험: 보스톤 하우징 프라이스를 했고, 한 개 정도 더 하면 될듯하다.

- I am a bit confused at the fact that in table 3 you compare your method to

mixup and in table 4 you also show results when using both your method and

mixup combined. Up until that point I thought that mixup was posed as an

alternative method, but here it seems it's quite orthogonal and can be used

together, which I think makes sense, but would be good to clarify. Also,

given that you show combined results in table 4, why not also perform

exactly the same analysis for table 3 and also show numbers for CN + Mixup?

It would also be nice to use the same naming scheme for both tables. I would

use: ConvNet, ConvNet + CN, ConvNet + CN + Mixup, and the same with WRN for

table 4. This would make the tables easier to read because currently the first

thing that comes to mind is what may be different between the two setups given

that they are presented side-by-side but use different naming conventions.

제안하는 방법을 설명하는 것. 나쁘지 않아 보인다.

One question that comes to mind is that you make certain assumptions on the

kinds of noise your model can capture, so are there any cases where you have

good intuition as to why your model may fail? It would be good to present a

short discussion on this to help readers understand whether they can benefit by

using your model or not.

실패하는 경우를 데라.. 음 이건 random permutation에 대한 언급을 해도 될듯 하다.

Rating: 6: Marginally above acceptance threshold

Confidence: 4: The reviewer is confident but not absolutely certain that the evaluation is correct

전반적으로 좋은 평을 줬다. 감사.

We first thank the reviewer for helpful comments.

First of all, we added the suggested papers to the related work in that modeling the structure of the output space seems to have a lot in common with our proposed method.

__Reviewer-3__

Review: This paper formulates a new deep learning method called ChoiceNet for noisy data. Their main idea is to estimate the densities of data distributions using a set of correlated mean functions. They argue that ChoiceNet can robustly infer the target distribution on corrupted data.

Pros:

1. The authors find a new angle for learning with noisy labels. For example, the keypoint of ChoiceNet is to design the mixture of correlated density network block.

2. The authors perform numerical experiments to demonstrate the effectiveness of their framework in both regression tasks and classification tasks. And their experimental result support their previous claims.

Cons:

We have three questions in the following.

1. Related works: In deep learning with noisy labels, there are three main directions, including small-loss trick [1-3], estimating noise transition matrix [4-6], and explicit and implicit regularization [7-9]. I would appreciate if the authors can survey and compare more baselines in their paper instead of listing some basic ones.

알았다. Related work 다 읽어봤고, 설명 추가한다.

2. Experiment:

2.1 Baselines: For noisy labels, the authors should add MentorNet [1] as a baseline https://github.com/google/mentornet From my own experience, this baseline is very strong. At the same time, they should compare with VAT [7].

경재가 멘토넷이랑 VAT 실험을 했다.

2.2 Datasets: For datasets, I think the author should first compare their methods on symmetric and aysmmetric noisy data [4]. Besides, the current paper only verifies on vision datasets. The authors are encouraged to conduct 1 NLP dataset.

이안이 실험을 해주면 sym과 asym 다 해본게 된다. 그리고 NLP를 하라고?!

했다 하하하하 결과도 나쁘지 않다.

3. Motivation: The authors are encouraged to re-write their paper with more motivated storyline. The current version is okay but not very exciting for idea selling.

설명은 좀 더 잘 할 수 있게 노력해볼게.

References:

[1] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.

[2] M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep learning. In ICML, 2018.

[3] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, M. Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NIPS, 2018.

[4] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017.

[5] J. Goldberger and E. Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017.

[6] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus. Training convolutional networks with noisy labels. In ICLR workshop, 2015.

[7] T. Miyato, S. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. ICLR, 2016.

[8] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS, 2017.

[9] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In ICLR, 2017.

Rating: 5: Marginally below acceptance threshold

Confidence: 5: The reviewer is absolutely certain that the evaluation is correct and very familiar with the relevant literature

We thank the reviewer for the helpful reviews.

1. Related work: We admit that the current manuscript lacks comprehensive curation of related work. We re-categorized the related work into three groups and try to compare them in a more principled way.

2. Experiments: We compared our methods with MentorNet [2] and VAT [3] in our current setting (symmetric noise). Furthermore, we also conducted additional experiments on both symmetric and asymmetic noises following the experimental setting from Co-teaching [4].

We also conducted a regression experiment using a real-world dataset (Boston housing price dataset) and compare our method with two robust loss functions [1] as baselines.

We didn't conduct NLP experiments due to limited time and computational resources available.

[1] Vasileios Belagiannis, Christian Rupprecht, Gustavo Carneiro, Nassir Navab, "Robust Optimization for Deep Regression", ICCV, 2015

[2] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.

[3] T. Miyato, S. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. ICLR, 2016.

[4] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, Masashi Sugiyama, Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels, NIPS, 2018

#### 'Thoughts > Technical Writing' 카테고리의 다른 글

ICLR 2018 reviews (0) | 2018.11.22 |
---|---|

ICML 2018 Review (0) | 2018.04.11 |

Short bio & abstract (0) | 2018.02.21 |

Postdoc 자리를 물어보는 이메일 (0) | 2017.12.16 |

영어 논문 글쓰기 (0) | 2017.07.03 |

Reviews I got from IROS 2017 (0) | 2017.06.27 |

- Filed under : Thoughts/Technical Writing
- 0 Comments 0 Trackbacks