Adaptive Gradient Methods with Dynamic Bound of Learning Rate | OpenReview

https://openreview.net/forum?id=Bkg3g2R9FX

Review: This paper presents new variants of ADAM and AMSGrad that bound the gradients above and below to avoid potential negative effects on generalization of excessively large and small gradients; and the paper demonstrates the effectiveness on a few commonly used machine learning test cases. The paper also presents detailed proofs that there exists a convex optimization problem for which the ADAM regret does not converge to zero. This paper is very well written and easy to read. For that I thank the authors for their hard word. I also believe that their approach to bound is well structured in that it converges to SGD in the infinite limit and allows the algorithm to get teh best of both worlds - faster convergence and better generalization. The authors' experimental results support the value of their proposed algorithms. In sum, this is an important result that I believe will be of interest to a wide audience at ICLR. The proofs in the paper, although impressive, are not very compelling for the point that the authors want to get across. That fact that such cases of poor performance can exists, says nothing about the average performance of the algorithms, which is practice is what really matters. The paper could be improved by including more and larger data sets. For example, the authors ran on CIFAR-10. They could have done CIFAR-100, for example, to get more believable results. The authors add a useful section on notation, but go on to abuse it a bit. This could be improved. Specifically, they use an "i" subscript to indicate the i-th coordinate of a vector and then in the Table 1 sum over t using i as a subscript. Also, superscript on vectors are said to element-wise powers. If so, why is a diag() operation required? Either make the outproduct explicit, or get rid of the diag().

[–]

Comment: Thanks for your comments! We deeply agree that the average performance of different algorithms is very important in practice. But as also mentioned in the reply to anonymous comments before (on 11.12), our understanding of the generalization behavior of deep neural networks is still very shallow by now. It is a big challenge of investigating from theoretical aspects. Actually, the theoretical analysis of most recent related work is still under strong or particular assumptions. I believe if one could conduct convincing theoretical proof without strong assumptions, that work is totally worth an individual publication. We are conducting more experiments on larger datasets such as CIFAR-100 and on more tasks in other fields, and the results are very positive too. We will add the results and analysis in the final revision if there is space left in the paper. We want to argue that the use of diag() is necessary since \phi_t is a matrix rather than a vector. Also, $g$ is not a vector but $g_t$ is, and $g_{t,i}$ is coordinate. It is true that the expression $x_i$ might be ambiguous without context: 1) $x$ is a vector and it means the i-th coordinate of $x$ or 2) $x$ is not a vector and $x_i$ is a vector at time $i$. But since $x$ cannot be or not be a vector at the same time, it is clear in a specific context. This kind of notation is also used in many other works. We re-check the math expressions in our paper and think they are ok.