Disclaimer: This post will be a little different than my usual ones. In fact, I won’t prove anything and I will just briefly explain some of my conjectures around optimization in deep neural networks. Differently from my usual posts, it is totally possible that what I wrote is completely wrong 🙂
I have been working on online and stochastic optimization for a while, from a practical and empirical point of view. So, I was already in this field when Adam (Kingma and Ba, 2015) was proposed.
The paper was ok but not a breakthrough, and even more so for today standards. Indeed, the theory was weak: A regret guarantee for an algorithm supposed to work on stochastic optimization of non-convex functions. The experiments were also weak: The exact same experiments would result in a surefire rejection in these days. Later people also discovered an error in the proof and the fact that the algorithm will not converge on certain one-dimensional stochastic convex functions. Despite all of this, in these days Adam is considered the King of the optimization algorithms. Let me be clear: it is known that Adam will not always give you the best performance, yet most of the time people know that they can use it with its default parameters and get, if not the best performance, at least the second best performance on their particular deep learning problem. In other words, Adam is considered nowadays the default optimizer for deep learning. So, what is the secret behind Adam?
Over the years, people published a vast number of papers that tried to explain Adam and its performance, too many to list. From the “adaptive learning rate” (adaptive to what? Nobody exactly knows…) to the momentum, to the almost scale-invariance, each single aspect of its arcane recipe has been examined. Yet, none of these analyses gave us the final answer on its performance. It is clear that most of these ingredients are beneficial to the optimization process of any function, but it is still unclear why this exact combination and not another one make it the best algorithm. The equilibrium in the mix is so delicate that even the small change required to fix the non-convergence issue was considered to give slightly worse performance than Adam.
The fame of Adam is also accompanied by strong sentiments: It is enough to read posts on r/MachineLearning on Reddit to see the passion that people put in defending their favorite optimizers against the other ones. It is the sort of fervor that you see in religion, in sports, and in politics.
However, how likely is all this? I mean, how likely is that Adam is really the best optimization algorithm? How likely is that we reached the apex of optimization for deep learning few years ago in a field that is so young? Could there be another explanation to its prodigious performance?
I have a hypothesis, but before explaining it we have to briefly talk about the applied deep learning community.
In a talk, Olivier Bousquet has described the deep learning community as a giant genetic algorithm: Researchers in this community are exploring the space of all variants of algorithms and architectures in a semi-random way. Things that consistently work in large experiments are kept, the ones not working are discarded. Note that this process seems to be independent of acceptance and rejection of papers: The community is so big and active that good ideas on rejected papers are still saved and transformed into best practices in few months, see for example (Loshchilov and Hutter, 2019). Analogously, ideas in published papers are reproduced by hundred of people that mercilessly trash things that will not reproduce. This process has created a number of heuristics that consistently produce good results in experiments, and the stress here is on “consistently”. Indeed, despite being a method based on non-convex formulations, the performance of deep learning methods turns out to be extremely reliable. (Note that the deep learning community has also a large bias towards “famous” people, so not all the ideas receive the same level of attention…)
So, what is the link between this giant genetic algorithm and Adam? Well, looking carefully at the creating process in the deep learning community I noticed a pattern: Usually people try new architectures keeping the optimization algorithm fixed, and most of the time the algorithm of choice is Adam. This happens because, as explained above, Adam is the default optimizer.
So, here my hypothesis: Adam was a very good optimization algorithm for the neural networks architectures we had few years ago and people kept evolving new architectures on which Adam works. So, we might not see many architectures on which Adam does not work because such ideas are discarded prematurely! Such ideas would require to design a new architecture and a new optimizer at the same time, that would be a very difficult task. In other words, the community is evolving only one set of parameters (architectures, initialization strategies, hyperparameters search algorithms, etc.) keeping most of the time the optimizer fixed to Adam.
Now, I am sure many people won’t buy in this hypothesis, I am sure they will list all sort of specific problems in which Adam is not the best algorithm, in which Stochastic Gradient Descent with momentum is the best one, and so on and so forth. However, I would like to point out two things: 1) I don’t describe here a law of nature, but simply a tendency the community has that might have influenced the co-evolution of some architectures and optimizers; 2) I actually have some evidence to support this claim 🙂
If my claims were true, we would expect Adam to be extremely good on deep neural networks and very poor on anything else. And this does happen! For example, Adam is known to perform very poorly on simple convex and non-convex problems that are not deep neural networks, see for example the following experiments from (Vaswani et al., 2019):
It seems that the moment we move away from the specific setting of deep neural networks with their specific choice of initialization, specific scale of weights, specific loss function, etc., Adam loses its adaptivity and its magic default learning rate must be tuned again. Note that you can always write a linear predictor as a one-layer neural network, yet Adam does not work so well on this case too. So, all the particular choices of architectures in deep learning might have evolved to make Adam work better and better, while the simple problems above do not have any of these nice properties that allow Adam to shine.
Overall, Adam might be the best optimizer because the deep learning community might be exploring only a small region in the joint search space of architectures/optimizers. If true, that would be ironic for a community that departed from convex methods because they focused only on a narrow region of the possible machine learning algorithms and it was like, as Yann LeCun wrote, “looking for your lost car keys under the street light knowing you lost them someplace else“.
EDIT: After the pubblication of this post, Sam Power pointed me to this tweet by Roger Grosse that seems to share a similar sentiment 🙂
If only there was a form of alternating least squares to optimize the architecture and optimizer in tandem… 🤔
LikeLiked by 1 person
One way to empirically explore this area: neural architecture And Optimizer search.
I.e. jointly search for both architectures and optimizers.
Thanks, that’s a pretty surprising observation indeed, and it sounds like a good idea to validate it.
But why do we think that these toy tasks are actually representative of the large-scale setups that are usually validated with those optimisers? Can’t we just miss some properties of deep learning tasks that make Adam more effective?
And a side note – does actually switching between optimisers make much less difference compared to switching between model architectures/training tasks and objectives? If yes, then what’s the problem here – we know that on average tuning optimizer gives minor differences, so we just stick to the default one similarly with how we do it with hyperparameters, don’t we?
The idea that “small” optimization problems have radically different characteristics than “big” ones is a common belief in the deep learning literature, but I don’t know of any real evidence for it. One thing is saying that MNIST is not a good dataset because too easy (I agree), another one is saying that a paper *needs* experiments on ImageNet to compare optimizers (I am very skeptical about it). “Big” and “small” must also be defined: are they related to the dimension of the problem or to the number of training samples? For example, perhaps counter-intuitively, increasing the number of training samples might increase the smoothness of the objective function, making the optimization easier.
Regarding the second note, based on what I wrote, I would argue that switching optimizers on the *current* architectures gives small advantage, but there might exist a different architecture where we can achieve for example exponential rate of convergence (what is usually called “linear convergence”) with a quasi-newton algorithm and the community will probably never find it because too focused on SGD.
Didn’t catch the air plane picture connection to the topic 😡
I wonder if Adam happens to work best because NNs are just too tolerant to noise and hence a “bad optimizer” does a better job.
Take a look here for the airplane: https://en.wikipedia.org/wiki/Survivorship_bias
LikeLiked by 1 person
Thanks, got it 🙂
LikeLiked by 1 person
I can not understand why PSO is not included as an optimiser.