In this post, I want to explain what we can show for Stochastic Gradient Descent (SGD) when used on non-convex smooth functions. The first results are known and very easy to obtain, the last one instead is a result by (Bertsekas and Tsitsiklis, 2000) that is not as known as it should be, maybe for their long proof. Some years ago, I found a way to distill their proof in a simple lemma that I present here.
1. SGD on Non-Convex Smooth Functions
We are interested in minimizing a smooth non-convex function using stochastic gradient descent with unbiased stochastic gradients. More in details, we assume to have access to an oracle that returns in any point , , where is the realization of a mechanism for computing the stochastic gradient. For example, could be the random index of a training sample we use to calculate the gradient of the training loss or just random noise that is added on top of our gradient computation. We will also assume that the variance of the stochastic gradient is bounded: , for all . Weaker assumptions on the variance are possible, but they don’t add much to the general message nor to the scheme of the proof.
Given that the function is non-convex, we clearly cannot hope to converge to the minimum of , so we need a less ambitious goal. We assumed that the function is smooth. As you might remember from my previous posts, smooth functions are differentiable functions whose gradient is Lipschitz. Formally, we say that is -smooth when , for all . This assumption assures us that when we approach a local minimum the gradient goes to zero. Hence, decreasing the norm of the gradient will be our objective function for SGD. Note that smoothness is necessary to study the norm of the gradients. In fact, consider the function whose derivative does not go to zero when we approach the minimum, on the contrary it is always different than 0 in any point different than the minimum.
Last thing we will assume is that the function is bounded from below. Remember that the boundedness from below does not imply that the minimum of the function exists, e.g., .
Hence, I start from a point and the SGD update is
where are deterministic learning rates or stepsizes.
First, let’s see practically how SGD behaves w.r.t. Gradient Descent (GD) on the same problem.
In Figure 1, we are minimizing , where the stochastic gradient in SGD is given by the gradient of the function corrupted by Gaussian noise with zero mean and standard deviation 1. On the other hand, there is no noise for GD. In both cases, we use and we plot the absolute value of the derivative. We can see that GD will monotonically minimize the gradient till numerical precision as expected, converging to one of the local minima. Note that with a constant learning rate GD on this problem would converge even faster. Instead, SGD will jump back and forth resulting in only some iterates having small gradient. So, our basic question is the following:
Will converge to zero with probability 1 in SGD when goes to infinity?
This is more difficult to answer than what you might think. However, this is a basic question to know if it actually makes sense to run SGD for a bunch of iterations and return the last iterate, that is how 99% of the people use SGD on a non-convex problem.
To warm up, let’s first see what we can prove in a finite-time setting.
As all other similar analysis, we need to construct a potential (Lyapunov) function that allows us to analyze it. In the convex case, we would study , where . Here, this potential does not even make sense because we are not even trying to converge to . It turns out that a better choice is to study . We will make use of the following property of -smooth functions:
In words, this means that a smooth function is always upper bounded by a quadratic function. Note that this property does not require convexity, so we can safely use it. Thanks to this property, let’s see how our potential evolves over time during the optimization of SGD.
Now, let’s denote by the expectation w.r.t. given , so we have
where in the inequality we have used the fact that the variance of the stochastic gradient is bounded by . Taking the total expectation and reordering the terms, we have
Let’s see how useful this inequality is: consider a constant step size , where is the usual critical parameter of the learning rate (that you’ll never be able to tune properly unless you know things that you clearly don’t know…). With this choice, we have . So, we have
What we got is almost a convergence result: it says that the average of the norm of the gradients is going to zero as . Given that the average of a set of numbers is bigger or equal to its minimum, this means that there exists at least one in my set of iterates that has a small expected gradient. This is interesting but slightly disappointing. We were supposed to prove that the gradient converges to zero, but instead we only proved that at least one of the iterates has indeed small expected norm, but we don’t know which one. Also, trying to find the right iterate might be annoying because we only have access to stochastic gradients.
It is also interesting to see that the convergence rate has two terms: a fast rate and a slow rate . This means that we can expect the algorithm to make fast progress at the beginning of the optimization and then slowly converge once the number of iterations becomes big enough compared to the variance of the stochastic gradients. In case the noise on the gradients is zero, SGD becomes simply gradient descent and it will converge at a rate. In the noiseless case, we can also show that the last iterate is the one with the smallest gradient. However, note that the learning rate has in it, so effectively we can achieve a faster convergence in the noiseless case because we would be using a constant and independent from stepsize.
2. The Magic Trick: Randomly Stopped SGD
The above reasoning is interesting but it is not a solution to our question: does the last iterate of SGD converge? Yes or no?
There is a possible work-around that looks like a magic trick. Let’s take one iterate of SGD uniformly at random among and call it . Taking the expectation with respect to this randomization and the noise in the stochastic gradients we have that
Basically, it says that if run SGD for iterations, then we stop and return not the last iterate but one of the iterates at random, in expectation with respect to everything the norm will be small! Note that this is equivalent to run SGD with a random stopping time. In other words, given that we didn’t know how to prove if SGD converges, we changed the algorithm adding a random stopping time and now the random iterate on which we stop in expectation will have the desired convergence rate.
This is a very important result and also a standard one in these days. It should be intuitive why the randomization helps: From Figure 1 it is clear that we might be unlucky in the last iteration of SGD, however randomizing in expectation we smooth out the noise and get a decreasing gradient. However, we just changed the target because we still didn’t prove if the last iterate converges. So, we need an alternative way.
3. The Disappointing Lim Inf
Let’s consider again (1). This time let’s select any time-varying positive stepsizes that satisfy
These two conditions are classic in the study of stochastic approximation. The first condition is needed to be able to travel arbitrarily far from the initial point, while the second one is needed to keep the variance of the noise under control. The classic learning rate of does not satisfy these assumptions, but something decaying a little bit faster as will do.
With such a choice, we get
where we have used the second condition in the inequality. Now, the condition implies that converges to 0. So, there exists such that for all . So, we get that
This implies that with probability 1. We are almost done: From this last inequality and the condition that , we can derive the fact that .
Wait, what? What is this ??? Unfortunately, it seems that we proved something weaker than we wanted to. In words, the lim inf result says that there exists a subsequence of that has a gradient converging to zero.
This is very disappointing and we might be tempted to believe that this is the best that we can do. Fortunately, this is not the case. In fact, in a seminal paper (Bertsekas and Tsitsiklis, 2000) proved the convergence of the gradients of SGD to zero with probability 1 under very weak assumptions. Their proof is very convoluted also due to the assumptions they used, but in the following I’ll show a much simpler proof.
4. The Asymptotic Proof in Few Lines
In 2018, I found a way to get the same result of (Bertsekas and Tsitsiklis, 2000) distilling their long proof in the following Lemma, whose proof is in the Appendix. It turns out that this Lemma is essentially all what we need.
Lemma 1. Let be two non-negative sequences and a sequence of vectors in a vector space . Let and assume and . Assume also that there exists such that , where is such that . Then, converges to 0.
We are now finally ready to prove the asymptotic convergence with probability 1.
Theorem 2. Assume that we use SGD on a -smooth function, with stepsizes that satisfies the conditions (2). Then, goes to zero with probability 1.
Proof: We want to use Lemma 1 on . So, first observe that by the -smoothness of , we have
The assumptions and the reasoning above imply that, with probability 1, . This also suggest to set . Also, we have, with probability 1, , because for is a martingale whose variance is bounded by . Hence, for is a martingale in , so it converges in with probability 1.
Overall, with probability 1 the assumptions of Lemma 1 are verified with .
We did it! Finally, we proved that the gradients of SGD do indeed converge to zero with probability 1. This means that with probability 1 for any there exists such that for .
Even if I didn’t actually use any intuition in crafting the above proof (I rarely use “intuition” to prove things), Yann Ollivier provided the following intuition for this proof: the proof is implicitly studying how far apart GD and SGD are. However, instead of estimating the distance between the two processes over a single update, it does it over large period of time through the term that can be controlled thanks to the choice of the learning rates.
5. History Bits
The idea of taking one iterate at random in SGD was proposed in (Ghadimi and Lan, 2013) and it reminds me the well-known online-to-batch conversion through randomization. The conditions on the learning rates in (2) go back to (Robbins and Monro, 1951). (Bertsekas and Tsitsiklis, 2000) contains a good review of previous work on asymptotic convergence of SGD, while a recent paper on this topic is (Patel, V., 2020).
I derived Lemma 1 as an extension of Proposition 2 in (Alber et al., 1998)/Lemma A.5 in (Mairal, 2013). Studying the proof of (Bertsekas and Tsitsiklis, 2000), I realized that I could change (Alber et al., 1998, Proposition 2) into what I needed. I had this proof sitting in my unpublished notes for 2 years, so I decided to write a blog post on it.
My actual small contribution to this line of research is a lim inf convergence for SGD with AdaGrad stepsizes (Li and Orabona, 2019), but under stronger assumptions on the noise.
Note that the 20-30 years ago there were many papers studying the asymptotic convergence of SGD and its variants in various settings. Then, the taste of the community changed moving from asymptotic convergence to finite-time rates. As it often happens when a new trend takes over the previous one, new generations tend to be oblivious to the old results and proof techniques. The common motivation to ignore these past results is that the finite-time analysis is superior to the asymptotic one, but this is clearly false (ask a statistician!). It should be instead clear to anyone that both analyses have pros and cons.
I thank Léon Bottou for telling me of the problem of analyzing the asymptotic convergence of SGD in the non-convex case with a simple and general proof in 2018. Léon also helped me checking my proofs and finding an error in a previous version. Also, I thank Yann Ollivier for reading my proof and kindly providing an alternative proof and the intuition that I report above.
Proof of Lemma : Since the series diverges, given that converges, we necessarily have . Hence, we have to prove that .
Let us proceed by contradiction and assume that . First, assume that .
Given the values of the and , we can then build two sequences of indices and such that
- , for ,
- , for .
Define . The convergence of the series implies that the sequence of partial sums are Cauchy sequences. Hence, there exists large enough such for all we have and are less or equal to . Then, we have for all and all with ,
Therefore, using the triangle inequality, And finally for all , which contradicts . Therefore, goes to zero.
To rule out the case that , proceed in the same way, choosing any . Hence, we get that for , that contradicts .
I want to cite your work, especially Lemma 1 and Theorem 2. Do you have some published paper about Theorem 2 so that I can include it as my reference?
Unfortunately, I didn’t publish it anywhere. I guess you can cite the blog post, I know of other people that are citing it 🙂
Hi, can you please give any reference to the equation “eta_t = min(1/L, \alpha/(\sigma*\sqrt(T))”? How to arrive at this equation?
In equation (1), first of all we need the left hand side to be big and positive. In particular, it is maximized for eta_t = 1/L and it is positive for any 0<= eta_t < =1/L, that gives the first condition in the min. With this condition the term eta_t-eta_t^2 L/2 is lower bounded by eta_t/2. So, now divide both terms by eta_t and on the right hand side we have (F(x_1)-F^*)/eta+sigma^2 L /2 T eta_t (because eta_t is constant). We want the right hand side to be as small as possible, so the best eta that minimizes this expression is eta_t=sqrt[(F(x_1)-F^*)/(sigma L T)]. Given that you don't know F(x_1)-F^*, you use some hyperparameter alpha that includes L too. This gives the second condition in the min. FYI, I didn't invent it, it is a standard reasoning for this kind of proofs.