Almost Sure Convergence of SGD on Smooth Non-Convex Functions

In this post, I want to explain what we can show for Stochastic Gradient Descent (SGD) when used on non-convex smooth functions. The first results are known and very easy to obtain, the last one instead is a result by (Bertsekas and Tsitsiklis, 2000) that is not as known as it should be, maybe for their long proof. Some years ago, I found a way to distill their proof in a simple lemma that I present here.

1. SGD on Non-Convex Smooth Functions

We are interested in minimizing a smooth non-convex function {F: {\mathbb R}^d \rightarrow {\mathbb R}} using stochastic gradient descent with unbiased stochastic gradients. More in details, we assume to have access to an oracle that returns in any point {{\boldsymbol x}}, {g({\boldsymbol x}, \xi) \in {\mathbb R}^d}, where {\xi} is the realization of a mechanism for computing the stochastic gradient. For example, {\xi} could be the random index of a training sample we use to calculate the gradient of the training loss or just random noise that is added on top of our gradient computation. We will also assume that the variance of the stochastic gradient is bounded: {\mathop{\mathbb E}_{\xi}[\|\nabla F({\boldsymbol x}) - g({\boldsymbol x},\xi)\|^2_2]\leq \sigma^2 <\infty}, for all {{\boldsymbol x} \in \mathop{\mathrm{dom}} \nabla F({\boldsymbol x})}. Weaker assumptions on the variance are possible, but they don’t add much to the general message nor to the scheme of the proof.

Given that the function {F} is non-convex, we clearly cannot hope to converge to the minimum of {F}, so we need a less ambitious goal. We assumed that the function {F} is smooth. As you might remember from my previous posts, smooth functions are differentiable functions whose gradient is Lipschitz. Formally, we say that {F} is {L}-smooth when {\|\nabla F({\boldsymbol x}) - \nabla F({\boldsymbol y})\|_2 \leq L \|{\boldsymbol x} - {\boldsymbol y}\|_2}, for all {{\boldsymbol x}, {\boldsymbol y} \in \mathop{\mathrm{dom}} \nabla F}. This assumption assures us that when we approach a local minimum the gradient goes to zero. Hence, decreasing the norm of the gradient will be our objective function for SGD. Note that smoothness is necessary to study the norm of the gradients. In fact, consider the function {f(x)=|x|} whose derivative does not go to zero when we approach the minimum, on the contrary it is always different than 0 in any point different than the minimum.

Last thing we will assume is that the function {F} is bounded from below. Remember that the boundedness from below does not imply that the minimum of the function exists, e.g., {F(x)=\exp(-x)}.

Hence, I start from a point {{\boldsymbol x}_1 \in \mathop{\mathrm{dom}} \nabla F} and the SGD update is

\displaystyle {\boldsymbol x}_{t+1} = {\boldsymbol x}_t - \eta_t g({\boldsymbol x}_t, \xi_t),

where {\eta_t>0} are deterministic learning rates or stepsizes.

First, let’s see practically how SGD and Gradient Descent (GD) behave w.r.t. gradient descent on the same problem.

Figure 1. GD vs SGD to minimize {F(x)=\sin(x)}. In SGD the derivative is corrupted by additive Gaussian noise.

In Figure 1, we are minimizing {F(x)=\sin(x)}, where the stochastic gradient in SGD is given by the gradient of the function corrupted by Gaussian noise with zero mean and standard deviation 1. On the other hand, there is no noise for GD. In both cases, we use {\eta_t=\frac{1}{\sqrt{t}}} and we plot the absolute value of the derivative. We can see that GD will monotonically minimize the gradient till numerical precision as expected, converging to one of the local minima. Note that with a constant learning rate GD on this problem would converge even faster. Instead, SGD will jump back and forth resulting in only some iterates having small gradient. So, our basic question is the following:

Will {\|\nabla F({\boldsymbol x}_T)\|_2} converge to zero with probability 1 in SGD when {T} goes to infinity?

This is more difficult to answer than what you might think. However, this is a basic question to know if it actually makes sense to run SGD for a bunch of iterations and return the last iterate, that is how 99% of the people use SGD on a non-convex problem.

To warm up, let’s first see what we can prove in a finite-time setting.

As all other similar analysis, we need to construct a potential (Lyapunov) function that allows us to analyze it. In the convex case, we would study {\|{\boldsymbol x}_t - {\boldsymbol x}^\star\|^2_2}, where {{\boldsymbol x}^\star \in \mathop{\mathrm{argmin}}_{{\boldsymbol x}} \ F(x)}. Here, this potential does not even make sense because we are not even trying to converge to {{\boldsymbol x}^\star}. It turns out that a better choice is to study {F({\boldsymbol x}_t)}. We will make use of the following property of {L}-smooth functions:

\displaystyle F({\boldsymbol y}) \leq F({\boldsymbol x}) + \langle \nabla F({\boldsymbol x}), {\boldsymbol y}-{\boldsymbol x}\rangle + \frac{L}{2} \|{\boldsymbol x}-{\boldsymbol y}\|_2^2, \ \forall {\boldsymbol x} \in \mathop{\mathrm{dom}} \nabla F, {\boldsymbol y} \in \mathop{\mathrm{dom}} F~.

In words, this means that a smooth function is always upper bounded by a quadratic function. Note that this property does not require convexity, so we can safely use it. Thanks to this property, let’s see how our potential evolves over time during the optimization of SGD.

\displaystyle \begin{aligned} F({\boldsymbol x}_{t+1}) &\leq F({\boldsymbol x}_t) + \langle \nabla F({\boldsymbol x}_t), {\boldsymbol x}_{t+1} -{\boldsymbol x}_t \rangle + \frac{L}{2} \|{\boldsymbol x}_{t+1}-{\boldsymbol x}_t\|^2_2 \\ &= F({\boldsymbol x}_t) - \eta_t \langle \nabla F({\boldsymbol x}_t), g({\boldsymbol x}_t,\xi_t) \rangle + \frac{\eta_t^2 L}{2} \| g({\boldsymbol x}_t, \xi_t)\|^2_2~. \end{aligned}

Now, let’s denote by {E_t} the expectation w.r.t. {\xi_t} given {{\boldsymbol x}_t}, so we have

\displaystyle \begin{aligned} \mathop{\mathbb E}_t[F({\boldsymbol x}_{t+1})] &\leq F({\boldsymbol x}_t) - \eta_t \|\nabla F({\boldsymbol x}_t)\|^2_2 + \frac{\eta_t^2 L}{2} \mathop{\mathbb E}_t \| g({\boldsymbol x}_t, \xi_t)\|^2_2 \\ &= F({\boldsymbol x}_t) - \eta_t \|\nabla F({\boldsymbol x}_t)\|^2_2 + \frac{\eta_t^2 L}{2} \mathop{\mathbb E}_t \|\nabla F({\boldsymbol x}_t) - g({\boldsymbol x}_t, \xi_t) - \nabla F({\boldsymbol x}_t)\|^2_2 \\ &= F({\boldsymbol x}_t) - \eta_t \|\nabla F({\boldsymbol x}_t)\|^2_2 + \frac{\eta_t^2 L}{2} \left(\mathop{\mathbb E}_t \|\nabla F({\boldsymbol x}_t) - g({\boldsymbol x}_t, \xi_t)\|^2_2 + \|\nabla F({\boldsymbol x}_t)\|^2_2\right) \\ &\leq F({\boldsymbol x}_t) - \left(\eta_t - \frac{\eta_t^2 L}{2}\right)\|\nabla F({\boldsymbol x}_t)\|^2_2 + \frac{\eta_t^2 L}{2} \sigma^2, \end{aligned}

where in the inequality we have used the fact that the variance of the stochastic gradient is bounded by {\sigma^2}. Taking the total expectation and reordering the terms, we have

\displaystyle \begin{aligned} \sum_{t=1}^T \left(\eta_t - \frac{\eta_t^2 L}{2}\right) \mathop{\mathbb E}[\|\nabla F({\boldsymbol x}_t)\|^2_2] &\leq \sum_{t=1}^T \left(\mathop{\mathbb E}[F({\boldsymbol x}_t)] - \mathop{\mathbb E}[F({\boldsymbol x}_{t+1})]\right) + \frac{\sigma^2 L}{2} \sum_{t=1}^T \eta_t^2 \\ &= F({\boldsymbol x}_1) - \mathop{\mathbb E}[F({\boldsymbol x}_{T+1})] + \frac{\sigma^2 L}{2} \sum_{t=1}^T \eta_t^2 \\ &\leq F({\boldsymbol x}_1) - F^\star + \frac{\sigma^2 L}{2} \sum_{t=1}^T \eta_t^2~. & (1) \\\end{aligned}

Let’s see how useful this inequality is: consider a constant step size {\eta_t=\min\left(\frac{1}{L},\frac{\alpha}{\sigma \sqrt{T}}\right)}, where {\alpha} is the usual critical parameter of the learning rate (that you’ll never be able to tune properly unless you know things that you clearly don’t know…). With this choice, we have {\eta_t - \frac{\eta_t^2 L}{2}\geq \eta_t - \frac{\eta_t}{2} = \frac12 \eta_t}. So, we have

\displaystyle \begin{aligned} \frac{1}{T}\sum_{t=1}^T \mathop{\mathbb E}[\|\nabla F({\boldsymbol x}_t)\|^2_2] &\leq \frac{2(F({\boldsymbol x}_1) - F^\star)}{\eta_1 T} + \sigma^2 L \eta_1 \\ &= \frac{2(F({\boldsymbol x}_1) - F^\star)}{T}\max\left(L, \frac{\sigma \sqrt{T}}{\alpha}\right) + \sigma L \alpha \frac{1}{\sqrt{T}} \\ &\leq \frac{2(F({\boldsymbol x}_1) - F^\star)}{T}\left(L+\frac{\sigma \sqrt{T}}{\alpha}\right) + \sigma L \alpha \frac{1}{\sqrt{T}} \\ &= \frac{2 L (F({\boldsymbol x}_1) - F^\star)}{T} + \left(\frac{2 (F({\boldsymbol x}_1) - F^\star)}{ \alpha} + L \alpha\right) \frac{\sigma}{\sqrt{T}}~. \end{aligned}

What we got is almost a convergence result: it says that the average of the norm of the gradients is going to zero as {O(1/T + \sigma\sqrt{T})}. Given that the average of a set of numbers is bigger or equal to its minimum, this means that there exists at least one {{\boldsymbol x}_t} in my set of iterates {{\boldsymbol x}_1, \dots, {\boldsymbol x}_T} that has a small expected gradient. This is interesting but slightly disappointing. We were supposed to prove that the gradient converges to zero, but instead we only proved that at least one of the iterates has indeed small expected norm, but we don’t know which one. Also, trying to find the right iterate might be annoying because we only have access to stochastic gradients.

It is also interesting to see that the convergence rate has two terms: a fast rate {1/T} and a slow rate {\sigma/\sqrt{T}}. This means that we can expect the algorithm to make fast progress at the beginning of the optimization and then slowly converge once the number of iterations becomes big enough compared to the variance of the stochastic gradients. In case the noise on the gradients is zero, SGD becomes simply gradient descent and it will converge at a {O(1/T)} rate. In the noiseless case, we can also show that the last iterate is the one with the smallest gradient. However, note that the learning rate has {\sigma} in it, so effectively we can achieve a faster convergence in the noiseless case because we would be using a constant and independent from {T} stepsize.

2. The Magic Trick: Randomly Stopped SGD

The above reasoning is interesting but it is not a solution to our question: does the last iterate of SGD converge? Yes or no?

There is a possible work-around that looks like a magic trick. Let’s take one iterate of SGD uniformly at random among {{\boldsymbol x}_1, \dots, {\boldsymbol x}_{T}} and call it {{\boldsymbol x}_I}. Taking the expectation with respect to this randomization and the noise in the stochastic gradients we have that

\displaystyle \mathop{\mathbb E}[\|\nabla F({\boldsymbol x}_I)\|^2_2] = \frac{1}{T} \sum_{t=1}^T \mathop{\mathbb E}[\|\nabla F({\boldsymbol x}_t)\|^2_2] \leq \frac{2 L (F({\boldsymbol x}_1) - F^\star)}{T} + \left(\frac{2 (F({\boldsymbol x}_1) - F^\star)}{ \alpha} + L \alpha\right) \frac{\sigma}{\sqrt{T}}~.

Basically, it says that if run SGD for {T} iterations, then we stop and return not the last iterate but one of the {T} iterates at random, in expectation with respect to everything the norm will be small! Note that this is equivalent to run SGD with a random stopping time. In other words, given that we didn’t know how to prove if SGD converges, we changed the algorithm adding a random stopping time and now the random iterate on which we stop in expectation will have the desired convergence rate.

This is a very important result and also a standard one in these days. It should be intuitive why the randomization helps: From Figure 1 it is clear that we might be unlucky in the last iteration of SGD, however randomizing in expectation we smooth out the noise and get a decreasing gradient. However, we just changed the target because we still didn’t prove if the last iterate converges. So, we need an alternative way.

3. The Disappointing Lim Inf

Let’s consider again (1). This time let’s select any time-varying positive stepsizes that satisfy

\displaystyle \label{eq:cond_eta} \sum_{t=1}^\infty \eta_t = \infty \quad \text{ and } \quad \sum_{t=1}^\infty \eta_t^2 < \infty~. \ \ \ \ \ (2)

These two conditions are classic in the study of stochastic approximation. The first condition is needed to be able to travel arbitrarily far from the initial point, while the second one is needed to keep the variance of the noise under control. The classic learning rate of {\eta_t=\frac{\alpha}{\sqrt{t}}} does not satisfy these assumptions, but something decaying a little bit faster as {\frac{\alpha}{\sqrt{t} \ln(t+1)}} will do.

With such a choice, we get

\displaystyle \sum_{t=1}^\infty \left(\eta_t - \frac{\eta_t^2 L}{2}\right) \mathop{\mathbb E}[\|\nabla F({\boldsymbol x}_t)\|^2_2] \leq F({\boldsymbol x}_1) - F^\star + \frac{\sigma^2 L}{2} \sum_{t=1}^\infty \eta_t^2 < \infty,

where we have used the second condition in the inequality. Now, the condition {\sum_{t=1}^T \eta_t^2 <\infty} implies that {\eta_t} converges to 0. So, there exists {T_L} such that {\eta_t - \eta_t - \frac{\eta_t^2 L}{2} \geq \frac{\eta_t}{2}} for all {t\geq T_L}. So, we get that

\displaystyle \sum_{t=T_L}^\infty \eta_t \mathop{\mathbb E}[\|\nabla F({\boldsymbol x}_t)\|^2_2] < \infty~.

This implies that {\sum_{t=T_L}^\infty \eta_t \|\nabla F({\boldsymbol x}_t)\|^2_2 < \infty} with probability 1. We are almost done: From this last inequality and the condition that {\sum_{t=1}^\infty \eta_t = \infty}, we can derive the fact that {\lim\inf_{t\rightarrow\infty} \ \|\nabla F({\boldsymbol x}_t)\|_2 = 0}.

Wait, what? What is this {\lim\inf}??? Unfortunately, it seems that we proved something weaker than we wanted to. In words, the lim inf result says that there exists a subsequence of {{\boldsymbol x}_t} that has a gradient converging to zero.

This is very disappointing and we might be tempted to believe that this is the best that we can do. Fortunately, this is not the case. In fact, in a seminal paper (Bertsekas and Tsitsiklis, 2000) proved the convergence of the gradients of SGD to zero with probability 1 under very weak assumptions. Their proof is very convoluted also due to the assumptions they used, but in the following I’ll show a much simpler proof.

4. The Asymptotic Proof in Few Lines

In 2018, I found a way to get the same result of (Bertsekas and Tsitsiklis, 2000) distilling their long proof in the following Lemma, whose proof is in the Appendix. It turns out that this Lemma is essentially all what we need.

Lemma 1. Let {(b_t)_{t \geq 1}, (\eta_t)_{t \geq 1}} be two non-negative sequences and {(a_t)_{t \geq 1}} a sequence of vectors in a vector space {X}. Let {p\geq1} and assume {\sum_{t=1}^{\infty} \eta_t b_t^p<\infty} and {\sum_{t=1}^{\infty} \eta_t=\infty}. Assume also that there exists {L \geq 0} such that {|b_{t+\tau}-b_t| \leq L (\sum_{i=t}^{t+\tau-1} \eta_i b_i + \|\sum_{i=t}^{t+\tau-1} \eta_i a_i \|)}, where {a_t} is such that {\|\sum_{i=1}^\infty \eta_t a_t\| <\infty}. Then, {b_t} converges to 0.

We are now finally ready to prove the asymptotic convergence with probability 1.

Theorem 2. Assume that we use SGD on a {L}-smooth function, with stepsizes {\eta_t>0} that satisfies the conditions (2). Then, {\|\nabla f({\boldsymbol x}_t)\|} goes to zero with probability 1.

Proof: We want to use Lemma 1 on {b_t=\|\nabla F({\boldsymbol x}_t)\|}. So, first observe that by the {L}-smoothness of {F}, we have

\displaystyle \begin{aligned} \left| \|\nabla F({\boldsymbol x}_{t+\tau})\| - \|\nabla F({\boldsymbol x}_t)\|\right| &\leq \left\| \nabla F({\boldsymbol x}_{t+\tau}) - \nabla F({\boldsymbol x}_t)\right\| \\ &\leq L \|{\boldsymbol x}_{t+\tau}-{\boldsymbol x}_t\| \\ &= L \left\|\sum_{i=t}^{t+\tau-1} \eta_i g({\boldsymbol x}_i,\xi_i)\right\| \\ &= L \left\|\sum_{i=t}^{t+\tau-1} \eta_i (\nabla F({\boldsymbol x}_i) + g({\boldsymbol x}_i,\xi_i)-\nabla F({\boldsymbol x}_i))\right\| \\ &\leq L \sum_{i=t}^{t+\tau-1} \eta_i \|\nabla F({\boldsymbol x}_i)\| + L \left\|\sum_{i=t}^{t+\tau-1} \eta_i (g({\boldsymbol x}_i,\xi_i)-\nabla F({\boldsymbol x}_i))\right\|~. \end{aligned}

The assumptions and the reasoning above imply that, with probability 1, {\sum_{t=1}^\infty \eta_t \|\nabla F({\boldsymbol x}_t)\|^2 < \infty}. This also suggest to set {a_t=g({\boldsymbol x}_t,\xi_t)-\nabla F({\boldsymbol x}_t)}. Also, we have, with probability 1, {\|\sum_{t=1}^\infty \eta_t a_t\| < \infty}, because {\sum_{t=1}^T \eta_t a_t} for {T=1,2, \dots} is a martingale whose variance is bounded by {\sigma^2 \sum_{t=1}^\infty \eta_t^2<\infty}. Hence, {\sum_{t=1}^T \eta_t a_t} for {T=1,2, \dots} is a martingale in {L^2}, so it converges in {L^2} with probability 1.

Overall, with probability 1 the assumptions of Lemma 1 are verified with {p=2}. \Box

We did it! Finally, we proved that the gradients of SGD do indeed converge to zero with probability 1. This means that with probability 1 for any {\epsilon>0} there exists {N_\epsilon} such that {\|\nabla F({\boldsymbol x}_t)\|\leq \epsilon} for {t\geq N_\epsilon}.

Even if I didn’t actually use any intuition in crafting the above proof (I rarely use “intuition” to prove things), Yann Ollivier provided the following intuition for this proof: the proof is implicitly studying how far apart GD and SGD are. However, instead of estimating the distance between the two processes over a single update, it does it over large period of time through the term {\|\sum_{t=1}^m \eta_t (\nabla F({\boldsymbol x}_t) - g({\boldsymbol x}_t,\xi_t))\|} that can be controlled thanks to the choice of the learning rates.

5. History Bits

The idea of taking one iterate at random in SGD was proposed in (Ghadimi and Lan, 2013) and it reminds me the well-known online-to-batch conversion through randomization. The conditions on the learning rates in (2) go back to (Robbins and Monro, 1951). (Bertsekas and Tsitsiklis, 2000) contains a good review of previous work on asymptotic convergence of SGD, while a recent paper on this topic is (Patel, V., 2020).

I derived Lemma 1 as an extension of Proposition 2 in (Alber et al., 1998)/Lemma A.5 in (Mairal, 2013). Studying the proof of (Bertsekas and Tsitsiklis, 2000), I realized that I could change (Alber et al., 1998, Proposition 2) into what I needed. I had this proof sitting in my unpublished notes for 2 years, so I decided to write a blog post on it.

My actual small contribution to this line of research is a lim inf convergence for SGD with AdaGrad stepsizes (Li and Orabona, 2019), but under stronger assumptions on the noise.

Note that the 20-30 years ago there were many papers studying the asymptotic convergence of SGD and its variants in various settings. Then, the taste of the community changed moving from asymptotic convergence to finite-time rates. As it often happens when a new trend takes over the previous one, new generations tend to be oblivious to the old results and proof techniques. The common motivation to ignore these past results is that the finite-time analysis is superior to the asymptotic one, but this is clearly false (ask a statistician!). It should be instead clear to anyone that both analyses have pros and cons.

6. Acknowledgements

I thank Léon Bottou for telling me of the problem of analyzing the asymptotic convergence of SGD in the non-convex case with a simple and general proof in 2018. Léon also helped me checking my proofs and finding an error in a previous version. Also, I thank Yann Ollivier for reading my proof and kindly providing an alternative proof and the intuition that I report above.

7. Appendix

Proof of Lemma : Since the series {\sum_{t=1}^{\infty} \eta_t} diverges, given that {\sum_{t=1}^{\infty} \eta_t b_t^p} converges, we necessarily have {\liminf_{t \rightarrow \infty}b_t = 0}. Hence, we have to prove that {\lim \sup_{t\rightarrow\infty} b_t = 0}.

Let us proceed by contradiction and assume that {\lim \sup_{t\rightarrow\infty} b_t = \lambda>0}. First, assume that {\lambda <\infty}.

Given the values of the {\lim\inf} and {\lim\sup}, we can then build two sequences of indices {(m_j)_{j\geq1}} and {(n_j)_{j\geq1}} such that

  • {m_j < n_j <m_{j+1}},
  • {\frac{\lambda}{3} < b_k}, for {m_j \leq k < n_j},
  • {b_k\leq \frac{\lambda}{3}}, for {n_j \leq k < m_{j+1}}.

Define {\epsilon=\frac{\lambda}{6 L}\min\left(\frac{\lambda^{p-1}}{3^{p-1}},1\right)}. The convergence of the series implies that the sequence of partial sums are Cauchy sequences. Hence, there exists {\tilde{j}} large enough such for all {N \geq m_{\tilde{j}}} we have {\sum_{t=m_{\tilde{j}}}^N \eta_t b_t^p} and {\left\|\sum_{t= m_{\tilde{j}}}^N \eta_t a_t \right\|} are less or equal to {\epsilon}. Then, we have for all {j\geq \tilde{j}} and all {m} with {m_j\leq m\leq n_j -1},

\displaystyle \begin{aligned} |b_{n_j}-b_m| &\leq L \left(\sum_{i=m}^{n_j-1} \eta_i b_i + \left\|\sum_{i=m}^{n_j-1} \eta_i a_i \right\|\right) \\ &= \frac{3^{p-1} L}{\lambda^{p-1}} \left(\sum_{i=m}^{n_j-1} \eta_i b_i \frac{\lambda^{p-1}}{3^{p-1}} + \left\|\sum_{i=m}^{n_j-1} \eta_i a_i \right\|\right) \\ &\leq \frac{3^{p-1} L}{\lambda^{p-1}} \left(\sum_{i=m}^{n_j-1} \eta_i b_i^p + \left\|\sum_{i=m}^{n_j-1} \eta_i a_i \right\|\right) \leq \frac{\lambda}{3}~. \end{aligned}

Therefore, using the triangle inequality, { b_m \leq b_{n_j} + \frac{\lambda}{3} \leq \frac{2\lambda}{3}~. } And finally for all {m \geq \tilde{j}}, { b_m \leq \frac{2\lambda}{3}, } which contradicts {\lim \sup_{t\rightarrow\infty} b_t = \lambda>0}. Therefore, {b_t} goes to zero.

To rule out the case that {\lim \sup_{t\rightarrow\infty} b_t = \infty}, proceed in the same way, choosing any {\lambda>0}. Hence, we get that for {m \geq \tilde{j}}, { b_m \leq \frac{2\lambda}{3}, } that contradicts {\lim \sup_{t\rightarrow\infty} b_t = \infty}. \Box

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s