In this blog (and in our ICML 2020 tutorial) we have often discussed the issue of how to tune learning rates in Stochastic (sub)Gradient Descent (SGD). This issue is delicate and sometimes a student might not realize its complexity due to the use of the simplifying assumption of bounded domains.
Don’t get me wrong: assuming bounded domains is perfectly fine and justified most of the time. However, sometimes it is unnecessary and it might also obscure critical issues in the analysis, as in this case. So, to balance the universe of first-order methods, I decided to show how to easily prove the convergence of the iterates in SGD, even in unbounded domains.
Technically speaking, the following result might be new, but definitely not worth a fight with Reviewer 2 to publish it somewhere.
1. Setting
First, let’s define our setting. We want to solve the following optimization problem
where and
is a convex function. Now, various assumptions are possible on
and choosing the right one depends on your particular problem, there are not right answers. Here, we will not make any strong assumption on
. Also, we will not assume
to be bounded. Indeed, in most of the modern applications in Machine Learning,
is simply the entire space
. We will also assume that
is not empty and
is any element in it.
We also assume to have access to a first-order stochastic oracle that returns stochastic sub-gradients of on any point
. In formulas, we get
such that
. Practically speaking, every time you calculate the (sub)gradient on a minibatch of training data, that is a stochastic (sub)gradient and roughly speaking the random minibatch is the random variable
.
Here, for didactic reasons, we will assume that is bounded by 1; similar results can be also show with more realistic assumptions. This holds, for example, if
is an average of 1-Lipschitz functions and you draw some of them to calculate the stochastic subgradient.
The algorithm we want to focus on is SGD. So, what is SGD? SGD is an incredibly simple optimization algorithm, almost primitive. Indeed, part of its fame depends critically on its simplicity. Basically, you start from a certain and you update your solution iteratively moving in the direction of the negative stochastic subgradients, multiplied by a learning rate
. We also use a projection onto
. Of course, if
no projection is needed. So, the update of SGD is
where and
is the projection onto
. Remember that when you use subgradients, SGD is not a descent algorithm: I already blogged about the fact that the common intuition of moving towards a descent direction is wrong when you use subgradients.
2. Convergence of the Average of the Iterates
Now, the most common analysis of SGD can be done in two different ways: constant learning rate and non-increasing learning rate. We already saw both of them in my lecture notes on online learning, so let’s summarize here the one-step inequality for SGD we need:
for all measureable w.r.t.
.
If you plan to use iterations, you can you use a learning rate
, and summing (1) we get
where we set . This is not a convergence results yet, because it just says that on average we are converging. To extract a single solution, we can use Jensen’s inequality and obtain
where . In words, we show a convergence guarantee for the average of the iterates of SGD, not for the last one.
Constant learning rates are a bit annoying because they depends on how many iterations you plan to do, theoretically and empirically. So, let’s now take a look at non-increasing learning rates, . In this case, the correct way to analyze SGD without the bounded assumption is to sum (1) without dividing by
, to have
where we set . From this one, we have two alternatives. First, we can observe that
because is a minimizer and the learning rate non-increasing. So, using again Jensen’s inequality, we get
Note that if you like these sorts of games, you can even change the learning rate to shave a factor, but it is probably useless from an applied point of view.
Another possibility is to use a weighted average:
where and we used
. Note that this option does not seem to give any advantage over the unweighted average above. Also, it weights the first iterations more than the last ones, that in most of the cases is a bad idea: First iterations tend to be farther away from the optimum then the last ones.
Let’s summarize what we have till now:
- Unbounded domains are fine with both constant and time-varying learning rates.
- The optimal learning rate depends on the distance between the optimal solution and the initial iterate, because the optimal setting of
is proportional to
.
- The weighted average is probably a bad idea and not strictly necessary.
- It seems we can only guarantee convergence for (weighted) averages of iterates.
The last point is a bit concerning: most of the time we take the last iterate of SGD, why we do it if the theory applies to the average?
3. Convergence of the Last Iterate
Actually, we do know that
- the last solution of SGD converges in unbounded domains with constant learning rate (Zhang, T., 2004).
- the last iterate of SGD converges in bounded domains with non-increasing learning rates (Shamir, O. and Zhang, T., 2013).
So, what about unbounded domains and non-increasing learning rate, i.e., 90% of the uses of SGD? It turns out that it is equally simple and I think the proof is also instructive! As surprising as it might sound, not dividing by (1) is the key ingredient we need. The proof plan is the following: we want to prove that the value of
on the last iterate
is not too far from the value of
on
. To prove it, we need the following technical lemma on sequences of non-negative numbers multiplied by non-increasing learning rates, whose proof is in the Appendix. This Lemma relates the last element of a sequence of numbers to their average.
Lemma 1. Let
a non-increasing sequence of positive numbers and
. Then
With the above Lemma, we can prove the following guarantee for the convergence of the last iterate of SGD.
Theorem 2. Assume the stepsizes deterministic and non-increasing. Then
Proof: We use Lemma 1, with , to have
Now, we bound the sum in the r.h.s. of last inequality. Summing (1) from to
, we have the following inequality that holds for any
:
Hence, setting , we have
Putting all together, we have the stated bound.
There are a couple of nice tricks in the proof that might be interesting to study carefully. First, we use the fact that one-step inequality in (1) holds for any . Most of the time, we state it with
equal to
, but it turns out that the more general statement is actually important! In fact, it is possible to know how far is the performance of last iterate from the performance of the average because the incremental nature of SGD makes possible to know exactly how far is
from any previous iterate
, with
. Please note that all of this would be hidden in the case of bounded domains, where all the distances are bounded by the diameter of the set, and you don’t get the dependency on
.
Now we have all the ingredients and we only have to substitute a particular choice of the learning rate.
Corollary 3. Assume
,
and
for
. Then, we have
Proof: First, observe that
Now, considering the last term in (3), we have
Using (2) and dividing by , we have the stated bound.
Note that the above proof works similarly if .
4. History Bits
The first finite-time convergence proof for the last iterate of SGD is from (Zhang, T., 2004), where he considered the constant learning rate case. It was later extended in (Shamir, O. and Zhang, T., 2013) for time-varying learning rates but only for bounded domains. The convergence rate for the weighted average in unbounded domains is from (Zhang, T., 2004). The observation that the weighted average is not needed and the plain average works equally well for non-increasing learning rates is from (X. Li and F. Orabona, 2019), where we needed it for the particular case of AdaGrad learning rates. The idea of analyzing SGD without dividing by the learning rate is by (Zhang, T., 2004). Lemma 1 is new but actually hidden in the the convergence proof of the last iterate of SGD with linear predictors and square losses in (Lin, J. and Rosasco, L. and Zhou, D.-X., 2016), that in turn is based on the one in (Shamir, O. and Zhang, T., 2013). As far as I know, Corollary 3 is new, but please let me know if you happen to know a reference for it! It is possible to remove the logarithmic term in the bound using a different learning rate, but the proof is only for bounded domains (Jain, P. and Nagaraj, D. and Netrapalli, P., 2019).
5. Exercises
Exercise 1. Generalize the above proofs to the Stochastic Mirror Descent case.
Exercise 2. Remove the assumption of expected bounded stochastic subgradients and instead assume that
is
-smooth, i.e., has
-Lipschitz gradient, and the variance of the noise is bounded. Hint: take a look at the proofs in (Zhang, T., 2004) and (X. Li and F. Orabona, 2019)
6. Appendix
Proof of Lemma 1: Define , so we have
that implies
Now, from the definition of and the above inequality, we have
that implies
Unrolling the inequality, we have
Using the definition of and the fact that
, we have the stated bound.
EDIT: inspired by the discussion on Twitter, I added exercise 2.
LikeLike
Very interesting! My first guess for why the last iterate should converge was that $x_T$ is close to $\bar x_T$. However for a simple one dimensional case where all $g_t = 1$, it can be shown that $x_T- \bar x_T = O(\sqrt(T))$.
LikeLike
Interesting! Can you sketch your construction?
LikeLike
I was thinking $x_T = -\sum_{t=1}^{T-1} \alpha g_t/\sqrt{t}$. Then we get $\bar x_T =-1/T\sum_{k=1}^T\sum_{t=1}^{k-1} \alpha g_t/\sqrt{t}$. Now by assuming $g_t=1$ and simplifying $x_t- \bar x_T$ it is easy to show that it is of the order $O(\sqrt{T})$. Though the problem is the loss functions seems unbounded and $\bar x_t$ actually doesn’t converge! So I was totally wrong.
LikeLiked by 1 person