I am back! I finally found some spare time to write something interesting.
A lot of people told me that they like this blog because I present “new” results. Sometimes the results I present are so “new” that people cite me in papers. This makes me happy: Not only you learned something reading my posts, but you also found a use for them! So, this post is again one of these things that many people will think as “new” but in reality I had it in 2018 and I never knew what to do with it. The result this time is obvious from Levy (2017). But in order to see it, you must know online learning. Now, classic optimization people usually do not know online learning. This means that if you know classic optimization, I bet you probably did not know about this result: Let me know if I was wrong!
This time I will try an experiment: I will publish this post here and in a few days I’ll submit a slightly more serious version to arXiv. So, in case you need it, this time you’ll have an arXiv paper to cite.
EDIT: This actually took few months to do it, but it is now on Arxiv.
1. Introduction
Here, I present some very easy results about the use of normalized gradients. Nothing is really new, there is only one bit about adapting to Hölder smoothness using a new inequality, but the core idea is very well-known from Levy (2017), at least to online learning people. The main result is the ability to adapt to Hölder smoothness without knowing the Hölder exponent essentially for free using normalized gradients. The reduction is very generic, so we will apply it to OGD, Dual Averaging, and parameter-free algorithms. But first I’ll show that a similar rate can be obtained by AdaGrad-norm stepsizes, that is an even easier proof.
Let’s first describe Hölder smoothness.
2. Hölder smoothness
For the following definition, see Nesterov (2015).
Definition 1. Let
and define the class of
-Hölder smooth functions with respect to
the ones that satisfy
where
is the dual norm of
.
This definition is useful because it allows us to capture a wide range of function in a unified way. It is a generalization of smoothness and Lipschitzness, in fact for smooth functions and for Lipschitz functions
.
We can prove the following well-known result for this class of functions, see, for example, Nesterov (2015).
Theorem 2. Let
with
-Hölder continuous gradients with respect to
. Then, for
and
, we have
Proof:
Therefore,
I could not find the following Corollary in any paper, but I am sure this is also well-known.
Cor 3. Let
be
-Hölder smooth with respect to
and
. Then, for any
![]()
Proof: Let is any vector such that
and scaled in such a way that
. Then, from Theorem 2 with
, we have
So, for any , we have
With the optimal setting of , the right hand side of this inequality becomes
Reordering, we have the stated bound.
3. AdaGrad-Norm Adapts to Hölder Smoothness
Our objective is to minimize a convex Hölder smooth function . However, we do not know
and
. Suppose you want to use gradient descent:
.
It should be easy to see that the optimal learning rate will depend on the Hölder exponent
. Can you see why? It is enough to consider the two extreme cases,
and
. For
, the function is Lipschitz and the optimal learning rate is
or
that would give you a convergence rate of
. Instead, for the case
we get that the optimal learning rate is constant and independent of
, that would give you a rate of
. So, you should know in which case you are in order to set the right learning rate and this is clearly annoying. Note that in the smooth case (
) the rate is not optimal because you can get the accelerated one that is
, for example using Nesterov’s momentum algorithm by Nesterov (1983).
From this reasoning, you should see that the optimal rate that gradient descent (without acceleration) can achieve on Hölder smooth functions depend on and it goes from
to
. See, for example, Grimmer (2019, Corollary 9) for the precise learning rate and convergence guarantee.
Now, our objective is simple: We want to achieve the above rate without knowing for all the Hölder smooth functions. Sometime people call this kind of property “adaptivity” or “universality”, even if universality as defined by Nesterov (2015) would achieve the optimal accelerated rate for all
.
Let’s now warm up a bit considering the so-called AdaGrad-norm stepsizes, that is .
It is known that gradient descent with the AdaGrad-norm stepsizes gets a faster rate on smooth function, that is instead of the usual
(Li and Orabona, 2019). So, what happens on Hölder smooth functions? Given that these things tend to be continuous, it should be intuitive that AdaGrad-norm stepsizes should give us a rate that smoothly interpolates between the Lipschitz and smooth case. Let’s see how.
We will consider the Follow-The-Regularized-Leader (FTRL) with linearized losses (or as it is know in the optimization community Dual Averaging (DA)) version of AdaGrad to have a simpler proof, but the analysis in the gradient descent case is also possible and easy (hint: consider the guarantee on ). In this case, the update is
where is an upper bound on
.
From its known guarantee, we have
Now, we want to transform the terms in
to be able to apply Corollary 3. So, focus on the last term and use Hölder’s inequality:
Setting so that
, we have
If the function is
-Hölder smooth, from Corollary 3 we have
Assuming the term can be ignored in the sum, solving the inequality, we have
The online-to-batch conversion completes the convergence rate:
where is the usual average of the iterates,
.
So, we obtained that AdaGrad-norm stepsizes will give a rate for Lipschitz functions,
for the usual smooth function and a rate in between for Hölder smooth functions, without the need to know
nor
. Pretty cool, right?
In the next section, we see that we can generalize this idea a bit more and we can get rid of the knowledge of . (You can probably remove
even in this proof, but I was lazy!) Also, I’ll show a way to prove that the iterates are bounded that can be used on AdaGrad-norm stepsizes too.
4. Adapting to Local Hölder Smoothness with Normalized Gradients
AdaGrad-norm stepsizes are nice, but they are a bit restrictive. Can I use different stepsizes in a different first-order optimization algorithm and still get the same rate?
It turns out this is very simple: Just use normalized gradients and evaluate the function on a weighted average of the iterates. That’s it. Moreover, the result is stronger: here we will adapt to a certain notion of “local” Hölder smoothness. And the proof is particularly simple too! Moreover, it is a black-box reduction: you can use it with any online optimization algorithm that guarantees a regret. No need to know what the online optimization algorithm does! This procedure is summarized in Algorithm 1.
Let’s see the details in the following theorem.
Theorem 4. Suppose you have an online linear optimization algorithm
that guarantees
when fed with linear losses
where
for all
, for some function
and any
. Let
convex and
. Assume that there exists
and
such that for all
we have
Then, Algorithm 1 guarantees
So, in words, the algorithm does not need to know anything about the function . Also, you don’t need to know anything about the internal working of the online learning algorithm, only the fact that it guarantees some regret upper bound. Just use normalized gradient and that’s it. It is difficult to be more general than this! The rate will depend on the geometric mean of the “local” smoothnesses on the iterates
, that in turn is upper bounded by their arithmetic mean. Note that if the function is
-Hölder smooth everywhere, then
and
.
To prove it, we will use the HM-GM-AM inequality
Lemma 5. Let
positive numbers. Then, we have
We can now prove our main Theorem. The proof is immediate from the results in Levy (2017), but kind of hidden. Again, if you know online learning and you read the paper, I assure you the following proof will be evident, at least for . The generalization is a nice trick about geometric means and this kind of tricks is my specialty 😉
Proof: In the case that the statement is still true because we just pass the true gradients and
is just the average of the iterates. So, the guarantee follows from classic online-to-batch conversion. Hence, in the following let’s assume
.
For shortness of notation, denote by . In particular, from the guarantee of the algorithm and the setting of
we obtain for all
Now, from the definition and Jensen’s inequality, we have
Hence, we have
where we used Lemma 5 in the second inequality, Corollary 3 in the third one, the assumption on the online learning algorithm in the last one.
We now consider some examples.
Online Gradient Descent Consider online gradient descent with constant learning rate . In this case, Algorithm 1 simply becomes Algorithm 2.
Then, from very standard online learning results, we have immediately have
So, using Theorem 4 and assuming the function -Hölder smooth, we obtain
So, we get in the classic smooth case and
in the Lipschitz case, and rates in between in the general Hölder case.
But we can do even better! From (1), using the fact that , we can also say that
Hence, the iterates are bounded (nope, this is not new, this is also well-known, see for example Xiao (2010, Corollary 2.b)). This means that the Hölder smoothness only have to hold on a ball around the initial point. So, this result generalizes the one in Mishchenko and Malitsky (2020) because they only consider the usual notion of smoothness.
FTRL/DA With enough math, we could even prove the same thing for OGD a time-varying learning rate. However, given that I am not religious about OGD, we can do it in a much simpler way using FTRL with linearized losses / DA.
In this case, the update is
As before, proving that the iterates are bounded is very easy (reason as above and use Exercise 7.2 in Orabona (2019)). So, using the known guarantee for FTRL and reasoning as above, we obtain
but this time we don’t need to know ahead of time. In case you are wondering, we lost a “2” because
. Note that we could even consider the non-Euclidean case, everything is essentially the same, you can try just for fun.
Parameter-free version In the above bounds, the optimal value of depends on
. If you knew
(but we don’t), we would obtain the term
inside the parentheses above. Luckily, the past 11 years of research in parameter-free algorithms gave us an easy solution: algorithms that do not need to know
yet achieve the best rates up to (unavoidable) polylogarithmic factors.
Remark 1. “Parameter-free” is a generic adjective but it is also a technical word like “universal”. So, it might be the case that I do not use it in the same you use it!
In my subfield, we call an optimization algorithm “parameter-free” if it can achieve the convergence rate (in expectation or high probability)
uniformly over all convex functions with bounded stochastic subgradients, with
stochastic subgradients query and possible knowledge of the bound of the stochastic subgradients. I know, our fault was never to define it formally in the papers, but this is how this community understands this term. So, it does not mean “without anything to tune” nor “without knowing anything on the function”, sorry! The name is motivated by the fact that most of the algorithms that achieve this guarantee (but not all! See, e.g., Foster, Kale, Mohri, and Sridharan (2017)) happen to do it with methods that do not have learning rates.
However, the main disadvantage of parameter-free algorithms is the need to have bounded gradients. But normalized gradients are always bounded! In particular, consider again the Euclidean case and just use the parameter-free KT algorithm with normalized gradients:
In case you think that this is a weird algorithm, think again: it is very difficult to beat this algorithm, because it is designed exactly for the case in which all the vectors have norm equal to 1! Again, using Theorem 4 and the regret guarantee of KT, we immediately obtain
The advantage of this bound over the ones above is that this has almost the optimal dependency on without the need to know anything nor to set a learning rate! This rate also tells us that
can be safely set to anything between
and
without affecting much the algorithm.
I can hear at least one party pooper saying that this bound is not optimal in because of the
term in the log. However, this is easy to fix, again known stuff. There are a lot of parameter-free algorithms, KT is just the simplest one. So, using the parameter-free algorithms that do not have
inside the log (e.g., McMahan and Orabona, 2014)(Zhang, Cutkosky, and Paschalidis, 2022), we can also remove the
term in the log at the expense of the term
that becomes
inside the parentheses. So, the final bound would be
This is a beautiful guarantee: the algorithms adapts to , adapts to
, adapts to
, asymptotically optimal in
. Bonus points: no learning rate to tune and the analysis was very simple because we just put together a known algorithm and its bound with Theorem 4.
But wait, there is also another way! Just scale each gradient by
and feed them to your parameter-free algorithm. Reasoning as in the AdaGrad-norm stepsizes, you will get the same bound, I leave it as an exercise for the reader.
Note that in this case the bounded iterates are more delicate, but it can still be done using a parameter-free algorithm based on FTRL and reason as in Orabona and Pál (2021). So, yeah, everything solved.
5. Open Problems
That’s it! Easy, right? Two different reductions, rescaled gradients by AdaGrad-norm stepsizes and normalized gradients, to obtain adaptive rates. Also, if you know classic optimization and you don’t know online learning, I hope the short and elegant proofs above will convince you to study more about online learning: I happen to have written an (almost) complete draft book on online learning 😉
As I wrote above, everything was (essentially) known. So, what remains to be done? For sure the stochastic case. Indeed, the deterministic case is too easy to be interesting. In fact, Levy (2017) has an entire section about adaptive minibatches to use normalized gradients (minus the adaptation to that is “new”) in the stochastic case and things get really hairy. On the issue of focusing on the deterministic setting, I feel that it is becoming the new “let’s assume a bounded domain”: Some people seem to do it not because it makes sense in their applications but just because it is easier to deal with. Don’t avoid the technical difficulties because sometimes they indicate that you do have to change the algorithm.
Another interesting direction is to obtain accelerated rates using parameter-free algorithms: it is still deterministic but I don’t know ways to do it (unless I am missing some very recent result).
6. History Bits
AdaGrad-norm stepsizes were proposed by Streeter and McMahan (2010) and widely used in the online learning community (e.g., Orabona and Pál, 2015) before being rediscovered in the stochastic optimization one by… us! (Li and Orabona, 2019) (yep, we were the first ones, check the arXiv date).
The idea of using normalized gradients goes back at least to Nesterov (2004, Section 3.2.3), where he gets adaptation to the Lipschitz constant. Reading his proof, it should be immediate to realize that the same trick works essentially for any algorithm. The majority of this blog post comes from reading and truly enjoying the underappreciated paper by Levy (2017). Levy (2017) presents his result for a particular algorithm, but if you look at his proof you should see that it applies to any online linear optimization algorithm, as I did. I am not sure where I took the trick about the geometric mean to adapt to . Given that I wrote this proof in 2018, I don’t remember anymore, maybe I actually invented it. As far as I know, the dependency on the geometric mean of the smoothness is new and it mirrors a similar bound I proved in the blog post on pseudogradients for the Perceptron. For a paper with true universality, i.e., optimal accelerated rates in all cases, see Nesterov (2015). Also, Grimmer (2022) proves a more fine-grained rate for Nesterov’s method on sum of functions.
Some weaker and less general results appear in other papers too. Mishchenko and Malitsky (2020) proved that one can obtain rate in the smooth case (
) using gradient descent with an adaptive stepsize that tries to estimate the local smoothness. I believe (but I might be wrong) this also implies that in the non-smooth case their algorithm might fail. Note that Levy (2017) already proved the stronger result of adaptation to Lipschitz and smooth case for normalized gradient descent. However, the algorithm in Mishchenko and Malitsky (2020) is much better in the strongly convex case. In fact, in the smooth strongly convex case, it essentially uses a learning rate that is
that will give a linear rate, while the algorithms above will not get you a linear rate. That said, Levy (2017) has other results for the strongly convex case, but you’ll have to read the paper to see which ones. The bit about the boundedness of the OGD iterates with normalized gradient descent is “new” but kind of obvious and I wrote it after reading a similar guarantee proved in Mishchenko and Malitsky (2020).
The adaptation to smoothness in the parameter-free case by rescaling the gradients of a parameter-free algorithm (and actually of any FTRL/DA algorithm) by is in Orabona and Pál (2021, Lemma 26), where in the deterministic case
can be set to 0 and
substituted by
because these changes are needed in the stochastic analysis only as explained in Li and Orabona (2019). The use of normalized gradients in parameter-free algorithms for the deterministic setting is explicitly mentioned as an easy alternative to avoid knowing the Lipschitz constant in Orabona and Pál (2021) on page 13. Khaled, Mishchenko, and Jin (2023) appear to have rediscovered similar guarantees without adaptation to
and with a more convoluted proof. They also rediscover the guarantee of normalized gradient descent from Levy (2017). For the smooth deterministic case, Carmon and Hinder (2022) proposed a parameter-free algorithm that on smooth functions achieves a rate of
paying only an additional log log factor in
, but with the knowledge of the smoothness constant.
The first parameter-free algorithm to prove bounded iterates (with probability one) was in Orabona and Pál (2021) but without a precise bound. Ivgi, Hinder, and Carmon (2023) were the first one to design a parameter-free algorithm where is bounded by a
, where
is a universal constant with high probability.
Acknowledgments
Thanks to Yair Carmon, Benjamin Grimmer, Kfir Levy, Mingrui Liu, Nicolas Loizou, Yura Malitsky, and Aryan Mokthari for feedback and comments on a preliminary version of this blog post.
Versions
1.0: initial post
1.1: Fixed missing term in the AdaGrad proof. Thanks to Aaron Defazio for pointing it out!



great post as usual. I wonder if as for gradient descent, the theory can be tightened if we use the equivalent of
f(x) \leq f(y) + \langle \nabla f(x), x – y\rangle – \frac{1}{2 L}\|\nabla f(x) – \nabla f(y)\|^2
for Holder smoothness the same way that using the above instead of the more traditional L-smoothness and convexity separately allows to improve the convergence rate of gradient descent (as done for example in https://link.springer.com/article/10.1007/s10107-022-01899-0)
LikeLiked by 1 person
Thanks Fabian! Also, thanks for sharing that paper, I didn’t know about it, very cool results!
I would guess it is probably possible to use similar methods in the Hölder case. But if we want something that adapts to the smoothness, we should probably not start from the descent lemma, because we need something that works in the Lipschitz case too. Anyway, it would be interesting to see if the refined inequalities in that paper could find applications in other settings too.
LikeLike
…or at least I would not know how to start from the descent lemma to design adaptive strategies 🙂
LikeLike