Adapting to Smoothness with Normalized Gradients

I am back! I finally found some spare time to write something interesting.

A lot of people told me that they like this blog because I present “new” results. Sometimes the results I present are so “new” that people cite me in papers. This makes me happy: Not only you learned something reading my posts, but you also found a use for them! So, this post is again one of these things that many people will think as “new” but in reality I had it in 2018 and I never knew what to do with it. The result this time is obvious from Levy (2017). But in order to see it, you must know online learning. Now, classic optimization people usually do not know online learning. This means that if you know classic optimization, I bet you probably did not know about this result: Let me know if I was wrong!

This time I will try an experiment: I will publish this post here and in a few days I’ll submit a slightly more serious version to arXiv. So, in case you need it, this time you’ll have an arXiv paper to cite.

EDIT: This actually took few months to do it, but it is now on Arxiv.

1. Introduction

Here, I present some very easy results about the use of normalized gradients. Nothing is really new, there is only one bit about adapting to Hölder smoothness using a new inequality, but the core idea is very well-known from Levy (2017), at least to online learning people. The main result is the ability to adapt to Hölder smoothness without knowing the Hölder exponent essentially for free using normalized gradients. The reduction is very generic, so we will apply it to OGD, Dual Averaging, and parameter-free algorithms. But first I’ll show that a similar rate can be obtained by AdaGrad-norm stepsizes, that is an even easier proof.

Let’s first describe Hölder smoothness.

2. Hölder smoothness

For the following definition, see Nesterov (2015).

Definition 1. Let ${\nu \in [0,1]}$ and define the class of ${(L_\nu,\nu)}$ -Hölder smooth functions with respect to ${\|\cdot\|}$ the ones that satisfy

$\displaystyle L_\nu = \max_{{\boldsymbol x},{\boldsymbol y} \in {\mathbb R}^d, {\boldsymbol x}\neq {\boldsymbol y}} \frac{\|\nabla f({\boldsymbol x}) - \nabla f({\boldsymbol y})\|_\star}{\|{\boldsymbol x} - {\boldsymbol y}\|^\nu} < \infty$

where ${\|\cdot\|_\star}$ is the dual norm of ${\|\cdot\|}$ .

This definition is useful because it allows us to capture a wide range of function in a unified way. It is a generalization of smoothness and Lipschitzness, in fact for smooth functions ${\nu=1}$ and for Lipschitz functions ${\nu=0}$ .

We can prove the following well-known result for this class of functions, see, for example, Nesterov (2015).

Theorem 2. Let ${f:{\mathbb R}^d \rightarrow {\mathbb R}}$ with ${L_\nu}$ -Hölder continuous gradients with respect to ${\|\cdot\|}$ . Then, for ${{\boldsymbol x}, {\boldsymbol y} \in \mathop{\mathrm{dom}} f}$ and ${{\boldsymbol g} \in \nabla f(x)}$ , we have

$\displaystyle f({\boldsymbol y}) \leq f({\boldsymbol x}) + \langle \nabla f({\boldsymbol x}), {\boldsymbol y}-{\boldsymbol x}\rangle + \frac{L_\nu}{1+\nu}\|{\boldsymbol x}-{\boldsymbol y}\|^{1+\nu}~.$

Proof:

$\displaystyle \begin{aligned} f({\boldsymbol y}) &= f({\boldsymbol x}) + \int_0^1 \langle \nabla f({\boldsymbol x}+\tau ({\boldsymbol y}-{\boldsymbol x})), {\boldsymbol y} -{\boldsymbol x}\rangle d \tau \\ &= f({\boldsymbol x}) + \langle \nabla f({\boldsymbol x}), {\boldsymbol y}-{\boldsymbol x}\rangle + \int_0^1 \langle \nabla f({\boldsymbol x}+\tau ({\boldsymbol y}-{\boldsymbol x}))- \nabla f({\boldsymbol x}), {\boldsymbol y} -{\boldsymbol x}\rangle d \tau~. \end{aligned}$

Therefore,

$\displaystyle \begin{aligned} |f({\boldsymbol y})-f({\boldsymbol x}) -\langle \nabla f({\boldsymbol x}), {\boldsymbol y} -{\boldsymbol x}\rangle| &= \left|\int_0^1 \langle \nabla f({\boldsymbol x}+\tau ({\boldsymbol y}-{\boldsymbol x}))- \nabla f({\boldsymbol x}), {\boldsymbol y} -{\boldsymbol x}\rangle d \tau\right| \\ &\leq \int_0^1 |\langle \nabla f({\boldsymbol x}+\tau ({\boldsymbol y}-{\boldsymbol x}))- \nabla f({\boldsymbol x}), {\boldsymbol y} -{\boldsymbol x}\rangle| d \tau \\ &\leq \int_0^1 \|\nabla f({\boldsymbol x}+\tau ({\boldsymbol y}-{\boldsymbol x}))- \nabla f({\boldsymbol x})\|_\star \|{\boldsymbol y} -{\boldsymbol x}\| d \tau \\ &\leq \int_0^1 \tau^\nu L_\nu \|{\boldsymbol y} -{\boldsymbol x}\|^{1+\nu} d \tau = \frac{L_\nu}{1+\nu} \|{\boldsymbol y} -{\boldsymbol x}\|^{1+\nu}~. \end{aligned}$

$\Box$

I could not find the following Corollary in any paper, but I am sure this is also well-known.

Cor 3. Let ${f:{\mathbb R}^d \rightarrow {\mathbb R}}$ be ${(L_\nu,\nu)}$ -Hölder smooth with respect to ${\|\cdot\|}$ and ${{\boldsymbol x}^\star \in \mathop{\mathrm{argmin}}_{{\boldsymbol x}} f({\boldsymbol x})}$ . Then, for any ${{\boldsymbol x} \in {\mathbb R}^d}$

$\displaystyle \|\nabla f({\boldsymbol x})\|_\star^{\frac{1}{\nu}+1} \leq \left(1+\frac{1}{\nu}\right) (L_\nu)^\frac{1}{\nu}(f({\boldsymbol x}) - f({\boldsymbol x}^\star))~.$

Proof: Let ${{\boldsymbol v}}$ is any vector such that ${\langle \nabla f({\boldsymbol x}), {\boldsymbol v}\rangle=-\|\nabla f({\boldsymbol x})\|_\star \|{\boldsymbol v}\|}$ and scaled in such a way that ${\|{\boldsymbol v}\| = \alpha \|\nabla f({\boldsymbol x})\|_\star }$ . Then, from Theorem 2 with ${{\boldsymbol y}={\boldsymbol x}+{\boldsymbol v}}$ , we have

$\displaystyle f({\boldsymbol x}^\star) \leq f({\boldsymbol x}+{\boldsymbol v}) \leq f({\boldsymbol x}) - \alpha \|\nabla f({\boldsymbol x})\|_\star^2 + \alpha^{1+\nu} \frac{L_\nu}{1+\nu}\|\nabla f({\boldsymbol x})\|_\star^{1+\nu}~.$

So, for any ${\alpha>0}$ , we have

$\displaystyle f({\boldsymbol x}^\star) - f({\boldsymbol x}) \leq -\alpha \|\nabla f({\boldsymbol x})\|_\star^2 + \alpha^{1+\nu} \frac{L_\nu}{1+\nu}\|\nabla f({\boldsymbol x})\|_\star^{1+\nu}~.$

With the optimal setting of ${\alpha= (L_\nu)^{-\frac{1}{\nu}} \|\nabla f({\boldsymbol x})\|^\frac{1-\nu}{\nu}}$ , the right hand side of this inequality becomes

$\displaystyle \begin{aligned} &-(L_\nu)^{-\frac{1}{\nu}} \|\nabla f({\boldsymbol x})\|_\star^\frac{1-\nu}{\nu} \|\nabla f({\boldsymbol x})\|_\star^2 + (L_\nu)^{-\frac{1+\nu}{\nu}} \|\nabla f({\boldsymbol x})\|_\star^\frac{(1+\nu)(1-\nu)}{\nu} \frac{L_\nu}{1+\nu}\|\nabla f({\boldsymbol x})\|_\star^{1+\nu} \\ &=-(L_\nu)^{-\frac{1}{\nu}} \|\nabla f({\boldsymbol x})\|_\star^{\frac{1}{\nu}+1}+\frac{1}{1+\nu}(L_\nu)^{-\frac{1}{\nu}}\|\nabla f({\boldsymbol x})\|_\star^{\frac{1}{\nu}+1}\\ &=-\frac{\nu}{1+\nu}(L_\nu)^{-\frac{1}{\nu}} \|\nabla f({\boldsymbol x})\|_\star^{\frac{1}{\nu}+1}~. \end{aligned}$

Reordering, we have the stated bound. $\Box$

3. AdaGrad-Norm Adapts to Hölder Smoothness

Our objective is to minimize a convex Hölder smooth function ${f:{\mathbb R}^d \rightarrow {\mathbb R}}$ . However, we do not know ${\nu}$ and ${L_\nu}$ . Suppose you want to use gradient descent: ${{\boldsymbol x}_{t+1}={\boldsymbol x}_t-\eta \nabla f({\boldsymbol x}_t)}$ .

It should be easy to see that the optimal learning rate ${\eta}$ will depend on the Hölder exponent ${\nu}$ . Can you see why? It is enough to consider the two extreme cases, ${\nu=0}$ and ${\nu=1}$ . For ${\nu=0}$ , the function is Lipschitz and the optimal learning rate is ${\eta \propto \frac{1}{\sqrt{T}}}$ or ${\eta_t \propto \frac{1}{\sqrt{t}}}$ that would give you a convergence rate of ${O(\frac{1}{\sqrt{T}})}$ . Instead, for the case ${\nu=1}$ we get that the optimal learning rate is constant and independent of ${T}$ , that would give you a rate of ${O(1/T)}$ . So, you should know in which case you are in order to set the right learning rate and this is clearly annoying. Note that in the smooth case ( ${\nu=1}$ ) the rate is not optimal because you can get the accelerated one that is ${O(1/T^2)}$ , for example using Nesterov’s momentum algorithm by Nesterov (1983).

From this reasoning, you should see that the optimal rate that gradient descent (without acceleration) can achieve on Hölder smooth functions depend on ${\nu}$ and it goes from ${O(1/\sqrt{T})}$ to ${O(1/T)}$ . See, for example, Grimmer (2019, Corollary 9) for the precise learning rate and convergence guarantee.

Now, our objective is simple: We want to achieve the above rate without knowing ${\nu}$ for all the Hölder smooth functions. Sometime people call this kind of property “adaptivity” or “universality”, even if universality as defined by Nesterov (2015) would achieve the optimal accelerated rate for all ${\nu}$ .

Let’s now warm up a bit considering the so-called AdaGrad-norm stepsizes, that is ${\eta_t=\frac{\alpha}{\sqrt{\sum_{i=1}^t \|{\boldsymbol g}_i\|_2^2}}}$ .

It is known that gradient descent with the AdaGrad-norm stepsizes gets a faster rate on smooth function, that is ${O(1/T)}$ instead of the usual ${O(1/\sqrt{T})}$ (Li and Orabona, 2019). So, what happens on Hölder smooth functions? Given that these things tend to be continuous, it should be intuitive that AdaGrad-norm stepsizes should give us a rate that smoothly interpolates between the Lipschitz and smooth case. Let’s see how.

We will consider the Follow-The-Regularized-Leader (FTRL) with linearized losses (or as it is know in the optimization community Dual Averaging (DA)) version of AdaGrad to have a simpler proof, but the analysis in the gradient descent case is also possible and easy (hint: consider the guarantee on ${\eta_t (f({\boldsymbol x}_t)-f({\boldsymbol x}^\star))}$ ). In this case, the update is

$\displaystyle {\boldsymbol x}_{t+1} ={\boldsymbol x}_1 - \frac{\alpha}{\sqrt{G^2+\sum_{i=1}^{t} \|{\boldsymbol g}_i\|_2^2}}\sum_{i=1}^t {\boldsymbol g}_i,$

where ${G^2}$ is an upper bound on ${\|{\boldsymbol g}_i\|_2^2}$ .
From its known guarantee, we have

$\displaystyle \begin{aligned} \sum_{t=1}^T (f({\boldsymbol x}_t) - f({\boldsymbol x}^\star)) \leq \left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2^2}{2\alpha} + \alpha\right) \sqrt{G^2+\sum_{t=1}^T \|{\boldsymbol g}_t\|_2^2}~. \end{aligned}$

Now, we want to transform the terms ${\|{\boldsymbol g}_t\|_2^2}$ in ${\|{\boldsymbol g}_t\|_2^{1+1/\nu}}$ to be able to apply Corollary 3. So, focus on the last term and use Hölder’s inequality:

$\displaystyle \begin{aligned} \sum_{t=1}^T \|{\boldsymbol g}_t\|^2_2 \leq \left(\sum_{t=1}^T 1^{q}\right)^{1/q} \left(\sum_{t=1}^T \|{\boldsymbol g}_t\|_2^{2p}\right)^\frac{1}{p} = T^{1-\frac{1}{p}}\left(\sum_{t=1}^T \|{\boldsymbol g}_t\|_2^{2p}\right)^\frac{1}{p}~. \end{aligned}$

Setting ${p= \frac{1}{2}+\frac{1}{2\nu}}$ so that ${1/p=\frac{2\nu}{\nu+1}}$ , we have

$\displaystyle \begin{aligned} \sqrt{\sum_{t=1}^T \|{\boldsymbol g}_t\|^2} \leq T^{\frac{-\nu+1}{2(\nu+1)}}\left(\sum_{t=1}^T \|{\boldsymbol g}_t\|_2^{1+1/\nu}\right)^\frac{\nu}{\nu+1} \end{aligned}$

If the function ${f}$ is ${(L_\nu,\nu)}$ -Hölder smooth, from Corollary 3 we have

$\displaystyle \begin{aligned} \sum_{t=1}^T (f({\boldsymbol x}_t)-f({\boldsymbol x}^\star)) &\leq \left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2^2}{2\alpha} + \alpha\right) \sqrt{G^2+\sum_{t=1}^T \|{\boldsymbol g}_t\|_2^2} \\ &\leq \left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2^2}{2\alpha} + \alpha\right) \left(G+T^\frac{-\nu+1}{2(\nu+1)}\left(\sum_{t=1}^T \|{\boldsymbol g}_t\|_2^{1+\frac{1}{\nu}}\right)^\frac{\nu}{\nu+1}\right) \\ &\leq \left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2^2}{2\alpha} + \alpha\right) \left(G+T^\frac{-\nu+1}{2(\nu+1)}\left[(1/\nu+1)L_\nu^{1/\nu} \sum_{t=1}^T (f({\boldsymbol x}_t)-f({\boldsymbol x}^\star)) \right]^\frac{\nu}{\nu+1}\right)~. \end{aligned}$

Assuming the term ${\|{\boldsymbol g}_t\|_2^2}$ can be ignored in the sum, solving the inequality, we have

$\displaystyle \begin{aligned} \sum_{t=1}^T (f({\boldsymbol x}_t)-f({\boldsymbol x}^\star)) =O\left( L_\nu \left(\frac{1}{\nu}+1\right)^\nu \left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2^2}{2\alpha} + \alpha\right)^{\nu+1} T^\frac{-\nu+1}{2}\right)~. \end{aligned}$

The online-to-batch conversion completes the convergence rate:

$\displaystyle f(\bar{{\boldsymbol x}}_T)-f({\boldsymbol x}^\star) = O\left(L_\nu \left(\frac{1}{\nu}+1\right)^\nu \left(\frac{1}{\sqrt{T}}\left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2^2}{2\alpha} + \alpha\right)\right)^{\nu+1}\right),$

where ${\bar{{\boldsymbol x}}_T}$ is the usual average of the iterates, ${\frac1T\sum_{t=1}^T f({\boldsymbol x}_t)}$ .

So, we obtained that AdaGrad-norm stepsizes will give a ${O(1/\sqrt{T})}$ rate for Lipschitz functions, ${O(1/T)}$ for the usual smooth function and a rate in between for Hölder smooth functions, without the need to know ${L_\nu}$ nor ${\nu}$ . Pretty cool, right?

In the next section, we see that we can generalize this idea a bit more and we can get rid of the knowledge of ${G}$ . (You can probably remove ${G}$ even in this proof, but I was lazy!) Also, I’ll show a way to prove that the iterates are bounded that can be used on AdaGrad-norm stepsizes too.

4. Adapting to Local Hölder Smoothness with Normalized Gradients

AdaGrad-norm stepsizes are nice, but they are a bit restrictive. Can I use different stepsizes in a different first-order optimization algorithm and still get the same rate?

It turns out this is very simple: Just use normalized gradients and evaluate the function on a weighted average of the iterates. That’s it. Moreover, the result is stronger: here we will adapt to a certain notion of “local” Hölder smoothness. And the proof is particularly simple too! Moreover, it is a black-box reduction: you can use it with any online optimization algorithm that guarantees a regret. No need to know what the online optimization algorithm does! This procedure is summarized in Algorithm 1.

Screenshot_20230619_092627

Let’s see the details in the following theorem.

Theorem 4. Suppose you have an online linear optimization algorithm ${\mathcal{A}}$ that guarantees

$\displaystyle \sum_{t=1}^T \langle {\boldsymbol q}_t, {\boldsymbol x}_t - {\boldsymbol u}\rangle \leq \psi_T(\|{\boldsymbol u}\|)$

when fed with linear losses ${\ell_t({\boldsymbol x})=\langle{\boldsymbol q}_t,{\boldsymbol x}\rangle}$ where ${\|{\boldsymbol q}_t\|_\star= 1}$ for all ${t}$ , for some function ${\psi_T:{\mathbb R}^d \rightarrow {\mathbb R}}$ and any ${{\boldsymbol u} \in {\mathbb R}^d}$ . Let ${f:{\mathbb R}^d \rightarrow {\mathbb R}}$ convex and ${{\boldsymbol x}^\star \in \mathop{\mathrm{argmin}}_{{\boldsymbol x}} f({\boldsymbol x})}$ . Assume that there exists ${\nu}$ and ${L({\boldsymbol x})}$ such that for all ${{\boldsymbol x} \in {\mathbb R}^d}$ we have

$\displaystyle \|\nabla f({\boldsymbol x})\|_\star^{\frac{1}{\nu}+1} \leq \alpha (L({\boldsymbol x}))^\frac{1}{\nu}(f({\boldsymbol x}) - f({\boldsymbol x}^\star))~.$

Then, Algorithm 1 guarantees

$\displaystyle \begin{aligned} f(\bar{{\boldsymbol x}}_T) - f({\boldsymbol x}^\star) \leq \alpha^\nu \left(\frac{\psi_T({\boldsymbol x}^\star)}{T}\right)^{1+\nu} \left( \prod_{t=1}^T L({\boldsymbol x}_t)\right)^\frac{1}{T} \leq \alpha^\nu \left(\frac{\psi_T({\boldsymbol x}^\star)}{T}\right)^{1+\nu} \frac{1}{T}\sum_{t=1}^T L({\boldsymbol x}_t) ~. \end{aligned}$

So, in words, the algorithm does not need to know anything about the function ${f}$ . Also, you don’t need to know anything about the internal working of the online learning algorithm, only the fact that it guarantees some regret upper bound. Just use normalized gradient and that’s it. It is difficult to be more general than this! The rate will depend on the geometric mean of the “local” smoothnesses on the iterates ${L({\boldsymbol x}_t)}$ , that in turn is upper bounded by their arithmetic mean. Note that if the function is ${(L_\nu,\nu)}$ -Hölder smooth everywhere, then ${\alpha^\nu\leq \left(1+\frac{1}{\nu}\right)^\nu\in[1,2]}$ and ${L({\boldsymbol x}_t)\leq L_\nu}$ .

To prove it, we will use the HM-GM-AM inequality

Lemma 5. Let ${a_1, \dots, a_T}$ positive numbers. Then, we have

$\displaystyle \frac{T}{\sum_{t=1}^T \frac{1}{a_t}} \leq \left(\prod_{t=1}^T a_t\right)^{1/T} \leq \left(\frac{1}{T} \sum_{t=1}^T a_t\right)~.$

We can now prove our main Theorem. The proof is immediate from the results in Levy (2017), but kind of hidden. Again, if you know online learning and you read the paper, I assure you the following proof will be evident, at least for ${\nu=1}$ . The generalization is a nice trick about geometric means and this kind of tricks is my specialty 😉

Proof: In the case that ${\nu=0}$ the statement is still true because we just pass the true gradients and ${\bar{{\boldsymbol x}}_T}$ is just the average of the iterates. So, the guarantee follows from classic online-to-batch conversion. Hence, in the following let’s assume ${\nu>0}$ .

For shortness of notation, denote by ${{\boldsymbol g}_t := \nabla f({\boldsymbol x}_t)}$ . In particular, from the guarantee of the algorithm and the setting of ${{\boldsymbol q}_t}$ we obtain for all ${{\boldsymbol u} \in {\mathbb R}^d}$

$\displaystyle \begin{aligned} \sum_{t=1}^T \frac{f({\boldsymbol x}_t) - f({\boldsymbol u})}{\|{\boldsymbol g}_t\|_\star} \leq \sum_{t=1}^T \frac{\langle {\boldsymbol g}_t, {\boldsymbol x}_t -{\boldsymbol u}\rangle}{\|{\boldsymbol g}_t\|_\star} \leq \psi_T({\boldsymbol u})~. \end{aligned}$

Now, from the definition ${\bar{{\boldsymbol x}}_T}$ and Jensen’s inequality, we have

$\displaystyle \sum_{t=1}^T \frac{f({\boldsymbol x}_t) -f({\boldsymbol x}^\star)}{\|{\boldsymbol g}_t\|_\star} \geq (f(\bar{{\boldsymbol x}}_T) -f({\boldsymbol x}^\star))\sum_{t=1}^T \frac{1}{\|{\boldsymbol g}_t\|_\star}~.$

Hence, we have

$\displaystyle \begin{aligned} f(\bar{{\boldsymbol x}}_T) - f({\boldsymbol x}^\star) &\leq \frac{\psi_T({\boldsymbol x}^\star)}{\sum_{t=1}^T \frac{1}{\|{\boldsymbol g}_t\|_\star} } \\ &\leq \frac{\psi_T({\boldsymbol x}^\star)}{T} \left(\prod_{t=1}^T \|{\boldsymbol g}_t\|_\star\right)^\frac{1}{T} \\ &= \frac{\psi_T({\boldsymbol x}^\star)}{T} \left(\prod_{t=1}^T \frac{\|{\boldsymbol g}_t\|_\star^{1+\frac{1}{\nu}}}{\|{\boldsymbol g}_t\|_\star}\right)^\frac{\nu}{T} \\ &\leq \frac{\psi_T({\boldsymbol x}^\star)}{T} \left( \prod_{t=1}^T \frac{\alpha(L({\boldsymbol x}_t))^\frac{1}{\nu}(f({\boldsymbol x}_t) - f({\boldsymbol x}^\star))}{\|{\boldsymbol g}_t\|_\star}\right)^\frac{\nu}{T} \\ &= \frac{\psi_T({\boldsymbol x}^\star)}{T}\alpha^\nu \left( \prod_{t=1}^T L({\boldsymbol x}_t)\right)^\frac{1}{T} \left( \prod_{t=1}^T \frac{f({\boldsymbol x}_t) - f({\boldsymbol x}^\star)}{\|{\boldsymbol g}_t\|_\star} \right)^\frac{\nu}{T}\\ &\leq \frac{\psi_T({\boldsymbol x}^\star)}{T} \alpha^\nu \left( \prod_{t=1}^T L({\boldsymbol x}_t)\right)^\frac{1}{T} \left( \frac{1}{T}\sum_{t=1}^T \frac{f({\boldsymbol x}_t) - f({\boldsymbol x}^\star)}{\|{\boldsymbol g}_t\|_\star} \right)^{\nu}\\ &\leq \frac{\psi_T({\boldsymbol x}^\star)}{T} \alpha^\nu \left( \prod_{t=1}^T L({\boldsymbol x}_t)\right)^\frac{1}{T} \left(\frac{\psi_T({\boldsymbol x}^\star)}{T}\right)^{\nu}~. \end{aligned}$

where we used Lemma 5 in the second inequality, Corollary 3 in the third one, the assumption on the online learning algorithm in the last one. $\Box$

We now consider some examples.

Online Gradient Descent Consider online gradient descent with constant learning rate ${\eta=\frac{\alpha}{\sqrt{T}}}$ . In this case, Algorithm 1 simply becomes Algorithm 2.

nogd2

Then, from very standard online learning results, we have immediately have

$\displaystyle \label{eq:nogd} \sum_{t=1}^T \frac{f({\boldsymbol x}_t) -f({\boldsymbol u})}{\|{\boldsymbol g}_t\|_2} \leq \frac{\sqrt{T}\|{\boldsymbol u}-{\boldsymbol x}_1\|_2^2}{2\alpha}+ \frac{\alpha\sqrt{T}}{2} - \frac{\sqrt{T}}{2\alpha}\|{\boldsymbol x}_{T+1}-{\boldsymbol u}\|_2^2~. \ \ \ \ \ (1)$

So, using Theorem 4 and assuming the function ${(L_\nu,\nu)}$ -Hölder smooth, we obtain

$\displaystyle f(\bar{{\boldsymbol x}}_T) -f({\boldsymbol x}^\star) \leq L_\nu \left(1+\frac{1}{\nu}\right)^\nu \left(\frac{1}{2\sqrt{T}} \left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2^2}{\alpha}+\alpha\right)\right)^{1+\nu}~.$

So, we get ${O(1/T)}$ in the classic smooth case and ${O(1/\sqrt{T})}$ in the Lipschitz case, and rates in between in the general Hölder case.

But we can do even better! From (1), using the fact that ${f({\boldsymbol x}_t)-f({\boldsymbol x}^\star)\geq0}$ , we can also say that

$\displaystyle \|{\boldsymbol x}_{T+1}-{\boldsymbol u}\|_2^2 \leq \|{\boldsymbol u}-{\boldsymbol x}_1\|_2^2+\alpha^2~.$

Hence, the iterates are bounded (nope, this is not new, this is also well-known, see for example Xiao (2010, Corollary 2.b)). This means that the Hölder smoothness only have to hold on a ball around the initial point. So, this result generalizes the one in Mishchenko and Malitsky (2020) because they only consider the usual notion of smoothness.

FTRL/DA With enough math, we could even prove the same thing for OGD a time-varying learning rate. However, given that I am not religious about OGD, we can do it in a much simpler way using FTRL with linearized losses / DA.

In this case, the update is

$\displaystyle {\boldsymbol x}_{t+1} = - \frac{\alpha}{\sqrt{t}} \sum_{i=1}^t {\boldsymbol q}_i = - \frac{\alpha}{\sqrt{t}} \sum_{i=1}^t \frac{{\boldsymbol g}_i}{\|{\boldsymbol g}_i\|_2}~.$

As before, proving that the iterates are bounded is very easy (reason as above and use Exercise 7.2 in Orabona (2019)). So, using the known guarantee for FTRL and reasoning as above, we obtain

$\displaystyle f(\bar{{\boldsymbol x}}_T) -f({\boldsymbol x}^\star) \leq L_\nu \left(1+\frac{1}{\nu}\right)^\nu \left(\frac{1}{\sqrt{T}}\left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2^2}{2\alpha}+\alpha\right)\right)^{1+\nu},$

but this time we don’t need to know ${T}$ ahead of time. In case you are wondering, we lost a “2” because ${\sum_{t=1}^T \frac{1}{\sqrt{t}}\leq 2\sqrt{T}}$ . Note that we could even consider the non-Euclidean case, everything is essentially the same, you can try just for fun.

Parameter-free version In the above bounds, the optimal value of ${\alpha}$ depends on ${\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2}$ . If you knew ${\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2}$ (but we don’t), we would obtain the term ${\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2}$ inside the parentheses above. Luckily, the past 11 years of research in parameter-free algorithms gave us an easy solution: algorithms that do not need to know ${\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2}$ yet achieve the best rates up to (unavoidable) polylogarithmic factors.

Remark 1. “Parameter-free” is a generic adjective but it is also a technical word like “universal”. So, it might be the case that I do not use it in the same you use it!

In my subfield, we call an optimization algorithm “parameter-free” if it can achieve the convergence rate (in expectation or high probability) ${\tilde{O}(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|}{\sqrt{T}})}$ uniformly over all convex functions with bounded stochastic subgradients, with ${T}$ stochastic subgradients query and possible knowledge of the bound of the stochastic subgradients. I know, our fault was never to define it formally in the papers, but this is how this community understands this term. So, it does not mean “without anything to tune” nor “without knowing anything on the function”, sorry! The name is motivated by the fact that most of the algorithms that achieve this guarantee (but not all! See, e.g., Foster, Kale, Mohri, and Sridharan (2017)) happen to do it with methods that do not have learning rates.

However, the main disadvantage of parameter-free algorithms is the need to have bounded gradients. But normalized gradients are always bounded! In particular, consider again the Euclidean case and just use the parameter-free KT algorithm with normalized gradients:

$\displaystyle {\boldsymbol x}_{t+1} = {\boldsymbol x}_1 + \frac{-\sum_{i=1}^{t} {\boldsymbol q}_i }{t+1}\left(d_0 - \sum_{i=1}^t \langle {\boldsymbol q}_i, {\boldsymbol x}_i\rangle\right) = {\boldsymbol x}_1 + \frac{-\sum_{i=1}^{t} \frac{{\boldsymbol g}_i}{\|{\boldsymbol g}_i\|_2} }{t+1}\left(d_0 - \sum_{i=1}^t \left\langle \frac{{\boldsymbol g}_i}{\|{\boldsymbol g}_i\|_2}, {\boldsymbol x}_i\right\rangle\right)$

In case you think that this is a weird algorithm, think again: it is very difficult to beat this algorithm, because it is designed exactly for the case in which all the vectors have norm equal to 1! Again, using Theorem 4 and the regret guarantee of KT, we immediately obtain

$\displaystyle f(\bar{{\boldsymbol x}}_T) -f({\boldsymbol x}^\star) \leq L_\nu \left(1+\frac{1}{\nu}\right)^\nu \left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2}{\sqrt{T}} \sqrt{\ln \left( \frac{24T^2\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|^2_2}{d_0^2}+1\right)}+ \frac{d_0}{T}\right)^{1+\nu}~.$

The advantage of this bound over the ones above is that this has almost the optimal dependency on ${\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2}$ without the need to know anything nor to set a learning rate! This rate also tells us that ${d_0}$ can be safely set to anything between ${O(1)}$ and ${O(\sqrt{T})}$ without affecting much the algorithm.

I can hear at least one party pooper saying that this bound is not optimal in ${T}$ because of the ${T}$ term in the log. However, this is easy to fix, again known stuff. There are a lot of parameter-free algorithms, KT is just the simplest one. So, using the parameter-free algorithms that do not have ${T}$ inside the log (e.g., McMahan and Orabona, 2014)(Zhang, Cutkosky, and Paschalidis, 2022), we can also remove the ${T}$ term in the log at the expense of the term ${\frac{d_0}{T}}$ that becomes ${O(\frac{d_0}{\sqrt{T}})}$ inside the parentheses. So, the final bound would be

$\displaystyle f(\bar{{\boldsymbol x}}_T) -f({\boldsymbol x}^\star) = O\left(L_\nu \left(1+\frac{1}{\nu}\right)^\nu \left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2}{\sqrt{T}} \sqrt{\ln \left( \frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2}{d_0}+1\right)}+ \frac{d_0}{\sqrt{T}}\right)^{1+\nu}\right)~.$

This is a beautiful guarantee: the algorithms adapts to ${\nu}$ , adapts to ${\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2}$ , adapts to ${L_\nu}$ , asymptotically optimal in ${T}$ . Bonus points: no learning rate to tune and the analysis was very simple because we just put together a known algorithm and its bound with Theorem 4.

But wait, there is also another way! Just scale each gradient ${{\boldsymbol g}_t}$ by ${\sqrt{\sum_{i=1}^t \|{\boldsymbol g}_i\|^2_2}}$ and feed them to your parameter-free algorithm. Reasoning as in the AdaGrad-norm stepsizes, you will get the same bound, I leave it as an exercise for the reader.

Note that in this case the bounded iterates are more delicate, but it can still be done using a parameter-free algorithm based on FTRL and reason as in Orabona and Pál (2021). So, yeah, everything solved.

5. Open Problems

That’s it! Easy, right? Two different reductions, rescaled gradients by AdaGrad-norm stepsizes and normalized gradients, to obtain adaptive rates. Also, if you know classic optimization and you don’t know online learning, I hope the short and elegant proofs above will convince you to study more about online learning: I happen to have written an (almost) complete draft book on online learning 😉

As I wrote above, everything was (essentially) known. So, what remains to be done? For sure the stochastic case. Indeed, the deterministic case is too easy to be interesting. In fact, Levy (2017) has an entire section about adaptive minibatches to use normalized gradients (minus the adaptation to ${\nu}$ that is “new”) in the stochastic case and things get really hairy. On the issue of focusing on the deterministic setting, I feel that it is becoming the new “let’s assume a bounded domain”: Some people seem to do it not because it makes sense in their applications but just because it is easier to deal with. Don’t avoid the technical difficulties because sometimes they indicate that you do have to change the algorithm.

Another interesting direction is to obtain accelerated rates using parameter-free algorithms: it is still deterministic but I don’t know ways to do it (unless I am missing some very recent result).

6. History Bits

AdaGrad-norm stepsizes were proposed by Streeter and McMahan (2010) and widely used in the online learning community (e.g., Orabona and Pál , 2015) before being rediscovered in the stochastic optimization one by… us! (Li and Orabona, 2019) (yep, we were the first ones, check the arXiv date).

The idea of using normalized gradients goes back at least to Nesterov (2004, Section 3.2.3), where he gets adaptation to the Lipschitz constant. Reading his proof, it should be immediate to realize that the same trick works essentially for any algorithm. The majority of this blog post comes from reading and truly enjoying the underappreciated paper by Levy (2017). Levy (2017) presents his result for a particular algorithm, but if you look at his proof you should see that it applies to any online linear optimization algorithm, as I did. I am not sure where I took the trick about the geometric mean to adapt to ${\nu}$ . Given that I wrote this proof in 2018, I don’t remember anymore, maybe I actually invented it. As far as I know, the dependency on the geometric mean of the smoothness is new and it mirrors a similar bound I proved in the blog post on pseudogradients for the Perceptron. For a paper with true universality, i.e., optimal accelerated rates in all cases, see Nesterov (2015). Also, Grimmer (2022) proves a more fine-grained rate for Nesterov’s method on sum of functions.

Some weaker and less general results appear in other papers too. Mishchenko and Malitsky (2020) proved that one can obtain ${O(1/T)}$ rate in the smooth case ( ${\nu=1}$ ) using gradient descent with an adaptive stepsize that tries to estimate the local smoothness. I believe (but I might be wrong) this also implies that in the non-smooth case their algorithm might fail. Note that Levy (2017) already proved the stronger result of adaptation to Lipschitz and smooth case for normalized gradient descent. However, the algorithm in Mishchenko and Malitsky (2020) is much better in the strongly convex case. In fact, in the smooth strongly convex case, it essentially uses a learning rate that is ${\approx \frac{1}{2L}}$ that will give a linear rate, while the algorithms above will not get you a linear rate. That said, Levy (2017) has other results for the strongly convex case, but you’ll have to read the paper to see which ones. The bit about the boundedness of the OGD iterates with normalized gradient descent is “new” but kind of obvious and I wrote it after reading a similar guarantee proved in Mishchenko and Malitsky (2020).

The adaptation to smoothness in the parameter-free case by rescaling the gradients of a parameter-free algorithm (and actually of any FTRL/DA algorithm) by ${(G^2+\sum_{i=1}^{t-1} \|{\boldsymbol g}_i\|^2_2)^{1/2+\epsilon}}$ is in Orabona and Pál (2021, Lemma 26), where in the deterministic case ${\epsilon}$ can be set to 0 and ${G^2}$ substituted by ${\|{\boldsymbol g}_t\|^2_2}$ because these changes are needed in the stochastic analysis only as explained in Li and Orabona (2019). The use of normalized gradients in parameter-free algorithms for the deterministic setting is explicitly mentioned as an easy alternative to avoid knowing the Lipschitz constant in Orabona and Pál (2021) on page 13. Khaled, Mishchenko, and Jin (2023) appear to have rediscovered similar guarantees without adaptation to ${\nu}$ and with a more convoluted proof. They also rediscover the guarantee of normalized gradient descent from Levy (2017). For the smooth deterministic case, Carmon and Hinder (2022) proposed a parameter-free algorithm that on smooth functions achieves a rate of ${O(1/T)}$ paying only an additional log log factor in ${\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|}$ , but with the knowledge of the smoothness constant.

The first parameter-free algorithm to prove bounded iterates (with probability one) was in Orabona and Pál (2021) but without a precise bound. Ivgi, Hinder, and Carmon (2023) were the first one to design a parameter-free algorithm where ${\|{\boldsymbol x}_1-{\boldsymbol x}_t\|_2}$ is bounded by a ${K \|{\boldsymbol x}_1-{\boldsymbol x}^\star\|_2}$ , where ${K}$ is a universal constant with high probability.

Acknowledgments

Thanks to Yair Carmon, Benjamin Grimmer, Kfir Levy, Mingrui Liu, Nicolas Loizou, Yura Malitsky, and Aryan Mokthari for feedback and comments on a preliminary version of this blog post.

Versions
1.0: initial post
1.1: Fixed missing ${G}$ term in the AdaGrad proof. Thanks to Aaron Defazio for pointing it out!

3 Comments

Fabian Pedregosa says:

June 21, 2023 at 3:35 AM

great post as usual. I wonder if as for gradient descent, the theory can be tightened if we use the equivalent of

f(x) \leq f(y) + \langle \nabla f(x), x – y\rangle – \frac{1}{2 L}\|\nabla f(x) – \nabla f(y)\|^2

for Holder smoothness the same way that using the above instead of the more traditional L-smoothness and convexity separately allows to improve the convergence rate of gradient descent (as done for example in https://link.springer.com/article/10.1007/s10107-022-01899-0)

LikeLiked by 1 person

1. bremen79 says:
  
  June 21, 2023 at 7:35 AM
  
  Thanks Fabian! Also, thanks for sharing that paper, I didn’t know about it, very cool results!
  I would guess it is probably possible to use similar methods in the Hölder case. But if we want something that adapts to the smoothness, we should probably not start from the descent lemma, because we need something that works in the Lipschitz case too. Anyway, it would be interesting to see if the refined inequalities in that paper could find applications in other settings too.
  
  LikeLike
  
  1. bremen79 says:
    
    June 21, 2023 at 7:45 AM
    
    …or at least I would not know how to start from the descent lemma to design adaptive strategies 🙂
    
    LikeLike
Pingback: Is Nesterov’s Universal Algorithm Really Universal? – Parameter-free Learning and Optimization Algorithms

Share this:

Related

Leave a comment Cancel reply