Is Nesterov’s Universal Algorithm Really Universal?

This post is about Nesterov’s universal algorithm and how delicate is to claim that an algorithm is “universal”, “parameter-free”, “adaptive”, or any other similar word to denote the fact that the algorithm does not need prior knowledge of the characteristics of a function to converge at its best rate. This post was born from a visit I did to George Lan a few weeks ago.

This post is also a good occasion to point out an important distinction between algorithms that guarantee that the suboptimality gap shrinks at ${O(1/\sqrt{T})}$ rate for any ${T}$ and algorithms that accept a parameter ${\epsilon}$ and require ${O(1/\epsilon^2)}$ iterations to reach ${\epsilon}$ -suboptimality. They might look the same thing, but they are not!

1. Nesterov’s Universal Algorithm

We will consider the setting of non-smooth functions, but similar considerations can be made for the weakly-smooth one. So, we have a convex non-smooth Lipschitz function ${f:{\mathbb R}^d \rightarrow {\mathbb R}}$ . We assume the Lipschitz constant to be ${L}$ and we assume that at least a minimizer exists and we call it ${{\boldsymbol x}^\star}$ .

What is a “universal” optimization algorithm? Well, this concept was introduced by Nesterov (2015), but not precisely defined. More or less, he implied that a universal algorithm achieves the optimal convergence rate without knowledge of the characteristics of a function, where the function comes from a known and large function class.

Now, Nesterov’s universal algorithm uses subgradients and function values to guarantee the following complexity rate: ${O(\frac{\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|L^2}{\epsilon^2})}$ , where ${{\boldsymbol x}_0}$ is the point where we query the first subgradient. (The algorithm actually guarantees a rate uniformly over all Holder-smooth functions, but here we focus only on the non-smooth case). What is the meaning of the above expression? This is the way in which optimization papers state how fast you will converge: If you want to produce ${{\boldsymbol x}_T}$ such that the suboptimality gap is less or equal to ${\epsilon}$ , that is, ${f({\boldsymbol x}_T)-f({\boldsymbol x}^\star)\leq \epsilon}$ , you need the number of iterations to be ${T_\epsilon=O(\frac{\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|L^2}{\epsilon^2})}$ .

It is possible to show that this complexity bound is optimal, moreover the algorithm does not need to know ${L}$ nor ${\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|}$ to calculate its update. So, it seems that Nesterov’s algorithm is indeed parameter-free. That is, it seems we can generate a solution with precision ${\epsilon}$ with the optimal number of iterations and without knowing anything about the function. Right? Well, no.

Indeed, for everything we know about parameter-free optimization, the complexity rate above is too good to be true. In fact, you are expected to pay something for your lack of knowledge of the characteristics of ${f}$ . Maybe you pay something small, like a ${\log \log}$ factor, but you typically have to pay something.

So, where is the issue? Let’s take a look at Nesterov’s algorithm:

The guarantee for this algorithm is the following one: For any parameter ${\epsilon>0}$ , after ${T}$ iterations we have

$\displaystyle f({\boldsymbol x}_{T})-f({\boldsymbol x}^\star) \leq \frac{2 L^2}{T \epsilon}\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2+ \frac{\epsilon}{2}~.$

Observe a peculiar thing: there is no stopping condition for this algorithm. Now, an algorithm has to terminate after a finite number of steps, otherwise it is not an algorithm! So, either we define a precise stopping condition or we decide to run it for a fixed number of iterations. When do we stop? The above guarantee tells us that if ${\frac{2 L^2}{T \epsilon}\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2\leq \epsilon/2}$ , then ${f({\boldsymbol x}_{T})-f({\boldsymbol x}^\star)\leq \epsilon}$ . Hence, we need the number of iterations ${T}$ to be bigger or equal than ${T_\epsilon:=\frac{2L^2 \|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2}{\epsilon^2}}$ , that gives the stated complexity rate. (If we keep running the algorithm for more than ${T_\epsilon}$ iterations, the suboptimality is not guaranteed to decrease! For example, the algorithm might keep circling around the minimum, under or equal ${\epsilon}$ -suboptimality.) Here is the caveat: ${T_\epsilon}$ is unknown because we do not know ${\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|}$ . In other words, the stated complexity is true only if we assume to know things that we do not know!

I am sure at this point you are trying to find a bug in my reasoning because you cannot believe that Nesterov’s universal algorithm is not universal! So, let me present you two pieces of evidence to show that the above problem is real.

First, I’ll show you how to obtain the exact same complexity with a very dumb algorithm. The algorithm is so dumb that you would have hard time claiming that it is parameter-free. Here is the algorithm:

$\displaystyle x_{t+1}=x_t-\eta_t g_t,$

where ${\eta_t=\frac{\epsilon}{\max_{j\leq t} \|g_j\|^2}}$ and ${\epsilon}$ is again the desired suboptimality.

From the usual proofs, we have

$\displaystyle \sum_{t=1}^T \eta_t \langle {\boldsymbol g}_t, {\boldsymbol x}_t-{\boldsymbol x}^\star\rangle \leq \frac{1}{2}\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2+ \frac{1}{2}\sum_{t=1}^T \eta_t^2 \|{\boldsymbol g}_t\|^2 \leq \frac{1}{2}\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2+ \frac{\epsilon }{2} \sum_{t=1}^T \eta_t~.$

Define ${\bar{{\boldsymbol x}}_T=\frac{\sum_{t=1}^T \eta_t {\boldsymbol x}_t}{\sum_{t=1}^T \eta_t}}$ , divide by ${\sum_{t=1}^T \eta_t}$ , and use Jensen’s inequality, to have

$\displaystyle \begin{aligned} f(\bar{{\boldsymbol x}}_T)- f({\boldsymbol x}^\star) \leq \frac{1}{2 \sum_{t=1}^T \eta_t}\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2+ \frac{\epsilon}{2} \leq \frac{L^2}{2 T \epsilon}\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2+ \frac{\epsilon}{2}~. \end{aligned}$

Observe that this is the same guarantee of the Nesterov’s algorithm above. So, again, we need ${\frac{L^2}{2 T \epsilon}\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2\leq \epsilon/2}$ that is ${T\geq \frac{L^2 \|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2}{\epsilon^2}}$ .

I hope you see it now: Both Nesterov’s algorithm and the dumb procedure above are not parameter-free. Both procedures sooner or later have ${\epsilon}$ -suboptimality, but both require knowledge of the unknown characteristics of ${f}$ to decide when to optimally stop. If you do not have the required knowledge, either you stop too soon and cannot guarantee anything or you might stop too late and the complexity bound becomes false.

While you are still thinking about the above proof, let me also present the second piece of evidence. That is, how is it possible that Nesterov called “universal” an algorithm that still requires knowledge of unknown quantities? It turns out that Nesterov had in mind the constrained and bounded setting. This is clear from the stopping condition that he proposes:

Here, you can see that you can implement this stopping condition only when the feasible set has a bounded diameter, or you have some knowledge about the function. However, it is worth noting that in that case the algorithm stops in the worst case with a number of iterations that is ${O(\frac{D^2 L^2}{\epsilon^2})}$ rather than ${O(\frac{\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2 L^2}{\epsilon^2})}$ , where ${D}$ is the diameter of the set, and the ratio between the two can be arbitrarily large.

2. Could we Tune ${\epsilon}$ ?

There might be still skeptical people, so let’s try to see if we can fix Nesterov’s algorithm. Let’s start from its guarantee:

$\displaystyle f({\boldsymbol x}_T)-f({\boldsymbol x}^\star) \leq \frac{L^2}{2 T \epsilon}\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2+ \frac{\epsilon}{2},$

where ${\epsilon}$ is a parameter of the algorithm.

An idea one might have is to say that I have a budget of ${T}$ iterations and I want to set ${\epsilon}$ such that the rate is optimal in ${T}$ , ${L}$ , and ${\|{\boldsymbol x}_0-{\boldsymbol x}_\star\|}$ . Maybe in this way we can go around the problem. So, one reasonable setting is ${\epsilon=\frac{\alpha}{\sqrt{T}}}$ that gives us

$\displaystyle f({\boldsymbol x}_T)-f({\boldsymbol x}^\star) \leq \frac{1}{\sqrt{T}}\left(\frac{L^2 \|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2}{2 \alpha} + \frac{\alpha}{2}\right)~.$

This is optimal in ${T}$ , but now we have the problem of choosing ${\alpha}$ without knowing anything. We could actually design an algorithm optimal in ${L}$ (very very easy, see this blog post), but being optimal in ${\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|}$ is a completely different beast. In fact, let me remind you that this is non-smooth optimization, so you have zero information from the subgradient alone on how far you are from the optimum (think about the subgradients of ${f(x)=|x-10|}$ ).

3. Is There any Fix at All?

As I see it, the issue above is due to the fact that we are hiding the required knowledge inside the stopping condition. Pretty sneaky if you ask me!

The presence of ${\epsilon}$ is also an issue: If you construct an optimization algorithm that requires the desired suboptimality ${\epsilon}$ , then it is very likely that you will need a stopping criterion. In turn, stopping criteria are a very complex business in the unconstrained non-smooth setting.

What is the alternative? Suppose I had an algorithm that does not take any parameter and after ${T}$ iterations it guarantees ${O(\frac{\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|L}{\sqrt{T}})}$ . This would immediately imply the same complexity bound above, that is, I can have ${\epsilon}$ -suboptimality after ${T_\epsilon=O(\frac{\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2 L^2}{\epsilon^2})}$ . However, you still cannot calculate ${T_\epsilon}$ , but here it is fine!

Let me explain why this would be fine: This is a delicate point and the key message of this entire post, so let’s go slowly. I think we agreed that 1) algorithms must terminate and 2) having a stopping condition that depends on unknown quantities is a bad idea. Instead, an algorithm that guarantees ${O(\frac{\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|L}{\sqrt{T}})}$ for any ${T}$ iterations is telling me that no matter when I stop, I am always sure that the algorithm did the best possible thing to minimize the (worst-case) function, in the sense that the unknown suboptimality depends optimally on everything: ${T}$ , ${L}$ , and ${\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|}$ . So, with this second hypothetical algorithm we still do not know when to stop to guarantee ${\epsilon}$ -suboptimality, but whenever I stop I know the algorithm did the best possible thing!

This second type of guarantees are less common because classic optimization focuses on producing a solution with a predefined accuracy. Think for example of an algorithm to calculate eigenvalues: you decide the desidered accuracy and the algorithm stops only when that accuracy is reached (in some measure) or it gives up with a warning saying that it reached the maximum number of iterations. Instead, in statistical learning theory we are very used to these type of guarantees, proving, for example, that the difference between the test error of a predictor and the test error of the best predictor is upper bounded by a term that depends optimally on all the unknown characteristics of the problem. Again, we cannot calculate how many samples we need to have a certain gap from the optimal predictor, because these bounds depend on unknown quantities. However, we still have the comfort of knowing that, no matter how many samples I use, the gap shrinks at the optimal rate. And I feel this is exactly the right thing to show.

4. What is the Best we Can (Currently) Do?

As far as I know, unfortunately we do not have an algorithm that guarantees ${\epsilon}$ -suboptimality in ${O(\frac{\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|^2 L^2}{\epsilon^2})}$ without paying any additional price. The currently best known asymptotic rate in the deterministic case is ${O(\frac{(\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|+1) L \sqrt{\ln (1+\|{\boldsymbol x}_0-{\boldsymbol x}^\star\|)}}{\sqrt{T}})}$ as I showed in another blog post, assuming only access to subgradients. If we have access to function values too, you should read this hidden gem: “Fast Bundle-Level Type Methods for unconstrained and ball-constrained convex optimization” by Yunmei Chen, Guanghui Lan, Yuyuan Ouyang, and Wei Zhang.

This paper has also an interesting story that reminds us of how poor any scientific community is at recognizing good work:

This paper was released in 2014 and got rejected by SIOPT (finally got published by COAP with a slightly different title in 2019). It only received 10 citations so far😂. I myself almost forgot this paper but Lemma 2.1, which might be of some interest to the broader community now
— George Lan (@GeorgeLan7) November 28, 2023

Parameter-free Learning and Optimization Algorithms

Is Nesterov’s Universal Algorithm Really Universal?

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply