*This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.*

* You can find the lectures I published till now here.*

In this lecture we will present some lower bounds for online linear optimization (OLO). Remembering that linear losses are convex, this immediately gives us lower bounds for online convex optimization (OCO). We will consider both the constrained and the unconstrained case. The lower bounds are important because they inform us on what are the optimal algorithms and where are the gaps in our knowledge.

**1. Lower bounds for Bounded OLO **

We will first consider the bounded constrained case. Finding a lower bound accounts to find a strategy for the adversary that forces a certain regret onto the algorithm, *no matter what the algorithm does*. We will use the probabilistic method to construct our lower bound.

The basic method relies on the fact that for any random variable with domain , and any function

For us, it means that we can lower bound the effect of the worst-case sequence of functions with an expectation over any distribution over the functions. If the distribution is chosen wisely, we can expect that gap in the inequality to be small. Why do you rely on expectations rather than actually constructing an adversarial sequence? Because the use of a stochastic loss functions makes very easy to deal with arbitrary algorithms. In particular, we will choose stochastic loss functions that makes the expected loss of the algorithm 0, independently from the strategy of the algorithm.

Theorem 1Let be any non-empty bounded closed convex subset. Let be the diameter of . Let be any (possibly randomized) algorithm for OLO on . Let be any non-negative integer. There exists a sequence of vectors with and such that the regret of algorithm satisfies

*Proof:* Let’s denote by . Let such that . Let , so that . Let be i.i.d. Rademacher random variables, that is and set the losses .

where we used and the are independent in the first equality, in the fourth equality, and Khintchine inequality in the last inequality.

We see that the lower bound is a constant multiplicative factor from the upper bound we proved for Online Subgradient Descent (OSD) with learning rates or . This means that OSD is asymptotically optimal with both settings of the learning rate.

At this point there is an important consideration to do: How can this be the optimal regret when we managed to proved better regret, for example with adaptive learning rates? The subtlety is that, constraining the adversary to play -Lipschitz losses, the adversary could always force on the algorithm at least the regret in Theorem 1. However, we can design algorithms that take advantage of *suboptimal plays of the adversary*. Indeed, for example, if the adversary plays in a way that all the subgradients have the same norm equal to , there is nothing to adapt to!

We now move to the unconstrained case, however first we have to enrich our math toolbox with another essential tool, *Fenchel conjugates*.

**2. Convex Analysis Bits: Fenchel Conjugate **

Definition 2A function isclosediff is closed for every .

Note that in a Hausdorff space a function is closed iff it is lower semicontinuos (Bauschke, H. H. and Combettes, P. L., 2011, Lemma 1.24).

Example 1The indicator function of a set , is closed iff is closed.

Definition 3For a function , we define theFenchel conjugateas

From the definition we immediately obtain the Fenchel’s inequality

We have the following useful property for the Fenchel conjugate

Theorem 4 (Rockafellar, R. T., 1970, Corollary 23.5.1 and Theorem 23.5)Let be convex, proper, and closed. Then

- iff .
- achieves its supremum in at iff .

Example 2Let , hence we have . Solving the optimization, we have that if and , , and for .

Example 3Consider the function , where is a norm in , with dual norm . We can show that its conjugate is . From

for all . The right hand side is a quadratic function of , which has maximum value . Therefore for all , we have

which shows that . To show the other inequality, let be any vector with , scaled so that . Then we have, for this ,

which shows that .

Lemma 5Let be a function and let be its Fenchel conjugate. For and , the Fenchel conjugate of is .

*Proof:* From the definition of conjugate function, we have

Lemma 6 (Bauschke, H. H. and Combettes, P. L., 2011, Example 13.7)Let even. Then .

**3. Unconstrained OLO **

The above lower bound applies only to the constrained setting. In the unconstrained setting, we proved that OSD with and constant learning rate of gives a regret of for any . Is this regret optimal? It is clear that the regret must be at least linear in , but is linear enough?

The approach I will follow is to *reduce the OLO game to the online game of betting on a coin*, where the lower bounds are known. So, let’s introduce the coin-betting online game:

- Start with an initial amount of money .
- In each round, the algorithm bets a fraction of its current wealth on the outcome of a coin.
- The outcome of the coin is revealed and the algorithm wins or lose its bet, 1 to 1.

The aim of this online game is to win as much money as possible. Also, as in all the online games we consider, we do not assume anything on how the outcomes of the coin are decided. Note that this game can also be written as OCO using the log loss.

We will denote by the outcomes of the coin. We will use the absolute value of to denote the fraction of money to bet and its sign to denote on which side we are betting. The money the algorithm has won from the beginning of the game till the end of round will be denoted by and given that the money are won or lost 1 to 1, we have

where we used the fact that . We will also denote by the bet of the algorithm on round .

If we got all the outcomes of the coin correct, we would double our money in each round, so that . However, given the adversarial nature of the game, we can actually prove a stronger lower bound to the maximum wealth we can gain.

Theorem 7 (Cesa-Bianchi, N. and Lugosi, G. , 2006, a simplified statement of Theorem 9.2)Let . Then, for any betting strategy with initial money that bets fractions of its current wealth, there exists a sequence of coin outcomes , such that

Now, let’s connect the coin-betting game with OLO. Remember that proving a regret guarantee in OLO consists in showing

for some function , where we want the dependency on to be sublinear. Using our new learned concept of Fenchel conjugate, this is equivalent to prove that

Hence, for a given online algorithm we can prove regret bounds proving that there exists a function or equivalently finding its conjugate . Similarly, proving a lower bound in unconstrained OLO means finding a sequence of and a function or a function that lower bound the regret or the cumulative losses of the algorithm respectively.

Without any other information, it can be challenging to guess what is the slowest increasing function . So, we restrict our attention to online algorithms that guarantee a constant regret against the zero vector. This immediately imply the following important consequence.

Theorem 8Let a non-decreasing function of the index of the rounds and an OLO algorithm that guarantees for any sequence of with . Then, there exists such that and for .

*Proof:* Define the “reward” of the algorithm. So, we have

Since, we assumed that , we always have . Using this, we claim that for all . To see this, assume that there is a sequence that gives . We then set . For this sequence, we would have , that contradicts the observation that .

So, from the fact that we have that there exists such that for a and .

This theorem informs us of something important: *any OLO algorithm that suffer a non-decreasing regret against the null competitor must predict in the form of a “vectorial” coin-betting algorithm*. This immediately implies the following.

Theorem 9Let . For any OLO algorithm, under the assumptions of Theorem 8, there exist a sequence of with and , such that

*Proof:* The proof works by reducing the OLO game to a coin-betting game and then using the upper bound to the reward for coin-betting games.

First, set , where will be defined in the following, so that for any . Given Theorem 8, we have that the first coordinate of has to satisfy

for some such that . Hence, the above is nothing else than a coin-betting algorithm that bets money on the outcome of a coin , with initial money . This means that the upper bound to its reward in Theorem 7 applies: there exists a sequence of such that

where is the Fenchel conjugate of the function and by Theorem 4 part 2. Using the closed form solution for the Fenchel conjugate as well as its lower bound from (Orabona, F. and Pal, D., 2016) and reordering the terms, we get the stated bound.

From the above theorem we have that OSD with learning rate does not have the optimal dependency on for any .

In the future classes, we will see that the connection between coin-betting and OLO can also be used to design OLO algorithm. This will give us *optimal unconstrained OLO algorithms with the surprising property of not requiring a learning rate at all*.

**4. History Bits **

The lower bound for OCO is quite standard, the proof presented is a simplified version of the one in (F. Orabona and D. Pál, 2018).

On the other hand, both the online learning literature and the optimization one almost ignored the issue of lower bounds for the unconstrained case. The connection between coin betting and OLO was first unveiled in (Orabona, F. and Pal, D., 2016). Theorem 8 is an unpublished result by Ashok Cutkosky (Thanks for allowing me to use it here!), that proved similar and more general results in his PhD thesis (Cutkosky, A., 2018). Theorem 9 is new, by me. (McMahan, B. and Abernethy, J., 2013) implicitly also proposed using the conjugate function for lower bound in unconstrained OLO.

There is a caveat in the unconstrained lower bound: A stronger statement would be to choose the norm of beforehand. To do this, we would have to explicitely construct the sequence of the . One way to do is to use Rademacher coins and then leverage again the hypothesis on the regret against the null competitor. This route was used in (M. Streeter and B. McMahan, 2012), but the proof relied on assuming the value of the global optimum of a non-convex function with an infinite number of local minima. The correct proof avoiding that step was then given in (Orabona, F., 2013). Yet, the proof presented here, that converts reward upper bounds in regret lower bound, is simpler in spirit and (I hope!) more understandable. Given that, as far as I know, this is the first time that unconstrained OLO lower bounds are taught in a class, I valued simplicity over generality.

**5. Exercises **

Exercise 1Fix . Mimicking the proof of Theorem 1, prove that for any OCO algorithm there exists a and a sequence of loss functions such that where and the loss functions are -Lipschitz w.r.t. .

Exercise 2Extend the proof of Theorem 1 to an arbitrary norm to measure the diameter of and with .

Exercise 3Let be even. Prove that is even

Exercise 4Find the exact expression of the conjugate function of , for . Hint: Wolfram Alpha or any other kind of symbolic solver can be very useful for this type of problems.

Update: Fixed a problem in the proof of Theorem 8 that one of my students anonymously posted on our internal forum.

LikeLike

Hi Francesco,

First fo all, nice post! I’m pretty sure I may be missing something simple, but I don’t quite see why Theorem 1 requires a constrained domain. If we remove that assumption, where does the proof break? We could define v and w in a slightly different way. Define them such that ||v-w||=1, and change the equality at the beginning of line 3 of the proof to an inequality (restricting the optimization domain to {v, w}). Then, I feel that all of the steps would still be valid. Of course, the diameter D won’t show up anymore, but that’s a constant. What’s wrong with these modifications?

Thanks!

LikeLike

Hi Mass,

yes, you are very right! The theorem would work also for unbounded domains. However, the resulting lower bound would be supoptimal because it would lack of the logarithmic factor we get from the other proof. I guess I should change a bit the text to explain this bit.

LikeLike