This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.
You can find the lectures I published till now here.
In this lecture we will present some lower bounds for online linear optimization (OLO). Remembering that linear losses are convex, this immediately gives us lower bounds for online convex optimization (OCO). We will consider both the constrained and the unconstrained case. The lower bounds are important because they inform us on what are the optimal algorithms and where are the gaps in our knowledge.
1. Lower bounds for Bounded OLO
We will first consider the bounded constrained case. Finding a lower bound accounts to find a strategy for the adversary that forces a certain regret onto the algorithm, no matter what the algorithm does. We will use the probabilistic method to construct our lower bound.
The basic method relies on the fact that for any random variable with domain
, and any function
For us, it means that we can lower bound the effect of the worst-case sequence of functions with an expectation over any distribution over the functions. If the distribution is chosen wisely, we can expect that gap in the inequality to be small. Why do you rely on expectations rather than actually constructing an adversarial sequence? Because the use of a stochastic loss functions makes very easy to deal with arbitrary algorithms. In particular, we will choose stochastic loss functions that makes the expected loss of the algorithm 0, independently from the strategy of the algorithm.
Theorem 1 Let
be any non-empty bounded closed convex subset. Let
be the diameter of
. Let
be any (possibly randomized) algorithm for OLO on
. Let
be any non-negative integer. There exists a sequence of vectors
with
and
such that the regret of algorithm
satisfies
Proof: Let’s denote by . Let
such that
. Let
, so that
. Let
be i.i.d. Rademacher random variables, that is
and set the losses
.
where we used and the
are independent in the first equality,
in the fourth equality, and Khintchine inequality in the last inequality.
We see that the lower bound is a constant multiplicative factor from the upper bound we proved for Online Subgradient Descent (OSD) with learning rates or
. This means that OSD is asymptotically optimal with both settings of the learning rate.
At this point there is an important consideration to do: How can this be the optimal regret when we managed to proved better regret, for example with adaptive learning rates? The subtlety is that, constraining the adversary to play -Lipschitz losses, the adversary could always force on the algorithm at least the regret in Theorem 1. However, we can design algorithms that take advantage of suboptimal plays of the adversary. Indeed, for example, if the adversary plays in a way that all the subgradients have the same norm equal to
, there is nothing to adapt to!
We now move to the unconstrained case, however first we have to enrich our math toolbox with another essential tool, Fenchel conjugates.
2. Convex Analysis Bits: Fenchel Conjugate
Definition 2 A function
is closed iff
is closed for every
.
Note that in a Hausdorff space a function is closed iff it is lower semicontinuos (Bauschke, H. H. and Combettes, P. L., 2011, Lemma 1.24).
Example 1 The indicator function of a set
, is closed iff
is closed.
Definition 3 For a function
, we define the Fenchel conjugate
as
From the definition we immediately obtain the Fenchel’s inequality
We have the following useful property for the Fenchel conjugate
Theorem 4 (Rockafellar, R. T., 1970, Corollary 23.5.1 and Theorem 23.5) Let
be convex, proper, and closed. Then
iff
.
achieves its supremum in
at
iff
.
Example 2 Let
, hence we have
. Solving the optimization, we have that
if
and
,
, and
for
.
Example 3 Consider the function
, where
is a norm in
, with dual norm
. We can show that its conjugate is
. From
for all
. The right hand side is a quadratic function of
, which has maximum value
. Therefore for all
, we have
which shows that
. To show the other inequality, let
be any vector with
, scaled so that
. Then we have, for this
,
which shows that
.
Lemma 5 Let
be a function and let
be its Fenchel conjugate. For
and
, the Fenchel conjugate of
is
.
Proof: From the definition of conjugate function, we have
Lemma 6 (Bauschke, H. H. and Combettes, P. L., 2011, Example 13.7) Let
even. Then
.
3. Unconstrained OLO
The above lower bound applies only to the constrained setting. In the unconstrained setting, we proved that OSD with and constant learning rate of
gives a regret of
for any
. Is this regret optimal? It is clear that the regret must be at least linear in
, but is linear enough?
The approach I will follow is to reduce the OLO game to the online game of betting on a coin, where the lower bounds are known. So, let’s introduce the coin-betting online game:
- Start with an initial amount of money
.
- In each round, the algorithm bets a fraction of its current wealth on the outcome of a coin.
- The outcome of the coin is revealed and the algorithm wins or lose its bet, 1 to 1.
The aim of this online game is to win as much money as possible. Also, as in all the online games we consider, we do not assume anything on how the outcomes of the coin are decided. Note that this game can also be written as OCO using the log loss.
We will denote by the outcomes of the coin. We will use the absolute value of
to denote the fraction of money to bet and its sign to denote on which side we are betting. The money the algorithm has won from the beginning of the game till the end of round
will be denoted by
and given that the money are won or lost 1 to 1, we have
where we used the fact that . We will also denote by
the bet of the algorithm on round
.
If we got all the outcomes of the coin correct, we would double our money in each round, so that . However, given the adversarial nature of the game, we can actually prove a stronger lower bound to the maximum wealth we can gain.
Theorem 7 (Cesa-Bianchi, N. and Lugosi, G. , 2006, a simplified statement of Theorem 9.2) Let
. Then, for any betting strategy with initial money
that bets fractions of its current wealth, there exists a sequence of coin outcomes
, such that
Now, let’s connect the coin-betting game with OLO. Remember that proving a regret guarantee in OLO consists in showing
for some function , where we want the dependency on
to be sublinear. Using our new learned concept of Fenchel conjugate, this is equivalent to prove that
Hence, for a given online algorithm we can prove regret bounds proving that there exists a function or equivalently finding its conjugate
. Similarly, proving a lower bound in unconstrained OLO means finding a sequence of
and a function
or a function
that lower bound the regret or the cumulative losses of the algorithm respectively.
Without any other information, it can be challenging to guess what is the slowest increasing function . So, we restrict our attention to online algorithms that guarantee a constant regret against the zero vector. This immediately imply the following important consequence.
Theorem 8 Let
a non-decreasing function of the index of the rounds and
an OLO algorithm that guarantees
for any sequence of
with
. Then, there exists
such that
and
for
.
Proof: Define the “reward” of the algorithm. So, we have
Since, we assumed that , we always have
. Using this, we claim that
for all
. To see this, assume that there is a sequence
that gives
. We then set
. For this sequence, we would have
, that contradicts the observation that
.
So, from the fact that we have that there exists
such that
for a
and
.
This theorem informs us of something important: any OLO algorithm that suffer a non-decreasing regret against the null competitor must predict in the form of a “vectorial” coin-betting algorithm. This immediately implies the following.
Theorem 9 Let
. For any OLO algorithm, under the assumptions of Theorem 8, there exist a sequence of
with
and
, such that
Proof: The proof works by reducing the OLO game to a coin-betting game and then using the upper bound to the reward for coin-betting games.
First, set , where
will be defined in the following, so that
for any
. Given Theorem 8, we have that the first coordinate of
has to satisfy
for some such that
. Hence, the above is nothing else than a coin-betting algorithm that bets
money on the outcome of a coin
, with initial money
. This means that the upper bound to its reward in Theorem 7 applies: there exists a sequence of
such that
where is the Fenchel conjugate of the function
and
by Theorem 4 part 2. Using the closed form solution for the Fenchel conjugate as well as its lower bound from (Orabona, F. and Pal, D., 2016) and reordering the terms, we get the stated bound.
From the above theorem we have that OSD with learning rate does not have the optimal dependency on
for any
.
In the future classes, we will see that the connection between coin-betting and OLO can also be used to design OLO algorithm. This will give us optimal unconstrained OLO algorithms with the surprising property of not requiring a learning rate at all.
4. History Bits
The lower bound for OCO is quite standard, the proof presented is a simplified version of the one in (F. Orabona and D. Pál, 2018).
On the other hand, both the online learning literature and the optimization one almost ignored the issue of lower bounds for the unconstrained case. The connection between coin betting and OLO was first unveiled in (Orabona, F. and Pal, D., 2016). Theorem 8 is an unpublished result by Ashok Cutkosky (Thanks for allowing me to use it here!), that proved similar and more general results in his PhD thesis (Cutkosky, A., 2018). Theorem 9 is new, by me. (McMahan, B. and Abernethy, J., 2013) implicitly also proposed using the conjugate function for lower bound in unconstrained OLO.
There is a caveat in the unconstrained lower bound: A stronger statement would be to choose the norm of beforehand. To do this, we would have to explicitely construct the sequence of the
. One way to do is to use Rademacher coins and then leverage again the hypothesis on the regret against the null competitor. This route was used in (M. Streeter and B. McMahan, 2012), but the proof relied on assuming the value of the global optimum of a non-convex function with an infinite number of local minima. The correct proof avoiding that step was then given in (Orabona, F., 2013). Yet, the proof presented here, that converts reward upper bounds in regret lower bound, is simpler in spirit and (I hope!) more understandable. Given that, as far as I know, this is the first time that unconstrained OLO lower bounds are taught in a class, I valued simplicity over generality.
5. Exercises
Exercise 1 Fix
. Mimicking the proof of Theorem 1, prove that for any OCO algorithm there exists a
and a sequence of loss functions such that
where
and the loss functions are
-Lipschitz w.r.t.
.
Exercise 2 Extend the proof of Theorem 1 to an arbitrary norm
to measure the diameter of
and with
.
Exercise 3 Let
be even. Prove that
is even
Exercise 4 Find the exact expression of the conjugate function of
, for
. Hint: Wolfram Alpha or any other kind of symbolic solver can be very useful for this type of problems.
Update: Fixed a problem in the proof of Theorem 8 that one of my students anonymously posted on our internal forum.
LikeLike
Hi Francesco,
First fo all, nice post! I’m pretty sure I may be missing something simple, but I don’t quite see why Theorem 1 requires a constrained domain. If we remove that assumption, where does the proof break? We could define v and w in a slightly different way. Define them such that ||v-w||=1, and change the equality at the beginning of line 3 of the proof to an inequality (restricting the optimization domain to {v, w}). Then, I feel that all of the steps would still be valid. Of course, the diameter D won’t show up anymore, but that’s a constant. What’s wrong with these modifications?
Thanks!
LikeLike
Hi Mass,
yes, you are very right! The theorem would work also for unbounded domains. However, the resulting lower bound would be supoptimal because it would lack of the logarithmic factor we get from the other proof. I guess I should change a bit the text to explain this bit.
LikeLike