*This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.*

* You can find the lectures I published till now here.*

**1. Online Mirror Descent **

Last time we introduced the Online Mirror Descent (OMD) algorithm, in Algorithm 1. We also said that we need one of these two assumptions to hold.

We also proved the following Lemma.

Lemma 1Let the Bregman divergence w.r.t. and assume to be -strongly convex with respect to in . Let a closed convex set. Set . Assume (1) or (2) hold. Then, and with the notation in Algorithm 1, the following inequality holds

Today, we will finally prove a regret bound for OMD.

Theorem 2Set such that is differentiable in . Assume . Then, under the assumptions of Lemma 1 and , the following regret bounds hold

Moreover, if is constant, i.e. , we have

*Proof:* Fix . As in the proof of OGD, dividing the inequality in Lemma 1 by and summing from , we get

where we denoted by .

The second statement is left as an exercise.

In words, OMD allows us to prove regret guarantees that depend on arbitrary couple of dual norms and . In particular, the primal norm will be used to measure the feasible set or the distance between the competitor and the initial point, and the dual norm will be used to measure the gradients. If you happen to know something about these quantities, we can choose the most appropriate couple of norm to guarantee a small regret. The only thing you need is a function that is strongly convex with respect to the primal norm you have chosen .

Overall, the regret bound is still of the order of for Lipschitz functions, that only difference is that now the Lipschitz constant is measured with a different norm. Also, everything we did for Online Subgradient Descent (OSD) can be trivially used here. So, for example, we can use stepsize of the form

to achieve a regret bound of .

Next time, we will see practical examples of OMD that guarantee strictly better regret than OSD. As we did in the case of AdaGrad, the better guarantee will depend on the shape of the domain and the characteristics of the subgradients.

Instead, now we see the meaning of the “Mirror”.

**2. The “Mirror” Interpretation **

First, we need a couple of convex analysis results.

When we introduced the Fenchel conjugate, we said that iff , that in words means that in the sense of multivalued mappings. Now, we state a stronger result for the case that the function is strongly convex.

Theorem 3Let be a proper, convex, and closed function, strongly convex w.r.t. . Then,

- is finite everywhere and differentiable.
- is -smooth w.r.t. .

We will also use the following optimality condition.

Theorem 4Let proper. Then iff .

Hence, we can state the following theorem.

Theorem 5Let the Bregman divergence w.r.t. , where is strongly convex. Let a non-empty closed convex set such that . Then

*Proof:* Consider the update rule in Algorithm 1 and let’s see

Now, we want to use the first order optimality condition, so we have to use a little bit of subdifferential calculus. Given that , by the subdifferential calculus theorem we saw, we have . So, we have

that is

or equivalently

Using that fact that is -strongly convex, we have that . Hence

Noting that for vectors in , we have the stated bound.

Let’s explain what this theorem says. We said that Online Mirror Descent extends the Online Subgradient Descent method to non-euclidean norms. Hence, the regret bound we proved contains dual norms, that measure the iterate and the gradients. We also said that it makes sense to use a dual norm to measure a gradient, because it is a natural way to measure how “big” is the linear functional . In a more correct way, gradients actually live in the *dual space*, that is in a different space of the predictions . Hence, we cannot sum iterates and gradients together, in the same way in which we cannot sum pear and apples together. So, why we were doing it in OSD? The reason is that in that case the dual space coincides with the primal space. But, it is a very particular case due to the fact that we used the L2 norm. Instead, in the general case, iterates and gradients are in two different spaces.

So, in OMD we need a way to go from one space to the other. And this is exactly the role of and , that are called *duality mappings*. We can now understand that the theorem tells us that OMD takes the primal vector , transforms it into a dual vector through , does a subgradient descent step in the dual space, and finally transforms the vector back to the primal space through . This reasoning is summarized in Figure 1.

Example 1Let equal to where . Then,

Solving the constrained optimization problem, we have . Hence, we have

that is finite everywhere and differentiable. So, we have and

So, using (3), we obtain exactly the update of projected online subgradient descent.

**3. Yet Another Way to Write the OMD Update **

There exists yet another way to write the update of OMD. This third method uses the concept of *Bregman projections*. Extending the definition of Euclidean projections, we can define the projection with respect to a Bregman divergence. Let be defined by

In the online learning literature, the OMD algorithm is typically presented with a two step update: first, solving the argmin over the entire space and then projecting back over with respect to the Bregman divergence. In the following, we show that most of the time the two-step update is equivalent to the one-step update in (3).

First, we prove a general theorem that allows to break the constrained minimization of functions in the minimization over the entire space plus and Bregman projection step.

Theorem 6Let proper, closed, strictly convex, and differentiable in . Also, let a non-empty, closed convex set with and assume that exists and . Denote by . Then the following hold:

- exists and is unique.
- .

*Proof:* For the first point, from (Bauschke, H. H. and Combettes, P. L., 2011, Proposition 11.12) and the existence of , we have that is coercive. So, from (Bauschke, H. H. and Combettes, P. L., 2011, Proposition 11.14), the minimizer of in exists. Given that is strictly convex, the minimizer must be unique too.

For the second point, from the definition of , we have . On the other hand, from the first-order optimality condition, we have . So, we have

that is . Given that is strictly convex, .

Now, note that, if , then

Also, defining is equal to for some and . Hence, under the assumption of the above theorem, we have that is equivalent to

The advantage of this update is that sometimes it gives two easier problems to solve rather than a single difficult one.

**4. History Bits **

Most of the online learning literature for OMD assumes to be *Legendre*, (Cesa-Bianchi, N. and Lugosi, G. , 2006, e.g.), that corresponds to assuming (1). This condition allows to prove that . However, it turns out that the Legendre condition is not necessary and we only need the function to be differentiable on the predictions . In fact, we only need one of the two conditions in (1) or (2) to hold. Removing the Legendre assumption makes it easier to use OMD with different combinations of feasibility sets/Bregman divergences. So, I didn’t introduce the concept of Legendre functions at all, relying instead on (a minor modification of) OMD as described by (Beck, A. and Teboulle, M., 2003).

**5. Exercises **

Exercise 1Find the conjugate function of defined over .

Exercise 2Generalize the concept of strong convexity to Bregman functions, instead of norms, and prove a logarithmic regret guarantee for such functions using OMD.

Very nice post! I never fully understood why the Legendre assumption was needed.

LikeLiked by 1 person

Hi, I have one question regarding Example 1: the supremum in the Fenchel conjugate of \psi(\theta) was written as a supremum of an expression dependent on \alpha in [-1, 1]. Could you elaborate on how the supremum can be written in that form? Thanks.

LikeLike

You can write x in the supremum as the sum of two orthogonal vectors: one parallel to theta and one orthogonal to theta. So, now you can maximize over both vectors. The orthogonal part can only decrease the argument of the sup, so it has to be zero. Hence, the sup over x in V is actually a sup over a vector parallel to theta with bounded norm and it becomes a sup over alpha in [-1,1]. Does it make sense?

LikeLiked by 1 person

That makes sense. After posting the question I realized that we could also use an observation that for whatever x^\star that maximizes \psi, if it is not already parallel to \theta then it can be rotated so that x^\star is parallel to \theta and the dot product is increased (while the norm ||x^\star|| stays the same). This argument holds only when the feasible set V is closed under rotation, which it is in this case since V is an Euclidean ball.

LikeLiked by 1 person

Excellent post! I’ve learned a lot – thank you very much!

I have a quick question regarding the regret bound in Theorem 2: if the norm of the subgradient g_t is in general unbounded for x_t in V. Are there any effective upper bounds we could use? Thank you.

LikeLike

In general, you cannot prove a vanishing average regret without any assumption on the losses. Assuming bounded subgradients is a common assumption, but other assumptions are possible too. For example, you could assume that the losses are smooth (that is, the gradient is Lipschitz), see post on L* bounds. In other cases, we can still have vanishing average regret with unbounded losses, as in the case of Universal Portfolio that I’ll post soon. So, it really depends on your specific application what kind of assumption you can use.

LikeLike