This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.
You can find the lectures I published till now here.
In this lecture, we will introduce the Online Mirror Descent (OMD) algorithm. To explain its genesis, I think it is essential to understand what subgradients do. In particular, the negative subgradients are not always pointing towards a direction that minimizes the function. We already discussed this problem in a previous blog post, but copy&paste is free so I’ll repeat the important bits here.
1. Subgradients are not Informative
You know that in online learning we receive a sequence of loss functions and we have to output a vector before observing the loss function on which we will be evaluated. However, we can gain a lot of intuition if we consider the easy case that the sequence of loss functions is always a fixed function, i.e., . If our hypothetical online algorithm does not work in this situation, for sure it won’t work on the more general case.
In words, to minimize the left hand side of this equation, it is enough to minimize the right hand side, that is nothing else than the instantaneous linear regret on the linear function . This is the only reason why OSD works! However, I am sure you heard a million of times the (wrong) intuition that gradient points towards the minimum, and you might be tempted to think that the same (even more wrong) intuition holds for subgradients. Indeed, I am sure that even if we proved the regret guarantee based on (1), in the back of your mind you keep thinking “yeah, sure, it works because the subgradient tells me where to go to minimize the function”. Typically this idea is so strong that I have to present explicit counterexamples to fully convince a person.
So, take a look at the following examples that illustrate the fact that a subgradient does not always point in a direction where the function decreases.
Let , see Figure 1. The vector is a subgradient in of . No matter how we choose the stepsize, moving in the direction of the negative subgradient will not decrease the objective function. An even more extreme example is in Figure 2, with the function . Here, in the point , any positive step in the direction of the negative subgradient will increase the objective function.
2. Reinterpreting the Online Subgradient Descent Algorithm.
How Online Subgradient Descent works? It works exactly as I told you before: thanks to (1). But, what does that inequality really mean?
A way to understand how OSD algorithms works is to think that it minimizes a local approximation of the original objective function. This is not unusual for optimization algorithms, for example the Netwon’s algorithm constructs an approximation with a Taylor expansion truncated to the second term. Thanks to the definition of subgradients, we can immediately build a linear lower bound to a function around :
So, in our setting, this would mean that we update the online algorithm with the minimizer of a linear approximation of the loss function you received. Unfortunately, minimizing a linear function is unlikely to give us a good online algorithm. Indeed, over unbounded domains the minimum of a linear function is .
So, let’s introduce the other key concept: we constraint the minimization of this lower bound only in a neighborhood of , where we have good reason to believe that the approximation is more precise. Coding the neighborhood constraint with a L2 squared distance from less than some positive number , we might think to use the following update
This is a well-defined update scheme, that hopefully moves closer to the optimum of . See Figure 2 for a graphical representation in one-dimension.
And now the final element of our story: the argmin in (2) is exactly the update we used in OSD! Indeed, solving the argmin and completing the square, we get
where is the Euclidean projection onto , i.e. .
The new way to write the update of OSD in (2) will be the core ingredient for designing Online Mirror Descent. In fact, OMD is a strict generalization of that update when we use a different way to measure the locality of from . Tha is, we measured the distance between to the current point with the squared L2 norm. What happens if we change the norm? Do we even have to use a norm?
To answer these questions we have to introduce another useful mathematical object: the Bregman divergence.
3. Convex Analysis Bits: Bregman Divergence
We first give a new definition, a slightly stronger notion of convexity.
Definition 1 Let and a convex set. is strictly convex if
From the definition, it is immediate to see that strong convexity w.r.t. any norm implies strict convexity. Note that for a differentiable function, strict convexity also implies that for (Bauschke, H. H. and Combettes, P. L., 2011, Proposition 17.13).
We now define our new notion of “distance”.
Definition 2 Let be strictly convex and continuously differentiable on . The Bregman Divergence w.r.t. is defined as
From the definition, we see that the Bregman divergence is always non-negative for , from the convexity of . However, something stronger holds. By the strict convexity of , fixed a point we have that , with equality only for . Hence, the strict convexity allows us to use the Bregman divergence as a similarity measure between and . Moreover, this similarity measure changes with the reference point . This also implies that, as you can see from the definition, the Bregman divergence is not symmetric.
Let me give you some more intuition on the concept Bregman divergence. Consider the case that is twice differentiable in a ball around and . So, by the Taylor’s theorem, there exists such that
where . Hence, we are using a metric that depends on the Hessian of . Different areas of the space will have a different value of the Hessian, and so the Bregman will behave differently.
Example 2 If , then .
We also have the following lemma that links the Bregman divergences between 3 points.
Lemma 3 (Chen, Gong and Teboulle, Marc, 1993) Let the Bregman divergence w.r.t. . Then, for any three points and , the following identity holds
4. Online Mirror Descent
Based on what we said before, we can start from the equivalent formulation of the OSD update
and we can change the last term with another measure of distance. In particular, using the Bregman divergence w.r.t. a function , we have
These two updates are exactly the same when .
So we get the Online Mirror Descent algorithm in Algorithm 4.
However, without an additional assumption, this algorithm has a problem. Can you see it? The problem is that might be on the boundary of and in the next step we would have to evaluate for a point on the boundary of . Given that , we might end up on the boundary of where the Bregman is not defined!
Now we have a well-defined algorithm, but does it guarantee a sublinear regret? We know that at least in one case it recovers the OSD algorithm, that does work. So, from an intuitive point of view, how well the algorithm work should depend on some characteristic on . In particular, a key property will be the strong convexity of .
To analyze OMD, we first prove a one step relationship, similar to the one we proved for Online Gradient Descent and OSD. Note how in this Lemma, we will use a lot of the concepts we introduced till now: strong convexity, dual norms, subgradients, Fenchel-Young inequality, etc. In a way, over the past lectures I slowly prepared you to be able to prove this lemma.
Lemma 4 Let the Bregman divergence w.r.t. and assume to be -strongly convex with respect to in . Let a convex set. Set . Assume (4) or (5) hold. Then, and with the notation in Algorithm 1, the following inequality holds
The lower bound with the function values is due, as usual, to the definition of subgradients.
Next time, we will see how to use this one step relationship to prove a regret bound, that will finally show us if and when this entire construction is a good idea. In fact, it is worth stressing that the above motivation is not enough in any way to justify the existence of the OMD algorithm. Also, next time we will explain why the algorithm is called Online “Mirror” Descent.
Exercise 1 Prove that the defined in Example 3 is 1-strongly convex w.r.t. the L1 norm.
Exercise 2 Derive a closed form update for OMD when using the of Example 3 and .
6. History Bits
Mirror Descent (MD) was introduced by (Nemirovsky, A.S. and Yudin, D., 1983) in the batch setting. The description of MD with Bregman divergence that I described here (with minor changes) was done by (Beck, A. and Teboulle, M., 2003). The minor changes are in decoupling the domain of from the feasibility set . This allows to use functions that do not satisfy the condition (4) but they satisfy (5). In the online setting, the mirror descent scheme was used for the first time by (Warmuth, M. K. and Jagota, A. K., 1997).