Online Mirror Descent I: Bregman version

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.

In this lecture, we will introduce the Online Mirror Descent (OMD) algorithm. To explain its genesis, I think it is essential to understand what subgradients do. In particular, the negative subgradients are not always pointing towards a direction that minimizes the function. We already discussed this problem in a previous blog post, but copy&paste is free so I’ll repeat the important bits here.

1. Subgradients are not Informative

You know that in online learning we receive a sequence of loss functions and we have to output a vector before observing the loss function on which we will be evaluated. However, we can gain a lot of intuition if we consider the easy case that the sequence of loss functions is always a fixed function, i.e., ${\ell_t({\boldsymbol x})=\ell({\boldsymbol x})}$ . If our hypothetical online algorithm does not work in this situation, for sure it won’t work on the more general case.

That said, we proved that the convergence of Online Subgradient Descent (OSD) depends on the following key property of the subgradients:

$\displaystyle \label{eq:subgradient} \ell({\boldsymbol x}_t)-\ell({\boldsymbol u})\leq \langle {\boldsymbol g}_t, {\boldsymbol x}_t -{\boldsymbol u}\rangle~. \ \ \ \ \ (1)$

In words, to minimize the left hand side of this equation, it is enough to minimize the right hand side, that is nothing else than the instantaneous linear regret on the linear function ${\langle {\boldsymbol g}_t, \cdot\rangle}$ . This is the only reason why OSD works! However, I am sure you heard a million of times the (wrong) intuition that gradient points towards the minimum, and you might be tempted to think that the same (even more wrong) intuition holds for subgradients. Indeed, I am sure that even if we proved the regret guarantee based on (1), in the back of your mind you keep thinking “yeah, sure, it works because the subgradient tells me where to go to minimize the function”. Typically this idea is so strong that I have to present explicit counterexamples to fully convince a person.

So, take a look at the following examples that illustrate the fact that a subgradient does not always point in a direction where the function decreases.

sd_not_descent1 — Figure 1. 3D plot (left) and level sets (right) of ${f(\boldsymbol x)=\max[-x_1,x_1-x_2,x_1+x_2]}$ . The negative subgradient is indicated by the black arrow.

Figure 1. 3D plot (left) and level sets (right) of ${f(\boldsymbol x)=\max[-x_1,x_1-x_2,x_1+x_2]}$ . The negative subgradient is indicated by the black arrow.

sd_not_descent2 — Figure 2. 3D plot (left) and level sets (right) of ${f(\boldsymbol x)=\max[x_1^2+(x_2+1)^2,x_1^2+(x_2-1)^2]}$ . The negative subgradient is indicated by the black arrow.

Figure 2. 3D plot (left) and level sets (right) of ${f(\boldsymbol x)=\max[x_1^2+(x_2+1)^2,x_1^2+(x_2-1)^2]}$ . The negative subgradient is indicated by the black arrow.

Example 1

Let ${f({\boldsymbol x})=\max[-x_1,x_1-x_2,x_1+x_2]}$ , see Figure 1. The vector ${{\boldsymbol g}=[1,1]}$ is a subgradient in ${{\boldsymbol x}=(1,0)}$ of ${f({\boldsymbol x})}$ . No matter how we choose the stepsize, moving in the direction of the negative subgradient will not decrease the objective function. An even more extreme example is in Figure 2, with the function ${f({\boldsymbol x})=\max[x_1^2+(x_2+1)^2,x_1^2+(x_2-1)^2]}$ . Here, in the point ${(1,0)}$ , any positive step in the direction of the negative subgradient will increase the objective function.

2. Reinterpreting the Online Subgradient Descent Algorithm.

How Online Subgradient Descent works? It works exactly as I told you before: thanks to (1). But, what does that inequality really mean?

A way to understand how OSD algorithms works is to think that it minimizes a local approximation of the original objective function. This is not unusual for optimization algorithms, for example the Netwon’s algorithm constructs an approximation with a Taylor expansion truncated to the second term. Thanks to the definition of subgradients, we can immediately build a linear lower bound to a function ${f}$ around ${{\boldsymbol x}_0}$ :

$\displaystyle f({\boldsymbol x}) \geq \tilde{f}({\boldsymbol x}) := f({\boldsymbol x}_0) + \langle {\boldsymbol g}, {\boldsymbol x}-{\boldsymbol x}_0\rangle, \ \forall {\boldsymbol x} \in V~.$

So, in our setting, this would mean that we update the online algorithm with the minimizer of a linear approximation of the loss function you received. Unfortunately, minimizing a linear function is unlikely to give us a good online algorithm. Indeed, over unbounded domains the minimum of a linear function is ${-\infty}$ .

So, let’s introduce the other key concept: we constraint the minimization of this lower bound only in a neighborhood of ${{\boldsymbol x}_0}$ , where we have good reason to believe that the approximation is more precise. Coding the neighborhood constraint with a L2 squared distance from ${{\boldsymbol x}_0}$ less than some positive number ${h}$ , we might think to use the following update

$\displaystyle \begin{aligned} {\boldsymbol x}_{t+1} = \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in V} & \ f({\boldsymbol x}_t) + \langle {\boldsymbol g}, {\boldsymbol x}-{\boldsymbol x}_t\rangle \\ \text{s.t.} & \ \|{\boldsymbol x}_t-{\boldsymbol x}\|^2 \leq h~. \end{aligned}$

Equivalently, for some ${\eta>0}$ , we can consider the unconstrained formulation

$\displaystyle \label{eq:quad_approx} \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in V} \ \hat{f}({\boldsymbol x}):=f({\boldsymbol x}_0) + \langle {\boldsymbol g}, {\boldsymbol x}-{\boldsymbol x}_0\rangle + \frac{1}{2\eta}\|{\boldsymbol x}_{0}-{\boldsymbol x}\|_2^2~. \ \ \ \ \ (2)$

This is a well-defined update scheme, that hopefully moves ${{\boldsymbol x}_t}$ closer to the optimum of ${f}$ . See Figure 2 for a graphical representation in one-dimension.

sd_approximations — Figure 4. Approximations of ${f({\boldsymbol x})}$ .

Figure 4. Approximations of ${f({\boldsymbol x})}$ .

And now the final element of our story: the argmin in (2) is exactly the update we used in OSD! Indeed, solving the argmin and completing the square, we get

$\displaystyle \begin{aligned} \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in V} \ \langle {\boldsymbol g}_t, x\rangle + \frac{1}{2\eta_t}\|{\boldsymbol x}_{t}-{\boldsymbol x}\|_2^2 &= \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in V} \ \|\eta_t {\boldsymbol g}_t\|^2 + 2\eta_t \langle {\boldsymbol g}_t, x-{\boldsymbol x}_t\rangle + \|{\boldsymbol x}_{t}-{\boldsymbol x}\|_2^2 \\ &= \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in V} \ \|{\boldsymbol x}_t -\eta_t {\boldsymbol g}_t -{\boldsymbol x}\|_2^2 \\ &= \Pi_V({\boldsymbol x}_t - \eta_t {\boldsymbol g}_t), \end{aligned}$

where ${\Pi_V}$ is the Euclidean projection onto ${V}$ , i.e. ${\Pi_V({\boldsymbol x})=\mathop{\mathrm{argmin}}_{{\boldsymbol y} \in V} \ \|{\boldsymbol x}-{\boldsymbol y}\|_2}$ .

The new way to write the update of OSD in (2) will be the core ingredient for designing Online Mirror Descent. In fact, OMD is a strict generalization of that update when we use a different way to measure the locality of ${{\boldsymbol x}}$ from ${{\boldsymbol x}_t}$ . Tha is, we measured the distance between to the current point with the squared L2 norm. What happens if we change the norm? Do we even have to use a norm?

To answer these questions we have to introduce another useful mathematical object: the Bregman divergence.

3. Convex Analysis Bits: Bregman Divergence

We first give a new definition, a slightly stronger notion of convexity.

Definition 1 Let ${f:V \subseteq {\mathbb R}^d \rightarrow {\mathbb R}}$ and ${V}$ a convex set. ${f}$ is strictly convex if

$\displaystyle f(\alpha {\boldsymbol x} + (1-\alpha) {\boldsymbol y}) < \alpha f({\boldsymbol x}) + (1-\alpha) f({\boldsymbol y}), \ \forall {\boldsymbol x}, {\boldsymbol y} \in V, {\boldsymbol x}\neq {\boldsymbol y}, 0<\alpha<1~.$

From the definition, it is immediate to see that strong convexity w.r.t. any norm implies strict convexity. Note that for a differentiable function, strict convexity also implies that ${f({\boldsymbol y}) > f({\boldsymbol x}) + \langle \nabla f({\boldsymbol x}), {\boldsymbol y}-{\boldsymbol x}\rangle}$ for ${{\boldsymbol x}\neq{\boldsymbol y}}$ (Bauschke, H. H. and Combettes, P. L., 2011, Proposition 17.13).

We now define our new notion of “distance”.

Definition 2 Let ${\psi: X \rightarrow {\mathbb R}}$ be strictly convex and continuously differentiable on ${\mathop{\mathrm{int}} X}$ . The Bregman Divergence w.r.t. ${\psi}$ is ${B_\psi : X \times \mathop{\mathrm{int}} X \rightarrow {\mathbb R}}$ defined as

$\displaystyle B_\psi({\boldsymbol x}; {\boldsymbol y}) = \psi({\boldsymbol x}) - \psi({\boldsymbol y}) - \langle \nabla \psi({\boldsymbol y}), {\boldsymbol x} - {\boldsymbol y}\rangle~.$

From the definition, we see that the Bregman divergence is always non-negative for ${{\boldsymbol x},{\boldsymbol y}\in \mathop{\mathrm{int}} X}$ , from the convexity of ${\psi}$ . However, something stronger holds. By the strict convexity of ${\psi}$ , fixed a point ${{\boldsymbol y} \in \mathop{\mathrm{int}} X}$ we have that ${\psi({\boldsymbol y}) \geq \psi({\boldsymbol x}) +\langle \nabla \psi({\boldsymbol y}), {\boldsymbol x}-{\boldsymbol y}\rangle, \ \forall {\boldsymbol y} \in X}$ , with equality only for ${{\boldsymbol y}={\boldsymbol x}}$ . Hence, the strict convexity allows us to use the Bregman divergence as a similarity measure between ${{\boldsymbol x}}$ and ${{\boldsymbol y}}$ . Moreover, this similarity measure changes with the reference point ${{\boldsymbol y}}$ . This also implies that, as you can see from the definition, the Bregman divergence is not symmetric.

Let me give you some more intuition on the concept Bregman divergence. Consider the case that ${\psi}$ is twice differentiable in a ball ${B}$ around ${{\boldsymbol y}}$ and ${{\boldsymbol x} \in B}$ . So, by the Taylor’s theorem, there exists ${0\leq\alpha\leq1}$ such that

$\displaystyle B_\psi({\boldsymbol x}; {\boldsymbol y}) = \psi({\boldsymbol x}) - \psi({\boldsymbol y}) - \nabla \psi({\boldsymbol y})^\top {\boldsymbol x} - {\boldsymbol y}\rangle = \frac{1}{2} ({\boldsymbol x}-{\boldsymbol y})^\top \nabla^2 \psi({\boldsymbol z}) ({\boldsymbol x}-{\boldsymbol y}),$

where ${{\boldsymbol z}=\alpha {\boldsymbol x}+(1-\alpha){\boldsymbol y}}$ . Hence, we are using a metric that depends on the Hessian of ${\psi}$ . Different areas of the space will have a different value of the Hessian, and so the Bregman will behave differently.

We can also lower bound the Bregman divergence if the function ${\psi}$ is strongly convex. In particular, if ${\psi}$ is ${\lambda}$ -strongly convex w.r.t. a norm ${\|\cdot\|}$ in ${\mathop{\mathrm{int}} X}$ , then we have

$\displaystyle \label{eq:bregman_strongly_convex} B_\psi({\boldsymbol x};{\boldsymbol y})\geq \frac{\lambda}{2}\|{\boldsymbol x}-{\boldsymbol y}\|^2~. \ \ \ \ \ (3)$

Example 2 If ${\psi({\boldsymbol x})=\frac{1}{2}\|{\boldsymbol x}\|^2_2}$ , then ${B_\psi({\boldsymbol x};{\boldsymbol y})=\frac{1}{2}\|{\boldsymbol x}\|^2_2-\frac{1}{2}\|{\boldsymbol y}\|^2_2-\langle {\boldsymbol y},{\boldsymbol x}\rangle=\frac{1}{2}\|{\boldsymbol x}-{\boldsymbol y}\|^2_2}$ .

Example 3 If ${\psi({\boldsymbol x})=\sum_{i=1}^d x_i \ln x_i}$ and ${X=\{{\boldsymbol x} \in {\mathbb R}^d: x_i\geq 0, \|{\boldsymbol x}\|_1=1\}}$ , then ${B_\psi({\boldsymbol x}; {\boldsymbol y}) = \sum_{i=1}^d x_i \ln \tfrac{x_i}{y_i}}$ , that is the Kullback-Leibler divergence between the discrete distributions ${{\boldsymbol x}}$ and ${{\boldsymbol y}}$ .

We also have the following lemma that links the Bregman divergences between 3 points.

Lemma 3 (Chen, Gong and Teboulle, Marc, 1993) Let ${B_\psi}$ the Bregman divergence w.r.t. ${\psi: X \rightarrow {\mathbb R}}$ . Then, for any three points ${{\boldsymbol x},{\boldsymbol y} \in \mathop{\mathrm{int}} X}$ and ${{\boldsymbol z} \in X}$ , the following identity holds

$\displaystyle B_\psi({\boldsymbol z}; {\boldsymbol x}) + B_\psi({\boldsymbol x}; {\boldsymbol y}) - B_\psi({\boldsymbol z}; {\boldsymbol y}) = \langle \nabla \psi({\boldsymbol y}) - \nabla \psi({\boldsymbol x}), {\boldsymbol z} - {\boldsymbol x}\rangle~.$

4. Online Mirror Descent

Based on what we said before, we can start from the equivalent formulation of the OSD update

$\displaystyle {\boldsymbol x}_{t+1} = \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in V} \ \langle {\boldsymbol g}_t, x\rangle + \frac{1}{2\eta_t}\|{\boldsymbol x}_{t}-{\boldsymbol x}\|_2^2,$

and we can change the last term with another measure of distance. In particular, using the Bregman divergence w.r.t. a function ${\psi}$ , we have

$\displaystyle {\boldsymbol x}_{t+1} = \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in V} \ \langle {\boldsymbol g}_t, x\rangle + \frac{1}{\eta_t}B_\psi({\boldsymbol x};{\boldsymbol x}_{t})~.$

These two updates are exactly the same when ${\psi({\boldsymbol x})=\frac{1}{2}\|{\boldsymbol x}\|_2^2}$ .

So we get the Online Mirror Descent algorithm in Algorithm 4.

omd

However, without an additional assumption, this algorithm has a problem. Can you see it? The problem is that ${{\boldsymbol x}_{t+1}}$ might be on the boundary of ${V}$ and in the next step we would have to evaluate ${B_\psi({\boldsymbol x}; {\boldsymbol x}_{t+1})}$ for a point on the boundary of ${V}$ . Given that ${V\subseteq X}$ , we might end up on the boundary of ${X}$ where the Bregman is not defined!

To fix this problem, we need either one of the following assumptions

$\displaystyle \begin{aligned} &\lim_{{\boldsymbol x} \rightarrow \partial X} \|\nabla \psi({\boldsymbol x})\|_2 = +\infty, & (4) \\ &V\subseteq \mathop{\mathrm{int}} X~. & (5) \\\end{aligned}$

If either of these conditions are true, the update is well-defined on all rounds (proof left as an exercise).

Now we have a well-defined algorithm, but does it guarantee a sublinear regret? We know that at least in one case it recovers the OSD algorithm, that does work. So, from an intuitive point of view, how well the algorithm work should depend on some characteristic on ${\psi}$ . In particular, a key property will be the strong convexity of ${\psi}$ .

To analyze OMD, we first prove a one step relationship, similar to the one we proved for Online Gradient Descent and OSD. Note how in this Lemma, we will use a lot of the concepts we introduced till now: strong convexity, dual norms, subgradients, Fenchel-Young inequality, etc. In a way, over the past lectures I slowly prepared you to be able to prove this lemma.

Lemma 4 Let ${B_\psi}$ the Bregman divergence w.r.t. ${\psi: X \rightarrow {\mathbb R}}$ and assume ${\psi}$ to be ${\lambda}$ -strongly convex with respect to ${\|\cdot\|}$ in ${V}$ . Let ${V \subseteq X}$ a convex set. Set ${{\boldsymbol g}_t \in \partial \ell_t({\boldsymbol x}_t)}$ . Assume (4) or (5) hold. Then, ${\forall {\boldsymbol u} \in V}$ and with the notation in Algorithm 1, the following inequality holds

$\displaystyle \eta_t (\ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u}) ) \leq \eta_t \langle {\boldsymbol g}_t, {\boldsymbol x}_t -{\boldsymbol u}\rangle \leq B_\psi({\boldsymbol u};{\boldsymbol x}_t) - B_\psi({\boldsymbol u};{\boldsymbol x}_{t+1}) + \frac{\eta_t^2}{2\lambda}\|{\boldsymbol g}_t\|_\star^2~.$

Proof: From the optimality condition for the update of OMD, we have

$\displaystyle \label{eq:proof_regret_md_eq1} \langle \eta_t {\boldsymbol g}_t + \nabla \psi ({\boldsymbol x}_{t+1})-\nabla \psi({\boldsymbol x}_t), {\boldsymbol u} - {\boldsymbol x}_{t+1}\rangle \geq 0, \ \forall {\boldsymbol u} \in V~. \ \ \ \ \ (6)$

From the definition of subgradient, we have that

$\displaystyle \begin{aligned} \langle \eta_t {\boldsymbol g}_t, {\boldsymbol x}_t - {\boldsymbol u}\rangle &= \langle \nabla \psi({\boldsymbol x}_t)- \nabla \psi({\boldsymbol x}_{t+1})-\eta_t {\boldsymbol g}_t, {\boldsymbol u} - {\boldsymbol x}_{t+1} \rangle + \langle \nabla \psi({\boldsymbol x}_{t+1}) - \nabla \psi({\boldsymbol x}_t), {\boldsymbol u} - {\boldsymbol x}_{t+1} \rangle + \langle \eta_t {\boldsymbol g}_t, {\boldsymbol x}_t - {\boldsymbol x}_{t+1} \rangle \\ &\leq \langle \nabla \psi({\boldsymbol x}_{t+1}) - \nabla \psi({\boldsymbol x}_t), {\boldsymbol u} - {\boldsymbol x}_{t+1} \rangle + \langle \eta_t {\boldsymbol g}_t, {\boldsymbol x}_t - {\boldsymbol x}_{t+1} \rangle \\ &= B_\psi({\boldsymbol u};{\boldsymbol x}_t) - B_\psi({\boldsymbol u};{\boldsymbol x}_{t+1}) - B_\psi({\boldsymbol x}_{t+1};{\boldsymbol x}_t) + \langle \eta_t {\boldsymbol g}_t, {\boldsymbol x}_t - {\boldsymbol x}_{t+1} \rangle \\ &\leq B_\psi({\boldsymbol u};{\boldsymbol x}_t) - B_\psi({\boldsymbol u};{\boldsymbol x}_{t+1}) - B_\psi({\boldsymbol x}_{t+1};{\boldsymbol x}_t) + \frac{\eta_t^2}{2\lambda}\|{\boldsymbol g}_t\|_\star^2 + \frac{\lambda}{2}\|{\boldsymbol x}_t - {\boldsymbol x}_{t+1} \|^2 \\ &\leq B_\psi({\boldsymbol u};{\boldsymbol x}_t) - B_\psi({\boldsymbol u};{\boldsymbol x}_{t+1}) + \frac{\eta_t^2}{2\lambda}\|{\boldsymbol g}_t\|_\star^2, \end{aligned}$

where in the second inequality we used (6), in the second equality we used Lemma 3, in the third inequality we used ${\langle {\boldsymbol x}, {\boldsymbol y}\rangle \leq \frac{1}{2\lambda}\|{\boldsymbol x}\|^2 +\frac{\lambda}{2}\|{\boldsymbol y}\|_\star^2, \forall {\boldsymbol x},{\boldsymbol y} \in {\mathbb R}^d, \lambda>0}$ , and in the last inequality we used (3) because ${\psi}$ is ${\lambda}$ -strongly convex w.r.t. ${\|\cdot\|}$ .

The lower bound with the function values is due, as usual, to the definition of subgradients. $\Box$

Next time, we will see how to use this one step relationship to prove a regret bound, that will finally show us if and when this entire construction is a good idea. In fact, it is worth stressing that the above motivation is not enough in any way to justify the existence of the OMD algorithm. Also, next time we will explain why the algorithm is called Online “Mirror” Descent.

5. Exercises

Exercise 1 Prove that the ${\psi}$ defined in Example 3 is 1-strongly convex w.r.t. the L1 norm.

Exercise 2 Derive a closed form update for OMD when using the ${\psi}$ of Example 3 and ${V=X}$ .

6. History Bits

Mirror Descent (MD) was introduced by (Nemirovsky, A.S. and Yudin, D., 1983) in the batch setting. The description of MD with Bregman divergence that I described here (with minor changes) was done by (Beck, A. and Teboulle, M., 2003). The minor changes are in decoupling the domain ${X}$ of ${\psi}$ from the feasibility set ${V}$ . This allows to use functions ${\psi}$ that do not satisfy the condition (4) but they satisfy (5). In the online setting, the mirror descent scheme was used for the first time by (Warmuth, M. K. and Jagota, A. K., 1997).