Online Mirror Descent III: Examples and Learning with Expert Advice

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.


Today, we will see a couple of practical implementations of online mirror descent, with two different Bregman functions and we will introduce the setting of Learning with Expert Advice.

1. Exponentiated Gradient


eg

Let {X=\{{\boldsymbol x} \in {\mathbb R}^d: x_i\geq0, \|{\boldsymbol x}\|_1=1\}} and {\psi({\boldsymbol x}):X\rightarrow {\mathbb R}} defined as the negative entropy {\psi({\boldsymbol x})=\sum_{i=1}^d x_i \log x_i}, where we define {0 \log(0)=0}. Also, we set the feasibility set {V=X}. So, in words, we want to output discrete probability distributions over {{\mathbb R}^d}.

The Fenchel conjugate {\psi_V^\star({\boldsymbol \theta})} is defined as

\displaystyle \psi_V^\star({\boldsymbol \theta}) = \sup_{{\boldsymbol x} \in V} \ \langle {\boldsymbol \theta}, {\boldsymbol x}\rangle - \psi({\boldsymbol x}) = \sup_{{\boldsymbol x} \in V} \ \langle {\boldsymbol \theta}, {\boldsymbol x}\rangle - \sum_{i=1}^d x_i \log x_i~.

The solution is {\psi_V^\star({\boldsymbol \theta}) = \log\left(\sum_{i=1}^d \exp(\theta_i)\right)}, see Appendix.

We also have {(\nabla \psi_V^\star)_j({\boldsymbol \theta})= \frac{\exp(\theta_j)}{\sum_{i=1}^d \exp(\theta_i)}} and {(\nabla \psi_V({\boldsymbol x}))_j=\ln(x_j)+1}.

Putting all together, we have the online mirror descent update rule for entropic distance generating function.

\displaystyle x_{t+1,j} = \frac{\exp(\ln x_{t,j}+1-\eta_t g_{t,j})}{\sum_{i=1}^d \exp(\ln x_{t,i}+1-\eta_{t} g_{t,i})} = \frac{x_{t,j}\exp(-\eta_t g_{t,j})}{\sum_{i=1}^d x_{t,i} \exp(-\eta_{t} g_{t,i})}~.

The algorithm is summarized in Algorithm 1. This algorithm is called Exponentiated Gradient (EG) because in the update rule we take the component-wise exponent of the gradient vector.

Let’s take a look at the regret bound we get. First, as we said, observe that

\displaystyle \begin{aligned} B_\psi({\boldsymbol x};{\boldsymbol y}) &= \sum_{i=1}^d (x_i \ln x_i - y_i \ln y_i - (\ln(y_i)+1)(x_i-y_i)) = \sum_{i=1}^d (x_i \ln x_i - y_i \ln y_i - x_i \ln(y_i) + y_i \ln(y_i)) \\ &= \sum_{i=1}^d x_i \ln \frac{x_i}{y_i}, \end{aligned}

that is the KL divergence between the 1-dimensional discrete distributions {{\boldsymbol x}} and {{\boldsymbol y}}. Now, the following Lemma tells us about the strong convexity of {\psi}.

Lemma 1 (S. Shalev-Shwartz, 2007, Lemma 16) {\psi({\boldsymbol x})=\sum_{i=1}^d x_i \ln x_i} is 1-strongly convex with respect to the {L_1} norm over the set {\{{\boldsymbol x} \in R^d: x_i \geq0, \|{\boldsymbol x}\|_1=1\}}.

Another thing to decide is the initial point {{\boldsymbol x}_1}. We can set {{\boldsymbol x}_1} to be the minimizer of {\psi} in {V}. In this way {B_\psi({\boldsymbol u}; {\boldsymbol x}_1)} simplifies to {\psi({\boldsymbol u})-\min_{{\boldsymbol x} \in V} \psi({\boldsymbol x})}. Hence, we set {{\boldsymbol x}_1=[1/d, \dots, 1/d] \in {\mathbb R}^d}. So, we have {B_\psi({\boldsymbol u};{\boldsymbol x}_1) = \sum_{i=1}^d u_i \ln u_i + \ln{d} \leq \ln d}.

Putting all together, we have

\displaystyle \sum_{t=1}^T (\ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u})) \leq \frac{\ln d}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \|{\boldsymbol g}_t\|_\infty^2~.

Assuming {\|{\boldsymbol g}_t\|_\infty\leq L_\infty}, we can set {\eta=\sqrt{\frac{2\ln d}{L_\infty^2 T}}}, to obtain that a regret of { \frac{\sqrt{2}}{2} L_\infty \sqrt{T \ln d}}.

Remark 1 Note that the time-varying version of OMD with entropic distance generating function would give rise to a vacuous bound, can you see why? We will see how FTRL overcomes this issue using a time-varying regularizer rather than a time-varying learning rate.

How would Online Subgradient Descent (OSD) work on the same problem? First, it is important to realize that nothing prevents us to use OSD on this problem. We just have to implement the euclidean projection onto the simplex. The regret bound we would get from OSD is

\displaystyle \sum_{t=1}^T (\ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u})) \leq \frac{1}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \|{\boldsymbol g}_t\|_2^2,

where we set {{\boldsymbol x}_1=\boldsymbol{0}} and {\|{\boldsymbol u}\|_2\leq 1} for any {{\boldsymbol u} \in V}. Assuming {\|{\boldsymbol g}_t\|_\infty\leq L_\infty}, we have that in the worst case {\|{\boldsymbol g}_t\|_2^2 \leq d L_\infty^2}. Hence, we can set {\eta=\sqrt{\frac{1}{d L_\infty^2 T}}}, to obtain that a regret of { L_\infty \sqrt{d T}}. Hence, in a worst case sense, using an entropic distance generating function transforms a dependency on the dimension from {\sqrt{d}} to {\sqrt{\ln d}} for Online Convex Optimization (OCO) over the probability simplex.

So, as we already saw analyzing AdaGrad, the shape of the domain is the important ingredient when we change from euclidean norms to other norms.

2. {p}-norm Algorithms

Consider the distance generating function {\psi({\boldsymbol x})=\frac{1}{2}\|{\boldsymbol x}\|^2_p}, for {1 < p\leq2} over {X=V={\mathbb R}^d}. Let's remind the reader that the {p}-norm of a vector {{\boldsymbol x}} is defined as {(\sum_{i=1}^d |x_i|^p)^\frac{1}{p}}. We already proved that {\psi^\star_V({\boldsymbol \theta}) = \frac{1}{2}\|{\boldsymbol \theta}\|_q^2}, where {\frac{1}{p}+\frac{1}{q}=1}, so that {q\geq 2}. Let's calculate the dual maps: {(\nabla \psi({\boldsymbol x}))_j = \rm sign(x_j) |x_j|^{p-1} \|{\boldsymbol x}\|^{2/p-1}_p} and {(\nabla \psi^\star_V({\boldsymbol x}))_j = \rm sign(x_j) |x_j|^{q-1} \|{\boldsymbol x}\|^{2/q-1}_q}. Hence, we can write the update rule as

\displaystyle \begin{aligned} \tilde{x}_{t+1,j} &= \rm sign(x_{t,j}) |x_{t,j}|^{p-1}\|{\boldsymbol x}_t\|^{2/p-1}_p - \eta g_{t,j}, \ j=1,\dots,d, \\ x_{t+1,j} &= \rm sign(\tilde{x}_{t+1,j})|\tilde{x}_{t+1,j}|^{q-1} \left\| \tilde{{\boldsymbol x}}_{t+1} \right\|^{2/q-1}_q, \ j=1,\dots,d, \end{aligned}

where we broke the update in two steps to simplify the notation (and the implementation). Starting from {{\boldsymbol x}_1=\boldsymbol{0}}, we have that

\displaystyle B_\psi({\boldsymbol u}; {\boldsymbol x}_1) = \psi({\boldsymbol u}) - \psi({\boldsymbol x}_1) - \langle \nabla \psi({\boldsymbol x}_1), {\boldsymbol u} - {\boldsymbol x}_1 \rangle = \psi({\boldsymbol u})~.

The last ingredient is the fact that {\psi({\boldsymbol x})} is {p-1} strongly convex with respect to {\|\cdot\|_p}.

Lemma 2 (S. Shalev-Shwartz, 2007, Lemma 17) {\psi({\boldsymbol x})=\frac{1}{2}\|{\boldsymbol x}\|^2_p} is {(p-1)}-strongly convex with respect to {\|\cdot\|_p}, for {1\leq p\leq 2}.

Hence, the regret bound will be

\displaystyle \sum_{t=1}^T (\ell_t({\boldsymbol w}_t) - \ell_t({\boldsymbol u})) \leq \frac{\|{\boldsymbol u}\|_p^2}{2 \eta} + \frac{\eta}{2(p-1)}\sum_{t=1}^T \|{\boldsymbol g}_t\|_q^2~.

Setting {p=2}, we get the (unprojected) Online Subgradient Descent. However, we can set {p} to achieve a logarithmic dependency in the dimension {d} as in EG. Let’s assume again that {\|{\boldsymbol g}_t\|_\infty\leq L_\infty}, so we have

\displaystyle \sum_{t=1}^T \|{\boldsymbol g}_t\|_q^2 \leq L_\infty^2 d^{2/q} T~.

Also, note that {\|{\boldsymbol u}\|_p\leq \|{\boldsymbol u}\|_1}, so we have an upper bound to the regret of

\displaystyle \text{Regret}_T({\boldsymbol u}) \leq \frac{\|{\boldsymbol u}\|^2_1}{2 \eta} + \frac{L_\infty^2 d^{2/q} T \eta}{2(p-1)}, \ \forall {\boldsymbol u} \in {\mathbb R}^d~.

Setting {\eta=\frac{\alpha\sqrt{p-1}}{L_\infty d^{1/q} \sqrt{T}}}, we get an upper bound to the regret of

\displaystyle \frac{1}{2}\left(\frac{\|{\boldsymbol u}\|^2_1}{\alpha} + \alpha\right) L_\infty \sqrt{T} \frac{d^{1/q}}{\sqrt{p-1}} = \frac{1}{2}\left(\frac{\|{\boldsymbol u}\|^2_1}{\alpha} + \alpha\right) L_\infty \sqrt{T} \sqrt{q-1}d^{1/q} \leq \frac{1}{2}\left(\frac{\|{\boldsymbol u}\|^2_1}{\alpha} + \alpha\right) L_\infty \sqrt{T} \sqrt{q}d^{1/q}~.

Assuming {d\geq 3}, the choice of {q} that minimizes the last term is {q=2 \ln d} that makes the term {\sqrt{q}d^{1/q}=\sqrt{2 e \ln d}}. Hence, we have regret bound of the order of {O(\sqrt{T \ln d})}.

Note that {\alpha} should be tuned with the knowledge of {\|\boldsymbol{u}\|_1}, that is impossible given the adversarial nature of the game. We will see how to solve this issue when we will introduce the parameter-free algorithms.

So, the {p}-norm allows to interpolate from the behaviour of OSD to the one of EG. Note that here the set {V} is the entire space, however we could still set {V=\{{\boldsymbol x} \in{\mathbb R}^d: x_i\geq0, \|{\boldsymbol x}\|_1=1\}}. While this would allow us to get the same asymptotic bound of EG, the update would not be in a closed form anymore.

3. Learning with Expert Advice

Let’s introduce a particular Online Convex Optimization game called Learning with Expert Advice.

In this setting, we have {d} experts that gives us some advice on each round. In turn, in each round we have to decide which expert we want to follow. After we made our choice, the losses associated to each expert are revealed and we pay the loss associated to the expert we picked. The aim of the game is to minimize the losses we make compared to cumulative losses of the best expert. This is a general setting that allows to model many interesting cases. For example, we have a number of different online learning algorithms and we would like to close to the best among them.

Is this problem solvable? If we put ourselves in the adversarial setting, unfortunately it cannot be solved! Indeed, even with 2 experts, the adversary can force on us linear regret. Let’s see how. In each round we have to pick expert 1 or expert 2. In each round, the adversary can decide that the expert we pick has loss 1 and the other one has loss 0. This means that the cumulative loss of the algorithm over {T} rounds is {T}. On the other hand, the best cumulative loss over expert 1 and 2 is less than {T/2}. This means that our regret, no matter what we do, can be as big as {T/2}.

The problem above is due to the fact that the adversary has too much power. One way to reduce its power is using randomization. We can allow the algorithm to be randomized and force the adversary to decide the losses at time {t} without knowing the outcome of the randomization of the algorithm at time {t} (but it can depend on the past randomization). This is enough to make the problem solvable. Moreover, it will also make the problem convex, allowing us to use any OCO algorithm on it.

First, let’s write the problem in the original formulation. We set a discrete feasible set {V=\{{\boldsymbol e}_i\}_{i=1}^d}, where {{\boldsymbol e}_i} is the vector will all zeros but a {1} in the coordinate {i}. Our predictions and the competitor are from {V}. The losses are linear losses: {\ell_t({\boldsymbol x}) = \langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle}, where we assume {0 \leq g_{t,i} \leq 1}, for {t=1,\dots,T} and {i=1, \dots,d}. The regret is

\displaystyle \label{eq:regret_lea} \text{Regret}_T({\boldsymbol e}_i) = \sum_{t=1}^T \langle {\boldsymbol g}_{t}, {\boldsymbol x}_t\rangle - \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol e}_i\rangle , \ i=1, \dots, d~. \ \ \ \ \ (1)

The only thing that makes this problem non-convex is the feasibility set, that is clearly a non-convex one.

Let’s now see how the randomization makes this problem convex. Let’s extend the feasible set to {V'=\{{\boldsymbol x} \in {\mathbb R}^d : x_i>0, \|{\boldsymbol x}\|_1=1\}}. Note that {{\boldsymbol e}_i \in V'}. For this problem we can use an OCO algorithm to minimize the regret

\displaystyle \text{Regret}'_T({\boldsymbol u}) = \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle - \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol u}\rangle, \forall {\boldsymbol u} \in V'~.

Can we find a way to transform an upper bound to this regret to the one we care in (1)? One way is the following one: On each time step, construct a random variable {i_t} that is equal to {i} with probability {x_{t,i}} for {i=1,\dots,d}. Then, select the expert according to the outcome of {i_t}. Now, in expectation we have

\displaystyle \mathop{\mathbb E}[g_{t,i_t}] = \langle {\boldsymbol g}_t,{\boldsymbol x}_t\rangle,

and

\displaystyle \mathop{\mathbb E}[\text{Regret}_T({\boldsymbol e}_i)] = \mathop{\mathbb E}[\text{Regret}'_T({\boldsymbol e}_i)] = \mathop{\mathbb E}\left[\sum_{t=1}^T \langle {\boldsymbol g}_{t}, {\boldsymbol x}_t\rangle - \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol e}_i\rangle\right]~.

This means that we can minimize in expectation the non-convex regret with a randomized OCO algorithm. We can summarize this reasoning in Algorithm 2.


lea

For example, if we use EG as the OCO algorithm, setting {{\boldsymbol x}_1=[1/d, \dots, 1/d]}, then we obtain the following update rule

\displaystyle x_{t+1,j} = \frac{x_{t,j}\exp(-\eta g_{t,j})}{\sum_{i=1}^d x_{t,i} \exp(-\eta g_{t,i})}, \ j=1,\dots,d

and the regret will be

\displaystyle \mathop{\mathbb E}[\text{Regret}_T({\boldsymbol e}_i)] \leq \frac{\sqrt{2}}{2} L_\infty \sqrt{T \ln d}, \ \forall {\boldsymbol e}_i~.

It is worth stressing the importance of the result just obtained: We can design an algorithm that in expectation is close to the best expert in a set, paying only a logarithmic penalty in the size of the set.

In the future, we will see algorithms that achieve even the better regret guarantee of {O(\sqrt{T\cdot \rm KL({\boldsymbol u};{\boldsymbol x}_1)})}, for any {{\boldsymbol u}} in the simplex. You should be able to convince yourself that no setting of {\eta} in EG allows to achieve such regret guarantee. Indeed, the algorithm will be based on a very different strategy.

4. History Bits

The EG algorithm was introduced by (Kivinen, J. and Warmuth, M., 1997), but not as a specific instantiation of OMD. I am actually not sure when people first pointed out the link between EG and OMD, do you know something about this? If yes, please let me know!
The trick to set {q=2\ln d} is from (Gentile, C. and Littlestone, N., 1999) (online learning) and apparently rediscovered in (Ben-Tal, A. and Margalit, T. and Nemirovski, A., 2001) (optimization). The learning with expert setting was introduced in (Littlestone, N. and Warmuth, M. K., 1994) and (Vovk, V. G, 1990). The ideas in Algorithm 3 are based on the Multiplicative Weights algorithm (Littlestone, N. and Warmuth, M. K., 1994) and the Hedge algorithm (Freund, Y. and Schapire, R. E., 1997). By now, the literature on learning with expert is huge, with tons of variations over algorithms and settings.

5. Exercises

Exercise 1 Derive the EG update rule and regret bound in the case that the algorithm starts from an arbitrary vector {{\boldsymbol x}_1} in the probability simplex.

6. Appendix

Here we find the conjugate of the negative entropy.

It is a constrained optimization problem, we can solve it using the KKT conditions. We first rewrite it as a minimization problem

\displaystyle \sup_{{\boldsymbol x} \in V} \ \langle {\boldsymbol \theta}, {\boldsymbol x}\rangle - \sum_{i=1}^d x_i \log x_i = -\min_{{\boldsymbol x} \in V} \ \sum_{i=1}^d x_i \log x_i - \langle {\boldsymbol \theta}, {\boldsymbol x}\rangle~.

The KKT conditions are, for {i=1, \cdots, d},

\displaystyle \begin{aligned} 1+\log x^\star_i - \theta_i -\lambda^\star_i+\nu^\star&=0 \\ \lambda_i^\star&\geq0\\ \sum_{i=1}^d x^\star_i &=1\\ \lambda_i^\star x_i^\star&=0\\ x_i^\star&\geq0~. \end{aligned}

From the first one we get {x^\star_i = \exp(\theta_i + \lambda^\star_i-\nu^\star-1)}. Using the third one we have {\exp(\nu^\star+1)=\sum_{i=1}^d \exp(\theta_i +\lambda^\star_i)}. Then, from the complementarity condition, fourth one, we get {\lambda^\star_i=0}. Putting all together, we have {x^\star_i = \frac{\exp(\theta_i)}{\sum_{j=1}^d \exp(\theta_j)}}.

Denoting with {\alpha=\sum_{i=1}^d \exp(\theta_i)}, and substituting in the definition of the conjugate function we get

\displaystyle \begin{aligned} \psi_V^\star({\boldsymbol \theta}) &= \sum_{i=1}^d \left(\frac{1}{\alpha} \theta_i\exp(\theta_i) - \frac{1}{\alpha} \exp(\theta_i)(\theta_i-\log(\alpha))\right)\\ &= \log(\alpha) \frac{1}{\alpha}\sum_{i=1}^d \exp(\theta_i )\\ &= \log(\alpha)\\ &= \log\left(\sum_{i=1}^d \exp(\theta_i)\right)~. \end{aligned}

7 Comments

  1. Hi Francesco, nice post as always!
    Two typos: when you define the set X in the beginning after the Exponentiated Gradient algorithm, it should be \| x \|_1 = 1. And then, when you discuss the suboptimality of OSD wrt EG the difference in the regret bound is \sqrt(d) vs \sqrt(ln d), not just ln d, right?
    Also, I have few questions. Regarding remark 1, is the problem related to the fact that in the time-varying version of OMD the KL divergence between u and x_t could potentially explode for some t, if u_{i, t} is not 0 but x_{i, t} goes to 0 for some i?
    The second is about p-norm algorithms. In the tuning of \eta, we introduce the parameter \alpha, because I guess we do not know \| u \|_1^2 and it shouldn’t matter on the order of the regret bound of O(\sqrt(T ln d)). However, how would you practically set it? This is indeed not parameter-free 😛

    Liked by 1 person

    1. Thanks!
      Typos fixed.

      First question: yes. Probably that term is boundable in some way, but I suspect that you get something more that sqrt{T}. But I actually never tried.

      Second question: I introduce alpha exactly to make you notice that there is an untuned parameter! In Online Learning papers you will find 3 different ways to cope with this problem: 1) assume to know ||u|| (that is impossible, given the adversarial nature of the game); 2) compete only with ||u||\leq U, where U is a parameter (bad idea: we still have a parameter and bound now depends on U); 3) say that the order of the bound is the same even if you don’t tune it (it assumes that argmin_u sum_{t=1}^T l_t(u) exists, but this is false in many settings, e.g. logistic regression on a separable dataset and regression with a universal kernel. I discussed the logistic case when explaining a variant of online-to-batch).
      We will be able to remove the need of that parameter completely when we will use the coin-betting methods, I’ll add a link to this.

      Like

      1. Thank you for the replies!

        Regarding the first question, can’t we use a “clipped” simplex? I’m thinking of Fixed-Share kind of updates where we want to prevent weights going to 0 to force somehow additional exploration, thus we consider X = \{ x \in R^d: x_i \geq \alpha, \| x \|_1 = 1 \}. Then of course we should tune \alpha but maybe we can still get \sqrt{T}? (For the expert setting Fixed-Share kind of algorithms are \tilde{O}(\sqrt(T S \ln K)), where S is the number of shifts in the comparator sequence).

        Like

  2. If you change V, then you have to explicitly compute the projection w.r.t. the Bregman divergence, that in this case is the KL divergence. I don’t know how difficult it would be.
    A part from that, it might work, but I am not sure what is the advantage: FTRL gives use the right bound in this case, why using OMD? Also, even with your fix, you wouldn’t get the KL divergence between the first point and the competitor, but only the worst KL divergence in your “clipped” simplex.

    Like

  3. Yes, I guess the FTRL solution is the best approach. I was just curious whether MD could be used with time-varying learning rates, since I don’t think I have seen such a case 🙂

    Like

    1. Now that I think about it, David proved a lower bound in the journal version of our scale-free paper. It says that the regret of EG is linear in time if used with time-varying learning rates!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s