# Online Mirror Descent III: Examples and Learning with Expert Advice

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.

Today, we will see a couple of practical implementations of online mirror descent, with two different Bregman functions and we will introduce the setting of Learning with Expert Advice.

Let ${X=\{{\boldsymbol x} \in {\mathbb R}^d: x_i\geq0, \|{\boldsymbol x}\|_1\}}$ and ${\psi({\boldsymbol x}):X\rightarrow {\mathbb R}}$ defined as the negative entropy ${\psi({\boldsymbol x})=\sum_{i=1}^d x_i \log x_i}$, where we define ${0 \log(0)=0}$. Also, we set the feasibility set ${V=X}$. So, in words, we want to output discrete probability distributions over ${{\mathbb R}^d}$.

The Fenchel conjugate ${\psi_V^\star({\boldsymbol \theta})}$ is defined as

$\displaystyle \psi_V^\star({\boldsymbol \theta}) = \sup_{{\boldsymbol x} \in V} \ \langle {\boldsymbol \theta}, {\boldsymbol x}\rangle - \psi({\boldsymbol x}) = \sup_{{\boldsymbol x} \in V} \ \langle {\boldsymbol \theta}, {\boldsymbol x}\rangle - \sum_{i=1}^d x_i \log x_i~.$

The solution is ${\psi_V^\star({\boldsymbol \theta}) = \log\left(\sum_{i=1}^d \exp(\theta_i)\right)}$, see Appendix.

We also have ${(\nabla \psi_V^\star)_j({\boldsymbol \theta})= \frac{\exp(\theta_j)}{\sum_{i=1}^d \exp(\theta_i)}}$ and ${(\nabla \psi_V({\boldsymbol x}))_j=\ln(x_j)+1}$.

Putting all together, we have the online mirror descent update rule for entropic distance generating function.

$\displaystyle x_{t+1,j} = \frac{\exp(\ln x_{t,j}+1-\eta_t g_{t,j})}{\sum_{i=1}^d \exp(\ln x_{t,i}+1-\eta_{t} g_{t,i})} = \frac{x_{t,j}\exp(-\eta_t g_{t,j})}{\sum_{i=1}^d x_{t,i} \exp(-\eta_{t} g_{t,i})}~.$

The algorithm is summarized in Algorithm 1. This algorithm is called Exponentiated Gradient (EG) because in the update rule we take the component-wise exponent of the gradient vector.

Let’s take a look at the regret bound we get. First, as we said, observe that

\displaystyle \begin{aligned} B_\psi({\boldsymbol x};{\boldsymbol y}) &= \sum_{i=1}^d (x_i \ln x_i - y_i \ln y_i - (\ln(y_i)+1)(x_i-y_i)) = \sum_{i=1}^d (x_i \ln x_i - y_i \ln y_i - x_i \ln(y_i) + y_i \ln(y_i)) \\ &= \sum_{i=1}^d x_i \ln \frac{x_i}{y_i}, \end{aligned}

that is the KL divergence between the 1-dimensional discrete distributions ${{\boldsymbol x}}$ and ${{\boldsymbol y}}$. Now, the following Lemma tells us about the strong convexity of ${\psi}$.

Lemma 1 (S. Shalev-Shwartz, 2007, Lemma 16) ${\psi({\boldsymbol x})=\sum_{i=1}^d x_i \ln x_i}$ is 1-strongly convex with respect to the ${L_1}$ norm over the set ${\{{\boldsymbol x} \in R^d: x_i \geq0, \|{\boldsymbol x}\|_1=1\}}$.

Another thing to decide is the initial point ${{\boldsymbol x}_1}$. We can set ${{\boldsymbol x}_1}$ to be the minimizer of ${\psi}$ in ${V}$. In this way ${B_\psi({\boldsymbol u}; {\boldsymbol x}_1)}$ simplifies to ${\psi({\boldsymbol u})-\min_{{\boldsymbol x} \in V} \psi({\boldsymbol x})}$. Hence, we set ${{\boldsymbol x}_1=[1/d, \dots, 1/d] \in {\mathbb R}^d}$. So, we have ${B_\psi({\boldsymbol u};{\boldsymbol x}_1) = \sum_{i=1}^d u_i \ln u_i + \ln{d} \leq \ln d}$.

Putting all together, we have

$\displaystyle \sum_{t=1}^T (\ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u})) \leq \frac{\ln d}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \|{\boldsymbol g}_t\|_\infty^2~.$

Assuming ${\|{\boldsymbol g}_t\|_\infty\leq L_\infty}$, we can set ${\eta=\sqrt{\frac{2\ln d}{L_\infty^2 T}}}$, to obtain that a regret of ${ \frac{\sqrt{2}}{2} L_\infty \sqrt{T \ln d}}$.

Remark 1 Note that the time-varying version of OMD with entropic distance generating function would give rise to a vacuous bound, can you see why? We will see how FTRL overcomes this issue using a time-varying regularizer rather than a time-varying learning rate.

How would Online Subgradient Descent (OSD) work on the same problem? First, it is important to realize that nothing prevents us to use OSD on this problem. We just have to implement the euclidean projection onto the simplex. The regret bound we would get from OSD is

$\displaystyle \sum_{t=1}^T (\ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u})) \leq \frac{1}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \|{\boldsymbol g}_t\|_2^2,$

where we set ${{\boldsymbol x}_1=\boldsymbol{0}}$ and ${\|{\boldsymbol u}\|_2\leq 1}$ for any ${{\boldsymbol u} \in V}$. Assuming ${\|{\boldsymbol g}_t\|_\infty\leq L_\infty}$, we have that in the worst case ${\|{\boldsymbol g}_t\|_2^2 \leq d L_\infty^2}$. Hence, we can set ${\eta=\sqrt{\frac{1}{d L_\infty^2 T}}}$, to obtain that a regret of ${ L_\infty \sqrt{d T}}$. Hence, in a worst case sense, using an entropic distance generating function transforms a dependency on the dimension from ${\sqrt{d}}$ to ${\ln d}$ for Online Convex Optimization (OCO) over the probability simplex.

So, as we already saw analyzing AdaGrad, the shape of the domain is the important ingredient when we change from euclidean norms to other norms.

2. ${p}$-norm Algorithms

Consider the distance generating function ${\psi({\boldsymbol x})=\frac{1}{2}\|{\boldsymbol x}\|^2_p}$, for ${1 < p\leq2}$ over ${X=V={\mathbb R}^d}$. Let’s remind the reader that the ${p}$-norm of a vector ${{\boldsymbol x}}$ is defined as ${(\sum_{i=1}^d |x_i|^p)^\frac{1}{p}}$. We already proved that ${\psi^\star_V({\boldsymbol \theta}) = \frac{1}{2}\|{\boldsymbol \theta}\|_q^2}$, where ${\frac{1}{p}+\frac{1}{q}=1}$, so that ${q\geq 2}$. Let’s calculate the dual maps: ${(\nabla \psi({\boldsymbol x}))_j = \rm sign(x_j) |x_j|^{p-1} \|{\boldsymbol x}\|^{2/p-1}_p}$ and ${(\nabla \psi^\star_V({\boldsymbol x}))_j = \rm sign(x_j) |x_j|^{q-1} \|{\boldsymbol x}\|^{2/q-1}_q}$. Hence, we can write the update rule as

\displaystyle \begin{aligned} \tilde{x}_{t+1,j} &= \rm sign(x_{t,j}) |x_{t,j}|^{p-1}\|{\boldsymbol x}_t\|^{2/p-1}_p - \eta g_{t,j}, \ j=1,\dots,d, \\ x_{t+1,j} &= \rm sign(\tilde{x}_{t+1,j})|\tilde{x}_{t+1,j}|^{q-1} \left\| \tilde{{\boldsymbol x}}_{t+1} \right\|^{2/q-1}_q, \ j=1,\dots,d, \end{aligned}

where we broke the update in two steps to simplify the notation (and the implementation). Starting from ${{\boldsymbol x}_1=\boldsymbol{0}}$, we have that

$\displaystyle B_\psi({\boldsymbol u}; {\boldsymbol x}_1) = \psi({\boldsymbol u}) - \psi({\boldsymbol x}_1) - \langle \nabla \psi({\boldsymbol x}_1), {\boldsymbol u} - {\boldsymbol x}_1 \rangle = \psi({\boldsymbol u})~.$

The last ingredient is the fact that ${\psi({\boldsymbol x})}$ is ${p-1}$ strongly convex with respect to ${\|\cdot\|_p}$.

Lemma 2 (S. Shalev-Shwartz, 2007, Lemma 17) ${\psi({\boldsymbol x})=\frac{1}{2}\|{\boldsymbol x}\|^2_p}$ is ${(p-1)}$-strongly convex with respect to ${\|\cdot\|_p}$, for ${1\leq p\leq 2}$.

Hence, the regret bound will be

$\displaystyle \sum_{t=1}^T (\ell_t({\boldsymbol w}_t) - \ell_t({\boldsymbol u})) \leq \frac{\|{\boldsymbol u}\|_p^2}{2 \eta} + \frac{\eta}{2(p-1)}\sum_{t=1}^T \|{\boldsymbol g}_t\|_q^2~.$

Setting ${p=2}$, we get the (unprojected) Online Subgradient Descent. However, we can set ${p}$ to achieve a logarithmic dependency in the dimension ${d}$ as in EG. Let’s assume again that ${\|{\boldsymbol g}_t\|_\infty\leq L_\infty}$, so we have

$\displaystyle \sum_{t=1}^T \|{\boldsymbol g}_t\|_q^2 \leq L_\infty^2 d^{2/q} T~.$

Also, note that ${\|{\boldsymbol u}\|_p\leq \|{\boldsymbol u}\|_1}$, so we have an upper bound to the regret of

$\displaystyle \text{Regret}_T({\boldsymbol u}) \leq \frac{\|{\boldsymbol u}\|^2_1}{2 \eta} + \frac{L_\infty^2 d^{2/q} T \eta}{2(p-1)}, \ \forall {\boldsymbol u} \in {\mathbb R}^d~.$

Setting ${\eta=\frac{\alpha\sqrt{p-1}}{L_\infty d^{1/q} \sqrt{T}}}$, we get an upper bound to the regret of

$\displaystyle \frac{1}{2}\left(\frac{\|{\boldsymbol u}\|^2_1}{\alpha} + \alpha\right) L_\infty \sqrt{T} \frac{d^{1/q}}{\sqrt{p-1}} = \frac{1}{2}\left(\frac{\|{\boldsymbol u}\|^2_1}{\alpha} + \alpha\right) L_\infty \sqrt{T} \sqrt{q-1}d^{1/q} \leq \frac{1}{2}\left(\frac{\|{\boldsymbol u}\|^2_1}{\alpha} + \alpha\right) L_\infty \sqrt{T} \sqrt{q}d^{1/q}~.$

Assuming ${d\geq 3}$, the choice of ${q}$ that minimizes the last term is ${q=2 \ln d}$ that makes the term ${\sqrt{q}d^{1/q}=\sqrt{2 e \ln d}}$. Hence, we have regret bound of the order of ${O(\sqrt{T \ln d})}$.

So, the ${p}$-norm allows to interpolate from the behaviour of OSD to the one of EG. Note that here the set ${V}$ is the entire space, however we could still set ${V=\{{\boldsymbol x} \in{\mathbb R}^d: x_i\geq0, \|{\boldsymbol x}\|_1=1\}}$. While this would allow us to get the same asymptotic bound of EG, the update would not be in a closed form anymore.

Let’s introduce a particular Online Convex Optimization game called Learning with Expert Advice.

In this setting, we have ${d}$ experts that gives us some advice on each round. In turn, in each round we have to decide which expert we want to follow. After we made our choice, the losses associated to each expert are revealed and we pay the loss associated to the expert we picked. The aim of the game is to minimize the losses we make compared to cumulative losses of the best expert. This is a general setting that allows to model many interesting cases. For example, we have a number of different online learning algorithms and we would like to close to the best among them.

Is this problem solvable? If we put ourselves in the adversarial setting, unfortunately it cannot be solved! Indeed, even with 2 experts, the adversary can force on us linear regret. Let’s see how. In each round we have to pick expert 1 or expert 2. In each round, the adversary can decide that the expert we pick has loss 1 and the other one has loss 0. This means that the cumulative loss of the algorithm over ${T}$ rounds is ${T}$. On the other hand, the best cumulative loss over expert 1 and 2 is less than ${T/2}$. This means that our regret, no matter what we do, can be as big as ${T/2}$.

The problem above is due to the fact that the adversary has too much power. One way to reduce its power is using randomization. We can allow the algorithm to be randomized and force the adversary to decide the losses at time ${t}$ without knowing the outcome of the randomization of the algorithm at time ${t}$ (but it can depend on the past randomization). This is enough to make the problem solvable. Moreover, it will also make the problem convex, allowing us to use any OCO algorithm on it.

First, let’s write the problem in the original formulation. We set a discrete feasible set ${V=\{{\boldsymbol e}_i\}_{i=1}^d}$, where ${{\boldsymbol e}_i}$ is the vector will all zeros but a ${1}$ in the coordinate ${i}$. Our predictions and the competitor are from ${V}$. The losses are linear losses: ${\ell_t({\boldsymbol x}) = \langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle}$, where we assume ${0 \leq g_{t,i} \leq 1}$, for ${t=1,\dots,T}$ and ${i=1, \dots,d}$. The regret is

$\displaystyle \label{eq:regret_lea} \text{Regret}_T({\boldsymbol e}_i) = \sum_{t=1}^T \langle {\boldsymbol g}_{t}, {\boldsymbol x}_t\rangle - \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol e}_i\rangle , \ i=1, \dots, d~. \ \ \ \ \ (1)$

The only thing that makes this problem non-convex is the feasibility set, that is clearly a non-convex one.

Let’s now see how the randomization makes this problem convex. Let’s extend the feasible set to ${V'=\{{\boldsymbol x} \in {\mathbb R}^d : x_i>0, \|{\boldsymbol x}\|_1=1\}}$. Note that ${{\boldsymbol e}_i \in V'}$. For this problem we can use an OCO algorithm to minimize the regret

$\displaystyle \text{Regret}'_T({\boldsymbol u}) = \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle - \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol u}\rangle, \forall {\boldsymbol u} \in V'~.$

Can we find a way to transform an upper bound to this regret to the one we care in (1)? One way is the following one: On each time step, construct a random variable ${i_t}$ that is equal to ${i}$ with probability ${x_{t,i}}$ for ${i=1,\dots,d}$. Then, select the expert according to the outcome of ${i_t}$. Now, in expectation we have

$\displaystyle \mathop{\mathbb E}[g_{t,i_t}] = \langle {\boldsymbol g}_t,{\boldsymbol x}_t\rangle,$

and

$\displaystyle \mathop{\mathbb E}[\text{Regret}_T({\boldsymbol e}_i)] = \mathop{\mathbb E}[\text{Regret}'_T({\boldsymbol e}_i)] = \mathop{\mathbb E}\left[\sum_{t=1}^T \langle {\boldsymbol g}_{t}, {\boldsymbol x}_t\rangle - \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol e}_i\rangle\right]~.$

This means that we can minimize in expectation the non-convex regret with a randomized OCO algorithm. We can summarize this reasoning in Algorithm 2.

For example, if we use EG as the OCO algorithm, setting ${{\boldsymbol x}_1=[1/d, \dots, 1/d]}$, then we obtain the following update rule

$\displaystyle x_{t+1,j} = \frac{x_{t,j}\exp(-\eta g_{t,j})}{\sum_{i=1}^d x_{t,i} \exp(-\eta g_{t,i})}, \ j=1,\dots,d$

and the regret will be

$\displaystyle \mathop{\mathbb E}[\text{Regret}_T({\boldsymbol e}_i)] \leq \frac{\sqrt{2}}{2} L_\infty \sqrt{T \ln d}, \ \forall {\boldsymbol e}_i~.$

It is worth stressing the importance of the result just obtained: We can design an algorithm that in expectation is close to the best expert in a set, paying only a logarithmic penalty in the size of the set.

In the future, we will see algorithms that achieve even the better regret guarantee of ${O(\sqrt{T\cdot \rm KL({\boldsymbol u};{\boldsymbol x}_1)})}$, for any ${{\boldsymbol u}}$ in the simplex. You should be able to convince yourself that no setting of ${\eta}$ in EG allows to achieve such regret guarantee. Indeed, the algorithm will be based on a very different strategy.

4. History Bits

The EG algorithm was introduced by (Kivinen, J. and Warmuth, M., 1997), but not as a specific instantiation of OMD. I am actually not sure when people first pointed out the link between EG and OMD, do you know something about this? If yes, please let me know!
The trick to set ${q=2\ln d}$ is from (Gentile, C. and Littlestone, N., 1999) (online learning) and apparently rediscovered in (Ben-Tal, A. and Margalit, T. and Nemirovski, A., 2001) (optimization). The learning with expert setting was introduced in (Littlestone, N. and Warmuth, M. K., 1994) and (Vovk, V. G, 1990). The ideas in Algorithm 3 are based on the Multiplicative Weights algorithm (Littlestone, N. and Warmuth, M. K., 1994) and the Hedge algorithm (Freund, Y. and Schapire, R. E., 1997). By now, the literature on learning with expert is huge, with tons of variations over algorithms and settings.

5. Exercises

Exercise 1 Derive the EG update rule and regret bound in the case that the algorithm starts from an arbitrary vector ${{\boldsymbol x}_1}$ in the probability simplex.

6. Appendix

Here we find the conjugate of the negative entropy.

It is a constrained optimization problem, we can solve it using the KKT conditions. We first rewrite it as a minimization problem

$\displaystyle \sup_{{\boldsymbol x} \in V} \ \langle {\boldsymbol \theta}, {\boldsymbol x}\rangle - \sum_{i=1}^d x_i \log x_i = -\min_{{\boldsymbol x} \in V} \ \sum_{i=1}^d x_i \log x_i - \langle {\boldsymbol \theta}, {\boldsymbol x}\rangle~.$

The KKT conditions are, for ${i=1, \cdots, d}$,

\displaystyle \begin{aligned} 1+\log x^\star_i - \theta_i -\lambda^\star_i+\nu^\star&=0 \\ \lambda_i^\star&\geq0\\ \sum_{i=1}^d x^\star_i &=1\\ \lambda_i^\star x_i^\star&=0\\ x_i^\star&\geq0~. \end{aligned}

From the first one we get ${x^\star_i = \exp(\theta_i + \lambda^\star_i-\nu^\star-1)}$. Using the third one we have ${\exp(\nu^\star+1)=\sum_{i=1}^d \exp(\theta_i +\lambda^\star_i)}$. Then, from the complementarity condition, fourth one, we get ${\lambda^\star_i=0}$. Putting all together, we have ${x^\star_i = \frac{\exp(\theta_i)}{\sum_{j=1}^d \exp(\theta_j)}}$.

Denoting with ${\alpha=\sum_{i=1}^d \exp(\theta_i)}$, and substituting in the definition of the conjugate function we get

\displaystyle \begin{aligned} \psi_V^\star({\boldsymbol \theta}) &= \sum_{i=1}^d \left(\frac{1}{\alpha} \theta_i\exp(\theta_i) - \frac{1}{\alpha} \exp(\theta_i)(\theta_i-\log(\alpha))\right)\\ &= \log(\alpha) \frac{1}{\alpha}\sum_{i=1}^d \exp(\theta_i )\\ &= \log(\alpha)\\ &= \log\left(\sum_{i=1}^d \exp(\theta_i)\right)~. \end{aligned}