Multi-Armed Bandit II

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.



exp3_fig1

Last time, we saw that for Online Mirror Descent (OMD) with an entropic regularizer and learning rate {\eta} it might be possible to get the regret guarantee

\displaystyle \label{eq:eg_improved} \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol x}_t \rangle - \sum_{t=1}^T \langle {\boldsymbol g}_t,{\boldsymbol u}\rangle \leq \frac{\ln d}{\eta} + \frac{\eta}{2}\sum_{t=1}^T \sum_{i=1}^d x_{t,i} g_{t,i}^2, \ \forall {\boldsymbol u} \in V, \ \ \ \ \ (1)

where {V=\{{\boldsymbol x} \in {\mathbb R}^d: \|{\boldsymbol x}\|_1=1, x_i\geq0\}}. This time we will see how and we will use this guarantee to prove an almost optimal regret guarantee for Exp3, in Algorithm 1.

Remark 1 While it is possible to prove (1) from first principles using the specific properties for the entropic regularizer, such proof will not shed any light of what is actually going on. So, in the following we will instead try to prove such regret in a very general way. Indeed, this general proof will allow us to easily prove the optimal bound for multi-armed bandits using OMD with the Tsallis entropy as regularizer.

Now, for a generic {\psi}, consider the OMD algorithm that produces the predictions in two steps:

  • Set {\tilde{{\boldsymbol x}}_{t+1}} such that {\nabla \psi(\tilde{{\boldsymbol x}}_{t+1}) = \nabla \psi({\boldsymbol x}_{t}) - \eta {\boldsymbol g}_t}.
  • Set {{\boldsymbol x}_{t+1}= \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in V} \ B_\psi({\boldsymbol x}; \tilde{{\boldsymbol x}}_{t+1})}.

As we showed, under weak conditions, these two steps are equivalent to the usual OMD single-step update.

Now, the idea is to consider an alternative analysis of OMD that explicitly depends on {\tilde{{\boldsymbol x}}_{t+1}}, the new prediction before the Bregman projection step. First, let’s state the Generalized Pythagorean Theorem for Bregman divergences.

Lemma 1 Let {\tilde{{\boldsymbol x}} \in \mathop{\mathrm{int}} \mathop{\mathrm{dom}} \psi} and define {{\boldsymbol x}' = \mathop{\mathrm{argmin}}_{{\boldsymbol y} \in V} \ B_\psi({\boldsymbol y};\tilde{{\boldsymbol x}})}, then {B_{\psi} ({\boldsymbol x}; \tilde{{\boldsymbol x}}) \geq B_{\psi}({\boldsymbol x};{\boldsymbol x}') + B_{\psi}({\boldsymbol x}';\tilde{{\boldsymbol x}})} for all {{\boldsymbol x} \in V}.

Proof: From the first order optimality condition of {{\boldsymbol x}'} we have that {\langle\nabla \psi({\boldsymbol x}')-\nabla\psi(\tilde{{\boldsymbol x}}),{\boldsymbol x}-{\boldsymbol x}'\rangle \geq 0, \forall {\boldsymbol x} \in V}. Hence, we have

\displaystyle \begin{aligned} B_{\psi}({\boldsymbol x};{\boldsymbol x}') + B_{\psi}({\boldsymbol x}';\tilde{{\boldsymbol x}}) &= \psi({\boldsymbol x})-\psi({\boldsymbol x}')-\langle \nabla \psi({\boldsymbol x}'), {\boldsymbol x}-{\boldsymbol x}'\rangle + \psi({\boldsymbol x}')-\psi(\tilde{{\boldsymbol x}})-\langle \nabla \psi(\tilde{{\boldsymbol x}}), {\boldsymbol x}'-\tilde{{\boldsymbol x}}\rangle \\ &= \psi({\boldsymbol x}) - \psi(\tilde{{\boldsymbol x}}) -\langle\nabla \psi({\boldsymbol x}'),{\boldsymbol x}-{\boldsymbol x}'\rangle - \langle \nabla \psi(\tilde{{\boldsymbol x}}), {\boldsymbol x}'-\tilde{{\boldsymbol x}}\rangle \\ &\leq \psi({\boldsymbol x}) - \psi(\tilde{{\boldsymbol x}})-\langle\nabla \psi(\tilde{{\boldsymbol x}}),{\boldsymbol x}-{\boldsymbol x}'\rangle - \langle \nabla \psi(\tilde{{\boldsymbol x}}), {\boldsymbol x}'-\tilde{{\boldsymbol x}}\rangle\\ &=\psi({\boldsymbol x}) - \psi(\tilde{{\boldsymbol x}}) - \langle \nabla \psi(\tilde{{\boldsymbol x}}), {\boldsymbol x} -\tilde{{\boldsymbol x}}\rangle \\ &=B_\psi({\boldsymbol x};\tilde{{\boldsymbol x}})~. \end{aligned}

\Box

The Generalized Pythagorean Theorem is often used to prove that the Bregman divergence between any point {{\boldsymbol x}} in {V} and an arbitrary point {\tilde{{\boldsymbol x}}} decreases when the consider the Bregman projection {\tilde{{\boldsymbol x}}} in {V}.

We are now ready to prove our regret guarantee.

Lemma 2 For the two-steps OMD update above the following regret bound holds:

\displaystyle \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol x}_t - {\boldsymbol u}\rangle \leq \frac{B_\psi({\boldsymbol u}; {\boldsymbol x}_1)}{\eta} + \frac{\eta}{2} \sum_{t=1}^T {\boldsymbol g}_t^\top (\nabla^2 \psi({\boldsymbol z}_t))^{-1} {\boldsymbol g}_t, \ \forall {\boldsymbol u} \in V,

where {{\boldsymbol z}_t=\alpha_t {\boldsymbol x}_t+(1-\alpha_t) \tilde{{\boldsymbol x}}_{t+1}} and {\alpha_t\in[0,1]}.

Proof: From the update rule, we have that

\displaystyle \begin{aligned} \langle \eta {\boldsymbol g}_t, {\boldsymbol x}_t -{\boldsymbol u}\rangle &= \langle \nabla \psi({\boldsymbol x}_t)-\nabla \psi(\tilde{{\boldsymbol x}}_{t+1}), {\boldsymbol x}_t -{\boldsymbol u}\rangle \\ &= B_\psi({\boldsymbol u}; {\boldsymbol x}_t) - B_\psi({\boldsymbol u}; \tilde{{\boldsymbol x}}_{t+1}) + B_\psi({\boldsymbol x}_t; \tilde{{\boldsymbol x}}_{t+1}) \\ &\leq B_\psi({\boldsymbol u}; {\boldsymbol x}_t) - B_\psi({\boldsymbol u}; {\boldsymbol x}_{t+1}) + B_\psi({\boldsymbol x}_t; \tilde{{\boldsymbol x}}_{t+1}) - B_\psi({\boldsymbol x}_{t+1}; \tilde{{\boldsymbol x}}_{t+1}) \\ &\leq B_\psi({\boldsymbol u}; {\boldsymbol x}_t) - B_\psi({\boldsymbol u}; {\boldsymbol x}_{t+1}) + B_\psi({\boldsymbol x}_t; \tilde{{\boldsymbol x}}_{t+1}), \end{aligned}

where in the second equality we used the 3-points equality for the Bregman divergences and the Generalized Pythagorean Theorem in the first inequality. Hence, summing over time we have

\displaystyle \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol x}_t - {\boldsymbol u}\rangle \leq \frac{B_\psi({\boldsymbol u}; {\boldsymbol x}_1)}{\eta} + \frac{1}{\eta}\sum_{t=1}^T B_\psi({\boldsymbol x}_t; \tilde{{\boldsymbol x}}_{t+1})~.

So, as we did in the previous lecture, we have

\displaystyle \begin{aligned} B_\psi({\boldsymbol x}_t; \tilde{{\boldsymbol x}}_{t+1}) &=\psi({\boldsymbol x}_t)-\psi(\tilde{{\boldsymbol x}}_{t+1}) -\langle \nabla \psi(\tilde{{\boldsymbol x}}_{t+1}), {\boldsymbol x}_t-\tilde{{\boldsymbol x}}_{t+1}\rangle \\ &=-B_{\psi}(\tilde{{\boldsymbol x}}_{t+1}; {\boldsymbol x}_t) + \langle \nabla \psi({\boldsymbol x}_t) - \nabla \psi(\tilde{{\boldsymbol x}}_{t+1}), {\boldsymbol x}_{t}-\tilde{{\boldsymbol x}}_{t+1}\rangle \\ &=-B_{\psi}(\tilde{{\boldsymbol x}}_{t+1}; {\boldsymbol x}_t) + \langle \eta {\boldsymbol g}_t, {\boldsymbol x}_{t}-\tilde{{\boldsymbol x}}_{t+1}\rangle \\ &\leq -B_{\psi}(\tilde{{\boldsymbol x}}_{t+1}; {\boldsymbol x}_t) + \frac{1}{2} \eta^2 {\boldsymbol g}_t^\top (\nabla^2 \psi({\boldsymbol z}_t))^{-1} {\boldsymbol g}_t + \frac{1}{2} \|{\boldsymbol x}_{t}-\tilde{{\boldsymbol x}}_{t+1}\|^2_{\nabla^2 \psi({\boldsymbol z}_t)} \\ &= \frac{1}{2}\eta^2 {\boldsymbol g}_t^\top (\nabla^2 \psi({\boldsymbol z}_t))^{-1} {\boldsymbol g}_t, \end{aligned}

where {{\boldsymbol z}_t=\alpha_t {\boldsymbol x}_t+(1-\alpha_t) \tilde{{\boldsymbol x}}_{t+1}} and {\alpha_t\in[0,1]}.

Putting all together, we have the stated bound. \Box

This time it might be easier to get a handle over {z_{t,i}}. Given that we only need an upper bound, we can just take a look at {x_{t,i}} and {\tilde{x}_{t+1,i}} and see which one is bigger. This is easy to do: using the update rule we have

\displaystyle \ln(\tilde{x}_{t+1,i})+1=\ln(x_{t,i})+1-\eta g_{t,i},

that is

\displaystyle \tilde{x}_{t+1,i} = x_{t,i} \exp(-\eta g_{t,i})~.

Assuming {g_{t,i}\geq 0}, we have {\tilde{x}_{t+1,i}\leq x_{t,i}} that implies {z_{t,i} \leq x_{t,i}}.

Overall, we have the following improved regret guarantee for the Learning with Experts setting with positive losses.

Theorem 3 Assume {g_{t,i}\geq0} for {t=1,\dots,T} and {i=1,\dots,d}. Let {V=\{{\boldsymbol x} \in {\mathbb R}^d: \|{\boldsymbol x}\|_1=1, x_i\geq0\}} and {\eta>0}. Using OMD with the entropic regularizer {\psi: {\mathbb R}^d_+ \rightarrow {\mathbb R}} defined as {\psi({\boldsymbol x})=\sum_{i=1}^d x_i \ln x_i}, learning rate {\eta}, and {{\boldsymbol x}_1=[1/d,\dots,1/d]} gives the following regret guarantee

\displaystyle \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle - \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol u}\rangle \leq \frac{\ln d}{\eta} + \frac{\eta}{2} \sum_{t=1}^T x_{t,i} g_{t,i}^2, \ \forall {\boldsymbol u} \in V~.

Armed with this new tool, we can now turn to the multi-armed bandit problem again.

Let’s now consider the OMD with entropic regularizer, learning rate {\eta}, and set {\tilde{{\boldsymbol g}}_{t}} equal to the stochastic estimate of {{\boldsymbol g}_t}, as in Algorithm 1. Applying Theorem 3 and taking expectation, we have

\displaystyle \mathop{\mathbb E}\left[\sum_{t=1}^T g_{t,A_t}\right] - \sum_{t=1}^T \langle {\boldsymbol g}_t,{\boldsymbol u}\rangle = \mathop{\mathbb E}\left[\sum_{t=1}^T \langle \tilde{{\boldsymbol g}}_{t},{\boldsymbol x}_t\rangle - \sum_{t=1}^T \langle \tilde{{\boldsymbol g}}_t,{\boldsymbol u}\rangle\right] \leq \frac{B_\psi({\boldsymbol u}; {\boldsymbol x}_1)}{\eta} + \mathop{\mathbb E}\left[\sum_{t=1}^T \sum_{i=1}^d x_{t,i} \tilde{g}_{t,i}^2\right]~.

Now, focusing on the terms {\mathop{\mathbb E}[x_{t,i} \tilde{g}_{t,i}^2]}, we have

\displaystyle \label{eq:expectation_exp3} \mathop{\mathbb E}\left[\sum_{i=1}^d x_{t,i} \tilde{g}_{t,i}^2\right] = \mathop{\mathbb E}\left[\mathop{\mathbb E}\left[\sum_{i=1}^d x_{t,i} \tilde{g}_{t,i}^2\middle|A_1, \dots, A_{t-1}\right]\right] = \mathop{\mathbb E}\left[\sum_{i=1}^d x_{t,i} \frac{g_{t,i}^2}{x_{t,i}}\right] \leq d L_\infty^2~. \ \ \ \ \ (2)

So, setting {\eta \propto \sqrt{\frac{\ln d}{L^2_\infty T}}}, we have

\displaystyle \mathop{\mathbb E}\left[\sum_{t=1}^T g_{t,A_t}\right] - \sum_{t=1}^T \langle {\boldsymbol g}_t,{\boldsymbol u}\rangle \leq O\left(L_\infty \sqrt{d T \ln d}\right)~.

Remark 2 The need for a different analysis for OMD is due to the fact that we want an easy way to upper bound the Hessian. Indeed, in this analysis {\tilde{{\boldsymbol x}}_{t+1}} comes before the normalization into a probability distribution, that simplifies a lot the analysis. The same idea will be used for the Tsallis entropy in the next section.

So, with a tighter analysis we showed that, even without an explicit exploration term, OMD with entropic regularizer solves the multi-armed bandit problem paying only a factor {\sqrt{d}} more than the full information case. However, this is still not the optimal regret!

In the next section, we will see that changing the regularizer, with the same analysis, will remove the {\sqrt{\ln d}} term in the regret.

1. Optimal Regret Using OMD with Tsallis Entropy

In this section, we present the Implicitly Normalized Forecaster (INF) as known as OMD with Tsallis entropy algorithm for multi-armed bandit.

Define {\psi_q:{\mathbb R}^d_+ \rightarrow {\mathbb R}} as {\psi_q({\boldsymbol x})=\frac{1}{1-q}\left(1-\sum_{i=1}^d x_i^q\right)}, where {q \in [0,1]} and in {q=1} we extend the function by continuity. This is the negative Tsallis entropy of the vector {{\boldsymbol x}}. This is a strict generalization of the Shannon entropy, because when {q} goes to 1, {\psi_q({\boldsymbol x})} converges to the negative (Shannon) entropy of {{\boldsymbol x}}.

We will instantiate OMD with this regularizer for the multi-armed problem, as in Algorithm 2.


inf

Note that {\mathop{\mathrm{argmin}}_{{\boldsymbol x}} \psi_q({\boldsymbol x}) = \frac{1}{d}} and {\min_{{\boldsymbol x}} \psi_q({\boldsymbol x}) = \frac{1 - d^{1-q}}{1-q}}.

We will not use any interpretation of this regularizer from the information theory point of view. As we will see in the following, the only reason to choose it is its Hessian. In fact, the Hessian of this regularizer is still diagonal and it is equal to

\displaystyle (\nabla^2 \psi_q ({\boldsymbol z}))_{ii}= q \frac{1}{z_i^{2-q}}~.

Now, we can use again the modified analysis for OMD in Lemma 2. So, for any {{\boldsymbol u} \in V}, we obtain

\displaystyle \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol x}_t - {\boldsymbol u}\rangle \leq \frac{B_{\psi_q}({\boldsymbol u}; {\boldsymbol x}_1)}{\eta} + \frac{\eta}{2} \sum_{t=1}^T {\boldsymbol g}_t^\top (\nabla^2 \psi_q({\boldsymbol z}_t))^{-1} {\boldsymbol g}_t = \frac{d^{1-q} - \sum_{i=1}^d \sqrt{x_i}}{\eta(1-q)} + \frac{\eta}{2} \sum_{t=1}^T \sum_{i=1}^d g_{t,i}^2 z_{t,i}^{2-q},

where {{\boldsymbol z}_t=\alpha_t {\boldsymbol x}_t+(1-\alpha_t) \tilde{{\boldsymbol x}}_{t+1}} and {\alpha_t\in[0,1]}.

As we did for Exp3, now we need an upper bounds to the {z_{t,i}}. From the update rule and the definition of {\psi}, we have

\displaystyle -\frac{q}{1-q} \tilde{{\boldsymbol x}}^{q-1}_{t+1,i} = -\frac{q}{1-q} {\boldsymbol x}^{q-1}_{t,i} - \eta g_{t,i},

that is

\displaystyle \tilde{{\boldsymbol x}}_{t+1,i} = \frac{{\boldsymbol x}_{t,i}}{\left(1+\frac{1-q}{q}\eta g_{t,i} {\boldsymbol x}^{1-q}_{t,i}\right)^\frac{1}{1-q}}~.

So, if {g_{t,i}\geq0}, {\tilde{x}_{t+1,i}\leq x_{t,i}}, that implies that {z_{t,i} \leq x_{t,i}}.

Hence, putting all together, we have

\displaystyle \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol x}_t - {\boldsymbol u}\rangle \leq \frac{d^{1-q} - \sum_{i=1}^d \sqrt{x_i}}{\eta(1-q)} + \frac{\eta}{2q} \sum_{t=1}^T \sum_{i=1}^d g_{t,i}^2 x_{t,i}^{2-q}~.

We can now specialize the above reasoning, considering {q=1/2} in the Tsallis entropy, to obtain the following theorem.

Theorem 4 Assume {g_{t,i} \in [0, L_\infty]}. Set {q=1/2} and {{\boldsymbol x}_1=[1/d, \dots, 1/d]}. Then, Algorithm 2

\displaystyle \mathop{\mathbb E}\left[\sum_{t=1}^T g_{t,A_t}\right] - \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol u}\rangle \leq \frac{2\sqrt{d}}{\eta} + \eta \sqrt{d} L_\infty^2 T~.

Proof: We only need to calculate the terms

\displaystyle \mathop{\mathbb E}\left[\sum_{i=1}^d \tilde{g}^2_{t,i} x_{t,i}^{3/2}\right]~.

Proceeding as in (2), we obtain

\displaystyle \begin{aligned} \mathop{\mathbb E}\left[\sum_{i=1}^d \tilde{g}^2_{t,i} x_{t,i}^{3/2}\right] &= \mathop{\mathbb E}\left[\mathop{\mathbb E}\left[\sum_{i=1}^d x_{t,i}^{3/2} \tilde{g}_{t,i}^2\middle|A_1, \dots, A_{t-1}\right]\right] = \mathop{\mathbb E}\left[\sum_{i=1}^d x_{t,i}^{3/2} \frac{g_{t,i}^2}{x_{t,i}^2} x_{t,i}\right] = \mathop{\mathbb E}\left[\sum_{i=1}^d g_{t,i}^2 \sqrt{x_{t,i}}\right] \\ &\leq \mathop{\mathbb E}\left[\sqrt{\sum_{i=1}^d g_{t,i}^2} \sqrt{\sum_{i=1}^d g_{t,i}^2 x_{t,i}}\right] \leq L_\infty \sqrt{d}~. \end{aligned}

\Box

Choosing {\eta \propto \frac{1}{L_\infty \sqrt{T}}}, we finally obtain an expected regret of {O(L_\infty\sqrt{d T})}, that can be proved to be the optimal one.

There is one last thing, is how do we compute the prediction of this algorithm? In each step, we have to solve a constrained optimization problem. So, we can write the corresponding Lagragian:

\displaystyle L({\boldsymbol x},\beta) = \sum_{i=1}^d \left(\eta\tilde{g}_{t,i} + \frac{q}{1-q}x_{t,i}^{q-1}\right)x_i - \frac{1}{1-q} \sum_{i=1}^d x_i^q + \beta\left(\sum_{i=1}^d x_i-1\right)~.

From the KKT conditions, we have

\displaystyle x_{t+1,i} = \left[\frac{1-q}{q}\left(\beta+ \frac{q}{1-q}x_{t,i}^{q-1}+\eta \tilde{g}_{t,i}\right)\right]^\frac{1}{q-1}~.

and we also know that {\sum_{i=1}^d x_{t+1,i}=1}. So, we have a 1-dimensional problem in {\beta} that must be solved in each round.

2. History Bits

The INF algorithm was proposed by (Audibert, J.-Y. and Bubeck, S., 2009) and re-casted as an OMD procedure in (Audibert, J.-Y. and Bubeck, S. and Lugosi, G., 2011). The connection with the Tsallis entropy was done in (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). The specific proof presented here is new and it builds on the proof by (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). Note that (Abernethy, J. D. and Lee, C. and Tewari, A., 2015) proved the same regret bound for a Follow-The-Regularized-Leader procedure over the stochastic estimates of the losses (that they call Gradient-Based Prediction Algorithm), while here we proved it using a OMD procedure.

3. Exercises

Exercise 1 Prove that in the modified proof of OMD, the terms {B_\psi({\boldsymbol x}_t; \tilde{{\boldsymbol x}}_{t+1})} can be upper bounded by {\langle \eta {\boldsymbol g}_t, {\boldsymbol x}_t - \tilde{{\boldsymbol x}}_{t+1}\rangle}.

Exercise 2 Building on the previous exercise, prove that regret bounds of the same order can be obtained for Exp3 and for the INF/OMD with Tsallis entropy directly upper bounding the terms {\langle \eta {\boldsymbol g}_t, {\boldsymbol x}_t - \tilde{{\boldsymbol x}}_{t+1}\rangle}, without passing through the Bregman divergences.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s