*This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.*

* You can find the lectures I published till now here.*

Last time, we saw that for Online Mirror Descent (OMD) with an entropic regularizer and learning rate it might be possible to get the regret guarantee

where . This time we will see how and we will use this guarantee to prove an almost optimal regret guarantee for Exp3, in Algorithm 1.

Remark 1While it is possible to prove (1) from first principles using the specific properties for the entropic regularizer, such proof will not shed any light of what is actually going on. So, in the following we will instead try to prove such regret in a very general way. Indeed, this general proof will allow us to easily prove the optimal bound for multi-armed bandits using OMD with the Tsallis entropy as regularizer.

Now, for a generic , consider the OMD algorithm that produces the predictions in two steps:

- Set such that .
- Set .

As we showed, under weak conditions, these two steps are equivalent to the usual OMD single-step update.

Now, the idea is to consider an alternative analysis of OMD that explicitly depends on , the new prediction before the Bregman projection step. First, let’s state the Generalized Pythagorean Theorem for Bregman divergences.

Lemma 1Let and define , then for all .

*Proof:* From the first order optimality condition of we have that . Hence, we have

The Generalized Pythagorean Theorem is often used to prove that the Bregman divergence between any point in and an arbitrary point decreases when the consider the Bregman projection in .

We are now ready to prove our regret guarantee.

Lemma 2For the two-steps OMD update above the following regret bound holds:

where and .

*Proof:* From the update rule, we have that

where in the second equality we used the 3-points equality for the Bregman divergences and the Generalized Pythagorean Theorem in the first inequality. Hence, summing over time we have

So, as we did in the previous lecture, we have

where and .

Putting all together, we have the stated bound.

This time it might be easier to get a handle over . Given that we only need an upper bound, we can just take a look at and and see which one is bigger. This is easy to do: using the update rule we have

that is

Assuming , we have that implies .

Overall, we have the following improved regret guarantee for the Learning with Experts setting with positive losses.

Theorem 3Assume for and . Let and . Using OMD with the entropic regularizer defined as , learning rate , and gives the following regret guarantee

Armed with this new tool, we can now turn to the multi-armed bandit problem again.

Let’s now consider the OMD with entropic regularizer, learning rate , and set equal to the stochastic estimate of , as in Algorithm 1. Applying Theorem 3 and taking expectation, we have

Now, focusing on the terms , we have

So, setting , we have

Remark 2The need for a different analysis for OMD is due to the fact that we want an easy way to upper bound the Hessian. Indeed, in this analysis comes before the normalization into a probability distribution, that simplifies a lot the analysis. The same idea will be used for the Tsallis entropy in the next section.

So, with a tighter analysis we showed that, even without an explicit exploration term, OMD with entropic regularizer solves the multi-armed bandit problem paying only a factor more than the full information case. However, this is still not the optimal regret!

In the next section, we will see that changing the regularizer, *with the same analysis*, will remove the term in the regret.

**1. Optimal Regret Using OMD with Tsallis Entropy **

In this section, we present the Implicitly Normalized Forecaster (INF) as known as OMD with Tsallis entropy algorithm for multi-armed bandit.

Define as , where and in we extend the function by continuity. This is the negative **Tsallis entropy** of the vector . This is a strict generalization of the Shannon entropy, because when goes to 1, converges to the negative (Shannon) entropy of .

We will instantiate OMD with this regularizer for the multi-armed problem, as in Algorithm 2.

Note that and .

We will not use any interpretation of this regularizer from the information theory point of view. As we will see in the following, the only reason to choose it is its Hessian. In fact, the Hessian of this regularizer is still diagonal and it is equal to

Now, we can use again the modified analysis for OMD in Lemma 2. So, for any , we obtain

where and .

As we did for Exp3, now we need an upper bounds to the . From the update rule and the definition of , we have

that is

So, if , , that implies that .

Hence, putting all together, we have

We can now specialize the above reasoning, considering in the Tsallis entropy, to obtain the following theorem.

Theorem 4Assume . Set and . Then, Algorithm 2

*Proof:* We only need to calculate the terms

Proceeding as in (2), we obtain

Choosing , we finally obtain an expected regret of , that can be proved to be the optimal one.

There is one last thing, is how do we compute the prediction of this algorithm? In each step, we have to solve a constrained optimization problem. So, we can write the corresponding Lagragian:

From the KKT conditions, we have

and we also know that . So, we have a 1-dimensional problem in that must be solved in each round.

**2. History Bits **

The INF algorithm was proposed by (Audibert, J.-Y. and Bubeck, S., 2009) and re-casted as an OMD procedure in (Audibert, J.-Y. and Bubeck, S. and Lugosi, G., 2011). The connection with the Tsallis entropy was done in (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). The specific proof presented here is new and it builds on the proof by (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). Note that (Abernethy, J. D. and Lee, C. and Tewari, A., 2015) proved the same regret bound for a Follow-The-Regularized-Leader procedure over the stochastic estimates of the losses (that they call Gradient-Based Prediction Algorithm), while here we proved it using a OMD procedure.

**3. Exercises **

Exercise 1Prove that in the modified proof of OMD, the terms can be upper bounded by .

Exercise 2Building on the previous exercise, prove that regret bounds of the same order can be obtained for Exp3 and for the INF/OMD with Tsallis entropy directly upper bounding the terms , without passing through the Bregman divergences.