This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019.
You can find all the lectures I published here.
where . This time we will see how and we will use this guarantee to prove an almost optimal regret guarantee for Exp3, in Algorithm 1.
Remark 1 While it is possible to prove (1) from first principles using the specific properties for the entropic regularizer, such proof will not shed any light of what is actually going on. So, in the following we will instead try to prove such regret in a very general way. Indeed, this general proof will allow us to easily prove the optimal bound for multi-armed bandits using OMD with the Tsallis entropy as regularizer.
Now, for a generic , consider the OMD algorithm that produces the predictions in two steps:
- Set such that .
- Set .
As we showed, under weak conditions, these two steps are equivalent to the usual OMD single-step update.
Now, the idea is to consider an alternative analysis of OMD that explicitly depends on , the new prediction before the Bregman projection step. First, let’s state the Generalized Pythagorean Theorem for Bregman divergences.
Lemma 1 Let and define , then for all .
Proof: From the first order optimality condition of we have that . Hence, we have
The Generalized Pythagorean Theorem is often used to prove that the Bregman divergence between any point in and an arbitrary point decreases when the consider the Bregman projection in .
We are now ready to prove our regret guarantee.
where and .
Proof: From the update rule, we have that
where in the second equality we used the 3-points equality for the Bregman divergences and the Generalized Pythagorean Theorem in the first inequality. Hence, summing over time we have
So, as we did in the previous lecture, we have
where and .
Putting all together, we have the stated bound.
This time it might be easier to get a handle over . Given that we only need an upper bound, we can just take a look at and and see which one is bigger. This is easy to do: using the update rule we have
Assuming , we have that implies .
Overall, we have the following improved regret guarantee for the Learning with Experts setting with positive losses.
Armed with this new tool, we can now turn to the multi-armed bandit problem again.
So, setting , we have
Remark 2 The need for a different analysis for OMD is due to the fact that we want an easy way to upper bound the Hessian. Indeed, in this analysis comes before the normalization into a probability distribution, that simplifies a lot the analysis. The same idea will be used for the Tsallis entropy in the next section.
So, with a tighter analysis we showed that, even without an explicit exploration term, OMD with entropic regularizer solves the multi-armed bandit problem paying only a factor more than the full information case. However, this is still not the optimal regret!
In the next section, we will see that changing the regularizer, with the same analysis, will remove the term in the regret.
1. Optimal Regret Using OMD with Tsallis Entropy
In this section, we present the Implicitly Normalized Forecaster (INF) also known as OMD with Tsallis entropy for multi-armed bandit.
Define as , where and in we extend the function by continuity. This is the negative Tsallis entropy of the vector . This is a strict generalization of the Shannon entropy, because when goes to 1, converges to the negative (Shannon) entropy of .
We will instantiate OMD with this regularizer for the multi-armed problem, as in Algorithm 2.
Note that and .
We will not use any interpretation of this regularizer from the information theory point of view. As we will see in the following, the only reason to choose it is its Hessian. In fact, the Hessian of this regularizer is still diagonal and it is equal to
Now, we can use again the modified analysis for OMD in Lemma 2. So, for any , we obtain
where and .
As we did for Exp3, now we need an upper bounds to the . From the update rule and the definition of , we have
So, if , , that implies that .
Hence, putting all together, we have
We can now specialize the above reasoning, considering in the Tsallis entropy, to obtain the following theorem.
Theorem 4 Assume . Set and . Then, Algorithm 2
Proof: We only need to calculate the terms
Proceeding as in (2), we obtain
Choosing , we finally obtain an expected regret of , that can be proved to be the optimal one.
There is one last thing, is how do we compute the prediction of this algorithm? In each step, we have to solve a constrained optimization problem. So, we can write the corresponding Lagragian:
From the KKT conditions, we have
and we also know that . So, we have a 1-dimensional problem in that must be solved in each round.
2. History Bits
The INF algorithm was proposed by (Audibert, J.-Y. and Bubeck, S., 2009) and re-casted as an OMD procedure in (Audibert, J.-Y. and Bubeck, S. and Lugosi, G., 2011). The connection with the Tsallis entropy was done in (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). The specific proof presented here is new and it builds on the proof by (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). Note that (Abernethy, J. D. and Lee, C. and Tewari, A., 2015) proved the same regret bound for a Follow-The-Regularized-Leader procedure over the stochastic estimates of the losses (that they call Gradient-Based Prediction Algorithm), while here we proved it using a OMD procedure.
Exercise 1 Prove that in the modified proof of OMD, the terms can be upper bounded by .
Exercise 2 Building on the previous exercise, prove that regret bounds of the same order can be obtained for Exp3 and for the INF/OMD with Tsallis entropy directly upper bounding the terms , without passing through the Bregman divergences.