This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019.
You can find all the lectures I published here.
Last time, we saw that for Online Mirror Descent (OMD) with an entropic regularizer and learning rate it might be possible to get the regret guarantee
where . This time we will see how and we will use this guarantee to prove an almost optimal regret guarantee for Exp3, in Algorithm 1.
Remark 1 While it is possible to prove (1) from first principles using the specific properties for the entropic regularizer, such proof will not shed any light of what is actually going on. So, in the following we will instead try to prove such regret in a very general way. Indeed, this general proof will allow us to easily prove the optimal bound for multi-armed bandits using OMD with the Tsallis entropy as regularizer.
Now, for a generic , consider the OMD algorithm that produces the predictions in two steps:
- Set
such that
.
- Set
.
As we showed, under weak conditions, these two steps are equivalent to the usual OMD single-step update.
Now, the idea is to consider an alternative analysis of OMD that explicitly depends on , the new prediction before the Bregman projection step. First, let’s state the Generalized Pythagorean Theorem for Bregman divergences.
Lemma 1 Let
and define
, then
for all
.
Proof: From the first order optimality condition of we have that
. Hence, we have
The Generalized Pythagorean Theorem is often used to prove that the Bregman divergence between any point in
and an arbitrary point
decreases when the consider the Bregman projection
in
.
We are now ready to prove our regret guarantee.
Lemma 2 For the two-steps OMD update above the following regret bound holds:
where
and
.
Proof: From the update rule, we have that
where in the second equality we used the 3-points equality for the Bregman divergences and the Generalized Pythagorean Theorem in the first inequality. Hence, summing over time we have
So, as we did in the previous lecture, we have
where and
.
Putting all together, we have the stated bound.
This time it might be easier to get a handle over . Given that we only need an upper bound, we can just take a look at
and
and see which one is bigger. This is easy to do: using the update rule we have
that is
Assuming , we have
that implies
.
Overall, we have the following improved regret guarantee for the Learning with Experts setting with positive losses.
Theorem 3 Assume
for
and
. Let
and
. Using OMD with the entropic regularizer
defined as
, learning rate
, and
gives the following regret guarantee
Armed with this new tool, we can now turn to the multi-armed bandit problem again.
Let’s now consider the OMD with entropic regularizer, learning rate , and set
equal to the stochastic estimate of
, as in Algorithm 1. Applying Theorem 3 and taking expectation, we have
Now, focusing on the terms , we have
So, setting , we have
Remark 2 The need for a different analysis for OMD is due to the fact that we want an easy way to upper bound the Hessian. Indeed, in this analysis
comes before the normalization into a probability distribution, that simplifies a lot the analysis. The same idea will be used for the Tsallis entropy in the next section.
So, with a tighter analysis we showed that, even without an explicit exploration term, OMD with entropic regularizer solves the multi-armed bandit problem paying only a factor more than the full information case. However, this is still not the optimal regret!
In the next section, we will see that changing the regularizer, with the same analysis, will remove the term in the regret.
1. Optimal Regret Using OMD with Tsallis Entropy
In this section, we present the Implicitly Normalized Forecaster (INF) also known as OMD with Tsallis entropy for multi-armed bandit.
Define as
, where
and in
we extend the function by continuity. This is the negative Tsallis entropy of the vector
. This is a strict generalization of the Shannon entropy, because when
goes to 1,
converges to the negative (Shannon) entropy of
.
We will instantiate OMD with this regularizer for the multi-armed problem, as in Algorithm 2.
Note that and
.
We will not use any interpretation of this regularizer from the information theory point of view. As we will see in the following, the only reason to choose it is its Hessian. In fact, the Hessian of this regularizer is still diagonal and it is equal to
Now, we can use again the modified analysis for OMD in Lemma 2. So, for any , we obtain
where and
.
As we did for Exp3, now we need an upper bounds to the . From the update rule and the definition of
, we have
that is
So, if ,
, that implies that
.
Hence, putting all together, we have
We can now specialize the above reasoning, considering in the Tsallis entropy, to obtain the following theorem.
Theorem 4 Assume
. Set
and
. Then, Algorithm 2
Proof: We only need to calculate the terms
Proceeding as in (2), we obtain
Choosing , we finally obtain an expected regret of
, that can be proved to be the optimal one.
There is one last thing, is how do we compute the prediction of this algorithm? In each step, we have to solve a constrained optimization problem. So, we can write the corresponding Lagragian:
From the KKT conditions, we have
and we also know that . So, we have a 1-dimensional problem in
that must be solved in each round.
2. History Bits
The INF algorithm was proposed by (Audibert, J.-Y. and Bubeck, S., 2009) and re-casted as an OMD procedure in (Audibert, J.-Y. and Bubeck, S. and Lugosi, G., 2011). The connection with the Tsallis entropy was done in (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). The specific proof presented here is new and it builds on the proof by (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). Note that (Abernethy, J. D. and Lee, C. and Tewari, A., 2015) proved the same regret bound for a Follow-The-Regularized-Leader procedure over the stochastic estimates of the losses (that they call Gradient-Based Prediction Algorithm), while here we proved it using a OMD procedure.
3. Exercises
Exercise 1 Prove that in the modified proof of OMD, the terms
can be upper bounded by
.
Exercise 2 Building on the previous exercise, prove that regret bounds of the same order can be obtained for Exp3 and for the INF/OMD with Tsallis entropy directly upper bounding the terms
, without passing through the Bregman divergences.
Hi Francesco, in the section of the Tsallis entropy, shouldn’t the Bregman divergence $ B_{\psi_q} (u, x_1) $ be equal to $ \frac{1}{1 – q} ( d^{1 – q} – \sum_{i=1}^d u_i^q ) $ (with the sum over $u_i^q$ instead of $\sqrt{x_i}$) ?
LikeLike
Fixed, thanks!
LikeLike
Sorry, shouldn’t it be $u_i$ instead of $x_i$? (which $x_i$ do you mean by the way?)
LikeLike
Right!! Fixed again.
LikeLike