Disclaimer: This post assumes that you understand how FTRL works. If not, please take a look here.
In this post, I explain a variation of the EG/Hedge algorithm, called AdaHedge. The basic idea is to design an algorithm that is adaptive to the sum of the squared norm of the losses, without any prior information on the range of the losses.
First, consider the case in which we use as constant regularizer the negative entropy , where
will be determined in the following and
is the simplex in
. Using FTRL with linear losses with this regularizer, we immediately obtain
where we upper bounded the negative entropy of with 0. Using the strong convexity of the regularizer w.r.t. the
norm and Lemma 4 here, we would further upper bound this as
This suggests that the optimal should be
. However, as we have seen for L* bounds, this choice of any parameter of the algorithm is never feasible. Hence, exactly as we did for L* bounds, we might think of using an online version of this choice
where is a constant that will be determined later. An important property of such choice is that it gives rise to an algorithm that is scale-free, that is its predictions
are invariant from the scaling of the losses by any constant factor. This is easy to see because
Note that this choice makes the regularizer non-decreasing over time and immediately gives us
At this point, we might be tempted to use Lemma 1 from the L* post to upper bound the sum in the upper bound, but unfortunately we cannot! Indeed, the denominator does not contain the term . We might add a constant to
, but that would destroy the scale-freeness of the algorithm. However, it turns out that we can still prove our bound without any change to the regularizer. The key observation is that we can bound the term
in two different ways. The first way is the one above, while the other one is
where we used the definition of and the fact that the regularizer is non-decreasing over time. So, we can now write
where we used the fact that the minimum between two numbers is less than their harmonic mean. Assuming and using Lemma 1 here, we have
The bound and the assumption on suggest to set
. To summarize, we obtained a scale-free algorithm with regret bound
.
We might consider ourselves happy, but there is a clear problem in the above algorithm: the choice of in the time-varying regularizer strictly depends on our upper bound. So, a loose bound will result in a poor choice of the regularization! In general, every time we use a part of the proof in the design of an algorithm we cannot expect an exciting empirical performance, unless our upper bound was really tight. So, can we design a better regularizer? Well, we need a better upper bound!
Let’s consider a generic regularizer and its corresponding FTRL with linear losses regret upper bound
where we assume to be non-decreasing in time.
Now, observe that the sum is unlikely to disappear for this kind of algorithms, so we could try to make the term of the same order of the sum. So, we would like to set
of the same order of
. However, this approach would cause an annoying recurrence. So, using the fact that
is non-decreasing, let’s upper bound the terms in the sum just a little bit:
Now, we can set for
,
, and
. This immediately implies that
Setting to be equal to the negative entropy, we get an algorithm known as AdaHedge. It is easy to see that this choice makes the algorithm scale-free as well.
With this choice of the regularizer, we can simplify a bit the expression of . For
, we have
. Instead, for
, using the properties of the Fenchel conjugates, we have that
Overall, we get the pseudo-code of AdaHedge in Algorithm 1.
So, now we need an upper bound for . Observe that
. Moreover, as we have done before, we can upper bound
in two different ways. In fact, from Lemma 4 here, we have
for
. Also, denoting by
, we have
Hence, we have
We can solve this recurrence using the following Lemma, where and
.
Lemma 1. Let
be any sequence of non-negative real numbers. Suppose that
is a sequence of non-negative real numbers satisfying
Then, for any
,
.
Proof: Observe that
We bound each term in the sum separately. The left term of the minimum inequality in the definition of gives
, while the right term gives
. So, we conclude
.
So, overall we got
and setting , we have
Note that this is roughly the same regret in (2), but the very important difference is that this new regret bound depends on the much tighter quantity , that we upper bounded with
, but in general will be much smaller than that. For example,
can be upper bounded using the tighter local norms, see the analysis of Exp3. Instead, in the first solution, the regret will always be dominated by the term
because we explicitly use it in the regularizer!
There is an important lesson to be learned from AdaHedge: the regret is not the full story and algorithms with the same worst-case guarantee can exhibit vastly different empirical behaviors. Unfortunately, this message is rarely heard and there is a part of the community that focuses too much on the worst-case guarantee rather than on the empirical performance. Even worse, sometimes people favor algorithms with a “more elegant analysis” completely ignoring the likely worse empirical performance.
1. History Bits
The use of FTRL with the regularizer in (1) was proposed in (Orabona and Pál, 2015), I presented a simpler version of their proof that does not require Fenchel conjugates. The AdaHedge algorithm was introduced in (van Erven et al., 2011) and refined in (de Rooij et al., 2014). The analysis reported here is from (Orabona and Pál, 2015), that generalized AdaHedge to arbitrary regularizers in AdaFTRL. Additional properties of AdaHedge for the stochastic case were proven in (van Erven et al., 2011).
2. Exercises
Exercise 1. Implement AdaHedge and compare its empirical performance to FTRL with the time-varying regularizer in (1).
Hi, again your blog is making my life better. I think that it should be ‘\alpha = max (4 …)’ in (2) not min and the inequality in the assumption should \geq not strict. And It’s going to be 4 anyway unless d is in billions. Aaaand if we assume \alpha \leq 4, 4 ends up being optimal so perhaps just set alpha to 4?
Typo in bounding lambda middle step of the third equation (“Hence, we have …” : ‘\lambda_{t+1} = \lambda{t} +\alpha\delta_t’ instead of ‘\lambda_{t+1} = \lambda{t} +\farc{\delta_t}{\alpha^2}’
Missing brackets in the first sum of proof of lemma 1 (for clarity).
LikeLike
Happy to know you like it!
For the setting of alpha, if we set alpha=4, then the bound would depend asymptotically in d as log(d) instead of sqrt{log(d)}. As you observe, alpha=4 would be the optimal setting for any reasonable d, but theoreticians really like to have the optimal dependency in their asymptotics 🙂
Also, fixed all the bugs you found, thanks again!
LikeLiked by 1 person