This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.
You can find the lectures I published till now here.
In the last lecture, we have shown a very simple and parameter-free algorithm for Online Convex Optimization (OCO) in -dimensions, based on a reduction to a coin-betting problem. Now, we will see how to reduce Learning with Expert Advice (LEA) to betting on
coins, obtaining again parameter-free and optimal algorithms.
1. Reduction to Learning with Experts
First, remember that the regret we got from Online Mirror Descent (OMD), and similarly for Follow-The-Regularized-Leader (FTRL), is
where is the prior distribution on the experts and
is the KL-divergence. As we reasoned in the OCO case, in order to set the learning rate we should know the value of
. If we could set
to
, we would obtain a regret of
. However, given the adversarial nature of the game, this is impossible. So, as we did in the OCO case, we will show that even this problem can be reduced to betting on a coin, obtaining optimal guarantees with a parameter-free algorithm.
First, let’s introduce some notation. Let be the number of experts and
be the
-dimensional probability simplex. Let
be any prior distribution. Let
be a coin-betting algorithm. We will instantiate
copies of
.
Consider any round . Let
be the bet of the
-th copy of
. The LEA algorithm computes
as
Then, the LEA algorithm predicts as
Then, the algorithm receives the reward vector . Finally, it feeds the reward to each copy of
. The reward for the
-th copy of
is
defined as
The construction above defines a LEA algorithm defined by the predictions , based on the algorithm
. We can prove the following regret bound for it.
Theorem 1 (Regret Bound for Experts) Let
be a coin-betting algorithm that guarantees a wealth after
rounds with initial money equal to 1 of
for any sequence of continuous coin outcomes
. Then, the regret of the LEA algorithm with prior
that predicts at each round with
in (2) satisfies
for any
concave and non-decreasing such that
.
Proof: We first prove that . Indeed,
The first equality follows from definition of . To see the second equality, consider two cases: If
for all
then
and therefore both
and
are trivially zero. If
then
for all
.
From the assumption on , we have for any sequence
such that
that
So, inequality and (4) imply
Now, for any competitor ,
Now, we could think to use the Krichevsky–Trofimov (KT) bettor with this theorem. However, we would obtain a sub-optimal regret guarantee. In fact, remembering the lower bound on the wealth of KT and setting where
is a universal constant, we have
We might think that the is the price we have to pay to adapt to the unknown competitor
. However, it turns out it can be removed. In the next section, we see how to change the KT strategy to obtain the optimal guarantee.
2. A Betting Strategy that Looses at Most a Constant Fraction of Money
In the reduction before, if we use the KT betting strategy we would have a term under the square root. It turns out that we can avoid that term if we know the number of rounds beforehand. Then, in case
is unknown we can just use a doubling trick, paying only a constant multiplicative factor in the regret.
The logarithmic term in the regret comes from the fact that the lower bound on the wealth is
Note that in the case in which the number of heads in the sequence is equal to the number of heads, so that , the guaranteed wealth becomes proportional to
. So, for
that goes to infinity the bettor will lose all of its money.
Instead, we need a more conservative strategy that guarantees
for small enough and independent of
. In this case, the betting strategy has to pace its betting, possibly with the knowledge of the duration of the game, so that even in the case that the number of heads is equal to the number of tails it will only lose a fraction of its money. At the same time, it will still gain an exponential amount of money when the coin outcomes are biased towards one side.
We will prove that this is possible, designing a new betting strategy.
then, by induction, . In fact, we have
Hence, we have to prove that (8) is true in order to guarantee a minimum wealth of our betting strategy.
First, given that is a concave function of
, we have
Also, our choice of makes the two quantities above equal with
, that is
For other choices of , the two alternatives would be different and the minimum one could always be the one picked by the adversary. Instead, making the two choices worst outcomes equivalent, we minimize the damage of the adversarial choice of the outcomes of the coin. So, we have that
where in the second equality we used the definition of and in the second inequality we used the fact that
.
Hence, given that (8) is true, this strategy guarantees
We can now use this betting strategy in the expert reduction in Theorem 1, setting , to have
Note that this betting strategy could also be used in the OCO reduction. Given that we removed the logarithmic term in the exponent, in the 1-dimensional case, we would obtain a regret of
where we gained in the term inside the logarithmic, instead of the
term of the KT algorithm. This implies that now we can set
to
and obtain an asymptotic rate of
rather than
.
3. History Bits
The first parameter-free algorithm for experts is from (Chaudhuri, K. and Freund, Y. and Hsu, D. J., 2009), named NormalHedge, where they obtained a bound similar to the one in (9) but with an additional term. Then, (Chernov, A. and Vovk, V., 2010) removed the log factors with an update without a closed form. (Orabona, F. and Pal, D., 2016) showed that this guarantee can be efficiently obtained through the novel reduction to coin-betting in Theorem 1. Later, these kind of regret guarantees were improved to depend on the sum of the squared losses rather than on time, but with an additional
factor, in the Squint algorithm (Koolen, W. M. and van Erven, T., 2015). It is worth noting that the Squint algorithm can be interpreted exactly as a coin-betting algorithm plus the reduction in Theorem 1.
The betting strategy in (6) and (7) are new, and derived from the shifted-KT potentials in (Orabona, F. and Pal, D., 2016). The guarantee is the same obtained by the shifted-KT potentials, but the analysis can be done without knowing the properties of the gamma function.
4. Exercises
Exercise 1 Using the same proof technique in the lecture, find a betting strategy whose wealth depends on
rather than on
.
Nice, post! The amount of work to remove just a ln T factor is impressive!
I have few questions: since you mentioned Squint, can you elaborate more on how it can be seen as a coin betting algorithm? I find Squint really appealing, since the sum of square losses should always be no worse than T, but it can actually be much better, for example it implies constant regret in a stochastic setting (see for example thm. 11 in http://proceedings.mlr.press/v35/gaillard14.pdf). Is it possible to give such bounds, or first-order bounds using a KT strategy?
Also, there should be some applications where the knowledge of the time horizon is not available in advance. For example, in the construction used to get strongly-adaptive regret bounds (as in your paper http://proceedings.mlr.press/v54/jun17a/jun17a.pdf), I guess in this case it would be better to use Squint?
Thank you!
LikeLike
Sorry, regarding the strongly-adaptive construction I suppose the real problem is not the knowledge of T but the fact that you shouldn’t be able to use a doubling trick.
Nicolò
LikeLike
The main difficulty in dealing with Squint is the “improper prior” they use. The rest is easy: by induction you can show that any potential of the form used by Squint gives a lower bound to the wealth. I will try to find the time to write a post on it: it is easier to explain with equations.
If you don’t care about improper prior, you would get additional log factors. Note that Squint gets a log log term that is probably unavoidable.
In alternative, if you just care about the constant regret in the stochastic setting, you can get it easily with a strategy that depends on sum |c_i| rather than sum c_i^2. This is much easier to do and it was done in OCO with Pistol (https://arxiv.org/abs/1406.3816) and Learning with Experts with AdaNormalHedge (http://proceedings.mlr.press/v40/Luo15.pdf). Note that again it can be proven that these are betting algorithms. The only difference is that they bet with the lower bound to the wealth times the betting fraction rather than using the actual wealth times the betting fraction. Note that the proof of AdaNormalHedge is complex only because they truncate negative predictions inside the potential rather than outside as in equation (3). If you remove the truncation, the proof becomes easy.
Trying to adapt directly KT to obtain this kind of guarantees is not a good idea: you quickly loose the closed form expression. Indeed, KT is nothing else than FTRL with certain losses and a certain regularizer. You can use the same FTRL strategy if the coin is “continuous”, but you don’t have a closed form anymore.
LikeLike
Cool, thanks! You are probably already aware of it but there’s also a “dual” view of KT for experts from an exponential weights point of view in http://proceedings.mlr.press/v75/hoeven18a/hoeven18a.pdf
LikeLiked by 1 person