This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.
You can find the lectures I published till now here.
In the last lecture, we have shown a very simple and parameter-free algorithm for Online Convex Optimization (OCO) in -dimensions, based on a reduction to a coin-betting problem. Now, we will see how to reduce Learning with Expert Advice (LEA) to betting on coins, obtaining again parameter-free and optimal algorithms.
1. Reduction to Learning with Experts
First, remember that the regret we got from Online Mirror Descent (OMD), and similarly for Follow-The-Regularized-Leader (FTRL), is
where is the prior distribution on the experts and is the KL-divergence. As we reasoned in the OCO case, in order to set the learning rate we should know the value of . If we could set to , we would obtain a regret of . However, given the adversarial nature of the game, this is impossible. So, as we did in the OCO case, we will show that even this problem can be reduced to betting on a coin, obtaining optimal guarantees with a parameter-free algorithm.
First, let’s introduce some notation. Let be the number of experts and be the -dimensional probability simplex. Let be any prior distribution. Let be a coin-betting algorithm. We will instantiate copies of .
Consider any round . Let be the bet of the -th copy of . The LEA algorithm computes as
Then, the LEA algorithm predicts as
Then, the algorithm receives the reward vector . Finally, it feeds the reward to each copy of . The reward for the -th copy of is defined as
The construction above defines a LEA algorithm defined by the predictions , based on the algorithm . We can prove the following regret bound for it.
Theorem 1 (Regret Bound for Experts) Let be a coin-betting algorithm that guarantees a wealth after rounds with initial money equal to 1 of for any sequence of continuous coin outcomes . Then, the regret of the LEA algorithm with prior that predicts at each round with in (2) satisfies
for any concave and non-decreasing such that .
Proof: We first prove that . Indeed,
The first equality follows from definition of . To see the second equality, consider two cases: If for all then and therefore both and are trivially zero. If then for all .
From the assumption on , we have for any sequence such that that
So, inequality and (4) imply
Now, for any competitor ,
Now, we could think to use the Krichevsky–Trofimov (KT) bettor with this theorem. However, we would obtain a sub-optimal regret guarantee. In fact, remembering the lower bound on the wealth of KT and setting where is a universal constant, we have
We might think that the is the price we have to pay to adapt to the unknown competitor . However, it turns out it can be removed. In the next section, we see how to change the KT strategy to obtain the optimal guarantee.
2. A Betting Strategy that Looses at Most a Constant Fraction of Money
In the reduction before, if we use the KT betting strategy we would have a term under the square root. It turns out that we can avoid that term if we know the number of rounds beforehand. Then, in case is unknown we can just use a doubling trick, paying only a constant multiplicative factor in the regret.
The logarithmic term in the regret comes from the fact that the lower bound on the wealth is
Note that in the case in which the number of heads in the sequence is equal to the number of heads, so that , the guaranteed wealth becomes proportional to . So, for that goes to infinity the bettor will lose all of its money.
Instead, we need a more conservative strategy that guarantees
for small enough and independent of . In this case, the betting strategy has to pace its betting, possibly with the knowledge of the duration of the game, so that even in the case that the number of heads is equal to the number of tails it will only lose a fraction of its money. At the same time, it will still gain an exponential amount of money when the coin outcomes are biased towards one side.
We will prove that this is possible, designing a new betting strategy.
then, by induction, . In fact, we have
Hence, we have to prove that (8) is true in order to guarantee a minimum wealth of our betting strategy.
First, given that is a concave function of , we have
Also, our choice of makes the two quantities above equal with , that is
For other choices of , the two alternatives would be different and the minimum one could always be the one picked by the adversary. Instead, making the two choices worst outcomes equivalent, we minimize the damage of the adversarial choice of the outcomes of the coin. So, we have that
where in the second equality we used the definition of and in the second inequality we used the fact that .
Hence, given that (8) is true, this strategy guarantees
We can now use this betting strategy in the expert reduction in Theorem 1, setting , to have
Note that this betting strategy could also be used in the OCO reduction. Given that we removed the logarithmic term in the exponent, in the 1-dimensional case, we would obtain a regret of
where we gained in the term inside the logarithmic, instead of the term of the KT algorithm. This implies that now we can set to and obtain an asymptotic rate of rather than .
3. History Bits
The first parameter-free algorithm for experts is from (Chaudhuri, K. and Freund, Y. and Hsu, D. J., 2009), named NormalHedge, where they obtained a bound similar to the one in (9) but with an additional term. Then, (Chernov, A. and Vovk, V., 2010) removed the log factors with an update without a closed form. (Orabona, F. and Pal, D., 2016) showed that this guarantee can be efficiently obtained through the novel reduction to coin-betting in Theorem 1. Later, these kind of regret guarantees were improved to depend on the sum of the squared losses rather than on time, but with an additional factor, in the Squint algorithm (Koolen, W. M. and van Erven, T., 2015). It is worth noting that the Squint algorithm can be interpreted exactly as a coin-betting algorithm plus the reduction in Theorem 1.
The betting strategy in (6) and (7) are new, and derived from the shifted-KT potentials in (Orabona, F. and Pal, D., 2016). The guarantee is the same obtained by the shifted-KT potentials, but the analysis can be done without knowing the properties of the gamma function.
Exercise 1 Using the same proof technique in the lecture, find a betting strategy whose wealth depends on rather than on .
Nice, post! The amount of work to remove just a ln T factor is impressive!
I have few questions: since you mentioned Squint, can you elaborate more on how it can be seen as a coin betting algorithm? I find Squint really appealing, since the sum of square losses should always be no worse than T, but it can actually be much better, for example it implies constant regret in a stochastic setting (see for example thm. 11 in http://proceedings.mlr.press/v35/gaillard14.pdf). Is it possible to give such bounds, or first-order bounds using a KT strategy?
Also, there should be some applications where the knowledge of the time horizon is not available in advance. For example, in the construction used to get strongly-adaptive regret bounds (as in your paper http://proceedings.mlr.press/v54/jun17a/jun17a.pdf), I guess in this case it would be better to use Squint?
Sorry, regarding the strongly-adaptive construction I suppose the real problem is not the knowledge of T but the fact that you shouldn’t be able to use a doubling trick.
The main difficulty in dealing with Squint is the “improper prior” they use. The rest is easy: by induction you can show that any potential of the form used by Squint gives a lower bound to the wealth. I will try to find the time to write a post on it: it is easier to explain with equations.
If you don’t care about improper prior, you would get additional log factors. Note that Squint gets a log log term that is probably unavoidable.
In alternative, if you just care about the constant regret in the stochastic setting, you can get it easily with a strategy that depends on sum |c_i| rather than sum c_i^2. This is much easier to do and it was done in OCO with Pistol (https://arxiv.org/abs/1406.3816) and Learning with Experts with AdaNormalHedge (http://proceedings.mlr.press/v40/Luo15.pdf). Note that again it can be proven that these are betting algorithms. The only difference is that they bet with the lower bound to the wealth times the betting fraction rather than using the actual wealth times the betting fraction. Note that the proof of AdaNormalHedge is complex only because they truncate negative predictions inside the potential rather than outside as in equation (3). If you remove the truncation, the proof becomes easy.
Trying to adapt directly KT to obtain this kind of guarantees is not a good idea: you quickly loose the closed form expression. Indeed, KT is nothing else than FTRL with certain losses and a certain regularizer. You can use the same FTRL strategy if the coin is “continuous”, but you don’t have a closed form anymore.
Cool, thanks! You are probably already aware of it but there’s also a “dual” view of KT for experts from an exponential weights point of view in http://proceedings.mlr.press/v75/hoeven18a/hoeven18a.pdf
LikeLiked by 1 person
Nice post! Your work on parameter-free online learning is rather impressive! I’m curious if it is possible to develop parameter-free version of accelerated gradient descent as well?
LikeLiked by 1 person
Hi Yichi, thanks for your question!
The key point is that accelerated algorithms makes sense for smooth functions, while parameter-free algorithms make sense for non-smooth ones. In fact, with a non-smooth function the (sub)-gradient does not give any information on how far is the optimum and the setting of the learning rate becomes critical. Instead, in the smooth world the gradients are very informative and knowledge of the smoothness constant allows to easily optimally set the learning rate. Hence, we should first define what we mean by “parameter-free” for smooth optimization. Also, it is relatively easy to estimate smoothness. Indeed, the accelerated algorithm FISTA (https://www.math.mcgill.ca/yyang/comp/papers/FISTA.pdf) has a line-search procedure that essentially estimates the smoothness parameter. It is still non-trivial to estimate the smoothness constant in the stochastic setting, but definitely doable.
A different but related result you might be interested is that parameter-free algorithms allow you to adapt to the strong convexity (that is harder to estimate than smoothness), see section 6 in https://arxiv.org/pdf/1802.06293.pdf