This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.
You can find the lectures I published till now here.
Today, we will see a couple of practical implementations of online mirror descent, with two different Bregman functions and we will introduce the setting of Learning with Expert Advice.
1. Exponentiated Gradient
Let and defined as the negative entropy , where we define . Also, we set the feasibility set . So, in words, we want to output discrete probability distributions over .
The Fenchel conjugate is defined as
The solution is , see Appendix.
We also have and .
Putting all together, we have the online mirror descent update rule for entropic distance generating function.
The algorithm is summarized in Algorithm 1. This algorithm is called Exponentiated Gradient (EG) because in the update rule we take the component-wise exponent of the gradient vector.
Let’s take a look at the regret bound we get. First, as we said, observe that
that is the KL divergence between the 1-dimensional discrete distributions and . Now, the following Lemma tells us about the strong convexity of .
Lemma 1 (S. Shalev-Shwartz, 2007, Lemma 16) is 1-strongly convex with respect to the norm over the set .
Another thing to decide is the initial point . We can set to be the minimizer of in . In this way simplifies to . Hence, we set . So, we have .
Putting all together, we have
Assuming , we can set , to obtain that a regret of .
Remark 1 Note that the time-varying version of OMD with entropic distance generating function would give rise to a vacuous bound, can you see why? We will see how FTRL overcomes this issue using a time-varying regularizer rather than a time-varying learning rate.
How would Online Subgradient Descent (OSD) work on the same problem? First, it is important to realize that nothing prevents us to use OSD on this problem. We just have to implement the euclidean projection onto the simplex. The regret bound we would get from OSD is
where we set and for any . Assuming , we have that in the worst case . Hence, we can set , to obtain that a regret of . Hence, in a worst case sense, using an entropic distance generating function transforms a dependency on the dimension from to for Online Convex Optimization (OCO) over the probability simplex.
So, as we already saw analyzing AdaGrad, the shape of the domain is the important ingredient when we change from euclidean norms to other norms.
2. -norm Algorithms
Consider the distance generating function , for over . Let’s remind the reader that the -norm of a vector is defined as . We already proved that , where , so that . Let’s calculate the dual maps: and . Hence, we can write the update rule as
where we broke the update in two steps to simplify the notation (and the implementation). Starting from , we have that
The last ingredient is the fact that is strongly convex with respect to .
Lemma 2 (S. Shalev-Shwartz, 2007, Lemma 17) is -strongly convex with respect to , for .
Hence, the regret bound will be
Setting , we get the (unprojected) Online Subgradient Descent. However, we can set to achieve a logarithmic dependency in the dimension as in EG. Let’s assume again that , so we have
Also, note that , so we have an upper bound to the regret of
Setting , we get an upper bound to the regret of
Assuming , the choice of that minimizes the last term is that makes the term . Hence, we have regret bound of the order of .
So, the -norm allows to interpolate from the behaviour of OSD to the one of EG. Note that here the set is the entire space, however we could still set . While this would allow us to get the same asymptotic bound of EG, the update would not be in a closed form anymore.
3. Learning with Expert Advice
Let’s introduce a particular Online Convex Optimization game called Learning with Expert Advice.
In this setting, we have experts that gives us some advice on each round. In turn, in each round we have to decide which expert we want to follow. After we made our choice, the losses associated to each expert are revealed and we pay the loss associated to the expert we picked. The aim of the game is to minimize the losses we make compared to cumulative losses of the best expert. This is a general setting that allows to model many interesting cases. For example, we have a number of different online learning algorithms and we would like to close to the best among them.
Is this problem solvable? If we put ourselves in the adversarial setting, unfortunately it cannot be solved! Indeed, even with 2 experts, the adversary can force on us linear regret. Let’s see how. In each round we have to pick expert 1 or expert 2. In each round, the adversary can decide that the expert we pick has loss 1 and the other one has loss 0. This means that the cumulative loss of the algorithm over rounds is . On the other hand, the best cumulative loss over expert 1 and 2 is less than . This means that our regret, no matter what we do, can be as big as .
The problem above is due to the fact that the adversary has too much power. One way to reduce its power is using randomization. We can allow the algorithm to be randomized and force the adversary to decide the losses at time without knowing the outcome of the randomization of the algorithm at time (but it can depend on the past randomization). This is enough to make the problem solvable. Moreover, it will also make the problem convex, allowing us to use any OCO algorithm on it.
First, let’s write the problem in the original formulation. We set a discrete feasible set , where is the vector will all zeros but a in the coordinate . Our predictions and the competitor are from . The losses are linear losses: , where we assume , for and . The regret is
The only thing that makes this problem non-convex is the feasibility set, that is clearly a non-convex one.
Let’s now see how the randomization makes this problem convex. Let’s extend the feasible set to . Note that . For this problem we can use an OCO algorithm to minimize the regret
Can we find a way to transform an upper bound to this regret to the one we care in (1)? One way is the following one: On each time step, construct a random variable that is equal to with probability for . Then, select the expert according to the outcome of . Now, in expectation we have
This means that we can minimize in expectation the non-convex regret with a randomized OCO algorithm. We can summarize this reasoning in Algorithm 2.
For example, if we use EG as the OCO algorithm, setting , then we obtain the following update rule
and the regret will be
It is worth stressing the importance of the result just obtained: We can design an algorithm that in expectation is close to the best expert in a set, paying only a logarithmic penalty in the size of the set.
In the future, we will see algorithms that achieve even the better regret guarantee of , for any in the simplex. You should be able to convince yourself that no setting of in EG allows to achieve such regret guarantee. Indeed, the algorithm will be based on a very different strategy.
4. History Bits
The EG algorithm was introduced by (Kivinen, J. and Warmuth, M., 1997), but not as a specific instantiation of OMD. I am actually not sure when people first pointed out the link between EG and OMD, do you know something about this? If yes, please let me know!
The trick to set is from (Gentile, C. and Littlestone, N., 1999) (online learning) and apparently rediscovered in (Ben-Tal, A. and Margalit, T. and Nemirovski, A., 2001) (optimization). The learning with expert setting was introduced in (Littlestone, N. and Warmuth, M. K., 1994) and (Vovk, V. G, 1990). The ideas in Algorithm 3 are based on the Multiplicative Weights algorithm (Littlestone, N. and Warmuth, M. K., 1994) and the Hedge algorithm (Freund, Y. and Schapire, R. E., 1997). By now, the literature on learning with expert is huge, with tons of variations over algorithms and settings.
Exercise 1 Derive the EG update rule and regret bound in the case that the algorithm starts from an arbitrary vector in the probability simplex.
Here we find the conjugate of the negative entropy.
It is a constrained optimization problem, we can solve it using the KKT conditions. We first rewrite it as a minimization problem
The KKT conditions are, for ,
From the first one we get . Using the third one we have . Then, from the complementarity condition, fourth one, we get . Putting all together, we have .
Denoting with , and substituting in the definition of the conjugate function we get