This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019.
You can find all the lectures I published here.
Last time, we saw that we can use Follow-The-Regularized-Leader (FTRL) on linearized losses:
Today, we will show a number of applications of FTRL with linearized losses, some easy ones and some more advanced ones.
We also said that we are free to choose , so we will often set it to .
Remark 1 As we said last time, the algorithm is invariant to any positive constant added to the regularizer, hence we can always state the regret guarantee with instead of . However, for clarity in the following we will instead explicitly choose the regularizer such that their minimum is 0.
1. FTRL with Linearized Losses Can Be Equivalent to OMD
First, we see that even if FTRL and OMD seem very different, in certain cases they are equivalent. For example, consider that case that . The output of OMD is
Assume that for all . This implies that , that is . Assuming , we have
On the other hand, consider FTRL with linearized losses with regularizers , then
Assuming that , this implies that . Further, assuming that is invertible, implies that the predictions of FTRL and OMD are the same.
This equivalence immediately gives us some intuition on the role of in both algorithm: The same function is inducing the Bregman divergence, that is our similarity measure, and is the regularizer in FTRL. Moreover, the inverse of the growth rate of the regularizers in FTRL takes the role of the learning rate in OMD.
Example 1 Consider and , then it satisfies the conditions above to have the predictions of OMD equal to the ones of FTRL.
2. Exponentiated Gradient with FTRL: No Need to know
Let’s see an example of an instantiation of FTRL with linearized losses to have the FTRL version of Exponentiated Gradient (EG).
Let and the sequence of loss functions be convex and -Lipschitz w.r.t. the L-infinity norm. Let defined as , where and we define . Set , that is -strongly convex w.r.t. the L1 norm, where is a parameter of the algorithm.
Given that the regularizers are strongly convex, we know that
We already saw that , that implies that . So, we have that
Note that this is exactly the same update of EG based on OMD, but here we are effectively using time-varying learning rates.
We also get that the regret guarantee is
where we used the fact that using and are equivalent. Choosing . This regret guarantee is similar to the one we proved for OMD, but with an important difference: We don’t have to know in advance the number of rounds . In OMD a similar bound would be vacuous because it would depend on the that is infinite.
3. Composite Losses
Let’s now see a variant of the linearization of the losses: partial linearization of composite losses.
Suppose that the losses we receive are composed by two terms: one convex function changing over time and another part is fixed and known. These losses are called composite. For example, we might have . Using the linearization, we might just take the subgradient of . However, in this particular case, we might lose the ability of the L1 norm to produce sparse solutions.
There is a better way to deal with these kind of losses: Move the constant part of the loss inside the regularization term. In this way, that part will not be linearized but used exactly in the argmin of the update. Assuming that the argmin is still easily computable, you can always expect better performance from this approach. In particular, in the case of adding an L1 norm to the losses, you will be predicting in each step with the solution of an L1 regularized optimization problem.
Practically speaking, in the example above, we will define , where we assume to be 1-strongly convex and the losses be -Lipschitz. Note that we use at time a term because we anticipate the next term in the next round. Given that is -strongly convex, using (1), we have
where . Reordering the terms, we have
Example 2 Let’s also take a look at the update rule in that case that and we get composite losses with the L1 norm. We have
We can solve this problem observing that the minimization decomposes over each coordinate of . Denote by . Hence, we know from first-order optimality condition that is the solution for the coordinate iff there exists such that
Consider the 3 different cases:
- , then and .
- , then and .
- , then and .
So, overall we have
Observe as this update will produce sparse solutions, while just taking the subgradient of the L1 norm would have never produced sparse predictions.
Remark 2 (Proximal operators) In the example above, we calculated something like
This operation is known in the optimization literature as Proximal Operator of the L1 norm. In general, a proximal operator of a convex, proper, and closed function is defined as
Proximal operators are used in optimization in the same way as we used it: They allow to minimize the entire function rather a linear approximation of it. Also, proximal operators generalizes the concept of Euclidean projection. Indeed, .
4. FTRL with Strongly Convex Functions
Let’s now go back to the FTRL regret bound and let’s see if you can strengthen it in the case that the regularizer is proximal, that is it satisfies that .
Proof: We have
where in the second inequality we used Corollary 1 from last lecture, the fact that , and . Observing that from the proximal property, we have that , . Hence, using the theorem of the subdifferential of sum of functions, and remembering that , we can choose such that we have .
Remark 3 Note that a constant regularizer is proximal because any point is the minimizer of the zero function. On the other hand, a constant regularizer makes the two Lemmas the same, unless the loss functions contribute to the total strong convexity.
We will now use the above lemma to prove a logarithmic regret bound for strongly convex losses.
The above regret guarantee is the same of OMD over strongly convex losses, but here we don’t need to know the strong convexity of the losses. In fact, we just need to output the minimizer over the past losses. However, as we noticed last time, this might be undesirable because now each update is an optimization problem.
Hence, we can again use the idea of replacing the losses with an easy surrogate. In the Lipschitz case, it made sense to use linear losses. However, here you can do better and use quadratic losses, because the losses are strongly convex. So, we can run FTRL on the quadratic losses , where . The algorithm would be the following one:
Moreover, we will get exactly the same regret bound as in Corollary 2, with the only difference that here the guarantee holds for a specific choice of the rather than for any subgradient in .
Example 3 Going back to the example in the first lecture, where and are strongly convex, we now see immediately that FTRL without a regularizer, that is Follow the Leader, gives logarithmic regret. Note that in this case the losses were defined only over , so that the minimization is carried over .
5. History Bits
The first analysis of FTRL with composite losses is in (L. Xiao, 2010). The analysis presented here using the negative terms to easily prove regret bounds for FTRL for composite losses is from (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015).
The first proof of FTRL for strongly convex losses was in (S. Shalev-Shwartz and Y. Singer, 2007) (even if they don’t call it FTRL).
There is an interesting bit about FTRL-Proximal (McMahan, H. B., 2011): FTRL-Proximal is an instantiation of FTRL that uses a particular proximal regularizer. It became very famous in internet companies when Google disclosed in a very influential paper that they were using FTRL-Proximal to train the classifier for click prediction (McMahan, H. B. and Holt, G. and Sculley, D. and Young, M. and Ebner, D. and Grady, J. and Nie, L. and Phillips, T. and Davydov, E. and Golovin, D. and Chikkerur, S. and Liu, D. and Wattenberg, M. and Hrafnkelsson, A. M. and Boulos, T. and Kubica, J., 2013). This generated even more confusion because many people started conflating the term FTRL-Proximal (a specific algorithm) with FTRL (an entire family of algorithms). Unfortunately, this confusion is still going on in these days.
Exercise 1 Prove that the update in (2) is equivalent to the one of OSD with and learning rate .