This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019.
You can find all the lectures I published here.
Till now, we focused only on Online Subgradient Descent and its generalization, Online Mirror Descent (OMD), with a brief ad-hoc analysis of a Follow-The-Leader (FTL) analysis in the first lecture. In this class, we will extend FTL to a powerful and generic algorithm to do online convex optimization: Follow-the-Regularized-Leader (FTRL).
FTRL is a very intuitive algorithm: At each time step it will play the minimizer of the sum of the past losses plus a time-varying regularization. We will see that the regularization is needed to make the algorithm “more stable” with linear losses and avoid the jumping back and forth that we saw in Lecture 2 for Follow-the-Leader.
As said above, in FTRL we output the minimizer of the regularized cumulative past losses. It should be clear that FTRL is not an algorithm, but rather a family of algorithms, in the same way as OMD is a family of algorithms.
Before analyzing the algorithm, let’s get some intuition on it. In OMD, we saw that the “state” of the algorithm is stored in the current iterate , in the sense that the next iterate depends on and the loss received at time (the choice of the learning rate has only a little influence on the next iterate). Instead in FTRL, the next iterate depends on the entire history of losses received up to time . This has an immediate consequence: In the case that is bounded, OMD will only “remember” the last , and not the iterate before the projection. On the other hand, FTRL keeps in memory the entire history of the past, that in principle allows to recover the iterates before the projection in .
This difference in behavior might make the reader think that FTRL is more computationally and memory expensive. And indeed it is! But, we will also see that there is a way to consider approximate losses that makes the algorithm as expensive as OMD, yet retaining strictly more information than OMD.
For FTRL, we prove a surprising result: an equality for the regret! The proof is in the Appendix.
Proof: Given that the terms are appearing at both side of the equality, we just have to verify that
Remembering that and using the fact that the sum with is telescopic, we have
that is true by the definition of .
Remark 1 Note that we basically didn’t assume anything on nor on , the above equality holds even for non-convex losses and regularizers. Yet, solving the minimization problem at each step might be computationally infeasible.
Remark 3 Note that the algorithm is invariant to any positive constant added to the regularizers, hence we can always state the regret guarantee with instead of .
However, while surprising, the above equality is not yet a regret bound, because it is somehow “implicit” because the losses are appearing on both sides of the equality.
Let’s take a closer look at the equality. If , we have that the sum of the last two terms on the r.h.s. is negative. On the other hand, the first two terms on the r.h.s. are similar to what we got in OMD. The interesting part is the sum of the terms . To give an intuition of what is going on, let’s consider that case that the regularizer is constant over time, i.e., . Hence, the terms in the sum can be rewritten as
Hence, we are measuring the distance between the minimizer of the regularized losses (with two different regularizers) in two consecutive predictions of the algorithms. Roughly speaking, this term will be small if and the losses+regularization are “nice”. This should remind you exactly the OMD update, where we constrain to be close to . Instead, here the two predictions will be close one to the other if the minimizer of the regularized losses up to time is close to the minimizer of the losses up to time . So, like in OMD, the regularizer here will play the critical role of stabilizing the predictions, if the losses don’t possess enough curvature.
To quantify this intuition, we need a property of strongly convex functions.
2. Convex Analysis Bits: Properties of Strongly Convex Functions
We will use the following lemma for strongly convex functions.
Proof: Define . Observe that , hence is the minimizer of . Also, note that . Hence, we can write
where the last step comes from the conjugate function of the squared norm (See Example 3 in the lecture on OLO lower bounds).
Corollary 3 Let -strongly convex with respect to a norm . Let . Then, for all , and , we have
In words, the above lemma says that an upper bound to the suboptimality gap is proportional to the squared norm of the subgradient.
3. An Explicit Regret Bound using Strongly Convex Regularizers
We now state a Lemmas quantifying the intuition on the “stability” of the predictions.
Lemma 4 With the notation and assumptions of Lemma 1, assume that is proper and -strongly convex w.r.t. , and proper and convex. Also, assume that is non-empty. Then, we have
for all .
Proof: We have
where in the second inequality we used Lemma 2, the fact that , and . Observing that , we have . Hence, using the theorem of the subdifferential of sum of functions, we can choose such that we have .
Let’s see some immediate applications of FTRL
for all . Moreover, if the functions are -Lipschitz, setting we get
This might look like the same regret guarantee of OMD, however here there is a very important difference: The last term contains a time-varying element () but the domain does not have to be bounded! Also, I used the regularizer and not to remind you another important difference: In OMD the learning rate is chosen after receiving the subgradient while here you have to choose it before receiving it!
The another important difference is that here the update rule seems way more expensive than in OMD, because we need to solve an optimization problem at each step. However, it turns out we can use FTRL on linearized losses and obtain the same bound with the same computational complexity of OMD.
4. FTRL with Linearized Losses
If we consider that case in which the losses are linear, we have that the prediction of FTRL is
Now, if we assume to be proper, convex, and closed, using theorem 4 in the lecture on OLO lower bounds, we have that . Moreover, if is strongly convex, we know that is differentiable and we get
This corresponds to Figure 1.
Compare it to the mirror update of OMD, rewritten in a similar way:
They are very similar, but with important differences:
- In OMD, the state is kept in , so we need to transform it into a dual variable before making the update and then back to the primal variable.
- In FTRL with linear losses, the state is kept directly in the dual space, updated and then transformed in the primal variable. The primal variable is only used to predict, but not directly in the update.
- In OMD, the samples are weighted by the learning rates that is typically decreasing
- In FTRL with linear losses, all the subgradients have the same weight, but the regularizer is typically increasing over time.
Also, we will not loose anything in the bound! Indeed, we can run FTRL on the linearized losses , where , guaranting exactly the same regret on the losses . The algorithm for such procedure is in Algorithm 2.
In fact, using the definition of the subgradients and the assumptions of Corollary 5, we have
In the next example, we can see the different behavior of FTRL and OMD.
On the other hand in FTRL with linearized losses, we can use and it is easy to verify that the update in (1) becomes
While the regret guarantee would be the same for these two updates, from an intuitive point of view OMD seems to be loosing a lot of potential information due to the projection and the fact that we only memorize the projected iterate.
Next time, we will see how to obtain logarithmic regret bounds for strongly convex losses for FTRL and more applications.
5. History Bits
Follow the Regularized Leader was introduced in (Abernethy, J. D. and Hazan, E. and Rakhlin, A., 2008) where at each step the prediction is computed as the minimizer of a regularization term plus the sum of losses on all past rounds. However, the key ideas of FTRL, and in particular its analysis through the dual, were planted by Shai Shalev-Shwartz and Yoram Singer way before (Shalev-Shwartz, S. and Singer, Y., 2006)(Shalev-Shwartz, S. and Singer, Y., 2007). Later, the PhD thesis of Shai Shalev-Shwartz (S. Shalev-Shwartz, 2007) contained the most precise dual analysis of FTRL, but he called it “online mirror descent” because the name FTRL was only invented later! Even later, I contributed to the confusion naming a general analysis of FTRL with time-varying regularizer and linear losses “generalized online mirror descent” (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015). So, now I am trying to set the record straight 🙂
Later to all this, the optimization community starts using FTRL with linear losses as well, calling it Dual Averaging (Nesterov, Y., 2009). Note that this paper by Nesterov is actually from 2005 (Nesterov, Y., 2005), and in the paper he claims that these ideas are from 2001-2002, but he decided not to publish them for a while because he was convinced that “the importance of black-box approach in Convex Optimization will be irreversibly vanishing, and, finally, this approach will be completely replaced by other ones based on a clever use of problem’s structure (interior-point methods, smoothing, etc.).” Hence, Nesterov is for sure the the first inventor of offline FTRL with linear losses. It is interesting to note that Nesterov introduced the Dual Averaging algorithm to fix the fact that in OMD gradients enter the algorithm with decreasing weights, contradicting the common sense understanding of how optimization should work. The same ideas were then translated to online learning and stochastic optimization in (L. Xiao, 2010), essentially rediscovering the framework of Shalev-Shwartz and rebranding it Regularized Dual Averaging (RDA). Finally, (McMahan, H B., 2017) gives the elegant equality result that I presented here (with minor improvements) that holds for general loss functions and regularizers. Note that Dual Averaging stems from the dual interpretation of FTRL for linear losses, but Lemma 1 underlines the fact that FTRL is actually more general.
Another source of confusion stems from the fact that some people differentiate among a “lazy” and “greedy” version of OMD. In reality, as proved in (McMahan, H B., 2017), the lazy algorithm is just FTRL with linearized losses and the greedy one is just OMD. The notation “lazy online mirror descent” was introduced in (Zinkevich, M., 2004), where he basically introduced for the first time FTRL with linearized losses in the special case of .
Exercise 1 Prove that the update of FTRL with linearized loss in Example 1 is correct.
Exercise 2 Find a way to have bounds for smooth losses with linearized FTRL: Do you need an additional assumption compared to what we did for OSD?