*This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.*

* You can find the lectures I published till now here.*

Till now, we focused only on Online Subgradient Descent and its generalization, Online Mirror Descent (OMD), with a brief ad-hoc analysis of a Follow-The-Leader (FTL) analysis in the first lecture. In this class, we will extend FTL to a powerful and generic algorithm to do online convex optimization: **Follow-the-Regularized-Leader** (FTRL).

FTRL is a very intuitive algorithm: At each time step it will play the minimizer of the sum of the past losses *plus* a time-varying regularization. We will see that the regularization is needed to make the algorithm “more stable” with linear losses and avoid the jumping back and forth that we saw in Lecture 2 for Follow-the-Leader.

**1. Follow-the-Regularized-Leader **

As said above, in FTRL we output the minimizer of the regularized cumulative past losses. It should be clear that FTRL is not an algorithm, but rather a family of algorithms, in the same way as OMD is a family of algorithms.

Before analyzing the algorithm, let’s get some intuition on it. In OMD, we saw that the “state” of the algorithm is stored in the current iterate , in the sense that the next iterate depends on and the loss received at time (the choice of the learning rate has only a little influence on the next iterate). Instead in FTRL, the next iterate depends on the entire history of losses received up to time . This has an immediate consequence: In the case that is bounded, OMD will only “remember” the last , and not the iterate before the projection. On the other hand, FTRL keeps in memory the entire history of the past, that in principle allows to recover the iterates before the projection in .

This difference in behavior might make the reader think that FTRL is more computationally and memory expensive. And indeed it is! But, we will also see that there is a way to consider approximate losses that makes the algorithm as expensive as OMD, yet retaining strictly more information than OMD.

For FTRL, we prove a surprising result: an equality for the regret! The proof is in the Appendix.

Lemma 1Denote by . Assume that is not empty and set . Then, for any , we have

Remark 1Note that we basically didn’t assume anything on nor on , the above equality holds even for non-convex losses and regularizers. Yet, solving the minimization problem at each step might be computationally infeasible.

Remark 2Note that the left hand side of the equality in the theorem does not depend on , so if needed we can set it to .

Remark 3Note that the algorithm is invariant to any positive constant added to the regularizers, hence we can always state the regret guarantee with instead of .

However, while surprising, the above equality is not yet a regret bound, because it is somehow “implicit” because the losses are appearing on both sides of the equality.

Let’s take a closer look at the equality. If , we have that the sum of the last two terms on the r.h.s. is negative. On the other hand, the first two terms on the r.h.s. are similar to what we got in OMD. The interesting part is the sum of the terms . To give an intuition of what is going on, let’s consider that case that the regularizer is constant over time, i.e., . Hence, the terms in the sum can be rewritten as

Hence, we are measuring the distance between the minimizer of the regularized losses (with two different regularizers) in two consecutive predictions of the algorithms. Roughly speaking, this term will be small if and the losses+regularization are “nice”. This should remind you exactly the OMD update, where we *constrain* to be close to . Instead, here the two predictions will be close one to the other if the minimizer of the regularized losses up to time is close to the minimizer of the losses up to time . So, like in OMD, the regularizer here will play the critical role of *stabilizing* the predictions, if the losses don’t possess enough curvature.

To quantify this intuition, we need a property of strongly convex functions.

**2. Convex Analysis Bits: Properties of Strongly Convex Functions **

We will use the following lemma for strongly convex functions.

Lemma 2Let -strongly convex with respect to a norm . Then, for all , , and , we have

*Proof:* Define . Observe that , hence is the minimizer of . Also, note that . Hence, we can write

where the last step comes from the conjugate function of the squared norm (See Example 3 in the lecture on OLO lower bounds).

Corollary 3Let -strongly convex with respect to a norm . Let . Then, for all , and , we have

In words, the above lemma says that an upper bound to the suboptimality gap is proportional to the squared norm of the subgradient.

**3. An Explicit Regret Bound using Strongly Convex Regularizers **

We now state a Lemmas quantifying the intuition on the “stability” of the predictions.

Lemma 4With the notation and assumptions of Lemma 1, assume that is proper and -strongly convex w.r.t. , and proper and convex. Also, assume that is non-empty. Then, we have

for all .

*Proof:* We have

where in the second inequality we used Lemma 2, the fact that , and . Observing that , we have . Hence, using the theorem of the subdifferential of sum of functions, we can choose such that we have .

Let’s see some immediate applications of FTRL

Corollary 5Let a sequence of convex loss functions. Let a -strongly convex function w.r.t. . Set the sequence of regularizers as , where . Then, FTRL guarantees

for all . Moreover, if the functions are -Lipschitz, setting we get

*Proof:* The corollary is immediate from Lemma 1, Lemma 4, and the observation that from the assumptions we have . We also set , thanks to Remark 2.

This might look like the same regret guarantee of OMD, however here there is a very important difference: The last term contains a time-varying element () but the domain does not have to be bounded! Also, I used the regularizer and not to remind you another important difference: In OMD the learning rate is chosen after receiving the subgradient while here you have to choose it before receiving it!

The another important difference is that here the update rule seems way more expensive than in OMD, because we need to solve an optimization problem at each step. However, it turns out we can use FTRL on *linearized losses* and obtain the same bound with the same computational complexity of OMD.

**4. FTRL with Linearized Losses **

If we consider that case in which the losses are linear, we have that the prediction of FTRL is

Now, if we assume to be proper, convex, and closed, using theorem 4 in the lecture on OLO lower boundslecture on OLO lower bounds, we have that . Moreover, if is strongly convex, we know that is differentiable and we get

In turn, this update can be written in the following way

This corresponds to Figure 1.

Compare it to the mirror update of OMD, rewritten in a similar way:

They are very similar, but with important differences:

- In OMD, the state is kept in , so we need to transform it into a dual variable before making the update and then back to the primal variable.
- In FTRL with linear losses, the state is kept directly in the dual space, updated and then transformed in the primal variable. The primal variable is only used to predict, but not directly in the update.
- In OMD, the samples are weighted by the learning rates that is typically decreasing
- In FTRL with linear losses, all the subgradients have the same weight, but the regularizer is typically increasing over time.

Also, we will not loose anything in the bound! Indeed, we can run FTRL on the linearized losses , where , guaranting exactly the same regret on the losses . The algorithm for such procedure is in Algorithm 2.

In fact, using the definition of the subgradients and the assumptions of Corollary 5, we have

The only difference with respect to Corollary 5 is that here the are the specific ones we use in the algorithm, while in Corollary 5 the statement holds for any choice of the .

In the next example, we can see the different behavior of FTRL and OMD.

Example 1Consider . With Online Subgradient Descent (OSD) with learning rate and , the update is

On the other hand in FTRL with linearized losses, we can use and it is easy to verify that the update in (1) becomes

While the regret guarantee would be the same for these two updates, from an intuitive point of view OMD seems to be loosing a lot of potential information due to the projection and the fact that we only memorize the projected iterate.

Next time, we will see how to obtain logarithmic regret bounds for strongly convex losses for FTRL and more applications.

**5. History Bits **

Follow the Regularized Leader was introduced in (Abernethy, J. D. and Hazan, E. and Rakhlin, A., 2008) where at each step the prediction is computed as the minimizer of a regularization term plus the sum of losses on all past rounds. However, the key ideas of FTRL, and in particular its analysis through the dual, were planted by Shai Shalev-Shwartz and Yoram Singer way before (Shalev-Shwartz, S. and Singer, Y., 2006)(Shalev-Shwartz, S. and Singer, Y., 2007). Later, the PhD thesis of Shai Shalev-Shwartz (S. Shalev-Shwartz, 2007) contained the most precise dual analysis of FTRL, but he called it “online mirror descent” because the name FTRL was only invented later! Even later, I contributed to the confusion naming a general analysis of FTRL with time-varying regularizer and linear losses “generalized online mirror descent” (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015). So, now I am trying to set the record straight 🙂

Later to all this, the optimization community rediscovers FTRL with linear losses and calls it Dual Averaging (Nesterov, Y., 2009), even if Nesterov used similar ideas already in 2005 (Nesterov, Y., 2005). It is interesting to note that Nesterov introduced the Dual Averaging algorithm to fix the fact that in OMD gradients enter the algorithm with decreasing weights, contradicting the common sense understanding of how optimization should work. The same ideas were then translated to online learning and stochastic optimization in (L. Xiao, 2010), essentially rediscovering the framework of Shalev-Shwartz and rebranding it Regularized Dual Averaging (RDA). Finally, (McMahan, H B., 2017) gives the elegant equality result that I presented here (with minor improvements) that holds for general loss functions and regularizers. Note that the dual interpretation of FTRL comes out naturally for linear losses, but Lemma 1 underlines the fact that the algorithm is actually more general.

Another source of confusion stems from the fact that some people differentiate among a “lazy” and “greedy” version of OMD. In reality, as proved in (McMahan, H B., 2017), the lazy algorithm is just FTRL with linearized losses and the greedy one is just OMD. The notation “lazy online mirror descent” was introduced in (Zinkevich, M., 2004), where he basically introduced for the first time FTRL with linearized losses.

**6. Exercises **

Exercise 1Prove that the update of FTRL with linearized loss in Example 1 is correct.

Exercise 2Find a way to have bounds for smooth losses with linearized FTRL: Do you need an additional assumption compared to what we did for OSD?

**7. Appendix **

*Proof of Lemma 1:* Define and for . Hence, we have that . Now, consider

We also have

Hence, putting these two inequalities together, we get

Observing that

and that , we get the equality

Corollary 5’s proof should have that \psi_t(x) – \psi_{t+1}(x) \leq 0 (currently the sign of the inequality is incorrect)

LikeLike

Hi Anil, thanks for the bug report! Fixed now.

LikeLike

Hi Francesco, I am a bit confused about corollary 5. Here’s my reasoning: starting from the regret equality lemma, we can say that F_{T+1}(x_{T+1}) \leq F_{T+1}(u) because x_{T+1} is the minimiser of F_{T+1}(x), thus we can get rid of the last 2 terms since their sum is less or equal than 0. Then using lemma 4 and the choice of regularisers that you gave, the sum over t of the terms in the rhs of the regret equality lemma reduces to $ \frac{1}{2 \mu} \sum_{t=1}^T \n_{t-1} \| g_t \|_*^2 $ as written. Until here, everything should be fine. Now regarding the first two terms, using the definition of $\psi$ you gave we should get

$$ \psi_{T+1}(u) – \min_x \psi_1(x) = \frac{1}{\eta_T}( \Psi(u) – \min_z \Psi(z) ) – \min_x ( \frac{1}{\eta_0} ( \Psi(x) – \min_z \Psi(z) ) ) $$

which, if I’m not wrong reduces to

$$ \frac{1}{\eta_T} \Psi(u) – \frac{1}{\eta_0} \min_x \Psi(x) $$

assuming decreasing learning rates, i.e. \eta_t \leq \eta_{t-1}. But, if this is true we get the final bound as

$$ R_T(u) \leq \frac{1}{\eta_T} \Psi(u) – \frac{1}{\eta_0} \min_x \Psi(x) + \frac{1}{2\mu} \sum_{t=1}^T \| g_t \|_*^2 $$

where the first two terms are bit different from what you have written, in particular the one with \eta_0. Do we make any assumption on \eta_0 or maybe I did some mistakes in the calculations?

Also, some I think there are some typos when you mention the linearised losses part. In particular, when you say: “Now, if we assume {\psi} to be proper, convex, and closed, using the theorem 2” – I think it should be theorem 4 – “in the lecture on OLO lower bounds, we have that {x_t}” – I think it should be x_{t+1}.

Thank you,

Nicolò

LikeLike

Hi Nicolò,

thanks for reporting the typos, they are fixed now.

Regarding the main issue, there is a remark that I forgot to write, probably it is in the next lecture: constant biases do not play any role in the update rule. Hence, I can always consider psi_t(x) substituted with psi(x) – min_y psi(y). So, the term psi_1(x_1)-min_y psi_1(y)=0 and you also get \frac{1}{\eta_T} (\psi(u) – min_x \psi(x)).

EDIT: I now added the remark.

LikeLike

All right, thanks!

LikeLiked by 1 person