Follow-The-Regularized-Leader I: Regret Equality

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019.

You can find all the lectures I published here.

Till now, we focused only on Online Subgradient Descent and its generalization, Online Mirror Descent (OMD), with a brief ad-hoc analysis of a Follow-The-Leader (FTL) analysis in the first lecture. In this class, we will extend FTL to a powerful and generic algorithm to do online convex optimization: Follow-the-Regularized-Leader (FTRL).

FTRL is a very intuitive algorithm: At each time step it will play the minimizer of the sum of the past losses plus a time-varying regularization. We will see that the regularization is needed to make the algorithm “more stable” with linear losses and avoid the jumping back and forth that we saw in Lecture 2 for Follow-the-Leader.

1. Follow-the-Regularized-Leader

ftrl_algo

As said above, in FTRL we output the minimizer of the regularized cumulative past losses. It should be clear that FTRL is not an algorithm, but rather a family of algorithms, in the same way as OMD is a family of algorithms.

Before analyzing the algorithm, let’s get some intuition on it. In OMD, we saw that the “state” of the algorithm is stored in the current iterate ${{\boldsymbol x}_t}$ , in the sense that the next iterate ${{\boldsymbol x}_{t+1}}$ depends on ${{\boldsymbol x}_t}$ and the loss received at time ${t}$ (the choice of the learning rate has only a little influence on the next iterate). Instead in FTRL, the next iterate ${{\boldsymbol x}_{t+1}}$ depends on the entire history of losses received up to time ${t}$ . This has an immediate consequence: In the case that ${V}$ is bounded, OMD will only “remember” the last ${{\boldsymbol x}_t}$ , and not the iterate before the projection. On the other hand, FTRL keeps in memory the entire history of the past, that in principle allows to recover the iterates before the projection in ${V}$ .

This difference in behavior might make the reader think that FTRL is more computationally and memory expensive. And indeed it is! But, we will also see that there is a way to consider approximate losses that makes the algorithm as expensive as OMD, yet retaining strictly more information than OMD.

For FTRL, we prove a surprising result: an equality for the regret! The proof is in the Appendix.

Lemma 1 Denote by ${F_t({\boldsymbol x}) = \psi_{t}({\boldsymbol x}) + \sum_{i=1}^{t-1} \ell_i({\boldsymbol x})}$ . Assume that ${\mathop{\mathrm{argmin}}_{{\boldsymbol x} \in {\mathbb R}^d} \ F_{t}({\boldsymbol x})}$ is not empty and set ${{\boldsymbol x}_t \in \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in {\mathbb R}^d} \ F_{t}({\boldsymbol x})}$ . Then, for any ${{\boldsymbol u}}$ , we have

$\displaystyle \sum_{t=1}^T ( \ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u}) ) = \psi_{T+1}({\boldsymbol u}) - \min_{{\boldsymbol x}} \ \psi_{1}({\boldsymbol x}) + \sum_{t=1}^T [F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t)] + F_{T+1}({\boldsymbol x}_{T+1}) - F_{T+1}({\boldsymbol u})~.$

Proof: Given that the terms ${\ell_t({\boldsymbol x}_t)}$ are appearing at both side of the equality, we just have to verify that

$\displaystyle - \sum_{t=1}^T \ell_t({\boldsymbol u}) = \psi_{T+1}({\boldsymbol u}) - \min_{{\boldsymbol x} \in V} \ \psi_{1}({\boldsymbol x}) + \sum_{t=1}^T [F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1})] + F_{T+1}({\boldsymbol x}_{T+1}) - F_{T+1}({\boldsymbol u})~.$

Remembering that ${F_1({\boldsymbol x}_1) = \min_{{\boldsymbol x} \in V} \ \psi_1({\boldsymbol x})}$ and using the fact that the sum with ${F_t}$ is telescopic, we have

$\displaystyle - \sum_{t=1}^T \ell_t({\boldsymbol u}) = \psi_{T+1}({\boldsymbol u}) - F_1({\boldsymbol x}_1) + F_1({\boldsymbol x}_1) - F_{T+1}({\boldsymbol x}_{T+1}) + F_{T+1}({\boldsymbol x}_{T+1}) - F_{T+1}({\boldsymbol u}) = \psi_{T+1}({\boldsymbol u}) - F_{T+1}({\boldsymbol u}),$

that is true by the definition of ${F_{T+1}}$ . $\Box$

Remark 1 Note that we basically didn’t assume anything on ${\ell_t}$ nor on ${\psi_t}$ , the above equality holds even for non-convex losses and regularizers. Yet, solving the minimization problem at each step might be computationally infeasible.

Remark 2 Note that the left hand side of the equality in the theorem does not depend on ${\psi_{T+1}}$ , so if needed we can set it to ${\psi_{T}}$ .

Remark 3 Note that the algorithm is invariant to any positive constant added to the regularizers, hence we can always state the regret guarantee with ${\psi_t({\boldsymbol u})-\min_{{\boldsymbol x}}\psi_t({\boldsymbol x})}$ instead of ${\psi_t}$ .

However, while surprising, the above equality is not yet a regret bound, because it is somehow “implicit” because the losses are appearing on both sides of the equality.

Let’s take a closer look at the equality. If ${{\boldsymbol u} \in \mathop{\mathrm{dom}} F_{t+1}}$ , we have that the sum of the last two terms on the r.h.s. is negative. On the other hand, the first two terms on the r.h.s. are similar to what we got in OMD. The interesting part is the sum of the terms ${F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t)}$ . To give an intuition of what is going on, let’s consider that case that the regularizer is constant over time, i.e., ${\psi_t=\psi}$ . Hence, the terms in the sum can be rewritten as

$\displaystyle F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t)= \psi({\boldsymbol x}_t) + \sum_{i=1}^{t} \ell_i({\boldsymbol x}_t) - \left( \psi({\boldsymbol x}_{t+1}) + \sum_{i=1}^{t} \ell_i({\boldsymbol x}_{t+1})\right)~.$

Hence, we are measuring the distance between the minimizer of the regularized losses (with two different regularizers) in two consecutive predictions of the algorithms. Roughly speaking, this term will be small if ${{\boldsymbol x}_{t}\approx {\boldsymbol x}_{t+1}}$ and the losses+regularization are “nice”. This should remind you exactly the OMD update, where we constrain ${{\boldsymbol x}_{t+1}}$ to be close to ${{\boldsymbol x}_{t+1}}$ . Instead, here the two predictions will be close one to the other if the minimizer of the regularized losses up to time ${t}$ is close to the minimizer of the losses up to time ${t+1}$ . So, like in OMD, the regularizer here will play the critical role of stabilizing the predictions, if the losses don’t possess enough curvature.

To quantify this intuition, we need a property of strongly convex functions.

2. Convex Analysis Bits: Properties of Strongly Convex Functions

We will use the following lemma for strongly convex functions.

Lemma 2 Let ${f:{\mathbb R}^d \rightarrow (-\infty, +\infty]}$ ${\mu}$ -strongly convex with respect to a norm ${\|\cdot\|}$ . Then, for all ${{\boldsymbol x}, {\boldsymbol y} \in \mathop{\mathrm{dom}} f}$ , ${{\boldsymbol g} \in \partial f({\boldsymbol y})}$ , and ${{\boldsymbol g}' \in \partial f({\boldsymbol x})}$ , we have

$\displaystyle f({\boldsymbol x}) - f({\boldsymbol y}) \leq \langle {\boldsymbol g}, {\boldsymbol x}-{\boldsymbol y}\rangle + \frac{1}{2\mu} \|{\boldsymbol g} - {\boldsymbol g}'\|_*^2~.$

Proof: Define ${\phi({\boldsymbol z})= f({\boldsymbol z}) -\langle {\boldsymbol g}, {\boldsymbol z}\rangle}$ . Observe that ${\boldsymbol{0} \in \partial \phi({\boldsymbol y})}$ , hence ${{\boldsymbol y}}$ is the minimizer of ${\phi({\boldsymbol z})}$ . Also, note that ${{\boldsymbol g}' - {\boldsymbol g} \in \partial \phi({\boldsymbol x})}$ . Hence, we can write

$\displaystyle \begin{aligned} \phi({\boldsymbol y}) &= \min_{{\boldsymbol z} \in \mathop{\mathrm{dom}} \phi} \ \phi({\boldsymbol z}) \\ &\geq \min_{{\boldsymbol z} \in \mathop{\mathrm{dom}} \phi} \ \left(\phi({\boldsymbol x}) + \langle {\boldsymbol g}' - {\boldsymbol g}, {\boldsymbol z} - {\boldsymbol x}\rangle + \frac{\mu}{2} \| {\boldsymbol z} - {\boldsymbol x} \|^2\right) \\ &\geq \inf_{{\boldsymbol z} \in {\mathbb R}^d} \ \left(\phi({\boldsymbol x}) + \langle {\boldsymbol g}' - {\boldsymbol g}, {\boldsymbol z} - {\boldsymbol x}\rangle + \frac{\mu}{2} \| {\boldsymbol z} - {\boldsymbol x} \|^2\right) \\ &= \phi({\boldsymbol x}) - \frac{1}{2\mu}\|{\boldsymbol g}'-{\boldsymbol g}\|^2_*, \end{aligned}$

where the last step comes from the conjugate function of the squared norm (See Example 3 in the lecture on OLO lower bounds). $\Box$

Corollary 3 Let ${f:{\mathbb R}^d \rightarrow (-\infty, +\infty]}$ ${\mu}$ -strongly convex with respect to a norm ${\|\cdot\|}$ . Let ${{\boldsymbol x}^\star = \mathop{\mathrm{argmin}}_{{\boldsymbol x}} f({\boldsymbol x})}$ . Then, for all ${{\boldsymbol x} \in \mathop{\mathrm{dom}} f}$ , and ${{\boldsymbol g} \in \partial f({\boldsymbol x})}$ , we have

$\displaystyle f({\boldsymbol x}) - f({\boldsymbol x}^\star) \leq \frac{1}{2\mu} \|{\boldsymbol g}\|_*^2~.$

In words, the above lemma says that an upper bound to the suboptimality gap is proportional to the squared norm of the subgradient.

3. An Explicit Regret Bound using Strongly Convex Regularizers

We now state a Lemmas quantifying the intuition on the “stability” of the predictions.

Lemma 4 With the notation and assumptions of Lemma 1, assume that ${F_t}$ is proper and ${\lambda_t}$ -strongly convex w.r.t. ${\|\cdot\|}$ , and ${\ell_t}$ proper and convex. Also, assume that ${\partial \ell_t({\boldsymbol x}_t)}$ is non-empty. Then, we have

$\displaystyle F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t) \leq \frac{\|{\boldsymbol g}_t\|_*^2}{2\lambda_t} + \psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1}),$

for all ${{\boldsymbol g}_t \in \partial \ell_t({\boldsymbol x}_t)}$ .

Proof: We have

$\displaystyle \begin{aligned} F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t) &= (F_t({\boldsymbol x}_t) + \ell_t({\boldsymbol x}_t)) - (F_{t}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_{t+1})) + \psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1}) \\ &\leq (F_t({\boldsymbol x}_t) + \ell_t({\boldsymbol x}_t)) - (F_{t}({\boldsymbol x}_t^\star) + \ell_t({\boldsymbol x}_t^\star)) + \psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1}) \\ &\leq \frac{\|{\boldsymbol g}'_t\|_*^2}{2\lambda_t} + \psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1}), \end{aligned}$

where in the second inequality we used Lemma 2, the fact that ${{\boldsymbol x}_t^\star = \mathop{\mathrm{argmin}}_{{\boldsymbol x}} \ F_{t}({\boldsymbol x}) + \ell_t({\boldsymbol x})}$ , and ${{\boldsymbol g}'_t \in \partial (F_t({\boldsymbol x}_t) + \ell_t({\boldsymbol x}_t))}$ . Observing that ${{\boldsymbol x}_t = \mathop{\mathrm{argmin}}_{{\boldsymbol x}} F_t({\boldsymbol x})}$ , we have ${\boldsymbol{0} \in \partial F_t({\boldsymbol x}_t)}$ . Hence, using the theorem of the subdifferential of sum of functions, we can choose ${{\boldsymbol g}'_t}$ such that we have ${{\boldsymbol g}'_t \in \partial \ell_t({\boldsymbol x}_t)}$ . $\Box$

Let’s see some immediate applications of FTRL

Corollary 5 Let ${\ell_t}$ a sequence of convex loss functions. Let ${\psi: V \rightarrow {\mathbb R}}$ a ${\mu}$ -strongly convex function w.r.t. ${\|\cdot\|}$ . Set the sequence of regularizers as ${\psi_t({\boldsymbol x})=\frac{1}{\eta_{t-1}} (\psi({\boldsymbol x})-\min_{{\boldsymbol z}} \psi({\boldsymbol z}))}$ , where ${\eta_{t+1} \leq \eta_t, \ t=1, \dots, T}$ . Then, FTRL guarantees

$\displaystyle \sum_{t=1}^T \ell({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u}) \leq \frac{\psi({\boldsymbol u}) - \min_{{\boldsymbol x}} \psi({\boldsymbol x})}{\eta_{T-1}} + \frac{1}{2 \mu} \sum_{t=1}^T \eta_{t-1} \|{\boldsymbol g}_t\|^2_\star,$

for all ${{\boldsymbol g}_t \in \partial \ell_t({\boldsymbol x}_t)}$ . Moreover, if the functions ${\ell_t}$ are ${L}$ -Lipschitz, setting ${\eta_{t-1}=\frac{\alpha \sqrt{\mu}}{L\sqrt{t}}}$ we get

$\displaystyle \sum_{t=1}^T \ell({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u}) \leq \left(\frac{\psi({\boldsymbol u}) - \min_{{\boldsymbol x}} \psi({\boldsymbol x})}{\alpha} +\alpha \right) \frac{L \sqrt{T}}{\sqrt{\mu}}~.$

Proof: The corollary is immediate from Lemma 1, Lemma 4, and the observation that from the assumptions we have ${\psi_t({\boldsymbol x}) - \psi_{t+1}({\boldsymbol x})\leq0, \ \forall {\boldsymbol x}}$ . We also set ${\psi_{T+1}=\psi_T}$ , thanks to Remark 2. $\Box$

This might look like the same regret guarantee of OMD, however here there is a very important difference: The last term contains a time-varying element ( ${\eta_t}$ ) but the domain does not have to be bounded! Also, I used the regularizer ${\frac{1}{\eta_{t-1}} \psi({\boldsymbol x})}$ and not ${\frac{1}{\eta_{t}} \psi({\boldsymbol x})}$ to remind you another important difference: In OMD the learning rate ${\eta_t}$ is chosen after receiving the subgradient ${{\boldsymbol g}_t}$ while here you have to choose it before receiving it!

The another important difference is that here the update rule seems way more expensive than in OMD, because we need to solve an optimization problem at each step. However, it turns out we can use FTRL on linearized losses and obtain the same bound with the same computational complexity of OMD.

4. FTRL with Linearized Losses

If we consider that case in which the losses are linear, we have that the prediction of FTRL is

$\displaystyle {\boldsymbol x}_{t+1} \in \mathop{\mathrm{argmin}}_{{\boldsymbol x}} \ \psi_{t+1}({\boldsymbol x}) + \sum_{i=1}^{t} \langle {\boldsymbol g}_i, {\boldsymbol x}\rangle = \mathop{\mathrm{argmax}}_{{\boldsymbol x}} \ \left\langle -\sum_{i=1}^{t} {\boldsymbol g}_i, {\boldsymbol x}\right\rangle -\psi_{t+1}({\boldsymbol x})~.$

Now, if we assume ${\psi}$ to be proper, convex, and closed, using theorem 4 in the lecture on OLO lower bounds, we have that ${{\boldsymbol x}_{t+1} \in \partial \psi_{t+1}^\star(-\sum_{i=1}^{t} {\boldsymbol g}_i)}$ . Moreover, if ${\psi_{t+1}}$ is strongly convex, we know that ${\psi_{t+1}^\star}$ is differentiable and we get

$\displaystyle \label{eq:ftrl_update_linear} {\boldsymbol x}_{t+1} = \nabla \psi_{t+1}^\star\left(-\sum_{i=1}^{t} {\boldsymbol g}_i\right)~. \ \ \ \ \ (1)$

In turn, this update can be written in the following way

$\displaystyle \begin{aligned} {\boldsymbol \theta}_{t+1} &= {\boldsymbol \theta}_{t} - {\boldsymbol g}_{t}, \\ {\boldsymbol x}_{t+1} &= \nabla \psi_{t+1}^\star ({\boldsymbol \theta}_{t+1})~. \end{aligned}$

This corresponds to Figure 1.

dual_mappings_ftrl — Figure 1. Dual mapping for FTRL with linear losses.

Compare it to the mirror update of OMD, rewritten in a similar way:

$\displaystyle \begin{aligned} {\boldsymbol \theta}_{t+1} &= \nabla \psi ({\boldsymbol x}_{t}) - \eta_t {\boldsymbol g}_{t}, \\ {\boldsymbol x}_{t+1} &= \nabla \psi^\star ({\boldsymbol \theta}_{t+1})~. \end{aligned}$

They are very similar, but with important differences:

In OMD, the state is kept in ${{\boldsymbol x}_t}$ , so we need to transform it into a dual variable before making the update and then back to the primal variable.
In FTRL with linear losses, the state is kept directly in the dual space, updated and then transformed in the primal variable. The primal variable is only used to predict, but not directly in the update.
In OMD, the samples are weighted by the learning rates that is typically decreasing
In FTRL with linear losses, all the subgradients have the same weight, but the regularizer is typically increasing over time.

Also, we will not loose anything in the bound! Indeed, we can run FTRL on the linearized losses ${\tilde{\ell}_t({\boldsymbol x})=\langle {\boldsymbol g}_t, {\boldsymbol x}\rangle}$ , where ${{\boldsymbol g}_t \in \partial \ell_t({\boldsymbol x}_t)}$ , guaranting exactly the same regret on the losses ${\ell_t}$ . The algorithm for such procedure is in Algorithm 2.

ftrl_algo_linear

In fact, using the definition of the subgradients and the assumptions of Corollary 5, we have

$\displaystyle \text{Regret}_T({\boldsymbol u}) = \sum_{t=1}^T (\ell_t({\boldsymbol x}_t) -\ell_t({\boldsymbol u})) \leq \sum_{t=1}^T (\tilde{\ell}_t({\boldsymbol x}_t) - \tilde{\ell}_t({\boldsymbol u})) \leq \frac{\psi({\boldsymbol u}) - \min_{{\boldsymbol x}} \psi({\boldsymbol x})}{\eta_{T-1}} + \frac{1}{2 \mu} \sum_{t=1}^T \eta_{t-1} \|{\boldsymbol g}_t\|^2_\star, \ \forall {\boldsymbol u}~.$

The only difference with respect to Corollary 5 is that here the ${{\boldsymbol g}_t}$ are the specific ones we use in the algorithm, while in Corollary 5 the statement holds for any choice of the ${{\boldsymbol g}_t \in \partial \ell_t({\boldsymbol x}_t)}$ .

In the next example, we can see the different behavior of FTRL and OMD.

Example 1 Consider ${V=\{{\boldsymbol x} \in {\mathbb R}^d: \|{\boldsymbol x}\|_2\leq 1\}}$ . With Online Subgradient Descent (OSD) with learning rate ${\eta_t=\frac{1}{\sqrt{t}}}$ and ${{\boldsymbol x}_1=\boldsymbol{0}}$ , the update is

$\displaystyle \begin{aligned} \tilde{{\boldsymbol x}}_{t+1} &= {\boldsymbol x}_t -\frac{1}{\sqrt{t}} {\boldsymbol g}_t, \\ {\boldsymbol x}_{t+1} &= \tilde{{\boldsymbol x}}_{t+1} \, \min\left(\frac{1}{\|\tilde{{\boldsymbol x}}_{t+1}\|_2},1\right)~. \end{aligned}$

On the other hand in FTRL with linearized losses, we can use ${\psi_t({\boldsymbol x}) = \frac{\sqrt{t}}{2}\|{\boldsymbol x}\|_2^2 + i_V({\boldsymbol x})}$ and it is easy to verify that the update in (1) becomes

$\displaystyle \begin{aligned} \tilde{{\boldsymbol x}}_{t+1} &= \frac{-\sum_{i=1}^t {\boldsymbol g}_i}{\sqrt{t}},\\ {\boldsymbol x}_{t+1} &= \tilde{{\boldsymbol x}}_{t+1} \, \min\left(\frac{1}{\|\tilde{{\boldsymbol x}}_{t+1}\|_2},1\right)~. \end{aligned}$

While the regret guarantee would be the same for these two updates, from an intuitive point of view OMD seems to be loosing a lot of potential information due to the projection and the fact that we only memorize the projected iterate.

Next time, we will see how to obtain logarithmic regret bounds for strongly convex losses for FTRL and more applications.

5. History Bits

Follow the Regularized Leader was introduced in (Abernethy, J. D. and Hazan, E. and Rakhlin, A., 2008) where at each step the prediction is computed as the minimizer of a regularization term plus the sum of losses on all past rounds. However, the key ideas of FTRL, and in particular its analysis through the dual, were planted by Shai Shalev-Shwartz and Yoram Singer way before (Shalev-Shwartz, S. and Singer, Y., 2006)(Shalev-Shwartz, S. and Singer, Y., 2007). Later, the PhD thesis of Shai Shalev-Shwartz (S. Shalev-Shwartz, 2007) contained the most precise dual analysis of FTRL, but he called it “online mirror descent” because the name FTRL was only invented later! Even later, I contributed to the confusion naming a general analysis of FTRL with time-varying regularizer and linear losses “generalized online mirror descent” (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015). So, now I am trying to set the record straight 🙂

Later to all this, the optimization community starts using FTRL with linear losses as well, calling it Dual Averaging (Nesterov, Y., 2009). Note that this paper by Nesterov is actually from 2005 (Nesterov, Y., 2005), and in the paper he claims that these ideas are from 2001-2002, but he decided not to publish them for a while because he was convinced that “the importance of black-box approach in Convex Optimization will be irreversibly vanishing, and, finally, this approach will be completely replaced by other ones based on a clever use of problem’s structure (interior-point methods, smoothing, etc.).” Hence, Nesterov is for sure the the first inventor of offline FTRL with linear losses. It is interesting to note that Nesterov introduced the Dual Averaging algorithm to fix the fact that in OMD gradients enter the algorithm with decreasing weights, contradicting the common sense understanding of how optimization should work. The same ideas were then translated to online learning and stochastic optimization in (L. Xiao, 2010), essentially rediscovering the framework of Shalev-Shwartz and rebranding it Regularized Dual Averaging (RDA). Finally, (McMahan, H B., 2017) gives the elegant equality result that I presented here (with minor improvements) that holds for general loss functions and regularizers. Note that Dual Averaging stems from the dual interpretation of FTRL for linear losses, but Lemma 1 underlines the fact that FTRL is actually more general.

Another source of confusion stems from the fact that some people differentiate among a “lazy” and “greedy” version of OMD. In reality, as proved in (McMahan, H B., 2017), the lazy algorithm is just FTRL with linearized losses and the greedy one is just OMD. The notation “lazy online mirror descent” was introduced in (Zinkevich, M., 2004), where he basically introduced for the first time FTRL with linearized losses in the special case of ${\psi_t({\boldsymbol x})=\frac{\eta}{2}\|{\boldsymbol x}\|_2^2}$ .

6. Exercises

Exercise 1 Prove that the update of FTRL with linearized loss in Example 1 is correct.

Exercise 2 Find a way to have ${L^\star}$ bounds for smooth losses with linearized FTRL: Do you need an additional assumption compared to what we did for OSD?

13 Comments

Pingback: Follow the Regularized Leader II: Applications – Parameter-free Learning and Optimization Algorithms
Pingback: Online Linear Classification – Parameter-free Learning and Optimization Algorithms
Anil Kag says:

October 29, 2019 at 6:32 PM

Corollary 5’s proof should have that \psi_t(x) – \psi_{t+1}(x) \leq 0 (currently the sign of the inequality is incorrect)

LikeLike

1. bremen79 says:
  
  October 29, 2019 at 9:41 PM
  
  Hi Anil, thanks for the bug report! Fixed now.
  
  LikeLike
  
Pingback: Parameter-free Online Learning I: Coin-Betting and 1-d OCO – Parameter-free Learning and Optimization Algorithms
Pingback: Parameter-free Online Learning II: Reductions – Parameter-free Learning and Optimization Algorithms
Pingback: Optimistic Follow-The-Regularized-Leader – Parameter-free Learning and Optimization Algorithms
Nicolò says:

November 10, 2019 at 6:31 AM

Hi Francesco, I am a bit confused about corollary 5. Here’s my reasoning: starting from the regret equality lemma, we can say that F_{T+1}(x_{T+1}) \leq F_{T+1}(u) because x_{T+1} is the minimiser of F_{T+1}(x), thus we can get rid of the last 2 terms since their sum is less or equal than 0. Then using lemma 4 and the choice of regularisers that you gave, the sum over t of the terms in the rhs of the regret equality lemma reduces to $ \frac{1}{2 \mu} \sum_{t=1}^T \n_{t-1} \| g_t \|_*^2 $ as written. Until here, everything should be fine. Now regarding the first two terms, using the definition of $\psi$ you gave we should get

$$ \psi_{T+1}(u) – \min_x \psi_1(x) = \frac{1}{\eta_T}( \Psi(u) – \min_z \Psi(z) ) – \min_x ( \frac{1}{\eta_0} ( \Psi(x) – \min_z \Psi(z) ) ) $$

which, if I’m not wrong reduces to

$$ \frac{1}{\eta_T} \Psi(u) – \frac{1}{\eta_0} \min_x \Psi(x) $$

assuming decreasing learning rates, i.e. \eta_t \leq \eta_{t-1}. But, if this is true we get the final bound as

$$ R_T(u) \leq \frac{1}{\eta_T} \Psi(u) – \frac{1}{\eta_0} \min_x \Psi(x) + \frac{1}{2\mu} \sum_{t=1}^T \| g_t \|_*^2 $$

where the first two terms are bit different from what you have written, in particular the one with \eta_0. Do we make any assumption on \eta_0 or maybe I did some mistakes in the calculations?

Also, some I think there are some typos when you mention the linearised losses part. In particular, when you say: “Now, if we assume {\psi} to be proper, convex, and closed, using the theorem 2” – I think it should be theorem 4 – “in the lecture on OLO lower bounds, we have that {x_t}” – I think it should be x_{t+1}.

Thank you,

Nicolò

LikeLike

bremen79 says:

November 10, 2019 at 7:37 AM

Hi Nicolò,

thanks for reporting the typos, they are fixed now.

Regarding the main issue, there is a remark that I forgot to write, probably it is in the next lecture: constant biases do not play any role in the update rule. Hence, I can always consider psi_t(x) substituted with psi(x) – min_y psi(y). So, the term psi_1(x_1)-min_y psi_1(y)=0 and you also get \frac{1}{\eta_T} (\psi(u) – min_x \psi(x)).

EDIT: I now added the remark.

LikeLike

1. Nicolò says:
  
  November 10, 2019 at 8:31 AM
  
  All right, thanks!
  
  LikeLiked by 1 person
  
Pingback: Multi-Armed Bandit I – Parameter-free Learning and Optimization Algorithms
bremen79 says:

January 26, 2020 at 6:54 PM

EDIT: Greatly simplified the proof of the equality Lemma for FTRL.

LikeLike

bremen79 says:

April 20, 2020 at 10:03 PM

Thanks to Christian Kroer for pointing out the exact dates in Nesterov’s Dual Averaging.

LikeLike

Pingback: AdaHedge – Parameter-free Learning and Optimization Algorithms
Fernando says:

June 1, 2021 at 2:50 PM

Hello Francesco, I’ve been following along your modern introduction to online convex optimization. It is a great resource. I just have a comment about example 1. Shouldn’t it be max instead of min? Otherwise, we would run into situations where x_{t+1} is not in V, i.e., || x_{t+1} || > 1.

LikeLike

1. bremen79 says:
  
  June 1, 2021 at 6:04 PM
  
  I think it is correct as it is. For example, if || \tilde{x}_{t+1} ||=10, then 1/|| \tilde{x}_{t+1} || is less than 1, and we correctly divide the vector by its norm. The min keeps vectors with norm less than 1 to be normalized to 1.
  
  LikeLike
  
  1. Fernando says:
    
    June 3, 2021 at 6:39 PM
    
    Yeah, I see it now, my bad. Thank you so much for the quick response.
    
    LikeLiked by 1 person
Pingback: Implicit, Passive-Aggressive, and aProx updates – Parameter-free Learning and Optimization Algorithms
Pingback: A Simple Proof for Optimistic OMD – Parameter-free Learning and Optimization Algorithms
rpSebastian says:

June 19, 2023 at 4:10 AM

Hi Francesco, I have a small question about Corollary 5.

For the summation term in Lemma 1, I apply Lemma 4 and get $\sum_{t=1}^{T}F_t(x_t) – F_{t+1}(x_{t+1}) + l_t(x_t) \le \sum_{t=1}^{T} \frac{||g_t||_*^2}{2\mu} $. I wonder why $\eta_{t-1}$ appears in the final result.

Thanks a lot.

LikeLike

rpSebastian says:

June 19, 2023 at 4:25 AM

Sorry to interrupt. I have found the reason. $\psi$ is a $\mu$-strongly convex function and $\psi_t$ is a $\frac{\mu}{\eta_{t-1}}$-strongly convex function.

LikeLike

1. bremen79 says:
  
  June 19, 2023 at 7:56 AM
  
  Correct!
  
  LikeLike
  
Pingback: Adapting to Smoothness with Normalized Gradients – Parameter-free Learning and Optimization Algorithms
Pingback: The Weighted Average Algorithm – Parameter-free Learning and Optimization Algorithms
Pingback: The Gaptron Algorithm – Parameter-free Learning and Optimization Algorithms