Follow-The-Regularized-Leader I: Regret Equality

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.


Till now, we focused only on Online Subgradient Descent and its generalization, Online Mirror Descent (OMD), with a brief ad-hoc analysis of a Follow-The-Leader (FTL) analysis in the first lecture. In this class, we will extend FTL to a powerful and generic algorithm to do online convex optimization: Follow-the-Regularized-Leader (FTRL).

FTRL is a very intuitive algorithm: At each time step it will play the minimizer of the sum of the past losses plus a time-varying regularization. We will see that the regularization is needed to make the algorithm “more stable” with linear losses and avoid the jumping back and forth that we saw in Lecture 2 for Follow-the-Leader.

1. Follow-the-Regularized-Leader


ftrl_algo

As said above, in FTRL we output the minimizer of the regularized cumulative past losses. It should be clear that FTRL is not an algorithm, but rather a family of algorithms, in the same way as OMD is a family of algorithms.

Before analyzing the algorithm, let’s get some intuition on it. In OMD, we saw that the “state” of the algorithm is stored in the current iterate {{\boldsymbol x}_t}, in the sense that the next iterate {{\boldsymbol x}_{t+1}} depends on {{\boldsymbol x}_t} and the loss received at time {t} (the choice of the learning rate has only a little influence on the next iterate). Instead in FTRL, the next iterate {{\boldsymbol x}_{t+1}} depends on the entire history of losses received up to time {t}. This has an immediate consequence: In the case that {V} is bounded, OMD will only “remember” the last {{\boldsymbol x}_t}, and not the iterate before the projection. On the other hand, FTRL keeps in memory the entire history of the past, that in principle allows to recover the iterates before the projection in {V}.

This difference in behavior might make the reader think that FTRL is more computationally and memory expensive. And indeed it is! But, we will also see that there is a way to consider approximate losses that makes the algorithm as expensive as OMD, yet retaining strictly more information than OMD.

For FTRL, we prove a surprising result: an equality for the regret! The proof is in the Appendix.

Lemma 1 Denote by {F_t({\boldsymbol x}) = \psi_{t}({\boldsymbol x}) + \sum_{i=1}^{t-1} \ell_i({\boldsymbol x})}. Assume that {\mathop{\mathrm{argmin}}_{{\boldsymbol x} \in {\mathbb R}^d} \ F_{t}({\boldsymbol x})} is not empty and set {{\boldsymbol x}_t \in \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in {\mathbb R}^d} \ F_{t}({\boldsymbol x})}. Then, for any {{\boldsymbol u}}, we have

\displaystyle \sum_{t=1}^T ( \ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u}) ) = \psi_{T+1}({\boldsymbol u}) - \min_{{\boldsymbol x}} \ \psi_{1}({\boldsymbol x}) + \sum_{t=1}^T [F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t)] + F_{T+1}({\boldsymbol x}_{T+1}) - F_{T+1}({\boldsymbol u})~.

Remark 1 Note that we basically didn’t assume anything on {\ell_t} nor on {\psi_t}, the above equality holds even for non-convex losses and regularizers. Yet, solving the minimization problem at each step might be computationally infeasible.

Remark 2 Note that the left hand side of the equality in the theorem does not depend on {\psi_{T+1}}, so if needed we can set it to {\psi_{T}}.

However, while surprising, the above equality is not yet a regret bound, because it is somehow “implicit” because the losses are appearing on both sides of the equality.

Let’s take a closer look at the equality. If {{\boldsymbol u} \in \mathop{\mathrm{dom}} F_{t+1}}, we have that the sum of the last two terms on the r.h.s. is negative. On the other hand, the first two terms on the r.h.s. are similar to what we got in OMD. The interesting part is the sum of the terms {F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t)}. To give an intuition of what is going on, let’s consider that case that the regularizer is constant over time, i.e., {\psi_t=\psi}. Hence, the terms in the sum can be rewritten as

\displaystyle \psi({\boldsymbol x}_t) + \sum_{i=1}^{t+1} \ell_t({\boldsymbol x}_t) - \left( \psi({\boldsymbol x}_{t+1}) + \sum_{i=1}^{t+1} \ell_t({\boldsymbol x}_{t+1})\right)~.

Hence, we are measuring the distance between the minimizer of the regularized losses (with two different regularizers) in two consecutive predictions of the algorithms. Roughly speaking, this term will be small if {{\boldsymbol x}_{t}\approx {\boldsymbol x}_{t+1}} and the losses+regularization are “nice”. This should remind you exactly the OMD update, where we constrain {{\boldsymbol x}_{t+1}} to be close to {{\boldsymbol x}_{t+1}}. Instead, here the two predictions will be close one to the other if the minimizer of the regularized losses up to time {t} is close to the minimizer of the losses up to time {t+1}. So, like in OMD, the regularizer here will play the critical role of stabilizing the predictions, if the losses don’t possess enough curvature.

To quantify this intuition, we need a property of strongly convex functions.

2. Convex Analysis Bits: Properties of Strongly Convex Functions

We will use the following lemma for strongly convex functions.

Lemma 2 Let {f:{\mathbb R}^d \rightarrow (-\infty, +\infty]} {\mu}-strongly convex with respect to a norm {\|\cdot\|}. Then, for all {{\boldsymbol x}, {\boldsymbol y} \in \mathop{\mathrm{dom}} f}, {{\boldsymbol g} \in \partial f({\boldsymbol y})}, and {{\boldsymbol g}' \in \partial f({\boldsymbol x})}, we have

\displaystyle f({\boldsymbol x}) - f({\boldsymbol y}) \leq \langle {\boldsymbol g}, {\boldsymbol x}-{\boldsymbol y}\rangle + \frac{1}{2\mu} \|{\boldsymbol g} - {\boldsymbol g}'\|_*^2~.

Proof: Define {\phi({\boldsymbol z})= f({\boldsymbol z}) -\langle {\boldsymbol g}, {\boldsymbol z}\rangle}. Observe that {\boldsymbol{0} \in \partial \phi({\boldsymbol y})}, hence {{\boldsymbol y}} is the minimizer of {\phi({\boldsymbol z})}. Also, note that {{\boldsymbol g}' - {\boldsymbol g} \in \partial \phi({\boldsymbol x})}. Hence, we can write

\displaystyle \begin{aligned} \phi({\boldsymbol y}) &= \min_{{\boldsymbol z} \in \mathop{\mathrm{dom}} \phi} \ \phi({\boldsymbol z}) \\ &\geq \min_{{\boldsymbol z} \in \mathop{\mathrm{dom}} \phi} \ \left(\phi({\boldsymbol x}) + \langle {\boldsymbol g}' - {\boldsymbol g}, {\boldsymbol z} - {\boldsymbol x}\rangle + \frac{\mu}{2} \| {\boldsymbol z} - {\boldsymbol x} \|^2\right) \\ &\geq \inf_{{\boldsymbol z} \in {\mathbb R}^d} \ \left(\phi({\boldsymbol x}) + \langle {\boldsymbol g}' - {\boldsymbol g}, {\boldsymbol z} - {\boldsymbol x}\rangle + \frac{\mu}{2} \| {\boldsymbol z} - {\boldsymbol x} \|^2\right) \\ &= \phi({\boldsymbol x}) - \frac{1}{2\mu}\|{\boldsymbol g}'-{\boldsymbol g}\|^2_*, \end{aligned}

where the last step comes from the conjugate function of the squared norm (See Example 3 in the lecture on OLO lower bounds). \Box

Corollary 3 Let {f:{\mathbb R}^d \rightarrow (-\infty, +\infty]} {\mu}-strongly convex with respect to a norm {\|\cdot\|}. Let {{\boldsymbol x}^\star = \mathop{\mathrm{argmin}}_{{\boldsymbol x}} f({\boldsymbol x})}. Then, for all {{\boldsymbol x} \in \mathop{\mathrm{dom}} f}, and {{\boldsymbol g} \in \partial f({\boldsymbol x})}, we have

\displaystyle f({\boldsymbol x}) - f({\boldsymbol x}^\star) \leq \frac{1}{2\mu} \|{\boldsymbol g}\|_*^2~.

In words, the above lemma says that an upper bound to the suboptimality gap is proportional to the squared norm of the subgradient.

3. An Explicit Regret Bound using Strongly Convex Regularizers

We now state a Lemmas quantifying the intuition on the “stability” of the predictions.

Lemma 4 With the notation and assumptions of Lemma 1, assume that {F_t} is proper and {\lambda_t}-strongly convex w.r.t. {\|\cdot\|}, and {\ell_t} proper and convex. Also, assume that {\partial \ell_t({\boldsymbol x}_t)} is non-empty. Then, we have

\displaystyle F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t) \leq \frac{\|{\boldsymbol g}_t\|_*^2}{2\lambda_t} + \psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1}),

for all {{\boldsymbol g}_t \in \partial \ell_t({\boldsymbol x}_t)}.

Proof: We have

\displaystyle \begin{aligned} F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t) &= (F_t({\boldsymbol x}_t) + \ell_t({\boldsymbol x}_t)) - (F_{t}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_{t+1})) + \psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1}) \\ &\leq (F_t({\boldsymbol x}_t) + \ell_t({\boldsymbol x}_t)) - (F_{t}({\boldsymbol x}_t^\star) + \ell_t({\boldsymbol x}_t^\star)) + \psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1}) \\ &\leq \frac{\|{\boldsymbol g}'_t\|_*^2}{2\lambda_t} + \psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1}), \end{aligned}

where in the second inequality we used Lemma 2, the fact that {{\boldsymbol x}_t^\star = \mathop{\mathrm{argmin}}_{{\boldsymbol x}} \ F_{t}({\boldsymbol x}) + \ell_t({\boldsymbol x})}, and {{\boldsymbol g}'_t \in \partial (F_t({\boldsymbol x}_t) + \ell_t({\boldsymbol x}_t))}. Observing that {{\boldsymbol x}_t = \mathop{\mathrm{argmin}}_{{\boldsymbol x}} F_t({\boldsymbol x})}, we have {\boldsymbol{0} \in \partial F_t({\boldsymbol x}_t)}. Hence, using the theorem of the subdifferential of sum of functions, we can choose {{\boldsymbol g}'_t} such that we have {{\boldsymbol g}'_t \in \partial \ell_t({\boldsymbol x}_t)}. \Box

Let’s see some immediate applications of FTRL

Corollary 5 Let {\ell_t} a sequence of convex loss functions. Let {\Psi: V \rightarrow {\mathbb R}} a {\mu}-strongly convex function w.r.t. {\|\cdot\|}. Set the sequence of regularizers as {\psi_t({\boldsymbol x})=\frac{1}{\eta_{t-1}} (\Psi({\boldsymbol x})-\min_{{\boldsymbol z}} \Psi({\boldsymbol z}))}, where {\eta_{t+1} \leq \eta_t, \ t=1, \dots, T}. Then, FTRL guarantees

\displaystyle \sum_{t=1}^T \ell({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u}) \leq \frac{\Psi({\boldsymbol u}) - \min_{{\boldsymbol x}} \Psi({\boldsymbol x})}{\eta_{T-1}} + \frac{1}{2 \mu} \sum_{t=1}^T \eta_{t-1} \|{\boldsymbol g}_t\|^2_\star,

for all {{\boldsymbol g}_t \in \partial \ell_t({\boldsymbol x}_t)}. Moreover, if the functions {\ell_t} are {L}-Lipschitz, setting {\eta_{t-1}=\frac{\alpha \sqrt{\mu}}{L\sqrt{t}}} we get

\displaystyle \sum_{t=1}^T \ell({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u}) \leq \left(\frac{\Psi({\boldsymbol u}) - \min_{{\boldsymbol x}} \Psi({\boldsymbol x})}{\alpha} +\alpha \right) \frac{L \sqrt{T}}{\sqrt{\mu}}~.

Proof: The corollary is immediate from Lemma 1, Lemma 4, and the observation that from the assumptions we have {\psi_t({\boldsymbol x}) - \psi_{t+1}({\boldsymbol x})\geq0, \ \forall {\boldsymbol x}}. We also set {\psi_{T+1}=\psi_T}, thanks to Remark 2. \Box

This might look like the same regret guarantee of OMD, however here there is a very important difference: The last term contains a time-varying element ({\eta_t}) but the domain does not have to be bounded! Also, I used the regularizer {\frac{1}{\eta_{t-1}} \psi({\boldsymbol x})} and not {\frac{1}{\eta_{t}} \psi({\boldsymbol x})} to remind you another important difference: In OMD the learning rate {\eta_t} is chosen after receiving the subgradient {{\boldsymbol g}_t} while here you have to choose it before receiving it!

The another important difference is that here the update rule seems way more expensive than in OMD, because we need to solve an optimization problem at each step. However, it turns out we can use FTRL on linearized losses and obtain the same bound with the same computational complexity of OMD.

4. FTRL with Linearized Losses

If we consider that case in which the losses are linear, we have that the prediction of FTRL is

\displaystyle {\boldsymbol x}_{t+1} \in \mathop{\mathrm{argmin}}_{{\boldsymbol x}} \ \psi_{t+1}({\boldsymbol x}) + \sum_{i=1}^{t} \langle {\boldsymbol g}_i, {\boldsymbol x}\rangle = \mathop{\mathrm{argmax}}_{{\boldsymbol x}} \ \left\langle -\sum_{i=1}^{t} {\boldsymbol g}_i, {\boldsymbol x}\right\rangle -\psi_t({\boldsymbol x})~.

Now, if we assume {\psi} to be proper, convex, and closed, using the theorem 2 in the lecture on OLO lower bounds, we have that {{\boldsymbol x}_t \in \partial \psi_{t+1}^\star(-\sum_{i=1}^{t} {\boldsymbol g}_i)}. Moreover, if {\psi_{t+1}} is strongly convex, we know that {\psi_{t+1}^\star} is differentiable and we get

\displaystyle \label{eq:ftrl_update_linear} {\boldsymbol x}_{t+1} = \nabla \psi_{t+1}^\star\left(-\sum_{i=1}^{t} {\boldsymbol g}_i\right)~. \ \ \ \ \ (1)

In turn, this update can be written in the following way

\displaystyle \begin{aligned} {\boldsymbol \theta}_{t+1} &= {\boldsymbol \theta}_{t} - {\boldsymbol g}_{t}, \\ {\boldsymbol x}_{t+1} &= \nabla \psi_{t+1}^\star ({\boldsymbol \theta}_{t+1})~. \end{aligned}

This corresponds to Figure 1.

dual_mappings_ftrl
Figure 1. Dual mapping for FTRL with linear losses.

Compare it to the mirror update of OMD, rewritten in a similar way:

\displaystyle \begin{aligned} {\boldsymbol \theta}_{t+1} &= \nabla \psi ({\boldsymbol x}_{t}) - \eta_t {\boldsymbol g}_{t}, \\ {\boldsymbol x}_{t+1} &= \nabla \psi^\star ({\boldsymbol \theta}_{t+1})~. \end{aligned}

They are very similar, but with important differences:

  • In OMD, the state is kept in {{\boldsymbol x}_t}, so we need to transform it into a dual variable before making the update and then back to the primal variable.
  • In FTRL with linear losses, the state is kept directly in the dual space, updated and then transformed in the primal variable. The primal variable is only used to predict, but not directly in the update.
  • In OMD, the samples are weighted by the learning rates that is typically decreasing
  • In FTRL with linear losses, all the subgradients have the same weight, but the regularizer is typically increasing over time.

Also, we will not loose anything in the bound! Indeed, we can run FTRL on the linearized losses {\tilde{\ell}_t({\boldsymbol x})=\langle {\boldsymbol g}_t, {\boldsymbol x}\rangle}, where {{\boldsymbol g}_t \in \partial \ell_t({\boldsymbol x}_t)}, guaranting exactly the same regret on the losses {\ell_t}. The algorithm for such procedure is in Algorithm 2.


ftrl_algo_linear

In fact, using the definition of the subgradients and the assumptions of Corollary 5, we have

\displaystyle \text{Regret}_T({\boldsymbol u}) = \sum_{t=1}^T (\ell_t({\boldsymbol x}_t) -\ell_t({\boldsymbol u})) \leq \sum_{t=1}^T (\tilde{\ell}_t({\boldsymbol x}_t) - \tilde{\ell}_t({\boldsymbol u})) \leq \frac{\Psi({\boldsymbol u}) - \min_{{\boldsymbol x}} \Psi({\boldsymbol x})}{\eta_{T-1}} + \frac{1}{2 \mu} \sum_{t=1}^T \eta_{t-1} \|{\boldsymbol g}_t\|^2_\star, \ \forall {\boldsymbol u}~.

The only difference with respect to Corollary 5 is that here the {{\boldsymbol g}_t} are the specific ones we use in the algorithm, while in Corollary 5 the statement holds for any choice of the {{\boldsymbol g}_t \in \partial \ell_t({\boldsymbol x}_t)}.

In the next example, we can see the different behavior of FTRL and OMD.

Example 1 Consider {V=\{{\boldsymbol x} \in {\mathbb R}^d: \|{\boldsymbol x}\|_2\leq 1\}}. With Online Subgradient Descent (OSD) with learning rate {\eta_t=\frac{1}{\sqrt{t}}} and {{\boldsymbol x}_1=\boldsymbol{0}}, the update is

\displaystyle \begin{aligned} \tilde{{\boldsymbol x}}_{t+1} &= {\boldsymbol x}_t -\frac{1}{\sqrt{t}} {\boldsymbol g}_t, \\ {\boldsymbol x}_{t+1} &= \tilde{{\boldsymbol x}}_{t+1} \, \min\left(\frac{1}{\|\tilde{{\boldsymbol x}}_{t+1}\|_2},1\right)~. \end{aligned}

On the other hand in FTRL with linearized losses, we can use {\psi_t({\boldsymbol x}) = \frac{\sqrt{t}}{2}\|{\boldsymbol x}\|_2^2 + i_V({\boldsymbol x})} and it is easy to verify that the update in (1) becomes

\displaystyle \begin{aligned} \tilde{{\boldsymbol x}}_{t+1} &= \frac{-\sum_{i=1}^t {\boldsymbol g}_i}{\sqrt{t}},\\ {\boldsymbol x}_{t+1} &= \tilde{{\boldsymbol x}}_{t+1} \, \min\left(\frac{1}{\|\tilde{{\boldsymbol x}}_{t+1}\|_2},1\right)~. \end{aligned}

While the regret guarantee would be the same for these two updates, from an intuitive point of view OMD seems to be loosing a lot of potential information due to the projection and the fact that we only memorize the projected iterate.

Next time, we will see how to obtain logarithmic regret bounds for strongly convex losses for FTRL and more applications.

5. History Bits

Follow the Regularized Leader was introduced in (Abernethy, J. D. and Hazan, E. and Rakhlin, A., 2008) where at each step the prediction is computed as the minimizer of a regularization term plus the sum of losses on all past rounds. However, the key ideas of FTRL, and in particular its analysis through the dual, were planted by Shai Shalev-Shwartz and Yoram Singer way before (Shalev-Shwartz, S. and Singer, Y., 2006)(Shalev-Shwartz, S. and Singer, Y., 2007). Later, the PhD thesis of Shai Shalev-Shwartz (S. Shalev-Shwartz, 2007) contained the most precise dual analysis of FTRL, but he called it “online mirror descent” because the name FTRL was only invented later! Even later, I contributed to the confusion naming a general analysis of FTRL with time-varying regularizer and linear losses “generalized online mirror descent” (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015). So, now I am trying to set the record straight 🙂

Later to all this, the optimization community rediscovers FTRL with linear losses and calls it Dual Averaging (Nesterov, Y., 2009), even if Nesterov used similar ideas already in 2005 (Nesterov, Y., 2005). It is interesting to note that Nesterov introduced the Dual Averaging algorithm to fix the fact that in OMD gradients enter the algorithm with decreasing weights, contradicting the common sense understanding of how optimization should work. The same ideas were then translated to online learning and stochastic optimization in (L. Xiao, 2010), essentially rediscovering the framework of Shalev-Shwartz and rebranding it Regularized Dual Averaging (RDA). Finally, (McMahan, H B., 2017) gives the elegant equality result that I presented here (with minor improvements) that holds for general loss functions and regularizers. Note that the dual interpretation of FTRL comes out naturally for linear losses, but Lemma 1 underlines the fact that the algorithm is actually more general.

Another source of confusion stems from the fact that some people differentiate among a “lazy” and “greedy” version of OMD. In reality, as proved in (McMahan, H B., 2017), the lazy algorithm is just FTRL with linearized losses and the greedy one is just OMD. The notation “lazy online mirror descent” was introduced in (Zinkevich, M., 2004), where he basically introduced for the first time FTRL with linearized losses.

6. Exercises

Exercise 1 Prove that the update of FTRL with linearized loss in Example 1 is correct.

Exercise 2 Find a way to have {L^\star} bounds for smooth losses with linearized FTRL: Do you need an additional assumption compared to what we did for OSD?

7. Appendix

Proof of Lemma 1: Define {f_0({\boldsymbol x})=\psi_1({\boldsymbol x})} and {f_t({\boldsymbol x})=\ell_t({\boldsymbol x})-\psi_t({\boldsymbol x})+\psi_{t+1}({\boldsymbol x})} for {t\geq1}. Hence, we have that {\sum_{t=0}^T f_t({\boldsymbol x})=F_{T+1}({\boldsymbol x})}. Now, consider

\displaystyle \begin{aligned} \sum_{t=1}^T f_t({\boldsymbol x}_t) - \sum_{t=0}^T f_t({\boldsymbol x}_{T+1}) &= \sum_{t=1}^T \overbrace{(F_{t+1}({\boldsymbol x}_t) - F_t({\boldsymbol x}_t))}^{f_t({\boldsymbol x}_t)} - F_{T+1}({\boldsymbol x}_{T+1}) \\ &= \sum_{t=1}^T F_{t+1}({\boldsymbol x}_t) - \sum_{t=0}^{T-1} F_{t+1}({\boldsymbol x}_{t+1}) - F_{T+1}({\boldsymbol x}_{T+1}) \\ &= \sum_{t=1}^T (F_{t+1}({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1})) - \psi_1({\boldsymbol x}_1)~. & (2) \\\end{aligned}

We also have

\displaystyle \begin{aligned} \sum_{t=1}^T f_t({\boldsymbol x}_t) - \sum_{t=0}^T f_t({\boldsymbol u}) &= \sum_{t=1}^T (\ell_t({\boldsymbol x}_t) - \psi_t({\boldsymbol x}_t)+\psi_{t+1}({\boldsymbol x}_t)) - \sum_{t=1}^T \ell_t({\boldsymbol u}) - \psi_{T+1}({\boldsymbol u})~. \end{aligned}

Hence, putting these two inequalities together, we get

\displaystyle \begin{aligned} \sum_{t=1}^T &(\ell_t({\boldsymbol x}_t) - \psi_t({\boldsymbol x}_t)+\psi_{t+1}({\boldsymbol x}_t)) - \sum_{t=1}^T \ell({\boldsymbol u}) - \psi_{T+1}({\boldsymbol u}) \\ &= \sum_{t=1}^T f_t({\boldsymbol x}_t) - \sum_{t=0}^T f_t({\boldsymbol u}) \\ &= \sum_{t=1}^T f_t({\boldsymbol x}_t) - \sum_{t=0}^T f_t({\boldsymbol x}_{T+1}) + \sum_{t=0}^T f_t({\boldsymbol x}_{T+1}) - \sum_{t=0}^T f_t({\boldsymbol u}) \\ &\stackrel{(2)}{=} \sum_{t=1}^T (F_{t+1}({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1})) - \psi_1({\boldsymbol x}_1) - \sum_{t=0}^T f_t({\boldsymbol u}) + \sum_{t=0}^T f_t({\boldsymbol x}_{T+1}) \\ &= \sum_{t=1}^T (F_{t+1}({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1})) - \psi_1({\boldsymbol x}_1) - F_{T+1}({\boldsymbol u}) + F_{T+1}({\boldsymbol x}_{T+1})~. \end{aligned}

Observing that

\displaystyle \begin{aligned} \psi_{t}({\boldsymbol x}_t)-\psi_{t+1}({\boldsymbol x}_{t}) + F_{t+1}({\boldsymbol x}_t) &= \psi_{t}({\boldsymbol x}_t)-\psi_{t+1}({\boldsymbol x}_{t}) + \sum_{i=1}^{t} \ell_i({\boldsymbol x}_t) + \psi_{t+1}({\boldsymbol x}_t) \\ &= \psi_{t}({\boldsymbol x}_t) + \sum_{i=1}^{t-1} \ell_i({\boldsymbol x}_t) + \ell_t({\boldsymbol x}_t) \\ &= F_{t}({\boldsymbol x}_t)+ \ell_t({\boldsymbol x}_t), \end{aligned}

and that {\psi({\boldsymbol x}_1)=\min_{{\boldsymbol x}} \ \psi({\boldsymbol x})}, we get the equality

\displaystyle \psi_{T+1}({\boldsymbol u}) - \min_{{\boldsymbol x}} \ \psi_1({\boldsymbol x}) + \sum_{t=1}^T [F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t)] + F_{T+1}({\boldsymbol x}_{T+1}) - F_{T+1}({\boldsymbol u})~.

\Box

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s