Optimistic Follow-The-Regularized-Leader

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.


Throughout this class, we considered the adversarial model as our model of the environment. This allowed us to design algorithm that work in this setting, as well as in other more benign settings. However, the world is never completely adversarial. So, we might be tempted to model the environment in some way, but that would leave our algorithm vulnerable to attacks. An alternative, is to consider the data as generated by some predictable process plus adversarial noise. In this view, it might be beneficial to try to model the predictable part, without compromising the robustness to the adversarial noise.

In this class, we will explore this possibility through a particular version of Follow-The-Regularized-Leader (FTRL), where we predict the next loss. In very intuitive terms, if our predicted loss is correct, we can expect the regret to decrease. However, if our prediction is wrong we still want to recover the worst case guarantee. Such algorithm is called Optimistic FTRL.


ftrl_optimistic

The core idea of Optimistic FTRL is to predict the next loss and use it in the update rule, as summarized in Algorithm 1. Note that for the sake of the analysis, it does not matter how the prediction is generated. It can be even generated by another online learning procedure!

Let’s see why this is a good idea. Remember that FTRL simply predicts with the minimizer of the previous losses plus a time-varying regularizer. Let’s assume for a moment that instead we have the gift of predicting the future, so we do know the next loss ahead of time. Then, we could predict with its minimizer and suffer a negative regret. However, probably our foresight abilities are not so powerful, so our prediction of the next loss might be inaccurate. In this case, a better idea might be just to add our predicted loss to the previous ones and minimize the regularized sum. We would expect the regret guarantee to improve if our prediction of the future loss is precise. At the same time, if the prediction is wrong, we expect its influence to be limited, given that we use it together with all the past losses.

All these intuitions can be formalized in the following Theorem.

Theorem 1 With the notation in Algorithm 1, let {V\subseteq {\mathbb R}^d} be convex, closed, and non-empty. Denote by {F_t({\boldsymbol x}) = \psi_{t}({\boldsymbol x}) + \sum_{i=1}^{t-1} \ell_i({\boldsymbol x})}. Assume for {t=1, \dots, T} that {F_t} is proper and {\lambda_t}-strongly convex w.r.t. {\|\cdot\|}, {\ell_t} and {\tilde{\ell_t}} proper and convex, and {\mathop{\mathrm{int}} \mathop{\mathrm{dom}} F_t \cap V\neq \{\}}. Also, assume that {\partial \ell_t({\boldsymbol x}_t)} and {\partial \tilde{\ell}_t({\boldsymbol x}_t)} are non-empty. Then, there exists {\tilde{{\boldsymbol g}}_t \in \partial \tilde{\ell}_t({\boldsymbol x}_t)} for {t=1, \dots, T}, such that we have

\displaystyle \begin{aligned} \sum_{t=1}^T &\ell({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u})\\ &\leq \psi_{T+1}({\boldsymbol u}) - \psi_{1}({\boldsymbol x}_1) + \sum_{t=1}^T \left[\langle {\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t,{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\rangle -\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2 +\psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1})\right] \\ &\leq \psi_{T+1}({\boldsymbol u}) - \psi_{1}({\boldsymbol x}_1) + \sum_{t=1}^T \left[\frac{\| {\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t\|_\star^2}{2\lambda_t} +\psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1})\right], \end{aligned}

for all {{\boldsymbol g}_t \in \partial \ell_t({\boldsymbol x}_t)}.

Proof: We can interpret the Optimistic-FTRL as FTRL with a regularizer {\tilde{\psi}_t({\boldsymbol x})=\psi_t({\boldsymbol x})+\ell_t({\boldsymbol x})}. Also, note that {\tilde{\ell}_{T+1}({\boldsymbol x})} has no influence on the algorithm, so we can set it to the null function.

Hence, from the equality for FTRL, we immediately get

\displaystyle \begin{aligned} \sum_{t=1}^T &\ell({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u}) \\ &\leq \tilde{\ell}_{T+1}({\boldsymbol u}) + \psi_{T+1}({\boldsymbol u}) - \min_{{\boldsymbol x} \in V} \ (\tilde{\ell}_1({\boldsymbol x})+\psi_{1}({\boldsymbol x})) + \sum_{t=1}^T [F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t) +\tilde{\ell}_t({\boldsymbol x}_t)-\tilde{\ell}_{t+1}({\boldsymbol x}_{t+1})] \\ &= \psi_{T+1}({\boldsymbol u}) - \psi_{1}({\boldsymbol x}_1) + \sum_{t=1}^T [F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t)]~. \end{aligned}

Now focus on the terms {F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t)}. Observe that {F_t({\boldsymbol x})+\ell_t({\boldsymbol x})+i_V({\boldsymbol x})} is {\lambda_t}-strongly convex w.r.t. {\|\cdot\|}, hence we have

\displaystyle \begin{aligned} F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t) &= (F_t({\boldsymbol x}_t) + \ell_t({\boldsymbol x}_t)) - (F_{t}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_{t+1})) + \psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1}) \\ &\leq \langle {\boldsymbol g}'_t,{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\rangle -\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2 +\psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1}), \end{aligned}

where {{\boldsymbol g}'_t \in \partial (F_t({\boldsymbol x}_t) + \ell_t({\boldsymbol x}_t) + i_V({\boldsymbol x}_t))}. Observing that {{\boldsymbol x}_t = \mathop{\mathrm{argmin}}_{{\boldsymbol x} \in V} F_t({\boldsymbol x}) + \tilde{\ell}_t({\boldsymbol x})}, we have {\boldsymbol{0} \in \partial (F_t({\boldsymbol x}_t) + \tilde{\ell}_t({\boldsymbol x}_t) + i_V({\boldsymbol x}_t)}. Hence, given that our assumptions guarantee that the subdifferential of the sum is equal to the sum of the subdifferentials, there exists {\tilde{{\boldsymbol g}}_t\in \partial \tilde{\ell}_t({\boldsymbol x}_t)} such that {{\boldsymbol g}'_t = {\boldsymbol g}_t -\tilde{{\boldsymbol g}}_t}. So, we have

\displaystyle F_t({\boldsymbol x}_t) - F_{t+1}({\boldsymbol x}_{t+1}) + \ell_t({\boldsymbol x}_t) \leq \langle {\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t,{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\rangle -\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2 + \psi_t({\boldsymbol x}_{t+1}) - \psi_{t+1}({\boldsymbol x}_{t+1})~.

By the definition of dual norms, we also have that

\displaystyle \langle {\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t,{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\rangle -\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2 \leq \| {\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t\|_\star \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\| -\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2 \leq \frac{\lambda_t}{2}\|{\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t\|^2_\star~.

\Box

Let’s take a look at the second bound in the theorem. Compared to the similar bound for FTRL, we now have the terms {\|{\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t\|_\star^2} instead of the ones {\|{\boldsymbol g}_t\|_\star^2}. So, if the prediction of the next loss is good, that term can become smaller and possibly even zero! On the other hand, if the predictions are bad, for Lipschitz losses we only lose a constant factor. Overall, in the best case we can gain a lot, in the worst case we don’t lose that much.

Despite the simplicity of the algorithm and its analysis, there are many applications of this principle. We will only describe a couple of them. Recently, this idea was used even to recover the Nesterov’s acceleration algorithm and to prove faster convergence in repeated games.

1. Regret that Depends on the Variance of the Subgradients

Consider of running Optimistic-FTRL on the linearized losses {\ell_t({\boldsymbol x})=\langle {\boldsymbol g}_t, {\boldsymbol x}\rangle}. We can gain something out of the Optimistic-FTRL compared to plain FTRL if we are able to predict the next {{\boldsymbol g}_t}. A simple possibility is to predict the average of the past values, {\bar{{\boldsymbol g}}_t=\frac{1}{t-1}\sum_{i=1}^{t-1} {\boldsymbol g}_i}. Indeed, from the first lecture, we know that such strategy is itself an online learning procedure! In particular, it corresponds to a Follow-The-Leader algorithm on the losses {\ell_t({\boldsymbol x})=\|{\boldsymbol x}-{\boldsymbol g}_t\|_2^2}. Hence, from the strong convexity of this losses, we know that

\displaystyle \sum_{t=1}^T \|\bar{{\boldsymbol x}}_t - {\boldsymbol g}_t\|^2_2 - \sum_{t=1}^T \|{\boldsymbol u} - {\boldsymbol g}_t\|^2_2 \leq 4(1+\ln T), \ \forall {\boldsymbol u} \in R^d~.

This implies

\displaystyle \sum_{t=1}^T \|\bar{{\boldsymbol x}}_t - {\boldsymbol g}_t\|^2_2 \leq 4(1 + \ln T) + \min_{{\boldsymbol u}} \sum_{t=1}^T \|{\boldsymbol u} - {\boldsymbol g}_t\|^2_2~.

It is immediate to see that the minimizer is {{\boldsymbol u}=\frac{1}{T}\sum_{t=1}^T {\boldsymbol g}_t}, that results in {T} times the empirical variance of the subgradients. Plugging it in the Optimistic-FTRL regret, with {\psi_t=\psi}, we have

\displaystyle \sum_{t=1}^T \ell({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u}) \leq \psi({\boldsymbol u}) - \psi({\boldsymbol x}_1) + 4+4\ln T+\min_{{\boldsymbol g}} \sum_{t=1}^T \frac{\|{\boldsymbol g}_t-{\boldsymbol g}\|_*^2}{2\lambda}~.

Remark 1 Instead of using the mean of the past subgradients, we could use any other strategy or even a mix of different strategies. For example, assuming the subgradients bounded, we could use an algorithm to solve the Learning with Expert problem, where each expert is a strategy. Then, we would obtain a bound that depends on the predictions of the best strategy, plus the regret of the expert algorithm.

2. Online Convex Optimization with Gradual Variations

In this section, we consider the case that the losses we receive have small variations over time. We will show that in this case it is possible to get constant regret in the case that the losses are equal.

In this case, the simple strategy we can use to predict the next subgradient is to use the previous one, that is {\tilde{\ell}_t({\boldsymbol x})=\langle {\boldsymbol g}_{t-1}, {\boldsymbol x}\rangle} for {t\geq2} and {\tilde{\ell}_1({\boldsymbol x})=0}.

Corollary 2 Under the assumptions of Theorem 1, define {\tilde{\ell}_t({\boldsymbol x})=\langle {\boldsymbol g}_{t-1}, {\boldsymbol x}\rangle} for {t\geq2} and {\tilde{\ell}_1({\boldsymbol x})=0}. Set {\psi_t({\boldsymbol x})=\lambda_t \psi({\boldsymbol x})} where {\psi} is 1-strongly convex w.r.t. {\|\cdot\|} and {\lambda_t} satisfies {\lambda_t \lambda_{t-1}\geq 8M^2} for {t=2, \dots, T}, where {M} is the smoothness constant of the losses {\ell_t}. Then, {\forall {\boldsymbol u} \in V}, we have

\displaystyle \text{Regret}_T({\boldsymbol u}) = \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle - \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol u}\rangle \leq \lambda_t \psi({\boldsymbol u}) - \lambda_1 \psi({\boldsymbol x}_1) + \frac{1}{\lambda_1} \|\nabla \ell_1({\boldsymbol x}_1)\|_\star^2 + \sum_{t=2}^T \frac{2}{\lambda_t} \|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star~.

Moreover, assuming {\|\nabla \ell_t({\boldsymbol x})\|_\star \leq L} for all {{\boldsymbol x} \in V}, setting {\lambda_t = \sqrt{\max(8M^2 ,4L^2) + \sum_{i=2}^{t-1} \|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star}}, we have

\displaystyle \text{Regret}_T({\boldsymbol u}) \leq \left(\psi({\boldsymbol u}) - \min_{{\boldsymbol x}} \psi({\boldsymbol x})+ 4\right)\sqrt{\max(8M^2 ,4L^2) + \sum_{t=2}^T \|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star} + \frac{1}{4}~.

Proof: From the Optimistic-FTRL bound with a fixed regularizer, we immediately get

\displaystyle \sum_{t=1}^T \langle{\boldsymbol g}_t,{\boldsymbol x}_t\rangle - \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol u}\rangle \leq \lambda_{T}\psi({\boldsymbol u}) - \lambda_1 \psi({\boldsymbol x}_1) + \sum_{t=1}^T \left[\langle {\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t,{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\rangle -\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2 \right]~.

Now, consider the case that the losses {\ell_t} are {M}-smooth. So, for any {\alpha_t>0}, we have

\displaystyle \begin{aligned} \langle {\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t,{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\rangle -\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2 &= \langle \nabla \ell_t({\boldsymbol x}_t)-\nabla \ell_{t-1}({\boldsymbol x}_{t-1}),{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\rangle -\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2 \\ &\leq \frac{\alpha_t}{2} \|\nabla \ell_t({\boldsymbol x}_t)-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star +\frac{1}{2\alpha_t}\|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2-\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2~. \end{aligned}

Focusing on the first term, for {t=2, \dots, T}, we have

\displaystyle \begin{aligned} \frac{\alpha_t}{2} \|\nabla \ell_t({\boldsymbol x}_t)-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star &= \frac{\alpha_t}{2} \|\nabla \ell_t({\boldsymbol x}_t)-\nabla \ell_{t-1}({\boldsymbol x}_{t-1}) - \nabla \ell_t({\boldsymbol x}_{t-1}) + \nabla \ell_t({\boldsymbol x}_{t-1})\|^2_\star \\ &\leq \alpha_t \|\nabla \ell_t({\boldsymbol x}_t)-\nabla \ell_{t}({\boldsymbol x}_{t-1})\|^2_\star +\alpha_t \|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star \\ &\leq \alpha_t M^2 \|{\boldsymbol x}_t-{\boldsymbol x}_{t-1}\|^2 +\alpha_t \|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star~. \end{aligned}

Choose {\alpha_t=\frac{2}{\lambda_t}}. We have for {t=2,\dots, T}

\displaystyle \begin{aligned} &\langle {\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t,{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\rangle -\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2\\ &\quad \leq \frac{2M^2}{\lambda_t} \|{\boldsymbol x}_t-{\boldsymbol x}_{t-1}\|^2 + \frac{2}{\lambda_t} \|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star -\frac{\lambda_t}{4} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2~. \end{aligned}

For {t=1}, we have

\displaystyle \langle {\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t,{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\rangle -\frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2 \\ \leq \frac{1}{\lambda_t}\|\nabla \ell_t({\boldsymbol x}_t)-\tilde{{\boldsymbol g}}_t\|^2_\star -\frac{\lambda_t}{4} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2~.

Now observe the assumption {\lambda_t} implies {\frac{2M^2}{\lambda_t}\leq \frac{\lambda_{t-1}}{4}} for {t=2, \dots,T}. So, summing for {t=1,\dots,T}, we have

\displaystyle \begin{aligned} \sum_{t=1}^T \left( \langle {\boldsymbol g}_t-\tilde{{\boldsymbol g}}_t,{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\rangle - \frac{\lambda_t}{2} \|{\boldsymbol x}_t-{\boldsymbol x}_{t+1}\|^2 \right) \leq \frac{1}{\lambda_1} \|\nabla \ell_1({\boldsymbol x}_1)\|_\star^2 + \sum_{t=2}^T \frac{2}{\lambda_t} \|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star~. \end{aligned}

Putting all together, we have the first stated bound.

The second one is obtained observing that

\displaystyle \begin{aligned} \sum_{t=2}^T \frac{\|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star}{\lambda_t} &= \sum_{t=2}^T \frac{\|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star}{\sqrt{\max(8M^2 ,4L^2) + \sum_{i=2}^{t-1} \|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star}} \\ &\leq \sum_{t=2}^T \frac{\|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star}{\sqrt{\sum_{i=2}^{t} \|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star}} \\ &\leq 2\sqrt{\sum_{i=2}^{T} \|\nabla \ell_t({\boldsymbol x}_{t-1})-\nabla \ell_{t-1}({\boldsymbol x}_{t-1})\|^2_\star}~. \end{aligned}

\Box

Note that if the losses are all the same, the regret becomes a constant! This is not surprising, because the prediction of the next loss is a linear approximation of the previous loss. Indeed, looking back at the proof, the key idea is to use the smoothness to argue that, if even the past subgradient was taken in a different point than the current one, it is still a good prediction of the current subgradient.

Remark 2 Note that the assumption of smoothness is necessary. Indeed, passing always the same function and using online-to-batch conversion, would result in a convergence rate of {O(1/T)} for a Lipschitz function, that is impossible.

3. History Bits

The Optimistic Online Mirror Descent algorithm was proposed by (Chiang, C.-K. and Yang, T. and Lee, C.-J. and Mahdavi, M. and Lu, C.-J. and Jin, R. and Zhu, S., 2012) and extended in (A. Rakhlin and K. Sridharan, 2013) to use arbitrary “hallucinated” losses. The Optimistic FTRL version was proposed in (A. Rakhlin and K. Sridharan, 2013) and rediscovered in (Steinhardt, J. and Liang, P., 2014), even if it was called Online Mirror Descent for the misnaming problem we already explained. The proof of Theorem 1 I present here is new.

Corollary 2 was proved by (Chiang, C.-K. and Yang, T. and Lee, C.-J. and Mahdavi, M. and Lu, C.-J. and Jin, R. and Zhu, S., 2012) for Optimistic OMD and presented in a similar form in (P. Joulani and A. György and C. Szepesvári, 2017) for Optimistic FTRL, but for bounded domains.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s