More Online-to-Batch Examples and Strong Convexity

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.


1. Online-to-Batch Conversion

Last time we saw the following theorem, let’s now prove it.

Theorem 1 Let {F({\boldsymbol x})=\mathop{\mathbb E}[f({\boldsymbol x},{\boldsymbol \xi})]} where the expectation is w.r.t. {{\boldsymbol \xi}} drawn from {\rho} over some vector space {X} and {f:{\mathbb R}^d \times X\rightarrow (-\infty,+\infty]} is convex in the first argument. Draw {T} samples {{\boldsymbol \xi}_1,\dots,{\boldsymbol \xi}_T} i.i.d. from {\rho} and construct the sequence of losses {\ell_t({\boldsymbol x}) = \alpha_t f({\boldsymbol x},{\boldsymbol \xi}_t)}, where {\alpha_t} are deterministic. Run any OCO algorithm over the losses {\ell_t}, to construct the sequence of predictions {{\boldsymbol x}_1,\dots,{\boldsymbol x}_{T+1}}. Then, we have

\displaystyle \mathop{\mathbb E}\left[F\left(\frac{1}{\sum_{t=1}^T \alpha_t}\sum_{t=1}^T \alpha_t {\boldsymbol x}_t\right)\right] \leq F({\boldsymbol u}) + \frac{\mathop{\mathbb E}[\text{Regret}_T({\boldsymbol u})]}{\sum_{t=1}^T \alpha_t}, \ \forall {\boldsymbol u} \in {\mathbb R}^d,

where the expectation is with respect to {{\boldsymbol \xi}_1,\dots, {\boldsymbol \xi}_T}.

Proof: We first show that

\displaystyle \label{eq:o2b_1} \mathop{\mathbb E}\left[\sum_{t=1}^T \alpha_t F({\boldsymbol x}_t)\right] = \mathop{\mathbb E}\left[ \sum_{t=1}^T \ell_t({\boldsymbol x}_t)\right]~. \ \ \ \ \ (1)

In fact, from the linearity of the expectation we have

\displaystyle \mathop{\mathbb E}\left[ \sum_{t=1}^T \ell_t({\boldsymbol x}_t)\right] = \sum_{t=1}^T \mathop{\mathbb E}\left[\ell_t({\boldsymbol x}_t)\right]~.

Then, from the law of total expectation, we have

\displaystyle \mathop{\mathbb E}\left[\ell_t({\boldsymbol x}_t)\right] = \mathop{\mathbb E}\left[\mathop{\mathbb E}[\ell_t({\boldsymbol x}_t)|{\boldsymbol \xi}_1, \dots, {\boldsymbol \xi}_{t-1}\right] = \mathop{\mathbb E}\left[\mathop{\mathbb E}[\alpha_t f({\boldsymbol x}_t, {\boldsymbol \xi}_t)|{\boldsymbol \xi}_1, \dots, {\boldsymbol \xi}_{t-1}]\right] = \mathop{\mathbb E}\left[\alpha_t F({\boldsymbol x}_t)\right],

where we used the fact that {{\boldsymbol x}_{t}} and {\alpha_{t}} depend only on {{\boldsymbol \xi}_1,\dots,{\boldsymbol x}_{t-1}}. Hence, (1) is proved.

It remains only to use Jensen’s inequality, using the fact that {F} is convex, to have

\displaystyle F\left(\frac{1}{\sum_{t=1}^T \alpha_t} \sum_{t=1}^T \alpha_t {\boldsymbol x}_t\right) \leq \frac{1}{\sum_{t=1}^T \alpha_t} \sum_{t=1}^T \alpha_t F({\boldsymbol x}_t)~.

Dividing the regret by {\sum_{t=1}^T \alpha_t} and using the above inequalities gives the stated theorem. \Box

In example last time, we had to use a constant learning rate to be able to minimize the training error over the entire space {{\mathbb R}^d}. In the next one, we will see a different approach, that allows to use a varying learning rate without the need of a bounded feasible set.

Example 1 Consider the same setting of the previous example, and let’s change the way in which the construct the online losses. Now use {\ell_t({\boldsymbol x})=\frac{1}{R\sqrt{t}} \max(1-y_t\langle {\boldsymbol z}_t, {\boldsymbol x}\rangle,0)} and step size {\eta=1}. Hence, we have

\displaystyle \mathop{\mathbb E}\left[F\left(\sum_{t=1}^T \frac{1}{R\sqrt{t}} {\boldsymbol x}_t\right)\right] - F({\boldsymbol x}^\star) \leq \frac{\|{\boldsymbol x}^\star\|_2^2}{2\sum_{t=1}^T \frac{1}{R\sqrt{t}}} +\frac{1}{2 \sum_{t=1}^T \frac{1}{R\sqrt{t}}} \sum_{t=1}^T \frac{1}{t} \leq R\frac{\|{\boldsymbol x}^\star\|_2^2+1+\log T}{4 \sqrt{T+1}-2},

where we used {\sum_{t=1}^T \frac{1}{\sqrt{t}} \geq 2\sqrt{T+1}-1}.

I stressed the fact that the only meaningful way to define a regret is with respect to an arbitrary point in the feasible set. This is obvious in the case we consider unconstrained OLO, because the optimal competitor is unbounded. But, it is also true in unconstrained OCO. Let’s see an example of this.

Example 2 Consider a problem of binary classification, with inputs {{\boldsymbol z}_i \in R^d} and outputs {y_i \in \{-1,1\}}. The loss function is the logistic loss: {f({\boldsymbol x}, ({\boldsymbol z},y)) = \ln(1+\exp(-y\langle {\boldsymbol z}, {\boldsymbol x}\rangle))}. Suppose that you want to minimize the training error over a training set of {N} samples, {\{({\boldsymbol z}_i,y_i)\}_{i=1}^N}. Also, assume the maximum L2 norm of the samples is {R}. That is, we want to minimize

\displaystyle \min_{{\boldsymbol x}} \ F({\boldsymbol x}):= \frac{1}{N}\sum_{i=1}^N \ln(1+\exp(-y_i\langle {\boldsymbol z}_i, {\boldsymbol x}\rangle))~.

So, run the reduction described in Theorem 1 for {T} iterations using OSD. In each iteration, construct {\ell_t({\boldsymbol x})=\ln(1+\exp(-y_t\langle {\boldsymbol z}_t, {\boldsymbol x}\rangle))} sampling a training point uniformly at random from {1} to {N}. Set {{\boldsymbol x}_1=\boldsymbol{0}} and {\eta=\frac{1}{R\sqrt{T}}}. We have that

\displaystyle \mathop{\mathbb E}\left[F\left(\frac{1}{T}\sum_{t=1}^T {\boldsymbol x}_t\right)\right] \leq \frac{R}{2\sqrt{T}} + \min_{{\boldsymbol u} \in R^d} F({\boldsymbol u}) + R\frac{\|{\boldsymbol u}\|^2}{2\sqrt{T}}~.

In words, we will be {\frac{R}{2\sqrt{T}}} away from the optimal value of regularized empirical risk minimization problem, where the weight of the regularization is {\frac{R}{2\sqrt{T}}}. Now, let’s consider the case that the training set is linearly separable, this means that the infimum of {F} is 0 and the optimal solution does not exist, i.e., it has norm equal to infinity. So, any convergence guarantee that depends on {{\boldsymbol x}^\star} would be vacuous. On the other hand, our guarantee above still make perfectly sense.

Note that the above examples only deal with training error. However, there is a more interesting application of the online-to-batch conversion, that is to directly minimize the generalization error.

Example 3 Consider the same setting of Example 1, but now consider the case that we want to minimize the true risk, that is

\displaystyle \min_{{\boldsymbol x} \in {\mathbb R}^d} \ \text{Risk}({\boldsymbol x}):=\mathop{\mathbb E}[\max(1-y\langle {\boldsymbol z}, {\boldsymbol x}\rangle,0)],

where {({\boldsymbol z}, y) \sim \rho}. Our Theorem 1 still applies as is. Indeed, draw {T} samples i.i.d. from {\rho} and run OSD with losses {\ell_t({\boldsymbol x})=\max(1-y_t\langle {\boldsymbol z}_t, {\boldsymbol x}\rangle,0)}. We obtain that

\displaystyle \mathop{\mathbb E}\left[\text{Risk}\left(\frac{1}{T}\sum_{t=1}^T {\boldsymbol x}_t\right)\right] \leq \min_{{\boldsymbol u} \in {\mathbb R}^d} \text{Risk}({\boldsymbol u}) + R\frac{\|{\boldsymbol u}\|^2 + 1}{2\sqrt{T}}~.

Note the fact that the risk will be close to the best regularized solution, even if we didn’t use any regularizer! The presence of the regularizer is due to OSD and we will be able to change it to other regularizer when we will see Online Mirror Descent.

2. Can We Do Better Than {\sqrt{T}} Regret?

Let’s now go back to online convex optimization theory. The example in the first class showed us that it is possible to get logarithmic regret in time. However, we saw that we get only {\sqrt{T}}-regret with Online Subgradient Descent (OSD) on the same game. What is the reason? It turns out that the losses in the first game, {\ell_t(x)=(x-y_t)^2} on {[0,1]}, are not just Lipschitz. They also posses some curvature that can be exploited to achieve a better regret. In a moment we will see that the only change we will need to OSD is a different learning rate, dictated as usual by the regret analysis.

The key concept we will need is the one of strong convexity.

3. Convex Analysis Bits: Strong Convexity

Here, we introduce a stronger concept of convexity, that allows to build better lower bound to a function. Instead of the linear lower bound achievable through the use of subgradients, we will make use of quadratic lower bound.

Definition 2 Let {\mu\geq0}. A proper function {f : {\mathbb R}^d \rightarrow (-\infty, +\infty]} is {\mu}-strongly convex over a convex set {V \subseteq \mathop{\mathrm{int}} \mathop{\mathrm{dom}} f} w.r.t. {\|\cdot\|} if

\displaystyle \forall {\boldsymbol x}, {\boldsymbol y} \in V, {\boldsymbol g} \in \partial f({\boldsymbol x}), \quad f({\boldsymbol y}) \geq f({\boldsymbol x}) + \langle {\boldsymbol g} , {\boldsymbol y} - {\boldsymbol x} \rangle + \frac{\mu}{2} \| {\boldsymbol x} - {\boldsymbol y} \|^2~.

Remark 1 You might find another definition of strong convexity, that resembles the one of convexity. However, it is easy to show that these two definitions are equivalent.

Note that the any convex function is {0}-strongly convex, by the definition of subgradient. Also, any {\mu}-strongly convex function is also {\lambda}-strongly convex for any {0\leq \lambda\leq \mu}.

In words, the definition above tells us that a strongly convex function can be lower bounded by a quadratic, where the linear term is the usual one constructed through the subgradient, and the quadratic term depends on the strong convexity. Hence, we have a tighter lower bound to the function w.r.t. simply using convexity. This is what we would expect using a Taylor expansion on a twice-differentiable convex function and lower bounding the smallest eigenvalue of the Hessian. Indeed, we have the following Theorem

Theorem 3 (S. Shalev-Shwartz, 2007, Lemma 14) Let {V\subseteq {\mathbb R}^d} convex and {f:V\rightarrow {\mathbb R}} twice differentiable. Then, a sufficient condition for {\mu}-strong convexity in {V} w.r.t. {\|\cdot\|} is that for all {{\boldsymbol x},{\boldsymbol y}} we have {\langle \nabla^2 f({\boldsymbol x}){\boldsymbol y},{\boldsymbol y}\rangle \geq \mu \|{\boldsymbol y}\|^2}, where {\nabla^2 f({\boldsymbol x})} is the Hessian matrix of {f} at {{\boldsymbol x}}.

However, here there is the important difference that we do not assume the function to be twice differentiable. Indeed, we don’t even need plain differentiability. Hence, the use of the subgradient implies that this lower bound does not have to be uniquely determined, as in the next Example.

strongly_convex
Figure 1. Possible lower bounds to the strongly convex non-differentiable function {f(x)=|x|+x^2}.

Example 4 Consider the strongly convex function {f(x)=|x|+x^2}. In Figure 1, we show two possible quadratic lower bounds to the function in {x=0}.

We also have the following easy but useful property on the sum of strong convex functions.

Theorem 4 Let {f:{\mathbb R}^d\rightarrow (-\infty, +\infty]} be {\mu_1}-strongly convex and {g:{\mathbb R}^d \rightarrow (-\infty, +\infty]} a {\mu_2}-strongly convex function in a non-empty convex set {V \subseteq \mathop{\mathrm{int}} \mathop{\mathrm{dom}} f\cap \mathop{\mathrm{int}} \mathop{\mathrm{dom}} g} w.r.t. {\|\cdot\|_2}. Then, {f+g} is {\mu_1+\mu_2}-strongly convex in {V} w.r.t. {\|\cdot\|_2}.

Proof: Note that the assumption on {V} give us that the subdifferential set of the sum is equal to the sum of the subdifferential sets. Hence, the proof is immediate from the definition of strong convexity. \Box

Example 5 Let {f({\boldsymbol x})=\frac{1}{2}\|{\boldsymbol x}\|_2^2}. Using Theorem 3, we have that {f} is 1-strongly convex w.r.t. {\|\cdot\|_2} in {{\mathbb R}^d}.

4. Online Subgradient Descent for Strongly Convex Losses

Theorem 5 Assume that the functions {\ell_t: V \rightarrow {\mathbb R}} are {\mu_t}-strongly convex w.r.t {\|\cdot\|_2} over {V \subseteq \cap_{i=1}^T \mathop{\mathrm{int}} \mathop{\mathrm{dom}} \ell_i}, where {\mu_t>0}. Use OSD with stepsizes equal to {\eta_t=\frac{1}{\sum_{i=1}^t \mu_t}}. Then, for any {{\boldsymbol u} \in V}, we have the following regret guarantee

\displaystyle \sum_{t=1}^T \ell_t({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u}) \leq \frac{1}{2} \sum_{t=1}^T \eta_t \|{\boldsymbol g}_t\|^2_2~.

Proof: From the assumption of {\mu_t}-strong convexity of the functions {\ell_t}, we have that

\displaystyle \ell_t({\boldsymbol x}_t)-\ell_t({\boldsymbol u}) \leq \langle {\boldsymbol g}_t, {\boldsymbol x}_t - {\boldsymbol u} \rangle - \frac{\mu_t}{2} \|{\boldsymbol x}_t-{\boldsymbol u}\|^2_2~.

From the fact that {\eta_t=\frac{1}{\sum_{i=1}^t \mu_i}}, we have

\displaystyle \begin{aligned} &\frac{1}{2\eta_1}-\frac{\mu_1}{2}=0,\\ &\frac{1}{2\eta_t}-\frac{\mu_t}{2}=\frac{1}{2\eta_{t-1}}, t=2,\cdots,T~. \end{aligned}

Hence, use Lemma 2 in Lecture 3 and sum from {t=1,\dots,T}, to obtain

\displaystyle \begin{aligned} \sum_{t=1}^T \left(\ell_t({\boldsymbol x}_t)- \ell_t({\boldsymbol u})\right) &\leq \sum_{t=1}^T \left(\frac{1}{2\eta_t}\|{\boldsymbol x}_t-{\boldsymbol u}\|_2^2 - \frac{1}{2\eta_t}\|{\boldsymbol x}_{t+1}-{\boldsymbol u}\|_2^2 - \frac{\mu_t}{2} \|{\boldsymbol x}_t-{\boldsymbol u}\|^2 + \frac{\eta_t}{2}\|{\boldsymbol g}_t\|_2^2\right) \\ &= - \frac{1}{2\eta_1}\|{\boldsymbol x}_{2}-{\boldsymbol u}\|_2^2 + \sum_{t=2}^T \left(\frac{1}{2 \eta_{t-1}}\|{\boldsymbol x}_t-{\boldsymbol u}\|_2^2 - \frac{1}{2 \eta_t}\|{\boldsymbol x}_{t+1}-{\boldsymbol u}\|_2^2\right) + \sum_{t=1}^T \frac{\eta_t}{2} \|{\boldsymbol g}_t\|_2^2 ~. \end{aligned}

Observing that the first sum on the left hand side is a telescopic sum, we have the stated bound. \Box

Remark 2 Notice that the theorem requires a bounded domain, otherwise the loss functions will not be Lipschitz given that they are also strongly convex.

Corollary 6 Under the assumptions of Theorem 5, if in addiction we have {\mu_t = \mu >0} and {\ell_t} is {L}-Lipschitz w.r.t. {\|\cdot\|_2}, for {t=1, \dots, T}, then we have

\displaystyle \sum_{t=1}^T \ell_t({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u}) \leq \frac{L^2}{2 \mu} \left(1+\ln T\right)~.

 

Remark 3 Corollary 6 does not imply that for any finite {T} the regret will be smaller than using learning rates {\propto \frac{1}{L\sqrt{t}}}. Instead, asymptotically the regret in Corollary 6 is always better than to one of OSD with Lipschitz losses.

Example 6 Consider once again the example in the first class: {\ell_t(x)=(x-y_t)^2}. Note that the loss functions are {2}-strongly convex w.r.t. {|\cdot|}. Hence, setting {\eta_t=\frac{1}{2t}} and {\ell_t'(x)=2(x-y_t)} gives a regret of {\ln(T)+1}.

Let’s now use again the online-to-batch conversion on strongly convex stochastic problems.

Example 7 As done before, we can use the online-to-batch conversion to use Corollary 6 to obtain stochastic subgradient descent algorithms for strongly convex stochastic functions. For example, consider the classic Support Vector Machine objective

\displaystyle \min_{{\boldsymbol x}} \ F({\boldsymbol x}):=\frac{\lambda}{2}\|{\boldsymbol x}\|_2^2 + \frac{1}{N} \sum_{i=1}^N \max(1-y_i \langle {\boldsymbol z}_i, {\boldsymbol x}\rangle,0),

or any other regularized formulation like regularized logistic regression:

\displaystyle \min_{{\boldsymbol x}} \ F({\boldsymbol x}):=\frac{\lambda}{2}\|{\boldsymbol x}\|_2^2 + \frac{1}{N} \sum_{i=1}^N \ln(1+\exp(-y_i \langle {\boldsymbol z}_i, {\boldsymbol x}\rangle)),

where {{\boldsymbol z}_i \in {\mathbb R}^d}, {\|{\boldsymbol z}_i\|_2\leq R}, and {y_i \in\{-1,1\}}. First, notice that the minimizer of both expressions as to be in the L2 ball of radius proportional to {\sqrt{\frac{1}{\lambda}}} (proof left as exercise). Hence, we can set {V} equal to this set. Then, setting {\ell_t({\boldsymbol x})=\frac{\lambda}{2}\|{\boldsymbol x}\|_2^2+\max(1-y_i \langle {\boldsymbol z}_i, {\boldsymbol x}\rangle,0)} or {\ell_t({\boldsymbol x})=\frac{\lambda}{2}\|{\boldsymbol x}\|_2^2+\ln(1+\exp(-y_i \langle {\boldsymbol z}_i, {\boldsymbol x}\rangle))} results in {\lambda}-strongly convex loss functions. Using Corollary 6 and Theorem 1 gives immediately

\displaystyle \mathop{\mathbb E}\left[F\left(\frac{1}{T}\sum_{t=1}^T {\boldsymbol x}_t\right)\right] - \min_{{\boldsymbol x}} \ F({\boldsymbol x}) \leq O\left(\frac{\ln T}{ \lambda T}\right)~.

However, we can do better! Indeed, {\ell_t({\boldsymbol x})=\frac{\lambda t}{2}\|{\boldsymbol x}\|_2^2+t \max(1-y_i \langle {\boldsymbol z}_i, {\boldsymbol x}\rangle,0)} or {\ell_t({\boldsymbol x})=\frac{\lambda t}{2}\|{\boldsymbol x}\|_2^2+t \ln(1+\exp(-y_i \langle {\boldsymbol z}_i, {\boldsymbol x}\rangle))} results in {\lambda t}-strongly convex loss functions. Using Corollary 6, we have that {\eta_t=\frac{2}{\lambda t(t+1)}} and Theorem 1 gives immediately

\displaystyle \mathop{\mathbb E}\left[F\left(\frac{2}{T(T+1)}\sum_{t=1}^T t\, {\boldsymbol x}_t\right)\right] - \min_{{\boldsymbol x}} \ F({\boldsymbol x}) \leq O\left(\frac{1}{\lambda T}\right),

that is asymptotically better because it does not have the logarithmic term.

5. History Bits

The logarithmic regret in Corollary 6 was shown for the first time in the seminal paper (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006). The general statement in Theorem 5 was proven by (Hazan, E. and Rakhlin, A. and Bartlett, P. L., 2008).

As we said last time, the non-uniform averaging of Example 1 is from (Zhang, T., 2004), even if there it is not proposed explicitly as an online-to-batch conversion. Instead, the non-uniform averaging of Example 7 is from (Lacoste-Julien, S. and Schmidt, M. and Bach, F., 2012), but again there is not proposed as an online-to-batch conversion. The basic idea of solving the problem of SVM with OSD and online-to-batch conversion of Example 7 was the Pegasos algorithm (S. Shalev-Shwartz and Y. Singer and N. Srebro, 2007), for many years the most used optimizer for SVMs.

A more recent method to do online-to-batch conversion has been introduced in (Cutkosky, 2019). The new method allows to prove the convergence of the last iterate rather than the one of the weighted average, with a small change in the online learning algorithm.

6. Exercises

Exercise 1 Prove that OSD in Example 6 with {x_1=0} is exactly the Follow-the-Leader strategy for that particular problem.

Exercise 2 Prove that {\ell_t({\boldsymbol x})=\|{\boldsymbol x}-{\boldsymbol z}_t\|_2^2} is {2}-strongly convex w.r.t. {\|\cdot\|_2} and derive the OSD update for it and its regret guarantee.

Exercise 3 (Difficult) Prove that {f({\boldsymbol x})=\frac{1}{2}\|{\boldsymbol x}\|^2_p} is {p-1}-strongly convex w.r.t. {\|\cdot\|_p} where {1\leq p\leq 2}.

4 Comments

  1. Hi! Found some typos: in Example 2 “training set of N samples, {x_i, y_i}” should be “{z_i, y_i}” ; in Example 3 the min should be over x \in R^d ; in Example 7 you first mention Corollary 4 and then Theorem 4 (which I guess are the same) even if neither of them are formally stated anywhere (though there is a reference pointing to a bound).

    I have a doubt regarding the proof of Theorem 5. In the end we have a telescopic sum, thus we should have the first and the last term. Now, the last term should be negative so we can get rid of it, but the first term should be $ \frac{1}{2 \eta_0} \| x_1 – u \| $. I wanted to ask how is $ \eta_0 $ defined and what happens to this first term?

    Thanks,

    Nicolò

    Like

    1. I fixed the typos, thanks a lot! (The missing corollary was actually an error in my version of latex2wp: It should have affected any other corollary I published, I’ll have to doublecheck…)

      In Theorem 5, I added some more details to the proof and avoided to define 1/eta_0=0, it should be fine now.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s