*This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019.*

* You can find all the lectures I published here.*

**1. Online-to-Batch Conversion **

Last time we saw the following theorem, let’s now prove it.

Theorem 1Let where the expectation is w.r.t. drawn from over some vector space and is convex in the first argument. Draw samples i.i.d. from and construct the sequence of losses , where are deterministic. Run any OCO algorithm over the losses , to construct the sequence of predictions . Then, we have

where the expectation is with respect to .

In fact, from the linearity of the expectation we have

Then, from the law of total expectation, we have

where we used the fact that depends only on . Hence, (1) is proved.

It remains only to use Jensen’s inequality, using the fact that is convex, to have

Dividing the regret by and using the above inequalities gives the stated theorem.

In example last time, we had to use a constant learning rate to be able to minimize the training error over the entire space . In the next one, we will see a different approach, that allows to use a varying learning rate without the need of a bounded feasible set.

Example 1Consider the same setting of the previous example, and let’s change the way in which the construct the online losses. Now use and step size . Hence, we have

where we used .

I stressed the fact that the only meaningful way to define a regret is with respect to an arbitrary point in the feasible set. This is obvious in the case we consider unconstrained OLO, because the optimal competitor is unbounded. But, it is also true in unconstrained OCO. Let’s see an example of this.

Example 2Consider a problem of binary classification, with inputs and outputs . The loss function is the logistic loss: . Suppose that you want to minimize the training error over a training set of samples, . Also, assume the maximum L2 norm of the samples is . That is, we want to minimize

So, run the reduction described in Theorem 1 for iterations using OSD. In each iteration, construct sampling a training point uniformly at random from to . Set and . We have that

In words, we will be away from the optimal value of regularized empirical risk minimization problem, where the weight of the regularization is . Now, let’s consider the case that the training set is linearly separable, this means that the infimum of is 0 and the optimal solution does not exist, i.e., it has norm equal to infinity. So, any convergence guarantee that depends on would be vacuous. On the other hand, our guarantee above still make perfectly sense.

Note that the above examples only deal with training error. However, there is a more interesting application of the online-to-batch conversion, that is to directly minimize the generalization error.

Example 3Consider the same setting of Example 1, but now consider the case that we want to minimize the true risk, that is

where . Our Theorem 1 still applies as is. Indeed, draw samples i.i.d. from and run OSD with losses . We obtain that

Note the fact that the risk will be close to the best

regularizedsolution, even if we didn’t use any regularizer! The presence of the regularizer is due to OSD and we will be able to change it to other regularizer when we will see Online Mirror Descent.

**2. Can We Do Better Than Regret? **

Let’s now go back to online convex optimization theory. The example in the first class showed us that it is possible to get logarithmic regret in time. However, we saw that we get only -regret with Online Subgradient Descent (OSD) on the same game. What is the reason? It turns out that the losses in the first game, on , are not just Lipschitz. They also posses some *curvature* that can be exploited to achieve a better regret. In a moment we will see that the only change we will need to OSD is a different learning rate, dictated as usual by the regret analysis.

The key concept we will need is the one of *strong convexity*.

**3. Convex Analysis Bits: Strong Convexity **

Here, we introduce a stronger concept of convexity, that allows to build better lower bound to a function. Instead of the linear lower bound achievable through the use of subgradients, we will make use of *quadratic* lower bound.

Definition 2Let . A proper function is -strongly convex over a convex set w.r.t. if

Remark 1You might find another definition of strong convexity, that resembles the one of convexity. However, it is easy to show that these two definitions are equivalent.

Note that the any convex function is -strongly convex, by the definition of subgradient. Also, any -strongly convex function is also -strongly convex for any .

In words, the definition above tells us that a strongly convex function can be lower bounded by a quadratic, where the linear term is the usual one constructed through the subgradient, and the quadratic term depends on the strong convexity. Hence, we have a tighter lower bound to the function w.r.t. simply using convexity. This is what we would expect using a Taylor expansion on a twice-differentiable convex function and lower bounding the smallest eigenvalue of the Hessian. Indeed, we have the following Theorem

Theorem 3 (S. Shalev-Shwartz, 2007, Lemma 14)Let convex and twice differentiable. Then, a sufficient condition for -strong convexity in w.r.t. is that for all we have , where is the Hessian matrix of at .

However, here there is the important difference that we do not assume the function to be twice differentiable. Indeed, we don’t even need plain differentiability. Hence, the use of the subgradient implies that this lower bound does not have to be uniquely determined, as in the next Example.

Example 4Consider the strongly convex function . In Figure 1, we show two possible quadratic lower bounds to the function in .

We also have the following easy but useful property on the sum of strong convex functions.

Theorem 4Let be -strongly convex and a -strongly convex function in a non-empty convex set w.r.t. . Then, is -strongly convex in w.r.t. .

*Proof:* Note that the assumption on give us that the subdifferential set of the sum is equal to the sum of the subdifferential sets. Hence, the proof is immediate from the definition of strong convexity.

Example 5Let . Using Theorem 3, we have that is 1-strongly convex w.r.t. in .

**4. Online Subgradient Descent for Strongly Convex Losses **

Theorem 5Assume that the functions are -strongly convex w.r.t over , where . Use OSD with stepsizes equal to . Then, for any , we have the following regret guarantee

*Proof:* From the assumption of -strong convexity of the functions , we have that

From the fact that , we have

Hence, use Lemma 2 in Lecture 3 and sum from , to obtain

Observing that the first sum on the left hand side is a telescopic sum, we have the stated bound.

Remark 2Notice that the theorem requires a bounded domain, otherwise the loss functions will not be Lipschitz given that they are also strongly convex.

Corollary 6Under the assumptions of Theorem 5, if in addiction we have and is -Lipschitz w.r.t. , for , then we have

Remark 3Corollary 6does notimply that for any finite the regret will be smaller than using learning rates . Instead, asymptotically the regret in Corollary 6 is always better than to one of OSD with Lipschitz losses.

Example 6Consider once again the example in the first class: . Note that the loss functions are -strongly convex w.r.t. . Hence, setting and gives a regret of .

Let’s now use again the online-to-batch conversion on strongly convex stochastic problems.

Example 7As done before, we can use the online-to-batch conversion to use Corollary 6 to obtain stochastic subgradient descent algorithms for strongly convex stochastic functions. For example, consider the classic Support Vector Machine objective

or any other regularized formulation like regularized logistic regression:

where , , and . First, notice that the minimizer of both expressions has to be in the L2 ball of radius proportional to (proof left as exercise). Hence, we can set equal to this set. Then, setting or results in -strongly convex loss functions. Using Corollary 6 and Theorem 1 gives immediately

However, we can do better! Indeed, or results in -strongly convex loss functions. Using Corollary 6, we have that and Theorem 1 gives immediately

that is asymptotically better because it does not have the logarithmic term.

**5. History Bits **

The logarithmic regret in Corollary 6 was shown for the first time in the seminal paper (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006). The general statement in Theorem 5 was proven by (Hazan, E. and Rakhlin, A. and Bartlett, P. L., 2008).

As we said last time, the non-uniform averaging of Example 1 is from (Zhang, T., 2004), even if there it is not proposed explicitly as an online-to-batch conversion. Instead, the non-uniform averaging of Example 7 is from (Lacoste-Julien, S. and Schmidt, M. and Bach, F., 2012), but again there is not proposed as an online-to-batch conversion. The basic idea of solving the problem of SVM with OSD and online-to-batch conversion of Example 7 was the Pegasos algorithm (S. Shalev-Shwartz and Y. Singer and N. Srebro, 2007), for many years the most used optimizer for SVMs.

A more recent method to do online-to-batch conversion has been introduced in (Cutkosky, 2019). The new method allows to prove the convergence of the last iterate rather than the one of the weighted average, with a small change in the online learning algorithm.

**6. Exercises **

Exercise 1Prove that OSD in Example 6 with is exactly the Follow-the-Leader strategy for that particular problem.

Exercise 2Prove that is -strongly convex w.r.t. and derive the OSD update for it and its regret guarantee.

Exercise 3 (Difficult)Prove that is -strongly convex w.r.t. where .

There is a nice recent paper by Ashok Cutkosky that deals with online-to-batch conversion in an elegant way: http://proceedings.mlr.press/v97/cutkosky19a/cutkosky19a.pdf

LikeLike

Hi David, nice to see you here đź™‚

Good point, I forgot about that paper! I’ll add it to the references.

LikeLike

Hi! Found some typos: in Example 2 “training set of N samples, {x_i, y_i}” should be “{z_i, y_i}” ; in Example 3 the min should be over x \in R^d ; in Example 7 you first mention Corollary 4 and then Theorem 4 (which I guess are the same) even if neither of them are formally stated anywhere (though there is a reference pointing to a bound).

I have a doubt regarding the proof of Theorem 5. In the end we have a telescopic sum, thus we should have the first and the last term. Now, the last term should be negative so we can get rid of it, but the first term should be $ \frac{1}{2 \eta_0} \| x_1 – u \| $. I wanted to ask how is $ \eta_0 $ defined and what happens to this first term?

Thanks,

NicolĂ˛

LikeLike

I fixed the typos, thanks a lot! (The missing corollary was actually an error in my version of latex2wp: It should have affected any other corollary I published, I’ll have to doublecheck…)

In Theorem 5, I added some more details to the proof and avoided to define 1/eta_0=0, it should be fine now.

LikeLike

Hi. Cool stuff, it’s great someone wrote in a less abstract and exotic way.

Qestion: shouldn’t the ball in example 7 be just 1 over lambda without the square root? In theorem 5 there is a typo (subscript of \mu is t instead of i).

LikeLike

Thanks for finding the typo, fixed it now.

Regarding the ball in example 7, I think it is correct. The trick is to observe that the minimum is smaller than any other value, in particular it is smaller than the objective function in the null vector. Let me know if you see how to do it.

LikeLiked by 1 person

Yes, I see it. But am I correct that comparing gradient to 0 and upper-bounding the norm by L from Lipschitz we get 1/lambda and this should tighter for bigger lambdas >1? (which probably is irrelevant considering mostly used lambda values are lower).

LikeLike

I am not sure I see how to do it: || lambda x^* + \nabla \ell_t(x^\star)|| \geq 0?

LikeLike

From first order optimality condition we get x = -\nabla \ell(x) /\lambda which implies \|x\|_* \|\nabla \ell(x)/\lambda \|_* and it’s \leq L/\lambda since both losses are Lipschitz. Ooor did I do something terribly wrong?

LikeLike

Ah yes, now I understand what you mean! It is a good point indeed, I don’t think I ever saw it. Probably the reason is that, as you say, people only care about small lambda. However, a practical implementation could definitely use this observation and take the min between the two estimated radiuses.

LikeLiked by 1 person

Dear Prof. Orabona,

I am a little confused about Theorem 1. In the statement you have mentioned that the terms $(\alpha_t)_{t \geq 1}$ are `deterministic. Later in the proof, you have mentioned that $x_t$ and $\alpha_t$ `depend only on $x_1, \ldots, \xi_{t-1}$’. But since $\xi_i$ are random variables, doesn’t that make $\alpha_t$ random as well? So, why is the term $\sum_{t=1}^T \alpha_t$ in the right-hand-side of the statement in Theorem 1 not within expectation?

Thanks!

LikeLiked by 1 person

Good catch! It is actually a typo: I didn’t want to deal with stochastic alpha_t, I have fixed now the post and I’ll fix the arxiv in the next release. Thanks!

LikeLike