This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.
You can find the lectures I published till now here.
1. Online-to-Batch Conversion
Last time we saw the following theorem, let’s now prove it.
Theorem 1 Let where the expectation is w.r.t. drawn from over some vector space and is convex in the first argument. Draw samples i.i.d. from and construct the sequence of losses , where are deterministic. Run any OCO algorithm over the losses , to construct the sequence of predictions . Then, we have
where the expectation is with respect to .
Then, from the law of total expectation, we have
where we used the fact that and depend only on . Hence, (1) is proved.
It remains only to use Jensen’s inequality, using the fact that is convex, to have
Dividing the regret by and using the above inequalities gives the stated theorem.
In example last time, we had to use a constant learning rate to be able to minimize the training error over the entire space . In the next one, we will see a different approach, that allows to use a varying learning rate without the need of a bounded feasible set.
where we used .
I stressed the fact that the only meaningful way to define a regret is with respect to an arbitrary point in the feasible set. This is obvious in the case we consider unconstrained OLO, because the optimal competitor is unbounded. But, it is also true in unconstrained OCO. Let’s see an example of this.
Example 2 Consider a problem of binary classification, with inputs and outputs . The loss function is the logistic loss: . Suppose that you want to minimize the training error over a training set of samples, . Also, assume the maximum L2 norm of the samples is . That is, we want to minimize
So, run the reduction described in Theorem 1 for iterations using OSD. In each iteration, construct sampling a training point uniformly at random from to . Set and . We have that
In words, we will be away from the optimal value of regularized empirical risk minimization problem, where the weight of the regularization is . Now, let’s consider the case that the training set is linearly separable, this means that the infimum of is 0 and the optimal solution does not exist, i.e., it has norm equal to infinity. So, any convergence guarantee that depends on would be vacuous. On the other hand, our guarantee above still make perfectly sense.
Note that the above examples only deal with training error. However, there is a more interesting application of the online-to-batch conversion, that is to directly minimize the generalization error.
Example 3 Consider the same setting of Example 1, but now consider the case that we want to minimize the true risk, that is
where . Our Theorem 1 still applies as is. Indeed, draw samples i.i.d. from and run OSD with losses . We obtain that
Note the fact that the risk will be close to the best regularized solution, even if we didn’t use any regularizer! The presence of the regularizer is due to OSD and we will be able to change it to other regularizer when we will see Online Mirror Descent.
2. Can We Do Better Than Regret?
Let’s now go back to online convex optimization theory. The example in the first class showed us that it is possible to get logarithmic regret in time. However, we saw that we get only -regret with Online Subgradient Descent (OSD) on the same game. What is the reason? It turns out that the losses in the first game, on , are not just Lipschitz. They also posses some curvature that can be exploited to achieve a better regret. In a moment we will see that the only change we will need to OSD is a different learning rate, dictated as usual by the regret analysis.
The key concept we will need is the one of strong convexity.
3. Convex Analysis Bits: Strong Convexity
Here, we introduce a stronger concept of convexity, that allows to build better lower bound to a function. Instead of the linear lower bound achievable through the use of subgradients, we will make use of quadratic lower bound.
Definition 2 Let . A proper function is -strongly convex over a convex set w.r.t. if
Remark 1 You might find another definition of strong convexity, that resembles the one of convexity. However, it is easy to show that these two definitions are equivalent.
Note that the any convex function is -strongly convex, by the definition of subgradient. Also, any -strongly convex function is also -strongly convex for any .
In words, the definition above tells us that a strongly convex function can be lower bounded by a quadratic, where the linear term is the usual one constructed through the subgradient, and the quadratic term depends on the strong convexity. Hence, we have a tighter lower bound to the function w.r.t. simply using convexity. This is what we would expect using a Taylor expansion on a twice-differentiable convex function and lower bounding the smallest eigenvalue of the Hessian. Indeed, we have the following Theorem
Theorem 3 (S. Shalev-Shwartz, 2007, Lemma 14) Let convex and twice differentiable. Then, a sufficient condition for -strong convexity in w.r.t. is that for all we have , where is the Hessian matrix of at .
However, here there is the important difference that we do not assume the function to be twice differentiable. Indeed, we don’t even need plain differentiability. Hence, the use of the subgradient implies that this lower bound does not have to be uniquely determined, as in the next Example.
Example 4 Consider the strongly convex function . In Figure 1, we show two possible quadratic lower bounds to the function in .
We also have the following easy but useful property on the sum of strong convex functions.
Theorem 4 Let be -strongly convex and a -strongly convex function in a non-empty convex set w.r.t. . Then, is -strongly convex in w.r.t. .
Proof: Note that the assumption on give us that the subdifferential set of the sum is equal to the sum of the subdifferential sets. Hence, the proof is immediate from the definition of strong convexity.
Example 5 Let . Using Theorem 3, we have that is 1-strongly convex w.r.t. in .
4. Online Subgradient Descent for Strongly Convex Losses
Proof: From the assumption of -strong convexity of the functions , we have that
From the fact that , we have
Hence, use Lemma 2 in Lecture 3 and sum from , to obtain
Observing that the first sum on the left hand side is a telescopic sum, we have the stated bound.
Remark 2 Notice that the theorem requires a bounded domain, otherwise the loss functions will not be Lipschitz given that they are also strongly convex.
Corollary 6 Under the assumptions of Theorem 5, if in addiction we have and is -Lipschitz w.r.t. , for , then we have
Remark 3 Corollary 6 does not imply that for any finite the regret will be smaller than using learning rates . Instead, asymptotically the regret in Corollary 6 is always better than to one of OSD with Lipschitz losses.
Let’s now use again the online-to-batch conversion on strongly convex stochastic problems.
Example 7 As done before, we can use the online-to-batch conversion to use Corollary 6 to obtain stochastic subgradient descent algorithms for strongly convex stochastic functions. For example, consider the classic Support Vector Machine objective
or any other regularized formulation like regularized logistic regression:
where , , and . First, notice that the minimizer of both expressions as to be in the L2 ball of radius proportional to (proof left as exercise). Hence, we can set equal to this set. Then, setting or results in -strongly convex loss functions. Using Corollary 6 and Theorem 1 gives immediately
that is asymptotically better because it does not have the logarithmic term.
5. History Bits
The logarithmic regret in Corollary 6 was shown for the first time in the seminal paper (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006). The general statement in Theorem 5 was proven by (Hazan, E. and Rakhlin, A. and Bartlett, P. L., 2008).
As we said last time, the non-uniform averaging of Example 1 is from (Zhang, T., 2004), even if there it is not proposed explicitly as an online-to-batch conversion. Instead, the non-uniform averaging of Example 7 is from (Lacoste-Julien, S. and Schmidt, M. and Bach, F., 2012), but again there is not proposed as an online-to-batch conversion. The basic idea of solving the problem of SVM with OSD and online-to-batch conversion of Example 7 was the Pegasos algorithm (S. Shalev-Shwartz and Y. Singer and N. Srebro, 2007), for many years the most used optimizer for SVMs.
A more recent method to do online-to-batch conversion has been introduced in (Cutkosky, 2019). The new method allows to prove the convergence of the last iterate rather than the one of the weighted average, with a small change in the online learning algorithm.
Exercise 1 Prove that OSD in Example 6 with is exactly the Follow-the-Leader strategy for that particular problem.
Exercise 2 Prove that is -strongly convex w.r.t. and derive the OSD update for it and its regret guarantee.
Exercise 3 (Difficult) Prove that is -strongly convex w.r.t. where .