This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.
You can find the lectures I published till now here.
Last time, we introduced Projected Online Gradient Descent:
And we proved the following regret guarantee:
Theorem 1 Let a closed non-empty convex set with diameter , i.e. . Let an arbitrary sequence of convex functions differentiable in open sets containing for . Pick any assume . Then, , the following regret bound holds
Moreover, if is constant, i.e. , we have
However, the differentiability assumption for the is quite strong. What happens when the losses are convex but not differentiable? For example . Note that this situation is more common than one would think. For example, the hinge loss, , and the ReLU activation function used in neural networks, , are not differentiable. It turns out that we can just use Online Gradient Descent, substituting the subgradients to the gradients. For this, we need some more convex analysis!
1. Convex Analysis Bits: Subgradients
First, we need a technical definition.
Definition 2 If a function is nowhere and finite somewhere, then is called proper.
In these class, we are mainly interested in convex proper functions, that basically better conform to our intuition of what a convex function looks like.
Let’s first define formally what is a subgradient.
Definition 3 For a proper function , we define a subgradient of in as a vector that satisfies
Basically, a subgradient of in is any vector that allows us to construct a linear lower bound to . Note that the subgradient is not unique, so we denote the set of subgradients of in by , called subdifferential of at .
Observe that if is proper and convex, then is empty for , because the inequality cannot be satisfied when . Also, the domain of , denoted by , is the set of all such that is nonempty; it is a subset of . A proper convex function is always subdifferentiable in (Rockafellar, R. T., 1970, Theorem 23.4). If the function is convex, differentiable in , and is finite in , we have that the subdifferential is composed by a unique element equal to (Rockafellar, R. T., 1970, Theorem 25.1).
Also, we can also calculate the subgradient of sum of functions.
Theorem 4 (Rockafellar, R. T., 1970, Theorem 23.8,Bauschke, H. H. and Combettes, P. L., 2011, Corollary 16.39) Let be proper convex functions on , and . Then . If , then actually .
Example 1 Let , then the subdifferential set is
Example 2 Let’s calculate the subgradient of the indicator function for a non-empty convex set . By definition, if
This condition implies that and (because for the inequality is always verified). The set of all that satisfies the above inequality is called the normal cone of at . Note that the normal cone for any (Hint: take ). For example, for , for all .
Another useful theorem is to calculate the subdifferential of the pointwise maximum of convex functions.
Theorem 5 (Bauschke, H. H. and Combettes, P. L., 2011, Theorem 18.5) Let be a finite set of convex functions from to and suppose and continuous at . Set and let the set of the active functions. Then
Example 3 (Subgradients of the Hinge loss) Consider the loss for . The subdifferential set if
Definition 6 Let is -Lipschitz over a set w.r.t a norm if .
We also have this handy result that upper bounds the norm of subgradients of convex Lipschitz functions.
Theorem 7 Let proper and convex. Then, is -Lipschitz in w.r.t. the L2 norm iff for all and we have .
Proof: Assume -Lipschitz, then . For small enough , then
that implies that .
For the other implication, the definition of subgradient and Cauchy-Schwartz inequalities gives us
for any . Taking , we also get
that completes the proof.
2. Analysis with Subgradients
As I promised you, with the proper mathematical tools, the analyzing online algorithms becomes easy. Indeed, switching from gradient to subgradient comes for free! In fact, our analysis of OGD with differentiable losses holds as is using subgradients instead of gradients. The reason is that the only property of the gradients that we used in our proof was that
where . However, the exact same property holds when . So, we can state the Online Subgradient descent algorithm in the following way, where the only difference is line 4.
Also, the regret bounds we proved holds as well, just changing differentiability with subdifferentiability and gradients with subgradients.
3. From Convex Losses to Linear Losses
Let’s take a deeper look at this step
And summing over time, we have
Now, define the linear (and convex) losses , so we have
This is more powerful that what it seems: We upper bounded the regret with respect to the convex losses with a regret with respect to another sequence of linear losses. This is important because it implies that we can build online algorithms that deal only with linear losses, and through the reduction above they can be seamlessly used as OCO algorithms! Note that this not imply that this reduction is always optimal, it isn’t! But, it allows us to easily construct optimal OCO algorithms in many interesting cases.
So, we will often consider just the problem of minimizing the linear regret
This problem is called Online Linear Optimization (OLO).
Example 4 Consider the guessing game of the first class, we can solve easily it with Online Gradient Descent. Indeed, we just need to calculate the gradients, prove that they are bounded, and find a way to calculate the projection of a real number in So, , that is bounded for . The projection on is just . With the optimal learning rate, the resulting regret would be . This is worse than the one we found in the first class, showing that the reduction not always gives the best possible regret.
Example 5 Consider again the guessing game of the first class, but now change the loss function to the absolute loss of the difference: . Now we will need to use Online Subgradient Descent, because the functions are non-differentiable. We can easily see that
Again, running Online Subgradient Descent with the optimal learning rate on this problem will give us immediately a regret of , without having to think to a particular strategy for it.
4. Online-to-Batch Conversion
It is a good moment to take a break from online learning theory and see some application of online learning to other domains. For example, we may wonder what is the connection between online learning and stochastic optimization. Given that Projected Online (Sub)Gradient Descent looks basically the same as Projected Stochastic (Sub)Gradient Descent, they must have something in common. Indeed, we can show that, for example, we can reduce stochastic optimization of convex functions to OCO. Let’s see how.
Theorem 8 Let where the expectation is w.r.t. drawn from over some vector space and is convex in the first argument. Draw samples i.i.d. from and construct the sequence of losses , where are deterministic. Run any OCO algorithm over the losses , to construct the sequence of predictions . Then, we have
where the expectation is with respect to .
We will prove it next time, let’s see now an application for it: Let’s now see how to use the above theorem to transform Online Subgradient Descent in Stochastic Subgradient Descent to minimize the training error of a classifier.
Example 6 Consider a problem of binary classification, with inputs and outputs . The loss function is the hinge loss: . Suppose that you want to minimize the training error over a training set of samples, . Also, assume the maximum L2 norm of the samples is . That is, we want to minimize
Run the reduction described in Theorem 8 for iterations using OGD. In each iteration, construct sampling a training point uniformly at random from to . Set and . We have that
In words, we used an OCO algorithm to stochastically optimize a function, transforming the regret guarantee into a convergence rate guarantee.
Next time we will prove the Online-to-Batch theorem and show many more examples on how to use it.
5. History Bits
The specific shape of Theorem 8 is new, but I wouldn’t be surprised if it appeared somewhere in the literature. In particular, the uniform averaging is from (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004), but was proposed for the absolute loss in (Blum, A. and Kalai, A. and Langford, J., 1999). The non-uniform averaging, that we will use next time, is from (Zhang, T., 2004), even if there it is not proposed explicitly as a online-to-batch conversion.
Exercise 1 Calculate the subdifferential set of the -insensitive loss: . It is a loss used in regression problems where we don’t want to penalize predictions within of the correct value .
Exercise 2 Consider Projected Online Subgradient Descent for the example in the previous lecture about the failure of Follow-the-Leader: Can we use it on that problem? Would it guarantee sublinear regret? How the behaviour of the algorithm would differ from FTL?
Exercise 3 Implement the algorithm in 6 in any language you like: implementing an algorithm is the perfect way to see if you understood all the details of the algorithm.