This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.
You can find the lectures I published till now here.
In this lecture, we will consider the problem of online linear classification. We consider the following setting:
- At each time step we receive a sample
- We output a prediction of the binary label of
- We receive the true label and we see if we did a mistake or not
- We update our online classifier
The aim of the online algorithm is to minimize the number of mistakes it does compared to some best fixed classifier.
We will focus on linear classifiers, that predicts with the sign of the inner product between a vector and the input features . Hence, . This problem can be written again as a regret minimization problem:
where . It should be clear that these losses are non-convex. Hence, we need an alternative way to deal with them. In the following, we will see two possible approaches to this problem.
1. Online Randomized Classifier
As we did for the Learning with Expert Advice framework, we might think to convexify the losses using randomization. Hence, on each round we can predict a number in and output the label according with probability and the label with probability . So, define the random variable
Now observe that . If we consider linear predictors, we can think to have and similarly for the competitor . Constraining both the algorithm and the competitor to the space of vectors where for , we can write
Hence, the surrogate convex loss becomes and the feasible set is any convex set where we have the property for .
Given that this problem is convex, assuming to be bounded w.r.t. some norm, we can use almost any of the algorithms we have seen till now, from Online Mirror Descent to Follow-The-Regularized-Leader (FTRL). All of them would result in regret upper bounds, assuming that are bounded in some norm. The only caveat is to restrict in . One way to do it might be to consider assuming and choose the feasible set .
Putting all together, for example, we can have the following strategy using FTRL with regularizers .
Theorem 1 Let an arbitrary sequence of samples/labels couples where and . Assume , for . Then, running the Randomized Online Linear Classifier algorithm with where , for any we have the following guarantee
Proof: The proof is straightforward from the FTRL regret bound with the chosen increasing regularizer.
2. The Perceptron Algorithm
The above strategy has the shortcoming of restricting the feasible vectors in a possibly very small set. In turn, this could make the performance of the competitor low. In turn, the performance of the online algorithm is only close to the one of the competitor.
Another way to deal with the non-convexity is to compare the number of mistakes that the algorithm does with a convex cumulative loss of the competitor. That is, we can try to prove a weaker regret guarantee:
In particular, the convex loss we consider is powers of the Hinge Loss: . The hinge loss is a convex upper bound to the 0/1 loss and it achieves the value of zero when the sign of the prediction is correct and the magnitude of the inner product is big enough. Moreover, taking powers of it, we get a family of functions that trade-offs the loss for the wrongly classified samples with the one for the correctly classified samples but with a value of , see Figure 1.
The Perceptron algorithm updates the current prediction moving in the direction of the current sample multiplied by its label. Let’s see why this is a good idea. Assume that and the algorithm made a mistake. Then, the updated prediction would predict a more positive number on the same sample . In fact, we have
In the same way, if and the algorithm made a mistake, the update would result in a more negative prediction on the same sample.
For the Perceptron algorithm, we can prove the following guarantee.
Before proving the theorem, let’s take a look to its meaning. If there exists a such that , then the Perceptron algorithm makes a finite number of mistakes upper bounded by . In case that are many that achieves we have that the finite number of mistakes is bounded the norm of the smallest among them. What is the meaning of this quantity?
Remember that a hyperplane represented by its normal vector divides the space in two half spaces: one with the points that give a positive value for the inner product and other one where the same inner product is negative. Now, we have that the distance of a sample from the hyperplane whose normal is is
Also, given that we are considering a that gives cumulative hinge loss zero, we have that that quantity is at least . So, the norm of the minimal that has cumulative hinge loss equal to zero is inversely proportional to the minimum distance between the points and the separating hyperplane. This distance is called the margin of the samples . So, if the margin is small, the Perceptron algorithm can do more mistakes than when the margin is big.
If the problem is not linearly separable, the Perceptron algorithm satisfies a regret of , where is the loss of the competitor. Moreover, we measure the competitor with a family of loss functions and compete with the best measured with the best loss. This adaptivity is achieved through two basic ingredients:
- The Perceptron is independent of scaling of the update by a hypothetical learning rate , in the sense that the mistakes it does are independent of the scaling. That is, we could update with and have the same mistakes and updates because they only depend on the sign of . Hence, we can think as it is always using the best possible learning rate .
- The weakened definition of regret allows to consider a family of loss functions, because the Perceptron is not using any of them in the update.
Let’s now prove the regret guarantee. For the proof, we will need the two following technical lemmas.
Proof: Let , then we have . Solving for we have . Hence, .
Proof: } Denote by the total number of the mistakes of the Perceptron algorithm by .
Now, as said above, does not affect in any way the sign of the predictions, hence the Perceptron algorithm could be run with (2) and its predictions would be exactly the same. Hence, we have
Note that . Also, we have
So, denoting by , we can rewrite (3) as
where we used Holder’s inequality and .
Given that and denoting by , we have
Using the fact that , we have
Finally, using Lemma 4, we have the stated bound.
3. History Bits
The Perceptron was proposed by Rosenblatt (F. Rosenblatt, 1958). The proof of convergence in the non-separable case for is by (C. Gentile, 2003) and for is from (Y. Freund and R. E. Schapire, 1999). The proof presented here is based on the one in (Beygelzimer, A. and Orabona, F. and Zhang, C., 2017).