Online Linear Classification

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.

In this lecture, we will consider the problem of online linear classification. We consider the following setting:

At each time step we receive a sample ${{\boldsymbol z}_t}$
We output a prediction of the binary label ${y_t\in \{-1,1\}}$ of ${{\boldsymbol z}_t}$
We receive the true label and we see if we did a mistake or not
We update our online classifier

The aim of the online algorithm is to minimize the number of mistakes it does compared to some best fixed classifier.

We will focus on linear classifiers, that predicts with the sign of the inner product between a vector ${{\boldsymbol x}_t}$ and the input features ${{\boldsymbol z}_t}$ . Hence, ${\tilde{y}_t=\rm sign(\langle {\boldsymbol z}_t,{\boldsymbol x}_t\rangle)}$ . This problem can be written again as a regret minimization problem:

$\displaystyle \text{Regret}_T({\boldsymbol u})=\sum_{t=1}^T \ell_t({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u}),$

where ${\ell_t({\boldsymbol x}) = \boldsymbol{1}[\rm sign(\langle {\boldsymbol z}_t,{\boldsymbol x}\rangle) \neq y_t]}$ . It should be clear that these losses are non-convex. Hence, we need an alternative way to deal with them. In the following, we will see two possible approaches to this problem.

1. Online Randomized Classifier

As we did for the Learning with Expert Advice framework, we might think to convexify the losses using randomization. Hence, on each round we can predict a number in ${q_t \in [-1,1]}$ and output the label ${+1}$ according with probability ${\frac{1+q_t}{2}}$ and the label ${-1}$ with probability ${\frac{1-q_t}{2}}$ . So, define the random variable

$\displaystyle \tilde{y}(p) = \begin{cases} +1, & \text{ with probability } p,\\ -1, & \text{ with probability } 1-p~. \end{cases}$

Now observe that ${\mathop{\mathbb E}_{\tilde{y}_t}[\tilde{y}(\frac{1+q_t}{2})\neq y_t]=\frac{1}{2}|q_t-y_t|}$ . If we consider linear predictors, we can think to have ${q_t=\langle {\boldsymbol z}_t, {\boldsymbol x}_t}$ and similarly for the competitor ${q'_t=\langle {\boldsymbol z}_t, {\boldsymbol u}\rangle}$ . Constraining both the algorithm and the competitor to the space of vectors where ${|\langle {\boldsymbol z}_t, {\boldsymbol x}\rangle|\leq 1}$ for ${t=1, \dots,T}$ , we can write

$\displaystyle \mathop{\mathbb E}\left[\sum_{t=1}^T \boldsymbol{1}\left[\tilde{y}(\langle {\boldsymbol z}_t,{\boldsymbol x}_t\rangle) \neq y_t\right] - \sum_{t=1}^T \boldsymbol{1}\left[\tilde{y}(\langle {\boldsymbol z}_t,{\boldsymbol u}\rangle) \neq y_t\right]\right] = \mathop{\mathbb E}\left[\sum_{t=1}^T \frac12 \left|\langle {\boldsymbol z}_t,{\boldsymbol x}_t\rangle - y_t\right| - \sum_{t=1}^T \frac12 \left|\langle {\boldsymbol z}_t, {\boldsymbol u}\rangle - y_t\right|\right]~.$

Hence, the surrogate convex loss becomes ${\tilde{\ell}_t({\boldsymbol x}) = \frac{1}{2}|\langle {\boldsymbol z}_t, {\boldsymbol x}\rangle - y_t|}$ and the feasible set is any convex set where we have the property ${|\langle {\boldsymbol z}_t,{\boldsymbol x}\rangle|\leq 1}$ for ${t=1,\dots,T}$ .

Given that this problem is convex, assuming ${{\boldsymbol z}_t}$ to be bounded w.r.t. some norm, we can use almost any of the algorithms we have seen till now, from Online Mirror Descent to Follow-The-Regularized-Leader (FTRL). All of them would result in ${O(\sqrt{T})}$ regret upper bounds, assuming that ${{\boldsymbol z}_t}$ are bounded in some norm. The only caveat is to restrict ${\langle {\boldsymbol z}_t,{\boldsymbol x}_t\rangle}$ in ${[-1,1]}$ . One way to do it might be to consider assuming ${\|{\boldsymbol z}_t\|_\star\leq R}$ and choose the feasible set ${V=\{{\boldsymbol x} \in{\mathbb R}^d : \|{\boldsymbol x}\|\leq \frac{1}{R}\}}$ .

Putting all together, for example, we can have the following strategy using FTRL with regularizers ${\psi_t({\boldsymbol x})=\frac{\sqrt{2t}}{4}\|{\boldsymbol x}\|_2^2}$ .

perc

Theorem 1 Let ${({\boldsymbol z}_t, y_t)_{t=1}^T}$ an arbitrary sequence of samples/labels couples where ${({\boldsymbol z}_t, y_t) \in X \times \{-1,1\}}$ and ${X\subset {\mathbb R}^d}$ . Assume ${\|{\boldsymbol z}_t\|_2\leq R}$ , for ${t=1, \dots,T}$ . Then, running the Randomized Online Linear Classifier algorithm with ${\eta_t=\frac{\sqrt{2}}{\sqrt{t}}}$ where ${\alpha>0}$ , for any ${{\boldsymbol u} \in \{{\boldsymbol x} \in {\mathbb R}^d : \|{\boldsymbol x}\|_2\leq \frac{1}{R}\}}$ we have the following guarantee

$\displaystyle \mathop{\mathbb E}\left[\sum_{t=1}^T \boldsymbol{1}[\tilde{y}(\langle{\boldsymbol z}_t,{\boldsymbol x}_t\rangle) \neq y_t]\right] - \mathop{\mathbb E}\left[\sum_{t=1}^T \boldsymbol{1}[\tilde{y}(\langle{\boldsymbol z}_t,{\boldsymbol u}\rangle) \neq y_t] \right] \leq \sqrt{2 T}~.$

Proof: The proof is straightforward from the FTRL regret bound with the chosen increasing regularizer. $\Box$

2. The Perceptron Algorithm

The above strategy has the shortcoming of restricting the feasible vectors in a possibly very small set. In turn, this could make the performance of the competitor low. In turn, the performance of the online algorithm is only close to the one of the competitor.

Another way to deal with the non-convexity is to compare the number of mistakes that the algorithm does with a convex cumulative loss of the competitor. That is, we can try to prove a weaker regret guarantee:

$\displaystyle \label{eq:weak_regret} \sum_{t=1}^T \boldsymbol{1}[y_t \neq \tilde{y}_t] - \sum_{t=1}^T \ell(\langle {\boldsymbol z}_t, {\boldsymbol u}\rangle, y_t) = O\left(\sqrt{T}\right)~. \ \ \ \ \ (1)$

In particular, the convex loss we consider is powers of the Hinge Loss: ${\ell_q(\tilde{y}, y)=\max(1-\tilde{y} y,0)^q}$ . The hinge loss is a convex upper bound to the 0/1 loss and it achieves the value of zero when the sign of the prediction is correct and the magnitude of the inner product is big enough. Moreover, taking powers of it, we get a family of functions that trade-offs the loss for the wrongly classified samples with the one for the correctly classified samples but with a value of ${|\langle {\boldsymbol z}_t, {\boldsymbol x}\rangle|\leq 1}$ , see Figure 1.

powers_hinge — Figure 1. Powers of the Hinge Loss.

The oldest algorithm we have to minimize the modified regret in (1) is the Perceptron algorithm, in Algorithm 2.

perc

The Perceptron algorithm updates the current prediction ${{\boldsymbol x}_t}$ moving in the direction of the current sample multiplied by its label. Let’s see why this is a good idea. Assume that ${y_t=1}$ and the algorithm made a mistake. Then, the updated prediction ${{\boldsymbol x}_{t+1}}$ would predict a more positive number on the same sample ${{\boldsymbol z}_t}$ . In fact, we have

$\displaystyle \langle {\boldsymbol z}_t, {\boldsymbol x}_{t+1}\rangle = \langle {\boldsymbol z}_t, {\boldsymbol x}_t + y_t {\boldsymbol z}_t\rangle = \langle {\boldsymbol z}_t, {\boldsymbol x}_t\rangle + \|{\boldsymbol z}_t\|^2_2~.$

In the same way, if ${y_t=-1}$ and the algorithm made a mistake, the update would result in a more negative prediction on the same sample.

For the Perceptron algorithm, we can prove the following guarantee.

Theorem 2 Let ${({\boldsymbol z}_t, y_t)_{t=1}^T}$ an arbitrary sequence of samples/labels couples where ${({\boldsymbol z}_t, y_t) \in X \times \{-1,1\}}$ and ${X\subset {\mathbb R}^d}$ . Assume ${\|{\boldsymbol z}_t\|_2\leq R}$ , for ${t=1, \dots,T}$ . Then, running the Perceptron algorithm we have the following guarantee

$\displaystyle \sum_{t=1}^T \boldsymbol{1}[y_t \neq \tilde{y}_t] - \sum_{t=1}^T \ell(\langle {\boldsymbol z}_t, {\boldsymbol u}\rangle, y_t) \leq \frac{q^2 R^2 \|{\boldsymbol u}\|^2_2}{2} + q\|{\boldsymbol u}\|\sqrt{\frac{q^2 R^2 \|{\boldsymbol u}\|^2_2}{4} + \sum_{t=1}^T \ell_q(\langle {\boldsymbol z}_t, {\boldsymbol u}\rangle, y_t)},\ \forall {\boldsymbol u} \in {\mathbb R}^d, q\geq 1~.$

Before proving the theorem, let’s take a look to its meaning. If there exists a ${{\boldsymbol u} \in {\mathbb R}^d}$ such that ${\sum_{t=1}^T \ell_q(\langle {\boldsymbol z}_t, {\boldsymbol u}\rangle, y_t)=0}$ , then the Perceptron algorithm makes a finite number of mistakes upper bounded by ${R^2 \|{\boldsymbol u}\|^2_2}$ . In case that are many ${{\boldsymbol u}}$ that achieves ${\sum_{t=1}^T \ell_q(\langle {\boldsymbol z}_t, {\boldsymbol u}\rangle, y_t)=0}$ we have that the finite number of mistakes is bounded the norm of the smallest ${{\boldsymbol u}}$ among them. What is the meaning of this quantity?

Remember that a hyperplane represented by its normal vector ${{\boldsymbol u}}$ divides the space in two half spaces: one with the points ${{\boldsymbol z}}$ that give a positive value for the inner product ${\langle {\boldsymbol z}, {\boldsymbol u}\rangle}$ and other one where the same inner product is negative. Now, we have that the distance of a sample ${{\boldsymbol z}_t}$ from the hyperplane whose normal is ${{\boldsymbol u}}$ is

$\displaystyle \frac{|\langle {\boldsymbol z}_t, {\boldsymbol u}\rangle|}{\|{\boldsymbol u}\|_2}~.$

Also, given that we are considering a ${{\boldsymbol u}}$ that gives cumulative hinge loss zero, we have that that quantity is at least ${\frac{1}{\|{\boldsymbol u}\|_2}}$ . So, the norm of the minimal ${{\boldsymbol u}}$ that has cumulative hinge loss equal to zero is inversely proportional to the minimum distance between the points and the separating hyperplane. This distance is called the margin of the samples ${({\boldsymbol z}_t,y_t)_{t=1}^T}$ . So, if the margin is small, the Perceptron algorithm can do more mistakes than when the margin is big.

If the problem is not linearly separable, the Perceptron algorithm satisfies a regret of ${O(\sqrt{L^\star})}$ , where ${L^\star}$ is the loss of the competitor. Moreover, we measure the competitor with a family of loss functions and compete with the best ${{\boldsymbol u}}$ measured with the best loss. This adaptivity is achieved through two basic ingredients:

The Perceptron is independent of scaling of the update by a hypothetical learning rate ${\eta}$ , in the sense that the mistakes it does are independent of the scaling. That is, we could update with ${{\boldsymbol x}_{t+1}={\boldsymbol x}_t + \eta \boldsymbol{1}[y_t\neq \tilde{y}_t] y_t{\boldsymbol z}_t}$ and have the same mistakes and updates because they only depend on the sign of ${\tilde{y}_t}$ . Hence, we can think as it is always using the best possible learning rate ${\eta}$ .
The weakened definition of regret allows to consider a family of loss functions, because the Perceptron is not using any of them in the update.

Let’s now prove the regret guarantee. For the proof, we will need the two following technical lemmas.

Lemma 3 (F. Cucker and D. X. Zhou, 2007, Lemma 10.17) Let ${p,q>1}$ be such that ${\frac{1}{p}+\frac{1}{q}=1}$ . Then

$\displaystyle a b \leq \frac{1}{q} a^q + \frac{1}{p} b^p , \ \forall a,b~.$

Lemma 4 Let ${a,c>0}$ , ${b\geq0}$ , and ${x\geq0}$ such that ${x - a\sqrt{x} \leq c}$ . Then, ${x \leq c + \frac{a^2}{2}+a\sqrt{\frac{a^2}{4}+c}}$ .

Proof: Let ${y=\sqrt{x}}$ , then we have ${y^2 - a y - c \leq 0}$ . Solving for ${y}$ we have ${y \leq \frac{a+\sqrt{a^2+4c}}{2}}$ . Hence, ${x = y^2 \leq \frac{a^2 + 2a\sqrt{a^2 + 4 c}+a^2 +4c}{4}=\frac{a^2}{2}+\frac{a}{2}\sqrt{a^2+4c}+c}$ . $\Box$

Proof: } Denote by the total number of the mistakes of the Perceptron algorithm by ${M=\sum_{t=1}^T \boldsymbol{1}[y_t \neq \tilde{y}_t]}$ .

First, note that the Perceptron algorithm can be thought as running Online Subgradient Descent (OSD) with a fixed stepsize ${\eta}$ over the losses ${\tilde{\ell}_t({\boldsymbol x})=-\langle \boldsymbol{1}[y_t\neq \tilde{y}_t] y_t {\boldsymbol z}_t, {\boldsymbol x}\rangle}$ over ${V={\mathbb R}^d}$ . Indeed, OSD over such losses would update

$\displaystyle \label{eq:proof_perc_eq1} {\boldsymbol x}_{t+1}={\boldsymbol x}_t + \eta\boldsymbol{1}[y_t\neq \tilde{y}_t] y_t{\boldsymbol z}_t~. \ \ \ \ \ (2)$

Now, as said above, ${\eta}$ does not affect in any way the sign of the predictions, hence the Perceptron algorithm could be run with (2) and its predictions would be exactly the same. Hence, we have

$\displaystyle \sum_{t=1}^T -\langle \boldsymbol{1}[y_t\neq \tilde{y}_t] y_t {\boldsymbol z}_t, {\boldsymbol x}_t\rangle + \sum_{t=1}^T \langle \boldsymbol{1}[y_t\neq \tilde{y}_t] y_t {\boldsymbol z}_t, {\boldsymbol u}\rangle \leq \frac{\|{\boldsymbol u}\|^2}{2\eta} + \frac{\eta}{2} \sum_{t=1}^T \boldsymbol{1}[y_t\neq \tilde{y}_t] \|{\boldsymbol z}_t\|^2_2, \ \forall \eta>0~.$

Given that this inequality holds for any ${\eta}$ , we can choose the ones that minimizes the r.h.s., to have

$\displaystyle \label{eq:proof_perc_eq2} \sum_{t=1}^T -\langle \boldsymbol{1}[y_t\neq \tilde{y}_t] y_t {\boldsymbol z}_t, {\boldsymbol x}_t\rangle + \sum_{t=1}^T \langle \boldsymbol{1}[y_t\neq \tilde{y}_t] y_t {\boldsymbol z}_t, {\boldsymbol u}\rangle \leq \|{\boldsymbol u}\| \sqrt{\sum_{t=1}^T \boldsymbol{1}[y_t\neq \tilde{y}_t] \|{\boldsymbol z}_t\|^2_2} \leq \|{\boldsymbol u}\| R\sqrt{\sum_{t=1}^T \boldsymbol{1}[y_t\neq \tilde{y}_t]}~. \ \ \ \ \ (3)$

Note that ${-\boldsymbol{1}[y_t\neq \tilde{y}_t]y_t\langle {\boldsymbol z}_t, {\boldsymbol x}_t\rangle\geq 0}$ . Also, we have

$\displaystyle \langle y_t {\boldsymbol z}_t, {\boldsymbol u}\rangle =1-(1-\langle y_t {\boldsymbol z}_t, {\boldsymbol u}\rangle) \geq 1 - \max(1-\langle y_t {\boldsymbol z}_t, {\boldsymbol u}\rangle,0) = 1 - \ell(\langle {\boldsymbol z}_t,{\boldsymbol u}\rangle,y_t)~.$

So, denoting by ${\tau_t=\boldsymbol{1}[y_t\neq \tilde{y}_t]}$ , we can rewrite (3) as

$\displaystyle \begin{aligned} \sum_{t=1}^T \tau_t &\leq \|{\boldsymbol u}\| R\sqrt{\sum_{t=1}^T \tau_t } + \sum_{t=1}^T \tau_t \ell(\langle {\boldsymbol z}_t,{\boldsymbol u}\rangle,y_t) \\ &\leq \|{\boldsymbol u}\| R\sqrt{\sum_{t=1}^T \tau_t } + \left(\sum_{t=1}^T \tau_t^p\right)^{1/p} \left(\sum_{t=1}^T\ell(\langle {\boldsymbol z}_t,{\boldsymbol u}\rangle,y_t)^q\right)^{1/q} \\ &= \|{\boldsymbol u}\| R\sqrt{\sum_{t=1}^T \tau_t } + \left(\sum_{t=1}^T \tau_t\right)^{1/p} \left(\sum_{t=1}^T\ell(\langle {\boldsymbol z}_t,{\boldsymbol u}\rangle,y_t)^q\right)^{1/q}, \end{aligned}$

where we used Holder’s inequality and ${\frac{1}{p}+\frac{1}{q}=1}$ .

Given that ${M=\sum_{t=1}^T \tau_t}$ and denoting by ${L_q=\sum_{t=1}^T\ell(\langle {\boldsymbol z}_t,{\boldsymbol u}\rangle,y_t)}$ , we have

$\displaystyle \begin{aligned} M &\leq \|{\boldsymbol u}\| R\sqrt{M} + M^{1/p} L_q^{1/q}~. \end{aligned}$

Let’s now consider two cases. For ${q=1}$ , we can use Lemma 4 and have the stated bound. Instead, for ${q>1}$ , using Lemma 3 we have

$\displaystyle \begin{aligned} M &\leq \|{\boldsymbol u}\| R\sqrt{M} + M^{1/p} L_q^{1/q} \leq \|{\boldsymbol u}\| R\sqrt{M} + \frac{1}{p} M + \frac{1}{q} L_q, \end{aligned}$

that implies

$\displaystyle M\left(1-\frac{1}{p} \right)\leq \|{\boldsymbol u}\| R\sqrt{M} + \frac{1}{q} L_q~.$

Using the fact that ${1-\frac{1}{p}=\frac{1}{q}}$ , we have

$\displaystyle M \leq q \|{\boldsymbol u}\| R\sqrt{M} + L_q~.$

Finally, using Lemma 4, we have the stated bound. $\Box$

3. History Bits

The Perceptron was proposed by Rosenblatt (F. Rosenblatt, 1958). The proof of convergence in the non-separable case for ${q=1}$ is by (C. Gentile, 2003) and for ${q=2}$ is from (Y. Freund and R. E. Schapire, 1999). The proof presented here is based on the one in (Beygelzimer, A. and Orabona, F. and Zhang, C., 2017).

4 Comments

Nicolò says:

November 20, 2019 at 5:06 AM

Hi Francesco, in Algorithm 1 line 3 when we do the projection shouldn’t R be in the denominator?

Nicolò

LikeLike

1. bremen79 says:
  
  November 20, 2019 at 8:47 AM
  
  Hi Nicolò, no, this looks correct to me: the first term in the min is selected when the norm is bigger than R, and projected back to R, no?
  
  LikeLike
  
  1. Nicolò says:
    
    November 20, 2019 at 9:27 AM
    
    But didn’t we assume the norm of x bounded by 1 / R ? you say “choose the feasible set {V=\{{\boldsymbol x} \in{\mathbb R}^d : \|{\boldsymbol x}\|\leq \frac{1}{R}\}}. “
    
    LikeLiked by 1 person
bremen79 says:

November 20, 2019 at 9:35 AM

You are right! Fixed, thanks!

LikeLike

Pingback: Perceptron is Not SGD: A Better Interpretation Through Pseudogradients – Parameter-free Learning and Optimization Algorithms

Share this:

Related

Leave a comment Cancel reply