The Gaptron Algorithm

This time I will describe an online algorithm that is better than the Percetron algorithm. This is one of those results that I consider fundamental in online learning, yet not enough widely known.

1. The Gaptron Algorithm

We introduce the Gaptron algorithm, a randomized first-order algorithm for online binary and multiclass classification. The key motivation behind Gaptron is to exploit the surrogate gap, which is the difference between the true zero-one loss and the convex surrogate loss function used for optimization. Standard analyses of online learning algorithms often bound the regret with respect to the surrogate loss. However, this can be a loose upper bound on the actual number of mistakes. The Gaptron algorithm is designed to tighten this analysis by actively managing the surrogate gap, leading to better mistake bounds. For didactical reasons, in the following we will focus on the binary version of the Gaptron algorithm.

The algorithm maintains a weight vector ${{\boldsymbol x}_t}$ and, at each round, computes a measure of confidence using the features ${{\boldsymbol z}_t}$ as ${m_t=\langle {\boldsymbol z}_t, {\boldsymbol x}_t\rangle}$ . Based on this confidence measure and the loss function used, it decides whether to follow the prediction of the current weight vector or to explore by choosing a label uniformly at random. This exploration is controlled by a gap map, ${a_t}$ , which allows to exploit the surrogate gap. Intuitively, the algorithm will randomize its prediction each time it is unsure, and in this way the expected number of mistakes approaches the value of the surrogate loss. Finally, the weight vector is updated using Online Gradient Descent (OGD) on the surrogate loss. Algorithm 1 summarizes the entire procedure.

We now present a theorem on the expected number of mistakes for the Gaptron algorithm when using self-bounded losses, a weaker assumption than smoothness for convex functions.

Definition 1 (Self-bounded Function). Let ${f:{\mathbb R}^d \rightarrow {\mathbb R}}$ bounded from below, and subdifferentiable in a set ${\mathcal{V}}$ . We say that ${f}$ is ${s}$ -self-bounded in ${\mathcal{V}}$ with respect to ${\|\cdot\|}$ if

$\displaystyle \|{\boldsymbol g}\|^2_\star \leq 2s (f({\boldsymbol x})-\inf_{{\boldsymbol y} \in {\mathbb R}^d} \ f({\boldsymbol y})), \qquad \forall {\boldsymbol x} \in \mathcal{V}, \forall {\boldsymbol g} \in \partial f({\boldsymbol x})~.$

Remark 1. Self-bounded functions are also convex in ${\mathcal{V}}$ because we are assuming that they are subdifferentiable in ${\mathcal{V}}$ .

Clearly, a convex ${s}$ -smooth function is also ${s}$ -self-bounded, but the converse is not true, as shown in the next example.

Example 1. Let ${f:{\mathbb R}\rightarrow {\mathbb R}}$ defined as ${f(x)=\frac12 x^2 + |x-1|}$ . The function ${f}$ is not differentiable in ${1}$ , hence it is not smooth. However, it is easy to verify that it is ${4}$ -self-bounded.

Theorem 1. Let ${\ell_t({\boldsymbol x}) = \ell(\langle {\boldsymbol z}_t, {\boldsymbol x} \rangle, y_t)}$ where ${\ell:{\mathbb R}\times \{-1,1 \}\rightarrow {\mathbb R}_{\geq0}}$ . Assume that

${\ell}$ is ${s}$ -self-bounded in the first argument;

${\ell(x, 1)+\ell(x, -1)\geq 2}$ for all ${ x\in {\mathbb R}}$ ;

${\ell(x, \rm sign(x)) \in [0,1]}$ for ${x\neq 0}$ ;

${\ell(0, 1) \in [0,1]}$ and ${\ell(0, -1) \in [0,1]}$ .

Let ${\|{\boldsymbol z}_t\|_2 \leq R}$ for all ${t}$ . Set the learning rate ${\eta \leq \frac{1}{2 s R^2}}$ and the gap map ${a_t = \ell(\langle {\boldsymbol z}_t, {\boldsymbol x}_t \rangle, \hat{y}_t)}$ . Then, the expected number of mistakes of the Gaptron algorithm (Algorithm 1) is upper bounded as

$\displaystyle \mathop{\mathbb E}\left[\sum_{t=1}^T \boldsymbol{1}[Y'_t \neq y_t]\right] \leq \sum_{t=1}^T \ell_t({\boldsymbol u}) + \frac{\|{\boldsymbol u}-{\boldsymbol x}_1\|_2^2}{2\eta}, \qquad \forall {\boldsymbol u} \in {\mathbb R}^d~.$

Proof: Using the fact that self-boundedness implies convexity, the analysis starts from the standard regret bound for Online Subgradient Descent (OSD), which for any ${{\boldsymbol u}}$ gives

$\displaystyle \sum_{t=1}^T \ell_t({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u}) \leq \frac{\|{\boldsymbol u}-{\boldsymbol x}_1\|_2^2}{2\eta} + \frac{\eta}{2}\sum_{t=1}^T \|{\boldsymbol g}_t\|_2^2~.$

The expected number of mistakes in round ${t}$ , conditioned on the history up to ${t-1}$ , is

$\displaystyle \mathop{\mathbb E}_t[\boldsymbol{1}[Y'_t \neq y_t]] = (1-a_t)\boldsymbol{1}[\hat{y}_t \neq y_t] + \frac{a_t}{2}~.$

Hence, we have

$\displaystyle \begin{aligned} \sum_{t=1}^T \mathop{\mathbb E}[\boldsymbol{1}[Y'_t \neq y_t]] -\sum_{t=1}^T \ell_t({\boldsymbol u}) &= \sum_{t=1}^T \mathop{\mathbb E}[\boldsymbol{1}[Y'_t \neq y_t]] - \sum_{t=1}^T \ell_t({\boldsymbol x}_t) + \sum_{t=1}^T \ell_t({\boldsymbol x}_t)-\sum_{t=1}^T \ell_t({\boldsymbol u})\\ &\leq \frac{\|{\boldsymbol u}-{\boldsymbol x}_1\|_2^2}{2\eta} + \sum_{t=1}^T \mathop{\mathbb E}\left[\boldsymbol{1}[Y'_t \neq y_t]-\ell_t({\boldsymbol x}_t) +\frac{\eta}{2} \|{\boldsymbol g}_t\|_2^2\right]\\ &= \frac{\|{\boldsymbol u}-{\boldsymbol x}_1\|_2^2}{2\eta} + \sum_{t=1}^T \mathop{\mathbb E}\left[(1-a_t)\boldsymbol{1}[\hat{y}_t \neq y_t] + \frac{a_t}{2}-\ell_t({\boldsymbol x}_t) +\frac{\eta}{2} \|{\boldsymbol g}_t\|_2^2\right]~. \end{aligned}$

Let’s analyze the term we call the surrogate gap:

$\displaystyle \gamma_t = (1-a_t)\boldsymbol{1}[\hat{y}_t \neq y_t] + \frac{a_t}{2} - \ell_t({\boldsymbol x}_t) + \frac{\eta}{2}\|{\boldsymbol g}_t\|_2^2~.$

The theorem is proven if we can show that ${\gamma_t \leq 0}$ for our choice of ${a_t}$ and ${\eta}$ .

The bound on the norm of the subgradient can be calculated through the self-boundedness of ${\ell}$ in its first argument:

$\displaystyle \|{\boldsymbol g}_t\|^2_2 \leq 2 s \|{\boldsymbol z}_t\|^2_2 \ell(m_t, y_t) \leq 2 s R^2 \ell(m_t, y_t)~.$

Hence, we have two cases.

Case ${ y_t \neq \hat{y}_t}$ :

$\displaystyle \begin{aligned} \gamma_t &=(1-a_t)\boldsymbol{1}[\hat{y}_t \neq y_t] + \frac{a_t}{2} - \ell_t({\boldsymbol x}_t) + \frac{\eta}{2}\|{\boldsymbol g}_t\|_2^2\\ &\leq 1-\ell(m_t, \hat{y}_t) + \frac12 \ell(m_t, \hat{y}_t) - \ell(m_t, y_t) + \eta s R^2 \ell(m_t,y_t)\\ &= 1-\frac12 \ell(m_t, \hat{y}_t) - \frac12 \ell(m_t, y_t) - \frac12 \ell(m_t, y_t) + \eta s R^2 \ell(m_t, y_t)\\ &\leq - \frac12 \ell(m_t, y_t) + \eta s R^2 \ell(m_t, y_t), \end{aligned}$

where in the second inequality we used the assumption on the loss. Hence, we need ${\eta\leq \frac{1}{2 s R^2}}$ to have ${\gamma_t \leq 0}$ .

Case ${ y_t = \hat{y}_t}$ :

Hence, again we need ${\eta\leq \frac{1}{2 s R^2}}$ to have ${\gamma_t \leq 0}$ . $\Box$

Surprisingly, the use of the randomization allows to obtain a constant regret with respect to the cumulative loss of any comparator.

Remark 2. Observe that if ${\ell(x,y)=f(x y)}$ for some convex function ${f:{\mathbb R}\ \rightarrow {\mathbb R}}$ , then ${f(0)\geq 1}$ implies the condition ${\ell(x, 1)+\ell(x, -1)\geq 2}$ for all ${ x\in {\mathbb R}}$ by Jensen’s inequality.

We now show some examples of binary classification losses that satisfy the assumptions of the theorem.

Example 2. Consider the hinge loss, defined as

$\displaystyle \label{eq:hinge} \ell^\text{hinge}(x,y) =\max(1 - y x,0)~. \ \ \ \ \ (1)$

We will use the squared hinge loss, ${(\ell^\text{hinge})^2}$ that satisfies all the assumptions in Theorem 1 and it is ${2}$ -self-bounded in its first argument. Hence, we have that the Gaptron using the squared hinge loss and ${\eta=\frac{1}{4 R^2}}$ satisfies

$\displaystyle \mathop{\mathbb E}\left[\sum_{t=1}^T \boldsymbol{1}[Y'_t \neq y_t]\right] \leq \sum_{t=1}^T (\ell^\text{hinge}_t)^2({\boldsymbol u}) + 2\|{\boldsymbol u}-{\boldsymbol x}_1\|_2^2 R^2, \qquad \forall {\boldsymbol u} \in {\mathbb R}^d~.$

So, if the problem is not linearly separable, we can greatly beat (in expectation) the bound on the number of mistakes we gave for the Perceptron in terms of the squared hinge loss.

An even better choice is the smoothed hinge loss:

$\displaystyle \label{eq:smooth_hinge} \ell^\text{sh}(x,y) = \begin{cases} 1 - 2 y x , & \text{ if } y x < 0\\ (1 - y x)^2, & \text{ if } 0 \leq y x \leq 1\\ 0, & \text{ if } y x > 1 \end{cases} \ \ \ \ \ (2)$

This loss is always less than or equal to the squared hinge loss, it still satisfies all the assumptions of Theorem 1, and it is ${2}$ -self-bounded in its first argument.

Example 3. One can also use loss functions that give different weights to the errors of the two classes. For example, we can use the following function

$\displaystyle \ell^\text{sh}(x,y) = \begin{cases} 1 - 2 w_y y x , & \text{ if } y x < 0\\ (1 - y x)^2, & \text{ if } 0 \leq y x \leq 1\\ 0, & \text{ if } y x > 1 \end{cases}$

where ${w_1=2}$ and ${w_{-1}=1}$ . This loss penalizes more the errors on class 1. In this case, the loss function is ${8}$ -self-bounded (but it is not smooth!) and it still satisfies the assumptions of Theorem 1.

2. History Bits

The Gaptron was introduced in van der Hoeven (2020), in turn based on some of the ideas in Neu&Zhivotovskiy (2020). He also described a multiclass variant for different surrogate losses, as well as a bandit variant. The proof here and the conditions in Theorem 1 were developed in collaboration with Dirk van der Hoeven, and he was so kind to allow me to reproduce them here.
More recently, Sakaue et al. (2024) extended the Gaptron algorithm to the structured prediction case, while Sakaue et al. (2025) extended it to the dynamic setting.

Acknowledgments

Thanks to Dirk van der Hoeven for the help in writing a short and general proof, and to Abed Razawy and Valentina Masarotto for feedback and comments.

Exercises

Exercise 1. Use FTRL with an increasing regularizer instead of OSD in the Gaptron to get rid of the need to know ${R}$ to set the learning rate, , while still allowing unbounded feasible sets. Note that we make use of the knowledge of ${{\boldsymbol z}_t}$ to set the regularizer at time ${t}$ .

Parameter-free Learning and Optimization Algorithms

The Gaptron Algorithm

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply