*This post is part of the lecture notes of my class “Introduction to Online Learning” I taught at Boston University during Fall 2019.*

* You can find the other lectures here.*

In this lecture, we will explore the link between Online Learning and and Statistical Learning Theory.

**1. Agnostic PAC Learning **

We now consider a different setting from what we have seen till now. We will assume that we have a prediction strategy parametrized by a vector and we want to learn the relationship between an input and its associated label . Moreover, we will assume that is drawn from a joint probability distribution . Also, we are equipped with a loss function that measures how good is our prediction compared to the true label , that is . So, learning the relationship can be cast as minimizing the expected loss of our predictor

In machine learning terms, the object above is nothing else than the *test error* and our predictor.

Note that the above setting assumes labeled samples, but we can generalize it even more considering the *Vapnik’s general setting of learning*, where we collapse the prediction function and the loss in a unique function. This allows, for example, to treat supervised and unsupervised learning in the same unified way. So, we want to minimize the *risk*

where is an unknown distribution over and is measurable w.r.t. the second argument. Also, the set of all predictors that can be expressed by vectors in is called the *hypothesis class*.

Example 1.In a linear regression task where the loss is the square loss, we have and . Hence, .

Example 2.In linear binary classification where the loss is the hinge loss, we have and . Hence, .

Example 3.In binary classification with a neural network with the logistic loss, we have and is the network corresponding to the weights . Hence, .

The key difficulty of the above problem is that we don’t know the distribution . Hence, there is no hope to exactly solve this problem. Instead, we are interested in understanding *what is the best we can do if we have access to samples drawn i.i.d. from *. More in details, we want to upper bound the *excess risk*

where is a predictor that was *learned* using samples.

It should be clear that this is just an optimization problem and we are interested in upper bounding the suboptimality gap. In this view, the objective of machine learning can be considered as a particular optimization problem.

Remark 1.Note that this is not the only way to approach the problem of learning. Indeed, the regret minimization model is an alternative model to learning. Moreover, another approach would be to try to estimate the distribution and then solve the risk minimization problem, the approach usually taken in Statistics. No approach is superior to the other and each of them has its pros and cons.

Given that we have access to the distribution through samples drawn from it, any procedure we might think to use to minimize the risk will be stochastic in nature. This means that we cannot assure a deterministic guarantee. Instead, *we can try to prove that with high probability our minimization procedure will return a solution that is close to the minimizer of the risk*. It is also intuitive that the precision and probability we can guarantee must depend on how many samples we draw from .

Quantifying the dependency of precision and probability of failure on the number of samples used is the objective of the **Agnostic Probably Approximately Correct** (PAC) framework, where the keyword “agnostic” refers to the fact that we don’t assume anything on the best possible predictor. More in details, given a precision parameter and a probability of failure , we are interested in characterizing the *sample complexity of the hypothesis class * that is defined as the number of samples necessary to guarantee with probability at least that the best learning algorithm using the hypothesis class outputs a solution that has an excess risk upper bounded by . Note that the sample complexity does not depend on , so it is a worst-case measure w.r.t. all the possible distributions. This makes sense if you think that we know nothing about the distribution , so if your guarantee holds for the worst distribution it will also hold for any other distribution. Mathematically, we will say that the hypothesis class is agnostic PAC-learnable is such sample complexity function exists.

Definition 1.We will say that a function class isAgnostic-PAC-learnableif there exists an algorithm and a function such that when is used with samples drawn from , with probability at least the solution returned by the algorithm has excess risk at most .

Note that the Agnostic PAC learning setting does not say what is the procedure we should follow to find such sample complexity. The approach most commonly used in machine learning to solve the learning problem is the so-called *Empirical Risk Minimization (ERM) problem*. It consist of drawing samples i.i.d. from and minimizing the *empirical risk*:

In words, ERM is nothing else than minimize the error on a training set. However, in many interesting cases we can have that can be very far from the true optimum , even with an infinite number of samples! So, we need to modify the ERM formulation in some way, e.g., using a *regularization* term or a Bayesian prior of , or find conditions under which ERM works.

The ERM approach is so widespread that machine learning itself is often wrongly identified with some kind of minimization of the training error. We now show that ERM is not the entire world of ML, showing that *the existence of a no-regret algorithm, that is an online learning algorithm with sublinear regret, guarantee Agnostic-PAC learnability*. More in details, we will show that an online algorithm with sublinear regret can be used to solve machine learning problems. This is not just a curiosity, for example this gives rise to computationally efficient parameter-free algorithms, that can be achieved through ERM only running a two-step procedure, i.e. running ERM with different parameters and selecting the best solution among them.

We already mentioned this possibility when we talked about the online-to-batch conversion, but this time we will strengthen it proving high probability guarantees rather than expectation ones.

So, we need some more bits on concentration inequalities.

**2. Bits on Concentration Inequalities **

We will use a concentration inequality to prove the high probability guarantee, but we will need to go beyond the sum of i.i.d.random variables. In particular, we will use the concept of *martingales*.

Definition 2.A sequence of random variables is called amartingaleif for all it satisfies:

Example 4.Consider a fair coin and a betting algorithm that bets money on each round on the side of the coin equal to . We win or lose money 1:1, so the total money we won up to round is . is a martingale. Indeed, we have

For bounded martingales we can prove high probability guarantees as for bounded i.i.d. random variables. The following Theorem will be the key result we will need.

Theorem 3 (Hoeffding-Azuma inequality).Let be a martingale of random variables that satisfy almost surely. Then, we have

Also, the same upper bounds hold on .

**3. From Regret to Agnostic PAC **

We now show how the online-to-batch conversion we introduced before gives us high probability guarantee for our machine learning problem.

Theorem 4.Let , where the expectation is w.r.t. drawn from with support over some vector space and . Draw samples i.i.d. from and construct the sequence of losses . Run any online learning algorithm over the losses , to construct the sequence of predictions . Then, we have with probability at least , it holds that

*Proof:* Define . We claim that is a martingale. In fact, we have

where we used the fact that depends only on Hence, we have

that proves our claim.

Hence, using Theorem 3, we have

This implies that, with probability at least , we have

or equivalently

We now use the definition of regret w.r.t. any , to have

The last step is to upper bound with high probability with . This is easier than the previous upper bound because is a fixed vector, so are i.i.d. random variables, so for sure forms a martingale. So, reasoning as above, we have that with probability at least it holds that

Putting all together and using the union bound, we have the stated bound.

The theorem above upper bounds the average risk of the predictors, while we are interested in producing a single predictor. If the risk is a convex function and is convex, than we can lower bound the l.h.s. of the inequalities in the theorem with the risk evaluated on the average of the . That is

If the risk is not a convex function, we need a way to generate a single solution with small risk. One possibility is to construct a *stochastic classifier* that samples one of the with uniform probability and predicts with it. For this classifier, we immediately have

where the expectation in the definition of the risk of the stochastic classifier is also with respect to the random index. Yet another way, is to select among the predictors, the one with the smallest risk. This works because the average is lower bounded by the minimum. This is easily achieved using samples for the online learning procedure and samples to generate a validation set to evaluate the solution and pick the best one. The following Theorem shows that selecting the predictor with the smallest empirical risk on a validation set will give us a predictor close to the best one with high probability.

Theorem 5.We have a finite set of predictors and a dataset of samples drawn i.i.d. from . Denote by . Then, with probability at least , we have

*Proof:* We want to calculate the probability that the hypothesis that minimizes the validation error is far from the best hypothesis in the set. We cannot do it directly because we don’t have the required independence to use a concentration. Instead, *we will upper bound the probability that there exists at least one function whose empirical risk is far from the risk.* So, we have

Hence, with probability at least , we have that for all

We are now able to upper bound the risk of , just using the fact that the above applies to too. So, we have

where in the last inequality we used the fact that minimizes the empirical risk.

Using this theorem, we can use samples for the training and samples for the validation. Denoting by the predictor with the best empirical risk on the validation set among the generated during the online procedure, we have with probability at least that

It is important to note that with any of the above three methods to select one among the generated by the online learning procedure, the sample complexity guarantee we get matches the one we would have obtained by ERM, up to polylogarithmic factors. In other words, there is nothing special about ERM compared to the online learning approach to statistical learning.

Another important point is that the above guarantee does not imply the existence of online learning algorithms with sublinear regret for any learning problem. It just says that, if it exists, it can be used in the statistical setting too.

**4. History Bits **

Theorem 4 is from (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004). Theorem 5 is nothing else than the Agnostic PAC learning guarantee for ERM for hypothesis classes with finite cardinality. (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004) gives also an alternative procedure to select a single hypothesis among the generated during the online procedure that does not require splitting the data in training and validation. However, the obtained guarantee matches the one we have proved.