This post is part of the lecture notes of my class “Introduction to Online Learning” I taught at Boston University during Fall 2019.
You can find the other lectures here.
In this lecture, we will explore the link between Online Learning and and Statistical Learning Theory.
1. Agnostic PAC Learning
We now consider a different setting from what we have seen till now. We will assume that we have a prediction strategy parametrized by a vector
and we want to learn the relationship between an input
and its associated label
. Moreover, we will assume that
is drawn from a joint probability distribution
. Also, we are equipped with a loss function that measures how good is our prediction
compared to the true label
, that is
. So, learning the relationship can be cast as minimizing the expected loss of our predictor
In machine learning terms, the object above is nothing else than the test error and our predictor.
Note that the above setting assumes labeled samples, but we can generalize it even more considering the Vapnik’s general setting of learning, where we collapse the prediction function and the loss in a unique function. This allows, for example, to treat supervised and unsupervised learning in the same unified way. So, we want to minimize the risk
where is an unknown distribution over
and
is measurable w.r.t. the second argument. Also, the set
of all predictors that can be expressed by vectors
in
is called the hypothesis class.
Example 1. In a linear regression task where the loss is the square loss, we have
and
. Hence,
.
Example 2. In linear binary classification where the loss is the hinge loss, we have
and
. Hence,
.
Example 3. In binary classification with a neural network with the logistic loss, we have
and
is the network corresponding to the weights
. Hence,
.
The key difficulty of the above problem is that we don’t know the distribution . Hence, there is no hope to exactly solve this problem. Instead, we are interested in understanding what is the best we can do if we have access to
samples drawn i.i.d. from
. More in details, we want to upper bound the excess risk
where is a predictor that was learned using
samples.
It should be clear that this is just an optimization problem and we are interested in upper bounding the suboptimality gap. In this view, the objective of machine learning can be considered as a particular optimization problem.
Remark 1. Note that this is not the only way to approach the problem of learning. Indeed, the regret minimization model is an alternative model to learning. Moreover, another approach would be to try to estimate the distribution
and then solve the risk minimization problem, the approach usually taken in Statistics. No approach is superior to the other and each of them has its pros and cons.
Given that we have access to the distribution through samples drawn from it, any procedure we might think to use to minimize the risk will be stochastic in nature. This means that we cannot assure a deterministic guarantee. Instead, we can try to prove that with high probability our minimization procedure will return a solution that is close to the minimizer of the risk. It is also intuitive that the precision and probability we can guarantee must depend on how many samples we draw from
.
Quantifying the dependency of precision and probability of failure on the number of samples used is the objective of the Agnostic Probably Approximately Correct (PAC) framework, where the keyword “agnostic” refers to the fact that we don’t assume anything on the best possible predictor. More in details, given a precision parameter and a probability of failure
, we are interested in characterizing the sample complexity of the hypothesis class
that is defined as the number of samples
necessary to guarantee with probability at least
that the best learning algorithm using the hypothesis class
outputs a solution
that has an excess risk upper bounded by
. Note that the sample complexity does not depend on
, so it is a worst-case measure w.r.t. all the possible distributions. This makes sense if you think that we know nothing about the distribution
, so if your guarantee holds for the worst distribution it will also hold for any other distribution. Mathematically, we will say that the hypothesis class is agnostic PAC-learnable is such sample complexity function exists.
Definition 1. We will say that a function class
is Agnostic-PAC-learnable if there exists an algorithm
and a function
such that when
is used with
samples drawn from
, with probability at least
the solution
returned by the algorithm has excess risk at most
.
Note that the Agnostic PAC learning setting does not say what is the procedure we should follow to find such sample complexity. The approach most commonly used in machine learning to solve the learning problem is the so-called Empirical Risk Minimization (ERM) problem. It consist of drawing samples i.i.d. from
and minimizing the empirical risk:
In words, ERM is nothing else than minimize the error on a training set. However, in many interesting cases we can have that can be very far from the true optimum
, even with an infinite number of samples! So, we need to modify the ERM formulation in some way, e.g., using a regularization term or a Bayesian prior of
, or find conditions under which ERM works.
The ERM approach is so widespread that machine learning itself is often wrongly identified with some kind of minimization of the training error. We now show that ERM is not the entire world of ML, showing that the existence of a no-regret algorithm, that is an online learning algorithm with sublinear regret, guarantee Agnostic-PAC learnability. More in details, we will show that an online algorithm with sublinear regret can be used to solve machine learning problems. This is not just a curiosity, for example this gives rise to computationally efficient parameter-free algorithms, that can be achieved through ERM only running a two-step procedure, i.e. running ERM with different parameters and selecting the best solution among them.
We already mentioned this possibility when we talked about the online-to-batch conversion, but this time we will strengthen it proving high probability guarantees rather than expectation ones.
So, we need some more bits on concentration inequalities.
2. Bits on Concentration Inequalities
We will use a concentration inequality to prove the high probability guarantee, but we will need to go beyond the sum of i.i.d.random variables. In particular, we will use the concept of martingales.
Definition 2. A sequence of random variables
is called a martingale if for all
it satisfies:
Example 4. Consider a fair coin
and a betting algorithm that bets
money on each round on the side of the coin equal to
. We win or lose money 1:1, so the total money we won up to round
is
.
is a martingale. Indeed, we have
For bounded martingales we can prove high probability guarantees as for bounded i.i.d. random variables. The following Theorem will be the key result we will need.
Theorem 3 (Hoeffding-Azuma inequality). Let
be a martingale of
random variables that satisfy
almost surely. Then, we have
Also, the same upper bounds hold on
.
3. From Regret to Agnostic PAC
We now show how the online-to-batch conversion we introduced before gives us high probability guarantee for our machine learning problem.
Theorem 4. Let
, where the expectation is w.r.t.
drawn from
with support over some vector space
and
. Draw
samples
i.i.d. from
and construct the sequence of losses
. Run any online learning algorithm over the losses
, to construct the sequence of predictions
. Then, we have with probability at least
, it holds that
Proof: Define . We claim that
is a martingale. In fact, we have
where we used the fact that depends only on
Hence, we have
that proves our claim.
Hence, using Theorem 3, we have
This implies that, with probability at least , we have
or equivalently
We now use the definition of regret w.r.t. any , to have
The last step is to upper bound with high probability with
. This is easier than the previous upper bound because
is a fixed vector, so
are i.i.d. random variables, so for sure
forms a martingale. So, reasoning as above, we have that with probability at least
it holds that
Putting all together and using the union bound, we have the stated bound.
The theorem above upper bounds the average risk of the predictors, while we are interested in producing a single predictor. If the risk is a convex function and
is convex, than we can lower bound the l.h.s. of the inequalities in the theorem with the risk evaluated on the average of the
. That is
If the risk is not a convex function, we need a way to generate a single solution with small risk. One possibility is to construct a stochastic classifier that samples one of the with uniform probability and predicts with it. For this classifier, we immediately have
where the expectation in the definition of the risk of the stochastic classifier is also with respect to the random index. Yet another way, is to select among the predictors, the one with the smallest risk. This works because the average is lower bounded by the minimum. This is easily achieved using
samples for the online learning procedure and
samples to generate a validation set to evaluate the solution and pick the best one. The following Theorem shows that selecting the predictor with the smallest empirical risk on a validation set will give us a predictor close to the best one with high probability.
Theorem 5. We have a finite set of predictors
and a dataset of
samples drawn i.i.d. from
. Denote by
. Then, with probability at least
, we have
Proof: We want to calculate the probability that the hypothesis that minimizes the validation error is far from the best hypothesis in the set. We cannot do it directly because we don’t have the required independence to use a concentration. Instead, we will upper bound the probability that there exists at least one function whose empirical risk is far from the risk. So, we have
Hence, with probability at least , we have that for all
We are now able to upper bound the risk of , just using the fact that the above applies to
too. So, we have
where in the last inequality we used the fact that minimizes the empirical risk.
Using this theorem, we can use samples for the training and
samples for the validation. Denoting by
the predictor with the best empirical risk on the validation set among the
generated during the online procedure, we have with probability at least
that
It is important to note that with any of the above three methods to select one among the
generated by the online learning procedure, the sample complexity guarantee we get matches the one we would have obtained by ERM, up to polylogarithmic factors. In other words, there is nothing special about ERM compared to the online learning approach to statistical learning.
Another important point is that the above guarantee does not imply the existence of online learning algorithms with sublinear regret for any learning problem. It just says that, if it exists, it can be used in the statistical setting too.
4. History Bits
Theorem 4 is from (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004). Theorem 5 is nothing else than the Agnostic PAC learning guarantee for ERM for hypothesis classes with finite cardinality. (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004) gives also an alternative procedure to select a single hypothesis among the generated during the online procedure that does not require splitting the data in training and validation. However, the obtained guarantee matches the one we have proved.