This is the first of three posts to show that online learning is more than just online learning!
Each of these posts will show how we can go from online learning to X, where X will be something completely unrelated to online learning. These results are based on the fact that a linear regret proof in its essence is a statement about arbitrary sequence of vectors, so it can be used in other settings as well.
These results I’ll show in these posts are not new, but somehow not so widely known as they should be.
1. From Linear Regret to Rademacher Complexity
Previously, we used online-to-batch conversions to turn regret bounds into guarantees on the risk. A different (and extremely useful) way to quantify the statistical difficulty of a hypothesis class is through its Rademacher complexity. It measures how well the class can correlate with pure noise, and it is the key quantity behind many uniform convergence and generalization results. Here, we show a simple reduction that upper bounds the Rademacher complexity of a class using the regret guarantee of an online learning algorithm.
We consider the Vapnik’s general setting of learning for generic function classes. Hence, we seek to minimize
where is a class of real-valued functions on
and
is a distribution over
. One common way to solve this problem is through an empirical risk minimization process, a.k.a. minimize the training error. That is, we gather a sample of size
,
, drawn i.i.d. from
, then we solve the empirical version of (1):
It is now natural to ask if will also minimize the objective function in (1).
One way to study this problem is through the concept of Rademacher complexity of a function class.
Definition 1 (Rademacher complexity). Let
be an arbitrary set of
vectors in some domain
. Let
be a class of real-valued functions on
. Let
be i.i.d. Rademacher random variables, that is
. The empirical Rademacher complexity of
on
is defined as
Intuitively, if is small, then no function in the class can fit random signs too well on the sample. When this happens, we can expect the minimizers of the empirical risk to be close to the minimizers of the true risk, because less likely to be fooled by random fluctuations of the training data.
This idea can be made formal and yields generalization guarantees; for example, bounds of the form
with probability at least under mild boundedness assumptions. Given that the above bound applies uniformly over
, it also applies to the empirical risk minimizer
.
Moreover, one can easily obtain a generalization guarantee that depends , rather than being uniform over a class. It is enough to consider a countable nested family
whose union covers the entire space of possible outcomes of the empirical risk minimization process. Then, apply the uniform bound to each
with a union bound, and choose the class with the smallest Rademacher complexity containing the empirical risk minimizer.
We will not prove such results here; instead, we focus on the following question: how can we upper bound for interesting classes?
Here, we consider function classes given by composing Lipschitz functions with linear predictors. A typical example is: Given a prediction function class and a loss
, we obtain the induced function class
where . Hence, we would like to control
. A common situation is that
is Lipschitz, say
-Lipschitz for all
. In this case, it is intuitive that a Lipschitz loss cannot increase the ability to correlate with noise by more than a factor
. This is formalized by contraction inequalities, as in the next lemma.
Lemma 2 (Contraction Lemma). Let
be
-Lipschitz. Consider the function class
. Then, for any sample
, we have
Hence, in our setting it suffices to control the complexity of linear functions, which we do next.
2. Rademacher Complexity of Linear Classes through Regret Guarantees
We now state the core reduction. Let be a non-empty, bounded, closed set and consider the linear class
Fix a sample and define the empirical Rademacher complexity of
on
as in (2). The following theorem shows how to use the existence of a regret upper bound for an online learning algorithm to upper bound such empirical Rademacher complexity.
Theorem 3. Let
such that for any
we also have
. Fix any sequence
and any non-empty, bounded, closed set
. Assume that there exists an online algorithm that, on any sequence of linear losses
, where
, produces
and satisfies the regret guarantee
Then, the empirical Rademacher complexity of the linear class satisfies
Proof: Fix and draw
. Consider the online linear game with gradients
. Let
be the iterates produced by the online algorithm when fed with
. By the regret guarantee (3) we have, for every realization of the signs,
Rearranging, we get
Taking expectation with respect to and dividing by
yields
It remains to show that the second term is . The key observation is that
is measurable with respect to the past signs
and possibly on its internal randomization, while
is independent of the past and has mean zero. Hence, conditioning on the past,
and by taking expectation again we obtain . Summing over
gives
, so (6) implies (4).
Remark 1. The proof of Theorem 3 has the same flavor as the probabilistic method argument used in lower bounds: we introduce Rademacher signs, feed them to an online algorithm, and then exploit the fact that the algorithm cannot correlate with the fresh randomness at time
. The entire complexity term is paid by the regret upper bound
.
We can instantiate the above theorem using the Online Mirror Descent (OMD) algorithms with -norms.
Corollary 4. Let
and
such that
. For
, let
. Let
where
and
for
. Let
. Then,
Proof: We instantiate Theorem 3 with OMD with -norms, learning rate
and
.
We can also consider the case of classes of functions with finite cardinality.
Corollary 5 (Massart’s Lemma). Let
where
and
for
. Let
. Then,
Proof: We instantiate Theorem 3 with OMD with entropic regularizer, learning rate and
.
3. History Bits
The contraction lemma appears in many places in the literature (see, e.g., Ledoux&Talagrand, 1991, Bartlett&Mendelson, 2001, Bartlett&Mendelson, 2002, Meir&Zhang, 2003).
Kakade et al. (2008) proved Theorem 3 using strong convexity, while Kakade et al. (2009) proved it through the proof of FTRL. Here, we simply used a black-box conversion to the linear regret upper bound for any online learning algorithm.
Massart’s lemma was originally proved by Massart (2000).
Acknowledgments
Thanks to ChatGPT for checking my post.