This is the second post in my series of “From Online Learning to X”.
Last time, we saw how to derive Rademacher complexity bounds using the existence of online learning algorithms. This time, we will see how to go from online learning to PAC-Bayes bounds, using the existence of parameter-free algorithms for the learning with experts setting.
1. From Linear Regret to PAC-Bayes
In this section, we show that one can directly obtain generalization bounds for any machine learning algorithm from upper bounds on linear regret.
We will consider the same setting we used in the online-to-batch reduction, where one aims at minimizing the risk of a function , defined as
For example, in a regression setting, is the composition of a loss with a prediction function, evaluated on the data
.
Given a training set drawn i.i.d. from
, here we are interested in the generalization gap, defined as the difference between the risk and the training error of a function
:
We can also consider the case where the predictor is randomized, in the sense that we draw according to a probability distribution
in
, that is the set of probability distributions over a set of functions
. In this case, we study
In particular, we are interested in upper bounding in high probability, where
is selected from
after looking at
.
We now describe the reduction from this problem to an online learning one. Consider an online algorithm that at each round produces a distribution
after observing the training samples
. Define
and
. So, for any
, we have that
We use to denote the
-algebra generated by the data points
and by all the random variables generated by the online learning algorithm up to the end of round
. Now, given that
is
-measurable and
is independent of
, we have
So, we have that form a martingale difference sequence. Since
, we have
and therefore
Hence, is a martingale difference sequence bounded in
, so we can apply the Hoeffding–Azuma inequality. This concentration does not depend on
, so it holds simultaneously for all
.
We are done! We can now put everything together to have the following theorem.
Theorem 1. Let
be a measurable space of measurable functions
,
, and
. Let
be drawn i.i.d. from a distribution
over
. Consider any online learning algorithm that outputs a distribution over
and is fed with the linear losses
. Then, simultaneously for all
such that
, even selected with the knowledge of
, and with probability at least
,
2. Which Online Algorithm Should we Use?
We can now instantiate this theorem with the regret of any online learning algorithm. For example, we could use the Exponentiated Gradient algorithm, where we think of each as an expert. There are only two caveats: first, we would have to extend it to the case where the number of experts is infinite, possibly continuous; second, we would have to select an appropriate learning rate.
The first issue is easy to deal with: Roughly speaking, it is enough to substitute any sum over the experts with integrals. Using the fact that , the regret guarantee of the continuous version of Exponentiated Gradient is
where is the prior distribution over the infinite experts and
is the competitor distribution. We might not know how to run such algorithm for the presence of integrals, but we do not need to! We only need to know that such an algorithm and its regret bound exist.
The second problem instead is a difficult one: the optimal learning rate depends on . However, we want generalization guarantees that hold uniformly for any
. This is exactly the problem we saw many times in online learning when the optimal choice of the learning rate depends on the unknown comparator. In standard EG we upper bounded the KL term for the case of a uniform prior
with
, but here we cannot do it because the number of experts is infinite. One could construct a grid of learning rates, instantiate the bound for each of them, and use a union bound, but this approach could introduce additional poly-logarithmic terms in the final bound. However, we already know how to solve this problem in online learning, by simply using parameter-free algorithms.
So, we can consider a continuous version of the parameter-free algorithm for learning with expert advice and we can show this regret bound.
Theorem 2. Let
be a measurable space of measurable functions
,
, and
. At each round
, the adversary reveals a measurable loss function
. There exists an online learning algorithm that depends on
, outputs a distribution
, and incurs the loss
For any competitor distribution
such that
, its regret satisfies
The proof is a straightforward adaptation of the finite-expert case, but for completeness we include it in the Appendix below.
Combining the two previous theorems, and taking care of the fact that while Theorem 2 requires the losses to be in
, we obtain the following so-called PAC-Bayes bound.
Corollary 3 (PAC-Bayes Bound). Let
be a measurable space of measurable functions
,
, and
. Let
drawn i.i.d. from a distribution
over
. Then, simultaneously for all
such that
, even selected with the knowledge of
, and with probability at least
, we have
The role of is to allow the bound to hold simultaneously for all posterior distributions
. Indeed, the theorem guarantees that, with probability at least
over the draw of the sample
, the inequality holds for every
such that
, even if
is selected after observing the data. This uniformity is possible because
acts as a continuous analogue of the union bound. The divergence term
plays the role of the logarithmic penalty that appears in the discrete union bound. Hence, choosing
corresponds to specifying how the bound is distributed over the different functions in
.
Remark 1. As in the previous post, the existence of an online learning algorithm with a regret upper bound is enough to prove our bound. We never need to actually run the online algorithm. Moreover, in this case we could not run the reduction even if we wanted to, since the linear losses
depend on the unknown distribution
.
3. History Bits
PAC-Bayes bounds were first proposed by McAllester (1998) and, as first explained by van Erven (2014), they can be seen as a continuous generalization of the union bound. There is now a vast literature on this subject, and we refer the reader to the recent review by Alquier (2024) for an introduction. Recently, PAC-Bayes bounds gained popularity because they can yield non-vacuous generalization bounds for deep neural networks. Yet one should not confuse a certificate of generalization with an “explanation of generalization” in deep neural networks (see, e.g., Picard-Weibel et al., 2025).
The reduction from online learning to PAC-Bayes bounds as well as the idea of using parameter-free algorithms is from Lugosi&Neu (2023). Despite what claimed in Lugosi&Neu (2023), the PAC-Bayes bound in Corollary 3 is not new, it has been obtained by Kakade et al. (2008). Such bound shaves off a term under the square root compared to previous PAC-Bayes guarantees. Lugosi&Neu (2023) also present many more results based on the same reduction.
Independently, Jang et al. (2023) proposed another reduction from coin-betting to PAC-Bayes. While less general than the one presented here, it allows one to prove tighter results by changing the surrogate loss to a non-linear one. Later, Kuzborskij et al. (2024) used a similar approach to show the first PAC-Bayes bound with a better-than-
divergence.
Acknowledgments
Thanks to ChatGPT for checking my post.
Appendix: Proof of the Parameter-free Algorithm for Continuous Experts
It is enough to show the continuous analogue of Theorem 1 here. As in the finite-dimensional case, let be a coin-betting algorithm. We formally instantiate one copy of
for each
. At round
, let
be the bet produced by the copy corresponding to
, and assume that
is measurable. Define
Then, the continuous Learning with Experts Advice (LEA) algorithm predicts with the distribution defined by
After the prediction, the algorithm receives and defines the outcome of the continuous coin for the copy corresponding to
as
The construction above defines a LEA algorithm on a continuous set of experts. The regret guarantee is the following one.
Theorem 4 (Regret Bound for a Continuous Set of Experts). Let
be a coin-betting algorithm such that, for any sequence of continuous coin outcomes
, its wealth after
rounds with initial money equal to
satisfies
Then, the continuous LEA algorithm with prior
that predicts at each round with
in (2) satisfies
for any
concave and non-decreasing such that
for all
.
Proof: We first prove that
Indeed, by the definition of , we have
Now define
If , then
for
-almost every
, so the first integral is equal to
. If instead
, then by (1) and (2),
Hence,
Therefore, in all cases,
because on the integration domain and
.
From the assumption on , for every fixed
and for the sequence
, we have
Integrating both sides with respect to , we obtain
Now, fix any competitor such that
. Then,
So, it remains to upper bound . Define
. Since
, let
. Then
where in the inequality we used Jensen’s inequality for the concave function . Rearranging, we get
Substituting back the definition of , we obtain
Using (5), the logarithm is at most , so
Putting everything together, we have the stated bound.