From Online Learning to PAC-Bayes

This is the second post in my series of “From Online Learning to X”.

Last time, we saw how to derive Rademacher complexity bounds using the existence of online learning algorithms. This time, we will see how to go from online learning to PAC-Bayes bounds, using the existence of parameter-free algorithms for the learning with experts setting.

1. From Linear Regret to PAC-Bayes

In this section, we show that one can directly obtain generalization bounds for any machine learning algorithm from upper bounds on linear regret.

We will consider the same setting we used in the online-to-batch reduction, where one aims at minimizing the risk of a function {f:\mathcal{Z} \rightarrow [0,1]}, defined as

\displaystyle \text{Risk}(f)=\mathop{\mathbb E}_{{\boldsymbol \xi}\sim \rho} [f({\boldsymbol \xi})]~.

For example, in a regression setting, {f({\boldsymbol \xi})} is the composition of a loss with a prediction function, evaluated on the data {{\boldsymbol \xi}=({\boldsymbol z},y)}.

Given a training set {\mathcal{S}=\{{\boldsymbol \xi}_1, \dots, {\boldsymbol \xi}_T\}} drawn i.i.d. from {\rho}, here we are interested in the generalization gap, defined as the difference between the risk and the training error of a function {f}:

\displaystyle \text{Gen}(f,\mathcal{S}) := \text{Risk}(f) - \frac{1}{T}\sum_{t=1}^T f({\boldsymbol \xi}_t)~.

We can also consider the case where the predictor is randomized, in the sense that we draw {f} according to a probability distribution {Q} in {\Delta(\mathcal{F})}, that is the set of probability distributions over a set of functions {\mathcal{F}:=\{f \mid f:\mathcal{Z} \rightarrow [0,1]\}}. In this case, we study

\displaystyle \overline{\text{Gen}}(Q,\mathcal{S}) := \mathop{\mathbb E}_{f \sim Q}[\text{Gen}(f,\mathcal{S})]~.

In particular, we are interested in upper bounding {\overline{\text{Gen}}(Q_\mathcal{S},\mathcal{S})} in high probability, where {Q_\mathcal{S}} is selected from {\Delta(\mathcal{F})} after looking at {\mathcal{S}}.

We now describe the reduction from this problem to an online learning one. Consider an online algorithm that at each round {t} produces a distribution {P_t \in \Delta(\mathcal{F})} after observing the training samples {{\boldsymbol \xi}_1, \dots, {\boldsymbol \xi}_{t-1}}. Define {g_t(f)=f({\boldsymbol \xi}_t) - \mathop{\mathbb E}_{{\boldsymbol \xi}\sim\rho} [f({\boldsymbol \xi})]} and {\ell_t(P)=\mathop{\mathbb E}_{f\sim P}[g_t(f)]}. So, for any {Q\in \Delta(\mathcal{F})}, we have that

\displaystyle \begin{aligned} \overline{\text{Gen}}(Q,\mathcal{S}) &= \mathop{\mathbb E}_{f \sim Q}[\mathop{\mathbb E}_{{\boldsymbol \xi}\sim\rho}[f({\boldsymbol \xi})]] - \frac{1}{T}\sum_{t=1}^T \mathop{\mathbb E}_{f \sim Q}[f({\boldsymbol \xi}_t)]\\ &= \frac{1}{T}\sum_{t=1}^T -\ell_t(Q) \\ &= \frac{1}{T}\sum_{t=1}^T (\ell_t(P_t)-\ell_t(Q)) - \frac{1}{T}\sum_{t=1}^T \ell_t(P_t)\\ &= \frac{\text{Regret}_T(Q)}{T} - \frac{1}{T}\sum_{t=1}^T \ell_t(P_t)~. \end{aligned}

We use {\mathcal{F}_t} to denote the {\sigma}-algebra generated by the data points {{\boldsymbol \xi}_1,\dots,{\boldsymbol \xi}_t} and by all the random variables generated by the online learning algorithm up to the end of round {t}. Now, given that {P_t} is {\mathcal{F}_{t-1}}-measurable and {{\boldsymbol \xi}_t} is independent of {\mathcal{F}_{t-1}}, we have

\displaystyle \mathop{\mathbb E}[\ell_t(P_t)|\mathcal{F}_{t-1}] =\mathop{\mathbb E}[\mathop{\mathbb E}_{f \sim P_t}[f({\boldsymbol \xi}_t)-\mathop{\mathbb E}_{{\boldsymbol \xi}\sim\rho}[f({\boldsymbol \xi})]]|\mathcal{F}_{t-1}] =0~.

So, we have that {\ell_1(P_1), \dots, \ell_T(P_T)} form a martingale difference sequence. Since {f({\boldsymbol \xi})\in[0,1]}, we have

\displaystyle g_t(f)=f({\boldsymbol \xi}_t)-\mathop{\mathbb E}_{{\boldsymbol \xi}\sim\rho}[f({\boldsymbol \xi})] \in [-1,1],

and therefore

\displaystyle \ell_t(P_t)=\mathop{\mathbb E}_{f\sim P_t}[g_t(f)] \in [-1,1]~.

Hence, {\ell_1(P_1),\dots,\ell_T(P_T)} is a martingale difference sequence bounded in {[-1,1]}, so we can apply the Hoeffding–Azuma inequality. This concentration does not depend on {Q}, so it holds simultaneously for all {Q \in \Delta(\mathcal{F})}.

We are done! We can now put everything together to have the following theorem.

Theorem 1. Let {\mathcal{F}} be a measurable space of measurable functions {f:\mathcal{Z}\rightarrow [0,1]}, {\delta \in (0,1)}, and {\pi \in \Delta(\mathcal{F})}. Let {\mathcal{S}=\{{\boldsymbol \xi}_1,\dots, {\boldsymbol \xi}_T\}} be drawn i.i.d. from a distribution {\rho} over {\mathcal{Z}}. Consider any online learning algorithm that outputs a distribution over {\mathcal{F}} and is fed with the linear losses {\ell_t(P)=\mathop{\mathbb E}_{f \sim P}[f({\boldsymbol \xi}_t) - \mathop{\mathbb E}_{{\boldsymbol \xi}\sim\rho} [f({\boldsymbol \xi})]]}. Then, simultaneously for all {Q\in \Delta(\mathcal{F})} such that {Q \ll \pi}, even selected with the knowledge of {\mathcal{S}}, and with probability at least {1-\delta},

\displaystyle \overline{\text{Gen}}(Q,\mathcal{S}) = \frac{\text{Regret}_T(Q)}{T} - \frac{1}{T} \sum_{t=1}^T \ell_t(P_t) \leq \frac{\text{Regret}_T(Q)}{T} + \sqrt{\frac{2\ln \frac{1}{\delta}}{T}}~.

2. Which Online Algorithm Should we Use?

We can now instantiate this theorem with the regret of any online learning algorithm. For example, we could use the Exponentiated Gradient algorithm, where we think of each {f \in \mathcal{F}} as an expert. There are only two caveats: first, we would have to extend it to the case where the number of experts is infinite, possibly continuous; second, we would have to select an appropriate learning rate.

The first issue is easy to deal with: Roughly speaking, it is enough to substitute any sum over the experts with integrals. Using the fact that {g_t(f) \in [-1,1]}, the regret guarantee of the continuous version of Exponentiated Gradient is

\displaystyle \frac{\text{KL}(Q||\pi)}{\eta} + \frac{\eta T}{2},

where {\pi} is the prior distribution over the infinite experts and {Q} is the competitor distribution. We might not know how to run such algorithm for the presence of integrals, but we do not need to! We only need to know that such an algorithm and its regret bound exist.

The second problem instead is a difficult one: the optimal learning rate depends on {Q}. However, we want generalization guarantees that hold uniformly for any {Q}. This is exactly the problem we saw many times in online learning when the optimal choice of the learning rate depends on the unknown comparator. In standard EG we upper bounded the KL term for the case of a uniform prior {P} with {\ln d}, but here we cannot do it because the number of experts is infinite. One could construct a grid of learning rates, instantiate the bound for each of them, and use a union bound, but this approach could introduce additional poly-logarithmic terms in the final bound. However, we already know how to solve this problem in online learning, by simply using parameter-free algorithms.

So, we can consider a continuous version of the parameter-free algorithm for learning with expert advice and we can show this regret bound.

Theorem 2. Let {\mathcal{F}} be a measurable space of measurable functions {f:\mathcal{Z}\rightarrow [0,1]}, {\delta \in (0,1)}, and {\pi \in \Delta(\mathcal{F})}. At each round {t}, the adversary reveals a measurable loss function {g_t:\mathcal{F}\rightarrow [0,1]}. There exists an online learning algorithm that depends on {\pi}, outputs a distribution {P_t \in \Delta(\mathcal{F})}, and incurs the loss

\displaystyle \ell_t(P)=\mathop{\mathbb E}_{f \sim P}[g_t(f)]~.

For any competitor distribution {Q \in \Delta(\mathcal{F})} such that {Q \ll \pi}, its regret satisfies

\displaystyle \text{Regret}_T(Q) :=\sum_{t=1}^T \left(\mathop{\mathbb E}_{f\sim P_t}[g_t(f)]-\mathop{\mathbb E}_{f\sim Q}[g_t(f)]\right) \leq 2\sqrt{T\left(\text{KL}(Q||\pi)+\frac12 \ln 2\right)}~.

The proof is a straightforward adaptation of the finite-expert case, but for completeness we include it in the Appendix below.

Combining the two previous theorems, and taking care of the fact that {\ell(P)\in [-1,1]} while Theorem 2 requires the losses to be in {[0,1]}, we obtain the following so-called PAC-Bayes bound.

Corollary 3 (PAC-Bayes Bound). Let {\mathcal{F}} be a measurable space of measurable functions {f:\mathcal{Z}\rightarrow [0,1]}, {\delta \in (0,1)}, and {\pi \in \Delta(\mathcal{F})}. Let {\mathcal{S}=\{{\boldsymbol \xi}_1,\dots, {\boldsymbol \xi}_T\}} drawn i.i.d. from a distribution {\rho} over {\mathcal{Z}}. Then, simultaneously for all {Q\in \Delta(\mathcal{F})} such that {Q \ll \pi}, even selected with the knowledge of {\mathcal{S}}, and with probability at least {1-\delta}, we have

\displaystyle \overline{\text{Gen}}(Q,\mathcal{S}) \leq 4\frac{\sqrt{\text{KL}(Q||\pi)+\frac12 \ln 2}}{\sqrt{T}} + \sqrt{\frac{2\ln \frac{1}{\delta}}{T}}~.

The role of {\pi} is to allow the bound to hold simultaneously for all posterior distributions {Q}. Indeed, the theorem guarantees that, with probability at least {1-\delta} over the draw of the sample {\mathcal{S}}, the inequality holds for every {Q \in \Delta(\mathcal{F})} such that {Q \ll \pi}, even if {Q} is selected after observing the data. This uniformity is possible because {\pi} acts as a continuous analogue of the union bound. The divergence term {\text{KL}(Q||\pi)} plays the role of the logarithmic penalty that appears in the discrete union bound. Hence, choosing {\pi} corresponds to specifying how the bound is distributed over the different functions in {\mathcal{F}}.

Remark 1. As in the previous post, the existence of an online learning algorithm with a regret upper bound is enough to prove our bound. We never need to actually run the online algorithm. Moreover, in this case we could not run the reduction even if we wanted to, since the linear losses {\ell_t} depend on the unknown distribution {\rho}.

3. History Bits

PAC-Bayes bounds were first proposed by McAllester (1998) and, as first explained by van Erven (2014), they can be seen as a continuous generalization of the union bound. There is now a vast literature on this subject, and we refer the reader to the recent review by Alquier (2024) for an introduction. Recently, PAC-Bayes bounds gained popularity because they can yield non-vacuous generalization bounds for deep neural networks. Yet one should not confuse a certificate of generalization with an “explanation of generalization” in deep neural networks (see, e.g., Picard-Weibel et al., 2025).

The reduction from online learning to PAC-Bayes bounds as well as the idea of using parameter-free algorithms is from Lugosi&Neu (2023). Despite what claimed in Lugosi&Neu (2023), the PAC-Bayes bound in Corollary 3 is not new, it has been obtained by Kakade et al. (2008). Such bound shaves off a {\ln T} term under the square root compared to previous PAC-Bayes guarantees. Lugosi&Neu (2023) also present many more results based on the same reduction.

Independently, Jang et al. (2023) proposed another reduction from coin-betting to PAC-Bayes. While less general than the one presented here, it allows one to prove tighter results by changing the surrogate loss {\ell_t} to a non-linear one. Later, Kuzborskij et al. (2024) used a similar approach to show the first PAC-Bayes bound with a better-than-{\text{KL}} divergence.

Acknowledgments

Thanks to ChatGPT for checking my post.

Appendix: Proof of the Parameter-free Algorithm for Continuous Experts

It is enough to show the continuous analogue of Theorem 1 here. As in the finite-dimensional case, let {\mathcal{A}} be a coin-betting algorithm. We formally instantiate one copy of {\mathcal{A}} for each {f \in \mathcal{F}}. At round {t}, let {x_t(f)\in{\mathbb R}} be the bet produced by the copy corresponding to {f}, and assume that {f \mapsto x_t(f)} is measurable. Define

\displaystyle \label{eq:phat_cont} \widehat p_t(f)=\max(x_t(f),0)~. \ \ \ \ \ (1)

Then, the continuous Learning with Experts Advice (LEA) algorithm predicts with the distribution {P_t} defined by

\displaystyle \label{eq:preds_experts_cont} P_t(\mathrm{d}f)= \begin{cases} \frac{\widehat p_t(f)}{\int_{\mathcal{F}} \widehat p_t(\phi)\,\pi(\mathrm{d}\phi)}\,\pi(\mathrm{d}f), & \text{if } \int_{\mathcal{F}} \widehat p_t(\phi)\,\pi(\mathrm{d}\phi)>0,\\ \pi(\mathrm{d}f), & \text{otherwise}~. \end{cases} \ \ \ \ \ (2)

After the prediction, the algorithm receives {g_t} and defines the outcome of the continuous coin for the copy corresponding to {f} as

\displaystyle \label{eq:gradients_experts_reduction_cont} c_t(f)= \begin{cases} \mathop{\mathbb E}_{\phi\sim P_t}[g_t(\phi)]-g_t(f), & \text{if } x_t(f)>0,\\ \max\bigl(\mathop{\mathbb E}_{\phi\sim P_t}[g_t(\phi)]-g_t(f),0\bigr), & \text{if } x_t(f)\le 0~. \end{cases} \ \ \ \ \ (3)

The construction above defines a LEA algorithm on a continuous set of experts. The regret guarantee is the following one.

Theorem 4 (Regret Bound for a Continuous Set of Experts). Let {\mathcal{A}} be a coin-betting algorithm such that, for any sequence of continuous coin outcomes {c'_1,\dots,c'_T \in [-1,1]}, its wealth after {T} rounds with initial money equal to {1} satisfies

\displaystyle 1+\sum_{t=1}^T c'_t x_t \ge \exp\left(f_T\left(\sum_{t=1}^T c'_t\right)\right)~.

Then, the continuous LEA algorithm with prior {\pi \in \Delta(\mathcal{F})} that predicts at each round with {P_t} in (2) satisfies

\displaystyle \text{Regret}_T(Q) \le h\bigl(\text{KL}(Q||\pi)\bigr), \quad \forall Q \in \Delta(\mathcal{F}) \text{ with } Q \ll \pi,

for any {h:{\mathbb R}\rightarrow{\mathbb R}} concave and non-decreasing such that {x\le h(f_T(x))} for all {x \in {\mathbb R}}.

Proof: We first prove that

\displaystyle \int_{\mathcal{F}} c_t(f)x_t(f)\,\pi(\mathrm{d}f)\le 0~.

Indeed, by the definition of {c_t(f)}, we have

\displaystyle \begin{aligned} \int_{\mathcal{F}} c_t(f)x_t(f)\,\pi(\mathrm{d}f) &= \int_{\{f\,:\,x_t(f)>0\}} x_t(f)\bigl(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f)\bigr)\,\pi(\mathrm{d}f) \\ &\qquad + \int_{\{f\,:\,x_t(f)\le 0\}} x_t(f)\max\bigl(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f),0\bigr)\,\pi(\mathrm{d}f)~. \end{aligned}

Now define

\displaystyle Z_t:=\int_{\mathcal{F}} \widehat p_t(f)\,\pi(\mathrm{d}f)=\int_{\mathcal{F}} \max(x_t(f),0)\,\pi(\mathrm{d}f)~.

If {Z_t=0}, then {x_t(f)\le 0} for {\pi}-almost every {f}, so the first integral is equal to {0}. If instead {Z_t>0}, then by (1) and (2),

\displaystyle x_t(f)=\widehat p_t(f)=Z_t \frac{\mathrm{d}P_t}{\mathrm{d}\pi}(f) \qquad \text{for } \pi \text{-almost every } f \text{ such that } x_t(f)>0~.

Hence,

\displaystyle \begin{aligned} \int_{\{f\,:\,x_t(f)>0\}} x_t(f)\bigl(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f)\bigr)\,\pi(\mathrm{d}f) &= Z_t \int_{\mathcal{F}} \bigl(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f)\bigr)\,P_t(\mathrm{d}f) \\ &= Z_t \left(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-\mathop{\mathbb E}_{f\sim P_t}[g_t(f)]\right)\\ &=0~. \end{aligned}

Therefore, in all cases,

\displaystyle \begin{aligned} \int_{\mathcal{F}} c_t(f)x_t(f)\,\pi(\mathrm{d}f) &= 0+\int_{\{f\,:\,x_t(f)\le 0\}} x_t(f)\max\bigl(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f),0\bigr)\,\pi(\mathrm{d}f) \\ &\le 0~, \end{aligned}

because {x_t(f)\le 0} on the integration domain and {\max(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f),0)\ge 0}.

From the assumption on {\mathcal{A}}, for every fixed {f\in\mathcal{F}} and for the sequence {\{c_t(f)\}_{t=1}^T \subseteq [-1,1]^T}, we have

\displaystyle 1+\sum_{t=1}^T c_t(f)x_t(f) \ge \exp\left(f_T\left(\sum_{t=1}^T c_t(f)\right)\right)~.

Integrating both sides with respect to {\pi}, we obtain

\displaystyle \begin{aligned} \int_{\mathcal{F}} \exp\left(f_T\left(\sum_{t=1}^T c_t(f)\right)\right)\pi(\mathrm{d}f) &\le \int_{\mathcal{F}} \left(1+\sum_{t=1}^T c_t(f)x_t(f)\right)\pi(\mathrm{d}f) \notag& (4) \\ &= 1+\sum_{t=1}^T \int_{\mathcal{F}} c_t(f)x_t(f)\,\pi(\mathrm{d}f) \le 1~. & (5) \\\end{aligned}

Now, fix any competitor {Q \in \Delta(\mathcal{F})} such that {Q\ll \pi}. Then,

\displaystyle \begin{aligned} \text{Regret}_T(Q) &= \sum_{t=1}^T \left(\mathop{\mathbb E}_{f\sim P_t}[g_t(f)]-\mathop{\mathbb E}_{f\sim Q}[g_t(f)]\right) \\ &= \sum_{t=1}^T \int_{\mathcal{F}} \left(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f)\right)Q(\mathrm{d}f) \\ &\le \sum_{t=1}^T \int_{\mathcal{F}} c_t(f)\,Q(\mathrm{d}f) \qquad \text{(by definition of } c_t(f)\text{)} \\ &= \int_{\mathcal{F}} \sum_{t=1}^T c_t(f)\,Q(\mathrm{d}f) \\ &\le \int_{\mathcal{F}} h\left(f_T\left(\sum_{t=1}^T c_t(f)\right)\right)Q(\mathrm{d}f) \qquad \text{(definition of } h(x)\text{)} \\ &\le h\left(\int_{\mathcal{F}} f_T\left(\sum_{t=1}^T c_t(f)\right)Q(\mathrm{d}f)\right) \qquad \text{(by concavity of } h \text{ and Jensen inequality)}. \end{aligned}

So, it remains to upper bound {\int_{\mathcal{F}} f_T\left(\sum_{t=1}^T c_t(f)\right)Q(\mathrm{d}f)}. Define {\varphi(f):=f_T\left(\sum_{t=1}^T c_t(f)\right)}. Since {Q\ll \pi}, let {r=\frac{\mathrm{d}Q}{\mathrm{d}\pi}}. Then

\displaystyle \begin{aligned} \int_{\mathcal{F}} \varphi(f)\,Q(\mathrm{d}f)-\text{KL}(Q||\pi) &= \int_{\mathcal{F}} \left(\varphi(f)-\ln r(f)\right)Q(\mathrm{d}f) \\ &= \int_{\mathcal{F}} \ln\!\left(\frac{e^{\varphi(f)}}{r(f)}\right)\,Q(\mathrm{d}f) \\ &\le \ln \int_{\mathcal{F}} \frac{e^{\varphi(f)}}{r(f)}\,Q(\mathrm{d}f) \\ &= \ln \int_{\mathcal{F}} e^{\varphi(f)}\,\pi(\mathrm{d}f), \end{aligned}

where in the inequality we used Jensen’s inequality for the concave function {\ln}. Rearranging, we get

\displaystyle \int_{\mathcal{F}} \varphi(f)\,Q(\mathrm{d}f) \le \text{KL}(Q||\pi)+\ln \int_{\mathcal{F}} e^{\varphi(f)}\,\pi(\mathrm{d}f)~.

Substituting back the definition of {\varphi}, we obtain

\displaystyle \int_{\mathcal{F}} f_T\left(\sum_{t=1}^T c_t(f)\right)Q(\mathrm{d}f) \le \text{KL}(Q||\pi)+\ln \int_{\mathcal{F}} \exp\left(f_T\left(\sum_{t=1}^T c_t(f)\right)\right)\pi(\mathrm{d}f).

Using (5), the logarithm is at most {\ln 1 = 0}, so

\displaystyle \int_{\mathcal{F}} f_T\left(\sum_{t=1}^T c_t(f)\right)Q(\mathrm{d}f) \le \text{KL}(Q||\pi)~.

Putting everything together, we have the stated bound. \Box

Leave a comment