From Online Learning to Rademacher Complexity

This is the first of three posts to show that online learning is more than just online learning!
Each of these posts will show how we can go from online learning to X, where X will be something completely unrelated to online learning. These results are based on the fact that a linear regret proof in its essence is a statement about arbitrary sequence of vectors, so it can be used in other settings as well.
These results I’ll show in these posts are not new, but somehow not so widely known as they should be.

1. From Linear Regret to Rademacher Complexity

Previously, we used online-to-batch conversions to turn regret bounds into guarantees on the risk. A different (and extremely useful) way to quantify the statistical difficulty of a hypothesis class is through its Rademacher complexity. It measures how well the class can correlate with pure noise, and it is the key quantity behind many uniform convergence and generalization results. Here, we show a simple reduction that upper bounds the Rademacher complexity of a class using the regret guarantee of an online learning algorithm.

We consider the Vapnik’s general setting of learning for generic function classes. Hence, we seek to minimize

\displaystyle \label{eq:vapnik_general_obj} \min_{f \in \mathcal{F}} \ \mathop{\mathbb E}_{{\boldsymbol \xi}\sim\rho} [f({\boldsymbol \xi})] \ \ \ \ \ (1)

where {\mathcal{F}} is a class of real-valued functions on {\mathcal{Z}} and {\rho} is a distribution over {\mathcal{Z}}. One common way to solve this problem is through an empirical risk minimization process, a.k.a. minimize the training error. That is, we gather a sample of size {T}, {\{{\boldsymbol \xi}_1, \dots, {\boldsymbol \xi}_T\}}, drawn i.i.d. from {\rho}, then we solve the empirical version of (1):

\displaystyle f_\mathcal{T} =\mathop{\mathrm{argmin}}_{f \in \mathcal{F}} \ \frac{1}{T} \sum_{t=1}^T f({\boldsymbol \xi}_t)~.

It is now natural to ask if {f_\mathcal{T}} will also minimize the objective function in (1).

One way to study this problem is through the concept of Rademacher complexity of a function class.

Definition 1 (Rademacher complexity). Let {\mathcal{T}=\{{\boldsymbol \xi}_1,\dots,{\boldsymbol \xi}_T\}} be an arbitrary set of {T} vectors in some domain {\mathcal{Z}}. Let {\mathcal{F}} be a class of real-valued functions on {\mathcal{Z}}. Let {\epsilon_1,\dots,\epsilon_T} be i.i.d. Rademacher random variables, that is {\mathop{\mathbb P}\{\epsilon_t=1\}=\mathop{\mathbb P}\{\epsilon_t=-1\}=1/2}. The empirical Rademacher complexity of {\mathcal{F}} on {\mathcal{T}} is defined as

\displaystyle \label{eq:emp_rad} \widehat{\mathfrak{R}}_{\mathcal{T}}(\mathcal{F}) \stackrel{\text{def}}{=} \frac{1}{T}\, \mathop{\mathbb E}_{\epsilon_1, \dots, \epsilon_T}\left[ \sup_{f\in\mathcal{F}} \ \sum_{t=1}^T \epsilon_t f({\boldsymbol \xi}_t)\right]~. \ \ \ \ \ (2)

Intuitively, if {\widehat{\mathfrak{R}}_{\mathcal{T}}(\mathcal{F})} is small, then no function in the class can fit random signs too well on the sample. When this happens, we can expect the minimizers of the empirical risk to be close to the minimizers of the true risk, because less likely to be fooled by random fluctuations of the training data.

This idea can be made formal and yields generalization guarantees; for example, bounds of the form

\displaystyle \sup_{f\in\mathcal{F}} \ \left| \mathop{\mathbb E}[f({\boldsymbol \xi})] - \frac{1}{T}\sum_{t=1}^T f({\boldsymbol \xi}_t) \right| \lesssim \widehat{\mathfrak{R}}_{\mathcal{T}}(\mathcal{F}) + \sqrt{\frac{\ln(1/\delta)}{T}}

with probability at least {1-\delta} under mild boundedness assumptions. Given that the above bound applies uniformly over {\mathcal{F}}, it also applies to the empirical risk minimizer {f_\mathcal{T}}.

Moreover, one can easily obtain a generalization guarantee that depends {f_\mathcal{T}}, rather than being uniform over a class. It is enough to consider a countable nested family {(\mathcal{F}_k)_{k=1}^\infty} whose union covers the entire space of possible outcomes of the empirical risk minimization process. Then, apply the uniform bound to each {\mathcal{F}_k} with a union bound, and choose the class with the smallest Rademacher complexity containing the empirical risk minimizer.

We will not prove such results here; instead, we focus on the following question: how can we upper bound {\widehat{\mathfrak{R}}_{\mathcal{T}}(\mathcal{F})} for interesting classes?

Here, we consider function classes given by composing Lipschitz functions with linear predictors. A typical example is: Given a prediction function class {\{\phi_{{\boldsymbol x}}:{\boldsymbol x}\in\mathcal{V}\}} and a loss {\ell(\hat{y},y)}, we obtain the induced function class

\displaystyle \mathcal{F}_{\ell} \stackrel{\text{def}}{=} \left\{{\boldsymbol \xi}=({\boldsymbol z},y)\mapsto \ell(\phi_{{\boldsymbol x}}({\boldsymbol z}),y)\ :\ {\boldsymbol x}\in\mathcal{V}, \right\},

where {({\boldsymbol z},y) \in \mathcal{Z} \times \mathcal{Y}}. Hence, we would like to control {\widehat{\mathfrak{R}}_{\mathcal{T}}(\mathcal{F}_{\ell})}. A common situation is that {\ell(\cdot,y)} is Lipschitz, say {L}-Lipschitz for all {y}. In this case, it is intuitive that a Lipschitz loss cannot increase the ability to correlate with noise by more than a factor {L}. This is formalized by contraction inequalities, as in the next lemma.

Lemma 2 (Contraction Lemma). Let {h:{\mathbb R} \rightarrow {\mathbb R}} be {L}-Lipschitz. Consider the function class {h \circ \mathcal{F}=\{h \circ f: f \in \mathcal{F}\}}. Then, for any sample {\mathcal{T}}, we have

\displaystyle \widehat{\mathfrak{R}}_{\mathcal{T}}(h\circ \mathcal{F}) \leq L \, \widehat{\mathfrak{R}}_{\mathcal{T}}(\mathcal{F})~.

Hence, in our setting it suffices to control the complexity of linear functions, which we do next.

2. Rademacher Complexity of Linear Classes through Regret Guarantees

We now state the core reduction. Let {\mathcal{V}\subset{\mathbb R}^d} be a non-empty, bounded, closed set and consider the linear class

\displaystyle \mathcal{F}_{\mathrm{lin}} \stackrel{\text{def}}{=} \left\{ {\boldsymbol z} \mapsto \langle {\boldsymbol x},{\boldsymbol z}\rangle \ :\ {\boldsymbol x}\in\mathcal{V} \right\}~.

Fix a sample {\mathcal{T}=\{{\boldsymbol z}_1,\dots,{\boldsymbol z}_T\}} and define the empirical Rademacher complexity of {\mathcal{F}_{\mathrm{lin}}} on {\mathcal{T}} as in (2). The following theorem shows how to use the existence of a regret upper bound for an online learning algorithm to upper bound such empirical Rademacher complexity.

Theorem 3. Let {\mathcal{Z}\subset{\mathbb R}^d} such that for any {{\boldsymbol z} \in \mathcal{Z}} we also have {-{\boldsymbol z} \in \mathcal{Z}}. Fix any sequence {{\boldsymbol z}_1,\dots,{\boldsymbol z}_T\in\mathcal{Z}} and any non-empty, bounded, closed set {\mathcal{V}\subset{\mathbb R}^d}. Assume that there exists an online algorithm that, on any sequence of linear losses {\ell_t({\boldsymbol x})=\langle {\boldsymbol g}_t,{\boldsymbol x}\rangle}, where {{\boldsymbol g}_t \in \mathcal{Z}}, produces {{\boldsymbol x}_t\in\mathcal{V}} and satisfies the regret guarantee

\displaystyle \label{eq:regret_generic} \sum_{t=1}^T \langle {\boldsymbol g}_t,{\boldsymbol x}_t\rangle - \min_{{\boldsymbol u}\in\mathcal{V}} \ \sum_{t=1}^T \langle {\boldsymbol g}_t,{\boldsymbol u}\rangle \leq R(T)~. \ \ \ \ \ (3)

Then, the empirical Rademacher complexity of the linear class satisfies

\displaystyle \label{eq:rad_bound_linear} \widehat{\mathfrak{R}}_{\mathcal{T}}(\mathcal{F}_{\mathrm{lin}}) = \frac{1}{T} \mathop{\mathbb E}_{\epsilon_1, \dots, \epsilon_T}\left[ \sup_{{\boldsymbol x}\in\mathcal{V}}\ \sum_{t=1}^T \epsilon_t \langle {\boldsymbol x},{\boldsymbol z}_t\rangle \right] \leq \frac{R(T)}{T}~. \ \ \ \ \ (4)

Proof: Fix {{\boldsymbol z}_1,\dots,{\boldsymbol z}_T} and draw {\epsilon_1,\dots,\epsilon_T}. Consider the online linear game with gradients {{\boldsymbol g}_t\stackrel{\text{def}}{=} -\epsilon_t {\boldsymbol z}_t}. Let {{\boldsymbol x}_1,\dots,{\boldsymbol x}_T} be the iterates produced by the online algorithm when fed with {{\boldsymbol g}_1,\dots,{\boldsymbol g}_T}. By the regret guarantee (3) we have, for every realization of the signs,

\displaystyle \begin{aligned} \label{eq:rad_regret_step} \sum_{t=1}^T \langle -\epsilon_t {\boldsymbol z}_t,{\boldsymbol x}_t\rangle - \min_{{\boldsymbol u}\in\mathcal{V}} \ \sum_{t=1}^T \langle -\epsilon_t {\boldsymbol z}_t,{\boldsymbol u}\rangle &\leq R(T)~. \ \ \ \ \ (5)\end{aligned}

Rearranging, we get

\displaystyle \max_{{\boldsymbol u}\in\mathcal{V}} \ \sum_{t=1}^T \epsilon_t \langle {\boldsymbol u},{\boldsymbol z}_t\rangle \leq R(T) + \sum_{t=1}^T \epsilon_t \langle {\boldsymbol x}_t,{\boldsymbol z}_t\rangle~.

Taking expectation with respect to {\epsilon_1,\dots,\epsilon_T} and dividing by {T} yields

\displaystyle \label{eq:rad_exp_split} \widehat{\mathfrak{R}}_{\mathcal{T}}(\mathcal{F}_{\mathrm{lin}}) \leq \frac{R(T)}{T} + \frac{1}{T}\mathop{\mathbb E}_{\epsilon_1, \dots, \epsilon_T}\left[\sum_{t=1}^T \epsilon_t \langle {\boldsymbol x}_t,{\boldsymbol z}_t\rangle\right]~. \ \ \ \ \ (6)

It remains to show that the second term is {0}. The key observation is that {{\boldsymbol x}_t} is measurable with respect to the past signs {\epsilon_1,\dots,\epsilon_{t-1}} and possibly on its internal randomization, while {\epsilon_t} is independent of the past and has mean zero. Hence, conditioning on the past,

\displaystyle \mathop{\mathbb E}_{\epsilon_1, \dots, \epsilon_T}\!\left[\epsilon_t \langle {\boldsymbol x}_t,{\boldsymbol z}_t\rangle \middle| \epsilon_1,\dots,\epsilon_{t-1}, \text{internal randomness}\right] = \langle {\boldsymbol x}_t,{\boldsymbol z}_t\rangle\,\mathop{\mathbb E}[\epsilon_t] =0,

and by taking expectation again we obtain {\mathop{\mathbb E}[\epsilon_t \langle {\boldsymbol x}_t,{\boldsymbol z}_t\rangle]=0}. Summing over {t} gives {\mathop{\mathbb E}_{\epsilon_1, \dots, \epsilon_T}\left[\sum_{t=1}^T \epsilon_t \langle {\boldsymbol x}_t,{\boldsymbol z}_t\rangle\right]=0}, so (6) implies (4). \Box

Remark 1. The proof of Theorem 3 has the same flavor as the probabilistic method argument used in lower bounds: we introduce Rademacher signs, feed them to an online algorithm, and then exploit the fact that the algorithm cannot correlate with the fresh randomness at time {t}. The entire complexity term is paid by the regret upper bound {R(T)}.

We can instantiate the above theorem using the Online Mirror Descent (OMD) algorithms with {p}-norms.

Corollary 4. Let {q\in [2, \infty)} and {p\in (1,2]} such that {1/p+1/q=1}. For {U_p>0}, let {\mathcal{V}_p\stackrel{\text{def}}{=}\{{\boldsymbol x} \in {\mathbb R}^d: \|{\boldsymbol x}\|_p\leq U_p\}}. Let {\mathcal{T}\stackrel{\text{def}}{=}\{{\boldsymbol z}_1, \dots, {\boldsymbol z}_T\}} where {{\boldsymbol z}_t\in \mathcal{Z}\subset{\mathbb R}^d} and {\|{\boldsymbol z}_t\|_q\leq G_q} for {t=1, \dots, T}. Let {\mathcal{F}_{p} \stackrel{\text{def}}{=} \left\{ {\boldsymbol z} \mapsto \langle {\boldsymbol x},{\boldsymbol z}\rangle \ :\ {\boldsymbol x}\in\mathcal{V}_p \right\}}. Then,

\displaystyle \frac{1}{T} \mathop{\mathbb E}_{\epsilon_1, \dots, \epsilon_T}\left[ \sup_{{\boldsymbol x}\in\mathcal{V}_p} \ \sum_{t=1}^T \epsilon_t \langle {\boldsymbol x},{\boldsymbol z}_t\rangle \right] \leq \frac{U_p G_q}{\sqrt{(p-1)T}}~.

Proof: We instantiate Theorem 3 with OMD with {p}-norms, learning rate {\eta=\frac{U_p\sqrt{p-1}}{G_q \sqrt{T}}} and {{\boldsymbol x}_1=\boldsymbol{0}}. \Box

We can also consider the case of classes of functions with finite cardinality.

Corollary 5 (Massart’s Lemma). Let {\mathcal{T}\stackrel{\text{def}}{=}\{{\boldsymbol z}_1, \dots, {\boldsymbol z}_T\}} where {{\boldsymbol z}_t\in \mathcal{Z}\subset{\mathbb R}^d} and {\|{\boldsymbol z}_t\|_\infty\leq G_\infty} for {t=1, \dots, T}. Let {\mathcal{F} \stackrel{\text{def}}{=} \left\{ {\boldsymbol z} \mapsto \langle {\boldsymbol x},{\boldsymbol z}\rangle \ :\ {\boldsymbol x}\in\Delta^{d-1} \right\}}. Then,

\displaystyle \frac{1}{T} \mathop{\mathbb E}_{\epsilon_1,\dots, \epsilon_T}\left[ \max_{i=1,\dots, d} \ \sum_{t=1}^T \epsilon_t z_{t,i}\right] \leq \frac{G_\infty \sqrt{2 \ln d}}{\sqrt{T}}~.

Proof: We instantiate Theorem 3 with OMD with entropic regularizer, learning rate {\eta=\frac{\sqrt{2 \ln d}}{G_\infty \sqrt{T}}} and {{\boldsymbol x}_1=[1/d, \dots, 1/d]^\top}. \Box

3. History Bits

The contraction lemma appears in many places in the literature (see, e.g., Ledoux&Talagrand, 1991, Bartlett&Mendelson, 2001, Bartlett&Mendelson, 2002, Meir&Zhang, 2003).

Kakade et al. (2008) proved Theorem 3 using strong convexity, while Kakade et al. (2009) proved it through the proof of FTRL. Here, we simply used a black-box conversion to the linear regret upper bound for any online learning algorithm.

Massart’s lemma was originally proved by Massart (2000).

Acknowledgments

Thanks to ChatGPT for checking my post.

Leave a comment