Parameter-free Learning and Optimization Algorithms

Multi-scale Expert Algorithm to Tune Learning Rates

bremen79 — Sat, 02 May 2026 12:15:28 +0000

This time we consider an application of Online Mirror Descent (OMD) to the problem of prediction with multi-scale expert advice.

In this setting, at each time step , an algorithm selects a probability distribution over a set of experts (or actions). After observing a loss vector , the algorithm incurs a loss of . The goal is to minimize the regret with respect to the best single expert in hindsight. However, unlike in the usual learning with expert advice setting, here the losses associated with different experts can have vastly different scales. For example, one expert might correspond to a low-risk, low-reward strategy with losses in , while another might be a high-risk, high-reward strategy with losses in . Thus, we would like to prove a regret guarantee that depends on the range of the best coordinate.

A standard and effective algorithm for this problem is the Exponentiated Gradient (EG) algorithm, which we saw previously as an instance of OMD. The EG algorithm has a regret bound that scales with the largest range of the loss coordinates, so it is not suitable for this problem. We need a different algorithm.

In this setting, we assume that we have prior knowledge of the approximate scale of losses for each expert. Specifically, for each expert , we know a value 0}" class="latex" /> such that the loss is guaranteed to be in the range for all . The goal is to leverage this scale information to achieve better regret bounds.

Here, we derive an algorithm for this setting as an instance of OMD with a specific choice of surrogate losses. In particular, we run OMD with distance generating function on the shifted losses , see Algorithm 1.

Theorem 1. Let 0}" class="latex" /> such that for all and . In Algorithm 1, set . Then, we have

Proof: First, observe that

Using the local norm regret upper bound for OMD, we obtain

where is between and . Using (1), we have that

So, we can upper bound each with .

Using again (1), we have that the right hand side of the regret upper bound becomes

Observe that , so we have

So, we obtain

Simplifying, we have the stated result.

Let’s now see how we can choose the prior and the initial point . First, let’s calculate the value of the Bregman divergence:

Define , , and to be any index in . If we set , we have

Now, set for and . Also, set . In this way, for , we have

On the other hand, for , we have

Hence, the previous bound also holds for all . In this case, choosing , we have

Note that if then , so we recover the bound of the EG algorithm, up to constant factors. Instead, if the scale of the losses is different, we still pay the scale of the best expert, instead of paying the biggest scale.

1. Tuning the Learning Rate using a Multi-scale Algorithm

We now show an interesting application of the multi-scale expert algorithm.

In our analysis of Online Subgradient Descent (OSD), we have seen that the choice of the learning rate is critical for performance. For instance, with a constant learning rate , the optimal choice that minimizes the regret bound is , which unfortunately depends on the competitor and the entire sequence of future gradients. One might be tempted to use a grid of learning rates and select the best one in hindsight, but unfortunately this is not a valid online learning procedure.

In this section, we demonstrate how to use a multi-scale expert algorithm to design a meta-algorithm that automatically adapts to the best learning rate from a given set, paying only a small price in the regret. The core idea is to treat each instance of an online learning algorithm with a fixed learning rate as an “expert”. We then use a multi-scale expert algorithm to combine the predictions of these experts. The resulting ensemble algorithm will have a regret guarantee that is close to the regret of the best expert—and thus the best learning rate—in hindsight.

Let us consider running parallel instances of the OSD algorithm, each with a different learning rate for . At each round , each OSD instance produces a prediction . We can view these predictions as advice from experts. Our goal is to combine them into a single prediction that performs nearly as well as the best prediction from the best OSD instance .

A straightforward approach would be to compute the loss for each expert and use this as the loss vector for the Multi-scale Expert algorithm. However, this would require computing separate subgradients at each round, which can be computationally expensive.

To create an efficient algorithm that requires only one subgradient evaluation per round, we can use the following linearization technique. The controller algorithm forms its combined prediction . We then receive a single subgradient at this combined point. This single subgradient is then used to define a linear surrogate loss, , which is passed to all expert algorithms and to the controller algorithm.

Because all experts receive the same linear loss function, they all use the same subgradient for their updates. The loss for the -th expert, used by the multi-scale expert algorithm, is simply . This procedure is summarized in Algorithm 2.

We can now prove a regret bound for this ensemble algorithm.

Theorem 2. Let 0}" class="latex" /> and fix the number of rounds. With the notation in Algorithm 2, assume that the losses are -Lipschitz with respect to . Let be the set of learning rates for the OSD experts, , and the learning rate of the Multi-scale Expert algorithm . Then, Algorithm 2 satisfies

Proof: The choice of ensures that for any such that there exists an expert such that . Indeed, if we take , while otherwise we take .

By the definition of subgradient, the regret of the ensemble algorithm can be bounded by the regret on the linearized losses:

We can now decompose the regret as

The second term is the regret of the best OSD expert on the sequence of linear losses . From the regret bound of OSD with constant learning rate and initial point equal to zero, we have

For the first term, we have

This is the regret of the Multi-scale Expert algorithm against the expert on the sequence of loss vectors where for all , where we used that is -Lipschitz and has radius . This is exactly the setting we used for the scales . So, using the regret bound for the Multi-scale Expert algorithm, we get

Combining the bounds for both terms completes the proof.

This result shows that by combining OSD instances, we can achieve a regret that is close to the one obtained by the best learning rate in the chosen set. The constraint on is motivated by the fact that for larger competitors the regret would not be sublinear in , even with prior knowledge of the norm of the competitor. Hence, it is impossible to compete with that are too large. The additional multiplicative price for the adaptivity is , that we know to be necessary by the lower bound for unconstrained online convex optimization.

This technique provides a principled method for automating the selection of learning rates in an online fashion. However, its interest is mostly theoretical: it shows that such adaptivity to is possible, but one should not expect strong empirical performance from this rather cumbersome procedure.

2. History Bits

The multi-scale setting was introduced independently (Dylan Foster, personal communication, 2026) and around the same time by Bubeck et al. (2017), Bubeck et al. (2019) and Foster et al. (2017). Bubeck et al. (2017), Bubeck et al. (2019) propose two algorithms, one for non-positive losses and one for generic losses. Unfortunately, their proof for the generic losses appears to be wrong and not easily fixable. Their bound is similar to the one we proved here. Foster et al. (2017) proved a stronger result, where the term in the logarithm depends on rather than on the sum of the scales and the minimal value. However, their algorithm has a computational complexity at step of . Cutkosky&Orabona (2018) proposed another solution that achieves the same bound of Foster et al. (2017), but with constant computational complexity.

The algorithm I describe here is a simplification of the one in Chen et al. (2021). The bound is worse than the one in Cutkosky&Orabona (2018) and Foster et al. (2017), but it is simpler to describe, and it also allowed me to describe the method of the shifted surrogate loss.

The use of shifted surrogate losses in learning with expert advice was proposed in Hazan and Kale (2008), Hazan&Kale (2010) and later used in Steinhardt&Liang (2014). Interestingly, these shifted surrogates can be understood as an approximation of the coin-betting instantaneous log wealth, as one can see by comparing the update of Squint (Koolen&van Erven, 2015) with the update of the KT-based parameter-free algorithm for learning with expert advice.

The idea of running multiple OSD algorithms and aggregating them with a Multi-scale Expert algorithm is from Foster et al. (2017), but they require function values per step, while we use only one subgradient per step using the linearization idea proposed in MetaGrad (van Erven&Koolen, 2016, van Erven et al., 2021). The idea of restricting the norm of the competitor to the “meaningful ones” is not necessary in the original formulation by Foster et al. (2017), but it is standard, and it has been rediscovered many times in different forms, e.g., in a method privately proposed by Nemirovski in 2013 and described in the ICML 2020 tutorial on “Parameter-free Online Optimization”, in Orabona (2014, Theorem 2), in Cutkosky (2019, Section 5).

From Online Learning to Non-convex Non-smooth Optimization

bremen79 — Mon, 06 Apr 2026 13:10:41 +0000

This is the last post in this series to show that online learning is more than online learning. This time it is about using an online linear optimization algorithm to minimize non-convex, non-smooth functions with access to stochastic gradients. Unlike in the previous two posts, here we will actually use the online algorithm itself, not just its regret guarantee. Nevertheless, the same philosophy applies: a regret guarantee is a statement that the sequence produced by the online algorithm has certain useful properties. So, even though the online linear optimization algorithm only minimizes linear functions, it can be used to design an optimization algorithm for non-convex functions.

1. From Online Learning to Non-Convex Non-Smooth Optimization

This section describes a simple (and surprisingly sharp) reduction from Online Linear Optimization (OLO) to the task of finding approximate stationary points of non-convex, non-smooth objectives. The key idea is to feed stochastic gradients to an online algorithm, which then decides the updates rather than the iterates.

We consider the objective function , which is differentiable but not necessarily smooth, i.e., the gradient map might not be Lipschitz.

Example 1. Consider defined as . Then, , where . Hence, is differentiable everywhere, but its derivative is not Lipschitz because is unbounded when .

To connect function values to gradients without convexity or smoothness, we isolate the only calculus identity needed by the reduction. So, we will define well-behaved functions as those that satisfy the specific equality we need.

Definition 1 (Well-behaved function). A differentiable function is well-behaved if for every ,

Remark 1. Identity (1) is immediate for smooth functions (by the fundamental theorem of calculus applied to ), but it can hold well beyond smooth objectives. In particular, Cutkosky et al. (2023) show that if is locally Lipschitz then an arbitrarily small randomized smoothing yields a differentiable well-behaved surrogate with a natural stochastic gradient oracle. Here, we focus on the differentiable well-behaved case to keep the presentation simple.

We now define our notion of optimality. Since is non-convex and not necessarily smooth, small gradients are not obviously implied by function decrease; we instead certify a local averaged stationarity notion that measures how close we are to a local minimum by considering local averages of gradients.

Definition 2 (Barycentric -stationarity). Let be differentiable almost everywhere, and let 0}" class="latex" />. Let be the set of random variables with finite support in the ball of radius around , such that exists for all in the support of . Define

We say that is a barycentric -stationary point if .

Remark 2. This notion is closely related to Goldstein-type -subdifferential stationarity, but we use gradients, and it is barycentric: the same convex weights used to average gradients must also average the sampled points back to .

1.1. The Reduction Algorithm

We now present the reduction from non-convex optimization of well-behaved functions to online linear optimization.

Fix a cycle length and a radius 0}" class="latex" />. We will run an OLO algorithm on the Euclidean ball , where the OLO algorithm outputs updates . The iterates are defined by . Then, at each step we draw a random point on the segment from to and query a stochastic gradient there. The final output is the average of points inside a random cycle . The complete procedure is given in Algorithm 1.

Theorem 3. Let be well-behaved (Definition 1). Let be the sigma-field generated by all randomness up to round before querying , so that is -measurable. With the notation in Algorithm 1, we assume

for some 0}" class="latex" />. Let be a multiple of the cycle length . Run Algorithm 1 for steps with cycle length , and instantiate as projected \ac{OGD} over with learning rate and initial point equal to on each cycle (i.e., after each reset). Then,

In particular, for 0}" class="latex" />, choose and then set

rounded to an integer. Then,

Proof: We first prove the key identity linking change in function value to an expected gradient. By Definition 1, we have

where the second to last equality is due to the fact that is uniform on .

Taking expectation over all randomness up to time (including ) and for any , we have

where in the second to last equality we used that is determined before querying , together with the unbiasedness condition .

We now analyze one cycle, and then sum over cycles. Fix a cycle and let its time indices be . Summing (3) over , we obtain

We now focus on each term on the r.h.s. of this equality.

For the first term on the r.h.s. of (4), for any , from the regret guarantee of projected Online Gradient Descent (OGD) with learning rate and initial point equal to , we have

Since this inequality holds for every , we can use a hindsight choice depending on the gradients in cycle . Moreover, taking expectation and using the definition of , we have

Now, we choose as

Our choice of guarantees , so the second term of the r.h.s. of (4) becomes

Now, we upper bound the last term in (4). First, observe that

and

Hence, we have

where we used Cauchy–Schwarz’s inequality and Jensen’s inequality.

Putting everything together, we have

Summing over telescopes the function values:

Dividing by gives the stated bound.

Now observe that within each cycle we have . Hence all points lie in the convex hull of whose diameter is at most (triangle inequality over at most steps). Since , every lies at distance at most the diameter of the set from , i.e., . Therefore, if , then for all in cycle we have . By Definition 2,

Sampling uniformly over cycles and taking expectation yields the stated barycentric -stationarity bound. Choosing to balance the two terms (with ) gives the stated rate.

It is interesting to compute the update in Algorithm 1 when is projected OGD:

This is reminiscent of the Stochastic Gradient Descent (SGD) update with momentum and clipping, which are common heuristics for optimizing non-convex objectives in deep learning. Yet, here the update comes naturally from the theory.

We can also prove that this bound is optimal. To see this, we consider smooth functions and the following lemma.

Lemma 4. Suppose that is -smooth and satisfies . Then, .

Proof: Since , for any 0}" class="latex" /> there exists such that .

By definition of , we have . So, by -smoothness, for every , we have . Hence,

Therefore, . Since this holds for every 0}" class="latex" />, we conclude that .

Now, recall that Theorem 3 shows that we can find a barycentric stationary point in iterations. Thus, Lemma 4 implies that by setting , we can find an -stationary point of an -smooth objective in iterations, which matches the optimal guarantee of standard SGD. Hence, the rate in Theorem 3 is optimal for all of the order of .

2. History Bits

Zhang et al. (2020) proposed to use the Goldstein stationarity condition (Goldstein, 1977) for non-convex non-smooth objectives. They also proved a suboptimal bound of . The barycentric definition, Theorem 3, and Lemma 4 are from Cutkosky et al. (2023). Note that I changed Definition 2 a bit because Algorithm 1 can return repeated points, that would not be allowed by the original definition in Cutkosky et al. (2023). The barycentric part of the definition is not strictly necessary for the results I presented, yet it is necessary for the additional results in Cutkosky et al. (2023) on functions whose Hessian is Lipschitz. Zhang&Cutkosky (2024) improved the reduction in Algorithm 1 using an exponentially distributed , which allows one to query the gradient at rather than at the intermediate point , and removes the feasible set . Jordan et al. (2023) proved that the randomization is essential in this class of problems to achieve dimension-free rates.

Acknowledgments

Thanks to ChatGPT for checking my post.

From Online Learning to PAC-Bayes

bremen79 — Fri, 03 Apr 2026 05:52:42 +0000

This is the second post in my series of “From Online Learning to X”.

Last time, we saw how to derive Rademacher complexity bounds using the existence of online learning algorithms. This time, we will see how to go from online learning to PAC-Bayes bounds, using the existence of parameter-free algorithms for the learning with experts setting.

1. From Linear Regret to PAC-Bayes

In this section, we show that one can directly obtain generalization bounds for any machine learning algorithm from upper bounds on linear regret.

We will consider the same setting we used in the online-to-batch reduction, where one aims at minimizing the risk of a function , defined as

For example, in a regression setting, is the composition of a loss with a prediction function, evaluated on the data .

Given a training set drawn i.i.d. from , here we are interested in the generalization gap, defined as the difference between the risk and the training error of a function :

We can also consider the case where the predictor is randomized, in the sense that we draw according to a probability distribution in , that is the set of probability distributions over a set of functions . In this case, we study

In particular, we are interested in upper bounding in high probability, where is selected from after looking at .

We now describe the reduction from this problem to an online learning one. Consider an online algorithm that at each round produces a distribution after observing the training samples . Define and . So, for any , we have that

We use to denote the -algebra generated by the data points and by all the random variables generated by the online learning algorithm up to the end of round . Now, given that is -measurable and is independent of , we have

So, we have that form a martingale difference sequence. Since , we have

and therefore

Hence, is a martingale difference sequence bounded in , so we can apply the Hoeffding–Azuma inequality. This concentration does not depend on , so it holds simultaneously for all .

We are done! We can now put everything together to have the following theorem.

Theorem 1. Let be a measurable space of measurable functions , , and . Let be drawn i.i.d. from a distribution over . Consider any online learning algorithm that outputs a distribution over and is fed with the linear losses . Then, with probability at least , for all such that , even selected with the knowledge of , we have

2. Which Online Algorithm Should We Use?

We can now instantiate this theorem with the regret of any online learning algorithm. For example, we could use the Exponentiated Gradient algorithm, where we think of each as an expert. There are only two caveats: first, we would have to extend it to the case where the number of experts is infinite, possibly continuous; second, we would have to select an appropriate learning rate.

The first issue is easy to deal with: Roughly speaking, it is enough to substitute any sum over the experts with integrals. Using the fact that , the regret guarantee of the continuous version of Exponentiated Gradient is

where is the prior distribution over the infinite experts and is the competitor distribution. We might not know how to run such an algorithm for the presence of integrals, but we do not need to! We only need to know that such an algorithm and its regret bound exist.

The second problem instead, is a difficult one: the optimal learning rate depends on . However, we want generalization guarantees that hold uniformly for any . This is exactly the problem we saw many times in online learning when the optimal choice of the learning rate depends on the unknown comparator. In standard EG we upper bounded the KL term for the case of a uniform prior with , but here we cannot do it because the number of experts is infinite. One could construct a grid of learning rates, instantiate the bound for each of them, and use a union bound, but this approach could introduce additional poly-logarithmic terms in the final bound. However, we already know how to solve this problem in online learning, by simply using parameter-free algorithms.

So, we can consider a continuous version of the parameter-free algorithm for learning with expert advice, and we can show this regret bound.

Theorem 2. Let be a measurable space of measurable functions , , and . At each round , the adversary reveals a measurable loss function . There exists an online learning algorithm that depends on , outputs a distribution , and incurs the loss

For any competitor distribution such that , its regret satisfies

The proof is a straightforward adaptation of the finite-expert case, but for completeness, we include it in the Appendix below.

Combining the two previous theorems, and taking care of the fact that while Theorem 2 requires the losses to be in , we obtain the following so-called PAC-Bayes bound.

Corollary 3 (PAC-Bayes Bound). Let be a measurable space of measurable functions , , and . Let drawn i.i.d. from a distribution over . Then, with probability at least , for all such that , even selected with the knowledge of , we have

The role of is to allow the bound to hold simultaneously for all posterior distributions . Indeed, the theorem guarantees that, with probability at least over the draw of the sample , the inequality holds for every such that , even if is selected after observing the data. This uniformity is possible because acts as a continuous analogue of the union bound. The divergence term plays the role of the logarithmic penalty that appears in the discrete union bound. Hence, choosing corresponds to specifying how the bound is distributed over the different functions in .

Remark 1. As in the previous post, the existence of an online learning algorithm with a regret upper bound is enough to prove our bound. We never need to actually run the online algorithm. Moreover, in this case we could not run the reduction even if we wanted to, since the linear losses depend on the unknown distribution .

3. History Bits

PAC-Bayes bounds were first proposed by McAllester (1998) and, as first explained by van Erven (2014), they can be seen as a continuous generalization of the union bound. There is now a vast literature on this subject, and we refer the reader to the recent review by Alquier (2024) for an introduction. Recently, PAC-Bayes bounds gained popularity because they can yield non-vacuous generalization bounds for deep neural networks. Yet one should not confuse a certificate of generalization with an “explanation of generalization” in deep neural networks (see, e.g., Picard-Weibel et al., 2025).

The reduction from online learning to PAC-Bayes bounds as well as the idea of using parameter-free algorithms is from Lugosi&Neu (2023). Despite what claimed in Lugosi&Neu (2023), the PAC-Bayes bound in Corollary 3 is not new, it has been obtained by Kakade et al. (2008). Such bound shaves off a term under the square root compared to previous PAC-Bayes guarantees. Lugosi&Neu (2023) also present many more results based on the same reduction.

Independently, Jang et al. (2023) proposed another reduction from coin-betting to PAC-Bayes. While less general than the one presented here, it allows one to prove tighter results by changing the surrogate loss to a non-linear one. Later, Kuzborskij et al. (2024) used a similar approach to show the first PAC-Bayes bound with a better-than- divergence.

Acknowledgments

Thanks to ChatGPT for checking my post.

Appendix: Proof of the Parameter-free Algorithm for Continuous Experts

It is enough to show the continuous analogue of Theorem 1 here. As in the finite-dimensional case, let be a coin-betting algorithm. We formally instantiate one copy of for each . At round , let be the bet produced by the copy corresponding to , and assume that is measurable. Define

Then, the continuous Learning with Experts Advice (LEA) algorithm predicts with the distribution defined by

0,\\ \pi(\mathrm{d}f), & \text{otherwise}~. \end{cases} \ \ \ \ \ (2)" class="latex" />

After the prediction, the algorithm receives and defines the outcome of the continuous coin for the copy corresponding to as

0,\\ \max\bigl(\mathop{\mathbb E}_{\phi\sim P_t}[g_t(\phi)]-g_t(f),0\bigr), & \text{if } x_t(f)\le 0~. \end{cases} \ \ \ \ \ (3)" class="latex" />

The construction above defines a LEA algorithm on a continuous set of experts. The regret guarantee is the following one.

Theorem 4 (Regret Bound for a Continuous Set of Experts). Let be a coin-betting algorithm such that, for any sequence of continuous coin outcomes , its wealth after rounds with initial money equal to satisfies

Then, the continuous LEA algorithm with prior that predicts at each round with in (2) satisfies

for any concave and non-decreasing such that for all .

Proof: We first prove that

Indeed, by the definition of , we have

0\}} x_t(f)\bigl(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f)\bigr)\,\pi(\mathrm{d}f) \\ &\qquad + \int_{\{f\,:\,x_t(f)\le 0\}} x_t(f)\max\bigl(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f),0\bigr)\,\pi(\mathrm{d}f)~. \end{aligned}" class="latex" />

Now define

If , then for -almost every , so the first integral is equal to . If instead 0}" class="latex" />, then by (1) and (2),

0~. " class="latex" />

Hence,

0\}} x_t(f)\bigl(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f)\bigr)\,\pi(\mathrm{d}f) &= Z_t \int_{\mathcal{F}} \bigl(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-g_t(f)\bigr)\,P_t(\mathrm{d}f) \\ &= Z_t \left(\mathop{\mathbb E}_{u\sim P_t}[g_t(u)]-\mathop{\mathbb E}_{f\sim P_t}[g_t(f)]\right)\\ &=0~. \end{aligned}" class="latex" />

Therefore, in all cases,

because on the integration domain and .

From the assumption on , for every fixed and for the sequence , we have

Integrating both sides with respect to , we obtain

Now, fix any competitor such that . Then,

So, it remains to upper bound . Define . Since , let . Then

where in the inequality we used Jensen’s inequality for the concave function . Rearranging, we get

Substituting back the definition of , we obtain

Using (5), the logarithm is at most , so

Putting everything together, we have the stated bound.

From Online Learning to Rademacher Complexity

bremen79 — Thu, 12 Mar 2026 14:35:56 +0000

This is the first of three posts to show that online learning is more than just online learning!
Each of these posts will show how we can go from online learning to X, where X will be something completely unrelated to online learning. These results are based on the fact that a linear regret proof in its essence is a statement about arbitrary sequence of vectors, so it can be used in other settings as well.
These results I’ll show in these posts are not new, but somehow not so widely known as they should be.

1. From Linear Regret to Rademacher Complexity

Previously, we used online-to-batch conversions to turn regret bounds into guarantees on the risk. A different (and extremely useful) way to quantify the statistical difficulty of a hypothesis class is through its Rademacher complexity. It measures how well the class can correlate with pure noise, and it is the key quantity behind many uniform convergence and generalization results. Here, we show a simple reduction that upper bounds the Rademacher complexity of a class using the regret guarantee of an online learning algorithm.

We consider the Vapnik’s general setting of learning for generic function classes. Hence, we seek to minimize

where is a class of real-valued functions on and is a distribution over . One common way to solve this problem is through an empirical risk minimization process, a.k.a. minimize the training error. That is, we gather a sample of size , , drawn i.i.d. from , then we solve the empirical version of (1):

It is now natural to ask if will also minimize the objective function in (1).

One way to study this problem is through the concept of Rademacher complexity of a function class.

Definition 1 (Rademacher complexity). Let be an arbitrary set of vectors in some domain . Let be a class of real-valued functions on . Let be i.i.d. Rademacher random variables, that is . The empirical Rademacher complexity of on is defined as

Intuitively, if is small, then no function in the class can fit random signs too well on the sample. When this happens, we can expect the minimizers of the empirical risk to be close to the minimizers of the true risk, because less likely to be fooled by random fluctuations of the training data.

This idea can be made formal and yields generalization guarantees; for example, bounds of the form

with probability at least under mild boundedness assumptions. Given that the above bound applies uniformly over , it also applies to the empirical risk minimizer .

Moreover, one can easily obtain a generalization guarantee that depends , rather than being uniform over a class. It is enough to consider a countable nested family whose union covers the entire space of possible outcomes of the empirical risk minimization process. Then, apply the uniform bound to each with a union bound, and choose the class with the smallest Rademacher complexity containing the empirical risk minimizer.

We will not prove such results here; instead, we focus on the following question: how can we upper bound for interesting classes?

Here, we consider function classes given by composing Lipschitz functions with linear predictors. A typical example is: Given a prediction function class and a loss , we obtain the induced function class

where . Hence, we would like to control . A common situation is that is Lipschitz, say -Lipschitz for all . In this case, it is intuitive that a Lipschitz loss cannot increase the ability to correlate with noise by more than a factor . This is formalized by contraction inequalities, as in the next lemma.

Lemma 2 (Contraction Lemma). Let be -Lipschitz. Consider the function class . Then, for any sample , we have

Hence, in our setting it suffices to control the complexity of linear functions, which we do next.

2. Rademacher Complexity of Linear Classes through Regret Guarantees

We now state the core reduction. Let be a non-empty, bounded, closed set and consider the linear class

Fix a sample and define the empirical Rademacher complexity of on as in (2). The following theorem shows how to use the existence of a regret upper bound for an online learning algorithm to upper bound such empirical Rademacher complexity.

Theorem 3. Let such that for any we also have . Fix any sequence and any non-empty, bounded, closed set . Assume that there exists an online algorithm that, on any sequence of linear losses , where , produces and satisfies the regret guarantee

Then, the empirical Rademacher complexity of the linear class satisfies

Proof: Fix and draw . Consider the online linear game with gradients . Let be the iterates produced by the online algorithm when fed with . By the regret guarantee (3) we have, for every realization of the signs,

Rearranging, we get

Taking expectation with respect to and dividing by yields

It remains to show that the second term is . The key observation is that is measurable with respect to the past signs and possibly on its internal randomization, while is independent of the past and has mean zero. Hence, conditioning on the past,

and by taking expectation again we obtain . Summing over gives , so (6) implies (4).

Remark 1. The proof of Theorem 3 has the same flavor as the probabilistic method argument used in lower bounds: we introduce Rademacher signs, feed them to an online algorithm, and then exploit the fact that the algorithm cannot correlate with the fresh randomness at time . The entire complexity term is paid by the regret upper bound .

We can instantiate the above theorem using the Online Mirror Descent (OMD) algorithms with -norms.

Corollary 4. Let and such that . For 0}" class="latex" />, let . Let where and for . Let . Then,

Proof: We instantiate Theorem 3 with OMD with -norms, learning rate and .

We can also consider the case of classes of functions with finite cardinality.

Corollary 5 (Massart’s Lemma). Let where and for . Let . Then,

Proof: We instantiate Theorem 3 with OMD with entropic regularizer, learning rate and .

3. History Bits

The contraction lemma appears in many places in the literature (see, e.g., Ledoux&Talagrand, 1991, Bartlett&Mendelson, 2001, Bartlett&Mendelson, 2002, Meir&Zhang, 2003).

Kakade et al. (2008) proved Theorem 3 using strong convexity, while Kakade et al. (2009) proved it through the proof of FTRL. Here, we simply used a black-box conversion to the linear regret upper bound for any online learning algorithm.

Massart’s lemma was originally proved by Massart (2000).

Acknowledgments

Thanks to ChatGPT for checking my post.

Better Optimistic Bounds and Delays as Bad Hints

bremen79 — Thu, 26 Feb 2026 12:40:04 +0000

We now consider online learning with delayed feedback. We consider a constant delay of length , where at time the learner has only observed before producing . In other words, the learner observes at time . This means that the algorithm receives its first feedback at round and it receives no feedback before that round. We can expect that the delay will increase the regret of the algorithm and one can show that the optimal regret should depend on as (Weinberger&Ordentlich, 2002). Let’s see how to achieve this optimal dependency.

1. Delay as Bad Hints

Instead of designing online algorithms specifically for the case of delayed feedback, we will reduce the setting of online learning with delays to the one of optimistic online learning, that is, when we receive the hint at each round and we use it in optimistic algorithms. In particular, using FTRL with linearized losses with delays means that we predict with for and

On the other hand, in Optimistic FTRL without delays and linearized losses we predict with

Hence, the two updates are equivalent if we set . In other words, we are receiving a very bad hint that corresponds to the delayed feedback.

In the same way, we have that in OMD with delays we predict the same for the first rounds and

On the other hand, Optimistic OMD without delays updates with

where . So, the two are equivalent by setting . So, unrolling, we have .

The above observations are very powerful because they can immediately give us the regret guarantee of FTRL and OMD with delays. Indeed, considering FTRL, assuming that the regularizer is -strongly convex with respect to , we have

The result for OMD with delays is similar. The optimal tuning of would give us a regret upper bound, but a linear dependency on . Unfortunately, according to what we said above, this is suboptimal…

Where is the problem? Is the reduction from delays to optimistic updates suboptimal? This seems unlikely, given that the reduction is exact. Instead, the problem is somewhere else: The optimistic bounds we have are suboptimal!

So, in the next two sections, we show a better bound for Optimistic OMD and Optimistic FTRL, which will imply the optimal bound for OMD and FTRL with delays.

2. Improved Optimistic OMD Bound

As we already saw, the following one is the pseudo-code of Optimistic OMD.

Note that setting is not a limitation because setting it to any other value would be equivalent to changing the arbitrary initial point . In the same way, setting does not change in any way the behaviour of the algorithm in round , but it is needed to simplify the analysis.

We can now state its improved regret guarantee.

Theorem 1. Let be the Bregman divergence with respect to and assume to be proper, closed, and -strongly convex with respect to in . Let be a non-empty closed convex set. With the notation in Algorithm 1, assume each exists and lies in . Assume for . Define . Then, , the following regret bounds hold

Moreover, if is constant, i.e., , we have

To prove this improved theorem, we will use the following technical lemmas.

Lemma 2. Assume to be -strongly convex w.r.t. and . Let be convex. Let . Then, we have

Proof: Observe that from the optimality condition on , we have

Hence, we have

where in first inequality we used the strong convexity of , (1) in the second one, and the definition of dual norms in the last one. Reordering, we have the stated bound.

Lemma 3. Let and 0}" class="latex" />. Then, we have

Proof:

The second inequality is obtained by lower bounding the max with its two possible values and overapproximating.

We can now prove the Theorem.

Proof: We can use the one-step lemma for OMD with , to have

Summing over the l.h.s., we obtain

Summing the r.h.s., we have that

Finally, observe that

Telescoping the terms with the Bregman divergences gives the first bound.

Now, using the strong convexity of , we have

From Lemma 2, we know that is upper bounded by . Hence, we have

Using Lemma 3 gives the second bound.

The bound for fixed is proved in a similar way.

The advantage of these bounds is that the terms in the sum only depends linearly on bad hints, rather than quadratically. Next, we will use exactly this property to obtain the optimal regret guarantee for OMD with delayed feedback.

From Optimistic OMD to Delayed OMD. We now show the improved bound for OMD with delays and constant learning rate. The delay-to-optimism conversion says that we have to set for , yet we must set for the Optimistic OMD proof. Hence, we obtain a reduced term in the sum for the first rounds:

That is, assuming bounded for all , we improved the worst-case dependency in the dominant term of the bound from to . Choosing the learning rate in the optimal way, we obtain the optimal regret guarantee of .

3. Improved Optimistic FTRL Bound for Linear Losses

We now prove a similar improved regret guarantee for Optimistic FTRL with linearized losses, whose pseudo-code is the following one.

Here, we will give an improved version of the regret bound for Optimistic FTRL with linearized losses. The analysis will use an auxiliary sequence of predictions, and then relate these predictions to the ones of Optimistic FTRL.

Define for the auxiliary sequence of predictions obtained with Optimistic FTRL with linearized losses with losses and hints . Given that this auxiliary sequence is only used in the analysis, we do not have to specify the choice of the hints . Instead, we will leave them free, and state the final bound for any of these choices. Formally, defining we have .

Theorem 4. With the notation in Algorithm 2, let be convex, closed, non-empty. For , if is closed, subdifferentiable, and -strongly convex with respect to in , then exists and is unique. Moreover, if in addition pointwise for all , we have

for all and all for .

To prove this theorem we will need the following stability lemma for FTRL.

Lemma 5. Assume to be -strongly convex with respect to . Define and . Then, .

Proof: Define and . From the strong convexity of , we have

Hence, summing these two inequalities, we have

that completes the proof.

We can now prove Theorem 4.

Proof: Using the regret bound for optimistic FTRL on the predictions , we can now upper bound the regret of as

Now, we focus our attention on . From Lemma 5, we have .

Using this inequality in the regret bound above, for any and any , we have

Remember that is only used in the analysis. So, this upper bound holds for any choice of .

Note that the minimization with respect to can be solved only with the knowledge of the dual norm used. However, we have the following upper bound (proof left as exercise, see Exercise 1)

From Optimistic FTRL to Delayed FTRL. For our aim of analyzing online learning with delays, we can choose . Then, using the obtained bound in the delay-to-optimism conversion, we have that FTRL with delayed linearized losses and increasing regularizer satisfies

That is, assuming that the are bounded, we improved the dependency in the sum from to . Choosing the strong convexity of the regularizer in the optimal way, we obtain the optimal regret guarantee of .

4. Conclusion

I have described only the basic idea behind the reduction from online learning with delayed feedback to optimistic updates with bad hints. It should be immediate to see that this reduction allows us to use any variant for optimistic updates to study delayed updates. For example, it is now immediate to use data-depedent learning rates or regularizers to obtain tigher regret bounds under particular situations.

5. History Bits

Weinberger&Ordentlich (2002) were the first to analyze the delayed feedback problem. They considered the adversarial full information setting with a fixed, known delay . They achieved the optimal rate by proposing a black-box reduction, by running online algorithms on subsampled sequences. Zinkevich et al. (2009) analyzed OMD with delayed feedback and obtained the optimal regret. Joulani et al. (2013) unified most of the prior research on the effect of delay in adversarial and stochastic problems. McMahan&Streeter (2014) studied a variant of AdaGrad with delays. Quanrud&Khashabi (2015) studied OGD and FTRL with adversarially chosen delays, while Joulani et al. (2016) studied OMD and FTRL-Proximal with adaptive learning rates.

The equivalence of delays/bad hints and the improved regret guarantees for Optimistic FTRL and Optimistic OMD are due to Flaspohler et al. (2021). However, their OMD bound contains a small mistake: They are missing the last terms in the bound, due to the fact that we set the breaking the equivalence between optimistic and delayed updates for . Note that while the choice of does not influence the algorithm in the rounds , its value appears in the regret upper bound. (As soon as we have time, we will fix the arxiv version of Flaspohler et al. (2021)…) Flaspohler et al. (2021) also show that RM and RM+ automatically adapts to the delay .

Lemma 2 is from Joulani et al. (2016).

Acknowledgments

Thanks for ChatGPT 5.2 for checking the proofs. (Yet, I found the mistake in the previous paper, while ChatGPT 5.2 was claiming that everything was correct!)

6. Exercises

Exercise 1. Let and a norm. Then, show that

The Gaptron Algorithm

bremen79 — Thu, 11 Dec 2025 07:30:05 +0000

This time I will describe an online algorithm that is better than the Percetron algorithm. This is one of those results that I consider fundamental in online learning, yet not enough widely known.

1. The Gaptron Algorithm

We introduce the Gaptron algorithm, a randomized first-order algorithm for online binary and multiclass classification. The key motivation behind Gaptron is to exploit the surrogate gap, which is the difference between the true zero-one loss and the convex surrogate loss function used for optimization. Standard analyses of online learning algorithms often bound the regret with respect to the surrogate loss. However, this can be a loose upper bound on the actual number of mistakes. The Gaptron algorithm is designed to tighten this analysis by actively managing the surrogate gap, leading to better mistake bounds. For didactical reasons, in the following we will focus on the binary version of the Gaptron algorithm.

The algorithm maintains a weight vector and, at each round, computes a measure of confidence using the features as . Based on this confidence measure and the loss function used, it decides whether to follow the prediction of the current weight vector or to explore by choosing a label uniformly at random. This exploration is controlled by a gap map, , which allows to exploit the surrogate gap. Intuitively, the algorithm will randomize its prediction each time it is unsure, and in this way the expected number of mistakes approaches the value of the surrogate loss. Finally, the weight vector is updated using Online Gradient Descent (OGD) on the surrogate loss. Algorithm 1 summarizes the entire procedure.

We now present a theorem on the expected number of mistakes for the Gaptron algorithm when using self-bounded losses, a weaker assumption than smoothness for convex functions.

Definition 1 (Self-bounded Function). Let bounded from below, and subdifferentiable in a set . We say that is -self-bounded in with respect to if

Remark 1. Self-bounded functions are also convex in because we are assuming that they are subdifferentiable in .

Clearly, a convex -smooth function is also -self-bounded, but the converse is not true, as shown in the next example.

Example 1. Let defined as . The function is not differentiable in , hence it is not smooth. However, it is easy to verify that it is -self-bounded.

Theorem 1. Let where . Assume that

is -self-bounded in the first argument;

for all ;

for ;

and .

Let for all . Set the learning rate and the gap map . Then, the expected number of mistakes of the Gaptron algorithm (Algorithm 1) is upper bounded as

Proof: Using the fact that self-boundedness implies convexity, the analysis starts from the standard regret bound for Online Subgradient Descent (OSD), which for any gives

The expected number of mistakes in round , conditioned on the history up to , is

Hence, we have

Let’s analyze the term we call the surrogate gap:

The theorem is proven if we can show that for our choice of and .

The bound on the norm of the subgradient can be calculated through the self-boundedness of in its first argument:

Hence, we have two cases.

Case :

where in the second inequality we used the assumption on the loss. Hence, we need to have .

Case :

Hence, again we need to have .

Surprisingly, the use of the randomization allows to obtain a constant regret with respect to the cumulative loss of any comparator.

Remark 2. Observe that if for some convex function , then implies the condition for all by Jensen’s inequality.

We now show some examples of binary classification losses that satisfy the assumptions of the theorem.

Example 2. Consider the hinge loss, defined as

We will use the squared hinge loss, that satisfies all the assumptions in Theorem 1 and it is -self-bounded in its first argument. Hence, we have that the Gaptron using the squared hinge loss and satisfies

So, if the problem is not linearly separable, we can greatly beat (in expectation) the bound on the number of mistakes we gave for the Perceptron in terms of the squared hinge loss.

An even better choice is the smoothed hinge loss:

1 \end{cases} \ \ \ \ \ (2)" class="latex" />

This loss is always less than or equal to the squared hinge loss, it still satisfies all the assumptions of Theorem 1, and it is -self-bounded in its first argument.

Example 3. One can also use loss functions that give different weights to the errors of the two classes. For example, we can use the following function

1 \end{cases} " class="latex" />

where and . This loss penalizes more the errors on class 1. In this case, the loss function is -self-bounded (but it is not smooth!) and it still satisfies the assumptions of Theorem 1.

2. History Bits

The Gaptron was introduced in van der Hoeven (2020), in turn based on some of the ideas in Neu&Zhivotovskiy (2020). He also described a multiclass variant for different surrogate losses, as well as a bandit variant. The proof here and the conditions in Theorem 1 were developed in collaboration with Dirk van der Hoeven, and he was so kind to allow me to reproduce them here.
More recently, Sakaue et al. (2024) extended the Gaptron algorithm to the structured prediction case, while Sakaue et al. (2025) extended it to the dynamic setting.

Acknowledgments

Thanks to Dirk van der Hoeven for the help in writing a short and general proof, and to Abed Razawy and Valentina Masarotto for feedback and comments.

Exercises

Exercise 1. Use FTRL with an increasing regularizer instead of OSD in the Gaptron to get rid of the need to know to set the learning rate, , while still allowing unbounded feasible sets. Note that we make use of the knowledge of to set the regularizer at time .

The Aggregating Algorithm

bremen79 — Tue, 11 Nov 2025 14:09:46 +0000

We continue our journey into algorithms that predicts distributions and this time we talk about the Aggregating Algorithm.

1. The Aggregating Algorithm and Mixable Losses

Here, we show how to extend the Weighted Average Algorithm (WAA) we saw last time to a larger class of loss functions. We will assume to be a probability density with respect to a measure of the set , the domain of the losses. As before, for the set , we define by the set of all probability distributions with support on

In the case of -exp-concave losses, we used the fact that

for all . Now, we want to extend the class of functions where we can use a similar inequality. So, we introduce a more general class of losses, the -mixable ones.

Definition 1. Let . We say that is -mixable if there exists a mapping called substitution function such that

It is clear that the proof of WAA we saw last time holds for and , instead of . Moreover, every exp-concave function is mixable because the substitution function is . This implies that all the following regret guarantees hold for exp-concave functions too.

Proposition 2. Let 0}" class="latex" /> and . If is -mixable, then is -mixable for any .

Proof: To prove that is -mixable, it is enough to show that for we have . Observe that

where in the inequality we used Jensen’s inequality on the convex function .

There is an additional caveat: The substitution function here depends on and in online learning we only know the loss function after producing our prediction. So, in our setting we need to find a generic substitution function that holds for a class of loss functions.

Hence, we consider where is decided adversarially and we require

An example of such losses is given in the following proposition.

Proposition 3. Define the softmax 0}^{K-1}}" class="latex" /> as . The multiclass logistic loss, also referred to as softmax-cross-entropy loss, , defined as , is 1-mixable.

Proof: The proof is by construction: Define the mapping as defined as for any distribution on 0}^{K-1}}" class="latex" />, where the logarithm is entry-wise. So, for any , we have

where the second equality uses the fact that for any 0}^{K-1}}" class="latex" />, . Thus, is 1-mixable.

It is worth stressing that not all the convex losses are mixable.

Proposition 4. The loss , where is not mixable.

Proof: To show mixability, we would need to show that there exists 0}" class="latex" /> and a substitution function such that

Consider the case that and . We need to find a and 0}" class="latex" /> such that

Let’s calculate the expectations:

Hence, we have

Summing these two inequalities, we have

Solving for , we have that has the only solution . Hence, the function is not mixable.

Equipped with this definition, we can introduce the Aggregating Algorithm in Algorithm 1. Its regret guarantee is the following one.

Theorem 5. Assume to be -mixable in the first argument for all , and for all , then Algorithm 1 satisfies

Moreover, we also have

As we did for WAA last time, under additional assumptions we can also give a regret bound with respect to a deterministic competitor.

Theorem 6. Let a non-empty closed convex set in and , where is an arbitrary norm. Let and assume that is -Lipschitz w.r.t. and -mixable. Then, the Aggregating Algorithm in Algorithm 1 with a sequence of such that for all and as the uniform distribution on satisfies

Proof: We start from the second bound in Theorem 5. Fix , , and define . Choose in and 0 otherwise, where . Remember that is the uniform distribution, so is uniform over . So, we have

Putting all together, we have

Now, we set such that . In this way, we have and , that gives the stated bound.

2. Example of AA: Online Multiclass Logistic Regression

In this section, we show an application of AA showing how to obtain logarithmic regret for online multiclass logistic regression. In this problem, in each round we receive a covariate and we produce a discrete probability distribution over the classes as . Then, we receive the true class and we pay the loss .

We could use a linear classifier, where is the linear classifier at time . Assuming that the covariates have bounded norm and that the columns of the matrices in has bounded norm , one can show that this problem has exp-concave losses, so we could use the ONS or the WAA algorithm. Unfortunately, the exp-concavity is of the order of .

Here, we show that we can prove a logarithmic bound that depends only logarithmically on . The price that we pay for this improved rate is that the algorithm we will use is improper, that is our predictor will not be linear in the covariates , but we will still measure its regret with respect to a linear competitor.

Here, we will proceed a little bit differently than in the standard AA algorithm because we will use two loss functions, and that satisfy a generalized form of mixability:

It is easy to see that Theorem 5 extends to this case:

Similarly, we have that if is -Lipschitz in its first argument, we have the corresponding version of Theorem 6:

In particular, from Proposition 3, we will define , , and .

Theorem 7. Assume that for all and and . Then, Algorithm 2 satisfies

Proof: We have that

Defining as the infinity norm of the L2 norm of the columns of , we have that . So, we have that is -Lipschitz w.r.t. the . The bound on the diameter of with respect to the same norm is . So, using (1) with , we obtain the stated result.

6. History Bits

The AA was introduced in Vovk (1990) (see also Vovk (1998, Appendix A) for an easier description of the algorithm).

The observation that the AA also works for infinite sets of experts was made by Freund (1996), Freund (2003).

The concept of mixability is introduced in Vovk (2001). Proposition 2 is in the proof of Vovk (1998, Lemma 9).

The mixability of the logistic loss, Proposition 3, and the content of Section 2 are from Foster et al. (2018). Theorem 6 is a generalization of a similar one for the logistic loss from Foster et al. (2018).

Acknowledgments
Thanks to Wei-Cheng Lee for feedback on a prelimary version of this post and to Gemini 2.5 and ChatGPT5 for checking all the proofs.

The Weighted Average Algorithm

bremen79 — Tue, 09 Sep 2025 16:09:01 +0000

This time we will introduce the Weighted Average Algorithm (WAA). I will do it my way: I am allergic to present for each algorithm a different analysis! From my blog it should be clear that we only have two main algorithms in online learning: OMD and FTRL. So, 99% of the online algorithms are instantiations of one or the other. In this case, I will show that WAA is nothing else than FTRL on distributions.

Next time, I will introduce the Aggregating Algorithm (AA) as a variant of the WAA.

1. Follow-the-Regularized-Leader with Distributions

Our analysis of the WAA will be based on a generalization of the FTRL algorithm with the entropic regularizer, to work with distributions with infinite support, either countable or continuous.

Let the intersection of all domains of , be a fixed measure on , and be some probability density with respect to (this means that ; in what follows, we will drop “with respect to ”). In the finite case, it is natural to take (the counting measure), while in the continuous case the natural choice of is the Lebesgue measure.

For some set of distributions , in each round we output , then we receive the loss functions , where . To simplify the notation, we introduce the duality pairing

Note that the duality pairing is bi-linear but not symmetric, so it should not be confused with the inner product. Yet, it reduces to the inner product in some cases, for example, when is a discrete set. With this notation, we can write . We also have that the functional derivative of is

The above notation will greatly facilitate the generalization of FTRL to infinite distributions. Let’s use FTRL with regularizers . Hence, at round we produce a probability distribution defined as

It is easy to see that the FTRL equality works even in this setting, so we have that

where we removed the loss term of the algorithm on both sides. Define and add to both sides, to have

The above equality will be the starting point to analyze WAA and AA.

2. The Weighted Average Algorithm

We consider FTRL for distributions described in the previous section. As usual, we will denote by the feasible set of the problem. Also, we will make a subtle but important distinction between the domain of the loss functions, , and a smaller feasible set . This allows the flexibility to use priors defined on larger sets (like a Gaussian on ), while still ensuring the predictions adhere to the problem’s constraints . So, for a set , define by the set of all probability distributions with support on , and by the set of all probability distributions whose expectation is in .

We set as . With this choice, the prediction rule becomes

WAA is stated in Algorithm 1. Essentially, WAA predicts with a weighted average of predictions—hence the name of the algorithm—where the weights are proportional to the negative exponential of the cumulative losses of each predictor.

Remark 1. Note that if is such that , then for all . Hence, if , then for all and the constraint to have is automatically satisfied.

Remark 2. If the losses are convex, predicting with the average instead of sampling makes sense because by Jensen’s inequality, we have

Next, we will dig deeper in the concept of exp-concavity that will be used as a key assumption on the losses.

2.1. Convex Analysis Bits: Exp-concavity

Definition 1. Let and 0}" class="latex" />. A function is called ‑exp‑concave if

is a concave function on .

Proposition 2. Let 0}" class="latex" />. If is -exp-concave, then is -exp-concave for any .

Proof: Observe that and is non-negative. Hence, we have a composition of a concave function with an increasing concave function. Hence, is concave.

Proposition 3. Assume is -exp-concave and let where and . Then, is -exp-concave.

Proof: Since is concave, then is also concave because it is the composition of a concave function with an affine transformation.

The next proposition shows that exp-concavity is a stronger property than convexity.

Proposition 4. Let be ‑exp‑concave. Then, is convex.

Proof: Let . By hypothesis is concave on and 0}" class="latex" /> for all (since the exponential is positive). Fix and . Concavity of gives

For the positive numbers the weighted AM–GM inequality yields

Combining the two inequalities,

Taking of both sides, yields

so is convex.

We can also characterize the exp-concavity in terms of the Hessian of a twice differentiable function.

Theorem 5. Let twice-differentiable. Then, f is -exp-concave iff

Proof: The concavity of is equivalent to its Hessian being negative semi‑definite: .

Let’s compute and . Denote . The gradient is

The Hessian is

Re‑ordering,

Since 0}" class="latex" /> and 0}" class="latex" />, the condition is equivalent to .

Example 1. Consider , , and defined as , where . We have

and

Using Theorem 5, is -exp-concave iff

Hence, we have that .

Finally, we show that exp-concavity holds on the entire only for constant functions.

Proposition 6. Let be -exp-concave on . Then, is a constant function.

Proof: Denote by that is convex by definition and it is upper bounded by 0. Suppose that is not constant, this means that there exist such that h({\boldsymbol y})}" class="latex" />. Since is convex, we have

Hence, we have

Since h({\boldsymbol y})}" class="latex" /> we have that the left hand side of the inequality goes to infinity as giving a contradiction. Hence, is constant that implies that is constant.

Remark 3. The previous result and the fact that exp-concave functions are often used only on bounded domains might induce someone to think that exp-concave function only exists on bounded domains. This is not true: There exists exp-concave , where is not bounded. For example, let and that is 1-exp-concave.

2.2. Analysis of WAA

We now state the regret guarantee for WAA.

Theorem 7. Assume that to be -exp-concave for all , and for all in Algorithm 2. Then, we have

Moreover, if , we also have

Proof: We start from (1).

First of all, observe that it holds that (proof left as an exercise)

Given that , we have , because the term depending on is constant w.r.t. .

Assuming non-increasing, we have

where in the first equality we used the fact that the terms with the losses are linear with respect to the distribution, and in the last equality we used (2).

Finally, if is -exp-concave, using Jensen’s inequality we have

for all . Putting all together and observing that the last term in (1) is negative for all , we obtained the first stated bound.

The second bound is obtained by choosing to minimize . In this case, the assumption of supported on makes sure that the last term in (1) is negative.

So, the WAA algorithm has the surprising property to have a \emph{constant} regret on exp-concave losses with respect to a stochastic competitor.
Unfortunately, the update in the WAA has not a closed formula. In fact, it requires the numerical evaluation of the expectation. However, if the distribution is discrete we can always calculate the prediction.

Example 2. As a practical example, consider the case that is composed by vectors in , . In this case, we have that

We have proved an upper bound to the regret of WAA that uses a randomized comparator, that is, . However, sometimes one would like to prove an upper bound that depends on a deterministic one. Hence, now we show how to link the performance of the stochastic comparator to the performance of a deterministic one.

Theorem 8. Assume that is a convex closed bounded set and set to be the uniform distribution on it. Assume that the losses are -exp-concave and set . Then, Algorithm 2 satisfies

Proof: Fix , and define . Choose in and 0 otherwise, where .

By the exp-concavity of , we have for any

Hence, we have for any

that implies

Moreover, the KL term becomes

Given that is uniform on , we have that

because is a scaling and translation of .

Using Theorem 7 and putting all together, we have the stated bound.

2.3. ONS and OSD as instantiations of WAA

In this section, we show that WAA is more general than one might think. In fact, we will show that it can be equivalent to online subgradient descent and to the Online Newton Step.

Here, we will set and as the Gaussian distribution , where and .

The losses we use are . Hence, we have

where

and and . From the above, we clearly have .

However, can also be written as

that is, the solution of FTRL with regularizer on the surrogate losses . This implies that is equivalent to the output of the algorithm for weaker notion of strong convexity. So, using the notation , we can directly use the bound we proved for it:

Hence, we have the following cases:

For exp-concave losses, we recover the regret upper bound of the ONS algorithm.
For convex losses, we have . Hence, we have

If in addition , we have
For -strongly losses with respect to , we have . Hence, we can set and to have

In the next theorem, we also show that the bound in Theorem 7 is powerful enough to capture all the cases above.

Theorem 9. Let and run WAA on the losses , where and are arbitrary for all . Set , where and . Then, we have

where for all .

Proof: Select for any and we will specify in the following.

Observe that the quadratic nature of the losses allows us to easily go from a stochastic to a deterministic competitor:

Hence, we have

From (1), we have

From the update rule, we have

where in the last inequality we used the that

and . Finally, we have

Putting all together, we obtain

We now select which minimizes the bound, to have

3. Example of WAA: The Krichevsky-Trofimov Betting Algorithm

Here, we show how to derive the Krichevsky-Trofimov betting algorithm and its regret upper bound. We consider the WAA algorithm where we set the losses to where with the feasible set . It is immediate to verify that these losses are 1-exp-concave. Finally, we set over .

Given that , we define and , to have

With the above formulas we can calculate that

and

Using the second result in Theorem 7, we have

Using Lemma 13.6 in my draft book, we have that this expression is maximized for and (or equivalently and ), so we obtain

4. History Bits

The WAA is from Kivinen&Warmuth (1999), as a simplification of the AA of Vovk (1990) (see also Vovk (1998, Appendix A) for an easier description of the AA).
This algorithm is known with many names: Cesa-Bianchi&Lugosi (2006) calls it “exponentially weighted mixture forecaster”, Hazan et al. (2006), Hazan et al. (2007) rediscover it (see below) and name it “exponentially weighted online optimization algorithm”, Wouter Koolen calls it simply “exponential weights algorithm” in his blog post in 2016 (see below). I preferred to use the name that its designers gave to it, also because its acronym nicely matches the one of the Aggregating Algorithm.
Theorem 5 and Example 1 are from Kivinen&Warmuth (1999).

The observation that the AA also works for infinite sets of experts was made by Freund (1996), Freund (2003).

Theorem 8 is from Hazan et al. (2006), Hazan et al. (2007), where they seem to rediscover the WAA algorithm with uniform prior, but I improved it by adding term in the logarithm. The proof of Hazan et al. (2006), Hazan et al. (2007) is a generalization of the one of Blum&Kalai (1997), Blum&Kalai (1999) for universal portfolio.

Almost the entire Section 2.3 is from a blog post by Wouter Koolen (the blog of Wouter is really good, full of gems and this is one of them!), where it is done for OGD, while I derived it for FTRL. van der Hoeven&van Erven (2016) has extended this equivalence to Online Mirror Descent. The subtlety of defining the prior on instead than on is by me and it allows to state a single theorem that covers all the cases.

Acknowledgments
Thanks to Wei-Cheng Lee for feedback on a prelimary version of this post and to Gemini 2.5 for checking all the proofs.

Strongly Adaptive Regret and CBCE

bremen79 — Tue, 11 Feb 2025 11:11:39 +0000

In this post, we introduce yet another way to quantify the ability of online learning algorithms to compete with a different comparators, besides the dynamic regret that we saw last time.

1. Strongly Adaptive Regret

We introduce the concept of strongly adaptive regret:

This definition captures the fact that we want the performance of the algorithm to be good on any interval of length . We will say that the algorithm is strongly adaptive if the additional price we pay with respect to learning on any specific interval of length is at most polylogarithmic in .

Remark 1. One might be tempted to remove the dependency on and consider

This notion is known as adaptive regret. However, it is clear that for bounded domains and Lipschitz losses (from the OLO lower bound), that is meaningless for intervals of size . Hence, the adaptive regret does not allow us to reason on the performance of the algorithm on small intervals.

In the next section, we show how to obtain strongly adaptive algorithms, using once again a combination of online learning algorithms.

1.1. CBCE: A Meta Algorithm for Strongly Adaptive Regret

We will introduce now the Coin Betting for Changing Environment (CBCE) algorithm, see Algorithm 1. We use different OCO algorithms, each one starting at a different time step, and we combine them with a sleeping-experts algorithm over a countably infinite number of experts. In particular, is the output at time of the OCO algorithm started at time , while is the convex combination of using the probability distribution produced by the sleeping expert algorithm. Using the sleeping expert algorithm in Example 1 in the sleeping expert post, immediately gives the following theorem.

Theorem 1. Let a non-empty closed convex set. Let the algorithm in Example 1 in the sleeping expert post, where for , and an online learning algorithm that satisfies for all , where . Let . Then, Algorithm 1 satisfies

Proof: Like in the case of ADER, a simple implementation of this algorithm would require to query values of the function at time . However, we run the algorithm on the linearized losses , where , so that the subgradient is only asked once. Hence, we upper bound the regret using the linearized losses:

Next, we decompose the regret in the contributes of and as

The first sum can be written as

because the -th copy of the algorithm has been active only on rounds to . With the choice of , the sleeping expert algorithm in Example 1 in the sleeping expert post achieves a sleeping regret against the -th expert of , because the expert has been active for rounds. For the second sum, we simply have because the -th copy of algorithm was started on round .

Hence, assuming that , the CBCE algorithm is strongly adaptive because the regret on any interval of length depends on and only logarithmically on .

However, while we ask only one subgradient per round, the computational complexity of CBCE in round is proportional to , because we have to update OCO algorithms. So, in the next section, we consider an efficient variant of the CBCE algorithm, whose update is proportional to , with the same strongly adaptive regret guarantee.

1.2. Efficient Version of CBCE

The previous algorithm has the disadvantage of requiring to start a new copy of algorithm in each iteration. Instead, here we will see how to reduce the number of copies to the logarithm of the number of rounds, using geometric coverings.

Figure 1. Geometric covering intervals. Each interval is denoted by [ ].

For , define the intervals

That is, each is a partition of to consecutive intervals of length . Also, define

and

the set of the “active” intervals at time , that is, the intervals that contain . By the definition of , for every we have that no interval in contains , while for every we have that a single interval in contains . Therefore, . We will run a copy of an OCO algorithm for each active interval, this means that we have a number of algorithms at each time step that is logarithmic in . Given an interval of , we will also use the definition of

to denote the set of the intervals in that contains a .

The next lemma allows to decompose the regret over any interval using the regret over intervals in .

Lemma 2. Let be an arbitrary interval. Then, the interval can be paritioned into two finite sequences of disjoint and consecutive intervals, denoted and , such that

.

.

Proof: The intuition of the proof is the following one: First, we choose the interval to be the biggest leftmost one in . Then, we show that the others are selected so that they decrease in size to the left, while on the right can have the same size of and the others will decrease in size.

Denote the size of by for and let be the maximal size of any interval that is contained in . Among all of these intervals, let be the leftmost interval, i.e., we define

Starting from , we now define a sequence of intervals (in a reversed order), denoted to cover the interval , as follows:

Clearly, this sequence is finite and the left endpoint of the leftmost interval, , is . Denote the size of by . We next prove that for every , . Given that the intervals are powers of 2, it suffices to show that for every . We use induction. The base case follows from the fact that q_0}" class="latex" />, otherwise it would contradict the maximality of and the fact that is the leftmost biggest interval. We next assume that the claim holds for every and prove for . Assume by contradiction that . Consider the interval which is obtained by extending to its left by an additional size of , that is, . Hence, is an interval of size and it is contained in . According to the induction hypothesis, for some . Given that is adjacent to the interval and its size is for some , from the definition of , we have that , contradicting the maximality of .

Similarly, we define a sequence of disjoint and consecutive intervals, denoted that covers .

Clearly, this sequence is finite and the right endpoint of the rightmost interval, , is . Denote the size of by . We next prove that for every , that is equivalent to .

We prove it again by induction. We assume that for every and prove that for . For the base case, from the definition of we have that . Now, assume by contradiction that . Then, we can consider the interval which is obtained by extending to its right by an additional size of . It follows that is an interval of size and it is contained in . According to the induction hypothesis, for some . We need to consider the following two cases:

Case (i.e., ). Then, is consecutive to and its size is , hence , contradicting the maximality of .
Case (i.e., ). Then, and this contradicts the maximality of .

As an example of the above Lemma, consider the interval that is partitioned in the intervals , , , , , , , .

We will use an expert algorithm over a countable number of experts, hence we also have a prior over the infinite-dimensional simplex . For simplicity of notation, we denote by the elements of associated with the interval .

Theorem 3. Let a non-empty closed convex set. Let be the algorithm in Example 1 in the sleeping expert post, where the copy associated to the interval has the parameter defined recursively as

Let be an online learning algorithm that satisfies for all and , where . Let . Then, Algorithm 1 satisfies

Proof: First of all, we upper bound the regret over the interval with the one over linearized losses:

where .

Now, suppose we denote by the decision from the black-box run associated with the interval at time and by the combined decision of the meta algorithm at time . Since the complete algorithm is a combination of a meta algorithm and several black-box algorithms , its regret depends on both and . We now decompose the two sources of regret additively through the geometric covering.

Let be the partition of obtained from Lemma 2. Then, the regret on can be decomposed as follows:

The black-box regret on is exactly the regret of the black-box algorithm on rounds, since the black-box run was started at the beginning of the interval . In particular, we have

where the second inequality is due to Lemma 2 and in the last one we used that the number of geometric cover intervals is less than (again from Lemma 2) and .

Now, we show that has low regret on interval , considering separately the regret on for and . The critical observation is that

where the equality holds because the expert associated to is active only for the rounds in . Using the guarantee on the sleeping regret in Example 1 in the sleeping expert post, we have

where the second inequality is due to Lemma 2. Analogously, we have

Putting all together, we have the stated bound.

Remark 2. Observe that the limit for of the first term of the upper bound of the regret is . However, the second term is always of the order of .

From the above theorem, it is immediate to get a guarantee on the strongly adaptive regret by observing that

2. History Bits

The adaptive regret is defined in Hazan&Seshadhri (2007), Hazan&Seshadhri (2009). The strongly-adaptive regret was defined by Adamskiy et al. (2012), Adamskiy et al. (2016) (still calling it “adaptive regret”) and then reinvented in Daniely et al. (2015) (that coined the name). Note that the strongly-adaptive regret is usually defined for bounded domains, taking the maximum with respect , instead I defined it more generally for any competitor in the feasible set. Lemma 2 is from Daniely et al. (2015) as well. The efficient version of the CBCE algorithm was proposed by Jun et al. (2017), Jun et al. (2017) using a stronger version of the sleeping expert algorithm, still based on coin betting. I simplified the sleeping expert algorithm for didactical reasons. CBCE improves over the guarantee in Daniely et al. (2015) by improving the logarithmic term. The ideas of using linearized losses in CBCE to avoid querying more than one subgradients and the better behaviour of the first term in the bound when are new. The non-efficient version of CBCE is also new, but straightforward.

Acknowledgements

Thanks to Wei-Cheng Lee and Yulian Wu for feedback on this post.

Dynamic regret and ADER

bremen79 — Mon, 01 Jul 2024 13:45:32 +0000

In this post, we will see how to extend the notion of regret to a sequence of comparators instead of a single one. We saw that the definition of regret makes sense as a direct generalization of both the stochastic setting and the offline optimization. However, in some cases, we know that the environment is changing over time, so we would like an algorithm that guarantees a stronger notion of regret that captures the dynamics of the environment.

Our first extension is to use multiple comparators, using the concept of dynamic regret, defined as

where and is the feasible set. For additional clarity, we will also refer to the standard notion of regret as the static regret.

Is it possible to obtain sublinear dynamic regret? We already know that in the case that it is possible because we recover the standard regret case. On the other hand, it should be intuitive that the problem becomes more and more difficult the more the sequence of changes over time. There are various ways to quantity this complexity and a common one is the path-length defined as

where is a function that measures the shift from to . In particular, we will instantiate to be a norm. In the following, we will show how to design an online algorithm whose dynamic regret depends on the path length in the case that the feasible set is bounded.

1. Dynamic Regret of Online Mirror Descent

It turns out that some online learning algorithm already satisfies a dynamic regret without any additional change. For Online Mirror Descent (OMD), we can state the following theorem.

Theorem 1. Let the Bregman divergence w.r.t. and assume to be closed and -strongly convex with respect to in . Let a non-empty closed convex set. Assume that for . Then, , OMD with constant learning rate satisfies

where .

Proof: From the one-step inequality for OMD with competitor , we have

Dividing by and summing over time, we have

Now observe that

Hence, we have

Putting it all together, we have

Assuming that and using the definition of dual norm, we have the stated bound.

Assuming and selecting the usual learning rate , we have

In other words, we suffer an additional regret of

compared to the static case, for any .

Example 1. Consider the case that the feasible is has diameter with respect to the L2 norm, i.e., . Set and assume the subgradients to satisfy for . In this case, we have that and setting we have

Could we obtain a better regret guarantee? We could set the learning rate to to obtain the dynamic regret

However, assuming the knowledge of the exact value of is an unreasonable assumption because it violates the assumption of the adversarial nature of the problem. This mirrors the problem of tuning the learning rate with the knowledge of in online gradient descent. Fortunately, there is a solution and we will see it in the next section.

2. ADER: Optimal Dependency on the Path Length for Online Subgradient Descent

In the previous section, we saw that the algorithm needs to know the path length of the competitors to tune its learning rate. Here, we show how to construct an online learning algorithm that achieves the same guarantees up to polylogarithmic terms. For simplicity, we will consider the Euclidean case, but it is easy to extend it to the Bregman case. Moreover, we assume the losses to be -Lipschitz w.r.t. the L2 norm.

We will use a classic online learning method: to run in parallel many projected Online Gradient Descent (OGD) algorithms with different learning rates and use an Exponentiated Gradient algorithm (EG) to learn online the best combination of their iterates. In this way, we will show that the cumulative loss of the resulting algorithm is close to the cumulative loss of the best learning rate, which in turn will give us the right dependency on the path length.

Consider a feasible set with bounded L2 diameter and assume of the losses to be -Lipschitz. Using projected OGD with learning rate , we obtain the following regret upper bound

Using the fact that , the optimal choice of satisfies

So, consider a grid of learning rates , so that and . This implies that there exists such that

To combine the OGD algorithms with different learning rates, we use the EG algorithm where the OGD algorithms are our experts. We construct the loss vector for EG as . The EG algorithm is invariant to additive constants to the coordinates of the loss vector, hence we can consider it as if it were running on the losses that makes them bounded in absolute value by . Hence, using a uniform prior and learning rate , we obtain the regret

Now observe that by Jensen’s inequality and the convexity of , we have

This motivates the choice of using a convex combination of the predictions of the OGD algorithms. Moreover, we have . From the regret of OGD and the choice of , we have

Putting it all together, we have

We are still not completely done: This construction above queries subgradients in each step, so now we show how to reduce it to only one subgradient per iteration. This is easily achieved: It is enough to run the construction on the convex surrogate losses , where . The advantage is that for all , hence all the OGD algorithms receive the same subgradient! Once again, we have that . Moreover, we have

Hence, a dynamic regret guarantee on immediately translates to a dynamic regret guarantee on . The resulting algorithm is called Adaptive learning for Dynamic EnviRonment (ADER) and it is in Algorithm 1. Formally, we have the following theorem.

Theorem 2. Let be a non-empty closed convex set with bounded diameter with respect to the L2 norm equal to . Assume to be convex functions subdifferentiable in and -Lipschitz with respect to the L2 norm. Then, , Algorithm 1 satisfies

Note that while we query only one subgradient per round, the computational complexity of ADER per round is still when goes to infinity. This is because in each round we need to update different OGD algorithms.

3. History Bits

Herbster&Warmuth (1998) introduced the idea of tracking the best expert in learning with experts game. In this setting, the best expert is allowed to change at most times. Zinkevich (2003) is often erroneously thought to have introduced the dynamic regret, while it first appeared in Herbster&Warmuth (1998), Herbster&Warmuth (2001), with upper bounds that depend on the drift .

Theorem 1 is a generalization of Cesa-Bianchi&Lugosi (2006, Theorem 11.4) to arbitrary distance generating functions and considering Lipschitz losses. Also, following Jacobsen&Cutkosky (2022, Lemma 4), I have fixed the offset in the proof so that we do not have to assume that .

Zhang et al. (2018) designed the ADER algorithm and they also proved that it is optimal in bounded domains. The idea of combining online algorithms with different learning rates comes directly from the MetaGrad algorithm (van Erven&Koolen, 2016, van Erven et al., 2021) which also showed how to query a single gradient per round. In turn, MetaGrad is based on prior work using a grid of learning rates in EG (Koolen et al., 2014). By now, this is a well-known method that allows us to solve essentially all problems of tuning learning rates in bounded domains, at least theoretically. One can also combine different learning rates in unbounded domains, with a multi-scale expert algorithm (Foster et al., 2017, Cutkosky&Orabona, 2018). This method can be considered a better “doubling trick” because it allows tracking non-monotonic quantities at the price of a logarithmic computational overhead. It is worth also stressing that the general idea of combining the outputs of different online learning algorithms through another online learning algorithm is instead much older, and it goes back at least to Blum&Mansour (2005), Blum&Mansour (2007).

I described a slightly simpler version of ADER with a flat prior, see Exercise 1 for the original bound. I also removed the unnecessary assumption that . Recently, Zhao et al. (2020), Zhao et al. (2024) improved the guarantees of ADER using optimistic algorithms, obtaining smaller bounds in easy environments.

Acknowledgements

Thanks to Nicolò Campolongo for telling me about the papers by Blum and Mansour.

4. Exercises

Exercise 1. Change the prior in the EG algorithm to obtain a dependency of instead of . In this way, if the path length is zero, that is we are in the static case, the regret is instead of , when .