The Aggregating Algorithm

We continue our journey into algorithms that predicts distributions and this time we talk about the Aggregating Algorithm.

1. The Aggregating Algorithm and Mixable Losses

Here, we show how to extend the Weighted Average Algorithm (WAA) we saw last time to a larger class of loss functions. We will assume {P_1} to be a probability density with respect to a measure {m} of the set {\mathcal{X}}, the domain of the losses. As before, for the set {\mathcal{X}}, we define by {\Delta(\mathcal{X})} the set of all probability distributions with support on {\mathcal{X}}

In the case of {\alpha}-exp-concave losses, we used the fact that

\displaystyle \lambda_t\ln \mathop{\mathbb E}_{{\boldsymbol x} \sim P_t} \left[\exp\left(-\frac{1}{\lambda_t} \ell_t({\boldsymbol x})\right)\right] + \ell_t\left(\mathop{\mathbb E}_{{\boldsymbol x} \sim P_t}[{\boldsymbol x}]\right) \leq0,

for all {\frac{1}{\lambda_t} \leq \alpha}. Now, we want to extend the class of functions where we can use a similar inequality. So, we introduce a more general class of losses, the {\alpha}-mixable ones.

Definition 1. Let {f:\mathcal{X}\rightarrow {\mathbb R}}. We say that {f} is {\alpha}-mixable if there exists a mapping {s:\Delta(\mathcal{X}) \rightarrow \mathcal{V}} called substitution function such that

\displaystyle \frac{1}{\alpha}\ln \mathop{\mathbb E}_{{\boldsymbol x} \sim P} [\exp(-\alpha f({\boldsymbol x}))] + f\left(s(P)\right) \leq0, \quad \forall P \in \Delta(\mathcal{X})~.

It is clear that the proof of WAA we saw last time holds for {{\boldsymbol x}_t=s(P_t)} and {\lambda_t \geq \frac{1}{\alpha}}, instead of {{\boldsymbol x}_t = \mathop{\mathbb E}_{{\boldsymbol x} \sim P_t} [{\boldsymbol x}]}. Moreover, every exp-concave function is mixable because the substitution function is {s(P)=\mathop{\mathbb E}_{{\boldsymbol x} \sim P}[{\boldsymbol x}]}. This implies that all the following regret guarantees hold for exp-concave functions too.

Proposition 2. Let {\alpha >0} and {\mathcal{X}\subseteq {\mathbb R}^d}. If {f:\mathcal{V} \rightarrow {\mathbb R}} is {\alpha}-mixable, then {f} is {\beta}-mixable for any {0<\beta<\alpha}.

Proof: To prove that {f} is {\beta}-mixable, it is enough to show that for {h(y)=\left(\mathop{\mathbb E}_{{\boldsymbol x} \sim P} [\exp(-y f({\boldsymbol x}))]\right)^\frac{1}{y}} we have {h(\alpha)\geq h(\beta)}. Observe that

\displaystyle \begin{aligned} h(\alpha) &=\left(\mathop{\mathbb E}_{{\boldsymbol x} \sim P} \left[\exp(-\alpha f({\boldsymbol x}))\right]\right)^\frac{1}{\alpha} = \left(\mathop{\mathbb E}_{{\boldsymbol x} \sim P} \left[(\exp(-f({\boldsymbol x})))^\alpha\right]\right)^\frac{1}{\alpha}\\ &=\left(\mathop{\mathbb E}_{{\boldsymbol x} \sim P} \left[(\exp(-f({\boldsymbol x})))^{\beta\frac{\alpha}{\beta}} \right]\right)^\frac{1}{\alpha} \geq \left(\mathop{\mathbb E}_{{\boldsymbol x} \sim P} \left[(\exp(-f({\boldsymbol x})))^\beta\right]\right)^\frac{1}{\beta} =h(\beta), \end{aligned}

where in the inequality we used Jensen’s inequality on the convex function {x\rightarrow x^\frac{\alpha}{\beta}}. \Box

There is an additional caveat: The substitution function here depends on {f} and in online learning we only know the loss function after producing our prediction. So, in our setting we need to find a generic substitution function that holds for a class of loss functions.

Hence, we consider {\ell_t({\boldsymbol x},{\boldsymbol y}_t)} where {{\boldsymbol y}_t \in \mathcal{Y}} is decided adversarially and we require

\displaystyle \frac{1}{\alpha}\ln \mathop{\mathbb E}_{{\boldsymbol x} \sim P} [\exp(-\alpha \ell({\boldsymbol x},{\boldsymbol y}))] + \ell\left(s(P),{\boldsymbol y}\right) \leq 0, \quad \forall P \in \Delta(\mathcal{X}), {\boldsymbol y} \in \mathcal{Y}~.

An example of such losses is given in the following proposition.

Proposition 3. Define the softmax {\sigma: {\mathbb R}^K \rightarrow \Delta_{>0}^{K-1}} as {\sigma({\boldsymbol z})_k=\frac{\exp(z_k)}{\sum_{i=1}^K \exp(z_i)}}. The multiclass logistic loss, also referred to as softmax-cross-entropy loss, {\ell:{\mathbb R}^K \times \{1, \dots, K\} \rightarrow {\mathbb R}}, defined as {\ell({\boldsymbol x}, y) = -\ln(\sigma({\boldsymbol x})_y)}, is 1-mixable.

Proof: The proof is by construction: Define the mapping {s} as defined as {s:\pi \rightarrow \ln(\mathop{\mathbb E}_{{\boldsymbol x}\sim\pi}[\sigma({\boldsymbol x})])} for any distribution {\pi} on {\Delta_{>0}^{K-1}}, where the logarithm is entry-wise. So, for any {y \in \{1, \dots, K\}}, we have

\displaystyle \mathop{\mathbb E}_{{\boldsymbol x}\sim\pi}[\exp(-\ell({\boldsymbol x}, y))] = \mathop{\mathbb E}_{{\boldsymbol x}\sim\pi}[\sigma({\boldsymbol x})_y ] = \sigma\left(\ln \mathop{\mathbb E}_{{\boldsymbol x}\sim\pi}[\sigma({\boldsymbol x}) ]\right)_y = \sigma(s(\pi))_y = \exp(-\ell(s(\pi), y)),

where the second equality uses the fact that for any {{\boldsymbol p} \in \Delta_{>0}^{K-1}}, {\sigma(\ln({\boldsymbol p})) = {\boldsymbol p}}. Thus, {\ell} is 1-mixable. \Box

It is worth stressing that not all the convex losses are mixable.

Proposition 4. The loss {\ell(x,y)=|x-y|}, where {y \in \{-1, 1\}} is not mixable.

Proof: To show mixability, we would need to show that there exists {\alpha>0} and a substitution function {s} such that

\displaystyle \frac{1}{\alpha}\ln \mathop{\mathbb E}_{{\boldsymbol x} \sim P} [\exp(-\alpha |x-y|)] + |s(P)-y| \leq 0, \quad \forall P \in \Delta(\mathcal{X}), y \in \{-1,1\}~.

Consider the case that {V=\{-1, 1\}} and {P(-1)=P(1)=1/2}. We need to find a {\mu \in {\mathbb R}} and {\alpha > 0} such that

\displaystyle \begin{aligned} \mathop{\mathbb E}_{{\boldsymbol x} \sim P} [\exp(-\alpha |x-1|)]\leq \exp(-\alpha |\mu-1|),\\ \mathop{\mathbb E}_{{\boldsymbol x} \sim P} [\exp(-\alpha |x+1|)]\leq \exp(-\alpha |\mu+1|)~. \end{aligned}

Let’s calculate the expectations:

\displaystyle \begin{aligned} \mathop{\mathbb E}_{{\boldsymbol x} \sim P} [\exp(-\alpha |x-1|)]=1/2+1/2\exp(-2\alpha ),\\ \mathop{\mathbb E}_{{\boldsymbol x} \sim P} [\exp(-\alpha |x+1|)]=1/2+1/2\exp(-2\alpha )~. \end{aligned}

Hence, we have

\displaystyle \begin{aligned} -\frac{\ln(1/2+1/2\exp(-2\alpha ))}{\alpha} \geq |\mu-1|,\\ -\frac{\ln(1/2+1/2\exp(-2\alpha ))}{\alpha} \geq |\mu+1|~. \end{aligned}

Summing these two inequalities, we have

\displaystyle -2\frac{\ln(1/2+1/2\exp(-2\alpha ))}{\alpha} \geq |\mu-1| + |\mu+1| \geq 2~.

Solving for {\alpha}, we have {2 \geq \exp(-\alpha)+\exp(\alpha)} that has the only solution {\alpha=0}. Hence, the function is not mixable. \Box

Equipped with this definition, we can introduce the Aggregating Algorithm in Algorithm 1. Its regret guarantee is the following one.

Theorem 5. Assume {\ell:\mathcal{X}\times \mathcal{Y} \rightarrow {\mathbb R}} to be {\alpha}-mixable in the first argument for all {t}, and {\frac{1}{\alpha}<\lambda_{t}\leq \lambda_{t+1}} for all {t}, then Algorithm 1 satisfies

\displaystyle \sum_{t=1}^T \ell( {\boldsymbol x}_t, y_t) \leq \mathop{\mathbb E}_{{\boldsymbol x} \sim Q}\left[\sum_{t=1}^T \ell({\boldsymbol x},{\boldsymbol y}_t)\right] + \lambda_T \text{KL}(Q||P_1), \quad \forall Q \in \Delta(\mathcal{X})~.

Moreover, we also have

\displaystyle \sum_{t=1}^T \ell( {\boldsymbol x}_t,{\boldsymbol y}_t) \leq -\lambda_T \ln \mathop{\mathbb E}_{{\boldsymbol x} \sim P_1}\left[ \exp\left(-\frac{1}{\lambda_T} \sum_{t=1}^T \ell({\boldsymbol x},{\boldsymbol y}_t)\right)\right]~.


As we did for WAA last time, under additional assumptions we can also give a regret bound with respect to a deterministic competitor.

Theorem 6. Let {\mathcal{X}\equiv\mathcal{V}} a non-empty closed convex set in {{\mathbb R}^d} and {D:=\max_{{\boldsymbol x}, {\boldsymbol y} \in \mathcal{V}} \ \|{\boldsymbol x}-{\boldsymbol y}\|<\infty}, where {\|\cdot\|} is an arbitrary norm. Let {\ell:\mathcal{V}\times \mathcal{Y}\rightarrow {\mathbb R}} and assume that {\ell(\cdot, {\boldsymbol y}_t)} is {L_t}-Lipschitz w.r.t. {\|\cdot\|} and {\alpha}-mixable. Then, the Aggregating Algorithm in Algorithm 1 with a sequence of {\lambda_t} such that {\frac{1}{\alpha}<\lambda_{t}\leq \lambda_{t+1}} for all {t} and {P_1} as the uniform distribution on {\mathcal{V}} satisfies

\displaystyle \sum_{t=1}^T (\ell({\boldsymbol x}_t,{\boldsymbol y}_t) - \ell({\boldsymbol u}, {\boldsymbol y}_t)) \leq d \lambda_T \ln\left(\max\left(1, \frac{\frac{1}{\lambda_T} D \sum_{t=1}^T L_t}{d}\right)\right) + d \lambda_T \leq 2d \lambda_T \ln\left(\frac{\frac{1}{\lambda_T} D \sum_{t=1}^T L_t}{d}+e\right)~.

Proof: We start from the second bound in Theorem 5. Fix {\theta \in [0,1]}, {{\boldsymbol u} \in \mathcal{V}}, and define {\mathcal{V}'=\{{\boldsymbol x} \in {\mathbb R}^d: {\boldsymbol x}=\theta {\boldsymbol u} + (1-\theta){\boldsymbol w}, {\boldsymbol w} \in \mathcal{V}\} \subseteq \mathcal{V}}. Choose {P'_1=\frac{1}{\beta} P_1} in {\mathcal{V}'} and 0 otherwise, where {\beta=\int_{\mathcal{V}'} P_1(\mathrm{d} {\boldsymbol x})=(1-\theta)^d}. Remember that {P_1} is the uniform distribution, so {P'_1} is uniform over {\mathcal{V}'}. So, we have

\displaystyle \begin{aligned} \mathop{\mathbb E}_{{\boldsymbol x} \sim P_1} \left[\exp\left(-\frac{1}{\lambda_T} \sum_{t=1}^T \ell({\boldsymbol x}, {\boldsymbol y}_t)\right)\right] &\geq (1-\theta)^d \mathop{\mathbb E}_{{\boldsymbol x} \sim P'_1} \left[\exp\left(-\frac{1}{\lambda_T} \sum_{t=1}^T \ell({\boldsymbol x},{\boldsymbol y}_t)\right)\right] \\ &= (1-\theta)^d \mathop{\mathbb E}_{{\boldsymbol x} \sim P_1} \left[\exp\left(-\frac{1}{\lambda_T} \sum_{t=1}^T\ell(\theta {\boldsymbol u} + (1-\theta){\boldsymbol x},{\boldsymbol y}_t)\right)\right] \\ &= (1-\theta)^d \mathop{\mathbb E}_{{\boldsymbol x} \sim P_1} \left[\exp\left(-\frac{1}{\lambda_T} \sum_{t=1}^T\ell((1-\theta)({\boldsymbol x}-{\boldsymbol u})+{\boldsymbol u},{\boldsymbol y}_t )\right)\right] \\ &\geq (1-\theta)^d \mathop{\mathbb E}_{{\boldsymbol x} \sim P_1} \left[\exp\left(-\frac{1}{\lambda_T} \sum_{t=1}^T\left(\ell({\boldsymbol u},{\boldsymbol y}_t ) + L_t (1-\theta)\|{\boldsymbol x}-{\boldsymbol u}\|\right)\right)\right] \\ &\geq (1-\theta)^d \mathop{\mathbb E}_{{\boldsymbol x} \sim P_1} \left[\exp\left(-\frac{1}{\lambda_T} \sum_{t=1}^T \left(\ell({\boldsymbol u},{\boldsymbol y}_t ) + L_t (1-\theta) D\right)\right)\right] \\ &= (1-\theta)^d \exp\left(-\frac{1}{\lambda_T} \sum_{t=1}^T \ell({\boldsymbol u},{\boldsymbol y}_t )\right) \exp\left(-\frac{1}{\lambda_T} \sum_{t=1}^T L_t (1-\theta)D\right)~. \end{aligned}

Putting all together, we have

\displaystyle \sum_{t=1}^T (\ell({\boldsymbol x}_t,{\boldsymbol y}_t) - \ell({\boldsymbol u},{\boldsymbol y}_t)) \leq d \lambda_T \ln \frac{1}{1-\theta} + (1-\theta)D \sum_{t=1}^T L_t ~.

Now, we set {\theta} such that {1-\theta=\min\left(\frac{d \lambda_T}{ D \sum_{t=1}^T L_t},1\right)}. In this way, we have {(1-\theta) D \sum_{t=1}^T L_t\leq \lambda_T d} and {\ln \frac{1}{1-\theta} = \ln\left(\max\left(1, \frac{ \frac{1}{\lambda_T} D \sum_{t=1}^T L_t}{d}\right)\right)}, that gives the stated bound. \Box

2. Example of AA: Online Multiclass Logistic Regression

In this section, we show an application of AA showing how to obtain logarithmic regret for online multiclass logistic regression. In this problem, in each round we receive a covariate {{\boldsymbol z}_t \in \mathcal{Z} \subseteq {\mathbb R}^d} and we produce a discrete probability distribution over the {K} classes as {\hat{{\boldsymbol y}}_t \in \Delta^{K-1}}. Then, we receive the true class {y_t \in \{1, \dots, K\}} and we pay the loss {- \ln \hat{y}_{t,y_t}}.

We could use a linear classifier, {\hat{{\boldsymbol y}}_t=\sigma({\boldsymbol X}^\top_t {\boldsymbol z}_t)} where {{\boldsymbol X}_t \in \mathcal{V}\subseteq {\mathbb R}^{d \times K}} is the linear classifier at time {t}. Assuming that the covariates {{\boldsymbol z}_t} have bounded norm and that the columns of the matrices in {\mathcal{V}} has bounded norm {D}, one can show that this problem has exp-concave losses, so we could use the ONS or the WAA algorithm. Unfortunately, the exp-concavity is of the order of {\exp(-D)}.

Here, we show that we can prove a logarithmic bound that depends only logarithmically on {D}. The price that we pay for this improved rate is that the algorithm we will use is improper, that is our predictor will not be linear in the covariates {{\boldsymbol z}_t}, but we will still measure its regret with respect to a linear competitor.

Here, we will proceed a little bit differently than in the standard AA algorithm because we will use two loss functions, {\ell_t:\mathcal{V}\times \mathcal{Y} \rightarrow {\mathbb R}} and {\hat{\ell}: {\mathbb R}^K \times \mathcal{Y}\rightarrow {\mathbb R}} that satisfy a generalized form of mixability:

\displaystyle \frac{1}{\alpha}\ln \mathop{\mathbb E}_{{\boldsymbol x} \sim P} [\exp(-\alpha \ell_t({\boldsymbol x}, y))] + \hat{\ell}\left(s_t(P), y\right) \leq 0, \quad \forall P \in \Delta(\mathcal{X}), {\boldsymbol y} \in \mathcal{Y}~.

It is easy to see that Theorem 5 extends to this case:

\displaystyle \sum_{t=1}^T \hat{\ell}( {\boldsymbol x}_t,{\boldsymbol y}_t) \leq -\lambda_T \ln \mathop{\mathbb E}_{{\boldsymbol x} \sim P_1}\left[ \exp\left(-\frac{1}{\lambda_T} \sum_{t=1}^T \ell_t({\boldsymbol x},{\boldsymbol y}_t)\right)\right]~.

Similarly, we have that if {\ell_t} is {L_t}-Lipschitz in its first argument, we have the corresponding version of Theorem 6:

\displaystyle \label{eq:mixable_lipschitz_extended} \sum_{t=1}^T (\hat{\ell}({\boldsymbol x}_t,{\boldsymbol y}_t) - \ell_t({\boldsymbol u}, {\boldsymbol y}_t)) \leq 2 \dim(\mathcal{V}) \lambda_T \ln\left(\frac{\frac{1}{\lambda_T} D \sum_{t=1}^T L_t}{\dim(\mathcal{V})}+e\right)~. \ \ \ \ \ (1)

In particular, from Proposition 3, we will define {\ell_t({\boldsymbol X}, y) = -\ln (\sigma({\boldsymbol X}^\top {\boldsymbol z}_t)_{y_t})}, {s_t(\pi) = \ln \mathop{\mathbb E}_{{\boldsymbol X}\sim\pi}[\sigma({\boldsymbol X}^\top {\boldsymbol z}_t)]}, and {\hat{\ell}({\boldsymbol x}, y)=-\ln (\sigma({\boldsymbol x})_{y_t})}.


Theorem 7. Assume that {\|{\boldsymbol z}_t\|_2\leq R} for all {t} and {\mathcal{V}=\{{\boldsymbol X}=[{\boldsymbol x}_1, \dots, {\boldsymbol x}_K] \in {\mathbb R}^{d\times K}: {\boldsymbol x}_i \in \mathcal{B} \subset {\mathbb R}^d\}} and {\max_{{\boldsymbol x}, {\boldsymbol y} \in \mathcal{B}} \ \|{\boldsymbol x}-{\boldsymbol y}\|_2\leq D}. Then, Algorithm 2 satisfies

\displaystyle -\sum_{t=1}^T \ln (\sigma({\boldsymbol x}_t)_{y_t}) + \sum_{t=1}^T \ln (\sigma({\boldsymbol U}^\top {\boldsymbol z}_t)_{y_t}) \leq 2d\, K \ln\left(\frac{D\, R\, T}{d\, K}+e\right), \quad \forall {\boldsymbol U} \in \mathcal{V}~.

Proof: We have that

\displaystyle \frac{\partial \hat{\ell}({\boldsymbol x},y)}{\partial x_j} = \sigma({\boldsymbol x})_j-\boldsymbol{1}[y=j]~.

Defining {\|{\boldsymbol X}\|_{2,\infty}} as the infinity norm of the L2 norm of the columns of {{\boldsymbol X}}, we have that {\|\nabla_{{\boldsymbol X}} \ell_t({\boldsymbol X},y_t)\|_{2,\infty}\leq \|{\boldsymbol z}_t\|_2\leq R}. So, we have that {\ell_t} is {R}-Lipschitz w.r.t. the {\|\cdot\|_{2,\infty}}. The bound on the diameter of {\mathcal{V}} with respect to the same norm is {D}. So, using (1) with {\frac{1}{\lambda_t}=1}, we obtain the stated result. \Box

6. History Bits

The AA was introduced in Vovk (1990) (see also Vovk (1998, Appendix A) for an easier description of the algorithm).

The observation that the AA also works for infinite sets of experts was made by Freund (1996), Freund (2003).

The concept of mixability is introduced in Vovk (2001). Proposition 2 is in the proof of Vovk (1998, Lemma 9).

The mixability of the logistic loss, Proposition 3, and the content of Section 2 are from Foster et al. (2018). Theorem 6 is a generalization of a similar one for the logistic loss from Foster et al. (2018).

Acknowledgments
Thanks to Wei-Cheng Lee for feedback on a prelimary version of this post and to Gemini 2.5 and ChatGPT5 for checking all the proofs.

Leave a comment