This time we will introduce the Weighted Average Algorithm (WAA). I will do it my way: I am allergic to present for each algorithm a different analysis! From my blog it should be clear that we only have two main algorithms in online learning: OMD and FTRL. So, 99% of the online algorithms are instantiations of one or the other. In this case, I will show that WAA is nothing else than FTRL on distributions.
Next time, I will introduce the Aggregating Algorithm (AA) as a variant of the WAA.
1. Follow-the-Regularized-Leader with Distributions
Our analysis of the WAA will be based on a generalization of the FTRL algorithm with the entropic regularizer, to work with distributions with infinite support, either countable or continuous.
Let the intersection of all domains of
,
be a fixed measure on
, and
be some probability density with respect to
(this means that
; in what follows, we will drop “with respect to
”). In the finite case, it is natural to take
(the counting measure), while in the continuous case the natural choice of
is the Lebesgue measure.
For some set of distributions , in each round we output
, then we receive the loss functions
, where
. To simplify the notation, we introduce the duality pairing
Note that the duality pairing is bi-linear but not symmetric, so it should not be confused with the inner product. Yet, it reduces to the inner product in some cases, for example, when is a discrete set. With this notation, we can write
. We also have that the functional derivative of
is
The above notation will greatly facilitate the generalization of FTRL to infinite distributions. Let’s use FTRL with regularizers . Hence, at round
we produce a probability distribution
defined as
It is easy to see that the FTRL equality works even in this setting, so we have that
where we removed the loss term of the algorithm on both sides. Define and add
to both sides, to have
The above equality will be the starting point to analyze WAA and AA.
2. The Weighted Average Algorithm
We consider FTRL for distributions described in the previous section. As usual, we will denote by the feasible set of the problem. Also, we will make a subtle but important distinction between the domain of the loss functions,
, and a smaller feasible set
. This allows the flexibility to use priors defined on larger sets (like a Gaussian on
), while still ensuring the predictions
adhere to the problem’s constraints
. So, for a set
, define by
the set of all probability distributions with support on
, and by
the set of all probability distributions whose expectation is in
.
We set as
. With this choice, the prediction rule becomes
WAA is stated in Algorithm 1. Essentially, WAA predicts with a weighted average of predictions—hence the name of the algorithm—where the weights are proportional to the negative exponential of the cumulative losses of each predictor.
Remark 1. Note that if
is such that
, then
for all
. Hence, if
, then
for all
and the constraint to have
is automatically satisfied.
Remark 2. If the losses are convex, predicting with the average instead of sampling makes sense because by Jensen’s inequality, we have
Next, we will dig deeper in the concept of exp-concavity that will be used as a key assumption on the losses.
2.1. Convex Analysis Bits: Exp-concavity
Definition 1. Let
and
. A function
is called
‑exp‑concave if
is a concave function on
.
Proposition 2. Let
. If
is
-exp-concave, then
is
-exp-concave for any
.
Proof: Observe that and
is non-negative. Hence, we have a composition of a concave function with an increasing concave function. Hence,
is concave.
Proposition 3. Assume
is
-exp-concave and let
where
and
. Then,
is
-exp-concave.
Proof: Since is concave, then
is also concave because it is the composition of a concave function with an affine transformation.
The next proposition shows that exp-concavity is a stronger property than convexity.
Proof: Let . By hypothesis
is concave on
and
for all
(since the exponential is positive). Fix
and
. Concavity of
gives
For the positive numbers the weighted AM–GM inequality yields
Combining the two inequalities,
Taking of both sides, yields
so is convex.
We can also characterize the exp-concavity in terms of the Hessian of a twice differentiable function.
Theorem 5. Let
twice-differentiable. Then, f is
-exp-concave iff
Proof: The concavity of is equivalent to its Hessian being negative semi‑definite:
.
Let’s compute and
. Denote
. The gradient is
The Hessian is
Re‑ordering,
Since and
, the condition
is equivalent to
.
Example 1. Consider
,
, and
defined as
, where
. We have
and
Using Theorem 5,
is
-exp-concave iff
Hence, we have that
.
Finally, we show that exp-concavity holds on the entire only for constant functions.
Proposition 6. Let
be
-exp-concave on
. Then,
is a constant function.
Proof: Denote by that is convex by definition and it is upper bounded by 0. Suppose that
is not constant, this means that there exist
such that
. Since
is convex, we have
Hence, we have
Since we have that the left hand side of the inequality goes to infinity as
giving a contradiction. Hence,
is constant that implies that
is constant.
Remark 3. The previous result and the fact that exp-concave functions are often used only on bounded domains might induce someone to think that exp-concave function only exists on bounded domains. This is not true: There exists exp-concave
, where
is not bounded. For example, let
and
that is 1-exp-concave.
2.2. Analysis of WAA
We now state the regret guarantee for WAA.
Theorem 7. Assume that
to be
-exp-concave for all
, and
for all
in Algorithm 2. Then, we have
Moreover, if
, we also have
Proof: We start from (1).
First of all, observe that it holds that (proof left as an exercise)
Given that , we have
, because the term depending on
is constant w.r.t.
.
Assuming non-increasing, we have
where in the first equality we used the fact that the terms with the losses are linear with respect to the distribution, and in the last equality we used (2).
Finally, if is
-exp-concave, using Jensen’s inequality we have
for all . Putting all together and observing that the last term in (1) is negative for all
, we obtained the first stated bound.
The second bound is obtained by choosing to minimize
. In this case, the assumption of
supported on
makes sure that the last term in (1) is negative.
So, the WAA algorithm has the surprising property to have a \emph{constant} regret on exp-concave losses with respect to a stochastic competitor.
Unfortunately, the update in the WAA has not a closed formula. In fact, it requires the numerical evaluation of the expectation. However, if the distribution is discrete we can always calculate the prediction.
Example 2. As a practical example, consider the case that
is composed by
vectors in
,
. In this case, we have that
We have proved an upper bound to the regret of WAA that uses a randomized comparator, that is, . However, sometimes one would like to prove an upper bound that depends on a deterministic one. Hence, now we show how to link the performance of the stochastic comparator to the performance of a deterministic one.
Theorem 8. Assume that
is a convex closed bounded set and set
to be the uniform distribution on it. Assume that the losses
are
-exp-concave and set
. Then, Algorithm 2 satisfies
Proof: Fix , and define
. Choose
in
and 0 otherwise, where
.
By the exp-concavity of , we have for any
Hence, we have for any
that implies
Moreover, the KL term becomes
Given that is uniform on
, we have that
because is a scaling and translation of
.
Using Theorem 7 and putting all together, we have the stated bound.
2.3. ONS and OSD as instantiations of WAA
In this section, we show that WAA is more general than one might think. In fact, we will show that it can be equivalent to online subgradient descent and to the Online Newton Step.
Here, we will set and
as the Gaussian distribution
, where
and
.
The losses we use are . Hence, we have
where
and and
. From the above, we clearly have
.
However, can also be written as
that is, the solution of FTRL with regularizer on the surrogate losses
. This implies that
is equivalent to the output of the algorithm for weaker notion of strong convexity. So, using the notation
, we can directly use the bound we proved for it:
Hence, we have the following cases:
- For exp-concave losses, we recover the regret upper bound of the ONS algorithm.
- For convex losses, we have
. Hence, we have
If in addition
, we have
- For
-strongly losses with respect to
, we have
. Hence, we can set
and
to have
In the next theorem, we also show that the bound in Theorem 7 is powerful enough to capture all the cases above.
Theorem 9. Let
and run WAA on the losses
, where
and
are arbitrary for all
. Set
, where
and
. Then, we have
where
for all
.
Proof: Select for any
and we will specify
in the following.
Observe that the quadratic nature of the losses allows us to easily go from a stochastic to a deterministic competitor:
Hence, we have
From (1), we have
From the update rule, we have
where in the last inequality we used the that
and . Finally, we have
Putting all together, we obtain
We now select which minimizes the bound, to have
3. Example of WAA: The Krichevsky-Trofimov Betting Algorithm
Here, we show how to derive the Krichevsky-Trofimov betting algorithm and its regret upper bound. We consider the WAA algorithm where we set the losses to where
with the feasible set
. It is immediate to verify that these losses are 1-exp-concave. Finally, we set
over
.
Given that , we define
and
, to have
With the above formulas we can calculate that
and
Using the second result in Theorem 7, we have
Using Lemma 13.6 in my draft book, we have that this expression is maximized for and
(or equivalently
and
), so we obtain
4. History Bits
The WAA is from Kivinen&Warmuth (1999), as a simplification of the AA of Vovk (1990) (see also Vovk (1998, Appendix A) for an easier description of the AA).
This algorithm is known with many names: Cesa-Bianchi&Lugosi (2006) calls it “exponentially weighted mixture forecaster”, Hazan et al. (2006), Hazan et al. (2007) rediscover it (see below) and name it “exponentially weighted online optimization algorithm”, Wouter Koolen calls it simply “exponential weights algorithm” in his blog post in 2016 (see below). I preferred to use the name that its designers gave to it, also because its acronym nicely matches the one of the Aggregating Algorithm.
Theorem 5 and Example 1 are from Kivinen&Warmuth (1999).
The observation that the AA also works for infinite sets of experts was made by Freund (1996), Freund (2003).
Theorem 8 is from Hazan et al. (2006), Hazan et al. (2007), where they seem to rediscover the WAA algorithm with uniform prior, but I improved it by adding term in the logarithm. The proof of Hazan et al. (2006), Hazan et al. (2007) is a generalization of the one of Blum&Kalai (1997), Blum&Kalai (1999) for universal portfolio.
Almost the entire Section 2.3 is from a blog post by Wouter Koolen (the blog of Wouter is really good, full of gems and this is one of them!), where it is done for OGD, while I derived it for FTRL. van der Hoeven&van Erven (2016) has extended this equivalence to Online Mirror Descent. The subtlety of defining the prior on instead than on
is by me and it allows to state a single theorem that covers all the cases.
Acknowledgments
Thanks to Wei-Cheng Lee for feedback on a prelimary version of this post and to Gemini 2.5 for checking all the proofs.
