In this post, we will see how to extend the notion of regret to a sequence of comparators instead of a single one. We saw that the definition of regret makes sense as a direct generalization of both the stochastic setting and the offline optimization. However, in some cases, we know that the environment is changing over time, so we would like an algorithm that guarantees a stronger notion of regret that captures the dynamics of the environment.
Our first extension is to use multiple comparators, using the concept of dynamic regret, defined as
where and
is the feasible set. For additional clarity, we will also refer to the standard notion of regret as the static regret.
Is it possible to obtain sublinear dynamic regret? We already know that in the case that it is possible because we recover the standard regret case. On the other hand, it should be intuitive that the problem becomes more and more difficult the more the sequence of
changes over time. There are various ways to quantity this complexity and a common one is the path-length
defined as
where is a function that measures the shift from
to
. In particular, we will instantiate
to be a norm. In the following, we will show how to design an online algorithm whose dynamic regret depends on the path length in the case that the feasible set is bounded.
1. Dynamic Regret of Online Mirror Descent
It turns out that some online learning algorithm already satisfies a dynamic regret without any additional change. For Online Mirror Descent (OMD), we can state the following theorem.
Theorem 1. Let
the Bregman divergence w.r.t.
and assume
to be closed and
-strongly convex with respect to
in
. Let
a non-empty closed convex set. Assume that
for
. Then,
, OMD with constant learning rate
satisfies
where
.
Proof: From the one-step inequality for OMD with competitor , we have
Dividing by and summing over time, we have
Now observe that
Hence, we have
Putting it all together, we have
Assuming that and using the definition of dual norm, we have the stated bound.
Assuming and selecting the usual learning rate
, we have
In other words, we suffer an additional regret of
compared to the static case, for any .
Example 1. Consider the case that the feasible is has diameter
with respect to the L2 norm, i.e.,
. Set
and assume the subgradients to satisfy
for
. In this case, we have that
and setting
we have
Could we obtain a better regret guarantee? We could set the learning rate to to obtain the dynamic regret
However, assuming the knowledge of the exact value of is an unreasonable assumption because it violates the assumption of the adversarial nature of the problem. This mirrors the problem of tuning the learning rate with the knowledge of
in online gradient descent. Fortunately, there is a solution and we will see it in the next section.
2. ADER: Optimal Dependency on the Path Length for Online Subgradient Descent
In the previous section, we saw that the algorithm needs to know the path length of the competitors to tune its learning rate. Here, we show how to construct an online learning algorithm that achieves the same guarantees up to polylogarithmic terms. For simplicity, we will consider the Euclidean case, but it is easy to extend it to the Bregman case. Moreover, we assume the losses to be -Lipschitz w.r.t. the L2 norm.
We will use a classic online learning method: to run in parallel many projected Online Gradient Descent (OGD) algorithms with different learning rates and use an Exponentiated Gradient algorithm (EG) to learn online the best combination of their iterates. In this way, we will show that the cumulative loss of the resulting algorithm is close to the cumulative loss of the best learning rate, which in turn will give us the right dependency on the path length.
Consider a feasible set with bounded L2 diameter
and assume of the losses
to be
-Lipschitz. Using projected OGD with learning rate
, we obtain the following regret upper bound
Using the fact that , the optimal choice of
satisfies
So, consider a grid of learning rates
, so that
and
. This implies that there exists
such that
To combine the OGD algorithms with different learning rates, we use the EG algorithm where the OGD algorithms are our experts. We construct the loss vector for EG as . The EG algorithm is invariant to additive constants to the coordinates of the loss vector, hence we can consider it as if it were running on the losses
that makes them bounded in absolute value by
. Hence, using a uniform prior and learning rate
, we obtain the regret
Now observe that by Jensen’s inequality and the convexity of , we have
This motivates the choice of using a convex combination of the predictions of the OGD algorithms. Moreover, we have . From the regret of OGD and the choice of
, we have
Putting it all together, we have
We are still not completely done: This construction above queries subgradients in each step, so now we show how to reduce it to only one subgradient per iteration. This is easily achieved: It is enough to run the construction on the convex surrogate losses
, where
. The advantage is that
for all
, hence all the OGD algorithms receive the same subgradient! Once again, we have that
. Moreover, we have
Hence, a dynamic regret guarantee on immediately translates to a dynamic regret guarantee on
. The resulting algorithm is called Adaptive learning for Dynamic EnviRonment (ADER) and it is in Algorithm 1. Formally, we have the following theorem.
Theorem 2. Let
be a non-empty closed convex set with bounded diameter with respect to the L2 norm equal to
. Assume
to be convex functions subdifferentiable in
and
-Lipschitz with respect to the L2 norm. Then,
, Algorithm 1 satisfies
Note that while we query only one subgradient per round, the computational complexity of ADER per round is still when
goes to infinity. This is because in each round we need to update
different OGD algorithms.
3. History Bits
Herbster&Warmuth (1998) introduced the idea of tracking the best expert in learning with experts game. In this setting, the best expert is allowed to change at most times. Zinkevich (2003) is often erroneously thought to have introduced the dynamic regret, while it first appeared in Herbster&Warmuth (1998), Herbster&Warmuth (2001), with upper bounds that depend on the drift
.
Theorem 1 is a generalization of Cesa-Bianchi&Lugosi (2006, Theorem 11.4) to arbitrary distance generating functions and considering Lipschitz losses. Also, following Jacobsen&Cutkosky (2022, Lemma 4), I have fixed the offset in the proof so that we do not have to assume that .
Zhang et al. (2018) designed the ADER algorithm and they also proved that it is optimal in bounded domains. The idea of combining online algorithms with different learning rates comes directly from the MetaGrad algorithm (van Erven&Koolen, 2016, van Erven et al., 2021) which also showed how to query a single gradient per round. In turn, MetaGrad is based on prior work using a grid of learning rates in EG (Koolen et al., 2014). By now, this is a well-known method that allows us to solve essentially all problems of tuning learning rates in bounded domains, at least theoretically. One can also combine different learning rates in unbounded domains, with a multi-scale expert algorithm (Foster et al., 2017, Cutkosky&Orabona, 2018). This method can be considered a better “doubling trick” because it allows tracking non-monotonic quantities at the price of a logarithmic computational overhead. It is worth also stressing that the general idea of combining the outputs of different online learning algorithms through another online learning algorithm is instead much older, and it goes back at least to Blum&Mansour (2005), Blum&Mansour (2007).
I described a slightly simpler version of ADER with a flat prior, see Exercise 1 for the original bound. I also removed the unnecessary assumption that . Recently, Zhao et al. (2020), Zhao et al. (2024) improved the guarantees of ADER using optimistic algorithms, obtaining smaller bounds in easy environments.
Acknowledgements
Thanks to Nicolò Campolongo for telling me about the papers by Blum and Mansour.
4. Exercises
Exercise 1. Change the prior in the EG algorithm to obtain a dependency of
instead of
. In this way, if the path length is zero, that is we are in the static case, the regret is
instead of
, when
.
