Strongly Adaptive Regret and CBCE

In this post, we introduce yet another way to quantify the ability of online learning algorithms to compete with a different comparators, besides the dynamic regret that we saw last time.

1. Strongly Adaptive Regret

We introduce the concept of strongly adaptive regret:

\displaystyle \text{SA-Regret}_{T,\tau}({\boldsymbol u}) :=\max_{[s,s+\tau-1] \subseteq [1,2,\dots, T]} \ \sum_{t=s}^{s+\tau-1} \ell_t({\boldsymbol x}_t) - \sum_{t=s}^{s+\tau-1} \ell_t({\boldsymbol u}), \ \forall {\boldsymbol u} \in \mathcal{V}~.

This definition captures the fact that we want the performance of the algorithm to be good on any interval of length {\tau}. We will say that the algorithm is strongly adaptive if the additional price we pay with respect to learning on any specific interval of length {\tau} is at most polylogarithmic in {T}.

Remark 1. One might be tempted to remove the dependency on {\tau} and consider

\displaystyle \max_\tau \ \text{SA-Regret}_{T,\tau}({\boldsymbol u})~.

This notion is known as adaptive regret. However, it is clear that {\max_\tau \ \text{SA-Regret}_{T,\tau}({\boldsymbol u})=\Omega(\sqrt{T})} for bounded domains and Lipschitz losses (from the OLO lower bound), that is meaningless for intervals of size {\mathcal{O}(\sqrt{T})}. Hence, the adaptive regret does not allow us to reason on the performance of the algorithm on small intervals.

In the next section, we show how to obtain strongly adaptive algorithms, using once again a combination of online learning algorithms.

1.1. CBCE: A Meta Algorithm for Strongly Adaptive Regret


We will introduce now the Coin Betting for Changing Environment (CBCE) algorithm, see Algorithm 1. We use {T} different OCO algorithms, each one starting at a different time step, and we combine them with a sleeping-experts algorithm over a countably infinite number of experts. In particular, {{\boldsymbol x}_t^{(s)}} is the output at time {t} of the OCO algorithm started at time {s}, while {{\boldsymbol x}_t} is the convex combination of {{\boldsymbol x}^{(1)}_t, \dots, {\boldsymbol x}^{(t)}_t} using the probability distribution produced by the sleeping expert algorithm. Using the sleeping expert algorithm in Example 1 in the sleeping expert post, immediately gives the following theorem.

Theorem 1. Let {\mathcal{V}\subseteq {\mathbb R}^d} a non-empty closed convex set. Let {\mathcal{M}} the algorithm in Example 1 in the sleeping expert post, where {\epsilon_i=\frac{1}{i(i+1)}} for {i=1,2,\dots}, and {\mathcal{B}} an online learning algorithm that satisfies {\text{Regret}_t({\boldsymbol u})\leq R_t({\boldsymbol u})} for all {t \in {\mathbb N}}, where {R_t:\mathcal{V}\rightarrow {\mathbb R}}. Let {I=[s,s+\tau-1]\subset [1,2,\dots]}. Then, Algorithm 1 satisfies

\displaystyle \text{Regret}_I({\boldsymbol u}) := \sum_{t\in I} (\ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u})) \leq R_{\tau}({\boldsymbol u}) + \mathcal{O}\left(\sqrt{\tau \ln (s^2 \tau)}\right), \ \forall {\boldsymbol u} \in \mathcal{V}~.

Proof: Like in the case of ADER, a simple implementation of this algorithm would require to query {t} values of the function {\ell_t} at time {t}. However, we run the algorithm on the linearized losses {\langle {\boldsymbol g}_t, \cdot\rangle}, where {{\boldsymbol g}_t \in \ell_t({\boldsymbol x}_t)}, so that the subgradient is only asked once. Hence, we upper bound the regret using the linearized losses:

\displaystyle \sum_{t=s}^{s+\tau-1} (\ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u})) \leq \sum_{t=s}^{s+\tau-1} \langle {\boldsymbol g}_t,{\boldsymbol x}_t- {\boldsymbol u}\rangle = \sum_{t=s}^{s+\tau-1} (\tilde{\ell}_t({\boldsymbol x}_t) - \tilde{\ell}_t({\boldsymbol u}))~.

Next, we decompose the regret in the contributes of {\mathcal{B}} and {\mathcal{M}} as

\displaystyle \sum_{t=s}^{s+\tau-1} (\tilde{\ell}_t({\boldsymbol x}_t) - \tilde{\ell}_t({\boldsymbol u})) = \sum_{t=s}^{s+\tau-1} (\tilde{\ell}_t({\boldsymbol x}_t) - \tilde{\ell}_t({\boldsymbol x}^{(s)}_t)) + \sum_{t=s}^{s+\tau-1}(\tilde{\ell}_t({\boldsymbol x}^{(s)}_t) - \tilde{\ell}_t({\boldsymbol u}) )~.

The first sum can be written as

\displaystyle \sum_{t=s}^{s+\tau-1} (\tilde{\ell}_t({\boldsymbol x}_t) - \tilde{\ell}_t({\boldsymbol x}^{(s)}_t)) = \text{Regret}^\text{sleeping}_{s+\tau-1}({\boldsymbol e}_s),

because the {s}-th copy of the algorithm {\mathcal{B}} has been active only on rounds {s} to {s+q-1}. With the choice of {\epsilon_i=\frac{1}{i(i+1)}}, the sleeping expert algorithm in Example 1 in the sleeping expert post achieves a sleeping regret against the {s}-th expert of {\mathcal{O}(\sqrt{{\tau}\ln (s^2 \tau)})}, because the expert {s} has been active for {\tau} rounds. For the second sum, we simply have {R_{\tau}({\boldsymbol u})} because the {s}-th copy of algorithm {\mathcal{B}} was started on round {s}. \Box

Hence, assuming that {R_\tau({\boldsymbol u})=\mathcal{O}(\sqrt{\tau})}, the CBCE algorithm is strongly adaptive because the regret on any interval of length {\tau} depends on {\sqrt{\tau}} and only logarithmically on {T}.

However, while we ask only one subgradient per round, the computational complexity of CBCE in round {t} is proportional to {t}, because we have to update {t} OCO algorithms. So, in the next section, we consider an efficient variant of the CBCE algorithm, whose update is proportional to {\ln t}, with the same strongly adaptive regret guarantee.

1.2. Efficient Version of CBCE


The previous algorithm has the disadvantage of requiring to start a new copy of algorithm {\mathcal{B}} in each iteration. Instead, here we will see how to reduce the number of copies to the logarithm of the number of rounds, using geometric coverings.

Figure 1. Geometric covering intervals. Each interval is denoted by [ ].

For {k=0,1,2,\dots}, define the intervals

\displaystyle \mathcal{I}_k :=\{[i 2^k,(i+1)2^k-1] : i \in {\mathbb N}\}~.

That is, each {\mathcal{I}_k} is a partition of {{\mathbb N} \backslash \{1, 2,\dots, 2^k-1\}} to consecutive intervals of length {2^k}. Also, define

\displaystyle \mathcal{I}:=\cup_{k=0, 1, 2, \dots} \mathcal{I}_k

and

\displaystyle ACTIVE(t) := \{J \in \mathcal{I} : t \in J\},

the set of the “active” intervals at time {t}, that is, the intervals that contain {t}. By the definition of {\mathcal{I}_k}, for every {t \leq 2^k-1} we have that no interval in {\mathcal{I}_k} contains {t}, while for every {t \geq 2^k} we have that a single interval in {\mathcal{I}_k} contains {t}. Therefore, {|ACTIVE(t)| = \lfloor\log_2 t\rfloor + 1}. We will run a copy of an OCO algorithm for each active interval, this means that we have a number of algorithms at each time step {t} that is logarithmic in {t}. Given an interval {J} of {{\mathbb N}}, we will also use the definition of {\mathcal{I}|_J}

\displaystyle \mathcal{I}|_J:=\{ I \in \mathcal{I}: I \subseteq J \},

to denote the set of the intervals in {\mathcal{I}} that contains a {J}.

The next lemma allows to decompose the regret over any interval using the regret over intervals in {\mathcal{I}}.

Lemma 2. Let {J = [q, s] \subseteq {\mathbb N}} be an arbitrary interval. Then, the interval {J} can be paritioned into two finite sequences of disjoint and consecutive intervals, denoted {(J_{-k}, \dots , J_0) \subseteq \mathcal{I}|_J} and {(J_1, J_2,\dots, J_p) \subseteq \mathcal{I}|_J}, such that

  • {|J_{-i}|/|J_{-i+1}| \leq 1/2, \forall i \geq 1}.
  • {|J_{i}|/|J_{i-1}| \leq 1/2, \forall i \geq 2}.

Proof: The intuition of the proof is the following one: First, we choose the interval {J_0} to be the biggest leftmost one in {\mathcal{I}|_J}. Then, we show that the others are selected so that they decrease in size to the left, while on the right {J_1} can have the same size of {J_0} and the others will decrease in size.

Denote the size of {J_{i}} by {b_{i}} for {i=-k, \dots, 0, \dots, p} and let {b_0 = \max\{|J'| : J' \in \mathcal{I}|_J\}} be the maximal size of any interval {J' \in \mathcal{I}} that is contained in {J}. Among all of these intervals, let {J_0} be the leftmost interval, i.e., we define

\displaystyle \begin{aligned} q_0 &:= \mathop{\mathrm{argmin}}\{q': [q',q'+b_0-1]\in \mathcal{I}|_J\},\\ s_0 &:= q_0+b_0-1,\\ J_0 &:=[q_0,s_0]~. \end{aligned}

Starting from {q_0 - 1}, we now define a sequence of intervals (in a reversed order), denoted {(J_{-1}, \dots , J_{-k})} to cover the interval {[1, q_0-1]}, as follows:

\displaystyle \begin{aligned} [q_{-1},s_{-1}] &:=J_{-1} :=\mathop{\mathrm{argmax}}_{J'=[q',s'] \in \mathcal{I}|_{[q,q_0-1]}: s'=q_0-1} |J'|\\ &\vdots\\ [q_{-i},s_{-i}] &:=J_{-i} :=\mathop{\mathrm{argmax}}_{J'=[q',s'] \in \mathcal{I}|_{[q,q_{i+1}-1]}: s'=q_{-i+1}-1} |J'|\\ &\vdots \end{aligned}

Clearly, this sequence is finite and the left endpoint of the leftmost interval, {J_{-k}}, is {q}. Denote the size of {J_{-i}} by {b_{-i}}. We next prove that for every {i \geq 1}, {b_{-i}/b_{-i+1} \leq 1/2 }. Given that the intervals are powers of 2, it suffices to show that {b_{-i} < b_{-i+1}} for every {i \geq 1}. We use induction. The base case follows from the fact that {b_0 > q_0}, otherwise it would contradict the maximality of {b_0} and the fact that {J_0} is the leftmost biggest interval. We next assume that the claim holds for every {i \in \{1, \dots , n - 1\}} and prove for {n}. Assume by contradiction that {b_{-n} \geq b_{-n+1}}. Consider the interval {\hat{J}_{-n+1}} which is obtained by extending {J_{-n+1}} to its left by an additional size of {b_{-n+1}}, that is, {\hat{J}_{-n+1}:=[q_{-n+1}-b_{-n+1},q_{-n+1}-1]\cup J_{-n+1}}. Hence, {\hat{J}_{-n+1}} is an interval of size {2b_{-n+1}} and it is contained in {J}. According to the induction hypothesis, {|\hat{J}_{-n+1}| = 2b_{-n+1} = 2^j b_{-n+2}} for some {j\leq 0}. Given that {\hat{J}_{-n+1}} is adjacent to the interval {J_{-n+2}} and its size is {2^j|J_{-n+2}|} for some {j\leq 0}, from the definition of {\mathcal{I}}, we have that {\hat{J}_{-n+1}\in \mathcal{I}|_{J}}, contradicting the maximality of {J_{-n+1}}.

Similarly, we define a sequence of disjoint and consecutive intervals, denoted {(J_1, \dots , J_p)} that covers {[s_0 + 1, s]}.

\displaystyle \begin{aligned} [q_{1},s_{1}] &:=J_{1} :=\mathop{\mathrm{argmax}}_{J'=[q',s'] \in \mathcal{I}|_{[s_0+1,s]}: q'=s_0+1} |J'|\\ &\vdots\\ [q_{i},s_{i}] &:=J_{i} :=\mathop{\mathrm{argmax}}_{J'=[q',s'] \in \mathcal{I}|_{[s_{i-1}+1,s]}: q'=s_{i-1}+1} |J'|\\ &\vdots \end{aligned}

Clearly, this sequence is finite and the right endpoint of the rightmost interval, {J_p}, is {s}. Denote the size of {J_i} by {b_i}. We next prove that for every {i \geq 2}, {b_i/b_{i-1} \leq 1/2} that is equivalent to {b_i< b_{i-1}}.

We prove it again by induction. We assume that {b_i/b_{i-1} \in \{2^j: j\leq 0\}} for every {i \in \{1, \dots, n-1\}} and prove that {b_n < b_{n-1}} for {n\geq 2}. For the base case, from the definition of {b_0} we have that {b_1 \leq b_0}. Now, assume by contradiction that {b_n \geq b_{n-1}}. Then, we can consider the interval {\hat{J}_{n-1}} which is obtained by extending {J_{n-1}} to its right by an additional size of {b_{n-1}}. It follows that {\hat{J}_{n-1}} is an interval of size {2b_{n-1}} and it is contained in {J}. According to the induction hypothesis, {|\hat{J}_{n-1}| = 2b_{n-1} = 2^j b_{n-2}} for some {j\leq 1}. We need to consider the following two cases:

  • Case {j \leq 0} (i.e., {b_{n-1}/b_{n-2} \leq 1/2}). Then, {\hat{J}_{n-1}} is consecutive to {J_{n-2} \in \mathcal{I}} and its size is {2^j |J_{n-2}|}, hence {\hat{J}_{n-1} \in \mathcal{I}|_J}, contradicting the maximality of {J_{n-1}}.
  • Case {j = 1} (i.e., {b_{n-1} = b_{n-2}}). Then, {|\hat{J}_{n-1}| = 2b_{n-2}} and this contradicts the maximality of {J_{n-2}}.

\Box

As an example of the above Lemma, consider the interval {J=[1,30]} that is partitioned in the intervals {J_{-3}=[1]}, {J_{-2}=[2,3]}, {J_{-1}=[4,7]}, {J_{0}=[8,15]}, {J_1=[16,23]}, {J_2=[24,27]}, {J_3=[28,29]}, {J_4=[30]}.

We will use an expert algorithm over a countable number of experts, hence we also have a prior {{\boldsymbol \pi}} over the infinite-dimensional simplex {\Delta^\infty:=\{ {\boldsymbol x} \in {\mathbb R}^\infty: x_i\geq 0, \sum_{i=1}^\infty x_i=1\}}. For simplicity of notation, we denote by {\pi_I} the elements of {{\boldsymbol \pi} \in \Delta^\infty} associated with the interval {I}.

Theorem 3. Let {\mathcal{V}\subseteq {\mathbb R}^d} a non-empty closed convex set. Let {\mathcal{M}} be the algorithm in Example 1 in the sleeping expert post, where the copy associated to the interval {J=[i_1,i_2] \in \mathcal{I}} has the parameter {\epsilon_{J}} defined recursively as

\displaystyle \begin{aligned} \epsilon_{J} = \begin{cases} \frac{1}{i_1 (i_1+1) (\lfloor\log_2 i_1\rfloor + 1)}, & J \in \mathcal{I}_0\\ \sum_{I \in \mathcal{I}_{0}|_J} \epsilon_{I}, & J \in \mathcal{I}_k, k\geq 1 \end{cases}~. \end{aligned}

Let {\mathcal{B}} be an online learning algorithm that satisfies {\text{Regret}_t({\boldsymbol u})\leq c t^\alpha} for all {{\boldsymbol u} \in \mathcal{V}} and {t \in {\mathbb N}}, where {\alpha \in (0,1)}. Let {I=[s,s+\tau-1]\subset [1,2,\dots]}. Then, Algorithm 1 satisfies

\displaystyle \text{Regret}_I({\boldsymbol u}) := \sum_{t\in I} (\ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u})) \leq 2\frac{2^\alpha-(4 \tau^2)^{-\alpha}}{2^\alpha-1} c \tau^\alpha + \mathcal{O}(\sqrt{\tau \ln (s+\tau)}), \ \forall {\boldsymbol u} \in \mathcal{V}~.

Proof: First of all, we upper bound the regret over the interval {I} with the one over linearized losses:

\displaystyle \sum_{t\in I} (\ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u})) \leq \sum_{t\in I} \langle {\boldsymbol g}_t, {\boldsymbol x}_t-{\boldsymbol u} \rangle = \sum_{t\in I} (\tilde{\ell}_t ({\boldsymbol x}_t)- \tilde{\ell}_t ({\boldsymbol u})),

where {\tilde{\ell}_t({\boldsymbol x})=\langle {\boldsymbol g}_t, {\boldsymbol x}\rangle}.

Now, suppose we denote by {{\boldsymbol x}^{(J)}_t} the decision from the black-box run {\mathcal{B}_J} associated with the interval {J} at time {t} and by {{\boldsymbol x}_t} the combined decision of the meta algorithm at time {t}. Since the complete algorithm is a combination of a meta algorithm {\mathcal{M}} and several black-box algorithms {\mathcal{B}}, its regret depends on both {\mathcal{M}} and {\mathcal{B}}. We now decompose the two sources of regret additively through the geometric covering.

Let {\mathcal{J}=\bigcup_{i=-a}^{b} J_i} be the partition of {I} obtained from Lemma 2. Then, the regret on {I} can be decomposed as follows:

\displaystyle \begin{aligned} \text{Regret}_I({\boldsymbol u}) &= \sum_{t\in I} \left(\ell_t({\boldsymbol x}_t) - \ell_t({\boldsymbol u}) \right) \\ &\leq \sum_{t\in I} \left(\tilde{\ell}_t({\boldsymbol x}_t) - \tilde{\ell}_t({\boldsymbol u}) \right) \\ &= \sum_{i=-a}^{b} \sum_{t\in J_i} \left(\tilde{\ell}_t({\boldsymbol x}_t) - \tilde{\ell}_t({\boldsymbol x}^{(J_i)}_t) + \tilde{\ell}_t({\boldsymbol x}^{(J_i)}_t) - \tilde{\ell}_t({\boldsymbol u}) \right) \\ &= \sum_{i=-a}^{b} \sum_{t\in J_i} \left(\tilde{\ell}_t({\boldsymbol x}_t) - \tilde{\ell}_t({\boldsymbol x}^{(J_i)}_t)\right) + \sum_{i=-a}^{b} \sum_{t\in J_i} \left(\tilde{\ell}_t({\boldsymbol x}^{(J_i)}_t) - \tilde{\ell}_t({\boldsymbol u}) \right) ~. & (1) \\\end{aligned}

The black-box regret on {J_i \in \mathcal{J}} is exactly the regret of the black-box algorithm on {|J_i|} rounds, since the black-box run {\mathcal{B}_{J_i}} was started at the beginning of the interval {J_i}. In particular, we have

\displaystyle \begin{aligned} \sum_{i=-a}^b \sum_{t\in J_i} (\tilde{\ell}_t({\boldsymbol x}^{(J_i)})-\tilde{\ell}_t({\boldsymbol u})) &= \sum_{i=-a}^b \sum_{t\in J_i} \langle{\boldsymbol g}_t,{\boldsymbol x}^{(J_i)}-{\boldsymbol u}\rangle \leq \sum_{i=-a}^b c |J_i|^\alpha \leq 2 c \sum_{k=0}^{a+b+1} (2^{-k} |I|)^\alpha\\ &= 2 c |I|^\alpha \frac{2^\alpha-2^{-\alpha (a+b+1)}}{2^\alpha-1} \leq 2 c |I|^\alpha \frac{2^\alpha-(4 |I|^2)^{-\alpha}}{2^\alpha-1}, \end{aligned}

where the second inequality is due to Lemma 2 and in the last one we used that the number of geometric cover intervals is less than {2(\lfloor \log_2 |I|\rfloor+1)} (again from Lemma 2) and {2^{-2(\lfloor \log_2 |I|\rfloor+1)}\geq \frac{1}{4|I|^2}}.

Now, we show that {\mathcal{M}} has low regret on interval {J \in \mathcal{J}}, considering separately the regret on {J_i} for {i\in [-a,0]} and {i \in [1,b]}. The critical observation is that

\displaystyle \sum_{i=-a}^{0} \sum_{t\in J_i} (\tilde{\ell}_t({\boldsymbol x}_t)-\tilde{\ell}_t({\boldsymbol x}^{(J_i)})) =\sum_{i=-a}^{0} \text{Regret}^\text{sleeping}_{s+\tau-1}({\boldsymbol e}_{J_i}),

where the equality holds because the expert associated to {J_i} is active only for the rounds in {J_i}. Using the guarantee on the sleeping regret in Example 1 in the sleeping expert post, we have

\displaystyle \begin{aligned} \sum_{i=-a}^{0} \text{Regret}^\text{sleeping}_{s+\tau-1}({\boldsymbol e}_{J_i}) &\leq \sum_{i=-a}^{0} \mathcal{O}\left(\sqrt{|J_i| \ln \frac{|J_i|}{\epsilon_{J_i}}}\right) \leq \sum_{i=-a}^{0} \mathcal{O}\left(\sqrt{2^{i} \tau \ln (s+\tau)}\right) \leq \sum_{k=0}^\infty \mathcal{O}\left(\sqrt{2^{-k} \tau \ln (s+\tau) }\right)\\ &= \mathcal{O}\left(\sqrt{\tau \ln (s+\tau) }\right), \end{aligned}

where the second inequality is due to Lemma 2. Analogously, we have

\displaystyle \sum_{i=1}^b \text{Regret}^\text{sleeping}_{s+\tau-1}({\boldsymbol e}_{J_i}) = \mathcal{O}(\sqrt{\tau \ln (s+\tau) })~.

Putting all together, we have the stated bound. \Box

Remark 2. Observe that the limit for {\alpha \rightarrow 0} of the first term of the upper bound of the regret is {2 c \log_2(8\tau^2)}. However, the second term is always of the order of {\sqrt{\tau}}.

From the above theorem, it is immediate to get a guarantee on the strongly adaptive regret by observing that

\displaystyle \text{SA-Regret}_{T,\tau}({\boldsymbol u}) = \max_{|I|=\tau} \ \text{Regret}_I ({\boldsymbol u})~.

2. History Bits

The adaptive regret is defined in Hazan&Seshadhri (2007), Hazan&Seshadhri (2009). The strongly-adaptive regret was defined by Adamskiy et al. (2012), Adamskiy et al. (2016) (still calling it “adaptive regret”) and then reinvented in Daniely et al. (2015) (that coined the name). Note that the strongly-adaptive regret is usually defined for bounded domains, taking the maximum with respect {{\boldsymbol u} \in \mathcal{V}}, instead I defined it more generally for any competitor in the feasible set. Lemma 2 is from Daniely et al. (2015) as well. The efficient version of the CBCE algorithm was proposed by Jun et al. (2017), Jun et al. (2017) using a stronger version of the sleeping expert algorithm, still based on coin betting. I simplified the sleeping expert algorithm for didactical reasons. CBCE improves over the guarantee in Daniely et al. (2015) by improving the logarithmic term. The ideas of using linearized losses in CBCE to avoid querying more than one subgradients and the better behaviour of the first term in the bound when {\alpha \rightarrow 0} are new. The non-efficient version of CBCE is also new, but straightforward.

Acknowledgements

Thanks to Wei-Cheng Lee and Yulian Wu for feedback on this post.

Leave a comment