Black-Box Reductions: Sleeping Experts

We will continue with black-box reductions, this time solving the problem of sleeping experts.

1. Sleeping Experts

Consider now the setting of learning with experts where only a subset of the experts is active in each round. In particular we have that ${a_{t,i}=1}$ is expert ${i}$ is active at time ${t}$ , and ${a_{t,i}=0}$ if the expert is inactive, that is, sleeping. This setting is useful in the case that some experts might become unavailable in some rounds, but also to model the case that the set of experts is growing over time.

The online algorithm receives at the beginning of each round the information about which experts will be active and it can pick only one of the active experts. So, we constrain the learning with experts algorithm to produce a probability distribution only over the active experts. So, we will use a different notion of regret, in particular, our regret with respect to the expert ${i}$ is defined as

$\displaystyle \sum_{t=1}^T a_{t,i} (\langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle- g_{t,i})~.$

In other words, we measure the regret against expert ${i}$ only on rounds where the expert was active. This notion can be easily generalized to an arbitrary convex combination ${{\boldsymbol u}}$ of experts as

$\displaystyle \text{Regret}^\text{sleeping}_T({\boldsymbol u}) := \sum_{t=1}^T \sum_{i=1}^d u_i a_{t,i} (\langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle- g_{t,i}), \ \forall {\boldsymbol u} \in \Delta^{d-1}~.$

In the case that all the experts are active on all rounds, this notion recovers the usual one. However, in the general case, this is a different notion than the one we used in Online Linear Optimization.

For the reduction, we need a way to transform a vector ${{\boldsymbol z}_t \in {\mathbb R}^d_{\geq 0}}$ into a vector in the simplex defined by the active experts, and we also need appropriate surrogate losses. The reduction will be similar to the one we used last time. For the first part, we construct a probability distribution as

$\displaystyle x_{t,i} = \frac{a_{t,i} z_{t,i}}{\sum_{j=1}^d a_{t,j} z_{t,j}}, \ \forall i=1, \dots, d~.$

For the second part, from the original linear losses ${\ell_t({\boldsymbol x})=\langle {\boldsymbol g}_t, {\boldsymbol x}\rangle}$ we construct the modified losses ${\tilde{\ell}_t({\boldsymbol x})=\langle\tilde{{\boldsymbol g}}_t,{\boldsymbol x}\rangle}$ , where ${\tilde{g}_{t,i}=a_{t,i} (g_{t,i}-\langle {\boldsymbol g}_t,{\boldsymbol x}_t\rangle)}$ . The overall algorithm is in Algorithm 1.

sleeping_experts

With the above definitions, we have

$\displaystyle \langle \tilde{{\boldsymbol g}}_{t},{\boldsymbol z}_t\rangle = \sum_{i=1}^d a_{t,i}(g_{t,i}-\langle {\boldsymbol g}_t,{\boldsymbol x}_t\rangle ) z_{t,i} = \left(\sum_{j=1}^d a_{t,j} z_{t,j}\right) \sum_{i=1}^d x_{t,i}(g_{t,i}-\langle {\boldsymbol g}_t,{\boldsymbol x}_t\rangle ) = 0~.$

In turn, this implies that

$\displaystyle a_{t,i}(\langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle - g_{t,i}) = \langle \tilde{{\boldsymbol g}}_{t},{\boldsymbol z}_t\rangle + a_{t,i}(\langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle - g_{t,i}) = \langle \tilde{{\boldsymbol g}}_{t},{\boldsymbol z}_t\rangle - \tilde{g}_{t,i}, \ i=1, \dots, d~.$

In words, we can construct surrogate losses to transform the sleeping expert problem in an OLO problem over ${{\mathbb R}^d_{\geq 0}}$ , obtaining that

$\displaystyle \text{Regret}^\text{sleeping}_T({\boldsymbol u}) = \sum_{t=1}^T \sum_{i=1}^d u_i a_{t,i} (\langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle- g_{t,i}) = \sum_{t=1}^T \langle \tilde{{\boldsymbol g}}_{t},{\boldsymbol z}_t-{\boldsymbol u}\rangle, \forall {\boldsymbol u} \in \Delta^{d-1}~.$

The norm of the surrogate losses is controlled because ${\|\tilde{{\boldsymbol g}}_t\|_\infty\leq 2 \|{\boldsymbol g}_t\|_\infty}$ .

Remark 1. Clearly, if we have an algorithm for learning with experts, we can use it in the reduction because ${\Delta^{d-1}\subset {\mathbb R}^d_{\geq 0}}$ .

Remark 2. The above definitions and reduction generalize to the setting that ${a_{t,i} \in [0,1]}$ , denoting the confidence that the expert has in their prediction, where ${0}$ means that the expert has no confidence and it abstains from producing a prediction.

Example 1. Consider the learning with sleeping experts setting with linear losses ${\ell_t({\boldsymbol x})=\langle {\boldsymbol g}_t, {\boldsymbol x}\rangle}$ such that ${\|{\boldsymbol g}_t\|_\infty\leq 1}$ . Let’s design a series of reductions to easily solve this problem: this is a good exercise to show how easy is to combine online learning algorithms and reductions as LEGO blocks.

Consider to run a 1d coin-betting algorithm to solve online linear optimization in ${R}$ . For example, we can use the KT algorithm, where we ignore the rounds where the gradients are zero:

$\displaystyle x_t =\frac{-\sum_{j=1}^{t-1} g_j}{\sum_{j=1}^{t-1} \boldsymbol{1}[g_j\neq0]} \left(\epsilon - \sum_{j=1}^{t-1} g_j x_j\right)~.$

Use a black-box reduction from last time to constraint it to ${{\mathbb R}^d_{\geq 0}}$ , we have

$\displaystyle \begin{aligned} \tilde{g}_t &= \begin{cases} \min(g_t,0), & z_t<0\\ g_t, & z_t \geq 0 \end{cases}\\ z_t &=\frac{-\sum_{j=1}^{t-1} \tilde{g}_j}{\sum_{j=1}^{t-1} \boldsymbol{1}[\tilde{g}_j\neq0]} \left(\epsilon - \sum_{j=1}^{t-1} \tilde{g}_j z_j\right)\\ x_t &=\max(z_t,0)~. \end{aligned}$

and obtain a regret of ${O(u \sqrt{T' \ln (T'/\epsilon_i+1)}+\epsilon_i)}$ where ${T'}$ is the number of times that ${\tilde{g}_{t}\neq 0}$ . Use it to produce an algorithm over ${{\mathbb R}^d_{\geq 0}}$ , using each 1d algorithm on each coordinate. Obtaining ${O(\sum_{i=1}^d (u_i \sqrt{T'_i \ln (T'_i/\epsilon_i+1)} + \epsilon_i))}$ . Finally, set ${\sum_{i} \epsilon_i = 1}$ and a sleeping expert reduction from ${{\mathbb R}^d_{\geq 0}}$ :

$\displaystyle \begin{aligned} \hat{g}_{t,i} &=a_{t,i}(g_{t,i}-\langle {\boldsymbol g}_t, {\boldsymbol x}_t\rangle), \ i =1, \dots, d\\ \tilde{g}_{t,i} &= \begin{cases} \min(\hat{g}_{t,i},0), & z_{t,i}<0\\ \hat{g}_{t,i}, & z_{t,i} \geq 0 \end{cases}, \ i=1, \dots, d\\ z_{t,i} &=\frac{-\sum_{j=1}^{t-1} \tilde{g}_{j,i}}{\sum_{j=1}^{t-1} \boldsymbol{1}[\tilde{g}_{j,i}\neq0]} \left(\epsilon_i - \sum_{j=1}^{t-1} \tilde{g}_{j,i} z_{j,i}\right), \ i=1, \dots, d\\ x_{t,i} &=\frac{a_{t,i} \max(z_{t,i},0)}{\sum_{j=1}^d a_{t,j} \max(z_{t,i},0)}, \ i=1, \dots, d~. \end{aligned}$

Moreover, given that ${a_{t,i}=0}$ implies ${\tilde{g}_{t,i}=0}$ , the upper bound on the regret of the final algorithm against any expert ${i}$ is ${O(\sqrt{\sum_{t=1}^T a_{t,i} \ln \frac{\sum_{t=1}^T a_{t,i}}{\epsilon_i}})+1)}$ , that depends on the number of rounds that the expert ${i}$ was active, rather than on the total number of rounds.

2. History Bits

The setting of sleeping experts has been proposed by Blum (1997) and Freund, Schapire, Singer, and Warmuth (1997). The reduction above is an extension of the one from Gaillard, Stoltz, and Van Erven (2014) that was designed to reduce the sleeping experts problem to the learning with experts problem rather than OLO in ${{\mathbb R}^d_{\geq 0}}$ . I also added the minor improvement of removing the constant term from the surrogate losses. Such reduction is implicitly used by Luo&Schapire (2015) and by Jun, Orabona, Wright, and Willett (2017), Jun, Orabona, Wright, and Willett (2017). The variant of KT in Example 1 that updates only when ${g_t\neq 0}$ appeared for the first time in Chen, Langford, and Orabona (2022).

3. Exercises

Exercise 1. Prove that the variant of KT that does not update on rounds where ${g_t=0}$ have the stated regret in Example 1.

Parameter-free Learning and Optimization Algorithms

Black-Box Reductions: Sleeping Experts

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply