Parameter-free Online Learning III: from Coins to Experts

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.


In the last lecture, we have shown a very simple and parameter-free algorithm for Online Convex Optimization (OCO) in {d}-dimensions, based on a reduction to a coin-betting problem. Now, we will see how to reduce Learning with Expert Advice (LEA) to betting on {d} coins, obtaining again parameter-free and optimal algorithms.

1. Reduction to Learning with Experts

First, remember that the regret we got from Online Mirror Descent (OMD), and similarly for Follow-The-Regularized-Leader (FTRL), is

\displaystyle \text{Regret}_T({\boldsymbol u}) \leq O\left(\frac{KL({\boldsymbol u};{\boldsymbol \pi})}{\eta} + \eta T \right),

where {{\boldsymbol \pi}} is the prior distribution on the experts and {KL(\cdot; \cdot)} is the KL-divergence. As we reasoned in the OCO case, in order to set the learning rate we should know the value of {KL({\boldsymbol u};{\boldsymbol \pi})}. If we could set {\eta} to {\sqrt{\frac{KL({\boldsymbol u};{\boldsymbol \pi})}{T}}}, we would obtain a regret of {\sqrt{T\, KL({\boldsymbol u}; {\boldsymbol \pi})}}. However, given the adversarial nature of the game, this is impossible. So, as we did in the OCO case, we will show that even this problem can be reduced to betting on a coin, obtaining optimal guarantees with a parameter-free algorithm.

First, let’s introduce some notation. Let {d \ge 2} be the number of experts and {\Delta^d} be the {d}-dimensional probability simplex. Let {{\boldsymbol \pi} = [\pi_1, \pi_2, \dots, \pi_d] \in \Delta^d} be any prior distribution. Let {\mathcal{A}} be a coin-betting algorithm. We will instantiate {d} copies of {\mathcal{A}}.

Consider any round {t}. Let {x_{t,i} \in {\mathbb R}} be the bet of the {i}-th copy of {\mathcal{A}}. The LEA algorithm computes {\widehat {\boldsymbol p}_t = [\widehat p_{t,1}, \widehat p_{t,2}, \dots, \widehat p_{t,d}] \in {\mathbb R}_{+}^d} as

\displaystyle \label{eq:phat} \widehat p_{t,i} = \pi_i \cdot \max(x_{t,i},0)~. \ \ \ \ \ (1)

Then, the LEA algorithm predicts {{\boldsymbol p}_t = [p_{t,1}, p_{t,2}, \dots, p_{t,d}] \in \Delta^d} as

\displaystyle \label{eq:preds_experts} {\boldsymbol p}_t = \begin{cases} \tfrac{\widehat {\boldsymbol p}_t}{\|\widehat {\boldsymbol p}_t\|_1}, & \text{ if } \widehat {\boldsymbol p}_t\neq \boldsymbol{0}, \\ {\boldsymbol \pi}, & \text{ otherwise}~. \end{cases} \ \ \ \ \ (2)

Then, the algorithm receives the reward vector {{\boldsymbol g}_t = [g_{t,1}, g_{t,2}, \dots, g_{t,d}] \in [0,1]^d}. Finally, it feeds the reward to each copy of {\mathcal{A}}. The reward for the {i}-th copy of {\mathcal{A}} is {c_{t,i} \in [-1,1]} defined as

\displaystyle \label{eq:gradients_experts_reduction} c_{t,i} = \begin{cases} \langle {\boldsymbol g}_t, {\boldsymbol p}_t \rangle - g_{t,i} & \text{if } x_{t,i} > 0, \\ \max\left(\langle {\boldsymbol g}_t, {\boldsymbol p}_t \rangle - g_{t,i} , 0\right) & \text{if } x_{t,i} \le 0~. \end{cases} \ \ \ \ \ (3)

The construction above defines a LEA algorithm defined by the predictions {{\boldsymbol p}_t}, based on the algorithm {\mathcal{A}}. We can prove the following regret bound for it.

Theorem 1 (Regret Bound for Experts) Let {\mathcal{A}} be a coin-betting algorithm that guarantees a wealth after {t} rounds with initial money equal to 1 of {\exp(f_t(\sum_{i=1}^t c'_t))} for any sequence of continuous coin outcomes {c'_1, \dots, c't \in [-1,1]}. Then, the regret of the LEA algorithm with prior {\pi \in \Delta^d} that predicts at each round with {{\boldsymbol p}_t} in (2) satisfies

\displaystyle \forall T \ge 0, \quad \forall {\boldsymbol u} \in \Delta^d, \qquad \qquad \text{Regret}_T({\boldsymbol u}) = \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol p}_t -{\boldsymbol u}\rangle \le h\left( KL({\boldsymbol u};{\boldsymbol \pi}) \right),

for any {h:{\mathbb R}\rightarrow{\mathbb R}} concave and non-decreasing such that {x\leq h(f_T(x))}.

Proof: We first prove that {\sum_{i=1}^d \pi_i c_{t,i} x_{t,i} \le 0}. Indeed,

\displaystyle \begin{aligned} \sum_{i=1}^d \pi_i c_{t,i} x_{t,i} & = \sum_{i \, : \, \pi_i x_{t,i} > 0} \pi_i \max(x_{t,i},0) (\langle {\boldsymbol g}_t, {\boldsymbol p}_t \rangle-g_{t,i}) \ + \ \sum_{i \, : \, \pi_i x_{t,i} \le 0} \pi_i x_{t,i} \max(\langle {\boldsymbol g}_t, {\boldsymbol p}_t \rangle - g_{t,i} ,0) \\ & = \|\widehat {\boldsymbol p}_t\|_1 \sum_{i=1}^d p_{t,i} (\langle {\boldsymbol g}_t, {\boldsymbol p}_t \rangle - {\boldsymbol g}_{t,i} ) \ + \ \sum_{i \, : \, \pi_i x_{t,i} \le 0} \pi_i x_{t,i} \max(\langle {\boldsymbol g}_t, {\boldsymbol p}_t\rangle - g_{t,i} ,0) \\ & = 0 \ + \ \sum_{i \, : \, \pi_i x_{t,i} \le 0} \pi_i x_{t,i} \max(\langle {\boldsymbol g}_t, {\boldsymbol p}_t\rangle - g_{t,i} ,0) \ \le 0 \; . \end{aligned}

The first equality follows from definition of {c_{t,i}}. To see the second equality, consider two cases: If {\pi_i x_{t,i} \le 0} for all {i} then {\|\widehat p_t\|_1 = 0} and therefore both {\|\widehat {\boldsymbol p}_t\|_1 \sum_{i=1}^d p_{t,i} (g_{t,i} - \langle {\boldsymbol g}_t, {\boldsymbol p}_t \rangle)} and {\sum_{i \, : \, \pi_i x_{t,i} > 0} \pi_i \max(w_{t,i},0) (g_{t,i} - \langle {\boldsymbol g}_t, {\boldsymbol p}_t \rangle)} are trivially zero. If {\|\widehat {\boldsymbol p}_t\|_1 > 0} then {\pi_i \max(x_{t,i},0) = \widehat p_{t,i} = \|\widehat {\boldsymbol p}_t\|_1 p_{t,i}} for all {i}.

From the assumption on {\mathcal{A}}, we have for any sequence {\{c'_t\}_{t=1}^T} such that {c'_t \in [-1,1]} that

\displaystyle \label{equation:experts-one-dimensional-assumption} \text{Wealth}_T = 1 + \sum_{t=1}^T c'_t x_t \ge \exp\left(f_T\left(\sum_{t=1}^T c'_t\right)\right)~. \ \ \ \ \ (4)

So, inequality {\sum_{i=1}^d \pi_i c_{t,i} x_{t,i} \le 0} and (4) imply

\displaystyle \label{equation:bounded-potential} \sum_{i=1}^d \pi_i \exp\left(f_T \left(\sum_{t=1}^T c_{t,i} \right)\right) \le 1 + \sum_{i=1}^d \pi_i \sum_{t=1}^T c_{t,i} x_{t,i} \le 1~ . \ \ \ \ \ (5)

Now, for any competitor {{\boldsymbol u} \in \Delta^d},

\displaystyle \begin{aligned} \text{Regret}_T({\boldsymbol u}) &= \sum_{t=1}^T \langle {\boldsymbol g}_t, {\boldsymbol p}_t - {\boldsymbol u} \rangle \\ &= \sum_{t=1}^T \sum_{i=1}^d u_i \left(\langle {\boldsymbol g}_t, {\boldsymbol p}_t \rangle - g_{t,i} \right) \\ & \le \sum_{t=1}^T \sum_{i=1}^N u_i c_{t,i} \qquad \text{(by definition of } c_{t,i} \text{)} \\ & \leq \sum_{i=1}^N u_i h\left(f_T\left( \sum_{t=1}^T c_{t,i}\right) \right) \qquad \text{(definition of the } h(x) \text{)} \\ & \le h\left[\sum_{i=1}^N u_i f_T\left( \sum_{t=1}^T c_{t,i}\right) \right] \qquad \text{(by concavity of } h \text{ and Jensen inequality)} \\ & \le h\left[KL({\boldsymbol u};{\boldsymbol \pi})+\ln \left(\sum_{i=1}^N \pi_i \exp\left(f_T\left(\sum_{t=1}^T c_{t,i}\right)\right) \right)\right] \qquad \text{(Fenchel-Young inequality)} \\ & \le h(KL({\boldsymbol u};{\boldsymbol \pi})) \qquad \text{(by (5))}~. \end{aligned}

\Box

Now, we could think to use the Krichevsky–Trofimov (KT) bettor with this theorem. However, we would obtain a sub-optimal regret guarantee. In fact, remembering the lower bound on the wealth of KT and setting {h(x)=\sqrt{4T(x+\frac12 \ln T + K )}} where {K} is a universal constant, we have

\displaystyle \text{Regret}_T({\boldsymbol u}) \leq O(\sqrt{T(KL({\boldsymbol u};{\boldsymbol \pi})+\ln T)})~.

We might think that the {\ln T} is the price we have to pay to adapt to the unknown competitor {{\boldsymbol u}}. However, it turns out it can be removed. In the next section, we see how to change the KT strategy to obtain the optimal guarantee.

2. A Betting Strategy that Looses at Most a Constant Fraction of Money

In the reduction before, if we use the KT betting strategy we would have a {\ln T} term under the square root. It turns out that we can avoid that term if we know the number of rounds beforehand. Then, in case {T} is unknown we can just use a doubling trick, paying only a constant multiplicative factor in the regret.

The logarithmic term in the regret comes from the fact that the lower bound on the wealth is

\displaystyle O\left(\frac{1}{\sqrt{T}}\exp\left(\frac{(\sum_{t=1}^T c_t)^2}{4T}\right)\right)~.

Note that in the case in which the number of heads in the sequence is equal to the number of heads, so that {\sum_{t=1}^T c_t=0}, the guaranteed wealth becomes proportional to {\frac{1}{\sqrt{T}}}. So, for {T} that goes to infinity the bettor will lose all of its money.

Instead, we need a more conservative strategy that guarantees

\displaystyle O\left(\alpha\exp\left(\frac{(\sum_{t=1}^T c_t)^2}{4 T}\right)\right),

for {\alpha} small enough and independent of {T}. In this case, the betting strategy has to pace its betting, possibly with the knowledge of the duration of the game, so that even in the case that the number of heads is equal to the number of tails it will only lose a fraction of its money. At the same time, it will still gain an exponential amount of money when the coin outcomes are biased towards one side.

We will prove that this is possible, designing a new betting strategy.

Denote by {S_{t}=\sum_{i=1}^{t} c_i} for {t=1, \dots, T} and define

\displaystyle \begin{aligned} F_t(x)&=\epsilon \exp\left(\frac{x^2}{2(t+T)} - \sum_{i=1}^{t} \frac{1}{2(i+T)}\right), & (6) \\ \beta_t&=\frac{F_{t}(S_{t-1}+1)-F_{t}(S_{t-1}-1)}{F_{t}(S_{t-1}+1)+F_{t}(S_{t-1}-1)}~.& (7) \\\end{aligned}

Note that if

\displaystyle \label{eq:property_betting_potentials} (1+\beta_t c_t) F_{t-1}\left(\sum_{i=1}^{t-1} c_i\right) \geq F_{t}\left(\sum_{i=1}^{t} c_i\right) \ \ \ \ \ (8)

then, by induction, {\text{Wealth}_T \geq F_{T}\left(\sum_{t=1}^{T} c_t\right)}. In fact, we have

\displaystyle \text{Wealth}_{t} = (1+\beta_t c_t) \text{Wealth}_{t-1} \geq (1+\beta_t c_t) F_{t-1} \left(\sum_{i=1}^{t-1} c_i\right) \geq F_{t}\left(\sum_{i=1}^{t} c_i\right)~.

Hence, we have to prove that (8) is true in order to guarantee a minimum wealth of our betting strategy.

First, given that {\ln(1+\beta_t c) - \frac{(x+c)^2}{2(t+T)}} is a concave function of {c}, we have

\displaystyle \min_{ c \in [-1,1]} \ \ln(1+\beta_t c) - \frac{(x+c)^2}{2(t+T)} = \min\left(\ln(1+\beta_t) - \frac{(x+1)^2}{2(t+T)}, \ln(1-\beta_t) - \frac{(x-1)^2}{2(t+T)}\right)~.

Also, our choice of {\beta_t} makes the two quantities above equal with {x=S_{t-1}}, that is

\displaystyle \ln(1+\beta_t) - \ln F_t (S_{t-1} + 1) = \ln(1-\beta_t) - \ln F_t (S_{t-1} - 1)

For other choices of {\beta_t}, the two alternatives would be different and the minimum one could always be the one picked by the adversary. Instead, making the two choices worst outcomes equivalent, we minimize the damage of the adversarial choice of the outcomes of the coin. So, we have that

\displaystyle \begin{aligned} \ln (1+\beta_t c_t) - \ln F_t(S_t) &= \ln (1+\beta_t c_t) + F_t(S_{t-1}+c_t) \\ &\geq \min_{ c \in [-1,1]} \ \ln (1+\beta_t c) + \ln F_t(S_{t-1}+c) \\ &= -\ln \frac{F_{t}(S_{t-1}+1)+F_{t}(S_{t-1}-1)}{2} \\ &=- \ln \left[\exp\left(\frac{S_{t-1}^2+1}{2(t+T)} - \sum_{i=1}^t \frac{1}{2(i+T)}\right) \frac{1}{2}\left(\exp\left(\frac{S_{t-1}}{t+T}\right) + \exp\left(\frac{-S_{t-1}}{t+T}\right) \right) \right] \\ &= -\frac{S_{t-1}^2+1}{2(t+T)} - \ln \cosh\frac{S_{t-1}}{t+T} + \sum_{i=1}^t \frac{1}{2(i+T)} \\ &\geq -\frac{S_{t-1}^2}{2(t+T)} - \frac{S_{t-1}^2}{2(t+T)^2} + \sum_{i=1}^{t-1} \frac{1}{2(i+T)} \\ &\geq -\frac{S_{t-1}^2}{2(t+T)} - \frac{S_{t-1}^2}{2(t+T)(t-1+T)} + \sum_{i=1}^{t-1} \frac{1}{2(i+T)} \\ &= -\frac{S_{t-1}^2}{2(t-1+T)}+ \sum_{i=1}^{t-1} \frac{1}{2(i+T)} \\ &= -\ln F_{t-1}(S_{t-1}), \end{aligned}

where in the second equality we used the definition of {\beta_t} and in the second inequality we used the fact that {\ln \cosh(x)\leq \frac{x^2}{2}, \forall x}.

Hence, given that (8) is true, this strategy guarantees

\displaystyle \text{Wealth}_T \geq \exp\left(\frac{\left(\sum_{t=1}^T c_t\right)^2}{4T} - \sum_{t=1}^T \frac{1}{2(i+T)}\right) \geq \exp\left(\frac{\left(\sum_{t=1}^T c_t\right)^2}{4T} - \frac{1}{2}\ln 2\right) = \frac{\sqrt{2}}{2} \exp\left(\frac{\left(\sum_{t=1}^T c_t\right)^2}{2T}\right)~.

We can now use this betting strategy in the expert reduction in Theorem 1, setting {h(x)=\sqrt{4T (x+\frac12 \ln 2)}}, to have

\displaystyle \label{eq:kl_experts} \text{Regret}_T({\boldsymbol u}) \leq \sqrt{4 T \left(KL({\boldsymbol u};{\boldsymbol \pi})+\frac{1}{2}\ln 2\right)}~. \ \ \ \ \ (9)

Note that this betting strategy could also be used in the OCO reduction. Given that we removed the logarithmic term in the exponent, in the 1-dimensional case, we would obtain a regret of

\displaystyle \text{Regret}_T(u) \leq O\left(|u|\sqrt{T \ln \left(\frac{|u|\sqrt{T}}{\epsilon}+1\right)+\epsilon}\right),

where we gained in the {\sqrt{T}} term inside the logarithmic, instead of the {T} term of the KT algorithm. This implies that now we can set {\epsilon} to {\sqrt{T}} and obtain an asymptotic rate of {O(\sqrt{T})} rather than {O(\sqrt{T \ln T})}.

3. History Bits

The first parameter-free algorithm for experts is from (Chaudhuri, K. and Freund, Y. and Hsu, D. J., 2009), named NormalHedge, where they obtained a bound similar to the one in (9) but with an additional {\ln \ln T} term. Then, (Chernov, A. and Vovk, V., 2010) removed the log factors with an update without a closed form. (Orabona, F. and Pal, D., 2016) showed that this guarantee can be efficiently obtained through the novel reduction to coin-betting in Theorem 1. Later, these kind of regret guarantees were improved to depend on the sum of the squared losses rather than on time, but with an additional {\ln \ln T} factor, in the Squint algorithm (Koolen, W. M. and van Erven, T., 2015). It is worth noting that the Squint algorithm can be interpreted exactly as a coin-betting algorithm plus the reduction in Theorem 1.

The betting strategy in (6) and (7) are new, and derived from the shifted-KT potentials in (Orabona, F. and Pal, D., 2016). The guarantee is the same obtained by the shifted-KT potentials, but the analysis can be done without knowing the properties of the gamma function.

4. Exercises

Exercise 1 Using the same proof technique in the lecture, find a betting strategy whose wealth depends on {\sum_{t=1}^T |c_t|} rather than on {T}.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s