# Multi-Armed Bandit IV: UCB

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019.

You can find all the lectures I published here.

In the last lecture, we introduced the Explore-Then-Commit (ETC) algorithm that solves the stochastic bandit problem, but requires the knowledge of the gaps. This time we will introduce a parameter-free strategy that achieves the same optimal regret guarantee.

1. Upper Confidence Bound Algorithm

The ETC algorithm has the disadvantage of requiring the knowledge of the gaps to tune the exploration phase. Moreover, it solves the exploration vs. exploitation trade-off in a clunky way. It would be better to have an algorithm that smoothly transition from one phase into the other in a data-dependent way. So, we now describe an optimal and adaptive strategy called Upper Confidence Bound (UCB) algorithm. It employs the principle of optimism in the face of uncertainty, to select in each round the arm that has the potential to be the best one.

UCB works keeping an estimate of the expected loss of each arm and also a confidence interval at a certain probability. Roughly speaking, we have that with probability at least ${1-\delta}$

$\displaystyle \mu_i \in \left[\hat{\mu}_i - \sqrt{\frac{2\ln \frac{1}{\delta}}{S_{t-1,i}}}, \hat{\mu}_i\right],$

where the “roughly” comes from the fact that ${S_{t-1,i}}$ is a random variable itself. Then, UCB will query the arm with the smallest lower bound, that is the one that could potentially have the smallest expected loss.

Remark 1. The name Upper Confidence Bound comes from the fact that traditionally stochastic bandits are defined over rewards, rather than losses. So, in our case we actually use the lower confidence bound in the algorithm. However, to avoid confusion with the literature, we still call it Upper Confidence Bound algorithm.

The key points in the proof are on how to choose the right confidence level ${\delta}$ and how to get around the dependency issues.

The algorithm is summarized in Algorithm 1 and we can prove the following regret bound.

Theorem 1. Assume that the rewards of the arms are ${1}$-subgaussian and ${\alpha>2}$ and let ${\alpha>2}$. Then, UCB guarantees a regret of

$\displaystyle \text{Regret}_T \leq \frac{\alpha}{\alpha-2}\sum_{i=1}^d \Delta_i + \sum_{i: \Delta_i>0} \frac{8\alpha \ln T}{\Delta_i} ~.$

Proof: We analyze one arm at the time. Also, without loss of generality, assume that the optimal arm is the first one. For arm ${i}$, we want to prove that ${\mathop{\mathbb E}[S_{T,i}]\leq \frac{8\alpha \ln T}{\Delta^2_i} + \frac{\alpha}{\alpha-2}}$.

The proof is based on the fact that once I have sampled an arm enough times, the probability to take a suboptimal arm is small.

Let ${t^\star}$ the biggest time index such that ${S_{t^\star-1,i} \leq \frac{8\alpha \ln T}{\Delta_i^2}}$. If ${t^\star=T}$, then the statement above is true. Hence, we can safely assume ${t^\star < T}$. Now, for ${t}$ bigger than ${t^\star}$ we have

$\displaystyle \label{eq:thm_ucb_eq1} S_{t-1,i} > \frac{8\alpha \ln T}{\Delta_i^2}~. \ \ \ \ \ (1)$

Consider ${t> t^\star}$ and such that ${A_t = i}$, then we claim that at least one of the two following equations must be true:

\displaystyle \begin{aligned} \hat{\mu}_{t-1,1} - \sqrt{\frac{2\alpha \ln t}{S_{t-1,1}}} &\geq \mu_1, & (2) \\ \hat{\mu}_{t-1,i} + \sqrt{\frac{2\alpha \ln t}{S_{t-1,i}}} &< \mu_i~. & (3) \\\end{aligned}

If the first one is true, the confidence interval around our estimate of the expectation of the optimal arm does not contain ${\mu_1}$. On the other hand, if the second one is true the confidence interval around our estimate of the expectation ${\mu_i}$ does not contain ${\mu_i}$. So, we claim that if ${t> t^\star}$ and we selected a suboptimal arm, then at least one of these two bad events happened.

Let’s prove the claim: if both the inequalities above are false, ${t> t^\star}$, and ${A_t = i}$, we have

\displaystyle \begin{aligned} \hat{\mu}_{t-1,1} - \sqrt{\frac{2\alpha \ln t}{S_{t-1,1}}} &< \mu_1 \qquad ((2) \text{ false}) \\ &= \mu_i - \Delta_i \\ &< \mu_i - 2\sqrt{\frac{2\alpha \ln T}{S_{t-1,i}}} \qquad (\text{for } (1))\\ &\leq \mu_i - 2\sqrt{\frac{2\alpha \ln t}{S_{t-1,i}}} \\ &\leq \hat{\mu}_{t-1,i} - \sqrt{\frac{2 \alpha \ln t}{S_{t-1,i}}} \qquad ((3) \text{ false}), \end{aligned}

that, by the selection strategy of the algorithm, would imply ${A_t \neq i}$.

Note that ${S_{t^\star,i} \leq \frac{8\alpha \ln T}{\Delta_i^2}+1}$. Hence, we have

\displaystyle \begin{aligned} \mathop{\mathbb E}\left[ S_{T,i}\right] &= \mathop{\mathbb E}[S_{t^\star,i}] + \sum_{t=t^\star+1}^T \mathop{\mathbb E}[\boldsymbol{1}[A_t=i, (2) \text{ or } (3) \text{ true}]] \\ &\leq \frac{8\alpha \ln T}{\Delta_i^2} + 1 + \sum_{t=t^\star+1}^T \mathop{\mathbb E}[\boldsymbol{1}[(2) \text{ or } (3) \text{ true}]] \\ &\leq \frac{8\alpha \ln T}{\Delta_i^2} + 1 + \sum_{t=t^\star+1}^T \left(\mathop{\mathbb P}[(2) \text{ true}] + \mathop{\mathbb P}[(3) \text{ true}]\right)~. \end{aligned}

Now, we upper bound the probabilities in the sum. Given that the losses on the arms are i.i.d. and using the union bound, we have

\displaystyle \begin{aligned} \mathop{\mathbb P} \left[\hat{\mu}_{t-1,1} - \sqrt{\frac{2\alpha \ln t}{S_{t-1,1}}} \geq \mu_1 \right] &\leq \mathop{\mathbb P} \left[\max_{s=1, \dots, t-1} \ \frac{1}{s}\sum_{j=1}^s g_{j,1} - \sqrt{\frac{2\alpha \ln t}{s}} \geq \mu_1\right] \\ &= \mathop{\mathbb P}\left[\bigcup_{s=1}^{t-1} \left\{\frac{1}{s}\sum_{j=1}^s g_{j,1} - \sqrt{\frac{2\alpha \ln t}{s}} \geq \mu_1\right\}\right]~. \end{aligned}

Hence, we have

\displaystyle \begin{aligned} \mathop{\mathbb P}[(2) \text{ true} ] &\leq \sum_{s=1}^{t-1} \mathop{\mathbb P}\left[\frac{1}{s}\sum_{j=1}^s g_{j,1} - \sqrt{\frac{2\alpha \ln t}{s}} \geq \mu_1\right] \qquad \text{(union bound)}\\ &\leq \sum_{s=1}^{t-1} t^{-\alpha} = (t-1) t^{-\alpha}~. \end{aligned}

Given that the same bound holds for ${\mathop{\mathbb P}[(3) \text{ true}]}$, we have

\displaystyle \begin{aligned} \mathop{\mathbb E}\left[ S_{T,i}\right] &\leq \frac{8\alpha \ln T}{\Delta_i^2} + 1 + \sum_{t=1}^{\infty} 2 (t-1) t^{-\alpha} \\ &\leq \frac{8\alpha \ln T}{\Delta_i^2} + 1 + \sum_{t=2}^{\infty} 2 t^{1-\alpha} \\ &\leq \frac{8\alpha \ln T}{\Delta_i^2} + 1 + 2\int_{1}^{\infty} x^{1-\alpha}\\ &=\frac{8\alpha \ln T}{\Delta_i^2} + \frac{\alpha}{\alpha-2}~. \end{aligned}

Using the decomposition of the regret we proved last time, ${\text{Regret}_T=\sum_{i=1}^d \Delta_i \mathop{\mathbb E}[S_{T,i}]}$, we have the stated bound. $\Box$

It is instructive to observe an actual run of the algorithm. I have considered 5 arms and Gaussian losses. In the left plot of figure below, I have plotted how the estimates and confidence intervals of UCB varies over time (in blue), compared to the actual true means (in black). In the right side, you can see the number of times each arm was pulled by the algorithm.

It is interesting to note that the logarithmic factor in the confidence term will make the confidences of the arm that are not pulled to increase over time. In turn, this will assure that the algorithm does not miss the optimal arm, even if the estimates were off. Also, the algorithm will keep pulling the two arms that are close together, to be sure on which one is the best among the two.

The bound above can become meaningless if the gaps are too small. So, here we prove another bound that does not depend on the inverse of the gaps.

Theorem 2. Assume that the rewards of the arms minus their expectations are ${1}$-subgaussian and let ${\alpha>2}$. Then, UCB guarantees a regret of

$\displaystyle \text{Regret}_T \leq 4\sqrt{2\alpha d T \ln T} + \frac{\alpha}{\alpha-2} \sum_{i=1}^d \Delta_i~.$

Proof: Let ${\Delta > 0}$ be some value to be tuned subsequently and recall from the proof of Theorem 1 that for each suboptimal arm ${i}$ we can bound

$\displaystyle \mathop{\mathbb E}[S_{T,i}] \leq \frac{\alpha}{\alpha-2} + \frac{8\alpha \ln T}{\Delta_i^2}~.$

Hence, using the regret decomposition we proved last time, we have

\displaystyle \begin{aligned} \text{Regret}_T &= \sum_{i:\Delta_i < \Delta} \Delta_i \mathop{\mathbb E}[S_{T,i}] + \sum_{i:\Delta_i \geq \Delta} \Delta_i \mathop{\mathbb E}[S_{T,i}] \\ &\leq T\Delta + \sum_{i:\Delta_i \geq \Delta} \Delta_i \mathop{\mathbb E}[S_{T,i}] \\ &\leq T\Delta + \sum_{i:\Delta_i \geq \Delta} \left(\Delta_i \frac{\alpha}{\alpha-2} + \frac{8\alpha \ln T}{\Delta_i}\right) \\ &\leq T\Delta + \frac{\alpha}{\alpha-2} \sum_{i=1}^d \Delta_i + \frac{8\alpha d \ln T}{\Delta}~. \end{aligned}

Choosing ${\Delta=\sqrt{\frac{8\alpha d \ln T}{T}}}$, we have the stated bound. $\Box$

Remark 2. Note that while the UCB algorithm is considered parameter-free, we still have to know the subgaussianity of the arms. While this can be easily upper bounded for stochastic arms with bounded support, it is unclear how to do it without any prior knowledge on the distribution of the arms.

It is possible to prove that the UCB algorithm is asymptotically optimal, in the sense of the following Theorem.

Theorem 3 (Bubeck, S. and Cesa-Bianchi, N. , 2012, Theorem 2.2). Consider a strategy that satisfies ${\mathop{\mathbb E}[S_{T,i}] = o(T^a)}$ for any set of Bernoulli rewards distributions, any arm ${i}$ with ${\Delta_i>0}$ and any ${a>0}$. Then, for any set of Bernoulli reward distributions, the following holds

$\displaystyle \liminf\limits_{T\rightarrow +\infty} \frac{\text{Regret}_T}{\ln T} \geq \sum_{i: \Delta_i} \frac{1}{2\Delta_i}~.$

2. History Bits

The use of confidence bounds and the idea of optimism first appeared in the work by (T. L. Lai and H. Robbins, 1985). The first version of UCB is by (T. L. Lai, 1987). The version of UCB I presented is by (P. Auer and N. Cesa-Bianchi and P. Fischer, 2002) under the name UCB1. Note that, rather than considering 1-subgaussian environments, (P. Auer and N. Cesa-Bianchi and P. Fischer, 2002) considers bandits where the rewards are confined to the ${[0, 1]}$ interval. The proof of Theorem 1 is a minor variation of the one of Theorem 2.1 in (Bubeck, S. and Cesa-Bianchi, N. , 2012), which also popularized the subgaussian setup. Theorem 2 is from (Bubeck, S. and Cesa-Bianchi, N. , 2012).

3. Exercises

Exercise 1. Prove a similar regret bound to the one in Theorem 2 for an optimally tuned Explore-Then-Commit algorithm.

## 1 Comment

1. bremen79 says:

EDIT: Fixed a small bug in the proof of UCB found by Daniel Hsu

Like