Multi-Armed Bandit III: Stochastic Bandits

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.

Today, we will consider the stochastic bandit setting. Here, each arm is associated with an unknown probability distribution. At each time step, the algorithm selects one arm {A_t} and it receives a loss (or reward) {g_{t,A_t}} drawn i.i.d. from the distribution of the arm {A_t}. We focus on minimizing the pseudo-regret, that is the regret with respect to the optimal action in expectation, rather than the optimal action on the sequence of realized losses:

\displaystyle \text{Regret}_T :=\mathop{\mathbb E}\left[\sum_{t=1}^T g_{t,A_t}\right] - \min_{i=1,\dots,d} \ \mathop{\mathbb E}\left[\sum_{t=1}^T g_{t,i}\right] =\mathop{\mathbb E}\left[\sum_{t=1}^T g_{t,A_t}\right] - \min_{i=1,\dots,d} \ \mu_i,

where we denoted by {\mu_i} the expectation of the distribution associated with the arm {i}.

Remark 1 The usual notation in the stochastic bandit literature is to consider rewards instead of losses. Instead, to keep our notation coherent with the OCO literature, we will consider losses. The two things are completely equivalent up to a multiplication by {-1}.

Before presenting our first algorithm for stochastic bandits, we will introduce some basic notions on concentration inequalities that will be useful in our definitions and proofs.

1. Concentration Inequalities Bits

Suppose that {X_1, X_2, \dots , X_n} is a sequence of independent and identically distributed random variables and with mean {\mu = \mathop{\mathbb E}[X_1]} and variance {\sigma = \mathrm{Var}[X_1]}. Having observed {X_1, X_2, \dots , X_n} we would like to estimate the common mean {\mu}. The most natural estimator is the empirical mean

\displaystyle \hat{\mu}=\frac{1}{n}\sum_{i=1}^n X_i~.

Linearity of expectation shows that {\mathop{\mathbb E}[\hat{\mu}] = \mu}, which means that {\hat{\mu}} is an unbiased estimator of {\mu}. Yet, {\hat{\mu}} is a random variable itself. So, can we quantify how far {\hat{\mu}} will be from {\mu}?

We could use Chebyshev’s inequality to upper bound the probability that {\hat{\mu}} is far from {\mu}:

\displaystyle \Pr[|\hat{\mu}-\mu|\geq\epsilon]\leq \frac{\mathrm{Var}[\hat{\mu}]}{\epsilon^2}~.

Using the fact that {\mathrm{Var}[\hat{\mu}] = \frac{\sigma^2}{n}}, we have that

\displaystyle \Pr[|\hat{\mu}-\mu|\geq\epsilon]\leq \frac{\sigma^2}{n \epsilon^2}~.

So, we can expect the probability of having a “bad” estimate to go to zero as one over the number of samples in our empirical mean. Is this the best we can get? To understand what we can hope for, let’s take a look at the central limit theorem.

We know that, defining {S_n=\sum_{t=1}^n (X_i-\mu)}, {\frac{S_n}{\sqrt{n \sigma^2}}\rightarrow N(0,1)}, the standard Gaussian distribution, as {n} goes to infinity. This means that

\displaystyle \Pr[\hat{\mu}-\mu \geq \epsilon] = \Pr[S_n \geq n\epsilon] = \Pr\left[\frac{S_n}{\sqrt{2 \sigma^2}} \geq \sqrt{\frac{n}{\sigma^2}}\epsilon\right] \approx \int_{\epsilon\sqrt{\frac{n}{\sigma^2}}}^\infty \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{x^2}{2}\right) dx,

where the approximation comes from the central limit theorem. The integral cannot be calculated with a closed form, but we can easily upper bound it. Indeed, for {a>0}, we have

\displaystyle \begin{aligned} \int_{a}^\infty \exp\left(-\frac{x^2}{2}\right) dx = \int_{a}^\infty \frac{x}{x}\exp\left(-\frac{x^2}{2}\right) dx \leq \frac{1}{a}\int_{a}^\infty x\exp\left(-\frac{x^2}{2}\right) dx = \frac{1}{a} \exp\left(-\frac{a^2}{2}\right)~. \end{aligned}

Hence, we have

\displaystyle \label{eq:conc_gaussian} \Pr[\hat{\mu}-\mu \geq \epsilon] \lessapprox \sqrt{\frac{\sigma^2}{2 \pi \epsilon^2 n}} \exp\left(-\frac{n \epsilon^2}{2 \sigma^2}\right)~. \ \ \ \ \ (1)

This is better than what we got with Chebyshev’s inequality and we would like to obtain an exact bound with a similar asymptotic rate. To do that, we will focus our attention on subgaussian random variables.

Definition 1 We say that a random variable is {\sigma}subgaussian if for all {\lambda \in {\mathbb R}} we have that {\mathop{\mathbb E}[\exp(\lambda X)]\leq \exp(\lambda^2 \sigma^2/2)}.

Example 1 The following random variable are subgaussian:

  • If {X} is Gaussian with mean zero and variance {\sigma^2}, then {X} is {\sigma}-subgaussian.
  • If {X} has mean zero and {X \in [a, b]} almost surely, then {X} is {(b - a)/2}-subgaussian.

We have the following properties for subgaussian random variables.

Lemma 2 (Lattimore and Szepesvári, 2018, Lemma 5.4) Assume that {X_1} and {X_2} are independent and {\sigma_1}-subgaussian and {\sigma_2}-subgaussian respectively. Then,

  1. {\mathop{\mathbb E}[X_1]} = 0 and {\mathrm{Var}[X_1] \leq \sigma_1^2}.
  2. {c X_1} is {|c|\sigma_1}-subgaussian.
  3. {X_1+X_2} is {\sqrt{\sigma^2_1+\sigma_2^2}}-subgaussian.

Subgaussians random variables behaves like Gaussian random variables, in the sense that their tail probabilities are upper bounded by the ones of a Gaussian of variance {\sigma^2}. To prove it, let’s first state the Markov’s inequality.

Theorem 3 (Markov’s inequality) For a non-negative random variable {X} and {\epsilon>0}, we have that {\Pr[X\geq \epsilon] \leq \frac{\mathop{\mathbb E}[X]}{\epsilon}}.

With Markov’s inequality, we can now formalize the above statement on subgaussian random variables.

Theorem 4 If a random variable is {\sigma}-subgaussian, then {\Pr[X\geq \epsilon] \leq \exp\left(-\frac{\epsilon^2}{2\sigma^2}\right)}.

Proof: For any {\lambda>0}, we have

\displaystyle \begin{aligned} \Pr[X\geq \epsilon] = \Pr[\exp(\lambda X)\geq \exp(\lambda\epsilon)] \leq \frac{\mathop{\mathbb E}[\exp(\lambda X)]}{\exp(\lambda \epsilon)} \leq \exp(\lambda^2 \sigma^2/2-\lambda \epsilon)~. \end{aligned}

Minimizing the right hand side of the inequality w.r.t. {\lambda}, we have the stated result. \Box

An easy consequence of the above theorem is that the empirical average of subgaussian random variables concentrates around its expectation, with the same asymptotic rate in (1).

Corollary 5 Assume that {X_i - \mu} are independent, {\sigma}-subgaussian random variables. Then, for any {\epsilon \geq 0}, we have

\displaystyle \begin{aligned} \Pr\left[\hat{\mu} \geq \mu + \epsilon\right] \leq \exp\left(- \frac{n \epsilon^2}{2\sigma^2}\right) &&\text{ and }&& \Pr\left[\hat{\mu} \leq \mu - \epsilon\right] \leq \exp\left(- \frac{n \epsilon^2}{2\sigma^2}\right), \end{aligned}

where {\hat{\mu} =\frac1n \sum_{i=1}^n X_i}.

Equating the upper bounds on the r.h.s. of the inequalities in the Corollary to {\delta}, we have the equivalent statement that, with probability at least {1-2\delta}, we have

\displaystyle \mu \in \left[\hat{\mu}-\sqrt{\frac{2 \sigma^2\ln\frac{1}{\delta}}{n}}, \hat{\mu}+\sqrt{\frac{2 \sigma^2\ln\frac{1}{\delta}}{n}}\right]~.

2. Explore-Then-Commit Algorithm


We are now ready to present the most natural algorithm for the stochastic bandit setting, called Explore-Then-Commit (ETC) algorithm. That is, we first identify the best arm over {m d} exploration rounds and then we commit to it. This algorithm is summarized in Algorithm 2.

In the following, we will denote by {S_{t,i}=\sum_{j=1}^t \boldsymbol{1}[A_t=i]}, that is the number of times that the arm {i} was pulled in the first {t} rounds.

Define by {\mu^\star} the expected loss of the arm with the smallest expectation, that is {\min_{i=1,\dots,d} \ \mu_i}. Critical quantities in our analysis will be the gaps, {\Delta_i:=\mu_i-\mu^\star} for {i=1, \dots,d}, that measure the expected difference in losses between the arms and the optimal one. In particular, we can decompose the regret as a sum over the arms of the expected number of times we pull an arm multiplied by its gap.

Lemma 6 For any policy of selection of the arms, the regret is upper bounded by

\displaystyle \text{Regret}_T = \sum_{i=1}^d \mathop{\mathbb E}[S_{T,i}] \Delta_i~.

Proof: Observe that

\displaystyle \sum_{t=1}^T g_{t,A_t} = \sum_{t=1}^T \sum_{i=1}^d g_{t,i} \boldsymbol{1}[A_t=i]~.


\displaystyle \begin{aligned} \text{Regret}_T &= \mathop{\mathbb E}\left[\sum_{t=1}^T g_{t,A_t}\right] - T \mu^\star = \mathop{\mathbb E}\left[\sum_{t=1}^T (g_{t,A_t}- \mu^\star)\right] = \sum_{i=1}^d \sum_{t=1}^T \mathop{\mathbb E}[\boldsymbol{1}[A_t=i](g_{t,i}-\mu^\star)] \\ &= \sum_{i=1}^d \sum_{t=1}^T \mathop{\mathbb E}[\mathop{\mathbb E}[\boldsymbol{1}[A_t=i](g_{t,i}- \mu^\star)|A_t]] = \sum_{i=1}^d \sum_{t=1}^T \mathop{\mathbb E}[\boldsymbol{1}[A_t=i] \mathop{\mathbb E}[g_{t,i} - \mu^\star |A_t]] \\ &= \sum_{i=1}^d \sum_{t=1}^T \mathop{\mathbb E}[\boldsymbol{1}[A_t=i] (\mu_{A_t}- \mu^\star)] = \sum_{i=1}^d \sum_{t=1}^T \mathop{\mathbb E}[\boldsymbol{1}[A_t=i]] (\mu_i- \mu^\star)~. \end{aligned}


The above Lemma quantifies the intuition that in order to have a small regret we have to select the suboptimal arms less often then the best one.

We are now ready to prove the regret guarantee of the ETC algorithm.

Theorem 7 Assume that the losses of the arms minus their expectations are {1}-subgaussian and {1\leq m \leq T/d}. Then, ETC guarantees a regret of

\displaystyle \text{Regret}_T \leq m \sum_{i=1}^d \Delta_i + (T-md)\sum_{i=1}^d \Delta_i \exp\left(-\frac{m \Delta^2_i}{4}\right)~.

Proof: Let’s assume without loss of generality that the optimal arm is the first one.

So, for {i\neq 1}, we have

\displaystyle \begin{aligned} \sum_{t=1}^T \mathop{\mathbb E}[\boldsymbol{1}[A_t=i]] &= m + (T-m d) \Pr\left[\hat{\mu}_{md,i} \leq \min_{j\geq i} \ \hat{\mu}_{md,j}\right] \\ &\leq m + (T-m d) \Pr\left[\hat{\mu}_{md,i} \leq \hat{\mu}_{md,1}\right]\\ &=m + (T-m d) \Pr\left[\hat{\mu}_{md,1}-\mu_1-(\hat{\mu}_{md,i} -\mu_i)\geq \Delta_i\right]~. \end{aligned}

From Lemma 2, we have that {\hat{\mu}_{md,i} -\mu_i -(\hat{\mu}_{md,1}-\mu_1)} is {\sqrt{2/m}}-subgaussian. So, from Theorem 4, we have

\displaystyle \Pr\left[\hat{\mu}_{md,1}-\mu_1-(\hat{\mu}_{md,i} -\mu_i) \geq \Delta_i\right] \leq \exp\left(-\frac{m \Delta^2_i}{4}\right)~.


The bound shows the trade-off between exploration and exploitation: if {m} is too big, we pay too much during the exploration phase (first term in the bound). On the other hand, if {m} is small, the probability to select a suboptimal arm increases (second term in the bound). Knowing all the gaps {\Delta_i}, it is possible to choose {m} that minimizes the bound.

For example, in that case that {d=2}, the regret is upper bounded by

\displaystyle m \Delta + (T-2m) \Delta \exp\left(-m \frac{\Delta^2}{4}\right) \leq m \Delta + T\Delta \exp\left(-m \frac{\Delta^2}{4}\right),

that is minimized by

\displaystyle m = \frac{4}{\Delta^2} \ln \frac{T\Delta^2}{4}~.

Remembering that {m} must be a natural number we can choose

\displaystyle m=\max\left(\left\lceil \frac{4}{\Delta^2} \ln \frac{T\Delta^2}{4} \right\rceil,1\right)~.

When {\frac{T\Delta^2}{4}\leq 1}, we select {m=1}. So, we have {\Delta + (T-2) \Delta \leq T \Delta \leq \frac{4}{\Delta}}. Hence, the regret is upper bounded by

\displaystyle \min\left(\Delta T,\frac{4}{\Delta} \left(1+ \max\left(\ln \frac{T\Delta^2}{4},0\right)\right) + \Delta\right)~.

The main drawback of this algorithm is that its optimal tuning depends on the gaps {\delta_i}. Assuming the knowledge of the gaps account to make the stochastic bandit problem completely trivial. However, its tuned regret bound gives us a baseline to which compare other bandit algorithms. In particular, in the next lecture we will present an algorithm that achieves the same asymptotic regret without any knowledge of the gaps.

3. History Bits

The ETC algorithm goes back to (Robbins, H., 1952), even if Robbins proposed what is now called epoch-greedy (Langford, J. and Zhang, T., 2008). For more history on ETC, take a look at chapter 6 in (Lattimore, T. and Szepesvári, C., 2018). The proofs presented here are from (Lattimore, T. and Szepesvári, C., 2018) as well.


  1. Nice tutorial! Could you explain why there is an additional +\Delta in the final upper bound? When I substitute m into the bound I don’t get that extra +\Delta. Thanks in advance


    1. It comes from the fact that you need to take the ceiling of the optimal value of m to obtain an integer number. Note that in the case the optimal m is 1, the +\Delta is not needed, but it is a small overapproximation to make the expression simpler.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s