*This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.*

* You can find the lectures I published till now here.*

Today, we will consider the *stochastic bandit* setting. Here, each arm is associated with an unknown probability distribution. At each time step, the algorithm selects one arm and it receives a loss (or reward) drawn i.i.d. from the distribution of the arm . We focus on minimizing the *pseudo-regret*, that is the regret with respect to the optimal action in expectation, rather than the optimal action on the sequence of realized losses:

where we denoted by the expectation of the distribution associated with the arm .

Remark 1The usual notation in the stochastic bandit literature is to consider rewards instead of losses. Instead, to keep our notation coherent with the OCO literature, we will consider losses. The two things are completely equivalent up to a multiplication by .

Before presenting our first algorithm for stochastic bandits, we will introduce some basic notions on concentration inequalities that will be useful in our definitions and proofs.

**1. Concentration Inequalities Bits **

Suppose that is a sequence of independent and identically distributed random variables and with mean and variance . Having observed we would like to estimate the common mean . The most natural estimator is the *empirical mean*

Linearity of expectation shows that , which means that is an *unbiased estimator* of . Yet, is a random variable itself. So, can we quantify how far will be from ?

We could use Chebyshev’s inequality to upper bound the probability that is far from :

Using the fact that , we have that

So, we can expect the probability of having a “bad” estimate to go to zero as one over the number of samples in our empirical mean. Is this the best we can get? To understand what we can hope for, let’s take a look at the central limit theorem.

We know that, defining , , the standard Gaussian distribution, as goes to infinity. This means that

where the approximation comes from the central limit theorem. The integral cannot be calculated with a closed form, but we can easily upper bound it. Indeed, for , we have

This is better than what we got with Chebyshev’s inequality and we would like to obtain an exact bound with a similar asymptotic rate. To do that, we will focus our attention on *subgaussian* random variables.

Definition 1We say that a random variable is –subgaussianif for all we have that .

Example 1The following random variable are subgaussian:

- If is Gaussian with mean zero and variance , then is -subgaussian.
- If has mean zero and almost surely, then is -subgaussian.

We have the following properties for subgaussian random variables.

Lemma 2 (Lattimore and Szepesvári, 2018, Lemma 5.4) Assume that and are independent and -subgaussian and -subgaussian respectively. Then,

- = 0 and .
- is -subgaussian.
- is -subgaussian.

Subgaussians random variables behaves like Gaussian random variables, in the sense that their tail probabilities are upper bounded by the ones of a Gaussian of variance . To prove it, let’s first state the Markov’s inequality.

Theorem 3 (Markov’s inequality)For a non-negative random variable and , we have that .

With Markov’s inequality, we can now formalize the above statement on subgaussian random variables.

*Proof:* For any , we have

Minimizing the right hand side of the inequality w.r.t. , we have the stated result.

An easy consequence of the above theorem is that the empirical average of subgaussian random variables concentrates around its expectation, *with the same asymptotic rate in (1)*.

Corollary 5Assume that are independent, -subgaussian random variables. Then, for any , we have

where .

Equating the upper bounds on the r.h.s. of the inequalities in the Corollary to , we have the equivalent statement that, with probability at least , we have

**2. Explore-Then-Commit Algorithm **

We are now ready to present the most natural algorithm for the stochastic bandit setting, called Explore-Then-Commit (ETC) algorithm. That is, we first identify the best arm over exploration rounds and then we commit to it. This algorithm is summarized in Algorithm 2.

In the following, we will denote by , that is the number of times that the arm was pulled in the first rounds.

Define by the expected loss of the arm with the smallest expectation, that is . Critical quantities in our analysis will be the *gaps*, for , that measure the expected difference in losses between the arms and the optimal one. In particular, we can decompose the regret as a sum over the arms of the expected number of times we pull an arm multiplied by its gap.

Lemma 6For any policy of selection of the arms, the regret is upper bounded by

*Proof:* Observe that

Hence,

The above Lemma quantifies the intuition that in order to have a small regret we have to select the suboptimal arms less often then the best one.

We are now ready to prove the regret guarantee of the ETC algorithm.

Theorem 7Assume that the losses of the arms minus their expectations are -subgaussian and . Then, ETC guarantees a regret of

*Proof:* Let’s assume without loss of generality that the optimal arm is the first one.

So, for , we have

From Lemma 2, we have that is -subgaussian. So, from Theorem 4, we have

The bound shows the trade-off between exploration and exploitation: if is too big, we pay too much during the exploration phase (first term in the bound). On the other hand, if is small, the probability to select a suboptimal arm increases (second term in the bound). Knowing all the gaps , it is possible to choose that minimizes the bound.

For example, in that case that , the regret is upper bounded by

that is minimized by

Remembering that must be a natural number we can choose

When , we select . So, we have . Hence, the regret is upper bounded by

The main drawback of this algorithm is that its optimal tuning depends on the gaps . Assuming the knowledge of the gaps account to make the stochastic bandit problem completely trivial. However, its tuned regret bound gives us a baseline to which compare other bandit algorithms. In particular, in the next lecture we will present an algorithm that achieves the same asymptotic regret without any knowledge of the gaps.

**3. History Bits **

The ETC algorithm goes back to (Robbins, H., 1952), even if Robbins proposed what is now called epoch-greedy (Langford, J. and Zhang, T., 2008). For more history on ETC, take a look at chapter 6 in (Lattimore, T. and Szepesvári, C., 2018). The proofs presented here are from (Lattimore, T. and Szepesvári, C., 2018) as well.