This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.
You can find the lectures I published till now here.
Today, we will consider the stochastic bandit setting. Here, each arm is associated with an unknown probability distribution. At each time step, the algorithm selects one arm and it receives a loss (or reward)
drawn i.i.d. from the distribution of the arm
. We focus on minimizing the pseudo-regret, that is the regret with respect to the optimal action in expectation, rather than the optimal action on the sequence of realized losses:
where we denoted by the expectation of the distribution associated with the arm
.
Remark 1 The usual notation in the stochastic bandit literature is to consider rewards instead of losses. Instead, to keep our notation coherent with the OCO literature, we will consider losses. The two things are completely equivalent up to a multiplication by
.
Before presenting our first algorithm for stochastic bandits, we will introduce some basic notions on concentration inequalities that will be useful in our definitions and proofs.
1. Concentration Inequalities Bits
Suppose that is a sequence of independent and identically distributed random variables and with mean
and variance
. Having observed
we would like to estimate the common mean
. The most natural estimator is the empirical mean
Linearity of expectation shows that , which means that
is an unbiased estimator of
. Yet,
is a random variable itself. So, can we quantify how far
will be from
?
We could use Chebyshev’s inequality to upper bound the probability that is far from
:
Using the fact that , we have that
So, we can expect the probability of having a “bad” estimate to go to zero as one over the number of samples in our empirical mean. Is this the best we can get? To understand what we can hope for, let’s take a look at the central limit theorem.
We know that, defining ,
, the standard Gaussian distribution, as
goes to infinity. This means that
where the approximation comes from the central limit theorem. The integral cannot be calculated with a closed form, but we can easily upper bound it. Indeed, for , we have
This is better than what we got with Chebyshev’s inequality and we would like to obtain an exact bound with a similar asymptotic rate. To do that, we will focus our attention on subgaussian random variables.
Definition 1 We say that a random variable is
–subgaussian if for all
we have that
.
Example 1 The following random variable are subgaussian:
- If
is Gaussian with mean zero and variance
, then
is
-subgaussian.
- If
has mean zero and
almost surely, then
is
-subgaussian.
We have the following properties for subgaussian random variables.
Lemma 2 (Lattimore and Szepesvári, 2018, Lemma 5.4) Assume that
and
are independent and
-subgaussian and
-subgaussian respectively. Then,
= 0 and
.
is
-subgaussian.
is
-subgaussian.
Subgaussians random variables behaves like Gaussian random variables, in the sense that their tail probabilities are upper bounded by the ones of a Gaussian of variance . To prove it, let’s first state the Markov’s inequality.
Theorem 3 (Markov’s inequality) For a non-negative random variable
and
, we have that
.
With Markov’s inequality, we can now formalize the above statement on subgaussian random variables.
Proof: For any , we have
Minimizing the right hand side of the inequality w.r.t. , we have the stated result.
An easy consequence of the above theorem is that the empirical average of subgaussian random variables concentrates around its expectation, with the same asymptotic rate in (1).
Corollary 5 Assume that
are independent,
-subgaussian random variables. Then, for any
, we have
where
.
Equating the upper bounds on the r.h.s. of the inequalities in the Corollary to , we have the equivalent statement that, with probability at least
, we have
2. Explore-Then-Commit Algorithm
We are now ready to present the most natural algorithm for the stochastic bandit setting, called Explore-Then-Commit (ETC) algorithm. That is, we first identify the best arm over exploration rounds and then we commit to it. This algorithm is summarized in Algorithm 2.
In the following, we will denote by , that is the number of times that the arm
was pulled in the first
rounds.
Define by the expected loss of the arm with the smallest expectation, that is
. Critical quantities in our analysis will be the gaps,
for
, that measure the expected difference in losses between the arms and the optimal one. In particular, we can decompose the regret as a sum over the arms of the expected number of times we pull an arm multiplied by its gap.
Lemma 6 For any policy of selection of the arms, the regret is upper bounded by
Proof: Observe that
Hence,
The above Lemma quantifies the intuition that in order to have a small regret we have to select the suboptimal arms less often then the best one.
We are now ready to prove the regret guarantee of the ETC algorithm.
Theorem 7 Assume that the losses of the arms minus their expectations are
-subgaussian and
. Then, ETC guarantees a regret of
Proof: Let’s assume without loss of generality that the optimal arm is the first one.
So, for , we have
From Lemma 2, we have that is
-subgaussian. So, from Theorem 4, we have
The bound shows the trade-off between exploration and exploitation: if is too big, we pay too much during the exploration phase (first term in the bound). On the other hand, if
is small, the probability to select a suboptimal arm increases (second term in the bound). Knowing all the gaps
, it is possible to choose
that minimizes the bound.
For example, in that case that , the regret is upper bounded by
that is minimized by
Remembering that must be a natural number we can choose
When , we select
. So, we have
. Hence, the regret is upper bounded by
The main drawback of this algorithm is that its optimal tuning depends on the gaps . Assuming the knowledge of the gaps account to make the stochastic bandit problem completely trivial. However, its tuned regret bound gives us a baseline to which compare other bandit algorithms. In particular, in the next lecture we will present an algorithm that achieves the same asymptotic regret without any knowledge of the gaps.
3. History Bits
The ETC algorithm goes back to (Robbins, H., 1952), even if Robbins proposed what is now called epoch-greedy (Langford, J. and Zhang, T., 2008). For more history on ETC, take a look at chapter 6 in (Lattimore, T. and Szepesvári, C., 2018). The proofs presented here are from (Lattimore, T. and Szepesvári, C., 2018) as well.
Thanks for the tutorial. May I know how you obtained \Delta + (T-2)\Delta \leq T\Delta \leq \frac{4}{\Delta} for m=1?
LikeLike
Hi Jeng, it is enough to use the condition T Delta^2/4 \leq 1.
LikeLike
Nice tutorial! Could you explain why there is an additional +\Delta in the final upper bound? When I substitute m into the bound I don’t get that extra +\Delta. Thanks in advance
LikeLike
It comes from the fact that you need to take the ceiling of the optimal value of m to obtain an integer number. Note that in the case the optimal m is 1, the +\Delta is not needed, but it is a small overapproximation to make the expression simpler.
LikeLike