* You can find the other lectures here.*

In this lecture, we will explore the link between Online Learning and and Statistical Learning Theory.

**1. Agnostic PAC Learning **

We now consider a different setting from what we have seen till now. We will assume that we have a prediction strategy parametrized by a vector and we want to learn the relationship between an input and its associated label . Moreover, we will assume that is drawn from a joint probability distribution . Also, we are equipped with a loss function that measures how good is our prediction compared to the true label , that is . So, learning the relationship can be cast as minimizing the expected loss of our predictor

In machine learning terms, the object above is nothing else than the *test error* and our predictor.

Note that the above setting assumes labeled samples, but we can generalize it even more considering the *Vapnik’s general setting of learning*, where we collapse the prediction function and the loss in a unique function. This allows, for example, to treat supervised and unsupervised learning in the same unified way. So, we want to minimize the *risk*

where is an unknown distribution over and is measurable w.r.t. the second argument. Also, the set of all predictors that can be expressed by vectors in is called the *hypothesis class*.

Example 1.In a linear regression task where the loss is the square loss, we have and . Hence, .

Example 2.In linear binary classification where the loss is the hinge loss, we have and . Hence, .

Example 3.In binary classification with a neural network with the logistic loss, we have and is the network corresponding to the weights . Hence, .

The key difficulty of the above problem is that we don’t know the distribution . Hence, there is no hope to exactly solve this problem. Instead, we are interested in understanding *what is the best we can do if we have access to samples drawn i.i.d. from *. More in details, we want to upper bound the *excess risk*

where is a predictor that was *learned* using samples.

It should be clear that this is just an optimization problem and we are interested in upper bounding the suboptimality gap. In this view, the objective of machine learning can be considered as a particular optimization problem.

Remark 1.Note that this is not the only way to approach the problem of learning. Indeed, the regret minimization model is an alternative model to learning. Moreover, another approach would be to try to estimate the distribution and then solve the risk minimization problem, the approach usually taken in Statistics. No approach is superior to the other and each of them has its pros and cons.

Given that we have access to the distribution through samples drawn from it, any procedure we might think to use to minimize the risk will be stochastic in nature. This means that we cannot assure a deterministic guarantee. Instead, *we can try to prove that with high probability our minimization procedure will return a solution that is close to the minimizer of the risk*. It is also intuitive that the precision and probability we can guarantee must depend on how many samples we draw from .

Quantifying the dependency of precision and probability of failure on the number of samples used is the objective of the **Agnostic Probably Approximately Correct** (PAC) framework, where the keyword “agnostic” refers to the fact that we don’t assume anything on the best possible predictor. More in details, given a precision parameter and a probability of failure , we are interested in characterizing the *sample complexity of the hypothesis class * that is defined as the number of samples necessary to guarantee with probability at least that the best learning algorithm using the hypothesis class outputs a solution that has an excess risk upper bounded by . Note that the sample complexity does not depend on , so it is a worst-case measure w.r.t. all the possible distributions. This makes sense if you think that we know nothing about the distribution , so if your guarantee holds for the worst distribution it will also hold for any other distribution. Mathematically, we will say that the hypothesis class is agnostic PAC-learnable is such sample complexity function exists.

Definition 1.We will say that a function class isAgnostic-PAC-learnableif there exists an algorithm and a function such that when is used with samples drawn from , with probability at least the solution returned by the algorithm has excess risk at most .

Note that the Agnostic PAC learning setting does not say what is the procedure we should follow to find such sample complexity. The approach most commonly used in machine learning to solve the learning problem is the so-called *Empirical Risk Minimization (ERM) problem*. It consist of drawing samples i.i.d. from and minimizing the *empirical risk*:

In words, ERM is nothing else than minimize the error on a training set. However, in many interesting cases we can have that can be very far from the true optimum , even with an infinite number of samples! So, we need to modify the ERM formulation in some way, e.g., using a *regularization* term or a Bayesian prior of , or find conditions under which ERM works.

The ERM approach is so widespread that machine learning itself is often wrongly identified with some kind of minimization of the training error. We now show that ERM is not the entire world of ML, showing that *the existence of a no-regret algorithm, that is an online learning algorithm with sublinear regret, guarantee Agnostic-PAC learnability*. More in details, we will show that an online algorithm with sublinear regret can be used to solve machine learning problems. This is not just a curiosity, for example this gives rise to computationally efficient parameter-free algorithms, that can be achieved through ERM only running a two-step procedure, i.e. running ERM with different parameters and selecting the best solution among them.

We already mentioned this possibility when we talked about the online-to-batch conversion, but this time we will strengthen it proving high probability guarantees rather than expectation ones.

So, we need some more bits on concentration inequalities.

**2. Bits on Concentration Inequalities **

We will use a concentration inequality to prove the high probability guarantee, but we will need to go beyond the sum of i.i.d.random variables. In particular, we will use the concept of *martingales*.

Definition 2.A sequence of random variables is called amartingaleif for all it satisfies:

Example 4.Consider a fair coin and a betting algorithm that bets money on each round on the side of the coin equal to . We win or lose money 1:1, so the total money we won up to round is . is a martingale. Indeed, we have

For bounded martingales we can prove high probability guarantees as for bounded i.i.d. random variables. The following Theorem will be the key result we will need.

Theorem 3 (Hoeffding-Azuma inequality).Let be a martingale of random variables that satisfy almost surely. Then, we have

Also, the same upper bounds hold on .

**3. From Regret to Agnostic PAC **

We now show how the online-to-batch conversion we introduced before gives us high probability guarantee for our machine learning problem.

Theorem 4.Let , where the expectation is w.r.t. drawn from with support over some vector space and . Draw samples i.i.d. from and construct the sequence of losses . Run any online learning algorithm over the losses , to construct the sequence of predictions . Then, we have with probability at least , it holds that

*Proof:* Define . We claim that is a martingale. In fact, we have

where we used the fact that depends only on Hence, we have

that proves our claim.

Hence, using Theorem 3, we have

This implies that, with probability at least , we have

or equivalently

We now use the definition of regret w.r.t. any , to have

The last step is to upper bound with high probability with . This is easier than the previous upper bound because is a fixed vector, so are i.i.d. random variables, so for sure forms a martingale. So, reasoning as above, we have that with probability at least it holds that

Putting all together and using the union bound, we have the stated bound.

The theorem above upper bounds the average risk of the predictors, while we are interested in producing a single predictor. If the risk is a convex function and is convex, than we can lower bound the l.h.s. of the inequalities in the theorem with the risk evaluated on the average of the . That is

If the risk is not a convex function, we need a way to generate a single solution with small risk. One possibility is to construct a *stochastic classifier* that samples one of the with uniform probability and predicts with it. For this classifier, we immediately have

where the expectation in the definition of the risk of the stochastic classifier is also with respect to the random index. Yet another way, is to select among the predictors, the one with the smallest risk. This works because the average is lower bounded by the minimum. This is easily achieved using samples for the online learning procedure and samples to generate a validation set to evaluate the solution and pick the best one. The following Theorem shows that selecting the predictor with the smallest empirical risk on a validation set will give us a predictor close to the best one with high probability.

Theorem 5.We have a finite set of predictors and a dataset of samples drawn i.i.d. from . Denote by . Then, with probability at least , we have

*Proof:* We want to calculate the probability that the hypothesis that minimizes the validation error is far from the best hypothesis in the set. We cannot do it directly because we don’t have the required independence to use a concentration. Instead, *we will upper bound the probability that there exists at least one function whose empirical risk is far from the risk.* So, we have

Hence, with probability at least , we have that for all

We are now able to upper bound the risk of , just using the fact that the above applies to too. So, we have

where in the last inequality we used the fact that minimizes the empirical risk.

Using this theorem, we can use samples for the training and samples for the validation. Denoting by the predictor with the best empirical risk on the validation set among the generated during the online procedure, we have with probability at least that

It is important to note that with any of the above three methods to select one among the generated by the online learning procedure, the sample complexity guarantee we get matches the one we would have obtained by ERM, up to polylogarithmic factors. In other words, there is nothing special about ERM compared to the online learning approach to statistical learning.

Another important point is that the above guarantee does not imply the existence of online learning algorithms with sublinear regret for any learning problem. It just says that, if it exists, it can be used in the statistical setting too.

**4. History Bits **

Theorem 4 is from (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004). Theorem 5 is nothing else than the Agnostic PAC learning guarantee for ERM for hypothesis classes with finite cardinality. (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004) gives also an alternative procedure to select a single hypothesis among the generated during the online procedure that does not require splitting the data in training and validation. However, the obtained guarantee matches the one we have proved.

]]>* You can find the lectures I published till now here.*

In the last lecture, we introduced the Explore-Then-Commit (ETC) algorithm that solves the stochastic bandit problem, but requires the knowledge of the *gaps*. This time we will introduce a parameter-free strategy that achieves the same optimal regret guarantee.

**1. Upper Confidence Bound Algorithm **

The ETC algorithm has the disadvantage of requiring the knowledge of the gaps to tune the exploration phase. Moreover, it solves the exploration vs. exploitation trade-off in a clunky way. It would be better to have an algorithm that smoothly transition from one phase into the other *in a data-dependent way*. So, we now describe an optimal and adaptive strategy called Upper Confidence Bound (UCB) algorithm. It employs the principle of *optimism in the face of uncertainty*, to select in each round the arm that has the *potential to be the best one*.

UCB works keeping an estimate of the expected loss of each arm and also a confidence interval at a certain probability. Roughly speaking, we have that with probability at least

where the “roughly” comes from the fact that is a random variable itself. Then, UCB will query the arm with the smallest lower bound, that is the one that could potentially have the smallest expected loss.

Remark 1.The name Upper Confidence Bound comes from the fact that traditionally stochastic bandits are defined over rewards, rather than losses. So, in our case we actually use the lower confidence bound in the algorithm. However, to avoid confusion with the literature, we still call it Upper Confidence Bound algorithm.

The key points in the proof are on how to choose the right confidence level and how to get around the dependency issues.

The algorithm is summarized in Algorithm 1 and we can prove the following regret bound.

Theorem 1.Assume that the rewards of the arms are -subgaussian and and let . Then, UCB guarantees a regret of

*Proof:* We analyze one arm at the time. Also, without loss of generality, assume that the optimal arm is the first one. For arm , we want to prove that .

The proof is based on the fact that once I have sampled an arm enough times, the probability to take a suboptimal arm is small.

Let the biggest time index such that . If , then the statement above is true. Hence, we can safely assume , we have

Consider and such that , then we claim that at least one of the two following equations must be true:

If the first one is true, the confidence interval around our estimate of the expectation of the optimal arm does not contain . On the other hand, if the second one is true the confidence interval around our estimate of the expectation does not contain . So, we claim that if and we selected a suboptimal arm, then at least one of these two bad events happened.

Let’s prove the claim: *if both the inequalities above are false*, , and , we have

that, by the selection strategy of the algorithm, would imply .

Note that . Hence, we have

Now, we upper bound the probabilities in the sum. First, note that, given that the losses on the arms are i.i.d., we have

Hence, we have

Given that the same bound holds for , we have

Using the decomposition of the regret we proved last time, , we have the stated bound.

It is instructive to observe an actual run of the algorithm. I have considered 5 arms and Gaussian losses. In the left plot of figure below, I have plotted how the estimates and confidence intervals of UCB varies over time (in blue), compared to the actual true means (in black). In the right side, you can see the number of times each arm was pulled by the algorithm.

It is interesting to note that the logarithmic factor in the confidence term will make the confidences of the arm that are not pulled to *increase* over time. In turn, this will assure that the algorithm does not miss the optimal arm, even if the estimates were off. Also, the algorithm will keep pulling the two arms that are close together, to be sure on which one is the best among the two.

The bound above can become meaningless if the gaps are too small. So, here we prove another bound that does not depend on the inverse of the gaps.

Theorem 2.Assume that the rewards of the arms minus their expectations are -subgaussian and let . Then, UCB guarantees a regret of

*Proof:* Let be some value to be tuned subsequently and recall from the proof of Theorem 1 that for each suboptimal arm we can bound

Hence, using the regret decomposition we proved last time, we have

Choosing , we have the stated bound.

Remark 2.Note that while the UCB algorithm is considered parameter-free, we still have to know the subgaussianity of the arms. While this can be easily upper bounded for stochastic arms with bounded support, it is unclear how to do it without any prior knowledge on the distribution of the arms.

It is possible to prove that the UCB algorithm is asymptotically optimal, in the sense of the following Theorem.

Theorem 3 (Bubeck, S. and Cesa-Bianchi, N. , 2012, Theorem 2.2).Consider a strategy that satisfies for any set of Bernoulli rewards distributions, any arm with and any . Then, for any set of Bernoulli reward distributions, the following holds

**2. History Bits **

The use of confidence bounds and the idea of optimism first appeared in the work by (T. L. Lai and H. Robbins, 1985). The first version of UCB is by (T. L. Lai, 1987). The version of UCB I presented is by (P. Auer and N. Cesa-Bianchi and P. Fischer, 2002) under the name UCB1. Note that, rather than considering 1-subgaussian environments, (P. Auer and N. Cesa-Bianchi and P. Fischer, 2002) considers bandits where the rewards are confined to the interval. The proof of Theorem 1 is a minor variation of the one of Theorem 2.1 in (Bubeck, S. and Cesa-Bianchi, N. , 2012), which also popularized the subgaussian setup. Theorem 2 is from (Bubeck, S. and Cesa-Bianchi, N. , 2012).

**3. Exercises **

]]>

Exercise 1.Prove a similar regret bound to the one in Theorem 2 for an optimally tuned Explore-Then-Commit algorithm.

* You can find the lectures I published till now here.*

Today, we will consider the *stochastic bandit* setting. Here, each arm is associated with an unknown probability distribution. At each time step, the algorithm selects one arm and it receives a loss (or reward) drawn i.i.d. from the distribution of the arm . We focus on minimizing the *pseudo-regret*, that is the regret with respect to the optimal action in expectation, rather than the optimal action on the sequence of realized losses:

where we denoted by the expectation of the distribution associated with the arm .

Remark 1The usual notation in the stochastic bandit literature is to consider rewards instead of losses. Instead, to keep our notation coherent with the OCO literature, we will consider losses. The two things are completely equivalent up to a multiplication by .

Before presenting our first algorithm for stochastic bandits, we will introduce some basic notions on concentration inequalities that will be useful in our definitions and proofs.

**1. Concentration Inequalities Bits **

Suppose that is a sequence of independent and identically distributed random variables and with mean and variance . Having observed we would like to estimate the common mean . The most natural estimator is the *empirical mean*

Linearity of expectation shows that , which means that is an *unbiased estimator* of . Yet, is a random variable itself. So, can we quantify how far will be from ?

We could use Chebyshev’s inequality to upper bound the probability that is far from :

Using the fact that , we have that

So, we can expect the probability of having a “bad” estimate to go to zero as one over the number of samples in our empirical mean. Is this the best we can get? To understand what we can hope for, let’s take a look at the central limit theorem.

We know that, defining , , the standard Gaussian distribution, as goes to infinity. This means that

where the approximation comes from the central limit theorem. The integral cannot be calculated with a closed form, but we can easily upper bound it. Indeed, for , we have

This is better than what we got with Chebyshev’s inequality and we would like to obtain an exact bound with a similar asymptotic rate. To do that, we will focus our attention on *subgaussian* random variables.

Definition 1We say that a random variable is –subgaussianif for all we have that .

Example 1The following random variable are subgaussian:

- If is Gaussian with mean zero and variance , then is -subgaussian.
- If has mean zero and almost surely, then is -subgaussian.

We have the following properties for subgaussian random variables.

Lemma 2 (Lattimore and Szepesvári, 2018, Lemma 5.4) Assume that and are independent and -subgaussian and -subgaussian respectively. Then,

- = 0 and .
- is -subgaussian.
- is -subgaussian.

Subgaussians random variables behaves like Gaussian random variables, in the sense that their tail probabilities are upper bounded by the ones of a Gaussian of variance . To prove it, let’s first state the Markov’s inequality.

Theorem 3 (Markov’s inequality)For a non-negative random variable and , we have that .

With Markov’s inequality, we can now formalize the above statement on subgaussian random variables.

*Proof:* For any , we have

Minimizing the right hand side of the inequality w.r.t. , we have the stated result.

An easy consequence of the above theorem is that the empirical average of subgaussian random variables concentrates around its expectation, *with the same asymptotic rate in (1)*.

Corollary 5Assume that are independent, -subgaussian random variables. Then, for any , we have

where .

Equating the upper bounds on the r.h.s. of the inequalities in the Corollary to , we have the equivalent statement that, with probability at least , we have

**2. Explore-Then-Commit Algorithm **

We are now ready to present the most natural algorithm for the stochastic bandit setting, called Explore-Then-Commit (ETC) algorithm. That is, we first identify the best arm over exploration rounds and then we commit to it. This algorithm is summarized in Algorithm 2.

In the following, we will denote by , that is the number of times that the arm was pulled in the first rounds.

Define by the expected loss of the arm with the smallest expectation, that is . Critical quantities in our analysis will be the *gaps*, for , that measure the expected difference in losses between the arms and the optimal one. In particular, we can decompose the regret as a sum over the arms of the expected number of times we pull an arm multiplied by its gap.

Lemma 6For any policy of selection of the arms, the regret is upper bounded by

*Proof:* Observe that

Hence,

The above Lemma quantifies the intuition that in order to have a small regret we have to select the suboptimal arms less often then the best one.

We are now ready to prove the regret guarantee of the ETC algorithm.

Theorem 7Assume that the losses of the arms minus their expectations are -subgaussian and . Then, ETC guarantees a regret of

*Proof:* Let’s assume without loss of generality that the optimal arm is the first one.

So, for , we have

From Lemma 2, we have that is -subgaussian. So, from Theorem 4, we have

The bound shows the trade-off between exploration and exploitation: if is too big, we pay too much during the exploration phase (first term in the bound). On the other hand, if is small, the probability to select a suboptimal arm increases (second term in the bound). Knowing all the gaps , it is possible to choose that minimizes the bound.

For example, in that case that , the regret is upper bounded by

that is minimized by

Remembering that must be a natural number we can choose

When , we select . So, we have . Hence, the regret is upper bounded by

The main drawback of this algorithm is that its optimal tuning depends on the gaps . Assuming the knowledge of the gaps account to make the stochastic bandit problem completely trivial. However, its tuned regret bound gives us a baseline to which compare other bandit algorithms. In particular, in the next lecture we will present an algorithm that achieves the same asymptotic regret without any knowledge of the gaps.

**3. History Bits **

The ETC algorithm goes back to (Robbins, H., 1952), even if Robbins proposed what is now called epoch-greedy (Langford, J. and Zhang, T., 2008). For more history on ETC, take a look at chapter 6 in (Lattimore, T. and Szepesvári, C., 2018). The proofs presented here are from (Lattimore, T. and Szepesvári, C., 2018) as well.

]]>* You can find the lectures I published till now here.*

Last time, we saw that for Online Mirror Descent (OMD) with an entropic regularizer and learning rate it might be possible to get the regret guarantee

where . This time we will see how and we will use this guarantee to prove an almost optimal regret guarantee for Exp3, in Algorithm 1.

Remark 1While it is possible to prove (1) from first principles using the specific properties for the entropic regularizer, such proof will not shed any light of what is actually going on. So, in the following we will instead try to prove such regret in a very general way. Indeed, this general proof will allow us to easily prove the optimal bound for multi-armed bandits using OMD with the Tsallis entropy as regularizer.

Now, for a generic , consider the OMD algorithm that produces the predictions in two steps:

- Set such that .
- Set .

As we showed, under weak conditions, these two steps are equivalent to the usual OMD single-step update.

Now, the idea is to consider an alternative analysis of OMD that explicitly depends on , the new prediction before the Bregman projection step. First, let’s state the Generalized Pythagorean Theorem for Bregman divergences.

Lemma 1Let and define , then for all .

*Proof:* From the first order optimality condition of we have that . Hence, we have

The Generalized Pythagorean Theorem is often used to prove that the Bregman divergence between any point in and an arbitrary point decreases when the consider the Bregman projection in .

We are now ready to prove our regret guarantee.

Lemma 2For the two-steps OMD update above the following regret bound holds:

where and .

*Proof:* From the update rule, we have that

where in the second equality we used the 3-points equality for the Bregman divergences and the Generalized Pythagorean Theorem in the first inequality. Hence, summing over time we have

So, as we did in the previous lecture, we have

where and .

Putting all together, we have the stated bound.

This time it might be easier to get a handle over . Given that we only need an upper bound, we can just take a look at and and see which one is bigger. This is easy to do: using the update rule we have

that is

Assuming , we have that implies .

Overall, we have the following improved regret guarantee for the Learning with Experts setting with positive losses.

Theorem 3Assume for and . Let and . Using OMD with the entropic regularizer defined as , learning rate , and gives the following regret guarantee

Armed with this new tool, we can now turn to the multi-armed bandit problem again.

Let’s now consider the OMD with entropic regularizer, learning rate , and set equal to the stochastic estimate of , as in Algorithm 1. Applying Theorem 3 and taking expectation, we have

Now, focusing on the terms , we have

So, setting , we have

Remark 2The need for a different analysis for OMD is due to the fact that we want an easy way to upper bound the Hessian. Indeed, in this analysis comes before the normalization into a probability distribution, that simplifies a lot the analysis. The same idea will be used for the Tsallis entropy in the next section.

So, with a tighter analysis we showed that, even without an explicit exploration term, OMD with entropic regularizer solves the multi-armed bandit problem paying only a factor more than the full information case. However, this is still not the optimal regret!

In the next section, we will see that changing the regularizer, *with the same analysis*, will remove the term in the regret.

**1. Optimal Regret Using OMD with Tsallis Entropy **

In this section, we present the Implicitly Normalized Forecaster (INF) also known as OMD with Tsallis entropy for multi-armed bandit.

Define as , where and in we extend the function by continuity. This is the negative **Tsallis entropy** of the vector . This is a strict generalization of the Shannon entropy, because when goes to 1, converges to the negative (Shannon) entropy of .

We will instantiate OMD with this regularizer for the multi-armed problem, as in Algorithm 2.

Note that and .

We will not use any interpretation of this regularizer from the information theory point of view. As we will see in the following, the only reason to choose it is its Hessian. In fact, the Hessian of this regularizer is still diagonal and it is equal to

Now, we can use again the modified analysis for OMD in Lemma 2. So, for any , we obtain

where and .

As we did for Exp3, now we need an upper bounds to the . From the update rule and the definition of , we have

that is

So, if , , that implies that .

Hence, putting all together, we have

We can now specialize the above reasoning, considering in the Tsallis entropy, to obtain the following theorem.

Theorem 4Assume . Set and . Then, Algorithm 2

*Proof:* We only need to calculate the terms

Proceeding as in (2), we obtain

Choosing , we finally obtain an expected regret of , that can be proved to be the optimal one.

There is one last thing, is how do we compute the prediction of this algorithm? In each step, we have to solve a constrained optimization problem. So, we can write the corresponding Lagragian:

From the KKT conditions, we have

and we also know that . So, we have a 1-dimensional problem in that must be solved in each round.

**2. History Bits **

The INF algorithm was proposed by (Audibert, J.-Y. and Bubeck, S., 2009) and re-casted as an OMD procedure in (Audibert, J.-Y. and Bubeck, S. and Lugosi, G., 2011). The connection with the Tsallis entropy was done in (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). The specific proof presented here is new and it builds on the proof by (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). Note that (Abernethy, J. D. and Lee, C. and Tewari, A., 2015) proved the same regret bound for a Follow-The-Regularized-Leader procedure over the stochastic estimates of the losses (that they call Gradient-Based Prediction Algorithm), while here we proved it using a OMD procedure.

**3. Exercises **

Exercise 1Prove that in the modified proof of OMD, the terms can be upper bounded by .

Exercise 2Building on the previous exercise, prove that regret bounds of the same order can be obtained for Exp3 and for the INF/OMD with Tsallis entropy directly upper bounding the terms , without passing through the Bregman divergences.

]]>

* You can find the lectures I published till now here.*

Today, we will present the problem of multi-armed bandit in the adversarial setting and show how to obtain sublinear regret.

**1. Multi-Armed Bandit **

This setting is similar to the Learning with Expert Advice (LEA) setting: In each round, we select one expert and, differently from the full-information setting, we only observe the loss of that expert . The aim is still to compete with the cumulative loss of the best expert in hindsight.

As in the learning with expert case, we need randomization in order to have a sublinear regret. Indeed, this is just a harder problem than LEA. However, we will assume that the adversary is **oblivious**, that is, he decides the losses of all the rounds before the game starts, but with the knowledge of the online algorithm. This makes the losses deterministic quantities and it avoids the inadequacy in our definition of regret when the adversary is adaptive (see (Arora, R. and Dekel, O. and Tewari, A., 2012)).

This kind of problems where we don’t receive the full-information, i.e., we don’t observe the loss vector, are called **bandit problems**. The name comes from the problem of a gambler who plays a pool of slot machines, that can be called “one-armed bandits”. On each round, the gambler places his bet on a slot machine and his goal is to win almost as much money as if he had known in advance which slot machine would return the maximal total reward.

In this problem, we clearly have an *exploration-exploitation trade-off*. In fact, on one hand we would like to play at the slot machine which, based on previous rounds, we believe will give us the biggest win. On the other hand, we have to explore the slot machines to find the best ones. On each round, we have to solve this trade-off.

Given that we don’t observe completely observe the loss, we cannot use our two frameworks: Online Mirror Descent (OMD) and Follow-The-Regularized-Leader (FTRL) both needs the loss functions or at least lower bounds to them.

One way to solve this issue is to construct *stochastic estimates* of the unknown losses. This is a natural choice given that we already know that the prediction strategy has to be a randomized one. So, in each round we construct a probability distribution over the arms and we sample one action according to this probability distribution. Then, we only observe the coordinate of the loss vector . One possibility to have a stochastic estimate of the losses is to use an *importance-weighted estimator*: Construct the estimator of the unknown vector in the following way:

Note that this estimator has all the coordinates equal to 0, except the coordinate corresponding the arm that was pulled.

This estimator is unbiased, that is . To see why, note that and . Hence, for , we have

Let’s also calculate the (uncentered) variance of the coordinates of this estimator. We have

We can now think of using OMD with an entropic regularizer and the estimated losses. Hence, assume and set defined as , that is the unnormalized negative entropy. Also, set . Using the OMD analysis, we have

We can now take the expectation at both sides and get

We are now in troubles, because the terms in the sum scale as . So, we need a way to control the smallest probability over the arms.

One way to do it, is to take a convex combination of and a uniform probability. That is, we can predict with , where will be chosen in the following. So, can be seen as the minimum amount of exploration we require to the algorithm. Its value will be chosen by the regret analysis to optimally trade-off exploration vs exploitation. The resulting algorithm is in Algorithm 1.

The same probability distribution is used in the estimator:

We can have that . However, we pay a price in the bias introduced:

Observing that , we have

Putting together the last inequality and the upper bound to the expected regret in (2), we have

Setting and , we obtain a regret of .

This is way worse than the of the full-information case. However, while it is expected that the bandit case must be more difficult than the full information one, it turns out that this is not the optimal strategy.

**2. Exponential-weight algorithm for Exploration and Exploitation: Exp3 **

It turns out that the algorithm above actually works, even without the mixing with the uniform distribution! We were just too loose in our regret guarantee. So, we will analyse the following algorithm, that is called Exponential-weight algorithm for Exploration and Exploitation (Exp3), that is nothing else than OMD with entropic regularizer and stochastic estimates of the losses. Note that now we will assume that .

Let’s take another look to the regret guarantee we have. From the OMD analysis, we have the following one-step inequality that holds for any

Let’s now focus on the term . We said that for a twice differentiable function , there exists such that , where . Hence, there exists such that and

So, assuming the Hessian in to be positive definite, we can bound the last two terms in the one-step inequality of OMD as

where we used Fenchel-Young inequality with the function and and .

When we use the strong convexity, we are upper bounding the terms in the sum with the inverse of the smallest eigenvalue of the Hessian of the regularizer. However, we can do better if we consider the actual Hessian. In fact, in the coordinates where is small, we have a smaller growth of the divergence. This can be seen also graphically in Figure 1. Indeed, for the entropic regularizer, we have that the Hessian is a diagonal matrix:This expression of the Hessian a regret of

where and . Note that for any is in the simplex, so this upper bound is always better than

that we derived just using the strong convexity of the entropic regularizer.

However, we don’t know the exact value of , but only that it is on the line segment between and . Yet, if you could say that , in the bandit case we would obtain an expected regret guarantee of , greatly improving the bound we proved above!

In the next lecture, we will see an alternative way to analyze OMD that will give us exactly this kind of guarantee for Exp3 and will use give us the optimal regret guarantee using the Tsallis entropy in few lines of proof.

**3. History Bits **

The algorithm in Algorithm 1 is from (Cesa-Bianchi, N. and Lugosi, G. , 2006, Theorem 6.9). The Exp3 algorithm was proposed in (Auer, P. and Cesa-Bianchi, N. and Freund, Y. and Schapire, R. E., 2002).

]]>* You can find the lectures I published till now here.*

Throughout this class, we considered the adversarial model as our model of the environment. This allowed us to design algorithm that work in this setting, as well as in other more benign settings. However, the world is never completely adversarial. So, we might be tempted to model the environment in some way, but that would leave our algorithm vulnerable to attacks. An alternative, is to consider the data as generated by some *predictable process plus adversarial noise*. In this view, it might be beneficial to try to model the predictable part, without compromising the robustness to the adversarial noise.

In this class, we will explore this possibility through a particular version of Follow-The-Regularized-Leader (FTRL), where we *predict* the next loss. In very intuitive terms, if our predicted loss is correct, we can expect the regret to decrease. However, if our prediction is wrong we still want to recover the worst case guarantee. Such algorithm is called **Optimistic FTRL**.

The core idea of Optimistic FTRL is to predict the next loss and use it in the update rule, as summarized in Algorithm 1. Note that for the sake of the analysis, it does not matter how the prediction is generated. It can be even generated by another online learning procedure!

Let’s see why this is a good idea. Remember that FTRL simply predicts with the minimizer of the previous losses plus a time-varying regularizer. Let’s assume for a moment that instead we have the gift of predicting the future, so we do know the next loss ahead of time. Then, we could predict with its minimizer and suffer a negative regret. However, probably our foresight abilities are not so powerful, so our prediction of the next loss might be inaccurate. In this case, a better idea might be just to add our predicted loss to the previous ones and minimize the regularized sum. We would expect the regret guarantee to improve if our prediction of the future loss is precise. At the same time, if the prediction is wrong, we expect its influence to be limited, given that we use it together with all the past losses.

All these intuitions can be formalized in the following Theorem.

Theorem 1With the notation in Algorithm 1, let be convex, closed, and non-empty. Denote by . Assume for that is proper and -strongly convex w.r.t. , and proper and convex, and . Also, assume that and are non-empty. Then, there exists for , such that we have

for all .

*Proof:* We can interpret the Optimistic-FTRL as FTRL with a regularizer . Also, note that has no influence on the algorithm, so we can set it to the null function.

Hence, from the equality for FTRL, we immediately get

Now focus on the terms . Observe that is -strongly convex w.r.t. , hence we have

where . Observing that , we have . Hence, given that our assumptions guarantee that the subdifferential of the sum is equal to the sum of the subdifferentials, there exists such that . So, we have

By the definition of dual norms, we also have that

Let’s take a look at the second bound in the theorem. Compared to the similar bound for FTRL, we now have the terms instead of the ones . So, if the prediction of the next loss is good, that term can become smaller and possibly even zero! On the other hand, if the predictions are bad, for Lipschitz losses we only lose a constant factor. Overall, in the best case we can gain a lot, in the worst case we don’t lose that much.

Despite the simplicity of the algorithm and its analysis, there are many applications of this principle. We will only describe a couple of them. Recently, this idea was used even to recover the Nesterov’s acceleration algorithm and to prove faster convergence in repeated games.

**1. Regret that Depends on the Variance of the Subgradients **

Consider of running Optimistic-FTRL on the linearized losses . We can gain something out of the Optimistic-FTRL compared to plain FTRL if we are able to predict the next . A simple possibility is to predict the average of the past values, . Indeed, from the first lecture, we know that such strategy is itself an online learning procedure! In particular, it corresponds to a Follow-The-Leader algorithm on the losses . Hence, from the strong convexity of this losses, we know that

This implies

It is immediate to see that the minimizer is , that results in times the empirical variance of the subgradients. Plugging it in the Optimistic-FTRL regret, with , we have

Remark 1Instead of using the mean of the past subgradients, we could use any other strategy or even a mix of different strategies. For example, assuming the subgradients bounded, we could use an algorithm to solve the Learning with Expert problem, where each expert is a strategy. Then, we would obtain a bound that depends on the predictions of the best strategy, plus the regret of the expert algorithm.

**2. Online Convex Optimization with Gradual Variations **

In this section, we consider the case that the losses we receive have small variations over time. We will show that in this case it is possible to get constant regret in the case that the losses are equal.

In this case, the simple strategy we can use to predict the next subgradient is to use the previous one, that is for and .

Corollary 2Under the assumptions of Theorem 1, define for and . Set where is 1-strongly convex w.r.t. and satisfies for , where is the smoothness constant of the losses . Then, , we have

Moreover, assuming for all , setting , we have

*Proof:* From the Optimistic-FTRL bound with a fixed regularizer, we immediately get

Now, consider the case that the losses are -smooth. So, for any , we have

Focusing on the first term, for , we have

Choose . We have for

For , we have

Now observe the assumption implies for . So, summing for , we have

Putting all together, we have the first stated bound.

The second one is obtained observing that

Note that if the losses are all the same, the regret becomes a constant! This is not surprising, because the prediction of the next loss is a linear approximation of the previous loss. Indeed, looking back at the proof, the key idea is to use the smoothness to argue that, if even the past subgradient was taken in a different point than the current one, it is still a good prediction of the current subgradient.

Remark 2Note that the assumption of smoothness is necessary. Indeed, passing always the same function and using online-to-batch conversion, would result in a convergence rate of for a Lipschitz function, that is impossible.

**3. History Bits **

The Optimistic Online Mirror Descent algorithm was proposed by (Chiang, C.-K. and Yang, T. and Lee, C.-J. and Mahdavi, M. and Lu, C.-J. and Jin, R. and Zhu, S., 2012) and extended in (A. Rakhlin and K. Sridharan, 2013) to use arbitrary “hallucinated” losses. The Optimistic FTRL version was proposed in (A. Rakhlin and K. Sridharan, 2013) and rediscovered in (Steinhardt, J. and Liang, P., 2014), even if it was called Online Mirror Descent for the misnaming problem we already explained. The proof of Theorem 1 I present here is new.

Corollary 2 was proved by (Chiang, C.-K. and Yang, T. and Lee, C.-J. and Mahdavi, M. and Lu, C.-J. and Jin, R. and Zhu, S., 2012) for Optimistic OMD and presented in a similar form in (P. Joulani and A. György and C. Szepesvári, 2017) for Optimistic FTRL, but for bounded domains.

]]>* You can find the lectures I published till now here.*

In the last lecture, we have shown a very simple and parameter-free algorithm for Online Convex Optimization (OCO) in -dimensions, based on a reduction to a coin-betting problem. Now, we will see how to reduce Learning with Expert Advice (LEA) to betting on coins, obtaining again parameter-free and optimal algorithms.

**1. Reduction to Learning with Experts **

First, remember that the regret we got from Online Mirror Descent (OMD), and similarly for Follow-The-Regularized-Leader (FTRL), is

where is the prior distribution on the experts and is the KL-divergence. As we reasoned in the OCO case, in order to set the learning rate we should know the value of . If we could set to , we would obtain a regret of . However, given the adversarial nature of the game, this is impossible. So, as we did in the OCO case, we will show that even this problem can be reduced to betting on a coin, obtaining optimal guarantees with a parameter-free algorithm.

First, let’s introduce some notation. Let be the number of experts and be the -dimensional probability simplex. Let be any *prior* distribution. Let be a coin-betting algorithm. We will instantiate copies of .

Consider any round . Let be the bet of the -th copy of . The LEA algorithm computes as

Then, the LEA algorithm predicts as

Then, the algorithm receives the reward vector . Finally, it feeds the reward to each copy of . The reward for the -th copy of is defined as

The construction above defines a LEA algorithm defined by the predictions , based on the algorithm . We can prove the following regret bound for it.

Theorem 1 (Regret Bound for Experts)Let be a coin-betting algorithm that guarantees a wealth after rounds with initial money equal to 1 of for any sequence of continuous coin outcomes . Then, the regret of the LEA algorithm with prior that predicts at each round with in (2) satisfies

for any concave and non-decreasing such that .

*Proof:* We first prove that . Indeed,

The first equality follows from definition of . To see the second equality, consider two cases: If for all then and therefore both and are trivially zero. If then for all .

From the assumption on , we have for any sequence such that that

So, inequality and (4) imply

Now, for any competitor ,

Now, we could think to use the Krichevsky–Trofimov (KT) bettor with this theorem. However, we would obtain a sub-optimal regret guarantee. In fact, remembering the lower bound on the wealth of KT and setting where is a universal constant, we have

We might think that the is the price we have to pay to adapt to the unknown competitor . However, it turns out it can be removed. In the next section, we see how to change the KT strategy to obtain the optimal guarantee.

**2. A Betting Strategy that Looses at Most a Constant Fraction of Money **

In the reduction before, if we use the KT betting strategy we would have a term under the square root. It turns out that we can avoid that term if we know the number of rounds beforehand. Then, in case is unknown we can just use a doubling trick, paying only a constant multiplicative factor in the regret.

The logarithmic term in the regret comes from the fact that the lower bound on the wealth is

Note that in the case in which the number of heads in the sequence is equal to the number of heads, so that , the guaranteed wealth becomes proportional to . So, for that goes to infinity the bettor will lose all of its money.

Instead, we need a more conservative strategy that guarantees

for small enough and independent of . In this case, the betting strategy has to pace its betting, possibly with the knowledge of the duration of the game, so that even in the case that the number of heads is equal to the number of tails it will only lose a fraction of its money. At the same time, it will still gain an exponential amount of money when the coin outcomes are biased towards one side.

We will prove that this is possible, designing a new betting strategy.

then, by induction, . In fact, we have

Hence, we have to prove that (8) is true in order to guarantee a minimum wealth of our betting strategy.

First, given that is a concave function of , we have

Also, our choice of makes the two quantities above equal with , that is

For other choices of , the two alternatives would be different and the minimum one could always be the one picked by the adversary. Instead, making the two choices worst outcomes equivalent, we minimize the damage of the adversarial choice of the outcomes of the coin. So, we have that

where in the second equality we used the definition of and in the second inequality we used the fact that .

Hence, given that (8) is true, this strategy guarantees

We can now use this betting strategy in the expert reduction in Theorem 1, setting , to have

Note that this betting strategy could also be used in the OCO reduction. Given that we removed the logarithmic term in the exponent, in the 1-dimensional case, we would obtain a regret of

where we gained in the term inside the logarithmic, instead of the term of the KT algorithm. This implies that now we can set to and obtain an asymptotic rate of rather than .

**3. History Bits **

The first parameter-free algorithm for experts is from (Chaudhuri, K. and Freund, Y. and Hsu, D. J., 2009), named NormalHedge, where they obtained a bound similar to the one in (9) but with an additional term. Then, (Chernov, A. and Vovk, V., 2010) removed the log factors with an update without a closed form. (Orabona, F. and Pal, D., 2016) showed that this guarantee can be efficiently obtained through the novel reduction to coin-betting in Theorem 1. Later, these kind of regret guarantees were improved to depend on the sum of the squared losses rather than on time, but with an additional factor, in the Squint algorithm (Koolen, W. M. and van Erven, T., 2015). It is worth noting that the Squint algorithm can be interpreted exactly as a coin-betting algorithm plus the reduction in Theorem 1.

The betting strategy in (6) and (7) are new, and derived from the shifted-KT potentials in (Orabona, F. and Pal, D., 2016). The guarantee is the same obtained by the shifted-KT potentials, but the analysis can be done without knowing the properties of the gamma function.

**4. Exercises **

]]>

Exercise 1Using the same proof technique in the lecture, find a betting strategy whose wealth depends on rather than on .

* You can find the lectures I published till now here.*

In the last lecture, we have shown a very simple and parameter-free algorithm for Online Convex Optimization (OCO) in 1-dimension. Now, we will see how to reduce OCO in a -dimensional space to OCO in 1-dimension, so that we can use the parameter-free algorithm given by a coin-betting strategy in any number of dimensions.

**1. Coordinate-wise Parameter-free OCO **

We have already seen that it is always possible to decompose an OCO problem over the coordinate and use a different 1-dimensional Online Linear Optimization (OLO) algorithm on each coordinate. In particular, we saw that

where the is exactly the regret w.r.t. the linear losses constructed by the coordinate of the subgradient.

Hence, if we have a 1-dimensional OLO algorithm, we can copies of it, each one fed with the coordinate of the subgradient. In particular, we might think to use the KT algorithm over each coordinate. The pseudo-code of this procedure is in Algorithm 1.

The regret bound we get is immediate: We just have to sum the regret over the coordinates.

Theorem 1With the notation in Algorithm 1, assume that . Then, , the following regret bounds hold

where is a universal constant.

Note that the Theorem above suggests that in high dimensional settings should be proportional to .

**2. Parameter-free in Any Norm **

The above reductions works only with in a finite dimensional space. Moreover, it gives a dependency on the competitor w.r.t. the norm that might be undesirable. So, here we present another simple reduction from 1-dimensional OCO to infinite dimensions.

This reduction requires an unconstrained OCO algorithm for the 1-dimensional case and an algorithm for learning in -dimensional (or infinite dimensional) balls. For the 1-dimensional learner, we could use the KT algorithm, while for learning in -dimensional balls we can use, for example, Online Mirror Descent (OMD). Given these two learners, we decompose the problem of learning a vector in the problem of learning a *direction* and a *magnitude*. The regret of this procedure turns out to be just the sum of the regret of the two learners.

We can formalize this idea in the following Theorem.

Theorem 2Denote by the linear regret of algorithm for any in the unit ball w.r.t a norm , and the linear regret of algorithm for any competitor . Then, for any , Algorithm 2 guarantees regret

Further, the subgradients sent to satisfy .

*Proof:* First, observe that since for all . Now, compute:

Remark 1Note that the direction vector is not constrained to have norm equal to 1, yet this does not seem to affect the regret equality.

We can instantiate the above theorem using the KT betting algorithm for the 1d learner and OMD for the direction learner. We obtain the following examples.

Example 1Let be OSD with and learning rate . Let the KT algorithm for 1-dimensional OCO with . Assume the loss functions are -Lipschitz w.r.t. the . Then, using the construction in Algorithm 2, we have

Using an online-to-batch conversion, this algorithm is a stochastic gradient descent procedure without learning rates to tune.

To better appreciate this kind of guarantee, let’s take a look at the one of Follow-The-Regularized-Leader (Online Subgradient Descent can be used in unbounded domains only with constant learning rates). With the regularizer and 1-Lipschitz losses we get a regret of

So, to get the right dependency on we need to tune , but we saw this is impossible. On the other hand, the regret in Example 1 suffers from a logarithmic factor, that is the price to pay not to have to tune parameters.

In the same way, we can even have a parameter-free regret bound for norms.

Example 2Let be OMD with and learning rate . Let the KT algorithm for 1-dimensional OCO with . Assume the loss functions are -Lipschitz w.r.t. the . Then, using the construction in Algorithm 2, we have

If we want to measure the competitor w.r.t the norm, we have to use the same method we saw for OMD: Set and such that . Now, assuming that , we have that . Hence, we have to divide all the losses by and, for all , we obtain

Note that the regret against of the parameter-free construction is *constant*. It is important to understand that there is nothing special in the origin: We could translate the prediction by any offset and get a guarantee that treats the offset as the point with constant regret. This is shown in the next Proposition.

Proposition 3Let an OLO algorithm that predicts and guarantees linear regret for any . We have that the regret of the predictions for OCO is

**3. Combining OCO Algorithms **

Finally, we now show a useful application of the parameter-free OCO algorithms property to have a constant regret against .

Theorem 4Let and two OLO algorithms that produces the predictions and respectively. Then, predicting with , we have for any

Moreover, if both algorithm guarantee a constant regret of against , we have for any

*Proof:* Set . Then,

In words, the above theorem allows us to combine online learning algorithm. If the algorithms we combine have constant regret against the null competitor, then we always get the best of the two guarantees.

Example 3We can combine two parameter-free OCO algorithms, one that gives a bound that depends on the norm of the competitor and subgradients and another one specialized to the norm of competitor/subgradients. The above theorem assures us that we will also get the best guarantee between the two, paying only a constant factor in the regret.

Of course, upper bounding the OCO regret with the linear regret, the above theorem also upper bounds the OCO regret.

**4. History Bits **

The approach of using a coordinate-wise version of the coin-betting algorithm was proposed in the first paper on parameter-free OLO in (M. Streeter and B. McMahan, 2012). Recently, the same approach with a special coin-betting algorithm was also used for optimization of deep neural networks (Orabona, F. and Tommasi, T., 2017). Theorem 2 is from (A. Cutkosky and F. Orabona, 2018). Note that the original theorem is more general because it works even in Banach spaces. The idea of combining two parameter-free OLO algorithms to obtain the best of the two guarantees is from (A. Cutkosky, 2019).

(Orabona, F. and Pal, D., 2016) proposed a different way to transform a coin-betting algorithm into an OCO algorithm that works in or even in Hilbert spaces. However, that approach seems to work on for the norm and it is not a black-box reduction. That said, the reduction in (Orabona, F. and Pal, D., 2016) seems to have a better empirical performance compared to the one in Theorem 2.

There are also reductions that allow to transform an unconstrained OCO learner into a constrained one (A. Cutkosky and F. Orabona, 2018). They work constructing a Lipschitz barrier function on the domain and passing to the algorithm the original subgradients plus the subgradients of the barrier function.

**5. Exercises **

Exercise 1Prove that with and are exp-concave. Then, using the Online Newton Step Algorithm, give an algorithm and a regret bound for a game with these losses. Finally, show a wealth guarantee of the corresponding coin-betting strategy.

]]>

* You can find all the lectures I published here.*

In the previous classes, we have shown that Online Mirror Descent (OMD) and Follow-The-Regularized-Leader (FTRL) achieves a regret of for convex Lipschitz losses. We have also shown that for bounded domains these bounds are optimal up to constant multiplicative factors. However, in the unbounded case the bounds we get are suboptimal w.r.t. the dependency on the competitor. More in particular, let’s consider an example with Online Subgradient Descent with over -Lipschitz losses and learning rate . We get the following regret guarantee

So, in order to get the best possible guarantee, we should know and set . As we said, this strategy does not work for a couple of reasons: i) we don’t know ii) if we guessed any value of the adversary could easily change the losses to make that value completely wrong.

Far from being a technicality, this is an important issue as shown in the next example.

Example 1Consider that we want to use OSD with online-to-batch conversion to minimize a function that is 1-Lipschitz. The convergence rate will be using a learning rate of . Consider the case that , specifying will result in a convergence rate 100 times slower that specifying the optimal choice in hindsight . Note that this is a real effect not an artifact of the proof. Indeed,it is intuitive that the optimal learning rate should be proportional to the distance between the initial point that algorithm picks and the optimal solution.

If we could tune the learning rate in the optimal way, we would get a regret of

However, this is also impossible, because we proved a lower bound that says that the regret must be .

In the following, we will show that it is possible to reduce any Online Convex Optimization (OCO) game to betting on a non-stochastic coin. This will allow us to use a radically different way to design OCO algorithms that will enjoy the optimal regret and will not require any parameter (e.g. learning rates, regularization weights) to be tuned. We call these kind of algorithms *parameter-free*.

**1. Coin-Betting Game **

Imagine the following repeated game:

- Set the initial Wealth to : .
- In each round
- You bet money on side of the coin equal to ; you cannot bet more money than what you currently have.
- The adversary reveals the outcome of the coin .
- You gain money , that is .

Given that we cannot borrow money, we can codify the bets as , with . So, is the fraction of money to bet and the side of the coin on which we bet.

The aim of the game is to make as much money as possible. As usual, given the adversarial nature of the game, we cannot hope to always win money. Instead, we try to gain as much money as the strategy that bets a fixed amount of money for the entire game.

Note that

So, given the multiplicative nature of the wealth, it is also useful to take the logarithm of the ratio of the wealth of the algorithm and wealth of the optimal betting fraction. Hence, we want to minimize the following regret

In words, this is nothing else than the regret of an OCO game where the losses are and . We can also extend a bit the formulation allowing “continuous coins”, where rather than in .

Remark 1Note that the constraint to bet a fraction between and is not strictly necessary. We could allow the algorithm to bet more money that what it currently has, lending it some money in each round. However, the restriction makes the analysis easier because it allows the transfomation above into an OCO problem, using the non-negativity of .

We could just use OMD or FTRL, taking special care of the non-Lipschitzness of the functions, but it turns out that there exists a better strategy specifically for this problem. There exists a very simple strategy to solve the coin-betting game above, that is called **Krichevsky-Trofimov (KT) bettor**. It simply says that on each time step you bet . So, the algorithm is the following one.

For it, we can prove the following theorem.

Theorem 1 (Cesa-Bianchi, N. and Lugosi, G. , 2006, Theorem 9.4)Let for . Then, the KT bettor in Algorithm 1 guarantees

where is a universal constant.

Note that if the outcomes of the coin are skewed towards one side, the optimal betting fraction will gain an exponential amount of money, as proved in the next Lemma.

*Proof:*

where we used the elementary inequality for .

Hence, KT guarantees an exponential amount of money, paying only a penalty. It is possible to prove that the guarantee above for the KT algorithm is optimal to constant additive factors. Moreover, observe that the KT strategy does not require any parameter to be set: no learning rates, nor regularizer. That is, KT is *parameter-free*.

Also, we can extend the guarantee of the KT algorithm to the case in which the coin are “continuous”, that is . We have the following Theorem.

Theorem 3 (Orabona, F. and Pal, D., 2016, Lemma 14)Let for . Then, the KT bettor in Algorithm 1 guarantees

where is a universal constant.

So, we have introduced the coin-betting game, extended it to continuous coins and presented a simple and optimal parameter-free strategy. In the next Section, we show *how to use the KT bettor as a parameter-free 1-d OCO algorithm!*

**2. Parameter-free 1d OCO through Coin-Betting **

So, Theorem 1 tells us that we can win almost as much money as a strategy betting the optimal fixed fraction of money at each step. We only pay a logarithmic price in the log wealth, that corresponds to a term in the actual wealth.

Now, let’s see why this problem is interesting in OCO. It turns out that *solving the coin-betting game is equivalent to solving a 1-dimensional unconstrained online linear optimization problem*. That is, a coin-betting algorithm is equivalent to design an online learning algorithm that produces a sequences of that minimize the 1-dimensional regret with linear losses:

where the are adversarial and bounded. Without loss of generality, we will assume . Also, remembering that OCO games can be reduced to Online Linear Optimization (OLO) games, such reduction would effectively reduces OCO to coin-betting! Moreover, through online-to-batch conversion, any stochastic 1-d problem could be reduced to a coin-betting game! The key theorem that allows the conversion between OLO and coin-betting is the following one.

Theorem 4Let be a proper closed convex function and let be its Fenchel conjugate. An algorithm that generates guarantees

where , if and only if it guarantees

*Proof:* Let’s prove the left to right implication.

For the other implication, we have

To make sense of the above theorem, assume that we are considering a 1-d problem and . Then, guaranteeing a lower bound to

can be done through a betting strategy that bets money on the coins . So, the theorem implies that *proving a reward lower bound for the wealth in a coin-betting game implies a regret upper bound for the corresponding 1-dimensional OLO game*. However, proving a reward lower bound is easier because it doesn’t depend on the competitor . Indeed, not knowing the norm of the competitor is exactly the reason why tuning the learning rates in OMD is hard!

This consideration immediately gives us the conversion between 1-d OLO and coin-betting: **the outcome of the coin is the negative of the subgradient of the losses on the current prediction.** Indeed, setting , we have that a coin-betting algorithm that bets would give us

So, a lower bound on the wealth corresponds to a lower bound that can be used in Theorem 3. To obtain a regret guarantee, we only need to calculate the Fenchel conjugate of the reward function, assuming it can be expressed as a function of .

The last step is to reduce 1-d OCO to 1-d OLO. But, this is an easy step that we have done many times. Indeed, we have

where .

So, to summarize, the Fenchel conjugate of the wealth lower bound for the coin-betting game becomes the regret guarantee for the OCO game. In the next section, we specialize all these considerations to the KT algorithm.

**3. KT as a 1d Online Convex Optimization Algorithm **

Here, we want to use the considerations in the above section to use KT as a parameter-free 1-d OCO algorithm. First, let’s see what such algorithm looks like. KT bets , starting with money. Now, set where and assume the losses -Lipschitz. So, we get

The pseudo-code is in Algorithm 3.

Let’s now see what kind of regret we get. From Theorem 3 and Lemma 2, we have that the KT bettor guarantees the following lower bound on the wealth when used with :

So, we found the function , we just need or an upper bound to it, that can be found with the following Lemma.

where is the Lambert function, i.e. defined as to satisfy .

*Proof:* From the definition of Fenchel dual, we have

where . We now use the fact that satisfies , to have , where is the Lambert function. Using Lemma 5 in the Appendix, we obtain the stated bound.

So, the regret guarantee of KT used a 1d OLO algorithm is upper bounded by

where the only assumption was that the first derivatives (or sub-derivatives) of are bounded in absolute value by 1. Also, it is important to note that any setting of in would not change the asymptotic rate.

To better appreciate this regret, compare this bound to the one of OMD with learning rate :

Hence, the coin-betting approach allows to get almost the optimal bound, without having to guess the correct learning rate! The price that we pay for this parameter-freeness is the log factor, that is optimal from our lower bound.

It is interesting also to look at what the algorithm would do on an easy problem, where . In Figure 3, we show the different predictions that the KT algorithm and online subgradient descent (OSD) would do. Note how the convergence rate of OSD critically depends on the learning rate: too big will not give convergence and too small will make slow down the convergence. On the other hand, KT will go *exponentially fast* towards the minimum and then it will automatically backtrack. This exponential growth effectively works like a line search procedure that allows to get the optimal regret without tuning learning rates. Later in the iterations, KT will oscillate around the minimum, *automatically shrinking its steps, without any parameter to tune.* Of course, this is a simplified example. In a truly OCO game, the losses are different at each time step and the intuition behind the algorithm becomes more difficult. Yet, the optimality of the regret assures us that the KT strategy is the right strategy.

Next time, we will see that we can also reduce OCO in and learning with experts to coin-betting games.

**4. History Bits **

The keyword “parameter-free” has been introduced in (Chaudhuri, K. and Freund, Y. and Hsu, D. J., 2009) for a similar strategy for the learning with expert problem. It is now used as an umbrella term for all online algorithms that guarantee the optimal regret uniformly over the competitor class. The first algorithm for 1-d parameter-free OCO is from (M. Streeter and B. McMahan, 2012), but the bound was suboptimal. The algorithm was then extended to Hilbert spaces in (Orabona, F., 2013), still with a suboptimal bound. The optimal bound in Hilbert space was obtained in (McMahan, H. B. and Orabona, F., 2014). The idea of using a coin-betting to do parameter-free OCO was introduced in (Orabona, F. and Pal, D., 2016). The Krichevsky-Trofimov algorithm is from (Krichevsky, R. and Trofimov, V., 1981) and its extension to the “continuous coin” is from (Orabona, F. and Pal, D., 2016). The regret-reward duality relationship was proved for the first time in (McMahan, H. B. and Orabona, F., 2014). Lemma 5 is from (Orabona, F. and Pal, D., 2016).

**5. Exercises **

Exercise 1While the original proof of the KT regret bound is difficult, it is possible to obtain a looser bound using the be-the-leader method in FTRL. In particular, it is easy to show a regret of for the log wealth.

**6. Appendix **

The Lambert function is defined by the equality

The following lemma provides bounds on .

*Proof:* The inequalities are satisfied for , hence we in the following we assume . We first prove the lower bound. From (1) we have

From this equality, using the elementary inequality for any , we get

Consider now the function defined in where is a positive number that will be decided in the following. This function has a maximum in , the derivative is positive in and negative in . Hence the minimum is in and in , where it is equal to . Using the property just proved on , setting , we have

For , setting , we have

Hence, we set such that

Numerically, , so

For the upper bound, we use Theorem 2.3 in (Hoorfar, A. and Hassani, M., 2008), that says that

Setting , we obtain the stated bound.

]]>* You can find the lectures I published till now here.*

In this lecture, we will consider the problem of *online linear classification*. We consider the following setting:

- At each time step we receive a sample
- We output a prediction of the binary label of
- We receive the true label and we see if we did a mistake or not
- We update our online classifier

The aim of the online algorithm is to minimize the number of mistakes it does compared to some best fixed classifier.

We will focus on linear classifiers, that predicts with the sign of the inner product between a vector and the input features . Hence, . This problem can be written again as a regret minimization problem:

where . It should be clear that these losses are non-convex. Hence, we need an alternative way to deal with them. In the following, we will see two possible approaches to this problem.

**1. Online Randomized Classifier **

As we did for the Learning with Expert Advice framework, we might think to convexify the losses using randomization. Hence, on each round we can predict a number in and output the label according with probability and the label with probability . So, define the random variable

Now observe that . If we consider linear predictors, we can think to have and similarly for the competitor . Constraining both the algorithm and the competitor to the space of vectors where for , we can write

Hence, the surrogate convex loss becomes and the feasible set is any convex set where we have the property for .

Given that this problem is convex, assuming to be bounded w.r.t. some norm, we can use almost any of the algorithms we have seen till now, from Online Mirror Descent to Follow-The-Regularized-Leader (FTRL). All of them would result in regret upper bounds, assuming that are bounded in some norm. The only caveat is to restrict in . One way to do it might be to consider assuming and choose the feasible set .

Putting all together, for example, we can have the following strategy using FTRL with regularizers .

Theorem 1Let an arbitrary sequence of samples/labels couples where and . Assume , for . Then, running the Randomized Online Linear Classifier algorithm with where , for any we have the following guarantee

*Proof:* The proof is straightforward from the FTRL regret bound with the chosen increasing regularizer.

**2. The Perceptron Algorithm **

The above strategy has the shortcoming of restricting the feasible vectors in a possibly very small set. In turn, this could make the performance of the competitor low. In turn, the performance of the online algorithm is only close to the one of the competitor.

Another way to deal with the non-convexity is to compare the number of mistakes that the algorithm does with a convex cumulative loss of the competitor. That is, we can try to prove a weaker regret guarantee:

In particular, the convex loss we consider is *powers* of the **Hinge Loss**: . The hinge loss is a convex upper bound to the 0/1 loss and it achieves the value of zero when the sign of the prediction is correct *and* the magnitude of the inner product is big enough. Moreover, taking powers of it, we get a family of functions that trade-offs the loss for the wrongly classified samples with the one for the correctly classified samples but with a value of , see Figure 1.

The oldest algorithm we have to minimize the modified regret in (1) is the **Perceptron** algorithm, in Algorithm 2.

The Perceptron algorithm updates the current prediction moving in the direction of the current sample multiplied by its label. Let’s see why this is a good idea. Assume that and the algorithm made a mistake. Then, the updated prediction would predict a more positive number on the same sample . In fact, we have

In the same way, if and the algorithm made a mistake, the update would result in a more negative prediction on the same sample.

For the Perceptron algorithm, we can prove the following guarantee.

Theorem 2Let an arbitrary sequence of samples/labels couples where and . Assume , for . Then, running the Perceptron algorithm we have the following guarantee

Before proving the theorem, let’s take a look to its meaning. If there exists a such that , then the Perceptron algorithm makes a *finite* number of mistakes upper bounded by . In case that are many that achieves we have that the finite number of mistakes is bounded the norm of the smallest among them. What is the meaning of this quantity?

Remember that a hyperplane represented by its normal vector divides the space in two half spaces: one with the points that give a positive value for the inner product and other one where the same inner product is negative. Now, we have that the distance of a sample from the hyperplane whose normal is is

Also, given that we are considering a that gives cumulative hinge loss zero, we have that that quantity is at least . So, *the norm of the minimal that has cumulative hinge loss equal to zero is inversely proportional to the minimum distance between the points and the separating hyperplane*. This distance is called the **margin** of the samples . So, if the margin is small, the Perceptron algorithm can do more mistakes than when the margin is big.

If the problem is not linearly separable, the Perceptron algorithm satisfies a regret of , where is the loss of the competitor. Moreover, we measure the competitor with a *family of loss functions* and compete with the best measured with the best loss. This adaptivity is achieved through two basic ingredients:

*The Perceptron is independent of scaling of the update by a hypothetical learning rate*, in the sense that the mistakes it does are independent of the scaling. That is, we could update with and have the same mistakes and updates because they only depend on the sign of . Hence, we can think as it is always using the best possible learning rate .- The weakened definition of regret allows to consider a family of loss functions, because
*the Perceptron is not using any of them in the update.*

Let’s now prove the regret guarantee. For the proof, we will need the two following technical lemmas.

Lemma 3(F. Cucker and D. X. Zhou, 2007, Lemma 10.17) Let be such that . Then

*Proof:* Let , then we have . Solving for we have . Hence, .

*Proof:* } Denote by the total number of the mistakes of the Perceptron algorithm by .

First, note that the Perceptron algorithm can be thought as running Online Subgradient Descent (OSD) with a fixed stepsize over the losses over . Indeed, OSD over such losses would update

Now, as said above, does not affect in any way the sign of the predictions, hence the Perceptron algorithm could be run with (2) and its predictions would be exactly the same. Hence, we have

Given that this inequality holds for any , we can choose the ones that minimizes the r.h.s., to have

Note that . Also, we have

So, denoting by , we can rewrite (3) as

where we used Holder’s inequality and .

Given that and denoting by , we have

Let’s now consider two cases. For , we can use Lemma 4 and have the stated bound. Instead, for , using Lemma 3 we have

that implies

Using the fact that , we have

Finally, using Lemma 4, we have the stated bound.

**3. History Bits **

The Perceptron was proposed by Rosenblatt (F. Rosenblatt, 1958). The proof of convergence in the non-separable case for is by (C. Gentile, 2003) and for is from (Y. Freund and R. E. Schapire, 1999). The proof presented here is based on the one in (Beygelzimer, A. and Orabona, F. and Zhang, C., 2017).

]]>