There is a popular interpretation of the Perceptron as a stochastic (sub)gradient descent procedure. I even found slides online with this idea. The thought of so many young minds twisted by these false claims was too much to bear. So, I felt compelled to write a blog post to explain why this is wrong…

Moreover, I will also give a different and (I think) much better interpretation of the Perceptron algorithm.

**1. Perceptron Algorithm**

The Perceptron algorithm was introduced by Rosenblatt in 1958. To be more precise, he introduced a family of algorithms characterized by a certain architecture. Also, he considered what we call now supervised and unsupervised training procedures. However, nowadays when we talk about the Perceptron we intend the following algorithm:

In the algorithm, the couples for , with and , represent a set of input/output pairs that we want to learn to classify correctly in the two categories and . We assume that there exists an unknown vector the correctly classify all the samples, that is . Note that any scaling of by a positive constant still correctly classify all the samples, so there are infinite solutions. The aim of the Perceptron is to find any of these solutions.

From an optimization point of view, this is called a *feasibility problem*, that is something like

where is some set. They are an essential step in constrained optimization for algorithms that require an feasible initial point. Feasibility problems are not optimization problems even if in some cases can be solved with an optimization formulation.

In the Perceptron case, we can restate the problem as

where the “1” on the r.h.s. is clearly arbitrary and it can be changed through rescaling of . So, in optimization language, the Perceptron algorithm is nothing else than an iterative procedure to solve the above feasibility problem.

**2. Issues with the SGD Interpretation**

As said above, sometimes people refer to the Perceptron as a stochastic (sub)gradient descent algorithm on the objective function

I think they are many problems with this ideas, let me list some of them

- First of all, the above interpretation assumes that we take the samples randomly from . However, this is not needed in the Perceptron and it was not needed in the first proofs of the Perceptron convergence (Novikoff, 1963). There is a tendency to call anything that receive one sample at a time as “stochastic”, but “arbitrary order” and “stochastic” are clearly not the same.
- The Perceptron is typically initialized with . Now, we have two problems. The first one is that with a black-box first-order oracle, we would get a subgradient of a , where is drawn uniformly at random in . A possible subgradient for any is . This means that SGD would not update. Instead, the Perceptron in this case does update. So, we are forced to consider a different model than the black-box one. Changing the oracle model is a minor problem, but this fact hints to another very big issue.
- The biggest issue is that is a global optimum of ! So, there is nothing to minimize, we are already done in the first iteration. However, from a classification point of view, this solution seems clearly wrong. So, it seems we constructed an objective function we want to minimize and a corresponding algorithm, but for some reason we do not like one of its infinite minimizers. So, maybe, the objective function is wrong? So, maybe, this interpretation misses something?

There is an easy way to avoid some of the above problems: change the objective function to a parametrized loss that has non-zero gradient in zero. For example, something like this

Now, when goes to infinity, you recover the function . However, for any finite , is not a global optimum anymore. As a side effect, we also solved the issue of the subgradient of the max function. In this way, you could interpret the Perceptron algorithm as the *limit behaviour of SGD on a family of optimization problems*.

To be honest, I am not sure this is a satisfying solution. Moreover, the stochasticity is still there and it should be removed.

Now, I already proved a mistake bound for the Perceptron, without any particular interpretation attached to it. As a matter of fact, proofs do not need interpretations to be correct. I showed that the Perceptron competes with a *family of loss functions* that implies that it does not just use the subgradient of a single function. However, if you need an *intuitive way* to think about it, let me present you the idea of *pseudogradients*.

**3. Pseudogradients**

Suppose we want to minimize a function -smooth and we would like to use something like gradient descent. However, we do not have access to its gradient. In this situation, (Polyak and Tsypkin, 1973) proposed to use a “pseudogradient”, that is *any* vector that forms an angle of 90 degrees or less with the actual gradient in

In a very intuitive way, gives me some information that should allow me to minimize , at least in the limit. The algorithm then becomes a “pseudogradient descent” procedure that updates the current solution in the direction of the negative pseudogradient

where are the step size or learning rates.

Note that (Polyak and Tsypkin, 1973) define the pseudogradients as a *stochastic* vector that satisfies the above inequality in conditional expectation and for a time-varying . Indeed, there are a number of very interesting results in that paper. However, for simplicity of exposition I will only consider the deterministic case and only describe the application to the Perceptron.

Let’s see how this would work. Let’s assume that at least for an initial number of rounds, that means that the angle between the pseudogradient and the gradient is acute. From the -smoothness of , we have that

Now, if , we have that so can guarantee that the value of decreases at each step. So, we are minimizing without using a gradient!

To get a rate of convergence, we should know something more about . For example, we could assume that . Then, setting , we obtain

This is still not enough because it is clear that cannot be true on all rounds because when we are in the minimizer . However, with enough assumptions, following this route you can even get a rate of convergence.

**4. Pseudogradients for the Perceptron**

How do we use this to explain the Perceptron? Suppose your set is *linearly separable* with a margin of 1. This means that there exists a vector such that

Note that the value of the margin is arbitrary, we can change it just rescaling .

Remark 1.An equivalent way to restate this condition is to constrain to have unitary norm and require

where is called themaximum marginof . However, in the following I will not use the margin notation because it makes things a bit less clear from an optimization point of view.

We would like to construct an algorithm to find (or any positive scaling of it) from the samples . So, we need an objective function. Here the brilliant idea of Polyak and Tsypkin: in each iteration take an arbitrary and define , that is exactly the negative update we use in the Perceptron. This turns out to be a pseudogradient for . Indeed,

where in the last inequality we used (2).

Let’s pause for a moment to look at what we did: We want to minimize , but its gradient is just impossible to calculate because it depends on that we clearly do not know. However, *every time the Perceptron finds a sample on which its prediction is wrong*, we can construct a pseudogradient, without any knowledge of . It is even more surprising if you consider the fact that there is an infinite number of possible solutions and hence functions , yet the pseudogradient correlates positively with the gradient of any of them! Moreover, no stochasticity is necessary.

At this point we are basically done. In fact, observe that is 1-smooth. So, every time , the analysis above tells us that

where in the last inequality we have assumed .

Setting , summing over time, and denoting the number of updates we have over iterations, we obtain

where used the fact that .

Now, there is the actual magic of the (parameter-free!) Perceptron update rule: as we explained here, the updates of the Perceptron are independent of . That is, given an order in which the samples are presented to the algorithm, any fixed makes the Perceptron update on the same samples and it only changes the scale of . Hence, even if the Perceptron algorithm uses , we can consider an arbitrary decided post-hoc to minimize the upper bound. Hence, we obtain

that is

Now, observing that the r.h.s. is independent of , we proved that the maximum number of updates, or equivalently mistakes, of the Perceptron algorithm is bounded.

Are we done? Not yet! We can now improve the Perceptron algorithm taking full advantage of the pseudogradients interpretation.

**5. An Improved Perceptron**

This is a little known idea to improve the Perceptron. It can be shown with the classic analysis as well, but it comes very naturally from the pseudogradient analysis.

Let’s start from

Now consider only the rounds in which and set , that is obtained by an optimization of the expression . So, we obtain

This means that now the update rule becomes

Now, summing (3) over time, we get

It is clear that this inequality implies the previous one because . But we can even obtain a tighter bound. Using the inequality between harmonic, geometric, and arithmetic mean, we have

In words, the original Perceptron bound depends on the maximum squared norm of the samples on which we updated. Instead, this bound depends on the geometric or arithmetic mean of the squared norm of the samples on which we updated, that is less or equal to the maximum.

**6. Pseudogradients and Lyapunov Potential Functions**

Some people might have realized yet another way to look at this: is the Lyapunov function typically used to analyze subgradient descent. In fact, the classic analysis of SGD considers the guaranteed decrement at each step of this function. The two things coincide, but I find the pseudogradient idea to add a non-trivial amount of information because it allows to bypass the idea of using a subgradient of the loss function completely.

Moreover, the idea of the pseudogradients is more general because it applies to any smooth function, not only to the choice of .

Overall, it is clear that all the good analyses of the Perceptron must have something in common. However, sometimes recasting a problem in a particular framework might have some advantages because it helps our intuition. In this view, I find the pseudogradient view particularly compelling because it aligns with my intuition of how an optimization algorithm is supposed to work.

**7. History Bits **

I already wrote about the Perceptron, so I will just add few more relevant bits.

As I said, it seems that the family of Perceptrons algorithms was intended to be something much more general than what we intend now. The particular class of Perceptron we use nowadays were called -system (Block, 1962). I hypothesize that the fact the -system survived the test of time is exactly due to the simple convergence proof in (Block, 1962) and (Novikoff, 1963). Both proofs are non-stochastic. For the sake of proper credits assignment, it seems that the convergence proof of the Perceptron was proved by many other before Block and Novikoff (see references in Novikoff, 1963). However, the proof in (Novikoff, 1963) seems to be the cleanest one. (Aizerman, Braverrnan, and Rozonoer, 1964) (essentially) describe for the first time the Kernel Perceptron and prove a finite mistake bound for it.

I got the idea of smoothing the Perceptron algorithm with a scaled logistic loss from a discussion on Twitter with Maxim Raginsky. He wrote that (Aizerman, Braverrnan, and Rozonoer, 1970) proposed some kind of smoothing in a Russian book for the objective function in (1), but I don’t have access to it so I am not sure what are the details. I just thought of a very natural one.

The idea of pseudogradients and the application to the Perceptron algorithm is in (Polyak and Tsypkin, 1973). However, there the input/output samples are still stochastic and the finite bound is not explicitly calculated. As I have shown, stochasticity is not needed. It is important to remember that online convex optimization as a field will come much later, so there was no reason for these people to consider arbitrary or even adversarial order of the samples.

The improved Perceptron mistake bound could be new (but please let me know if it isn’t!) and it is inspired from the idea in (Graepel, Herbrich, and Williamson, 2001) of normalizing the samples to show a tighter bound.

**Acknowledgements**

Given the insane amount of mistakes that NicolÃ² Campolongo usually finds in my posts, this time I asked him to proofread it. So, I thank NicolÃ² for finding an insane amout of mistakes on a draft of this post

]]>I have been working on online and stochastic optimization for a while, from a practical and empirical point of view. So, I was already in this field when Adam (Kingma and Ba, 2015) was proposed.

The paper was ok but not a breakthrough, and even more so for today standards. Indeed, the theory was weak: A regret guarantee for an algorithm supposed to work on stochastic optimization of non-convex functions. The experiments were also weak: The exact same experiments would result in a surefire rejection in these days. Later people also discovered an error in the proof and the fact that the algorithm will not converge on certain one-dimensional stochastic convex functions. Despite all of this, in these days Adam is considered the King of the optimization algorithms. Let me be clear: it is known that Adam will not always give you the best performance, yet most of the time people know that they can use it with its default parameters and get, if not the best performance, at least the second best performance on their particular deep learning problem. In other words, Adam is considered nowadays the *default optimizer* for deep learning. So, what is the secret behind Adam?

Over the years, people published a vast number of papers that tried to explain Adam and its performance, too many to list. From the “adaptive learning rate” (adaptive to what? Nobody exactly knows…) to the momentum, to the almost scale-invariance, each single aspect of its arcane recipe has been examined. Yet, none of these analyses gave us the final answer on its performance. It is clear that most of these ingredients are beneficial to the optimization process of *any* function, but it is still unclear why this exact combination and not another one make it the best algorithm. The equilibrium in the mix is so delicate that even the small change required to fix the non-convergence issue was considered to give slightly worse performance than Adam.

The fame of Adam is also accompanied by strong sentiments: It is enough to read posts on r/MachineLearning on Reddit to see the passion that people put in defending their favorite optimizers against the other ones. It is the sort of fervor that you see in religion, in sports, and in politics.

However, how *likely* is all this? I mean, how likely is that Adam is really the *best* optimization algorithm? How likely is that we reached the apex of optimization for deep learning few years ago in a field that is so young? Could there be another explanation to its prodigious performance?

I have a hypothesis, but before explaining it we have to briefly talk about the applied deep learning community.

In a talk, Olivier Bousquet has described the deep learning community as a giant genetic algorithm: Researchers in this community are exploring the space of all variants of algorithms and architectures in a semi-random way. Things that consistently work in large experiments are kept, the ones not working are discarded. Note that this process seems to be independent of acceptance and rejection of papers: The community is so big and active that good ideas on rejected papers are still saved and transformed into best practices in few months, see for example (Loshchilov and Hutter, 2019). Analogously, ideas in published papers are reproduced by hundred of people that mercilessly trash things that will not reproduce. This process has created a number of heuristics that consistently produce good results in experiments, and the stress here is on “consistently”. Indeed, despite being a method based on non-convex formulations, the performance of deep learning methods turns out to be extremely reliable. (Note that the deep learning community has also a large bias towards “famous” people, so not all the ideas receive the same level of attention…)

So, what is the link between this giant genetic algorithm and Adam? Well, looking carefully at the creating process in the deep learning community I noticed a pattern: Usually people try new architectures *keeping the optimization algorithm fixed*, and most of the time the algorithm of choice is Adam. This happens because, as explained above, Adam is the *default optimizer*.

So, here my hypothesis: Adam was a very good optimization algorithm for the neural networks architectures we had few years ago and ** people kept evolving new architectures on which Adam works**. So, we might not see many architectures on which Adam does not work because such ideas are discarded prematurely! Such ideas would require to design a new architecture

Now, I am sure many people won’t buy in this hypothesis, I am sure they will list all sort of specific problems in which Adam is not the best algorithm, in which Stochastic Gradient Descent with momentum is the best one, and so on and so forth. However, I would like to point out two things: 1) I don’t describe here a law of nature, but simply a tendency the community has that might have influenced the co-evolution of some architectures and optimizers; 2) I actually have some evidence to support this claim

If my claims were true, we would expect Adam to be extremely good on deep neural networks and very poor on anything else. And this does happen! For example, Adam is known to perform very poorly on simple convex and non-convex problems that are not deep neural networks, see for example the following experiments from (Vaswani et al., 2019):

It seems that the moment we move away from the specific setting of deep neural networks with their specific choice of initialization, specific scale of weights, specific loss function, etc., Adam loses its *adaptivity* and its magic default learning rate must be tuned again. Note that you can always write a linear predictor as a one-layer neural network, yet Adam does not work so well on this case too. So, ** all the particular choices of architectures in deep learning might have evolved to make Adam work better and better, while the simple problems above do not have any of these nice properties that allow Adam to shine**.

Overall, Adam might be the best optimizer because the deep learning community might be exploring only a small region in the joint search space of architectures/optimizers. If true, that would be ironic for a community that departed from convex methods because they focused only on a narrow region of the possible machine learning algorithms and it was like, as Yann LeCun wrote, “looking for your lost car keys under the street light knowing you lost them someplace else“.

EDIT: After the pubblication of this post, Sam Power pointed me to this tweet by Roger Grosse that seems to share a similar sentiment

]]>**1. SGD on Non-Convex Smooth Functions **

We are interested in minimizing a smooth non-convex function using stochastic gradient descent with unbiased stochastic gradients. More in details, we assume to have access to an oracle that returns in any point , , where is the realization of a mechanism for computing the stochastic gradient. For example, could be the random index of a training sample we use to calculate the gradient of the training loss or just random noise that is added on top of our gradient computation. We will also assume that the variance of the stochastic gradient is bounded: , for all . Weaker assumptions on the variance are possible, but they don’t add much to the general message nor to the scheme of the proof.

Given that the function is non-convex, we clearly cannot hope to converge to the minimum of , so we need a less ambitious goal. We assumed that the function is smooth. As you might remember from my previous posts, smooth functions are differentiable functions whose gradient is Lipschitz. Formally, we say that is -smooth when , for all . This assumption assures us that when we approach a local minimum the gradient goes to zero. Hence, **decreasing the norm of the gradient will be our objective function for SGD.** Note that smoothness is necessary to study the norm of the gradients. In fact, consider the function whose derivative does not go to zero when we approach the minimum, on the contrary it is always different than 0 in any point different than the minimum.

Last thing we will assume is that the function is bounded from below. Remember that the boundedness from below does not imply that the minimum of the function exists, e.g., .

Hence, I start from a point and the SGD update is

where are deterministic learning rates or stepsizes.

First, let’s see practically how SGD behaves w.r.t. Gradient Descent (GD) on the same problem.

In Figure 1, we are minimizing , where the stochastic gradient in SGD is given by the gradient of the function corrupted by Gaussian noise with zero mean and standard deviation 1. On the other hand, there is no noise for GD. In both cases, we use and we plot the absolute value of the derivative. We can see that GD will monotonically minimize the gradient till numerical precision as expected, converging to one of the local minima. Note that with a constant learning rate GD on this problem would converge even faster. Instead, SGD will jump back and forth resulting in only *some* iterates having small gradient. So, our basic question is the following:

*Will converge to zero with probability 1 in SGD when goes to infinity?*

This is more difficult to answer than what you might think. However, this is a basic question to know if it actually makes sense to run SGD for a bunch of iterations and return the last iterate, that is how 99% of the people use SGD on a non-convex problem.

To warm up, let’s first see what we can prove in a finite-time setting.

As all other similar analysis, we need to construct a potential (Lyapunov) function that allows us to analyze it. In the convex case, we would study , where . Here, this potential does not even make sense because we are not even trying to converge to . It turns out that a better choice is to study . We will make use of the following property of -smooth functions:

In words, this means that a smooth function is always upper bounded by a quadratic function. Note that this property does not require convexity, so we can safely use it. Thanks to this property, let’s see how our potential evolves over time during the optimization of SGD.

Now, let’s denote by the expectation w.r.t. given , so we have

where in the inequality we have used the fact that the variance of the stochastic gradient is bounded by . Taking the total expectation and reordering the terms, we have

Let’s see how useful this inequality is: consider a constant step size , where is the usual critical parameter of the learning rate (that you’ll never be able to tune properly unless you know things that you clearly don’t know…). With this choice, we have . So, we have

What we got is almost a convergence result: it says that the average of the norm of the gradients is going to zero as . Given that the average of a set of numbers is bigger or equal to its minimum, this means that there exists at least one in my set of iterates that has a small expected gradient. This is interesting but slightly disappointing. We were supposed to prove that the gradient converges to zero, but instead we only proved that at least *one* of the iterates has indeed small expected norm, but we don’t know which one. Also, trying to find the right iterate might be annoying because we only have access to stochastic gradients.

It is also interesting to see that the convergence rate has two terms: a fast rate and a slow rate . This means that we can expect the algorithm to make fast progress at the beginning of the optimization and then slowly converge once the number of iterations becomes big enough compared to the variance of the stochastic gradients. In case the noise on the gradients is zero, SGD becomes simply gradient descent and it will converge at a rate. In the noiseless case, we can also show that the last iterate is the one with the smallest gradient. However, note that the learning rate has in it, so effectively we can achieve a faster convergence in the noiseless case because we would be using a constant and independent from stepsize.

**2. The Magic Trick: Randomly Stopped SGD **

The above reasoning is interesting but it is not a solution to our question: does the last iterate of SGD converge? Yes or no?

There is a possible work-around that looks like a magic trick. Let’s take one iterate of SGD uniformly at random among and call it . Taking the expectation with respect to this randomization and the noise in the stochastic gradients we have that

Basically, it says that if run SGD for iterations, then we stop and return not the last iterate but one of the iterates at random, in expectation with respect to everything the norm will be small! Note that this is equivalent to run SGD with a random stopping time. In other words, given that we didn’t know how to prove if SGD converges, we changed the algorithm adding a random stopping time and now the random iterate on which we stop in expectation will have the desired convergence rate.

This is a very important result and also a standard one in these days. It should be intuitive why the randomization helps: From Figure 1 it is clear that we might be unlucky in the last iteration of SGD, however randomizing in expectation we smooth out the noise and get a decreasing gradient. However, we just changed the target because we still didn’t prove if the last iterate converges. So, we need an alternative way.

**3. The Disappointing Lim Inf **

Let’s consider again (1). This time let’s select any time-varying positive stepsizes that satisfy

These two conditions are classic in the study of stochastic approximation. The first condition is needed to be able to travel arbitrarily far from the initial point, while the second one is needed to keep the variance of the noise under control. The classic learning rate of does not satisfy these assumptions, but something decaying a little bit faster as will do.

With such a choice, we get

where we have used the second condition in the inequality. Now, the condition implies that converges to 0. So, there exists such that for all . So, we get that

This implies that with probability 1. We are almost done: From this last inequality and the condition that , we can derive the fact that .

**Wait, what? What is this ???** Unfortunately, it seems that we proved something weaker than we wanted to. In words, the lim inf result says that there exists a *subsequence* of that has a gradient converging to zero.

This is very disappointing and we might be tempted to believe that this is the best that we can do. Fortunately, this is not the case. In fact, in a seminal paper (Bertsekas and Tsitsiklis, 2000) proved the convergence of the gradients of SGD to zero with probability 1 under very weak assumptions. Their proof is very convoluted also due to the assumptions they used, but in the following I’ll show a much simpler proof.

**4. The Asymptotic Proof in Few Lines **

In 2018, I found a way to get the same result of (Bertsekas and Tsitsiklis, 2000) distilling their long proof in the following Lemma, whose proof is in the Appendix. It turns out that this Lemma is essentially all what we need.

Lemma 1.Let be two non-negative sequences and a sequence of vectors in a vector space . Let and assume and . Assume also that there exists such that , where is such that . Then, converges to 0.

We are now finally ready to prove the asymptotic convergence with probability 1.

Theorem 2.Assume that we use SGD on a -smooth function, with stepsizes that satisfies the conditions (2). Then, goes to zero with probability 1.

*Proof:* We want to use Lemma 1 on . So, first observe that by the -smoothness of , we have

The assumptions and the reasoning above imply that, with probability 1, . This also suggest to set . Also, we have, with probability 1, , because for is a martingale whose variance is bounded by . Hence, for is a martingale in , so it converges in with probability 1.

Overall, with probability 1 the assumptions of Lemma 1 are verified with .

We did it! Finally, we proved that the gradients of SGD do indeed converge to zero with probability 1. This means that with probability 1 for any there exists such that for .

Even if I didn’t actually use any intuition in crafting the above proof (I rarely use “intuition” to prove things), Yann Ollivier provided the following intuition for this proof: the proof is implicitly studying how far apart GD and SGD are. However, instead of estimating the distance between the two processes over a single update, it does it over large period of time through the term that can be controlled thanks to the choice of the learning rates.

**5. History Bits **

The idea of taking one iterate at random in SGD was proposed in (Ghadimi and Lan, 2013) and it reminds me the well-known online-to-batch conversion through randomization. The conditions on the learning rates in (2) go back to (Robbins and Monro, 1951). (Bertsekas and Tsitsiklis, 2000) contains a good review of previous work on asymptotic convergence of SGD, while a recent paper on this topic is (Patel, V., 2020).

I derived Lemma 1 as an extension of Proposition 2 in (Alber et al., 1998)/Lemma A.5 in (Mairal, 2013). Studying the proof of (Bertsekas and Tsitsiklis, 2000), I realized that I could change (Alber et al., 1998, Proposition 2) into what I needed. I had this proof sitting in my unpublished notes for 2 years, so I decided to write a blog post on it.

My actual small contribution to this line of research is a lim inf convergence for SGD with AdaGrad stepsizes (Li and Orabona, 2019), but under stronger assumptions on the noise.

Note that the 20-30 years ago there were many papers studying the asymptotic convergence of SGD and its variants in various settings. Then, the taste of the community changed moving from asymptotic convergence to finite-time rates. As it often happens when a new trend takes over the previous one, new generations tend to be oblivious to the old results and proof techniques. The common motivation to ignore these past results is that the finite-time analysis is superior to the asymptotic one, but this is clearly false (ask a statistician!). It should be instead clear to anyone that both analyses have pros and cons.

**6. Acknowledgements **

I thank LÃ©on Bottou for telling me of the problem of analyzing the asymptotic convergence of SGD in the non-convex case with a simple and general proof in 2018. LÃ©on also helped me checking my proofs and finding an error in a previous version. Also, I thank Yann Ollivier for reading my proof and kindly providing an alternative proof and the intuition that I report above.

**7. Appendix **

*Proof of Lemma : *Since the series diverges, given that converges, we necessarily have . Hence, we have to prove that .

Let us proceed by contradiction and assume that . First, assume that .

Given the values of the and , we can then build two sequences of indices and such that

- ,
- , for ,
- , for .

Define . The convergence of the series implies that the sequence of partial sums are Cauchy sequences. Hence, there exists large enough such for all we have and are less or equal to . Then, we have for all and all with ,

Therefore, using the triangle inequality, And finally for all , which contradicts . Therefore, goes to zero.

To rule out the case that , proceed in the same way, choosing any . Hence, we get that for , that contradicts .

]]>Don’t get me wrong: assuming bounded domains is perfectly fine and justified most of the time. However, sometimes it is unnecessary and it might also obscure critical issues in the analysis, as in this case. So, to balance the universe of first-order methods, I decided to show how to easily prove the convergence of the iterates in SGD, even in unbounded domains.

Technically speaking, the following result might be new, but definitely not worth a fight with Reviewer 2 to publish it somewhere.

**1. Setting **

First, let’s define our setting. We want to solve the following optimization problem

where and is a convex function. Now, various assumptions are possible on and choosing the right one depends on *your* particular problem, there are not right answers. Here, we will not make any strong assumption on . Also, we will *not* assume to be bounded. Indeed, in most of the modern applications in Machine Learning, is simply the entire space . We will also assume that is not empty and is any element in it.

We also assume to have access to a *first-order stochastic oracle* that returns stochastic sub-gradients of on any point . In formulas, we get such that . Practically speaking, every time you calculate the (sub)gradient on a minibatch of training data, that is a stochastic (sub)gradient and roughly speaking the random minibatch is the random variable .

Here, for didactic reasons, we will assume that is bounded by 1; similar results can be also show with more realistic assumptions. This holds, for example, if is an average of 1-Lipschitz functions and you draw some of them to calculate the stochastic subgradient.

The algorithm we want to focus on is SGD. So, what is SGD? SGD is an incredibly simple optimization algorithm, almost primitive. Indeed, part of its fame depends critically on its simplicity. Basically, you start from a certain and you update your solution iteratively moving in the direction of the negative stochastic subgradients, multiplied by a *learning rate* . We also use a projection onto . Of course, if no projection is needed. So, the update of SGD is

where and is the projection onto . Remember that when you use subgradients, SGD is not a descent algorithm: I already blogged about the fact that the common intuition of moving towards a descent direction is wrong when you use subgradients.

**2. Convergence of the Average of the Iterates **

Now, the most common analysis of SGD can be done in two different ways: constant learning rate and non-increasing learning rate. We already saw both of them in my lecture notes on online learning, so let’s summarize here the one-step inequality for SGD we need:

for all measureable w.r.t. .

If you plan to use iterations, you can you use a learning rate , and summing (1) we get

where we set . This is not a convergence results yet, because it just says that *on average* we are converging. To extract a single solution, we can use Jensen’s inequality and obtain

where . In words, we show a convergence guarantee for *the average of the iterates of SGD*, not for the last one.

Constant learning rates are a bit annoying because they depends on how many iterations you plan to do, theoretically and empirically. So, let’s now take a look at non-increasing learning rates, . In this case, the correct way to analyze SGD without the bounded assumption is to sum (1) *without dividing by *, to have

where we set . From this one, we have two alternatives. First, we can observe that

because is a minimizer and the learning rate non-increasing. So, using again Jensen’s inequality, we get

Note that if you like these sorts of games, you can even change the learning rate to shave a factor, but it is probably useless from an applied point of view.

Another possibility is to use a weighted average:

where and we used . Note that this option does not seem to give any advantage over the unweighted average above. Also, it weights the first iterations more than the last ones, that in most of the cases is a bad idea: First iterations tend to be farther away from the optimum then the last ones.

Let’s summarize what we have till now:

- Unbounded domains are fine with both constant and time-varying learning rates.
- The optimal learning rate depends on the distance between the optimal solution and the initial iterate, because the optimal setting of is proportional to .
- The weighted average is probably a bad idea and not strictly necessary.
- It seems we can only guarantee convergence for (weighted) averages of iterates.

The last point is a bit concerning: most of the time we take the last iterate of SGD, why we do it if the theory applies to the average?

**3. Convergence of the Last Iterate **

Actually, we do know that

- the last solution of SGD converges in unbounded domains with constant learning rate (Zhang, T., 2004).
- the last iterate of SGD converges in bounded domains with non-increasing learning rates (Shamir, O. and Zhang, T., 2013).

So, what about unbounded domains and non-increasing learning rate, i.e., 90% of the uses of SGD? It turns out that it is equally simple and I think the proof is also instructive! As surprising as it might sound, not dividing by (1) is the key ingredient we need. The proof plan is the following: we want to prove that the value of on the last iterate is not too far from the value of on . To prove it, we need the following technical lemma on sequences of non-negative numbers multiplied by non-increasing learning rates, whose proof is in the Appendix. This Lemma relates the last element of a sequence of numbers to their average.

Lemma 1.Let a non-increasing sequence of positive numbers and . Then

With the above Lemma, we can prove the following guarantee for the convergence of the last iterate of SGD.

Theorem 2.Assume the stepsizes deterministic and non-increasing. Then

*Proof:* We use Lemma 1, with , to have

Now, we bound the sum in the r.h.s. of last inequality. Summing (1) from to , we have the following inequality that holds for any :

Hence, setting , we have

Putting all together, we have the stated bound.

There are a couple of nice tricks in the proof that might be interesting to study carefully. First, we use the fact that one-step inequality in (1) holds for any . Most of the time, we state it with equal to , but it turns out that the more general statement is actually important! In fact, it is possible to know how far is the performance of last iterate from the performance of the average because the incremental nature of SGD makes possible to know exactly how far is from any previous iterate , with . Please note that all of this would be hidden in the case of bounded domains, where all the distances are bounded by the diameter of the set, and you don’t get the dependency on .

Now we have all the ingredients and we only have to substitute a particular choice of the learning rate.

*Proof:* First, observe that

Now, considering the last term in (3), we have

Using (2) and dividing by , we have the stated bound.

Note that the above proof works similarly if .

**4. History Bits **

The first finite-time convergence proof for the last iterate of SGD is from (Zhang, T., 2004), where he considered the constant learning rate case. It was later extended in (Shamir, O. and Zhang, T., 2013) for time-varying learning rates but only for bounded domains. The convergence rate for the weighted average in unbounded domains is from (Zhang, T., 2004). The observation that the weighted average is not needed and the plain average works equally well for non-increasing learning rates is from (X. Li and F. Orabona, 2019), where we needed it for the particular case of AdaGrad learning rates. The idea of analyzing SGD without dividing by the learning rate is byÂ (Zhang, T., 2004). Lemma 1 is new but actually hidden in the the convergence proof of the last iterate of SGD with linear predictors and square losses in (Lin, J. and Rosasco, L. and Zhou, D.-X., 2016), that in turn is based on the one in (Shamir, O. and Zhang, T., 2013). As far as I know, Corollary 3 is new, but please let me know if you happen to know a reference for it! It is possible to remove the logarithmic term in the bound using a different learning rate, but the proof is only for bounded domains (Jain, P. and Nagaraj, D. and Netrapalli, P., 2019).

**5. Exercises **

Exercise 1.Generalize the above proofs to the Stochastic Mirror Descent case.

Exercise 2.Remove the assumption of expected bounded stochastic subgradients and instead assume that is -smooth, i.e., has -Lipschitz gradient, and the variance of the noise is bounded. Hint: take a look at the proofs in (Zhang, T., 2004) and (X. Li and F. Orabona, 2019)

**6. Appendix **

*Proof of Lemma 1:*Â Define , so we have

that implies

Now, from the definition of and the above inequality, we have

that implies

Unrolling the inequality, we have

Using the definition of and the fact that , we have the stated bound.

]]>In this post, I explain a variation of the EG/Hedge algorithm, called *AdaHedge*. The basic idea is to design an algorithm that is adaptive to the sum of the squared norm of the losses, without any prior information on the range of the losses.

First, consider the case in which we use as constant regularizer the negative entropy , where will be determined in the following and is the simplex in . Using FTRL with linear losses with this regularizer, we immediately obtain

where we upper bounded the negative entropy of with 0. Using the strong convexity of the regularizer w.r.t. the norm and Lemma 4 here, we would further upper bound this as

This suggests that the optimal should be . However, as we have seen for L* bounds, this choice of any parameter of the algorithm is never feasible. Hence, exactly as we did for L* bounds, we might think of using an online version of this choice

where is a constant that will be determined later. An important property of such choice is that it gives rise to an algorithm that is scale-free, that is its predictions are invariant from the scaling of the losses by any constant factor. This is easy to see because

Note that this choice makes the regularizer non-decreasing over time and immediately gives us

At this point, we might be tempted to use Lemma 1 from the L* post to upper bound the sum in the upper bound, but unfortunately we cannot! Indeed, the denominator does not contain the term . We might add a constant to , but that would destroy the scale-freeness of the algorithm. However, it turns out that we can still prove our bound without any change to the regularizer. The key observation is that we can bound the term in two different ways. The first way is the one above, while the other one is

where we used the definition of and the fact that the regularizer is non-decreasing over time. So, we can now write

where we used the fact that the minimum between two numbers is less than their harmonic mean. Assuming and using Lemma 1 here, we have

The bound and the assumption on suggest to set . To summarize, we obtained a scale-free algorithm with regret bound .

We might consider ourselves happy, but there is a clear problem in the above algorithm: the choice of in the time-varying regularizer strictly depends on our upper bound. So, a loose bound will result in a poor choice of the regularization! In general, every time we use a part of the proof in the design of an algorithm we cannot expect an exciting empirical performance, unless our upper bound was really tight. So, can we design a better regularizer? Well, we need a better upper bound!

Let’s consider a generic regularizer and its corresponding FTRL with linear losses regret upper bound

where we assume to be non-decreasing in time.

Now, observe that the sum is unlikely to disappear for this kind of algorithms, so we could try to make the term of the same order of the sum. So, we would like to set of the same order of . However, this approach would cause an annoying recurrence. So, using the fact that is non-decreasing, let’s upper bound the terms in the sum just a little bit:

Now, we can set for , , and . This immediately implies that

Setting to be equal to the negative entropy, we get an algorithm known as AdaHedge. It is easy to see that this choice makes the algorithm scale-free as well.

With this choice of the regularizer, we can simplify a bit the expression of . For , we have . Instead, for , using the properties of the Fenchel conjugates, we have that

Overall, we get the pseudo-code of AdaHedge in Algorithm 1.

So, now we need an upper bound for . Observe that . Moreover, as we have done before, we can upper bound in two different ways. In fact, from Lemma 4 here, we have for . Also, denoting by , we have

Hence, we have

We can solve this recurrence using the following Lemma, where and .

Lemma 1.Let be any sequence of non-negative real numbers. Suppose that is a sequence of non-negative real numbers satisfying

Then, for any , .

*Proof:* Observe that

We bound each term in the sum separately. The left term of the minimum inequality in the definition of gives , while the right term gives . So, we conclude .

So, overall we got

and setting , we have

Note that this is roughly the same regret in (2), but the very important difference is that this new regret bound depends on the *much tighter quantity* , that we upper bounded with , but in general will be much smaller than that. For example, can be upper bounded using the tighter local norms, see the analysis of Exp3. Instead, in the first solution, the regret will always be dominated by the term because we explicitly use it in the regularizer!

There is an important lesson to be learned from AdaHedge: the regret is not the full story and algorithms with the same worst-case guarantee can exhibit vastly different empirical behaviors. Unfortunately, this message is rarely heard and there is a part of the community that focuses too much on the worst-case guarantee rather than on the empirical performance. Even worse, sometimes people favor algorithms with a “more elegant analysis” completely ignoring the likely worse empirical performance.

**1. History Bits **

The use of FTRL with the regularizer in (1) was proposed in (Orabona and PÃ¡l, 2015), I presented a simpler version of their proof that does not require Fenchel conjugates. The AdaHedge algorithm was introduced in (van Erven et al., 2011) and refined in (de Rooij et al., 2014). The analysis reported here is from (Orabona and PÃ¡l, 2015), that generalized AdaHedge to arbitrary regularizers in AdaFTRL. Additional properties of AdaHedge for the stochastic case were proven in (van Erven et al., 2011).

**2. Exercises **

]]>

Exercise 1.Implement AdaHedge and compare its empirical performance to FTRL with the time-varying regularizer in (1).

* You can find the other lectures here.*

In this lecture, we will explore the link between Online Learning and and Statistical Learning Theory.

**1. Agnostic PAC Learning **

We now consider a different setting from what we have seen till now. We will assume that we have a prediction strategy parametrized by a vector and we want to learn the relationship between an input and its associated label . Moreover, we will assume that is drawn from a joint probability distribution . Also, we are equipped with a loss function that measures how good is our prediction compared to the true label , that is . So, learning the relationship can be cast as minimizing the expected loss of our predictor

In machine learning terms, the object above is nothing else than the *test error* and our predictor.

Note that the above setting assumes labeled samples, but we can generalize it even more considering the *Vapnik’s general setting of learning*, where we collapse the prediction function and the loss in a unique function. This allows, for example, to treat supervised and unsupervised learning in the same unified way. So, we want to minimize the *risk*

where is an unknown distribution over and is measurable w.r.t. the second argument. Also, the set of all predictors that can be expressed by vectors in is called the *hypothesis class*.

Example 1.In a linear regression task where the loss is the square loss, we have and . Hence, .

Example 2.In linear binary classification where the loss is the hinge loss, we have and . Hence, .

Example 3.In binary classification with a neural network with the logistic loss, we have and is the network corresponding to the weights . Hence, .

The key difficulty of the above problem is that we don’t know the distribution . Hence, there is no hope to exactly solve this problem. Instead, we are interested in understanding *what is the best we can do if we have access to samples drawn i.i.d. from *. More in details, we want to upper bound the *excess risk*

where is a predictor that was *learned* using samples.

It should be clear that this is just an optimization problem and we are interested in upper bounding the suboptimality gap. In this view, the objective of machine learning can be considered as a particular optimization problem.

Remark 1.Note that this is not the only way to approach the problem of learning. Indeed, the regret minimization model is an alternative model to learning. Moreover, another approach would be to try to estimate the distribution and then solve the risk minimization problem, the approach usually taken in Statistics. No approach is superior to the other and each of them has its pros and cons.

Given that we have access to the distribution through samples drawn from it, any procedure we might think to use to minimize the risk will be stochastic in nature. This means that we cannot assure a deterministic guarantee. Instead, *we can try to prove that with high probability our minimization procedure will return a solution that is close to the minimizer of the risk*. It is also intuitive that the precision and probability we can guarantee must depend on how many samples we draw from .

Quantifying the dependency of precision and probability of failure on the number of samples used is the objective of the **Agnostic Probably Approximately Correct** (PAC) framework, where the keyword “agnostic” refers to the fact that we don’t assume anything on the best possible predictor. More in details, given a precision parameter and a probability of failure , we are interested in characterizing the *sample complexity of the hypothesis class * that is defined as the number of samples necessary to guarantee with probability at least that the best learning algorithm using the hypothesis class outputs a solution that has an excess risk upper bounded by . Note that the sample complexity does not depend on , so it is a worst-case measure w.r.t. all the possible distributions. This makes sense if you think that we know nothing about the distribution , so if your guarantee holds for the worst distribution it will also hold for any other distribution. Mathematically, we will say that the hypothesis class is agnostic PAC-learnable is such sample complexity function exists.

Definition 1.We will say that a function class isAgnostic-PAC-learnableif there exists an algorithm and a function such that when is used with samples drawn from , with probability at least the solution returned by the algorithm has excess risk at most .

Note that the Agnostic PAC learning setting does not say what is the procedure we should follow to find such sample complexity. The approach most commonly used in machine learning to solve the learning problem is the so-called *Empirical Risk Minimization (ERM) problem*. It consist of drawing samples i.i.d. from and minimizing the *empirical risk*:

In words, ERM is nothing else than minimize the error on a training set. However, in many interesting cases we can have that can be very far from the true optimum , even with an infinite number of samples! So, we need to modify the ERM formulation in some way, e.g., using a *regularization* term or a Bayesian prior of , or find conditions under which ERM works.

The ERM approach is so widespread that machine learning itself is often wrongly identified with some kind of minimization of the training error. We now show that ERM is not the entire world of ML, showing that *the existence of a no-regret algorithm, that is an online learning algorithm with sublinear regret, guarantee Agnostic-PAC learnability*. More in details, we will show that an online algorithm with sublinear regret can be used to solve machine learning problems. This is not just a curiosity, for example this gives rise to computationally efficient parameter-free algorithms, that can be achieved through ERM only running a two-step procedure, i.e. running ERM with different parameters and selecting the best solution among them.

We already mentioned this possibility when we talked about the online-to-batch conversion, but this time we will strengthen it proving high probability guarantees rather than expectation ones.

So, we need some more bits on concentration inequalities.

**2. Bits on Concentration Inequalities **

We will use a concentration inequality to prove the high probability guarantee, but we will need to go beyond the sum of i.i.d.random variables. In particular, we will use the concept of *martingales*.

Definition 2.A sequence of random variables is called amartingaleif for all it satisfies:

Example 4.Consider a fair coin and a betting algorithm that bets money on each round on the side of the coin equal to . We win or lose money 1:1, so the total money we won up to round is . is a martingale. Indeed, we have

For bounded martingales we can prove high probability guarantees as for bounded i.i.d. random variables. The following Theorem will be the key result we will need.

Theorem 3 (Hoeffding-Azuma inequality).Let be a martingale of random variables that satisfy almost surely. Then, we have

Also, the same upper bounds hold on .

**3. From Regret to Agnostic PAC **

We now show how the online-to-batch conversion we introduced before gives us high probability guarantee for our machine learning problem.

Theorem 4.Let , where the expectation is w.r.t. drawn from with support over some vector space and . Draw samples i.i.d. from and construct the sequence of losses . Run any online learning algorithm over the losses , to construct the sequence of predictions . Then, we have with probability at least , it holds that

*Proof:* Define . We claim that is a martingale. In fact, we have

where we used the fact that depends only on Hence, we have

that proves our claim.

Hence, using Theorem 3, we have

This implies that, with probability at least , we have

or equivalently

We now use the definition of regret w.r.t. any , to have

The last step is to upper bound with high probability with . This is easier than the previous upper bound because is a fixed vector, so are i.i.d. random variables, so for sure forms a martingale. So, reasoning as above, we have that with probability at least it holds that

Putting all together and using the union bound, we have the stated bound.

The theorem above upper bounds the average risk of the predictors, while we are interested in producing a single predictor. If the risk is a convex function and is convex, than we can lower bound the l.h.s. of the inequalities in the theorem with the risk evaluated on the average of the . That is

If the risk is not a convex function, we need a way to generate a single solution with small risk. One possibility is to construct a *stochastic classifier* that samples one of the with uniform probability and predicts with it. For this classifier, we immediately have

where the expectation in the definition of the risk of the stochastic classifier is also with respect to the random index. Yet another way, is to select among the predictors, the one with the smallest risk. This works because the average is lower bounded by the minimum. This is easily achieved using samples for the online learning procedure and samples to generate a validation set to evaluate the solution and pick the best one. The following Theorem shows that selecting the predictor with the smallest empirical risk on a validation set will give us a predictor close to the best one with high probability.

Theorem 5.We have a finite set of predictors and a dataset of samples drawn i.i.d. from . Denote by . Then, with probability at least , we have

*Proof:* We want to calculate the probability that the hypothesis that minimizes the validation error is far from the best hypothesis in the set. We cannot do it directly because we don’t have the required independence to use a concentration. Instead, *we will upper bound the probability that there exists at least one function whose empirical risk is far from the risk.* So, we have

Hence, with probability at least , we have that for all

We are now able to upper bound the risk of , just using the fact that the above applies to too. So, we have

where in the last inequality we used the fact that minimizes the empirical risk.

Using this theorem, we can use samples for the training and samples for the validation. Denoting by the predictor with the best empirical risk on the validation set among the generated during the online procedure, we have with probability at least that

It is important to note that with any of the above three methods to select one among the generated by the online learning procedure, the sample complexity guarantee we get matches the one we would have obtained by ERM, up to polylogarithmic factors. In other words, there is nothing special about ERM compared to the online learning approach to statistical learning.

Another important point is that the above guarantee does not imply the existence of online learning algorithms with sublinear regret for any learning problem. It just says that, if it exists, it can be used in the statistical setting too.

**4. History Bits **

Theorem 4 is from (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004). Theorem 5 is nothing else than the Agnostic PAC learning guarantee for ERM for hypothesis classes with finite cardinality. (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004) gives also an alternative procedure to select a single hypothesis among the generated during the online procedure that does not require splitting the data in training and validation. However, the obtained guarantee matches the one we have proved.

]]>* You can find all the lectures I published here.*

In the last lecture, we introduced the Explore-Then-Commit (ETC) algorithm that solves the stochastic bandit problem, but requires the knowledge of the *gaps*. This time we will introduce a parameter-free strategy that achieves the same optimal regret guarantee.

**1. Upper Confidence Bound Algorithm **

The ETC algorithm has the disadvantage of requiring the knowledge of the gaps to tune the exploration phase. Moreover, it solves the exploration vs. exploitation trade-off in a clunky way. It would be better to have an algorithm that smoothly transition from one phase into the other *in a data-dependent way*. So, we now describe an optimal and adaptive strategy called Upper Confidence Bound (UCB) algorithm. It employs the principle of *optimism in the face of uncertainty*, to select in each round the arm that has the *potential to be the best one*.

UCB works keeping an estimate of the expected loss of each arm and also a confidence interval at a certain probability. Roughly speaking, we have that with probability at least

where the “roughly” comes from the fact that is a random variable itself. Then, UCB will query the arm with the smallest lower bound, that is the one that could potentially have the smallest expected loss.

Remark 1.The name Upper Confidence Bound comes from the fact that traditionally stochastic bandits are defined over rewards, rather than losses. So, in our case we actually use the lower confidence bound in the algorithm. However, to avoid confusion with the literature, we still call it Upper Confidence Bound algorithm.

The key points in the proof are on how to choose the right confidence level and how to get around the dependency issues.

The algorithm is summarized in Algorithm 1 and we can prove the following regret bound.

Theorem 1.Assume that the rewards of the arms are -subgaussian and and let . Then, UCB guarantees a regret of

*Proof:* We analyze one arm at the time. Also, without loss of generality, assume that the optimal arm is the first one. For arm , we want to prove that .

The proof is based on the fact that once I have sampled an arm enough times, the probability to take a suboptimal arm is small.

Let the biggest time index such that . If , then the statement above is true. Hence, we can safely assume . Now, for bigger than we have

Consider and such that , then we claim that at least one of the two following equations must be true:

If the first one is true, the confidence interval around our estimate of the expectation of the optimal arm does not contain . On the other hand, if the second one is true the confidence interval around our estimate of the expectation does not contain . So, we claim that if and we selected a suboptimal arm, then at least one of these two bad events happened.

Let’s prove the claim: *if both the inequalities above are false*, , and , we have

that, by the selection strategy of the algorithm, would imply .

Note that . Hence, we have

Now, we upper bound the probabilities in the sum. Given that the losses on the arms are i.i.d. and using the union bound, we have

Hence, we have

Given that the same bound holds for , we have

Using the decomposition of the regret we proved last time, , we have the stated bound.

It is instructive to observe an actual run of the algorithm. I have considered 5 arms and Gaussian losses. In the left plot of figure below, I have plotted how the estimates and confidence intervals of UCB varies over time (in blue), compared to the actual true means (in black). In the right side, you can see the number of times each arm was pulled by the algorithm.

It is interesting to note that the logarithmic factor in the confidence term will make the confidences of the arm that are not pulled to *increase* over time. In turn, this will assure that the algorithm does not miss the optimal arm, even if the estimates were off. Also, the algorithm will keep pulling the two arms that are close together, to be sure on which one is the best among the two.

The bound above can become meaningless if the gaps are too small. So, here we prove another bound that does not depend on the inverse of the gaps.

Theorem 2.Assume that the rewards of the arms minus their expectations are -subgaussian and let . Then, UCB guarantees a regret of

*Proof:* Let be some value to be tuned subsequently and recall from the proof of Theorem 1 that for each suboptimal arm we can bound

Hence, using the regret decomposition we proved last time, we have

Choosing , we have the stated bound.

Remark 2.Note that while the UCB algorithm is considered parameter-free, we still have to know the subgaussianity of the arms. While this can be easily upper bounded for stochastic arms with bounded support, it is unclear how to do it without any prior knowledge on the distribution of the arms.

It is possible to prove that the UCB algorithm is asymptotically optimal, in the sense of the following Theorem.

Theorem 3 (Bubeck, S. and Cesa-Bianchi, N. , 2012, Theorem 2.2).Consider a strategy that satisfies for any set of Bernoulli rewards distributions, any arm with and any . Then, for any set of Bernoulli reward distributions, the following holds

**2. History Bits **

The use of confidence bounds and the idea of optimism first appeared in the work by (T. L. Lai and H. Robbins, 1985). The first version of UCB is by (T. L. Lai, 1987). The version of UCB I presented is by (P. Auer and N. Cesa-Bianchi and P. Fischer, 2002) under the name UCB1. Note that, rather than considering 1-subgaussian environments, (P. Auer and N. Cesa-Bianchi and P. Fischer, 2002) considers bandits where the rewards are confined to the interval. The proof of Theorem 1 is a minor variation of the one of Theorem 2.1 in (Bubeck, S. and Cesa-Bianchi, N. , 2012), which also popularized the subgaussian setup. Theorem 2 is from (Bubeck, S. and Cesa-Bianchi, N. , 2012).

**3. Exercises **

]]>

Exercise 1.Prove a similar regret bound to the one in Theorem 2 for an optimally tuned Explore-Then-Commit algorithm.

* You can find the lectures I published till now here.*

Today, we will consider the *stochastic bandit* setting. Here, each arm is associated with an unknown probability distribution. At each time step, the algorithm selects one arm and it receives a loss (or reward) drawn i.i.d. from the distribution of the arm . We focus on minimizing the *pseudo-regret*, that is the regret with respect to the optimal action in expectation, rather than the optimal action on the sequence of realized losses:

where we denoted by the expectation of the distribution associated with the arm .

Remark 1The usual notation in the stochastic bandit literature is to consider rewards instead of losses. Instead, to keep our notation coherent with the OCO literature, we will consider losses. The two things are completely equivalent up to a multiplication by .

Before presenting our first algorithm for stochastic bandits, we will introduce some basic notions on concentration inequalities that will be useful in our definitions and proofs.

**1. Concentration Inequalities Bits **

Suppose that is a sequence of independent and identically distributed random variables and with mean and variance . Having observed we would like to estimate the common mean . The most natural estimator is the *empirical mean*

Linearity of expectation shows that , which means that is an *unbiased estimator* of . Yet, is a random variable itself. So, can we quantify how far will be from ?

We could use Chebyshev’s inequality to upper bound the probability that is far from :

Using the fact that , we have that

So, we can expect the probability of having a “bad” estimate to go to zero as one over the number of samples in our empirical mean. Is this the best we can get? To understand what we can hope for, let’s take a look at the central limit theorem.

We know that, defining , , the standard Gaussian distribution, as goes to infinity. This means that

where the approximation comes from the central limit theorem. The integral cannot be calculated with a closed form, but we can easily upper bound it. Indeed, for , we have

This is better than what we got with Chebyshev’s inequality and we would like to obtain an exact bound with a similar asymptotic rate. To do that, we will focus our attention on *subgaussian* random variables.

Definition 1We say that a random variable is –subgaussianif for all we have that .

Example 1The following random variable are subgaussian:

- If is Gaussian with mean zero and variance , then is -subgaussian.
- If has mean zero and almost surely, then is -subgaussian.

We have the following properties for subgaussian random variables.

Lemma 2 (Lattimore and SzepesvÃ¡ri, 2018, Lemma 5.4) Assume that and are independent and -subgaussian and -subgaussian respectively. Then,

- = 0 and .
- is -subgaussian.
- is -subgaussian.

Subgaussians random variables behaves like Gaussian random variables, in the sense that their tail probabilities are upper bounded by the ones of a Gaussian of variance . To prove it, let’s first state the Markov’s inequality.

Theorem 3 (Markov’s inequality)For a non-negative random variable and , we have that .

With Markov’s inequality, we can now formalize the above statement on subgaussian random variables.

*Proof:* For any , we have

Minimizing the right hand side of the inequality w.r.t. , we have the stated result.

An easy consequence of the above theorem is that the empirical average of subgaussian random variables concentrates around its expectation, *with the same asymptotic rate in (1)*.

Corollary 5Assume that are independent, -subgaussian random variables. Then, for any , we have

where .

Equating the upper bounds on the r.h.s. of the inequalities in the Corollary to , we have the equivalent statement that, with probability at least , we have

**2. Explore-Then-Commit Algorithm **

We are now ready to present the most natural algorithm for the stochastic bandit setting, called Explore-Then-Commit (ETC) algorithm. That is, we first identify the best arm over exploration rounds and then we commit to it. This algorithm is summarized in Algorithm 2.

In the following, we will denote by , that is the number of times that the arm was pulled in the first rounds.

Define by the expected loss of the arm with the smallest expectation, that is . Critical quantities in our analysis will be the *gaps*, for , that measure the expected difference in losses between the arms and the optimal one. In particular, we can decompose the regret as a sum over the arms of the expected number of times we pull an arm multiplied by its gap.

Lemma 6For any policy of selection of the arms, the regret is upper bounded by

*Proof:* Observe that

Hence,

The above Lemma quantifies the intuition that in order to have a small regret we have to select the suboptimal arms less often then the best one.

We are now ready to prove the regret guarantee of the ETC algorithm.

Theorem 7Assume that the losses of the arms minus their expectations are -subgaussian and . Then, ETC guarantees a regret of

*Proof:* Let’s assume without loss of generality that the optimal arm is the first one.

So, for , we have

From Lemma 2, we have that is -subgaussian. So, from Theorem 4, we have

The bound shows the trade-off between exploration and exploitation: if is too big, we pay too much during the exploration phase (first term in the bound). On the other hand, if is small, the probability to select a suboptimal arm increases (second term in the bound). Knowing all the gaps , it is possible to choose that minimizes the bound.

For example, in that case that , the regret is upper bounded by

that is minimized by

Remembering that must be a natural number we can choose

When , we select . So, we have . Hence, the regret is upper bounded by

The main drawback of this algorithm is that its optimal tuning depends on the gaps . Assuming the knowledge of the gaps account to make the stochastic bandit problem completely trivial. However, its tuned regret bound gives us a baseline to which compare other bandit algorithms. In particular, in the next lecture we will present an algorithm that achieves the same asymptotic regret without any knowledge of the gaps.

**3. History Bits **

The ETC algorithm goes back to (Robbins, H., 1952), even if Robbins proposed what is now called epoch-greedy (Langford, J. and Zhang, T., 2008). For more history on ETC, take a look at chapter 6 in (Lattimore, T. and SzepesvÃ¡ri, C., 2018). The proofs presented here are from (Lattimore, T. and SzepesvÃ¡ri, C., 2018) as well.

]]>* You can find all the lectures I published here.*

Last time, we saw that for Online Mirror Descent (OMD) with an entropic regularizer and learning rate it might be possible to get the regret guarantee

where . This time we will see how and we will use this guarantee to prove an almost optimal regret guarantee for Exp3, in Algorithm 1.

Remark 1While it is possible to prove (1) from first principles using the specific properties for the entropic regularizer, such proof will not shed any light of what is actually going on. So, in the following we will instead try to prove such regret in a very general way. Indeed, this general proof will allow us to easily prove the optimal bound for multi-armed bandits using OMD with the Tsallis entropy as regularizer.

Now, for a generic , consider the OMD algorithm that produces the predictions in two steps:

- Set such that .
- Set .

As we showed, under weak conditions, these two steps are equivalent to the usual OMD single-step update.

Now, the idea is to consider an alternative analysis of OMD that explicitly depends on , the new prediction before the Bregman projection step. First, let’s state the Generalized Pythagorean Theorem for Bregman divergences.

Lemma 1Let and define , then for all .

*Proof:* From the first order optimality condition of we have that . Hence, we have

The Generalized Pythagorean Theorem is often used to prove that the Bregman divergence between any point in and an arbitrary point decreases when the consider the Bregman projection in .

We are now ready to prove our regret guarantee.

Lemma 2For the two-steps OMD update above the following regret bound holds:

where and .

*Proof:* From the update rule, we have that

where in the second equality we used the 3-points equality for the Bregman divergences and the Generalized Pythagorean Theorem in the first inequality. Hence, summing over time we have

So, as we did in the previous lecture, we have

where and .

Putting all together, we have the stated bound.

This time it might be easier to get a handle over . Given that we only need an upper bound, we can just take a look at and and see which one is bigger. This is easy to do: using the update rule we have

that is

Assuming , we have that implies .

Overall, we have the following improved regret guarantee for the Learning with Experts setting with positive losses.

Theorem 3Assume for and . Let and . Using OMD with the entropic regularizer defined as , learning rate , and gives the following regret guarantee

Armed with this new tool, we can now turn to the multi-armed bandit problem again.

Let’s now consider the OMD with entropic regularizer, learning rate , and set equal to the stochastic estimate of , as in Algorithm 1. Applying Theorem 3 and taking expectation, we have

Now, focusing on the terms , we have

So, setting , we have

Remark 2The need for a different analysis for OMD is due to the fact that we want an easy way to upper bound the Hessian. Indeed, in this analysis comes before the normalization into a probability distribution, that simplifies a lot the analysis. The same idea will be used for the Tsallis entropy in the next section.

So, with a tighter analysis we showed that, even without an explicit exploration term, OMD with entropic regularizer solves the multi-armed bandit problem paying only a factor more than the full information case. However, this is still not the optimal regret!

In the next section, we will see that changing the regularizer, *with the same analysis*, will remove the term in the regret.

**1. Optimal Regret Using OMD with Tsallis Entropy **

In this section, we present the Implicitly Normalized Forecaster (INF) also known as OMD with Tsallis entropy for multi-armed bandit.

Define as , where and in we extend the function by continuity. This is the negative **Tsallis entropy** of the vector . This is a strict generalization of the Shannon entropy, because when goes to 1, converges to the negative (Shannon) entropy of .

We will instantiate OMD with this regularizer for the multi-armed problem, as in Algorithm 2.

Note that and .

We will not use any interpretation of this regularizer from the information theory point of view. As we will see in the following, the only reason to choose it is its Hessian. In fact, the Hessian of this regularizer is still diagonal and it is equal to

Now, we can use again the modified analysis for OMD in Lemma 2. So, for any , we obtain

where and .

As we did for Exp3, now we need an upper bounds to the . From the update rule and the definition of , we have

that is

So, if , , that implies that .

Hence, putting all together, we have

We can now specialize the above reasoning, considering in the Tsallis entropy, to obtain the following theorem.

Theorem 4Assume . Set and . Then, Algorithm 2

*Proof:* We only need to calculate the terms

Proceeding as in (2), we obtain

Choosing , we finally obtain an expected regret of , that can be proved to be the optimal one.

There is one last thing, is how do we compute the prediction of this algorithm? In each step, we have to solve a constrained optimization problem. So, we can write the corresponding Lagragian:

From the KKT conditions, we have

and we also know that . So, we have a 1-dimensional problem in that must be solved in each round.

**2. History Bits **

The INF algorithm was proposed by (Audibert, J.-Y. and Bubeck, S., 2009) and re-casted as an OMD procedure in (Audibert, J.-Y. and Bubeck, S. and Lugosi, G., 2011). The connection with the Tsallis entropy was done in (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). The specific proof presented here is new and it builds on the proof by (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). Note that (Abernethy, J. D. and Lee, C. and Tewari, A., 2015) proved the same regret bound for a Follow-The-Regularized-Leader procedure over the stochastic estimates of the losses (that they call Gradient-Based Prediction Algorithm), while here we proved it using a OMD procedure.

**3. Exercises **

Exercise 1Prove that in the modified proof of OMD, the terms can be upper bounded by .

Exercise 2Building on the previous exercise, prove that regret bounds of the same order can be obtained for Exp3 and for the INF/OMD with Tsallis entropy directly upper bounding the terms , without passing through the Bregman divergences.

Â

]]>* You can find the lectures I published till now here.*

Today, we will present the problem of multi-armed bandit in the adversarial setting and show how to obtain sublinear regret.

**1. Multi-Armed Bandit **

This setting is similar to the Learning with Expert Advice (LEA) setting: In each round, we select one expert and, differently from the full-information setting, we only observe the loss of that expert . The aim is still to compete with the cumulative loss of the best expert in hindsight.

As in the learning with expert case, we need randomization in order to have a sublinear regret. Indeed, this is just a harder problem than LEA. However, we will assume that the adversary is **oblivious**, that is, he decides the losses of all the rounds before the game starts, but with the knowledge of the online algorithm. This makes the losses deterministic quantities and it avoids the inadequacy in our definition of regret when the adversary is adaptive (see (Arora, R. and Dekel, O. and Tewari, A., 2012)).

This kind of problems where we don’t receive the full-information, i.e., we don’t observe the loss vector, are called **bandit problems**. The name comes from the problem of a gambler who plays a pool of slot machines, that can be called “one-armed bandits”. On each round, the gambler places his bet on a slot machine and his goal is to win almost as much money as if he had known in advance which slot machine would return the maximal total reward.

In this problem, we clearly have an *exploration-exploitation trade-off*. In fact, on one hand we would like to play at the slot machine which, based on previous rounds, we believe will give us the biggest win. On the other hand, we have to explore the slot machines to find the best ones. On each round, we have to solve this trade-off.

Given that we don’t observe completely observe the loss, we cannot use our two frameworks:Â Online Mirror DescentÂ (OMD) and Follow-The-Regularized-Leader (FTRL) both needs the loss functions or at least lower bounds to them.

One way to solve this issue is to construct *stochastic estimates* of the unknown losses. This is a natural choice given that we already know that the prediction strategy has to be a randomized one. So, in each round we construct a probability distribution over the arms and we sample one action according to this probability distribution. Then, we only observe the coordinate of the loss vector . One possibility to have a stochastic estimate of the losses is to use an *importance-weighted estimator*: Construct the estimator of the unknown vector in the following way:

Note that this estimator has all the coordinates equal to 0, except the coordinate corresponding the arm that was pulled.

This estimator is unbiased, that is . To see why, note that and . Hence, for , we have

Let’s also calculate the (uncentered) variance of the coordinates of this estimator. We have

We can now think of using OMD with an entropic regularizer and the estimated losses. Hence, assume and set defined as , that is the unnormalized negative entropy. Also, set . Using the OMD analysis, we have

We can now take the expectation at both sides and get

We are now in troubles, because the terms in the sum scale as . So, we need a way to control the smallest probability over the arms.

One way to do it, is to take a convex combination of and a uniform probability. That is, we can predict with , where will be chosen in the following. So, can be seen as the minimum amount of exploration we require to the algorithm. Its value will be chosen by the regret analysis to optimally trade-off exploration vs exploitation. The resulting algorithm is in Algorithm 1.

The same probability distribution is used in the estimator:

We can have that . However, we pay a price in the bias introduced:

Observing that , we have

Putting together the last inequality and the upper bound to the expected regret in (2), we have

Setting and , we obtain a regret of .

This is way worse than the of the full-information case. However, while it is expected that the bandit case must be more difficult than the full information one, it turns out that this is not the optimal strategy.

**2. Exponential-weight algorithm for Exploration and Exploitation: Exp3 **

It turns out that the algorithm above actually works, even without the mixing with the uniform distribution! We were just too loose in our regret guarantee. So, we will analyse the following algorithm, that is called Exponential-weight algorithm for Exploration and Exploitation (Exp3), that is nothing else than OMD with entropic regularizer and stochastic estimates of the losses. Note that now we will assume that .

Let’s take another look to the regret guarantee we have. From the OMD analysis, we have the following one-step inequality that holds for any

Let’s now focus on the term . We said that for a twice differentiable function , there exists such that , where . Hence, there exists such that and

So, assuming the Hessian in to be positive definite, we can bound the last two terms in the one-step inequality of OMD as

where we used Fenchel-Young inequality with the function and and .

When we use the strong convexity, we are upper bounding the terms in the sum with the inverse of the smallest eigenvalue of the Hessian of the regularizer. However, we can do better if we consider the actual Hessian. In fact, in the coordinates where is small, we have a smaller growth of the divergence. This can be seen also graphically in Figure 1. Indeed, for the entropic regularizer, we have that the Hessian is a diagonal matrix:This expression of the Hessian a regret of

where and . Note that for any is in the simplex, so this upper bound is always better than

that we derived just using the strong convexity of the entropic regularizer.

However, we don’t know the exact value of , but only that it is on the line segment between and . Yet, if you could say that , in the bandit case we would obtain an expected regret guarantee of , greatly improving the bound we proved above!

In the next lecture, we will see an alternative way to analyze OMD that will give us exactly this kind of guarantee for Exp3 and will use give us the optimal regret guarantee using the Tsallis entropy in few lines of proof.

**3. History Bits **

The algorithm in Algorithm 1 is from (Cesa-Bianchi, N. and Lugosi, G. , 2006, Theorem 6.9). The Exp3 algorithm was proposed in (Auer, P. and Cesa-Bianchi, N. and Freund, Y. and Schapire, R. E., 2002).

]]>