**1. Implicit Updates **

Let’s consider the two mostly commonly used frameworks for online learning: Online Mirror Descent (OMD) and Follow-The-Regularized-Leader (FTRL).

We already explained that in OMD we update the iterate as the minimizer of an linear approximation of the last loss function we received plus a term that measures the distance to the previous iterate:

where .

On the other hand, in FTRL we have two possibilities: we can minimize the regularized sum of the loss we have received till now *or* the regularized sum of the *linear approximation* of the losses. In the first case, we update with

while in the second case, we use

where for . The second update is what optimization people call Dual Averaging. We also saw that under some reasonable conditions, the two updates of FTRL have the same regret guarantee. However, we would expect the second approach, the one using the exact loss functions, to perform much better in practice. Also, in the linearization of the losses we have to assume to know some characteristics of the losses, e.g., strong convexity parameter, to achieve the same regret of the full-losses FTRL.

Overall, we have two different frameworks and two different ways to use the loss functions in the update. So, it should be obvious that there is at least another possibility, that is *OMD with exact loss*. That is, we would like to consider the update

As in the FTRL case, we would expect this update to be better than the linearized one, at least empirically.

To gain some more intuition, let’s consider the simple case that , , and the losses are differentiable. In this case, we have that the linearized OMD update becomes

For example, with square loss and linear predictors over couples input/labels , we have and the update becomes

On the other hand, the update of OMD with the exact loss function becomes

The optimality condition tells us that satisfies

that is

So, the update is not in a closed formula anymore, but it has an *implicit* form, where appears in both side of the equality. This is exactly the reason why the update of OMD with exact losses is known in the online learning literature as *implicit updates*. So, we will call the update in (1) *Implicit OMD*.

Remark 1.Observe that for linear losses OMD and the Implicit OMD are equivalent.

In general, calculating the update of Implicit OMD can be an annoying optimization problem. However, in some cases the Implicit OMD update can still be solved in a closed form.

Example 1.Consider again linear regression with square loss. The update of Implicit OMD becomesTo solve the equation, we take the inner product of both sides with , to obtain

that is

Substituting this expression in (3), we have

**Implicit Updates are Always Descending** Till now, we have motivated implicit updates purely from an intuitive point of view: We expect this algorithm to be better because we do not approximate the loss functions. Indeed, we often gain a lot in performance switching to implicit updates. However, we can even prove that implicit updates have interesting theoretical properties.

First, contrarily to OGD, implicit updates remains “sane” even when the learning rate goes to infinity. Indeed, taking to infinity in (1), becomes simply the minimizer of the loss function. On the other hand, in OMD when the learning rate goes to infinity we can take a step that is arbitrarily far from the minimizer of the function!

When we consider non-differentiable convex functions, there is another important difference between implicit updates and subgradient descent updates. We already saw that the subgradient might not point in a descent direction. That is, no matter how we choose , the value of the function in might be *higher* than in . On the other hand, this cannot happen with implicit updates, no matter how we choose the learning rate:

**Proximal updates** Are implicit updates actually an invention of online learning people? Actually, no. Indeed, these kind of updates were known at least in 1965 (!) and they were proposed for the (offline) optimization of functions with the name of *proximal updates*. Basically, we have a function and we minimize it iteratively with the update

starting from an initial point . At first sight, such update might seem pointless in offline optimization: being able to solve (4) implies being also able to find the minimizer of in one step! However, as we have previously discussed, these kind of updates find an application when the function is composed by two parts and we decide to linearize only one part.

**2. Passive-Aggressive **

Now, let me show you that implicit updates were actually used a lot in the online literature, even if many people did not realize it.

Let’s take a look at a very famous online learning algorithm: the Passive-Aggressive (PA) algorithm. PA was a major success in online learning: 2099 citations and counting, that is huge for the online learning area. The theory was not very strong, but the performance of these algorithms was way better than anything else we had at that time. Let’s see how the PA algorithm works.

The PA algorithm was introduced before the Online Convex Optimization (OCO) framework was proposed. So, at that time, online learning for classification and regression focused on the particular case in which the loss functions have the form , basically the loss of linear predictors over couples input/label . The PA algorithm in particular focused on losses that can be zero over intervals, like the hinge loss, the squared hinge loss, the -insensitive loss, and the squared -insensitive loss. For these losses, the update they proposed was

where is a hyperparameter. Now, this is exactly the Implicit OMD update with the special case of the squared L2 Bregman divergence! The choice of the loss functions makes this update always in a closed form. So, for example, for the hinge loss and linear predictors, we have

where the second equality is calculated using the optimality condition and it is left as an exercise to the reader.

So, the huge boost in performance of PA over other online algorithm is due *uniquely* to the implicit updates.

**3. Implicit Updates on Truncated Linear Models: aProx **

There is another first-order optimization algorithm inspired to implicit updates. As we said, implicit updates are rarely in a closed form. So, we can try to approximate the implicit updates in some way. One possibility is to use the implicit update on a *surrogate loss function*. Indeed, when we use a linear approximation we recover plain OMD. Instead, when we use the exact function we get the implicit updates. What can we use in between the two cases? We could think to use a *truncated linear model*. That is, in the case we know that the functions are lower bounded by some , we define

for any . Note that this is a lower bound to the loss function and it is piecewise linear.

Now, we can use these surrogate function in the implicit OMD:

Implicit OMD with truncated linear models and the squared L2 Bregman is called aProx (Asi and Duchi, 2019).

Considering and , we again have

where is a specific vector in . Now, we have 2 possibilities: is in the flat part or in the corner of . Indeed, it should be easy to see that the proximal update assures us that we cannot miss the corner and land on the flat part. So, if we are in the linear part, then . Instead, if we are in the corner we have where . Hence, we always have

and we only need to find . Substituting in the definition of and using first-order optimality condition, we can verify that the following is the closed formula of the update (left as an exercise)

The similarity between this update and the one of PA in (5) should be adamant, that is due to the similarity between the truncated linear model and the hinge loss. Indeed, running aProx on linear classifiers with hinge loss is exactly the PA algorithm.

**4. More Updates Similar to the Implicit Updates **

From an empirical point of view, we can gain a lot of performance using implicit updates, even just approximating them. So, it should not be surprising if people proposed and used similar ideas in many optimization algorithm. Let me give you some examples.

The default optimization algorithm in the machine learning library Vowpal Wabbit (VW) uses the Importance Weight Aware Updates (Karampatziakis and Langford, 2011). These updates essentially approximate the implicit update using a differential equation that for linear models can be calculated in a closed formula. So, if you ever used VW, you already used a close relative of implicit updates, probably without knowing it.

Another interesting example is the setting of adaptive filtering, where one wants to minimize . In this setting, a classic algorithm is Least Mean Squared (LMS) algorithm that corresponds to Online Gradient Descent with linear models and squared loss. Now, a known better version of the LMS is the normalized LMS, that is nothing else that Implicit OMD with linear models and squared loss.

There are even interpretations of the Nesterov’s accelerated gradient method as an implicit update on a curved space (Defazio, 2019).

So, I am sure there are even more examples of implicit updates hiding in other well-known algorithms.

**5. Regret Guarantee for Implicit Updates **

From the above reasoning, it seems very intuitive to expect a better regret bound for implicit updates. However, it turns out particularly challenging to prove a *quantifiable* advantage of implicit updates over OMD ones in the adversarial setting.

Here, I show a very recent result of mine on Implicit OMD that for the first time shows a clear advantage of Implicit OMD in some situations.

First, we can show the following theorem.

Theorem 1.Assume a constant learning rate . Then, implicit OMD guarantees

Moreover, assume the distance generating function to be 1-strongly convex w.r.t. . Then, there exists such that we have

*Proof:* To obtain this bound, we proceed in a slightly different way than in the classic OMD proof. In particular, for any we have

where and in the second inequality we have used the optimality condition of the update. Adding to both terms of the inequality, dividing by , and reordering, we have

Summing over time, we get the first bound.

For the second bound, let’s now focus on the terms and we upper bound them in two different ways. First, using the convexity of the losses, we can bound the difference between and :

where . Also, from the strong convexity of , we have

Hence, putting all together we have

where in the last inequality we used the elementary inequality .

From the optimality condition of the implicit OMD update, we know that there exists such that

Hence, we have

where we used the convexity of the Bregman divergence in its first argument in the second inequality and the optimality condition of the update in the third inequality. This chain of inequalities implies that that gives the second bound in the minimum.

The theorem shows a *possible* and *small* improvement over the OMD regret bound. In particular, there might be sequences of losses where . The fact that the improvement is only possible on some sequences is to be expected: the OMD regret is worst-case optimal on bounded domains, so there is not much to gain. However, maybe we could expect a larger gain on some particular sequence of functions. Indeed, we can show that on some sequences of losses we can achieve *constant* regret! Let’s see how.

From the regret above, we have

Denoting by the *temporal variability* of the losses, we have that the regret guarantee is

Now, in the case that the loss functions are all the same and the regret upper bound becomes a *constant* independent of . It is worth reminding that constant regret is the best we can hope for in online convex optimization! In other words, when online learning becomes as easy as offline learning (i.e., all the losses are equal), that implicit updates give us a provable large boost.

However, there is a caveat: In order to get a regret in the general case we need . On the other had, if we want . The problem in online learning learning is that we do not know the future, so we need some *adaptive* strategy that changes in a dynamic way. This is indeed possible and we leave this as an exercise, see below.

Our last observation is that we can recover the constant regret bound even for FTRL when used on the exact losses. Again, this is due to the use of the exact losses rather than the linear approximation. Remember that FTRL predicts with , where . Hence, from the FTRL regret equality and assuming a non-decreasing regularizer, we have

However, FTRL with exact losses requires to solve a finite sum optimization problem whose size grows with the number of iterations. Instead, Implicit OMD uses only one loss in each round, resulting in a closed formula in a number of interesting cases. We also note that we would have the same tuning problem as before: in order to get a constant regret when , we would need the regularizer to be constant and independent from time, while it should grow as in the general case.

**6. History Bits **

The implicit updates in online learning were proposed for the first time by (Kivinen and Warmuth, 1997). However, such update with the Euclidean divergence is the Proximal update in the optimization literature dating back at least to 1965 (Moreau, 1965)(Martinet, 1970)(Rockafellar, 1976)(Parikh and Boyd, 2014), and more recently used even in the stochastic setting (Toulis and Airoldi, 2017)(Asi and Duchi, 2019).

The PA algorithms were proposed in (Crammer et al., 2006), but the connection with implicit updates was absent in the paper. I am not sure who first realized the connection: I realized it in 2011 and I showed it to Joseph Keshet (one of the author of PA) that encouraged me to publish it somewhere. Only 10 years later, I am doing it Note that the mistake bound proved in the PA paper is worse than the Perceptron bound. Later, we proved a mistake bound for PA that is strictly better than the classic Perceptron’s bound (Jie et al., 2010).

The very nice idea of truncated linear models was proposed by (Asi and Duchi, 2019) as a way to approximate proximal updates and retaining closed form updates.

The connection between implicit OMD and normalized LMS was shown by (Kivinen et al., 2006).

(Kulis and Bartlett, 2010) provide the first regret bounds for implicit updates that match those of OMD, while (McMahan, 2010) makes the first attempt to quantify the advantage of the implicit updates in the regret bound. Finally, (Song et al., 2018) generalize the results in (McMahan, 2010) to Bregman divergences and strongly convex functions, and quantify the gain differently in the regret bound. Note that in (McMahan, 2010)(Song et al., 2018) the gain cannot be exactly quantified, providing just a non-negative data-dependent quantity subtracted to the regret bound. The connection between temporal variation and implicit updates was shown in (Campolongo and Orabona, 2020), together with a matching lower bound.

**7. Acknowledgements **

Thanks to Nicolò Campolongo for feedback on a draft of this post.

**8. Exercises **

Exercise 1.Prove that the update of PA given above is correct.

Exercise 2.Prove that the update of aProx given above is correct.

]]>

Exercise 3.Find an learning rate strategy to adapt to the value of without knowing it (Campolongo and Orabona, 2020).

There is a popular interpretation of the Perceptron as a stochastic (sub)gradient descent procedure. I even found slides online with this idea. The thought of so many young minds twisted by these false claims was too much to bear. So, I felt compelled to write a blog post to explain why this is wrong…

Moreover, I will also give a different and (I think) much better interpretation of the Perceptron algorithm.

**1. Perceptron Algorithm**

The Perceptron algorithm was introduced by Rosenblatt in 1958. To be more precise, he introduced a family of algorithms characterized by a certain architecture. Also, he considered what we call now supervised and unsupervised training procedures. However, nowadays when we talk about the Perceptron we intend the following algorithm:

In the algorithm, the couples for , with and , represent a set of input/output pairs that we want to learn to classify correctly in the two categories and . We assume that there exists an unknown vector the correctly classify all the samples, that is . Note that any scaling of by a positive constant still correctly classify all the samples, so there are infinite solutions. The aim of the Perceptron is to find any of these solutions.

From an optimization point of view, this is called a *feasibility problem*, that is something like

where is some set. They are an essential step in constrained optimization for algorithms that require an feasible initial point. Feasibility problems are not optimization problems even if in some cases can be solved with an optimization formulation.

In the Perceptron case, we can restate the problem as

where the “1” on the r.h.s. is clearly arbitrary and it can be changed through rescaling of . So, in optimization language, the Perceptron algorithm is nothing else than an iterative procedure to solve the above feasibility problem.

**2. Issues with the SGD Interpretation**

As said above, sometimes people refer to the Perceptron as a stochastic (sub)gradient descent algorithm on the objective function

I think they are many problems with this ideas, let me list some of them

- First of all, the above interpretation assumes that we take the samples randomly from . However, this is not needed in the Perceptron and it was not needed in the first proofs of the Perceptron convergence (Novikoff, 1963). There is a tendency to call anything that receive one sample at a time as “stochastic”, but “arbitrary order” and “stochastic” are clearly not the same.
- The Perceptron is typically initialized with . Now, we have two problems. The first one is that with a black-box first-order oracle, we would get a subgradient of a , where is drawn uniformly at random in . A possible subgradient for any is . This means that SGD would not update. Instead, the Perceptron in this case does update. So, we are forced to consider a different model than the black-box one. Changing the oracle model is a minor problem, but this fact hints to another very big issue.
- The biggest issue is that is a global optimum of ! So, there is nothing to minimize, we are already done in the first iteration. However, from a classification point of view, this solution seems clearly wrong. So, it seems we constructed an objective function we want to minimize and a corresponding algorithm, but for some reason we do not like one of its infinite minimizers. So, maybe, the objective function is wrong? So, maybe, this interpretation misses something?

There is an easy way to avoid some of the above problems: change the objective function to a parametrized loss that has non-zero gradient in zero. For example, something like this

Now, when goes to infinity, you recover the function . However, for any finite , is not a global optimum anymore. As a side effect, we also solved the issue of the subgradient of the max function. In this way, you could interpret the Perceptron algorithm as the *limit behaviour of SGD on a family of optimization problems*.

To be honest, I am not sure this is a satisfying solution. Moreover, the stochasticity is still there and it should be removed.

Now, I already proved a mistake bound for the Perceptron, without any particular interpretation attached to it. As a matter of fact, proofs do not need interpretations to be correct. I showed that the Perceptron competes with a *family of loss functions* that implies that it does not just use the subgradient of a single function. However, if you need an *intuitive way* to think about it, let me present you the idea of *pseudogradients*.

**3. Pseudogradients**

Suppose we want to minimize a function -smooth and we would like to use something like gradient descent. However, we do not have access to its gradient. In this situation, (Polyak and Tsypkin, 1973) proposed to use a “pseudogradient”, that is *any* vector that forms an angle of 90 degrees or less with the actual gradient in

In a very intuitive way, gives me some information that should allow me to minimize , at least in the limit. The algorithm then becomes a “pseudogradient descent” procedure that updates the current solution in the direction of the negative pseudogradient

where are the step size or learning rates.

Note that (Polyak and Tsypkin, 1973) define the pseudogradients as a *stochastic* vector that satisfies the above inequality in conditional expectation and for a time-varying . Indeed, there are a number of very interesting results in that paper. However, for simplicity of exposition I will only consider the deterministic case and only describe the application to the Perceptron.

Let’s see how this would work. Let’s assume that at least for an initial number of rounds, that means that the angle between the pseudogradient and the gradient is acute. From the -smoothness of , we have that

Now, if , we have that so can guarantee that the value of decreases at each step. So, we are minimizing without using a gradient!

To get a rate of convergence, we should know something more about . For example, we could assume that . Then, setting , we obtain

This is still not enough because it is clear that cannot be true on all rounds because when we are in the minimizer . However, with enough assumptions, following this route you can even get a rate of convergence.

**4. Pseudogradients for the Perceptron**

How do we use this to explain the Perceptron? Suppose your set is *linearly separable* with a margin of 1. This means that there exists a vector such that

Note that the value of the margin is arbitrary, we can change it just rescaling .

Remark 1.An equivalent way to restate this condition is to constrain to have unitary norm and require

where is called themaximum marginof . However, in the following I will not use the margin notation because it makes things a bit less clear from an optimization point of view.

We would like to construct an algorithm to find (or any positive scaling of it) from the samples . So, we need an objective function. Here the brilliant idea of Polyak and Tsypkin: in each iteration take an arbitrary and define , that is exactly the negative update we use in the Perceptron. This turns out to be a pseudogradient for . Indeed,

where in the last inequality we used (2).

Let’s pause for a moment to look at what we did: We want to minimize , but its gradient is just impossible to calculate because it depends on that we clearly do not know. However, *every time the Perceptron finds a sample on which its prediction is wrong*, we can construct a pseudogradient, without any knowledge of . It is even more surprising if you consider the fact that there is an infinite number of possible solutions and hence functions , yet the pseudogradient correlates positively with the gradient of any of them! Moreover, no stochasticity is necessary.

At this point we are basically done. In fact, observe that is 1-smooth. So, every time , the analysis above tells us that

where in the last inequality we have assumed .

Setting , summing over time, and denoting the number of updates we have over iterations, we obtain

where used the fact that .

Now, there is the actual magic of the (parameter-free!) Perceptron update rule: as we explained here, the updates of the Perceptron are independent of . That is, given an order in which the samples are presented to the algorithm, any fixed makes the Perceptron update on the same samples and it only changes the scale of . Hence, even if the Perceptron algorithm uses , we can consider an arbitrary decided post-hoc to minimize the upper bound. Hence, we obtain

that is

Now, observing that the r.h.s. is independent of , we proved that the maximum number of updates, or equivalently mistakes, of the Perceptron algorithm is bounded.

Are we done? Not yet! We can now improve the Perceptron algorithm taking full advantage of the pseudogradients interpretation.

**5. An Improved Perceptron**

This is a little known idea to improve the Perceptron. It can be shown with the classic analysis as well, but it comes very naturally from the pseudogradient analysis.

Let’s start from

Now consider only the rounds in which and set , that is obtained by an optimization of the expression . So, we obtain

This means that now the update rule becomes

Now, summing (3) over time, we get

It is clear that this inequality implies the previous one because . But we can even obtain a tighter bound. Using the inequality between harmonic, geometric, and arithmetic mean, we have

In words, the original Perceptron bound depends on the maximum squared norm of the samples on which we updated. Instead, this bound depends on the geometric or arithmetic mean of the squared norm of the samples on which we updated, that is less or equal to the maximum.

**6. Pseudogradients and Lyapunov Potential Functions**

Some people might have realized yet another way to look at this: is the Lyapunov function typically used to analyze subgradient descent. In fact, the classic analysis of SGD considers the guaranteed decrement at each step of this function. The two things coincide, but I find the pseudogradient idea to add a non-trivial amount of information because it allows to bypass the idea of using a subgradient of the loss function completely.

Moreover, the idea of the pseudogradients is more general because it applies to any smooth function, not only to the choice of .

Overall, it is clear that all the good analyses of the Perceptron must have something in common. However, sometimes recasting a problem in a particular framework might have some advantages because it helps our intuition. In this view, I find the pseudogradient view particularly compelling because it aligns with my intuition of how an optimization algorithm is supposed to work.

**7. History Bits **

I already wrote about the Perceptron, so I will just add few more relevant bits.

As I said, it seems that the family of Perceptrons algorithms was intended to be something much more general than what we intend now. The particular class of Perceptron we use nowadays were called -system (Block, 1962). I hypothesize that the fact the -system survived the test of time is exactly due to the simple convergence proof in (Block, 1962) and (Novikoff, 1963). Both proofs are non-stochastic. For the sake of proper credits assignment, it seems that the convergence proof of the Perceptron was proved by many other before Block and Novikoff (see references in Novikoff, 1963). However, the proof in (Novikoff, 1963) seems to be the cleanest one. (Aizerman, Braverrnan, and Rozonoer, 1964) (essentially) describe for the first time the Kernel Perceptron and prove a finite mistake bound for it.

I got the idea of smoothing the Perceptron algorithm with a scaled logistic loss from a discussion on Twitter with Maxim Raginsky. He wrote that (Aizerman, Braverrnan, and Rozonoer, 1970) proposed some kind of smoothing in a Russian book for the objective function in (1), but I don’t have access to it so I am not sure what are the details. I just thought of a very natural one.

The idea of pseudogradients and the application to the Perceptron algorithm is in (Polyak and Tsypkin, 1973). However, there the input/output samples are still stochastic and the finite bound is not explicitly calculated. As I have shown, stochasticity is not needed. It is important to remember that online convex optimization as a field will come much later, so there was no reason for these people to consider arbitrary or even adversarial order of the samples.

The improved Perceptron mistake bound could be new (but please let me know if it isn’t!) and it is inspired from the idea in (Graepel, Herbrich, and Williamson, 2001) of normalizing the samples to show a tighter bound.

**Acknowledgements**

Given the insane amount of mistakes that Nicolò Campolongo usually finds in my posts, this time I asked him to proofread it. So, I thank Nicolò for finding an insane amout of mistakes on a draft of this post

]]>I have been working on online and stochastic optimization for a while, from a practical and empirical point of view. So, I was already in this field when Adam (Kingma and Ba, 2015) was proposed.

The paper was ok but not a breakthrough, and even more so for today standards. Indeed, the theory was weak: A regret guarantee for an algorithm supposed to work on stochastic optimization of non-convex functions. The experiments were also weak: The exact same experiments would result in a surefire rejection in these days. Later people also discovered an error in the proof and the fact that the algorithm will not converge on certain one-dimensional stochastic convex functions. Despite all of this, in these days Adam is considered the King of the optimization algorithms. Let me be clear: it is known that Adam will not always give you the best performance, yet most of the time people know that they can use it with its default parameters and get, if not the best performance, at least the second best performance on their particular deep learning problem. In other words, Adam is considered nowadays the *default optimizer* for deep learning. So, what is the secret behind Adam?

Over the years, people published a vast number of papers that tried to explain Adam and its performance, too many to list. From the “adaptive learning rate” (adaptive to what? Nobody exactly knows…) to the momentum, to the almost scale-invariance, each single aspect of its arcane recipe has been examined. Yet, none of these analyses gave us the final answer on its performance. It is clear that most of these ingredients are beneficial to the optimization process of *any* function, but it is still unclear why this exact combination and not another one make it the best algorithm. The equilibrium in the mix is so delicate that even the small change required to fix the non-convergence issue was considered to give slightly worse performance than Adam.

The fame of Adam is also accompanied by strong sentiments: It is enough to read posts on r/MachineLearning on Reddit to see the passion that people put in defending their favorite optimizers against the other ones. It is the sort of fervor that you see in religion, in sports, and in politics.

However, how *likely* is all this? I mean, how likely is that Adam is really the *best* optimization algorithm? How likely is that we reached the apex of optimization for deep learning few years ago in a field that is so young? Could there be another explanation to its prodigious performance?

I have a hypothesis, but before explaining it we have to briefly talk about the applied deep learning community.

In a talk, Olivier Bousquet has described the deep learning community as a giant genetic algorithm: Researchers in this community are exploring the space of all variants of algorithms and architectures in a semi-random way. Things that consistently work in large experiments are kept, the ones not working are discarded. Note that this process seems to be independent of acceptance and rejection of papers: The community is so big and active that good ideas on rejected papers are still saved and transformed into best practices in few months, see for example (Loshchilov and Hutter, 2019). Analogously, ideas in published papers are reproduced by hundred of people that mercilessly trash things that will not reproduce. This process has created a number of heuristics that consistently produce good results in experiments, and the stress here is on “consistently”. Indeed, despite being a method based on non-convex formulations, the performance of deep learning methods turns out to be extremely reliable. (Note that the deep learning community has also a large bias towards “famous” people, so not all the ideas receive the same level of attention…)

So, what is the link between this giant genetic algorithm and Adam? Well, looking carefully at the creating process in the deep learning community I noticed a pattern: Usually people try new architectures *keeping the optimization algorithm fixed*, and most of the time the algorithm of choice is Adam. This happens because, as explained above, Adam is the *default optimizer*.

So, here my hypothesis: Adam was a very good optimization algorithm for the neural networks architectures we had few years ago and ** people kept evolving new architectures on which Adam works**. So, we might not see many architectures on which Adam does not work because such ideas are discarded prematurely! Such ideas would require to design a new architecture

Now, I am sure many people won’t buy in this hypothesis, I am sure they will list all sort of specific problems in which Adam is not the best algorithm, in which Stochastic Gradient Descent with momentum is the best one, and so on and so forth. However, I would like to point out two things: 1) I don’t describe here a law of nature, but simply a tendency the community has that might have influenced the co-evolution of some architectures and optimizers; 2) I actually have some evidence to support this claim

If my claims were true, we would expect Adam to be extremely good on deep neural networks and very poor on anything else. And this does happen! For example, Adam is known to perform very poorly on simple convex and non-convex problems that are not deep neural networks, see for example the following experiments from (Vaswani et al., 2019):

It seems that the moment we move away from the specific setting of deep neural networks with their specific choice of initialization, specific scale of weights, specific loss function, etc., Adam loses its *adaptivity* and its magic default learning rate must be tuned again. Note that you can always write a linear predictor as a one-layer neural network, yet Adam does not work so well on this case too. So, ** all the particular choices of architectures in deep learning might have evolved to make Adam work better and better, while the simple problems above do not have any of these nice properties that allow Adam to shine**.

Overall, Adam might be the best optimizer because the deep learning community might be exploring only a small region in the joint search space of architectures/optimizers. If true, that would be ironic for a community that departed from convex methods because they focused only on a narrow region of the possible machine learning algorithms and it was like, as Yann LeCun wrote, “looking for your lost car keys under the street light knowing you lost them someplace else“.

EDIT: After the pubblication of this post, Sam Power pointed me to this tweet by Roger Grosse that seems to share a similar sentiment

]]>**1. SGD on Non-Convex Smooth Functions **

We are interested in minimizing a smooth non-convex function using stochastic gradient descent with unbiased stochastic gradients. More in details, we assume to have access to an oracle that returns in any point , , where is the realization of a mechanism for computing the stochastic gradient. For example, could be the random index of a training sample we use to calculate the gradient of the training loss or just random noise that is added on top of our gradient computation. We will also assume that the variance of the stochastic gradient is bounded: , for all . Weaker assumptions on the variance are possible, but they don’t add much to the general message nor to the scheme of the proof.

Given that the function is non-convex, we clearly cannot hope to converge to the minimum of , so we need a less ambitious goal. We assumed that the function is smooth. As you might remember from my previous posts, smooth functions are differentiable functions whose gradient is Lipschitz. Formally, we say that is -smooth when , for all . This assumption assures us that when we approach a local minimum the gradient goes to zero. Hence, **decreasing the norm of the gradient will be our objective function for SGD.** Note that smoothness is necessary to study the norm of the gradients. In fact, consider the function whose derivative does not go to zero when we approach the minimum, on the contrary it is always different than 0 in any point different than the minimum.

Last thing we will assume is that the function is bounded from below. Remember that the boundedness from below does not imply that the minimum of the function exists, e.g., .

Hence, I start from a point and the SGD update is

where are deterministic learning rates or stepsizes.

First, let’s see practically how SGD behaves w.r.t. Gradient Descent (GD) on the same problem.

In Figure 1, we are minimizing , where the stochastic gradient in SGD is given by the gradient of the function corrupted by Gaussian noise with zero mean and standard deviation 1. On the other hand, there is no noise for GD. In both cases, we use and we plot the absolute value of the derivative. We can see that GD will monotonically minimize the gradient till numerical precision as expected, converging to one of the local minima. Note that with a constant learning rate GD on this problem would converge even faster. Instead, SGD will jump back and forth resulting in only *some* iterates having small gradient. So, our basic question is the following:

*Will converge to zero with probability 1 in SGD when goes to infinity?*

This is more difficult to answer than what you might think. However, this is a basic question to know if it actually makes sense to run SGD for a bunch of iterations and return the last iterate, that is how 99% of the people use SGD on a non-convex problem.

To warm up, let’s first see what we can prove in a finite-time setting.

As all other similar analysis, we need to construct a potential (Lyapunov) function that allows us to analyze it. In the convex case, we would study , where . Here, this potential does not even make sense because we are not even trying to converge to . It turns out that a better choice is to study . We will make use of the following property of -smooth functions:

In words, this means that a smooth function is always upper bounded by a quadratic function. Note that this property does not require convexity, so we can safely use it. Thanks to this property, let’s see how our potential evolves over time during the optimization of SGD.

Now, let’s denote by the expectation w.r.t. given , so we have

where in the inequality we have used the fact that the variance of the stochastic gradient is bounded by . Taking the total expectation and reordering the terms, we have

Let’s see how useful this inequality is: consider a constant step size , where is the usual critical parameter of the learning rate (that you’ll never be able to tune properly unless you know things that you clearly don’t know…). With this choice, we have . So, we have

What we got is almost a convergence result: it says that the average of the norm of the gradients is going to zero as . Given that the average of a set of numbers is bigger or equal to its minimum, this means that there exists at least one in my set of iterates that has a small expected gradient. This is interesting but slightly disappointing. We were supposed to prove that the gradient converges to zero, but instead we only proved that at least *one* of the iterates has indeed small expected norm, but we don’t know which one. Also, trying to find the right iterate might be annoying because we only have access to stochastic gradients.

It is also interesting to see that the convergence rate has two terms: a fast rate and a slow rate . This means that we can expect the algorithm to make fast progress at the beginning of the optimization and then slowly converge once the number of iterations becomes big enough compared to the variance of the stochastic gradients. In case the noise on the gradients is zero, SGD becomes simply gradient descent and it will converge at a rate. In the noiseless case, we can also show that the last iterate is the one with the smallest gradient. However, note that the learning rate has in it, so effectively we can achieve a faster convergence in the noiseless case because we would be using a constant and independent from stepsize.

**2. The Magic Trick: Randomly Stopped SGD **

The above reasoning is interesting but it is not a solution to our question: does the last iterate of SGD converge? Yes or no?

There is a possible work-around that looks like a magic trick. Let’s take one iterate of SGD uniformly at random among and call it . Taking the expectation with respect to this randomization and the noise in the stochastic gradients we have that

Basically, it says that if run SGD for iterations, then we stop and return not the last iterate but one of the iterates at random, in expectation with respect to everything the norm will be small! Note that this is equivalent to run SGD with a random stopping time. In other words, given that we didn’t know how to prove if SGD converges, we changed the algorithm adding a random stopping time and now the random iterate on which we stop in expectation will have the desired convergence rate.

This is a very important result and also a standard one in these days. It should be intuitive why the randomization helps: From Figure 1 it is clear that we might be unlucky in the last iteration of SGD, however randomizing in expectation we smooth out the noise and get a decreasing gradient. However, we just changed the target because we still didn’t prove if the last iterate converges. So, we need an alternative way.

**3. The Disappointing Lim Inf **

Let’s consider again (1). This time let’s select any time-varying positive stepsizes that satisfy

These two conditions are classic in the study of stochastic approximation. The first condition is needed to be able to travel arbitrarily far from the initial point, while the second one is needed to keep the variance of the noise under control. The classic learning rate of does not satisfy these assumptions, but something decaying a little bit faster as will do.

With such a choice, we get

where we have used the second condition in the inequality. Now, the condition implies that converges to 0. So, there exists such that for all . So, we get that

This implies that with probability 1. We are almost done: From this last inequality and the condition that , we can derive the fact that .

**Wait, what? What is this ???** Unfortunately, it seems that we proved something weaker than we wanted to. In words, the lim inf result says that there exists a *subsequence* of that has a gradient converging to zero.

This is very disappointing and we might be tempted to believe that this is the best that we can do. Fortunately, this is not the case. In fact, in a seminal paper (Bertsekas and Tsitsiklis, 2000) proved the convergence of the gradients of SGD to zero with probability 1 under very weak assumptions. Their proof is very convoluted also due to the assumptions they used, but in the following I’ll show a much simpler proof.

**4. The Asymptotic Proof in Few Lines **

In 2018, I found a way to get the same result of (Bertsekas and Tsitsiklis, 2000) distilling their long proof in the following Lemma, whose proof is in the Appendix. It turns out that this Lemma is essentially all what we need.

Lemma 1.Let be two non-negative sequences and a sequence of vectors in a vector space . Let and assume and . Assume also that there exists such that , where is such that . Then, converges to 0.

We are now finally ready to prove the asymptotic convergence with probability 1.

Theorem 2.Assume that we use SGD on a -smooth function, with stepsizes that satisfies the conditions (2). Then, goes to zero with probability 1.

*Proof:* We want to use Lemma 1 on . So, first observe that by the -smoothness of , we have

The assumptions and the reasoning above imply that, with probability 1, . This also suggest to set . Also, we have, with probability 1, , because for is a martingale whose variance is bounded by . Hence, for is a martingale in , so it converges in with probability 1.

Overall, with probability 1 the assumptions of Lemma 1 are verified with .

We did it! Finally, we proved that the gradients of SGD do indeed converge to zero with probability 1. This means that with probability 1 for any there exists such that for .

Even if I didn’t actually use any intuition in crafting the above proof (I rarely use “intuition” to prove things), Yann Ollivier provided the following intuition for this proof: the proof is implicitly studying how far apart GD and SGD are. However, instead of estimating the distance between the two processes over a single update, it does it over large period of time through the term that can be controlled thanks to the choice of the learning rates.

**5. History Bits **

The idea of taking one iterate at random in SGD was proposed in (Ghadimi and Lan, 2013) and it reminds me the well-known online-to-batch conversion through randomization. The conditions on the learning rates in (2) go back to (Robbins and Monro, 1951). (Bertsekas and Tsitsiklis, 2000) contains a good review of previous work on asymptotic convergence of SGD, while a recent paper on this topic is (Patel, V., 2020).

I derived Lemma 1 as an extension of Proposition 2 in (Alber et al., 1998)/Lemma A.5 in (Mairal, 2013). Studying the proof of (Bertsekas and Tsitsiklis, 2000), I realized that I could change (Alber et al., 1998, Proposition 2) into what I needed. I had this proof sitting in my unpublished notes for 2 years, so I decided to write a blog post on it.

My actual small contribution to this line of research is a lim inf convergence for SGD with AdaGrad stepsizes (Li and Orabona, 2019), but under stronger assumptions on the noise.

Note that the 20-30 years ago there were many papers studying the asymptotic convergence of SGD and its variants in various settings. Then, the taste of the community changed moving from asymptotic convergence to finite-time rates. As it often happens when a new trend takes over the previous one, new generations tend to be oblivious to the old results and proof techniques. The common motivation to ignore these past results is that the finite-time analysis is superior to the asymptotic one, but this is clearly false (ask a statistician!). It should be instead clear to anyone that both analyses have pros and cons.

**6. Acknowledgements **

I thank Léon Bottou for telling me of the problem of analyzing the asymptotic convergence of SGD in the non-convex case with a simple and general proof in 2018. Léon also helped me checking my proofs and finding an error in a previous version. Also, I thank Yann Ollivier for reading my proof and kindly providing an alternative proof and the intuition that I report above.

**7. Appendix **

*Proof of Lemma : *Since the series diverges, given that converges, we necessarily have . Hence, we have to prove that .

Let us proceed by contradiction and assume that . First, assume that .

Given the values of the and , we can then build two sequences of indices and such that

- ,
- , for ,
- , for .

Define . The convergence of the series implies that the sequence of partial sums are Cauchy sequences. Hence, there exists large enough such for all we have and are less or equal to . Then, we have for all and all with ,

Therefore, using the triangle inequality, And finally for all , which contradicts . Therefore, goes to zero.

To rule out the case that , proceed in the same way, choosing any . Hence, we get that for , that contradicts .

]]>Don’t get me wrong: assuming bounded domains is perfectly fine and justified most of the time. However, sometimes it is unnecessary and it might also obscure critical issues in the analysis, as in this case. So, to balance the universe of first-order methods, I decided to show how to easily prove the convergence of the iterates in SGD, even in unbounded domains.

Technically speaking, the following result might be new, but definitely not worth a fight with Reviewer 2 to publish it somewhere.

**1. Setting **

First, let’s define our setting. We want to solve the following optimization problem

where and is a convex function. Now, various assumptions are possible on and choosing the right one depends on *your* particular problem, there are not right answers. Here, we will not make any strong assumption on . Also, we will *not* assume to be bounded. Indeed, in most of the modern applications in Machine Learning, is simply the entire space . We will also assume that is not empty and is any element in it.

We also assume to have access to a *first-order stochastic oracle* that returns stochastic sub-gradients of on any point . In formulas, we get such that . Practically speaking, every time you calculate the (sub)gradient on a minibatch of training data, that is a stochastic (sub)gradient and roughly speaking the random minibatch is the random variable .

Here, for didactic reasons, we will assume that is bounded by 1; similar results can be also show with more realistic assumptions. This holds, for example, if is an average of 1-Lipschitz functions and you draw some of them to calculate the stochastic subgradient.

The algorithm we want to focus on is SGD. So, what is SGD? SGD is an incredibly simple optimization algorithm, almost primitive. Indeed, part of its fame depends critically on its simplicity. Basically, you start from a certain and you update your solution iteratively moving in the direction of the negative stochastic subgradients, multiplied by a *learning rate* . We also use a projection onto . Of course, if no projection is needed. So, the update of SGD is

where and is the projection onto . Remember that when you use subgradients, SGD is not a descent algorithm: I already blogged about the fact that the common intuition of moving towards a descent direction is wrong when you use subgradients.

**2. Convergence of the Average of the Iterates **

Now, the most common analysis of SGD can be done in two different ways: constant learning rate and non-increasing learning rate. We already saw both of them in my lecture notes on online learning, so let’s summarize here the one-step inequality for SGD we need:

for all measureable w.r.t. .

If you plan to use iterations, you can you use a learning rate , and summing (1) we get

where we set . This is not a convergence results yet, because it just says that *on average* we are converging. To extract a single solution, we can use Jensen’s inequality and obtain

where . In words, we show a convergence guarantee for *the average of the iterates of SGD*, not for the last one.

Constant learning rates are a bit annoying because they depends on how many iterations you plan to do, theoretically and empirically. So, let’s now take a look at non-increasing learning rates, . In this case, the correct way to analyze SGD without the bounded assumption is to sum (1) *without dividing by *, to have

where we set . From this one, we have two alternatives. First, we can observe that

because is a minimizer and the learning rate non-increasing. So, using again Jensen’s inequality, we get

Note that if you like these sorts of games, you can even change the learning rate to shave a factor, but it is probably useless from an applied point of view.

Another possibility is to use a weighted average:

where and we used . Note that this option does not seem to give any advantage over the unweighted average above. Also, it weights the first iterations more than the last ones, that in most of the cases is a bad idea: First iterations tend to be farther away from the optimum then the last ones.

Let’s summarize what we have till now:

- Unbounded domains are fine with both constant and time-varying learning rates.
- The optimal learning rate depends on the distance between the optimal solution and the initial iterate, because the optimal setting of is proportional to .
- The weighted average is probably a bad idea and not strictly necessary.
- It seems we can only guarantee convergence for (weighted) averages of iterates.

The last point is a bit concerning: most of the time we take the last iterate of SGD, why we do it if the theory applies to the average?

**3. Convergence of the Last Iterate **

Actually, we do know that

- the last solution of SGD converges in unbounded domains with constant learning rate (Zhang, T., 2004).
- the last iterate of SGD converges in bounded domains with non-increasing learning rates (Shamir, O. and Zhang, T., 2013).

So, what about unbounded domains and non-increasing learning rate, i.e., 90% of the uses of SGD? It turns out that it is equally simple and I think the proof is also instructive! As surprising as it might sound, not dividing by (1) is the key ingredient we need. The proof plan is the following: we want to prove that the value of on the last iterate is not too far from the value of on . To prove it, we need the following technical lemma on sequences of non-negative numbers multiplied by non-increasing learning rates, whose proof is in the Appendix. This Lemma relates the last element of a sequence of numbers to their average.

Lemma 1.Let a non-increasing sequence of positive numbers and . Then

With the above Lemma, we can prove the following guarantee for the convergence of the last iterate of SGD.

Theorem 2.Assume the stepsizes deterministic and non-increasing. Then

*Proof:* We use Lemma 1, with , to have

Now, we bound the sum in the r.h.s. of last inequality. Summing (1) from to , we have the following inequality that holds for any :

Hence, setting , we have

Putting all together, we have the stated bound.

There are a couple of nice tricks in the proof that might be interesting to study carefully. First, we use the fact that one-step inequality in (1) holds for any . Most of the time, we state it with equal to , but it turns out that the more general statement is actually important! In fact, it is possible to know how far is the performance of last iterate from the performance of the average because the incremental nature of SGD makes possible to know exactly how far is from any previous iterate , with . Please note that all of this would be hidden in the case of bounded domains, where all the distances are bounded by the diameter of the set, and you don’t get the dependency on .

Now we have all the ingredients and we only have to substitute a particular choice of the learning rate.

*Proof:* First, observe that

Now, considering the last term in (3), we have

Using (2) and dividing by , we have the stated bound.

Note that the above proof works similarly if .

**4. History Bits **

The first finite-time convergence proof for the last iterate of SGD is from (Zhang, T., 2004), where he considered the constant learning rate case. It was later extended in (Shamir, O. and Zhang, T., 2013) for time-varying learning rates but only for bounded domains. The convergence rate for the weighted average in unbounded domains is from (Zhang, T., 2004). The observation that the weighted average is not needed and the plain average works equally well for non-increasing learning rates is from (X. Li and F. Orabona, 2019), where we needed it for the particular case of AdaGrad learning rates. The idea of analyzing SGD without dividing by the learning rate is by (Zhang, T., 2004). Lemma 1 is new but actually hidden in the the convergence proof of the last iterate of SGD with linear predictors and square losses in (Lin, J. and Rosasco, L. and Zhou, D.-X., 2016), that in turn is based on the one in (Shamir, O. and Zhang, T., 2013). As far as I know, Corollary 3 is new, but please let me know if you happen to know a reference for it! It is possible to remove the logarithmic term in the bound using a different learning rate, but the proof is only for bounded domains (Jain, P. and Nagaraj, D. and Netrapalli, P., 2019).

**5. Exercises **

Exercise 1.Generalize the above proofs to the Stochastic Mirror Descent case.

Exercise 2.Remove the assumption of expected bounded stochastic subgradients and instead assume that is -smooth, i.e., has -Lipschitz gradient, and the variance of the noise is bounded. Hint: take a look at the proofs in (Zhang, T., 2004) and (X. Li and F. Orabona, 2019)

**6. Appendix **

*Proof of Lemma 1:* Define , so we have

that implies

Now, from the definition of and the above inequality, we have

that implies

Unrolling the inequality, we have

Using the definition of and the fact that , we have the stated bound.

]]>In this post, I explain a variation of the EG/Hedge algorithm, called *AdaHedge*. The basic idea is to design an algorithm that is adaptive to the sum of the squared norm of the losses, without any prior information on the range of the losses.

First, consider the case in which we use as constant regularizer the negative entropy , where will be determined in the following and is the simplex in . Using FTRL with linear losses with this regularizer, we immediately obtain

where we upper bounded the negative entropy of with 0. Using the strong convexity of the regularizer w.r.t. the norm and Lemma 4 here, we would further upper bound this as

This suggests that the optimal should be . However, as we have seen for L* bounds, this choice of any parameter of the algorithm is never feasible. Hence, exactly as we did for L* bounds, we might think of using an online version of this choice

where is a constant that will be determined later. An important property of such choice is that it gives rise to an algorithm that is scale-free, that is its predictions are invariant from the scaling of the losses by any constant factor. This is easy to see because

Note that this choice makes the regularizer non-decreasing over time and immediately gives us

At this point, we might be tempted to use Lemma 1 from the L* post to upper bound the sum in the upper bound, but unfortunately we cannot! Indeed, the denominator does not contain the term . We might add a constant to , but that would destroy the scale-freeness of the algorithm. However, it turns out that we can still prove our bound without any change to the regularizer. The key observation is that we can bound the term in two different ways. The first way is the one above, while the other one is

where we used the definition of and the fact that the regularizer is non-decreasing over time. So, we can now write

where we used the fact that the minimum between two numbers is less than their harmonic mean. Assuming and using Lemma 1 here, we have

The bound and the assumption on suggest to set . To summarize, we obtained a scale-free algorithm with regret bound .

We might consider ourselves happy, but there is a clear problem in the above algorithm: the choice of in the time-varying regularizer strictly depends on our upper bound. So, a loose bound will result in a poor choice of the regularization! In general, every time we use a part of the proof in the design of an algorithm we cannot expect an exciting empirical performance, unless our upper bound was really tight. So, can we design a better regularizer? Well, we need a better upper bound!

Let’s consider a generic regularizer and its corresponding FTRL with linear losses regret upper bound

where we assume to be non-decreasing in time.

Now, observe that the sum is unlikely to disappear for this kind of algorithms, so we could try to make the term of the same order of the sum. So, we would like to set of the same order of . However, this approach would cause an annoying recurrence. So, using the fact that is non-decreasing, let’s upper bound the terms in the sum just a little bit:

Now, we can set for , , and . This immediately implies that

Setting to be equal to the negative entropy, we get an algorithm known as AdaHedge. It is easy to see that this choice makes the algorithm scale-free as well.

With this choice of the regularizer, we can simplify a bit the expression of . For , we have . Instead, for , using the properties of the Fenchel conjugates, we have that

Overall, we get the pseudo-code of AdaHedge in Algorithm 1.

So, now we need an upper bound for . Observe that . Moreover, as we have done before, we can upper bound in two different ways. In fact, from Lemma 4 here, we have for . Also, denoting by , we have

Hence, we have

We can solve this recurrence using the following Lemma, where and .

Lemma 1.Let be any sequence of non-negative real numbers. Suppose that is a sequence of non-negative real numbers satisfying

Then, for any , .

*Proof:* Observe that

We bound each term in the sum separately. The left term of the minimum inequality in the definition of gives , while the right term gives . So, we conclude .

So, overall we got

and setting , we have

Note that this is roughly the same regret in (2), but the very important difference is that this new regret bound depends on the *much tighter quantity* , that we upper bounded with , but in general will be much smaller than that. For example, can be upper bounded using the tighter local norms, see the analysis of Exp3. Instead, in the first solution, the regret will always be dominated by the term because we explicitly use it in the regularizer!

There is an important lesson to be learned from AdaHedge: the regret is not the full story and algorithms with the same worst-case guarantee can exhibit vastly different empirical behaviors. Unfortunately, this message is rarely heard and there is a part of the community that focuses too much on the worst-case guarantee rather than on the empirical performance. Even worse, sometimes people favor algorithms with a “more elegant analysis” completely ignoring the likely worse empirical performance.

**1. History Bits **

The use of FTRL with the regularizer in (1) was proposed in (Orabona and Pál, 2015), I presented a simpler version of their proof that does not require Fenchel conjugates. The AdaHedge algorithm was introduced in (van Erven et al., 2011) and refined in (de Rooij et al., 2014). The analysis reported here is from (Orabona and Pál, 2015), that generalized AdaHedge to arbitrary regularizers in AdaFTRL. Additional properties of AdaHedge for the stochastic case were proven in (van Erven et al., 2011).

**2. Exercises **

]]>

Exercise 1.Implement AdaHedge and compare its empirical performance to FTRL with the time-varying regularizer in (1).

* You can find the other lectures here.*

In this lecture, we will explore the link between Online Learning and and Statistical Learning Theory.

**1. Agnostic PAC Learning **

We now consider a different setting from what we have seen till now. We will assume that we have a prediction strategy parametrized by a vector and we want to learn the relationship between an input and its associated label . Moreover, we will assume that is drawn from a joint probability distribution . Also, we are equipped with a loss function that measures how good is our prediction compared to the true label , that is . So, learning the relationship can be cast as minimizing the expected loss of our predictor

In machine learning terms, the object above is nothing else than the *test error* and our predictor.

Note that the above setting assumes labeled samples, but we can generalize it even more considering the *Vapnik’s general setting of learning*, where we collapse the prediction function and the loss in a unique function. This allows, for example, to treat supervised and unsupervised learning in the same unified way. So, we want to minimize the *risk*

where is an unknown distribution over and is measurable w.r.t. the second argument. Also, the set of all predictors that can be expressed by vectors in is called the *hypothesis class*.

Example 1.In a linear regression task where the loss is the square loss, we have and . Hence, .

Example 2.In linear binary classification where the loss is the hinge loss, we have and . Hence, .

Example 3.In binary classification with a neural network with the logistic loss, we have and is the network corresponding to the weights . Hence, .

The key difficulty of the above problem is that we don’t know the distribution . Hence, there is no hope to exactly solve this problem. Instead, we are interested in understanding *what is the best we can do if we have access to samples drawn i.i.d. from *. More in details, we want to upper bound the *excess risk*

where is a predictor that was *learned* using samples.

It should be clear that this is just an optimization problem and we are interested in upper bounding the suboptimality gap. In this view, the objective of machine learning can be considered as a particular optimization problem.

Remark 1.Note that this is not the only way to approach the problem of learning. Indeed, the regret minimization model is an alternative model to learning. Moreover, another approach would be to try to estimate the distribution and then solve the risk minimization problem, the approach usually taken in Statistics. No approach is superior to the other and each of them has its pros and cons.

Given that we have access to the distribution through samples drawn from it, any procedure we might think to use to minimize the risk will be stochastic in nature. This means that we cannot assure a deterministic guarantee. Instead, *we can try to prove that with high probability our minimization procedure will return a solution that is close to the minimizer of the risk*. It is also intuitive that the precision and probability we can guarantee must depend on how many samples we draw from .

Quantifying the dependency of precision and probability of failure on the number of samples used is the objective of the **Agnostic Probably Approximately Correct** (PAC) framework, where the keyword “agnostic” refers to the fact that we don’t assume anything on the best possible predictor. More in details, given a precision parameter and a probability of failure , we are interested in characterizing the *sample complexity of the hypothesis class * that is defined as the number of samples necessary to guarantee with probability at least that the best learning algorithm using the hypothesis class outputs a solution that has an excess risk upper bounded by . Note that the sample complexity does not depend on , so it is a worst-case measure w.r.t. all the possible distributions. This makes sense if you think that we know nothing about the distribution , so if your guarantee holds for the worst distribution it will also hold for any other distribution. Mathematically, we will say that the hypothesis class is agnostic PAC-learnable is such sample complexity function exists.

Definition 1.We will say that a function class isAgnostic-PAC-learnableif there exists an algorithm and a function such that when is used with samples drawn from , with probability at least the solution returned by the algorithm has excess risk at most .

Note that the Agnostic PAC learning setting does not say what is the procedure we should follow to find such sample complexity. The approach most commonly used in machine learning to solve the learning problem is the so-called *Empirical Risk Minimization (ERM) problem*. It consist of drawing samples i.i.d. from and minimizing the *empirical risk*:

In words, ERM is nothing else than minimize the error on a training set. However, in many interesting cases we can have that can be very far from the true optimum , even with an infinite number of samples! So, we need to modify the ERM formulation in some way, e.g., using a *regularization* term or a Bayesian prior of , or find conditions under which ERM works.

The ERM approach is so widespread that machine learning itself is often wrongly identified with some kind of minimization of the training error. We now show that ERM is not the entire world of ML, showing that *the existence of a no-regret algorithm, that is an online learning algorithm with sublinear regret, guarantee Agnostic-PAC learnability*. More in details, we will show that an online algorithm with sublinear regret can be used to solve machine learning problems. This is not just a curiosity, for example this gives rise to computationally efficient parameter-free algorithms, that can be achieved through ERM only running a two-step procedure, i.e. running ERM with different parameters and selecting the best solution among them.

We already mentioned this possibility when we talked about the online-to-batch conversion, but this time we will strengthen it proving high probability guarantees rather than expectation ones.

So, we need some more bits on concentration inequalities.

**2. Bits on Concentration Inequalities **

We will use a concentration inequality to prove the high probability guarantee, but we will need to go beyond the sum of i.i.d.random variables. In particular, we will use the concept of *martingales*.

Definition 2.A sequence of random variables is called amartingaleif for all it satisfies:

Example 4.Consider a fair coin and a betting algorithm that bets money on each round on the side of the coin equal to . We win or lose money 1:1, so the total money we won up to round is . is a martingale. Indeed, we have

For bounded martingales we can prove high probability guarantees as for bounded i.i.d. random variables. The following Theorem will be the key result we will need.

Theorem 3 (Hoeffding-Azuma inequality).Let be a martingale of random variables that satisfy almost surely. Then, we have

Also, the same upper bounds hold on .

**3. From Regret to Agnostic PAC **

We now show how the online-to-batch conversion we introduced before gives us high probability guarantee for our machine learning problem.

Theorem 4.Let , where the expectation is w.r.t. drawn from with support over some vector space and . Draw samples i.i.d. from and construct the sequence of losses . Run any online learning algorithm over the losses , to construct the sequence of predictions . Then, we have with probability at least , it holds that

*Proof:* Define . We claim that is a martingale. In fact, we have

where we used the fact that depends only on Hence, we have

that proves our claim.

Hence, using Theorem 3, we have

This implies that, with probability at least , we have

or equivalently

We now use the definition of regret w.r.t. any , to have

The last step is to upper bound with high probability with . This is easier than the previous upper bound because is a fixed vector, so are i.i.d. random variables, so for sure forms a martingale. So, reasoning as above, we have that with probability at least it holds that

Putting all together and using the union bound, we have the stated bound.

The theorem above upper bounds the average risk of the predictors, while we are interested in producing a single predictor. If the risk is a convex function and is convex, than we can lower bound the l.h.s. of the inequalities in the theorem with the risk evaluated on the average of the . That is

If the risk is not a convex function, we need a way to generate a single solution with small risk. One possibility is to construct a *stochastic classifier* that samples one of the with uniform probability and predicts with it. For this classifier, we immediately have

where the expectation in the definition of the risk of the stochastic classifier is also with respect to the random index. Yet another way, is to select among the predictors, the one with the smallest risk. This works because the average is lower bounded by the minimum. This is easily achieved using samples for the online learning procedure and samples to generate a validation set to evaluate the solution and pick the best one. The following Theorem shows that selecting the predictor with the smallest empirical risk on a validation set will give us a predictor close to the best one with high probability.

Theorem 5.We have a finite set of predictors and a dataset of samples drawn i.i.d. from . Denote by . Then, with probability at least , we have

*Proof:* We want to calculate the probability that the hypothesis that minimizes the validation error is far from the best hypothesis in the set. We cannot do it directly because we don’t have the required independence to use a concentration. Instead, *we will upper bound the probability that there exists at least one function whose empirical risk is far from the risk.* So, we have

Hence, with probability at least , we have that for all

We are now able to upper bound the risk of , just using the fact that the above applies to too. So, we have

where in the last inequality we used the fact that minimizes the empirical risk.

Using this theorem, we can use samples for the training and samples for the validation. Denoting by the predictor with the best empirical risk on the validation set among the generated during the online procedure, we have with probability at least that

It is important to note that with any of the above three methods to select one among the generated by the online learning procedure, the sample complexity guarantee we get matches the one we would have obtained by ERM, up to polylogarithmic factors. In other words, there is nothing special about ERM compared to the online learning approach to statistical learning.

Another important point is that the above guarantee does not imply the existence of online learning algorithms with sublinear regret for any learning problem. It just says that, if it exists, it can be used in the statistical setting too.

**4. History Bits **

Theorem 4 is from (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004). Theorem 5 is nothing else than the Agnostic PAC learning guarantee for ERM for hypothesis classes with finite cardinality. (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004) gives also an alternative procedure to select a single hypothesis among the generated during the online procedure that does not require splitting the data in training and validation. However, the obtained guarantee matches the one we have proved.

]]>* You can find all the lectures I published here.*

In the last lecture, we introduced the Explore-Then-Commit (ETC) algorithm that solves the stochastic bandit problem, but requires the knowledge of the *gaps*. This time we will introduce a parameter-free strategy that achieves the same optimal regret guarantee.

**1. Upper Confidence Bound Algorithm **

The ETC algorithm has the disadvantage of requiring the knowledge of the gaps to tune the exploration phase. Moreover, it solves the exploration vs. exploitation trade-off in a clunky way. It would be better to have an algorithm that smoothly transition from one phase into the other *in a data-dependent way*. So, we now describe an optimal and adaptive strategy called Upper Confidence Bound (UCB) algorithm. It employs the principle of *optimism in the face of uncertainty*, to select in each round the arm that has the *potential to be the best one*.

UCB works keeping an estimate of the expected loss of each arm and also a confidence interval at a certain probability. Roughly speaking, we have that with probability at least

where the “roughly” comes from the fact that is a random variable itself. Then, UCB will query the arm with the smallest lower bound, that is the one that could potentially have the smallest expected loss.

Remark 1.The name Upper Confidence Bound comes from the fact that traditionally stochastic bandits are defined over rewards, rather than losses. So, in our case we actually use the lower confidence bound in the algorithm. However, to avoid confusion with the literature, we still call it Upper Confidence Bound algorithm.

The key points in the proof are on how to choose the right confidence level and how to get around the dependency issues.

The algorithm is summarized in Algorithm 1 and we can prove the following regret bound.

Theorem 1.Assume that the rewards of the arms are -subgaussian and and let . Then, UCB guarantees a regret of

*Proof:* We analyze one arm at the time. Also, without loss of generality, assume that the optimal arm is the first one. For arm , we want to prove that .

The proof is based on the fact that once I have sampled an arm enough times, the probability to take a suboptimal arm is small.

Let the biggest time index such that . If , then the statement above is true. Hence, we can safely assume . Now, for bigger than we have

Consider and such that , then we claim that at least one of the two following equations must be true:

If the first one is true, the confidence interval around our estimate of the expectation of the optimal arm does not contain . On the other hand, if the second one is true the confidence interval around our estimate of the expectation does not contain . So, we claim that if and we selected a suboptimal arm, then at least one of these two bad events happened.

Let’s prove the claim: *if both the inequalities above are false*, , and , we have

that, by the selection strategy of the algorithm, would imply .

Note that . Hence, we have

Now, we upper bound the probabilities in the sum. Given that the losses on the arms are i.i.d. and using the union bound, we have

Hence, we have

Given that the same bound holds for , we have

Using the decomposition of the regret we proved last time, , we have the stated bound.

It is instructive to observe an actual run of the algorithm. I have considered 5 arms and Gaussian losses. In the left plot of figure below, I have plotted how the estimates and confidence intervals of UCB varies over time (in blue), compared to the actual true means (in black). In the right side, you can see the number of times each arm was pulled by the algorithm.

It is interesting to note that the logarithmic factor in the confidence term will make the confidences of the arm that are not pulled to *increase* over time. In turn, this will assure that the algorithm does not miss the optimal arm, even if the estimates were off. Also, the algorithm will keep pulling the two arms that are close together, to be sure on which one is the best among the two.

The bound above can become meaningless if the gaps are too small. So, here we prove another bound that does not depend on the inverse of the gaps.

Theorem 2.Assume that the rewards of the arms minus their expectations are -subgaussian and let . Then, UCB guarantees a regret of

*Proof:* Let be some value to be tuned subsequently and recall from the proof of Theorem 1 that for each suboptimal arm we can bound

Hence, using the regret decomposition we proved last time, we have

Choosing , we have the stated bound.

Remark 2.Note that while the UCB algorithm is considered parameter-free, we still have to know the subgaussianity of the arms. While this can be easily upper bounded for stochastic arms with bounded support, it is unclear how to do it without any prior knowledge on the distribution of the arms.

It is possible to prove that the UCB algorithm is asymptotically optimal, in the sense of the following Theorem.

Theorem 3 (Bubeck, S. and Cesa-Bianchi, N. , 2012, Theorem 2.2).Consider a strategy that satisfies for any set of Bernoulli rewards distributions, any arm with and any . Then, for any set of Bernoulli reward distributions, the following holds

**2. History Bits **

The use of confidence bounds and the idea of optimism first appeared in the work by (T. L. Lai and H. Robbins, 1985). The first version of UCB is by (T. L. Lai, 1987). The version of UCB I presented is by (P. Auer and N. Cesa-Bianchi and P. Fischer, 2002) under the name UCB1. Note that, rather than considering 1-subgaussian environments, (P. Auer and N. Cesa-Bianchi and P. Fischer, 2002) considers bandits where the rewards are confined to the interval. The proof of Theorem 1 is a minor variation of the one of Theorem 2.1 in (Bubeck, S. and Cesa-Bianchi, N. , 2012), which also popularized the subgaussian setup. Theorem 2 is from (Bubeck, S. and Cesa-Bianchi, N. , 2012).

**3. Exercises **

]]>

Exercise 1.Prove a similar regret bound to the one in Theorem 2 for an optimally tuned Explore-Then-Commit algorithm.

* You can find the lectures I published till now here.*

Today, we will consider the *stochastic bandit* setting. Here, each arm is associated with an unknown probability distribution. At each time step, the algorithm selects one arm and it receives a loss (or reward) drawn i.i.d. from the distribution of the arm . We focus on minimizing the *pseudo-regret*, that is the regret with respect to the optimal action in expectation, rather than the optimal action on the sequence of realized losses:

where we denoted by the expectation of the distribution associated with the arm .

Remark 1The usual notation in the stochastic bandit literature is to consider rewards instead of losses. Instead, to keep our notation coherent with the OCO literature, we will consider losses. The two things are completely equivalent up to a multiplication by .

Before presenting our first algorithm for stochastic bandits, we will introduce some basic notions on concentration inequalities that will be useful in our definitions and proofs.

**1. Concentration Inequalities Bits **

Suppose that is a sequence of independent and identically distributed random variables and with mean and variance . Having observed we would like to estimate the common mean . The most natural estimator is the *empirical mean*

Linearity of expectation shows that , which means that is an *unbiased estimator* of . Yet, is a random variable itself. So, can we quantify how far will be from ?

We could use Chebyshev’s inequality to upper bound the probability that is far from :

Using the fact that , we have that

So, we can expect the probability of having a “bad” estimate to go to zero as one over the number of samples in our empirical mean. Is this the best we can get? To understand what we can hope for, let’s take a look at the central limit theorem.

We know that, defining , , the standard Gaussian distribution, as goes to infinity. This means that

where the approximation comes from the central limit theorem. The integral cannot be calculated with a closed form, but we can easily upper bound it. Indeed, for , we have

This is better than what we got with Chebyshev’s inequality and we would like to obtain an exact bound with a similar asymptotic rate. To do that, we will focus our attention on *subgaussian* random variables.

Definition 1We say that a random variable is –subgaussianif for all we have that .

Example 1The following random variable are subgaussian:

- If is Gaussian with mean zero and variance , then is -subgaussian.
- If has mean zero and almost surely, then is -subgaussian.

We have the following properties for subgaussian random variables.

Lemma 2 (Lattimore and Szepesvári, 2018, Lemma 5.4) Assume that and are independent and -subgaussian and -subgaussian respectively. Then,

- = 0 and .
- is -subgaussian.
- is -subgaussian.

Subgaussians random variables behaves like Gaussian random variables, in the sense that their tail probabilities are upper bounded by the ones of a Gaussian of variance . To prove it, let’s first state the Markov’s inequality.

Theorem 3 (Markov’s inequality)For a non-negative random variable and , we have that .

With Markov’s inequality, we can now formalize the above statement on subgaussian random variables.

*Proof:* For any , we have

Minimizing the right hand side of the inequality w.r.t. , we have the stated result.

An easy consequence of the above theorem is that the empirical average of subgaussian random variables concentrates around its expectation, *with the same asymptotic rate in (1)*.

Corollary 5Assume that are independent, -subgaussian random variables. Then, for any , we have

where .

Equating the upper bounds on the r.h.s. of the inequalities in the Corollary to , we have the equivalent statement that, with probability at least , we have

**2. Explore-Then-Commit Algorithm **

We are now ready to present the most natural algorithm for the stochastic bandit setting, called Explore-Then-Commit (ETC) algorithm. That is, we first identify the best arm over exploration rounds and then we commit to it. This algorithm is summarized in Algorithm 2.

In the following, we will denote by , that is the number of times that the arm was pulled in the first rounds.

Define by the expected loss of the arm with the smallest expectation, that is . Critical quantities in our analysis will be the *gaps*, for , that measure the expected difference in losses between the arms and the optimal one. In particular, we can decompose the regret as a sum over the arms of the expected number of times we pull an arm multiplied by its gap.

Lemma 6For any policy of selection of the arms, the regret is upper bounded by

*Proof:* Observe that

Hence,

The above Lemma quantifies the intuition that in order to have a small regret we have to select the suboptimal arms less often then the best one.

We are now ready to prove the regret guarantee of the ETC algorithm.

Theorem 7Assume that the losses of the arms minus their expectations are -subgaussian and . Then, ETC guarantees a regret of

*Proof:* Let’s assume without loss of generality that the optimal arm is the first one.

So, for , we have

From Lemma 2, we have that is -subgaussian. So, from Theorem 4, we have

The bound shows the trade-off between exploration and exploitation: if is too big, we pay too much during the exploration phase (first term in the bound). On the other hand, if is small, the probability to select a suboptimal arm increases (second term in the bound). Knowing all the gaps , it is possible to choose that minimizes the bound.

For example, in that case that , the regret is upper bounded by

that is minimized by

Remembering that must be a natural number we can choose

When , we select . So, we have . Hence, the regret is upper bounded by

The main drawback of this algorithm is that its optimal tuning depends on the gaps . Assuming the knowledge of the gaps account to make the stochastic bandit problem completely trivial. However, its tuned regret bound gives us a baseline to which compare other bandit algorithms. In particular, in the next lecture we will present an algorithm that achieves the same asymptotic regret without any knowledge of the gaps.

**3. History Bits **

The ETC algorithm goes back to (Robbins, H., 1952), even if Robbins proposed what is now called epoch-greedy (Langford, J. and Zhang, T., 2008). For more history on ETC, take a look at chapter 6 in (Lattimore, T. and Szepesvári, C., 2018). The proofs presented here are from (Lattimore, T. and Szepesvári, C., 2018) as well.

]]>* You can find all the lectures I published here.*

Last time, we saw that for Online Mirror Descent (OMD) with an entropic regularizer and learning rate it might be possible to get the regret guarantee

where . This time we will see how and we will use this guarantee to prove an almost optimal regret guarantee for Exp3, in Algorithm 1.

Remark 1While it is possible to prove (1) from first principles using the specific properties for the entropic regularizer, such proof will not shed any light of what is actually going on. So, in the following we will instead try to prove such regret in a very general way. Indeed, this general proof will allow us to easily prove the optimal bound for multi-armed bandits using OMD with the Tsallis entropy as regularizer.

Now, for a generic , consider the OMD algorithm that produces the predictions in two steps:

- Set such that .
- Set .

As we showed, under weak conditions, these two steps are equivalent to the usual OMD single-step update.

Now, the idea is to consider an alternative analysis of OMD that explicitly depends on , the new prediction before the Bregman projection step. First, let’s state the Generalized Pythagorean Theorem for Bregman divergences.

Lemma 1Let and define , then for all .

*Proof:* From the first order optimality condition of we have that . Hence, we have

The Generalized Pythagorean Theorem is often used to prove that the Bregman divergence between any point in and an arbitrary point decreases when the consider the Bregman projection in .

We are now ready to prove our regret guarantee.

Lemma 2For the two-steps OMD update above the following regret bound holds:

where and .

*Proof:* From the update rule, we have that

where in the second equality we used the 3-points equality for the Bregman divergences and the Generalized Pythagorean Theorem in the first inequality. Hence, summing over time we have

So, as we did in the previous lecture, we have

where and .

Putting all together, we have the stated bound.

This time it might be easier to get a handle over . Given that we only need an upper bound, we can just take a look at and and see which one is bigger. This is easy to do: using the update rule we have

that is

Assuming , we have that implies .

Overall, we have the following improved regret guarantee for the Learning with Experts setting with positive losses.

Theorem 3Assume for and . Let and . Using OMD with the entropic regularizer defined as , learning rate , and gives the following regret guarantee

Armed with this new tool, we can now turn to the multi-armed bandit problem again.

Let’s now consider the OMD with entropic regularizer, learning rate , and set equal to the stochastic estimate of , as in Algorithm 1. Applying Theorem 3 and taking expectation, we have

Now, focusing on the terms , we have

So, setting , we have

Remark 2The need for a different analysis for OMD is due to the fact that we want an easy way to upper bound the Hessian. Indeed, in this analysis comes before the normalization into a probability distribution, that simplifies a lot the analysis. The same idea will be used for the Tsallis entropy in the next section.

So, with a tighter analysis we showed that, even without an explicit exploration term, OMD with entropic regularizer solves the multi-armed bandit problem paying only a factor more than the full information case. However, this is still not the optimal regret!

In the next section, we will see that changing the regularizer, *with the same analysis*, will remove the term in the regret.

**1. Optimal Regret Using OMD with Tsallis Entropy **

In this section, we present the Implicitly Normalized Forecaster (INF) also known as OMD with Tsallis entropy for multi-armed bandit.

Define as , where and in we extend the function by continuity. This is the negative **Tsallis entropy** of the vector . This is a strict generalization of the Shannon entropy, because when goes to 1, converges to the negative (Shannon) entropy of .

We will instantiate OMD with this regularizer for the multi-armed problem, as in Algorithm 2.

Note that and .

We will not use any interpretation of this regularizer from the information theory point of view. As we will see in the following, the only reason to choose it is its Hessian. In fact, the Hessian of this regularizer is still diagonal and it is equal to

Now, we can use again the modified analysis for OMD in Lemma 2. So, for any , we obtain

where and .

As we did for Exp3, now we need an upper bounds to the . From the update rule and the definition of , we have

that is

So, if , , that implies that .

Hence, putting all together, we have

We can now specialize the above reasoning, considering in the Tsallis entropy, to obtain the following theorem.

Theorem 4Assume . Set and . Then, Algorithm 2

*Proof:* We only need to calculate the terms

Proceeding as in (2), we obtain

Choosing , we finally obtain an expected regret of , that can be proved to be the optimal one.

There is one last thing, is how do we compute the prediction of this algorithm? In each step, we have to solve a constrained optimization problem. So, we can write the corresponding Lagragian:

From the KKT conditions, we have

and we also know that . So, we have a 1-dimensional problem in that must be solved in each round.

**2. History Bits **

The INF algorithm was proposed by (Audibert, J.-Y. and Bubeck, S., 2009) and re-casted as an OMD procedure in (Audibert, J.-Y. and Bubeck, S. and Lugosi, G., 2011). The connection with the Tsallis entropy was done in (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). The specific proof presented here is new and it builds on the proof by (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). Note that (Abernethy, J. D. and Lee, C. and Tewari, A., 2015) proved the same regret bound for a Follow-The-Regularized-Leader procedure over the stochastic estimates of the losses (that they call Gradient-Based Prediction Algorithm), while here we proved it using a OMD procedure.

**3. Exercises **

Exercise 1Prove that in the modified proof of OMD, the terms can be upper bounded by .

Exercise 2Building on the previous exercise, prove that regret bounds of the same order can be obtained for Exp3 and for the INF/OMD with Tsallis entropy directly upper bounding the terms , without passing through the Bregman divergences.

]]>