In this post, we talk about solving saddle-point problems with online convex optimization (OCO) algorithms. Next time, I’ll show the connection to two-person zero-sum games and boosting.

**1. Saddle-Point Problems**

We want to solve the following saddle-point problem

Let’s say from the beginning that we need inf and sup rather than min and max because the minimum or maximum might not exist. Everytime we know for sure the inf/sup are attained, I’ll substitute them with min/max.

While for the minimization of functions is clear what it means to solve it, it might not be immediate to see what is the meaning of “solving” the saddle-point problem in (1). It turns out that the proper notion we are looking for is the one of saddle-point.

Definition 1.Let , , and . A point is a saddle-point of in if

We will now state conditions under which there *exists* a saddle-point that solves (1). First, we need an easy lemma.

Lemma 2.Let is any function from a non-empty product set to . Then,

*Proof:* For any and we have that . This implies that for all that gives the stated inequality.

We can now state the following theorem.

Theorem 3.Let any function from a non-empty product set to . A point is a saddle-point of if and only if the supremum inis attained at , the infimum in

is attained at , and these two expressions are equal.

*Proof:* If is a saddle-point, then we have

From Lemma 2, we have that these quantities must be equal, so that the three conditions in the theorem are satisfied.

For the other direction, if the conditions are satisfied, then we have

Hence, is a saddle-point.

Remark 1.The above theorem tells a surprising thing: If a saddle-point exists, then there might be multiple ones, and all of them must have the same minimax value. This might seem surprising, but it is due to the fact that the definition of saddle-point is a global and not a local property. Moreover, if the and problem have different values, no saddle-point exists.

Remark 2.Consider the case that the value of the and are different. If your favorite optimization procedure to find the solution of a saddle-point problem does not distinguish between a and formulation, then it cannot be correct!

Let’s show a couple of examples that show that the conditions above are indeed necessary.

Example 1.Let , , and . Then, we havewhile

Indeed, from Figure 1 we can see that there is no saddle-point.

Example 2.Let , , and . Then, we haveand

Here, even if inf sup is equal to sup inf, the saddle-point does not exist because the inf in the first expression is not attained in a point of .

This theorem also tells us that in order to find a saddle-point of a function , we need to find the minimizer in of and the maximizer in of . Let’s now use this knowledge to design a proper measure of progress towards the saddle-point.

We might be tempted to use as a measure of suboptimality of with respect to the saddle-point . Unfortunately, this quantity can be negative or equal to zero for an infinite number of points that are not saddle-points. We might then think to use some notion of distance to the saddle-point, like , but this quantity in general can go to zero at an arbitrarily slow rate. To see why consider the case that , so that the saddle-point problem reduces to minimize a convex function. So, assuming only convexity, the rate of convergence to a minimizer of can be arbitrarily slow. Hence, we need something different.

Observe that the Theorem 3 says one of the problems we should solve is

where . In this view, the problem looks like a standard convex optimization problem, where the objective function has a particular structure. Moreover, in this view we only focus on the variables . The standard measure of convergence in this case for a point , the *suboptimality gap*, can be written as

We also have to find the maximizer in of , hence we have

where . This also implies another measure of convergence in which we focus only on the variable :

Finally, in case we are interested in studying the quality of a joint solution , a natural measure is a sum of the two measures above:

where we assumed the existence of a saddle-point to say that from Theorem 3. This measure is called *duality gap*. From the definition it should be clear that the duality gap is always non-negative. Let’s add even more intuition of the duality gap definition, using the one of -saddle-point.

Definition 4.Let , , and . A point is a -saddle-point of in if

This definition is useful because we cannot expect to numerically calculate a saddle-point with infinite precision, but we can be find something that satisfies the saddle-point definition up to a . Obviously, any saddle-point is also an -saddle-point.

Now, the notion of -saddle-point is equivalent up to a multiplicative constant to the duality gap. In fact, it is easy to see that if is an -saddle-point then its duality gap of . In turn, a duality gap of and the existence of a saddle-point imply that the point is a -saddle.

The above reasoning told us that finding the saddle-point of the function is equivalent to solving a maximization problem and a minimization problem. However, is it always the case that the saddle-point exists? So, let’s now move to easily checkable sufficient conditions for the existence of saddle-point. For this, we can state the following theorem.

Theorem 5.Let , be compact convex subsets of and respectively. Let a continuous function, convex in its first argument, and concave in its second. Then, we have that

This theorem gives us sufficient conditions to have the min-max problem equal to the max-min one. So, for example, thanks to the Weierstrass theorem, the assumptions in Theorem 5, in light of Theorem 3, are sufficient conditions for the existence of a saddle-point.

We defer the proof of this theorem for a bit and we now turn ways to solve the saddle-point problem in (1).

**2. Solving Saddle-Point Problems with OCO **

Let’s show how to use Online Convex Optimization (OCO) algorithms to solve saddle-point problems.

Suppose to use an OCO algorithm fed with losses that produces the iterates and another OCO algorithm fed with losses that produces the iterates . Then, we can state the following Theorem.

Theorem 6.Let be convex in the first argument and concave in the second. Then, with the notation in Algorithm 2, for any , we havewhere .

For any , we have

where .

Also, if is continuous and , compact, we have

for any and .

*Proof:* The first two equalities are obtained simply observing that and .

For the inequality, using Jensen’s inequality, we obtain

Summing the first two equalities, using the above inequality, and taking and , we get the stated inequality.

From this theorem, we can immediately prove the following corollary.

Corollary 7.Let be continuous, convex in the first argument, and concave in the second. Assume that and are compact and that the -player and -player use no-regret algorithms, possibly different, than

Example 3.Consider the saddle-point problemThe saddle-point of this problem is . We can find it using, for example, Projected Online Gradient Descent. So, setting , we have the iterations

According to Theorem 6, the duality gap in converges to 0.

Surprisingly, we can even prove a (simpler version of the) minimax theorem from the above result! In particular, we will use the additional assumption that there exist OCO algorithms that minimize and have sublinear regret.

*Proof with OCO assumption:* From Lemma 2, we have one inequality. Hence, we now have to prove the other inequality.

We will use a constructive proof. Let’s use Algorithm 2 and Theorem 6. For the first player, for any we have

Observe that

Hence, using an OCO algorithm that has regret for each , we have

In the same way, we have

Summing the two inequalities, taking , and using the sublinear regret assumption, we have

** 2.1. Variations with Best Response and Alternation **

In some cases, it is easy to compute the max with respect to of for a given . For example, this is trivial for bilinear games over the simplex. In these cases, we can remove the second learner and just use its *best response* in each round. Note that in this way we are making one of the two players “stronger” through the knowledge of the loss of the next round. However, this is perfectly fine: The proof in Theorem 6 is still perfectly valid.

In this case, the -player has an easy life: it knows the loss before making the prediction, hence it can just output the minimizer of the loss in . Hence, we also have that the regret of the -player will be non-positive and it will not show up in Theorem 6. Putting all together, we can state the following corollary.

Corollary 8.Let be continuous, convex in the first argument, and concave in the second. Assume compact and that the argmax of the -player is never empty. Then, with the notation in Algorithm 2, we have

for any and where .

This alternative seems interesting from a theoretical point of view because it allows to avoid the complexity of learning in the space, for example removing the dependency on its dimension. Of course, an analogous result can be stated using best-response for the -player and an OCO algorithm for the -player.

There is a third variant, very used in empirical implementations, especially of Counterfactual Regret Minimization (CFR) (Zinkevich et al., 2007). It is called *alternation* and it breaks the simultaneous reveal of the actions of the two players. Instead, we use the updated prediction of the first player to construct the loss of the second player. Empirically, this variant seems to greatly speed-up the convergence of the duality gap.

For this version, Theorem 6 does not hold anymore because the terms and are now different, however we can prove a similar guarantee.

Theorem 9.Let continuous, convex in the first argument, and concave in the second. Assume that and are compact. Then, with the notation in Algorithm 4, for any and any we have

where and .

*Proof:* Note that , and . By Jensen’s inequality, we have

Taking and , we get the stated result.

Remark 3.In the case that is linear in the first argument, using OMD for the -player we have that . Hence, in this case the additional term in Theorem 9 is negative, showing a (marginal) improvement to the convergence rate.

Next post, we will show how to connect saddle-point problems with Game Theory.

**3. History Bits **

Theorem 3 is (Rockafellar, 1970, Lemma 36.2).

The proof of Theorem 6 is from (Liu and Orabona, 2021) and it is a minor variation on the one in (Abernethy and Wang, 2017): (Liu and Orabona, 2021) stressed the dependency of the regret on a competitor that can be useful for refined bounds, as we will show next time for Boosting. It is my understanding that different variant of this theorem are known in the game theory community as “Folk Theorems”, because such result was widely known among game theorists in the 1950s, even though no one had published it.

The celebrated minimax theorem for zero-sum two-person games was first discovered by John von Neumann in the 1920s (von Neumann, 1928)(von Neumann and Morgenstern, 1944). The version is state here is a simplification of the generalization due to (Sion, 1958). The proof here is from (Abernethy and Wang, 2017). A similar proof is in (Cesa-Bianchi and Lugosi, 2006) based on a discretization of the space that in turn is based on the one in (Freund, Y. and Schapire, 1996)(Freund, Y. and Schapire, 1999).

Algorithms 2 and 3 are a generalization of the algorithm for boosting in (Freund and Schapire, 1996)(Freund and Schapire, 1999). Algorithm 2 was also used in (Abernethy and Wang, 2017) to recover variants of the Frank-Wolfe algorithm (Frank and Wolfe, 1956).

It is not clear who invented alternation. Christian Kroer told me that it was a known trick used in implementations of CFR for the computer poker competition from 2010 or so. Note that in CFR the method of choice is Regret Matching (Hart and Mas-Colell, 2000). However, (Kroer, 2020) empirically shows that alternation improves a lot even OGD for solving bilinear games. (Tammelin et al., 2015) explicitly include this trick in their implementation of an improved version of CFR called CFR+, claiming that it would still guarantee convergence. However, (Farina et al., 2019) pointed out that averaging of the iterates in alternation might not produce a solution to the min-max problem, providing a counterexample. Theorem 9 is from (Burch et al., 2019) (see also (Kroer, 2020)).

There is also a complementary view on alternation: (Zhang et al., 2021) link alternating updates to Gauss-Seidel methods in numerical linear algebra, in contrast to the simultaneous updates of the Jacobi method. Also, they provide a good review of the optimization literature on the advantages of alternation, but this paper and the papers they cite do not seem to be aware of the use of alternation in CFR.

** Acknowledgements **

Thanks to Christian Kroer for the history of alternation in CFR, and to Gergely Neu for the references on alternation in the optimization community. Thanks to NicolÃ² Campolongo for hunting down my typos. Also, thanks to Christian Kroer for his great lecture notes that helped me getting started on this topic.

]]>First of all, two months ago I got the official announcement that I got tenure. This is the end of a journey that started in 2003 with a PhD in computer vision for humanoid robotics at the University of Genova (Italy). This was an incredibly stressful journey for a number of reasons: Switching research topic after my PhD, living in Milan with the salary of a post-doc, moving to the US with a PhD from an Italian university, not having a famous American professor as mentor/letter writer, etc.

In “Mostly Harmless” by Douglas Adams, the protagonist Arthur Dent visits a woman seer to receive advice. The woman, who swats at flies in front of a cave and smells horrible, hands photocopies of her life to him suggesting he should live his life *the opposite way* she did so he will not end up living in a rancid cave. I should do the same for my academic life…

However, I cannot complain too much: I did get tenure, I contributed 2 or 3 things that I am proud of, and I do not live in a rancid cave! So, now I am going to use my new tenure powers to explain why I think the expert reviewer is a myth.

**2. The Fairytale of the Expert Reviewer **

There is a common myth in the theory ML community that the main problem with reviews at conferences is that reviewers are not really experts.

First, I think it is hardly controversial that nowadays most of the reviewers do not have the experience and knowledge to judge scientific papers: reviewers at big ML conferences are for the majority young and inexperienced PhD students. Also, for the Dunning-Kruger effect, they vastly overestimate what they actually know. The final effect is that the average review tend to be a mix of arrogant meaningless semi-random critiques. Let me be clear: PhD students from 10 years ago were equally terrible as reviewers. For example, I cringe thinking at the reviews I used to write as a PhD student. However, my terrible inexperienced reviews were also for third-tier conferences. Nowadays, instead, the exponential growth of ML submissions means that we have to enroll *all* the possible reviewers at first-tier venues.

However, I do not want to focus on the young PhD students: Mainly, it is not their fault. I understand that the growth of some community made the reviewing process basically random and nobody has a real solution to it, me neither. Instead, here I want to talk about the **Expert Reviewer**.

You should know that the theory Machine Learning community is convinced to have a better reviewing system. The reason, they say, is because they have the Expert Reviewers. This mythological figure knows all the results in his/her area, read all the Russian papers from 50 years ago, and thanks to his/her infinite wisdom only accepts the best and finest products of the human intellect. He/she can also check the correctness of 40 pages of appendix purely laying on hands and can kill any living beings just pronouncing the magic words “this is not surprising”.

Now, the fact that not even in Advanced Dungeons & Dragons you can have a Mage/Paladin with Wisdom and Intelligence to 19, it should make you suspect this is not exactly the reality…

**3. The Reality **

Now, let me tell you what actually happens, *most of the time*. The Expert Reviewer typically is someone obsessed with a particular area whose volume tends to 0, but very often with close to 0 knowledge of anything outside of it. You can visualize him/her as a Delta function.

If you work on a less fashionable area, the probability to randomly meet an Expert Reviewer for your specific sub-topic is close to 0. Moreover, He/She will often refuse to think that anything outside of his/her area is actually of any interest. Finally, for a weird bug in the symmetric property, the Expert Reviewer is *always* sure to know more than the authors of the paper he/she reviews.

The net effect is that

- most of the time the authors are actually
*the only real experts*; - papers are seldom accepted or rejected based on a deep understanding of what is going on;
- most of the time the most important decision factor is
*how much the reviewer likes your topic and/or you*.

Sometimes it does happen that the review process actually work as intended, but I would argue that the above is what happens at least 60% of the times an Expert Reviewer is involved.

Now, I could argue forever on these 3 points above. Instead, thanks to my tenure, I decided to do something that is taboo in academia: I’ll describe the review process of a real paper!

The paper is here and it was accepted at COLT 2017 only thanks to me.

Let me start from the beginning.

**4. The Story of the Unlucky Paper **

First of all, for a long time COLT had a reviewer/subreviewer system: Each PC member was allocated a number of papers to review and he/she could decide if review all the paper by him/herself or send it to subreviewers. The subreviewers were not taken from a fixed list of reviewers, but they had to be contacted personally: As you can imagine, it was not an easy task. How many times you accept review requests from random people? Exactly! On the other hand, this gave the possibility to PC members to really select the best reviewer for each paper, even going outside of the usual clique of people.

So, I was a PC member and I assigned that paper to an expert of regression in Reproducing Kernel Hilbert Spaces, because the paper was clearly an extension of the seminal work on 1/T rate of SGD without strong convexity, that in turn was based on the line of work developed by Rosasco, Caponnetto, De Vito, Smale, Zhou, etc. Now, for a number of reasons, this specific line of work is unknown to 99% of the theory ML people. I happen to know it because I published in this subfield after Lorenzo Rosasco showed to me its beauty. This is to say that I was pretty confident about my understanding of the paper and my choice of the subreviewer.

The other two PC members assigned to the paper were online learning people. This requires some explanation: for long time at COLT online learning people were the only people with *some* background in convex optimization. So, any optimization paper was going to them, by default. If you know classic online learning theory and convex optimization, you should see why sometimes this can be a terrible idea. (This situation improved a bit in the latest years, but not by much.) Anyway, the reviews are in and one PC personally destroys the paper, clearly not understanding it. The other one sent it to an OR person, that also did not understand the paper at all. My subreviewer instead firmly accepted the paper.

Now, it is necessary to open a parenthesis. There is something that young reviewers might not understand: Most of the time, it is adamant when a reviewer did not understand a paper. It is painfully clear for the authors, but it is also very clear for an experienced Area Chair/PC/Chair in charge of the paper. So, luckily the Chair decided to intervene and asked for the opinion of an external (famous but oblivious to the area of the paper) reviewer. Let me say that this is also not common: It heavily depends on the Chairs and the load they have. Quite understandably, they often do not have the time to intervene on single papers. As expected, the fourth review also did not understand the paper and rejected it… At this point, I started a long discussion in which I successfully refuted each single point of the fourth external reviewer, till he/she concedes that the paper was borderline because he/she was still not *excited* about it.

Let me pause here to explain you another thing: Expert Reviewers are humans and humans are rarely rational. So, one of the *main* ways they have to judge a paper is “Do I like this topic?” that often translates to “Do I work on this topic? No, I don’t, because this is clearly a bad topic!”. So, the external reviewer decided that the paper was not interesting because he/she thought the entire area was not interesting. End of the story.

Let me stress here that the problem is not how well a paper is written. In fact, all the involved reviewers understood the paper and its claims. However, deciding if these results were warrant or not acceptance it is just a matter of *taste* of different mostly orthogonal sub-communities, like the pineapple-on-pizza community and the never-pineapple-on-pizza one.

By this stage, the paper was doomed: two reviewers against me and my subreviewer, and one reviewer completely silent. At this stage, any further discussion was also counter-productive because the Expert Reviewer sees any attack to his/her argument as an attack to his/her auctoritas.

So, I had an idea: I stopped sending messages to the other reviewers and wrote directly to the COLT Chairs. I plainly explained that none of these people but my subreviewer and I actually worked in this area. The Chair was not convinced, so I had to propose another PC member who i) actually understood the area, and ii) was famous enough to have a weight on the decision. (The Chair did not mention the second point, I read between the lines…) I proposed a fifth reviewer and the Chair contacted him/her.

After a few days, the fifth reviewer came back and the first sentence of the message was:

I’m a bit surprised: the improvement over existing work is pretty clear!!

The Chair was convinced, the paper accepted.

It took only 2 weeks of discussions, 7 Expert Reviewers, and 1 Chair.

Easy-peasy.

**5. Epilogue **

Was it worth it? In my personal opinion, definitely yes.

What did I get in return? Absolutely nothing, zero, nada, zip, zilch.

So, trust me when I tell you that this is not what normally happens. And it does not matter how many Experts Reviewers you have: The problem is that the probability to get someone that *really* understands your subtopic is very low, even when you submit to a prestigious theory conference. Even assuming you got reviewers that actually understand your paper, you have to be really lucky to avoid your *Nemesis* (that rejects all your papers just because he/she does not like you and you are not even sure why), the *Egomaniac* (that rejects anything vaguely similar to what he/she does, because nothing compares to what he/she does), and the *Purist* (that rejects anything that might actually work in practice). All the above are things that actually happened, but not even tenure makes me so brave to describe these episodes. But just to give you some fun facts, recently a Chair of a prestigious conference told me that I indeed might have “enemies”. He/she also plainly told me that I should declare the people I *suspect* are my enemies as conflicts (never mind that almost none of the conferences have a system in place for “negative” conflicts…).

In reality, in my years of experience (yes, I am old) I very rarely saw the reviewing system working as it should. Most of the time, in order to get a meaningful decision on a paper, you have to work hard, so hard that people might end up deciding that it is not worth it. I myself have less and less strength and patience to fight many of these battles. I did not gain anything in any of them, probably only more enemies.

An entire system of checks and balances is badly needed in the conference reviews, much more than just amassing expert reviewers. Indeed, you also need somebody that allocates them properly, that checks that they do their job, that prevents them from acting like jerks, that keep them open to discussions, that makes them plainly admit that after all they might not have understood the paper, that helps them admit that they might not know the area, that (God forbid!) prevents them from rejecting papers just because they don’t like them, etc.

However, the main problem is why should anyone waste so much time on reviewing/discussing/analyzing papers of *other* people? More importantly, how exactly the community gives feedback for these poor reviews? How are we teaching people to simply say “Actually, this is not exactly my topic”? Indeed, not only no feedback whatsoever was given to any of the people of my story, somehow they also got a prize: in later years I fought similar battles to have papers of the above mentioned reviewers accepted in other conferences!

Overall, I do not know what is a better system to review papers, but I am pretty sure Expert Reviewers are not the answer.

So, there is no happy ending here: The Expert Reviewers are still convinced to be always right and, from time to time, still rejecting your papers they did not actually understand.

P.S. I am also an Expert Reviewer.

]]>**1. Implicit Updates **

Let’s consider the two mostly commonly used frameworks for online learning: Online Mirror Descent (OMD) and Follow-The-Regularized-Leader (FTRL).

We already explained that in OMD we update the iterate as the minimizer of an linear approximation of the last loss function we received plus a term that measures the distance to the previous iterate:

where .

On the other hand, in FTRL we have two possibilities: we can minimize the regularized sum of the loss we have received till now *or* the regularized sum of the *linear approximation* of the losses. In the first case, we update with

while in the second case, we use

where for . The second update is what optimization people call Dual Averaging. We also saw that under some reasonable conditions, the two updates of FTRL have the same regret guarantee. However, we would expect the second approach, the one using the exact loss functions, to perform much better in practice. Also, in the linearization of the losses we have to assume to know some characteristics of the losses, e.g., strong convexity parameter, to achieve the same regret of the full-losses FTRL.

Overall, we have two different frameworks and two different ways to use the loss functions in the update. So, it should be obvious that there is at least another possibility, that is *OMD with exact loss*. That is, we would like to consider the update

As in the FTRL case, we would expect this update to be better than the linearized one, at least empirically.

To gain some more intuition, let’s consider the simple case that , , and the losses are differentiable. In this case, we have that the linearized OMD update becomes

For example, with square loss and linear predictors over couples input/labels , we have and the update becomes

On the other hand, the update of OMD with the exact loss function becomes

The optimality condition tells us that satisfies

that is

So, the update is not in a closed formula anymore, but it has an *implicit* form, where appears in both side of the equality. This is exactly the reason why the update of OMD with exact losses is known in the online learning literature as *implicit updates*. So, we will call the update in (1) *Implicit OMD*.

Remark 1.Observe that for linear losses OMD and the Implicit OMD are equivalent.

In general, calculating the update of Implicit OMD can be an annoying optimization problem. However, in some cases the Implicit OMD update can still be solved in a closed form.

Example 1.Consider again linear regression with square loss. The update of Implicit OMD becomesTo solve the equation, we take the inner product of both sides with , to obtain

that is

Substituting this expression in (3), we have

**Implicit Updates are Always Descending** Till now, we have motivated implicit updates purely from an intuitive point of view: We expect this algorithm to be better because we do not approximate the loss functions. Indeed, we often gain a lot in performance switching to implicit updates. However, we can even prove that implicit updates have interesting theoretical properties.

First, contrarily to OGD, implicit updates remains “sane” even when the learning rate goes to infinity. Indeed, taking to infinity in (1), becomes simply the minimizer of the loss function. On the other hand, in OMD when the learning rate goes to infinity we can take a step that is arbitrarily far from the minimizer of the function!

When we consider non-differentiable convex functions, there is another important difference between implicit updates and subgradient descent updates. We already saw that the subgradient might not point in a descent direction. That is, no matter how we choose , the value of the function in might be *higher* than in . On the other hand, this cannot happen with implicit updates, no matter how we choose the learning rate:

**Proximal updates** Are implicit updates actually an invention of online learning people? Actually, no. Indeed, these kind of updates were known at least in 1965 (!) and they were proposed for the (offline) optimization of functions with the name of *proximal updates*. Basically, we have a function and we minimize it iteratively with the update

starting from an initial point . At first sight, such update might seem pointless in offline optimization: being able to solve (4) implies being also able to find the minimizer of in one step! However, as we have previously discussed, these kind of updates find an application when the function is composed by two parts and we decide to linearize only one part.

**2. Passive-Aggressive **

Now, let me show you that implicit updates were actually used a lot in the online literature, even if many people did not realize it.

Let’s take a look at a very famous online learning algorithm: the Passive-Aggressive (PA) algorithm. PA was a major success in online learning: 2099 citations and counting, that is huge for the online learning area. The theory was not very strong, but the performance of these algorithms was way better than anything else we had at that time. Let’s see how the PA algorithm works.

The PA algorithm was introduced before the Online Convex Optimization (OCO) framework was proposed. So, at that time, online learning for classification and regression focused on the particular case in which the loss functions have the form , basically the loss of linear predictors over couples input/label . The PA algorithm in particular focused on losses that can be zero over intervals, like the hinge loss, the squared hinge loss, the -insensitive loss, and the squared -insensitive loss. For these losses, the update they proposed was

where is a hyperparameter. Now, this is exactly the Implicit OMD update with the special case of the squared L2 Bregman divergence! The choice of the loss functions makes this update always in a closed form. So, for example, for the hinge loss and linear predictors, we have

where the second equality is calculated using the optimality condition and it is left as an exercise to the reader.

So, the huge boost in performance of PA over other online algorithm is due *uniquely* to the implicit updates.

**3. Implicit Updates on Truncated Linear Models: aProx **

There is another first-order optimization algorithm inspired to implicit updates. As we said, implicit updates are rarely in a closed form. So, we can try to approximate the implicit updates in some way. One possibility is to use the implicit update on a *surrogate loss function*. Indeed, when we use a linear approximation we recover plain OMD. Instead, when we use the exact function we get the implicit updates. What can we use in between the two cases? We could think to use a *truncated linear model*. That is, in the case we know that the functions are lower bounded by some , we define

for any . Note that this is a lower bound to the loss function and it is piecewise linear.

Now, we can use these surrogate function in the implicit OMD:

Implicit OMD with truncated linear models and the squared L2 Bregman is called aProx (Asi and Duchi, 2019).

Considering and , we again have

where is a specific vector in . Now, we have 2 possibilities: is in the flat part or in the corner of . Indeed, it should be easy to see that the proximal update assures us that we cannot miss the corner and land on the flat part. So, if we are in the linear part, then . Instead, if we are in the corner we have where . Hence, we always have

and we only need to find . Substituting in the definition of and using first-order optimality condition, we can verify that the following is the closed formula of the update (left as an exercise)

The similarity between this update and the one of PA in (5) should be adamant, that is due to the similarity between the truncated linear model and the hinge loss. Indeed, running aProx on linear classifiers with hinge loss is exactly the PA algorithm.

**4. More Updates Similar to the Implicit Updates **

From an empirical point of view, we can gain a lot of performance using implicit updates, even just approximating them. So, it should not be surprising if people proposed and used similar ideas in many optimization algorithm. Let me give you some examples.

The default optimization algorithm in the machine learning library Vowpal Wabbit (VW) uses the Importance Weight Aware Updates (Karampatziakis and Langford, 2011). These updates essentially approximate the implicit update using a differential equation that for linear models can be calculated in a closed formula. So, if you ever used VW, you already used a close relative of implicit updates, probably without knowing it.

Another interesting example is the setting of adaptive filtering, where one wants to minimize . In this setting, a classic algorithm is Least Mean Squared (LMS) algorithm that corresponds to Online Gradient Descent with linear models and squared loss. Now, a known better version of the LMS is the normalized LMS, that is nothing else that Implicit OMD with linear models and squared loss.

There are even interpretations of the Nesterov’s accelerated gradient method as an implicit update on a curved space (Defazio, 2019).

So, implicit updates are so “natural” that I personally think that any offline/online optimization algorithm that has good performance must be a good approximation of implict updates. Hence, I am sure there are even more examples of implicit updates hiding in other well-known algorithms.

**5. Regret Guarantee for Implicit Updates **

From the above reasoning, it seems very intuitive to expect a better regret bound for implicit updates. However, it turns out particularly challenging to prove a *quantifiable* advantage of implicit updates over OMD ones in the adversarial setting.

Here, I show a very recent result of mine on Implicit OMD that for the first time shows a clear advantage of Implicit OMD in some situations.

First, we can show the following theorem.

Theorem 1.Assume a constant learning rate . Then, implicit OMD guarantees

Moreover, assume the distance generating function to be 1-strongly convex w.r.t. . Then, there exists such that we have

*Proof:* To obtain this bound, we proceed in a slightly different way than in the classic OMD proof. In particular, for any we have

where and in the second inequality we have used the optimality condition of the update. Adding to both terms of the inequality, dividing by , and reordering, we have

Summing over time, we get the first bound.

For the second bound, let’s now focus on the terms and we upper bound them in two different ways. First, using the convexity of the losses, we can bound the difference between and :

where . Also, from the strong convexity of , we have

Hence, putting all together we have

where in the last inequality we used the elementary inequality .

From the optimality condition of the implicit OMD update, we know that there exists such that

Hence, we have

where we used the convexity of the Bregman divergence in its first argument in the second inequality and the optimality condition of the update in the third inequality. This chain of inequalities implies that that gives the second bound in the minimum.

The theorem shows a *possible* and *small* improvement over the OMD regret bound. In particular, there might be sequences of losses where . The fact that the improvement is only possible on some sequences is to be expected: the OMD regret is worst-case optimal on bounded domains, so there is not much to gain. However, maybe we could expect a larger gain on some particular sequence of functions. Indeed, we can show that on some sequences of losses we can achieve *constant* regret! Let’s see how.

From the regret above, we have

Denoting by the *temporal variability* of the losses, we have that the regret guarantee is

Now, in the case that the loss functions are all the same and the regret upper bound becomes a *constant* independent of . It is worth reminding that constant regret is the best we can hope for in online convex optimization! In other words, when online learning becomes as easy as offline learning (i.e., all the losses are equal), that implicit updates give us a provable large boost.

However, there is a caveat: In order to get a regret in the general case we need . On the other had, if we want . The problem in online learning learning is that we do not know the future, so we need some *adaptive* strategy that changes in a dynamic way. This is indeed possible and we leave this as an exercise, see below.

Our last observation is that we can recover the constant regret bound even for FTRL when used on the exact losses. Again, this is due to the use of the exact losses rather than the linear approximation. Remember that FTRL predicts with , where . Hence, from the FTRL regret equality and assuming a non-decreasing regularizer, we have

However, FTRL with exact losses requires to solve a finite sum optimization problem whose size grows with the number of iterations. Instead, Implicit OMD uses only one loss in each round, resulting in a closed formula in a number of interesting cases. We also note that we would have the same tuning problem as before: in order to get a constant regret when , we would need the regularizer to be constant and independent from time, while it should grow as in the general case.

**6. History Bits **

The implicit updates in online learning were proposed for the first time by (Kivinen and Warmuth, 1997). However, such update with the Euclidean divergence is the Proximal update in the optimization literature dating back at least to 1965 (Moreau, 1965)(Martinet, 1970)(Rockafellar, 1976)(Parikh and Boyd, 2014), and more recently used even in the stochastic setting (Toulis and Airoldi, 2017)(Asi and Duchi, 2019).

The PA algorithms were proposed in (Crammer et al., 2006), but the connection with implicit updates was absent in the paper. I am not sure who first realized the connection: I realized it in 2011 and I showed it to Joseph Keshet (one of the author of PA) that encouraged me to publish it somewhere. Only 10 years later, I am doing it Note that the mistake bound proved in the PA paper is worse than the Perceptron bound. Later, we proved a mistake bound for PA that is strictly better than the classic Perceptron’s bound (Jie et al., 2010).

The very nice idea of truncated linear models was proposed by (Asi and Duchi, 2019) as a way to approximate proximal updates and retaining closed form updates.

The connection between implicit OMD and normalized LMS was shown by (Kivinen et al., 2006).

(Kulis and Bartlett, 2010) provide the first regret bounds for implicit updates that match those of OMD, while (McMahan, 2010) makes the first attempt to quantify the advantage of the implicit updates in the regret bound. Finally, (Song et al., 2018) generalize the results in (McMahan, 2010) to Bregman divergences and strongly convex functions, and quantify the gain differently in the regret bound. Note that in (McMahan, 2010)(Song et al., 2018) the gain cannot be exactly quantified, providing just a non-negative data-dependent quantity subtracted to the regret bound. The connection between temporal variation and implicit updates was shown in (Campolongo and Orabona, 2020), together with a matching lower bound.

**7. Acknowledgements **

Thanks to NicolÃ² Campolongo for feedback on a draft of this post.

**8. Exercises **

Exercise 1.Prove that the update of PA given above is correct.

Exercise 2.Prove that the update of aProx given above is correct.

]]>

Exercise 3.Find an learning rate strategy to adapt to the value of without knowing it (Campolongo and Orabona, 2020).

There is a popular interpretation of the Perceptron as a stochastic (sub)gradient descent procedure. I even found slides online with this idea. The thought of so many young minds twisted by these false claims was too much to bear. So, I felt compelled to write a blog post to explain why this is wrong…

Moreover, I will also give a different and (I think) much better interpretation of the Perceptron algorithm.

**1. Perceptron Algorithm**

The Perceptron algorithm was introduced by Rosenblatt in 1958. To be more precise, he introduced a family of algorithms characterized by a certain architecture. Also, he considered what we call now supervised and unsupervised training procedures. However, nowadays when we talk about the Perceptron we intend the following algorithm:

In the algorithm, the couples for , with and , represent a set of input/output pairs that we want to learn to classify correctly in the two categories and . We assume that there exists an unknown vector the correctly classify all the samples, that is . Note that any scaling of by a positive constant still correctly classify all the samples, so there are infinite solutions. The aim of the Perceptron is to find any of these solutions.

From an optimization point of view, this is called a *feasibility problem*, that is something like

where is some set. They are an essential step in constrained optimization for algorithms that require an feasible initial point. Feasibility problems are not optimization problems even if in some cases can be solved with an optimization formulation.

In the Perceptron case, we can restate the problem as

where the “1” on the r.h.s. is clearly arbitrary and it can be changed through rescaling of . So, in optimization language, the Perceptron algorithm is nothing else than an iterative procedure to solve the above feasibility problem.

**2. Issues with the SGD Interpretation**

As said above, sometimes people refer to the Perceptron as a stochastic (sub)gradient descent algorithm on the objective function

I think they are many problems with this ideas, let me list some of them

- First of all, the above interpretation assumes that we take the samples randomly from . However, this is not needed in the Perceptron and it was not needed in the first proofs of the Perceptron convergence (Novikoff, 1963). There is a tendency to call anything that receive one sample at a time as “stochastic”, but “arbitrary order” and “stochastic” are clearly not the same.
- The Perceptron is typically initialized with . Now, we have two problems. The first one is that with a black-box first-order oracle, we would get a subgradient of a , where is drawn uniformly at random in . A possible subgradient for any is . This means that SGD would not update. Instead, the Perceptron in this case does update. So, we are forced to consider a different model than the black-box one. Changing the oracle model is a minor problem, but this fact hints to another very big issue.
- The biggest issue is that is a global optimum of ! So, there is nothing to minimize, we are already done in the first iteration. However, from a classification point of view, this solution seems clearly wrong. So, it seems we constructed an objective function we want to minimize and a corresponding algorithm, but for some reason we do not like one of its infinite minimizers. So, maybe, the objective function is wrong? So, maybe, this interpretation misses something?

There is an easy way to avoid some of the above problems: change the objective function to a parametrized loss that has non-zero gradient in zero. For example, something like this

Now, when goes to infinity, you recover the function . However, for any finite , is not a global optimum anymore. As a side effect, we also solved the issue of the subgradient of the max function. In this way, you could interpret the Perceptron algorithm as the *limit behaviour of SGD on a family of optimization problems*.

To be honest, I am not sure this is a satisfying solution. Moreover, the stochasticity is still there and it should be removed.

Now, I already proved a mistake bound for the Perceptron, without any particular interpretation attached to it. As a matter of fact, proofs do not need interpretations to be correct. I showed that the Perceptron competes with a *family of loss functions* that implies that it does not just use the subgradient of a single function. However, if you need an *intuitive way* to think about it, let me present you the idea of *pseudogradients*.

**3. Pseudogradients**

Suppose we want to minimize a function -smooth and we would like to use something like gradient descent. However, we do not have access to its gradient. In this situation, (Polyak and Tsypkin, 1973) proposed to use a “pseudogradient”, that is *any* vector that forms an angle of 90 degrees or less with the actual gradient in

In a very intuitive way, gives me some information that should allow me to minimize , at least in the limit. The algorithm then becomes a “pseudogradient descent” procedure that updates the current solution in the direction of the negative pseudogradient

where are the step size or learning rates.

Note that (Polyak and Tsypkin, 1973) define the pseudogradients as a *stochastic* vector that satisfies the above inequality in conditional expectation and for a time-varying . Indeed, there are a number of very interesting results in that paper. However, for simplicity of exposition I will only consider the deterministic case and only describe the application to the Perceptron.

Let’s see how this would work. Let’s assume that at least for an initial number of rounds, that means that the angle between the pseudogradient and the gradient is acute. From the -smoothness of , we have that

Now, if , we have that so can guarantee that the value of decreases at each step. So, we are minimizing without using a gradient!

To get a rate of convergence, we should know something more about . For example, we could assume that . Then, setting , we obtain

This is still not enough because it is clear that cannot be true on all rounds because when we are in the minimizer . However, with enough assumptions, following this route you can even get a rate of convergence.

**4. Pseudogradients for the Perceptron**

How do we use this to explain the Perceptron? Suppose your set is *linearly separable* with a margin of 1. This means that there exists a vector such that

Note that the value of the margin is arbitrary, we can change it just rescaling .

Remark 1.An equivalent way to restate this condition is to constrain to have unitary norm and require

where is called themaximum marginof . However, in the following I will not use the margin notation because it makes things a bit less clear from an optimization point of view.

We would like to construct an algorithm to find (or any positive scaling of it) from the samples . So, we need an objective function. Here the brilliant idea of Polyak and Tsypkin: in each iteration take an arbitrary and define , that is exactly the negative update we use in the Perceptron. This turns out to be a pseudogradient for . Indeed,

where in the last inequality we used (2).

Let’s pause for a moment to look at what we did: We want to minimize , but its gradient is just impossible to calculate because it depends on that we clearly do not know. However, *every time the Perceptron finds a sample on which its prediction is wrong*, we can construct a pseudogradient, without any knowledge of . It is even more surprising if you consider the fact that there is an infinite number of possible solutions and hence functions , yet the pseudogradient correlates positively with the gradient of any of them! Moreover, no stochasticity is necessary.

At this point we are basically done. In fact, observe that is 1-smooth. So, every time , the analysis above tells us that

where in the last inequality we have assumed .

Setting , summing over time, and denoting the number of updates we have over iterations, we obtain

where used the fact that .

Now, there is the actual magic of the (parameter-free!) Perceptron update rule: as we explained here, the updates of the Perceptron are independent of . That is, given an order in which the samples are presented to the algorithm, any fixed makes the Perceptron update on the same samples and it only changes the scale of . Hence, even if the Perceptron algorithm uses , we can consider an arbitrary decided post-hoc to minimize the upper bound. Hence, we obtain

that is

Now, observing that the r.h.s. is independent of , we proved that the maximum number of updates, or equivalently mistakes, of the Perceptron algorithm is bounded.

Are we done? Not yet! We can now improve the Perceptron algorithm taking full advantage of the pseudogradients interpretation.

**5. An Improved Perceptron**

This is a little known idea to improve the Perceptron. It can be shown with the classic analysis as well, but it comes very naturally from the pseudogradient analysis.

Let’s start from

Now consider only the rounds in which and set , that is obtained by an optimization of the expression . So, we obtain

This means that now the update rule becomes

Now, summing (3) over time, we get

It is clear that this inequality implies the previous one because . But we can even obtain a tighter bound. Using the inequality between harmonic, geometric, and arithmetic mean, we have

In words, the original Perceptron bound depends on the maximum squared norm of the samples on which we updated. Instead, this bound depends on the geometric or arithmetic mean of the squared norm of the samples on which we updated, that is less or equal to the maximum.

**6. Pseudogradients and Lyapunov Potential Functions**

Some people might have realized yet another way to look at this: is the Lyapunov function typically used to analyze subgradient descent. In fact, the classic analysis of SGD considers the guaranteed decrement at each step of this function. The two things coincide, but I find the pseudogradient idea to add a non-trivial amount of information because it allows to bypass the idea of using a subgradient of the loss function completely.

Moreover, the idea of the pseudogradients is more general because it applies to any smooth function, not only to the choice of .

Overall, it is clear that all the good analyses of the Perceptron must have something in common. However, sometimes recasting a problem in a particular framework might have some advantages because it helps our intuition. In this view, I find the pseudogradient view particularly compelling because it aligns with my intuition of how an optimization algorithm is supposed to work.

**7. History Bits **

I already wrote about the Perceptron, so I will just add few more relevant bits.

As I said, it seems that the family of Perceptrons algorithms was intended to be something much more general than what we intend now. The particular class of Perceptron we use nowadays were called -system (Block, 1962). I hypothesize that the fact the -system survived the test of time is exactly due to the simple convergence proof in (Block, 1962) and (Novikoff, 1963). Both proofs are non-stochastic. For the sake of proper credits assignment, it seems that the convergence proof of the Perceptron was proved by many other before Block and Novikoff (see references in Novikoff, 1963). However, the proof in (Novikoff, 1963) seems to be the cleanest one. (Aizerman, Braverrnan, and Rozonoer, 1964) (essentially) describe for the first time the Kernel Perceptron and prove a finite mistake bound for it.

I got the idea of smoothing the Perceptron algorithm with a scaled logistic loss from a discussion on Twitter with Maxim Raginsky. He wrote that (Aizerman, Braverrnan, and Rozonoer, 1970) proposed some kind of smoothing in a Russian book for the objective function in (1), but I don’t have access to it so I am not sure what are the details. I just thought of a very natural one.

The idea of pseudogradients and the application to the Perceptron algorithm is in (Polyak and Tsypkin, 1973). However, there the input/output samples are still stochastic and the finite bound is not explicitly calculated. As I have shown, stochasticity is not needed. It is important to remember that online convex optimization as a field will come much later, so there was no reason for these people to consider arbitrary or even adversarial order of the samples.

The improved Perceptron mistake bound could be new (but please let me know if it isn’t!) and it is inspired from the idea in (Graepel, Herbrich, and Williamson, 2001) of normalizing the samples to show a tighter bound.

**Acknowledgements**

Given the insane amount of mistakes that NicolÃ² Campolongo usually finds in my posts, this time I asked him to proofread it. So, I thank NicolÃ² for finding an insane amout of mistakes on a draft of this post

]]>I have been working on online and stochastic optimization for a while, from a practical and empirical point of view. So, I was already in this field when Adam (Kingma and Ba, 2015) was proposed.

The paper was ok but not a breakthrough, and even more so for today standards. Indeed, the theory was weak: A regret guarantee for an algorithm supposed to work on stochastic optimization of non-convex functions. The experiments were also weak: The exact same experiments would result in a surefire rejection in these days. Later people also discovered an error in the proof and the fact that the algorithm will not converge on certain one-dimensional stochastic convex functions. Despite all of this, in these days Adam is considered the King of the optimization algorithms. Let me be clear: it is known that Adam will not always give you the best performance, yet most of the time people know that they can use it with its default parameters and get, if not the best performance, at least the second best performance on their particular deep learning problem. In other words, Adam is considered nowadays the *default optimizer* for deep learning. So, what is the secret behind Adam?

Over the years, people published a vast number of papers that tried to explain Adam and its performance, too many to list. From the “adaptive learning rate” (adaptive to what? Nobody exactly knows…) to the momentum, to the almost scale-invariance, each single aspect of its arcane recipe has been examined. Yet, none of these analyses gave us the final answer on its performance. It is clear that most of these ingredients are beneficial to the optimization process of *any* function, but it is still unclear why this exact combination and not another one make it the best algorithm. The equilibrium in the mix is so delicate that even the small change required to fix the non-convergence issue was considered to give slightly worse performance than Adam.

The fame of Adam is also accompanied by strong sentiments: It is enough to read posts on r/MachineLearning on Reddit to see the passion that people put in defending their favorite optimizers against the other ones. It is the sort of fervor that you see in religion, in sports, and in politics.

However, how *likely* is all this? I mean, how likely is that Adam is really the *best* optimization algorithm? How likely is that we reached the apex of optimization for deep learning few years ago in a field that is so young? Could there be another explanation to its prodigious performance?

I have a hypothesis, but before explaining it we have to briefly talk about the applied deep learning community.

In a talk, Olivier Bousquet has described the deep learning community as a giant genetic algorithm: Researchers in this community are exploring the space of all variants of algorithms and architectures in a semi-random way. Things that consistently work in large experiments are kept, the ones not working are discarded. Note that this process seems to be independent of acceptance and rejection of papers: The community is so big and active that good ideas on rejected papers are still saved and transformed into best practices in few months, see for example (Loshchilov and Hutter, 2019). Analogously, ideas in published papers are reproduced by hundred of people that mercilessly trash things that will not reproduce. This process has created a number of heuristics that consistently produce good results in experiments, and the stress here is on “consistently”. Indeed, despite being a method based on non-convex formulations, the performance of deep learning methods turns out to be extremely reliable. (Note that the deep learning community has also a large bias towards “famous” people, so not all the ideas receive the same level of attention…)

So, what is the link between this giant genetic algorithm and Adam? Well, looking carefully at the creating process in the deep learning community I noticed a pattern: Usually people try new architectures *keeping the optimization algorithm fixed*, and most of the time the algorithm of choice is Adam. This happens because, as explained above, Adam is the *default optimizer*.

So, here my hypothesis: Adam was a very good optimization algorithm for the neural networks architectures we had few years ago and ** people kept evolving new architectures on which Adam works**. So, we might not see many architectures on which Adam does not work because such ideas are discarded prematurely! Such ideas would require to design a new architecture

Now, I am sure many people won’t buy in this hypothesis, I am sure they will list all sort of specific problems in which Adam is not the best algorithm, in which Stochastic Gradient Descent with momentum is the best one, and so on and so forth. However, I would like to point out two things: 1) I don’t describe here a law of nature, but simply a tendency the community has that might have influenced the co-evolution of some architectures and optimizers; 2) I actually have some evidence to support this claim

If my claims were true, we would expect Adam to be extremely good on deep neural networks and very poor on anything else. And this does happen! For example, Adam is known to perform very poorly on simple convex and non-convex problems that are not deep neural networks, see for example the following experiments from (Vaswani et al., 2019):

It seems that the moment we move away from the specific setting of deep neural networks with their specific choice of initialization, specific scale of weights, specific loss function, etc., Adam loses its *adaptivity* and its magic default learning rate must be tuned again. Note that you can always write a linear predictor as a one-layer neural network, yet Adam does not work so well on this case too. So, ** all the particular choices of architectures in deep learning might have evolved to make Adam work better and better, while the simple problems above do not have any of these nice properties that allow Adam to shine**.

Overall, Adam might be the best optimizer because the deep learning community might be exploring only a small region in the joint search space of architectures/optimizers. If true, that would be ironic for a community that departed from convex methods because they focused only on a narrow region of the possible machine learning algorithms and it was like, as Yann LeCun wrote, “looking for your lost car keys under the street light knowing you lost them someplace else“.

EDIT: After the pubblication of this post, Sam Power pointed me to this tweet by Roger Grosse that seems to share a similar sentiment

]]>**1. SGD on Non-Convex Smooth Functions **

We are interested in minimizing a smooth non-convex function using stochastic gradient descent with unbiased stochastic gradients. More in details, we assume to have access to an oracle that returns in any point , , where is the realization of a mechanism for computing the stochastic gradient. For example, could be the random index of a training sample we use to calculate the gradient of the training loss or just random noise that is added on top of our gradient computation. We will also assume that the variance of the stochastic gradient is bounded: , for all . Weaker assumptions on the variance are possible, but they don’t add much to the general message nor to the scheme of the proof.

Given that the function is non-convex, we clearly cannot hope to converge to the minimum of , so we need a less ambitious goal. We assumed that the function is smooth. As you might remember from my previous posts, smooth functions are differentiable functions whose gradient is Lipschitz. Formally, we say that is -smooth when , for all . This assumption assures us that when we approach a local minimum the gradient goes to zero. Hence, **decreasing the norm of the gradient will be our objective function for SGD.** Note that smoothness is necessary to study the norm of the gradients. In fact, consider the function whose derivative does not go to zero when we approach the minimum, on the contrary it is always different than 0 in any point different than the minimum.

Last thing we will assume is that the function is bounded from below. Remember that the boundedness from below does not imply that the minimum of the function exists, e.g., .

Hence, I start from a point and the SGD update is

where are deterministic learning rates or stepsizes.

First, let’s see practically how SGD behaves w.r.t. Gradient Descent (GD) on the same problem.

In Figure 1, we are minimizing , where the stochastic gradient in SGD is given by the gradient of the function corrupted by Gaussian noise with zero mean and standard deviation 1. On the other hand, there is no noise for GD. In both cases, we use and we plot the absolute value of the derivative. We can see that GD will monotonically minimize the gradient till numerical precision as expected, converging to one of the local minima. Note that with a constant learning rate GD on this problem would converge even faster. Instead, SGD will jump back and forth resulting in only *some* iterates having small gradient. So, our basic question is the following:

*Will converge to zero with probability 1 in SGD when goes to infinity?*

This is more difficult to answer than what you might think. However, this is a basic question to know if it actually makes sense to run SGD for a bunch of iterations and return the last iterate, that is how 99% of the people use SGD on a non-convex problem.

To warm up, let’s first see what we can prove in a finite-time setting.

As all other similar analysis, we need to construct a potential (Lyapunov) function that allows us to analyze it. In the convex case, we would study , where . Here, this potential does not even make sense because we are not even trying to converge to . It turns out that a better choice is to study . We will make use of the following property of -smooth functions:

In words, this means that a smooth function is always upper bounded by a quadratic function. Note that this property does not require convexity, so we can safely use it. Thanks to this property, let’s see how our potential evolves over time during the optimization of SGD.

Now, let’s denote by the expectation w.r.t. given , so we have

where in the inequality we have used the fact that the variance of the stochastic gradient is bounded by . Taking the total expectation and reordering the terms, we have

Let’s see how useful this inequality is: consider a constant step size , where is the usual critical parameter of the learning rate (that you’ll never be able to tune properly unless you know things that you clearly don’t know…). With this choice, we have . So, we have

What we got is almost a convergence result: it says that the average of the norm of the gradients is going to zero as . Given that the average of a set of numbers is bigger or equal to its minimum, this means that there exists at least one in my set of iterates that has a small expected gradient. This is interesting but slightly disappointing. We were supposed to prove that the gradient converges to zero, but instead we only proved that at least *one* of the iterates has indeed small expected norm, but we don’t know which one. Also, trying to find the right iterate might be annoying because we only have access to stochastic gradients.

It is also interesting to see that the convergence rate has two terms: a fast rate and a slow rate . This means that we can expect the algorithm to make fast progress at the beginning of the optimization and then slowly converge once the number of iterations becomes big enough compared to the variance of the stochastic gradients. In case the noise on the gradients is zero, SGD becomes simply gradient descent and it will converge at a rate. In the noiseless case, we can also show that the last iterate is the one with the smallest gradient. However, note that the learning rate has in it, so effectively we can achieve a faster convergence in the noiseless case because we would be using a constant and independent from stepsize.

**2. The Magic Trick: Randomly Stopped SGD **

The above reasoning is interesting but it is not a solution to our question: does the last iterate of SGD converge? Yes or no?

There is a possible work-around that looks like a magic trick. Let’s take one iterate of SGD uniformly at random among and call it . Taking the expectation with respect to this randomization and the noise in the stochastic gradients we have that

Basically, it says that if run SGD for iterations, then we stop and return not the last iterate but one of the iterates at random, in expectation with respect to everything the norm will be small! Note that this is equivalent to run SGD with a random stopping time. In other words, given that we didn’t know how to prove if SGD converges, we changed the algorithm adding a random stopping time and now the random iterate on which we stop in expectation will have the desired convergence rate.

This is a very important result and also a standard one in these days. It should be intuitive why the randomization helps: From Figure 1 it is clear that we might be unlucky in the last iteration of SGD, however randomizing in expectation we smooth out the noise and get a decreasing gradient. However, we just changed the target because we still didn’t prove if the last iterate converges. So, we need an alternative way.

**3. The Disappointing Lim Inf **

Let’s consider again (1). This time let’s select any time-varying positive stepsizes that satisfy

These two conditions are classic in the study of stochastic approximation. The first condition is needed to be able to travel arbitrarily far from the initial point, while the second one is needed to keep the variance of the noise under control. The classic learning rate of does not satisfy these assumptions, but something decaying a little bit faster as will do.

With such a choice, we get

where we have used the second condition in the inequality. Now, the condition implies that converges to 0. So, there exists such that for all . So, we get that

This implies that with probability 1. We are almost done: From this last inequality and the condition that , we can derive the fact that .

**Wait, what? What is this ???** Unfortunately, it seems that we proved something weaker than we wanted to. In words, the lim inf result says that there exists a *subsequence* of that has a gradient converging to zero.

This is very disappointing and we might be tempted to believe that this is the best that we can do. Fortunately, this is not the case. In fact, in a seminal paper (Bertsekas and Tsitsiklis, 2000) proved the convergence of the gradients of SGD to zero with probability 1 under very weak assumptions. Their proof is very convoluted also due to the assumptions they used, but in the following I’ll show a much simpler proof.

**4. The Asymptotic Proof in Few Lines **

In 2018, I found a way to get the same result of (Bertsekas and Tsitsiklis, 2000) distilling their long proof in the following Lemma, whose proof is in the Appendix. It turns out that this Lemma is essentially all what we need.

Lemma 1.Let be two non-negative sequences and a sequence of vectors in a vector space . Let and assume and . Assume also that there exists such that , where is such that . Then, converges to 0.

We are now finally ready to prove the asymptotic convergence with probability 1.

Theorem 2.Assume that we use SGD on a -smooth function, with stepsizes that satisfies the conditions (2). Then, goes to zero with probability 1.

*Proof:* We want to use Lemma 1 on . So, first observe that by the -smoothness of , we have

The assumptions and the reasoning above imply that, with probability 1, . This also suggest to set . Also, we have, with probability 1, , because for is a martingale whose variance is bounded by . Hence, for is a martingale in , so it converges in with probability 1.

Overall, with probability 1 the assumptions of Lemma 1 are verified with .

We did it! Finally, we proved that the gradients of SGD do indeed converge to zero with probability 1. This means that with probability 1 for any there exists such that for .

Even if I didn’t actually use any intuition in crafting the above proof (I rarely use “intuition” to prove things), Yann Ollivier provided the following intuition for this proof: the proof is implicitly studying how far apart GD and SGD are. However, instead of estimating the distance between the two processes over a single update, it does it over large period of time through the term that can be controlled thanks to the choice of the learning rates.

**5. History Bits **

The idea of taking one iterate at random in SGD was proposed in (Ghadimi and Lan, 2013) and it reminds me the well-known online-to-batch conversion through randomization. The conditions on the learning rates in (2) go back to (Robbins and Monro, 1951). (Bertsekas and Tsitsiklis, 2000) contains a good review of previous work on asymptotic convergence of SGD, while a recent paper on this topic is (Patel, V., 2020).

I derived Lemma 1 as an extension of Proposition 2 in (Alber et al., 1998)/Lemma A.5 in (Mairal, 2013). Studying the proof of (Bertsekas and Tsitsiklis, 2000), I realized that I could change (Alber et al., 1998, Proposition 2) into what I needed. I had this proof sitting in my unpublished notes for 2 years, so I decided to write a blog post on it.

My actual small contribution to this line of research is a lim inf convergence for SGD with AdaGrad stepsizes (Li and Orabona, 2019), but under stronger assumptions on the noise.

Note that the 20-30 years ago there were many papers studying the asymptotic convergence of SGD and its variants in various settings. Then, the taste of the community changed moving from asymptotic convergence to finite-time rates. As it often happens when a new trend takes over the previous one, new generations tend to be oblivious to the old results and proof techniques. The common motivation to ignore these past results is that the finite-time analysis is superior to the asymptotic one, but this is clearly false (ask a statistician!). It should be instead clear to anyone that both analyses have pros and cons.

**6. Acknowledgements **

I thank LÃ©on Bottou for telling me of the problem of analyzing the asymptotic convergence of SGD in the non-convex case with a simple and general proof in 2018. LÃ©on also helped me checking my proofs and finding an error in a previous version. Also, I thank Yann Ollivier for reading my proof and kindly providing an alternative proof and the intuition that I report above.

**7. Appendix **

*Proof of Lemma : *Since the series diverges, given that converges, we necessarily have . Hence, we have to prove that .

Let us proceed by contradiction and assume that . First, assume that .

Given the values of the and , we can then build two sequences of indices and such that

- ,
- , for ,
- , for .

Define . The convergence of the series implies that the sequence of partial sums are Cauchy sequences. Hence, there exists large enough such for all we have and are less or equal to . Then, we have for all and all with ,

Therefore, using the triangle inequality, And finally for all , which contradicts . Therefore, goes to zero.

To rule out the case that , proceed in the same way, choosing any . Hence, we get that for , that contradicts .

]]>Don’t get me wrong: assuming bounded domains is perfectly fine and justified most of the time. However, sometimes it is unnecessary and it might also obscure critical issues in the analysis, as in this case. So, to balance the universe of first-order methods, I decided to show how to easily prove the convergence of the iterates in SGD, even in unbounded domains.

Technically speaking, the following result might be new, but definitely not worth a fight with Reviewer 2 to publish it somewhere.

**1. Setting **

First, let’s define our setting. We want to solve the following optimization problem

where and is a convex function. Now, various assumptions are possible on and choosing the right one depends on *your* particular problem, there are not right answers. Here, we will not make any strong assumption on . Also, we will *not* assume to be bounded. Indeed, in most of the modern applications in Machine Learning, is simply the entire space . We will also assume that is not empty and is any element in it.

We also assume to have access to a *first-order stochastic oracle* that returns stochastic sub-gradients of on any point . In formulas, we get such that . Practically speaking, every time you calculate the (sub)gradient on a minibatch of training data, that is a stochastic (sub)gradient and roughly speaking the random minibatch is the random variable .

Here, for didactic reasons, we will assume that is bounded by 1; similar results can be also show with more realistic assumptions. This holds, for example, if is an average of 1-Lipschitz functions and you draw some of them to calculate the stochastic subgradient.

The algorithm we want to focus on is SGD. So, what is SGD? SGD is an incredibly simple optimization algorithm, almost primitive. Indeed, part of its fame depends critically on its simplicity. Basically, you start from a certain and you update your solution iteratively moving in the direction of the negative stochastic subgradients, multiplied by a *learning rate* . We also use a projection onto . Of course, if no projection is needed. So, the update of SGD is

where and is the projection onto . Remember that when you use subgradients, SGD is not a descent algorithm: I already blogged about the fact that the common intuition of moving towards a descent direction is wrong when you use subgradients.

**2. Convergence of the Average of the Iterates **

Now, the most common analysis of SGD can be done in two different ways: constant learning rate and non-increasing learning rate. We already saw both of them in my lecture notes on online learning, so let’s summarize here the one-step inequality for SGD we need:

for all measureable w.r.t. .

If you plan to use iterations, you can you use a learning rate , and summing (1) we get

where we set . This is not a convergence results yet, because it just says that *on average* we are converging. To extract a single solution, we can use Jensen’s inequality and obtain

where . In words, we show a convergence guarantee for *the average of the iterates of SGD*, not for the last one.

Constant learning rates are a bit annoying because they depends on how many iterations you plan to do, theoretically and empirically. So, let’s now take a look at non-increasing learning rates, . In this case, the correct way to analyze SGD without the bounded assumption is to sum (1) *without dividing by *, to have

where we set . From this one, we have two alternatives. First, we can observe that

because is a minimizer and the learning rate non-increasing. So, using again Jensen’s inequality, we get

Note that if you like these sorts of games, you can even change the learning rate to shave a factor, but it is probably useless from an applied point of view.

Another possibility is to use a weighted average:

where and we used . Note that this option does not seem to give any advantage over the unweighted average above. Also, it weights the first iterations more than the last ones, that in most of the cases is a bad idea: First iterations tend to be farther away from the optimum then the last ones.

Let’s summarize what we have till now:

- Unbounded domains are fine with both constant and time-varying learning rates.
- The optimal learning rate depends on the distance between the optimal solution and the initial iterate, because the optimal setting of is proportional to .
- The weighted average is probably a bad idea and not strictly necessary.
- It seems we can only guarantee convergence for (weighted) averages of iterates.

The last point is a bit concerning: most of the time we take the last iterate of SGD, why we do it if the theory applies to the average?

**3. Convergence of the Last Iterate **

Actually, we do know that

- the last solution of SGD converges in unbounded domains with constant learning rate (Zhang, T., 2004).
- the last iterate of SGD converges in bounded domains with non-increasing learning rates (Shamir, O. and Zhang, T., 2013).

So, what about unbounded domains and non-increasing learning rate, i.e., 90% of the uses of SGD? It turns out that it is equally simple and I think the proof is also instructive! As surprising as it might sound, not dividing by (1) is the key ingredient we need. The proof plan is the following: we want to prove that the value of on the last iterate is not too far from the value of on . To prove it, we need the following technical lemma on sequences of non-negative numbers multiplied by non-increasing learning rates, whose proof is in the Appendix. This Lemma relates the last element of a sequence of numbers to their average.

Lemma 1.Let a non-increasing sequence of positive numbers and . Then

With the above Lemma, we can prove the following guarantee for the convergence of the last iterate of SGD.

Theorem 2.Assume the stepsizes deterministic and non-increasing. Then

*Proof:* We use Lemma 1, with , to have

Now, we bound the sum in the r.h.s. of last inequality. Summing (1) from to , we have the following inequality that holds for any :

Hence, setting , we have

Putting all together, we have the stated bound.

There are a couple of nice tricks in the proof that might be interesting to study carefully. First, we use the fact that one-step inequality in (1) holds for any . Most of the time, we state it with equal to , but it turns out that the more general statement is actually important! In fact, it is possible to know how far is the performance of last iterate from the performance of the average because the incremental nature of SGD makes possible to know exactly how far is from any previous iterate , with . Please note that all of this would be hidden in the case of bounded domains, where all the distances are bounded by the diameter of the set, and you don’t get the dependency on .

Now we have all the ingredients and we only have to substitute a particular choice of the learning rate.

*Proof:* First, observe that

Now, considering the last term in (3), we have

Using (2) and dividing by , we have the stated bound.

Note that the above proof works similarly if .

**4. History Bits **

The first finite-time convergence proof for the last iterate of SGD is from (Zhang, T., 2004), where he considered the constant learning rate case. It was later extended in (Shamir, O. and Zhang, T., 2013) for time-varying learning rates but only for bounded domains. The convergence rate for the weighted average in unbounded domains is from (Zhang, T., 2004). The observation that the weighted average is not needed and the plain average works equally well for non-increasing learning rates is from (X. Li and F. Orabona, 2019), where we needed it for the particular case of AdaGrad learning rates. The idea of analyzing SGD without dividing by the learning rate is byÂ (Zhang, T., 2004). Lemma 1 is new but actually hidden in the the convergence proof of the last iterate of SGD with linear predictors and square losses in (Lin, J. and Rosasco, L. and Zhou, D.-X., 2016), that in turn is based on the one in (Shamir, O. and Zhang, T., 2013). As far as I know, Corollary 3 is new, but please let me know if you happen to know a reference for it! It is possible to remove the logarithmic term in the bound using a different learning rate, but the proof is only for bounded domains (Jain, P. and Nagaraj, D. and Netrapalli, P., 2019).

**5. Exercises **

Exercise 1.Generalize the above proofs to the Stochastic Mirror Descent case.

Exercise 2.Remove the assumption of expected bounded stochastic subgradients and instead assume that is -smooth, i.e., has -Lipschitz gradient, and the variance of the noise is bounded. Hint: take a look at the proofs in (Zhang, T., 2004) and (X. Li and F. Orabona, 2019)

**6. Appendix **

*Proof of Lemma 1:*Â Define , so we have

that implies

Now, from the definition of and the above inequality, we have

that implies

Unrolling the inequality, we have

Using the definition of and the fact that , we have the stated bound.

]]>In this post, I explain a variation of the EG/Hedge algorithm, called *AdaHedge*. The basic idea is to design an algorithm that is adaptive to the sum of the squared norm of the losses, without any prior information on the range of the losses.

First, consider the case in which we use as constant regularizer the negative entropy , where will be determined in the following and is the simplex in . Using FTRL with linear losses with this regularizer, we immediately obtain

where we upper bounded the negative entropy of with 0. Using the strong convexity of the regularizer w.r.t. the norm and Lemma 4 here, we would further upper bound this as

This suggests that the optimal should be . However, as we have seen for L* bounds, this choice of any parameter of the algorithm is never feasible. Hence, exactly as we did for L* bounds, we might think of using an online version of this choice

where is a constant that will be determined later. An important property of such choice is that it gives rise to an algorithm that is scale-free, that is its predictions are invariant from the scaling of the losses by any constant factor. This is easy to see because

Note that this choice makes the regularizer non-decreasing over time and immediately gives us

At this point, we might be tempted to use Lemma 1 from the L* post to upper bound the sum in the upper bound, but unfortunately we cannot! Indeed, the denominator does not contain the term . We might add a constant to , but that would destroy the scale-freeness of the algorithm. However, it turns out that we can still prove our bound without any change to the regularizer. The key observation is that we can bound the term in two different ways. The first way is the one above, while the other one is

where we used the definition of and the fact that the regularizer is non-decreasing over time. So, we can now write

where we used the fact that the minimum between two numbers is less than their harmonic mean. Assuming and using Lemma 1 here, we have

The bound and the assumption on suggest to set . To summarize, we obtained a scale-free algorithm with regret bound .

We might consider ourselves happy, but there is a clear problem in the above algorithm: the choice of in the time-varying regularizer strictly depends on our upper bound. So, a loose bound will result in a poor choice of the regularization! In general, every time we use a part of the proof in the design of an algorithm we cannot expect an exciting empirical performance, unless our upper bound was really tight. So, can we design a better regularizer? Well, we need a better upper bound!

Let’s consider a generic regularizer and its corresponding FTRL with linear losses regret upper bound

where we assume to be non-decreasing in time.

Now, observe that the sum is unlikely to disappear for this kind of algorithms, so we could try to make the term of the same order of the sum. So, we would like to set of the same order of . However, this approach would cause an annoying recurrence. So, using the fact that is non-decreasing, let’s upper bound the terms in the sum just a little bit:

Now, we can set for , , and . This immediately implies that

Setting to be equal to the negative entropy, we get an algorithm known as AdaHedge. It is easy to see that this choice makes the algorithm scale-free as well.

With this choice of the regularizer, we can simplify a bit the expression of . For , we have . Instead, for , using the properties of the Fenchel conjugates, we have that

Overall, we get the pseudo-code of AdaHedge in Algorithm 1.

So, now we need an upper bound for . Observe that . Moreover, as we have done before, we can upper bound in two different ways. In fact, from Lemma 4 here, we have for . Also, denoting by , we have

Hence, we have

We can solve this recurrence using the following Lemma, where and .

Lemma 1.Let be any sequence of non-negative real numbers. Suppose that is a sequence of non-negative real numbers satisfying

Then, for any , .

*Proof:* Observe that

We bound each term in the sum separately. The left term of the minimum inequality in the definition of gives , while the right term gives . So, we conclude .

So, overall we got

and setting , we have

Note that this is roughly the same regret in (2), but the very important difference is that this new regret bound depends on the *much tighter quantity* , that we upper bounded with , but in general will be much smaller than that. For example, can be upper bounded using the tighter local norms, see the analysis of Exp3. Instead, in the first solution, the regret will always be dominated by the term because we explicitly use it in the regularizer!

There is an important lesson to be learned from AdaHedge: the regret is not the full story and algorithms with the same worst-case guarantee can exhibit vastly different empirical behaviors. Unfortunately, this message is rarely heard and there is a part of the community that focuses too much on the worst-case guarantee rather than on the empirical performance. Even worse, sometimes people favor algorithms with a “more elegant analysis” completely ignoring the likely worse empirical performance.

**1. History Bits **

The use of FTRL with the regularizer in (1) was proposed in (Orabona and PÃ¡l, 2015), I presented a simpler version of their proof that does not require Fenchel conjugates. The AdaHedge algorithm was introduced in (van Erven et al., 2011) and refined in (de Rooij et al., 2014). The analysis reported here is from (Orabona and PÃ¡l, 2015), that generalized AdaHedge to arbitrary regularizers in AdaFTRL. Additional properties of AdaHedge for the stochastic case were proven in (van Erven et al., 2011).

**2. Exercises **

]]>

Exercise 1.Implement AdaHedge and compare its empirical performance to FTRL with the time-varying regularizer in (1).

* You can find the other lectures here.*

In this lecture, we will explore the link between Online Learning and and Statistical Learning Theory.

**1. Agnostic PAC Learning **

We now consider a different setting from what we have seen till now. We will assume that we have a prediction strategy parametrized by a vector and we want to learn the relationship between an input and its associated label . Moreover, we will assume that is drawn from a joint probability distribution . Also, we are equipped with a loss function that measures how good is our prediction compared to the true label , that is . So, learning the relationship can be cast as minimizing the expected loss of our predictor

In machine learning terms, the object above is nothing else than the *test error* and our predictor.

Note that the above setting assumes labeled samples, but we can generalize it even more considering the *Vapnik’s general setting of learning*, where we collapse the prediction function and the loss in a unique function. This allows, for example, to treat supervised and unsupervised learning in the same unified way. So, we want to minimize the *risk*

where is an unknown distribution over and is measurable w.r.t. the second argument. Also, the set of all predictors that can be expressed by vectors in is called the *hypothesis class*.

Example 1.In a linear regression task where the loss is the square loss, we have and . Hence, .

Example 2.In linear binary classification where the loss is the hinge loss, we have and . Hence, .

Example 3.In binary classification with a neural network with the logistic loss, we have and is the network corresponding to the weights . Hence, .

The key difficulty of the above problem is that we don’t know the distribution . Hence, there is no hope to exactly solve this problem. Instead, we are interested in understanding *what is the best we can do if we have access to samples drawn i.i.d. from *. More in details, we want to upper bound the *excess risk*

where is a predictor that was *learned* using samples.

It should be clear that this is just an optimization problem and we are interested in upper bounding the suboptimality gap. In this view, the objective of machine learning can be considered as a particular optimization problem.

Remark 1.Note that this is not the only way to approach the problem of learning. Indeed, the regret minimization model is an alternative model to learning. Moreover, another approach would be to try to estimate the distribution and then solve the risk minimization problem, the approach usually taken in Statistics. No approach is superior to the other and each of them has its pros and cons.

Given that we have access to the distribution through samples drawn from it, any procedure we might think to use to minimize the risk will be stochastic in nature. This means that we cannot assure a deterministic guarantee. Instead, *we can try to prove that with high probability our minimization procedure will return a solution that is close to the minimizer of the risk*. It is also intuitive that the precision and probability we can guarantee must depend on how many samples we draw from .

Quantifying the dependency of precision and probability of failure on the number of samples used is the objective of the **Agnostic Probably Approximately Correct** (PAC) framework, where the keyword “agnostic” refers to the fact that we don’t assume anything on the best possible predictor. More in details, given a precision parameter and a probability of failure , we are interested in characterizing the *sample complexity of the hypothesis class * that is defined as the number of samples necessary to guarantee with probability at least that the best learning algorithm using the hypothesis class outputs a solution that has an excess risk upper bounded by . Note that the sample complexity does not depend on , so it is a worst-case measure w.r.t. all the possible distributions. This makes sense if you think that we know nothing about the distribution , so if your guarantee holds for the worst distribution it will also hold for any other distribution. Mathematically, we will say that the hypothesis class is agnostic PAC-learnable is such sample complexity function exists.

Definition 1.We will say that a function class isAgnostic-PAC-learnableif there exists an algorithm and a function such that when is used with samples drawn from , with probability at least the solution returned by the algorithm has excess risk at most .

Note that the Agnostic PAC learning setting does not say what is the procedure we should follow to find such sample complexity. The approach most commonly used in machine learning to solve the learning problem is the so-called *Empirical Risk Minimization (ERM) problem*. It consist of drawing samples i.i.d. from and minimizing the *empirical risk*:

In words, ERM is nothing else than minimize the error on a training set. However, in many interesting cases we can have that can be very far from the true optimum , even with an infinite number of samples! So, we need to modify the ERM formulation in some way, e.g., using a *regularization* term or a Bayesian prior of , or find conditions under which ERM works.

The ERM approach is so widespread that machine learning itself is often wrongly identified with some kind of minimization of the training error. We now show that ERM is not the entire world of ML, showing that *the existence of a no-regret algorithm, that is an online learning algorithm with sublinear regret, guarantee Agnostic-PAC learnability*. More in details, we will show that an online algorithm with sublinear regret can be used to solve machine learning problems. This is not just a curiosity, for example this gives rise to computationally efficient parameter-free algorithms, that can be achieved through ERM only running a two-step procedure, i.e. running ERM with different parameters and selecting the best solution among them.

We already mentioned this possibility when we talked about the online-to-batch conversion, but this time we will strengthen it proving high probability guarantees rather than expectation ones.

So, we need some more bits on concentration inequalities.

**2. Bits on Concentration Inequalities **

We will use a concentration inequality to prove the high probability guarantee, but we will need to go beyond the sum of i.i.d.random variables. In particular, we will use the concept of *martingales*.

Definition 2.A sequence of random variables is called amartingaleif for all it satisfies:

Example 4.Consider a fair coin and a betting algorithm that bets money on each round on the side of the coin equal to . We win or lose money 1:1, so the total money we won up to round is . is a martingale. Indeed, we have

For bounded martingales we can prove high probability guarantees as for bounded i.i.d. random variables. The following Theorem will be the key result we will need.

Theorem 3 (Hoeffding-Azuma inequality).Let be a martingale of random variables that satisfy almost surely. Then, we have

Also, the same upper bounds hold on .

**3. From Regret to Agnostic PAC **

We now show how the online-to-batch conversion we introduced before gives us high probability guarantee for our machine learning problem.

Theorem 4.Let , where the expectation is w.r.t. drawn from with support over some vector space and . Draw samples i.i.d. from and construct the sequence of losses . Run any online learning algorithm over the losses , to construct the sequence of predictions . Then, we have with probability at least , it holds that

*Proof:* Define . We claim that is a martingale. In fact, we have

where we used the fact that depends only on Hence, we have

that proves our claim.

Hence, using Theorem 3, we have

This implies that, with probability at least , we have

or equivalently

We now use the definition of regret w.r.t. any , to have

The last step is to upper bound with high probability with . This is easier than the previous upper bound because is a fixed vector, so are i.i.d. random variables, so for sure forms a martingale. So, reasoning as above, we have that with probability at least it holds that

Putting all together and using the union bound, we have the stated bound.

The theorem above upper bounds the average risk of the predictors, while we are interested in producing a single predictor. If the risk is a convex function and is convex, than we can lower bound the l.h.s. of the inequalities in the theorem with the risk evaluated on the average of the . That is

If the risk is not a convex function, we need a way to generate a single solution with small risk. One possibility is to construct a *stochastic classifier* that samples one of the with uniform probability and predicts with it. For this classifier, we immediately have

where the expectation in the definition of the risk of the stochastic classifier is also with respect to the random index. Yet another way, is to select among the predictors, the one with the smallest risk. This works because the average is lower bounded by the minimum. This is easily achieved using samples for the online learning procedure and samples to generate a validation set to evaluate the solution and pick the best one. The following Theorem shows that selecting the predictor with the smallest empirical risk on a validation set will give us a predictor close to the best one with high probability.

Theorem 5.We have a finite set of predictors and a dataset of samples drawn i.i.d. from . Denote by . Then, with probability at least , we have

*Proof:* We want to calculate the probability that the hypothesis that minimizes the validation error is far from the best hypothesis in the set. We cannot do it directly because we don’t have the required independence to use a concentration. Instead, *we will upper bound the probability that there exists at least one function whose empirical risk is far from the risk.* So, we have

Hence, with probability at least , we have that for all

We are now able to upper bound the risk of , just using the fact that the above applies to too. So, we have

where in the last inequality we used the fact that minimizes the empirical risk.

Using this theorem, we can use samples for the training and samples for the validation. Denoting by the predictor with the best empirical risk on the validation set among the generated during the online procedure, we have with probability at least that

It is important to note that with any of the above three methods to select one among the generated by the online learning procedure, the sample complexity guarantee we get matches the one we would have obtained by ERM, up to polylogarithmic factors. In other words, there is nothing special about ERM compared to the online learning approach to statistical learning.

Another important point is that the above guarantee does not imply the existence of online learning algorithms with sublinear regret for any learning problem. It just says that, if it exists, it can be used in the statistical setting too.

**4. History Bits **

Theorem 4 is from (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004). Theorem 5 is nothing else than the Agnostic PAC learning guarantee for ERM for hypothesis classes with finite cardinality. (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004) gives also an alternative procedure to select a single hypothesis among the generated during the online procedure that does not require splitting the data in training and validation. However, the obtained guarantee matches the one we have proved.

]]>* You can find all the lectures I published here.*

In the last lecture, we introduced the Explore-Then-Commit (ETC) algorithm that solves the stochastic bandit problem, but requires the knowledge of the *gaps*. This time we will introduce a parameter-free strategy that achieves the same optimal regret guarantee.

**1. Upper Confidence Bound Algorithm **

The ETC algorithm has the disadvantage of requiring the knowledge of the gaps to tune the exploration phase. Moreover, it solves the exploration vs. exploitation trade-off in a clunky way. It would be better to have an algorithm that smoothly transition from one phase into the other *in a data-dependent way*. So, we now describe an optimal and adaptive strategy called Upper Confidence Bound (UCB) algorithm. It employs the principle of *optimism in the face of uncertainty*, to select in each round the arm that has the *potential to be the best one*.

UCB works keeping an estimate of the expected loss of each arm and also a confidence interval at a certain probability. Roughly speaking, we have that with probability at least

where the “roughly” comes from the fact that is a random variable itself. Then, UCB will query the arm with the smallest lower bound, that is the one that could potentially have the smallest expected loss.

Remark 1.The name Upper Confidence Bound comes from the fact that traditionally stochastic bandits are defined over rewards, rather than losses. So, in our case we actually use the lower confidence bound in the algorithm. However, to avoid confusion with the literature, we still call it Upper Confidence Bound algorithm.

The key points in the proof are on how to choose the right confidence level and how to get around the dependency issues.

The algorithm is summarized in Algorithm 1 and we can prove the following regret bound.

Theorem 1.Assume that the rewards of the arms are -subgaussian and and let . Then, UCB guarantees a regret of

*Proof:* We analyze one arm at the time. Also, without loss of generality, assume that the optimal arm is the first one. For arm , we want to prove that .

The proof is based on the fact that once I have sampled an arm enough times, the probability to take a suboptimal arm is small.

Let the biggest time index such that . If , then the statement above is true. Hence, we can safely assume . Now, for bigger than we have

Consider and such that , then we claim that at least one of the two following equations must be true:

If the first one is true, the confidence interval around our estimate of the expectation of the optimal arm does not contain . On the other hand, if the second one is true the confidence interval around our estimate of the expectation does not contain . So, we claim that if and we selected a suboptimal arm, then at least one of these two bad events happened.

Let’s prove the claim: *if both the inequalities above are false*, , and , we have

that, by the selection strategy of the algorithm, would imply .

Note that . Hence, we have

Now, we upper bound the probabilities in the sum. Given that the losses on the arms are i.i.d. and using the union bound, we have

Hence, we have

Given that the same bound holds for , we have

Using the decomposition of the regret we proved last time, , we have the stated bound.

It is instructive to observe an actual run of the algorithm. I have considered 5 arms and Gaussian losses. In the left plot of figure below, I have plotted how the estimates and confidence intervals of UCB varies over time (in blue), compared to the actual true means (in black). In the right side, you can see the number of times each arm was pulled by the algorithm.

It is interesting to note that the logarithmic factor in the confidence term will make the confidences of the arm that are not pulled to *increase* over time. In turn, this will assure that the algorithm does not miss the optimal arm, even if the estimates were off. Also, the algorithm will keep pulling the two arms that are close together, to be sure on which one is the best among the two.

The bound above can become meaningless if the gaps are too small. So, here we prove another bound that does not depend on the inverse of the gaps.

Theorem 2.Assume that the rewards of the arms minus their expectations are -subgaussian and let . Then, UCB guarantees a regret of

*Proof:* Let be some value to be tuned subsequently and recall from the proof of Theorem 1 that for each suboptimal arm we can bound

Hence, using the regret decomposition we proved last time, we have

Choosing , we have the stated bound.

Remark 2.Note that while the UCB algorithm is considered parameter-free, we still have to know the subgaussianity of the arms. While this can be easily upper bounded for stochastic arms with bounded support, it is unclear how to do it without any prior knowledge on the distribution of the arms.

It is possible to prove that the UCB algorithm is asymptotically optimal, in the sense of the following Theorem.

Theorem 3 (Bubeck, S. and Cesa-Bianchi, N. , 2012, Theorem 2.2).Consider a strategy that satisfies for any set of Bernoulli rewards distributions, any arm with and any . Then, for any set of Bernoulli reward distributions, the following holds

**2. History Bits **

The use of confidence bounds and the idea of optimism first appeared in the work by (T. L. Lai and H. Robbins, 1985). The first version of UCB is by (T. L. Lai, 1987). The version of UCB I presented is by (P. Auer and N. Cesa-Bianchi and P. Fischer, 2002) under the name UCB1. Note that, rather than considering 1-subgaussian environments, (P. Auer and N. Cesa-Bianchi and P. Fischer, 2002) considers bandits where the rewards are confined to the interval. The proof of Theorem 1 is a minor variation of the one of Theorem 2.1 in (Bubeck, S. and Cesa-Bianchi, N. , 2012), which also popularized the subgaussian setup. Theorem 2 is from (Bubeck, S. and Cesa-Bianchi, N. , 2012).

**3. Exercises **

]]>

Exercise 1.Prove a similar regret bound to the one in Theorem 2 for an optimally tuned Explore-Then-Commit algorithm.