**1. Faster Rates Through Optimism **

Assume that is smooth in an open interval containing its domain, in the sense that for any and , we have

where and denote the gradients with respect to the first and second variable respectively, and we have denoted by and the norms in and respectively, while the norms with the are their duals.

Remark 1.At this point, one might be tempted to consider the maximum between the three quantities “to simplify the math”, but the units are different!

Let’s use again two online algorithms to solve the saddle-point problem . However, instead of using two standard no-regret algorithms, we will use two *optimistic ones*. Optimistic online algorithms use a hint on the next subgradient. We will use the same strategy and proof of the algorithm we saw for gradual variations, that is we will use the previous observed gradient as a prediction for the next one.

For example, use two Optimistic FTRL algorithms with fixed strongly convex regularizers and hint at time constructed using the previous observed gradient: where we set . We now show that these hints allow to cancel out terms when we consider the sum of the regrets and obtain a faster rate of rather than just .

From the regret of Optimistic FTRL, for the -player we have

From the Fenchel-Young inequality, we have . Putting all together, we have

Note that there are multiple choices of the coefficient in the Fenchel-Young inequality, but without additional information all choices are equally good.

Now, using the smoothness assumption, for we have

We can proceed in the exact same way for the -player too.

Summing the regret of the two algorithms, we have

Choosing and for any kills all the terms in the sum. In fact, we have

and similarly for the other term. One might wonder why we need to introduce and if it can be just set to 1. However, has units and it allows the sum of the smoothness coefficients, so it is better to keep it around to remember it.

Assuming that the regularizers are bounded over and and using the usual online-to-batch conversion, we have that the duality gap evaluated at the pair goes to zero as when .

Overall, we can state the following theorem.

Theorem 1.With the notation in Algorithm 1, let convex in the first argument and concave in the second, satisfying assumptions (1)–(4). For a fixed , let and . Let be -strongly convex w.r.t. and be -strongly convex w.r.t. . Assume and non-empty. Then, we have

for any and .

Looking back at the proof of the algorithm, we have a faster convergence because regret of one player depends on the “stability” of the other player, measured by the terms and . Hence, we have a sort of “stabilization loop” in which the stability of one algorithm makes the other more stable, that in turn stabilizes the first one even more. Indeed, we can also show that the regret of the two algorithms is not growing over time. Note that such result cannot be obtained just looking at the fact that the sum of the regret does not grow over time.

In fact, setting for example and , we have that

and

Hence, using the fact that the existence of a saddle-point guarantee that , we have

Plugging this guarantee back in the regret of each algorithm, we have that their regret is bounded and independent of . From (5), we also have that and converge 0. Hence, the algorithms are getting more and more stable over time, even if they use constant regularizers.

**Version with Optimistic OMD** The *exact* same reasoning holds for Optimistic OMD, because the key terms of its regret bound are exactly the same of the one of Optimistic FTRL. To better show this fact, we also instantiate the Optimistic OMD with stepsizes equal to and for -player and -player respectively. Following the same reasoning above and the regret bound of Optimistic OMD, we obtain the following theorem.

Theorem 2.With the notation in Algorithm 1, let convex in the first argument and concave in the second, satisfying assumptions (1)–(4). For a fixed , let and . Let be -strongly convex w.r.t. and be -strongly convex w.r.t. . Assume and non-empty. Then, we have

for any and .

Example 1.Consider the bilinear saddle-point problem

In this case, we have that , , , , and where is the operator norm of the matrix . The specific shape of the operator norm depends on the norms we use on and . For example, we choose the Euclidean norm on both and , the operator norm of is the largest singular value of . On the other hand, if and as in the two-person zero-sum games, then the operator norm of a matrix is the maximum absolute value of the entries of .

**2. Prescient Online Mirror Descent and Be-The-Regularized-Leader **

The above result is interesting from a game-theoretic point of view, because it shows that two player can converge to an equilibrium without any “communication”, if instead we only care about converging to the saddle-point, we can easily do better. In fact, we can use the fact that it is fine if one of the two players “cheats” by looking at the loss at the beginning of each round, making its regret non-positive.

For example, we saw the use of Best Response. However, Best Response only guarantees non-positive regret, while for the optimistic proof above we need some specific negative terms. This is not only an artifact of the proof: Best Response is very unstable and it would ruin the “stabilization loop” we have discussed above. It turns out there is an alternative: Prescient Online Mirror Descent, that predicts in each round with . We can intepret it as a conservative version of Best Response that trade-offs the best response with the distance from its previous prediction.

Theorem 3.Let differentiable in , closed, and strictly convex. Let a non-empty closed convex set. Assume , subdifferentiable in , and , for . Then, , the following inequality holdsMoreover, if is constant, i.e., , we have

*Proof:* From the first-order optimality condition on the update, we have that there exists such

Hence, we have

where in the last equality we used the 3-points equality for Bregman divergences. Dividing by and summing over , we have

where we denoted by .

The second statement is left as exercise.

The regret of Prescient Online Mirror Descent contains the negative terms we needed from the optimistic algorithms.

Analogously, we can obtain a version of FTRL that uses the knowledge of the current loss: Be-The-Regularized-Leader (BTRL), that predicts in each time step with . In the case that , then Be-The-Regularized-Leader becomes the Be-The-Leader algorithm. BTRL can be thought as Optimistic FTRL where . Hence, from the regret of Optimistic FTRL, we immediately have the following theorem.

Theorem 4.Let be convex, closed, and non-empty. Assume for that is proper, closed, and -strongly convex w.r.t. . Then, for all we have

Remark 2.In the Be-The-Leader algorithm, if all the , then the theorem states that the regret is non-positive.

Notably, the non-negative gradients terms are missing in the bound of BTRL, but we still have the negative ones associated to the change in .

Using, for example, BTRL for the -player and Optimistic FTRL for the -player, we have

**3. Code and Experiments **

This time I will also show some empirical experiments. In fact, I decided to write a small online learning library in Python, to quickly test old and new algorithms. It is called PyOL (Python Online Learning) and you can find it on GitHub and on PyPI, and install it with pip. I designed it in a modular way: you can use FTRL or OMD and choose the projection you want, the learning rates, the hints, etc. I implemented some online learning algorithms, projections, learning rate, reductions, but I plan to add more. At the moment there is no documentation, but I plan to add it and probably I’ll also blog about it.

The Python notebook below will show the effect of optimism in Exponentiated Gradient when used to solve a 2×2 bilinear saddle-point problem with simplex contraints. You can see as the optimistic algorithm converges faster, with both and averaged last solutions. Moreover, even if we did not prove it, the last iterate of the optimistic algorithm converges to the saddle point, while the one of the non-optimistic algorithm goes farther and farther away from the saddle-point.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

That’s all for this time!

We won’t see other saddle-point results for a while, time to cover new topics.

**4. History Bits **

Daskalakis et al. (2011) proposed the first no-regret algorithm that achieved a rate of for the duality gap when used by the two players of a zero-sum game without any communication between the players. However, the algorithm was rather complex and they posed the problem of obtaining the same or faster rate with a simpler algorithm. Rakhlin and Sridharan (2013) solved this problem showing that two Optimistic OMD algorithms can solve the problem in a simpler way, proving a version of Theorems 1 and 2. The possibility to achieve constant regret for each player observed after Theorem 1 is from Luo (2022).

The use of Prescient Online Mirror Descent in saddle-point optimization is from Wang et al. (2021), but renaming to it is also equivalent to implicit online mirror descent (Kivinen and Warmuth, 1997)(Kulis Bartlett, 2010). In fact, Theorem 3 is from the guarantee of implicit online mirror descent in Campolongo and Orabona (2020).

There is also a tight connection between optimistic updates using the previous gradients and classic approaches to solve saddle-point optimization. In fact, Gidel et al. (2019) showed that using two optimistic gradient descent algorithms to solve a saddle-point problem can be seen as a variant of the Extra-gradient updates (Korpelevich, G. M., 1976), while Mokhtari et al. (2020) show that they can be interpreted as an approximated proximal point algorithm.

Regarding the convergence of the iterations of optimistic algorithms, Daskalakis et al. (2018) proved the convergence of the last iterate to a neighboorhood of the saddle-point in the unconstrained case when using two optimistic online gradient descent algorithms with fixed and small enough stepsizes. Liang and Stokes (2019) improved their result showing that if in addition the matrix is square and full-rank then the iterates of two optimistic online gradient descent will converge exponentially fast to the saddle-point . Later, Daskalakis and Panageas (2019) proved the asympototic convergence of optimisitc OMD/FTRL EG with fixed stepsize for the bilinear games over the probability simplex, assuming a unique saddle-point. Wei et al. (2021) proved an exponential rate for the same algorithm under the same assumptions. Finally, Lee et al. (2021, Theorem 4) proved that the iterates of Optimistic OMD with a constant and small enough learning rate asymptotic converge to a saddle-point, without assuming a unique saddle-point.

**Acknowledgements**

Thanks to Haipeng Luo and Aryan Mokhtari for comments and references.

]]>The post will be short: the algorithm is straightforward, especially after knowing Optimistic Follow-The-Regularized-Leader (FTRL), and the proof is the simplest one I know. Yet, there are few interesting bits that one can extract from this short proof.

**1. Optimistic OMD **

I have already discussed the Optimistic version of FTRL and show how the proof is immediate once we change the regularizers. By immediate, I mean that it is just the FTRL regret equality and a telescopic sum over the hints. Here, I’ll show that a regret bound for Optimistic OMD can be proven in the exact same way.

I always found previous proofs of optimistic OMD unnecessarily complex. Instead, here we show that the proof is just the usual OMD proof applied to a different sequence of subgradients and a telescopic sum: easy peasy! I want to stress that this is not a “trick”, but the very essence of the Optimistic OMD algorithm.

First, let’s introduce the Optimistic OMD algorithm, see Algorithm 1. Here, at round the algorithm receives a hint on the next subgradient and uses it to construct the update. At the same time, you have to remove the hint you used at the previous time step, . (Note that this is the more recent one-update-per-step variant of Optimistic OMD, rather than the original two-updates-per-step Optimistic OMD, see the History Bits.)

To gain some intuition on why this update makes sense, consider the case that , , and . In this case, . Unrolling the update, we get . Without hints, that is in plain OMD, under the same assumptions the unrolled update would be and . Hence, acts as a proxy for the next (unknown) subgradient .

Note that one might be tempted to multiply by , because in the previous iteration we used the learning rate . However, the online learning proof reveals that the correct way to see the update is to think the learning rate as attached to the Bregman divergence rather than to the subgradients. (Things might be different in the batch and stochastic setting, where the best proofs deal with the learning rates in a slightly different way.)

One might also be tempted to find a way to study this algorithm with a special proof. However, the one-step lemma we proved for OMD is essentially tight: we only used two inequalities, one to deal with the set and the other one to linearize the losses. But, but both steps can be made tight, considering and linear losses. Hence, if the update is just OMD with a different sequence of subgradients, the proof *must* follow from the one of OMD with a different set of subgradients. This is a general rule: If we have a theorem based on a tight inequality, any other proof of the same theorem, no matter how complex, must be looser or in the best case equivalent. On the other hand, some (fake) complexity might help you with some reviewers, but this is another story…

Theorem 1.Let the Bregman divergence w.r.t. and assume to be proper, closed, and -strongly convex with respect to in . Let a non-empty closed convex set. With the notation in Algorithm 1, assume exists, and it is in .

Assume . Then, and , the following regret bounds holdMoreover, if is constant, i.e., , we have

*Proof:* The proof closely follows the one of Lemma 4 in the OMD proof with , so we only show the different steps. First of all, changing the sequence of subgradients, we immediately have

*Summing over* *the l.h.s., we obtain*

*Summing the last terms on the r.h.s., we have that*

*Finally, observe that*

*Given that loss on the rounds* *does not depend on* *we can safely set it to* *. Putting all together, we have the stated bounds.*

*To obtain the second inequalities, we use the strong convexity of* * as in the proof of OMD.*

These regret bounds are essentially the same of the ones of Optimistic FTRL, minus the intrinsic differences between FTRL and OMD. In particular, the constant factors are also the same. Hence, similar results to the ones that we proved for Optimistic FTRL can be proved for Optimistic OMD.

As said above, we will use this result to show how to accelerate the optimization of smooth saddle-point problems.

**2. History Bits **

The original Optimistic OMD was proposed in Rakhlin and Sridharan (2013) with two-updates-per-step. Later, Joulani et al. (2017) showed that the same bounds could be obtained with a simpler one-update-per-step version of Optimistic OMD, that is the version I describe here. The proof I present here is based on the one I proposed for Optimistic FTRL.

]]>**1. Game-Theory interpretation of Saddle-Point Problems **

An instantiation of a saddle-point problems also has an interpretation in Game Theory as a *Two-player Zero-Sum Game*. Note that Game Theory is a vast field and two-person zero-sum games are only a very small subset of the problems in this domain and what I describe here is an even smaller subset of this subset of problems.

Game theory studies what happens when self-interested agents interact. By self-interested, we mean that each agent has an ideal state of things he wants to reach, that can include a description of what should happen to other agents as well, and he works towards this goal. In two-person games the players act simultaneously and then they receive their losses. In particular, the -player chooses the play and the -player chooses the play , the -player suffers the loss and the -player the loss . It is important to understand that this is only one round, that is it has only one play for each player. Note that the standard game-theoretic terminology uses payoffs instead of losses, but we will keep using losses for coherence with the online convex optimization notation that I use on this blog.

We consider the so-called *two-person normal-form games*, that is when the first player has possible actions and the second player . A player can use a *pure strategy*, that is a single fixed action, or randomize over a set of actions according to some probability distribution, a so-called *mixed strategy*. In this case, we consider and and they are known as the *action spaces* for the two players. In this setting, for a pair of pure strategies , the first player receives the loss and the second player , where is the vector with all zeros but a ‘1’ in position . The goal of each player is to minimize the received loss. Given the discrete nature of this game, the function is the bilinear function , where is a matrix with rows and columns. Hence, for a pair of mixed strategy , the *expected* loss of the first player is and the one of the second player is .

A fundamental concept in game theory is the one of *Nash Equilibrium*. We have a Nash equilibrium if all players are playing their best strategy to the other players’ strategies. That is, none of the players has incentive to change their strategy if the other player does not change it. For the zero-sum two-person game, this can be formalized saying that is a Nash equilibrium if

This is *exactly* the definition of saddle-point for the function that we gave last time. Given that is continuous, convex in the first argument and concave in the second one, the sets and are convex and compacts, we can deduce from Theorem 5 and Theorem 3 from my previous post that a saddle-point always exists. Hence, there is always at least one (possibly mixed) Nash equilibrium in two-person zero-sum games. The common value of the minimax and maxmin problem is called *value of the game* and we will denote it by .

For zero-sum two-person game the Nash equilibrium has an immediate interpretation: From the definiton above, if the first player uses the strategy then his loss is less then the value of the game , regardless of the strategy of the second player. Analogously, if the second player uses the strategy then his loss is less then , regardless of the strategy of the first player. Both players achieve the value of the game if they both play the Nash strategy. Moreover, even if one of the player would announce his strategy in advance to the other player, he would not increase his loss in expectation.

Example 1 (Cake cutting).Suppose to have a game between two people: The first player cuts the cake in two and the second one chooses a piece; the first player receives the piece that was not chosen. We can formalize it with the following matrix

larger piece smaller piece cut evenly 0 0 cut unevenly 10 -10

When the first player plays action and the second player action , the first player receives the loss and the second player receives . The losses represent how much less in percentage compared to half of the cake the first player is receiving. The second player receives the negative of the same number. It should be easy to convince oneself that the strategy pair is is an equilibrium with value of the game of 0.

Example 2 (Rock-paper-scissors).Let’s consider the game of Rock-paper-scissors. We describe it with the following matrix

Rock Paper Scissors Rock 0 1 -1 Paper -1 0 1 Scissors 1 -1 0

It should be immediate to realize that there are no pure Nash equilibria for this game. However, there is a mixed Nash equilibrium when each player randomize the action with a uniform probability distribution and value of the game equal to 0.

Example 3 (Matching Pennies).In this game, both players show a face of a penny. If the two faces are the same, the first player wins both, otherwise the second player wins both. The associated matrix is

head tail head -1 1 tail 1 -1

It is easy to see that the Nash equilibrium is when both players randomize the face to show with equal probability.

In this simple case, we can visualize the saddle point associated to this problem in Figure 1

Unless the game is very small, we find Nash equilibria using numerical procedures that typically give us only approximate solutions. Hence, as for -saddle points, we also define an *-Nash equilibrium* for a zero-sum two-person game when and satisfy

Obviously, any Nash equilibrium is also an -Nash equilibrium.

From what we said last time, it should be immediate to see how to numerically calculate the Nash equilibrium of a two-person zero-sum game. In fact, we know that we can use online convex optimization algorithms to find -saddle points, so we can do the same for -Nash equilibrium of two-person zero-sum games. Assuming that the average regret of both players is , the theorem we saw last time says that is a – Nash equilibrium.

**2. Boosting as a Two-Person Game **

We will now show that the Boosting problem can also be seen as the solution of a zero-sum two-person game. In fact, this is my favorite explanation of Boosting because it focuses on the problem rather than on a specific algorithm.

Let a training set of couples features/labels, where and . Let a set of functions .

The aim of boosting is to find a combination of the functions in that has arbitrarily low misclassification error on . Of course, this is not always possible. However, we will make a (strong) assumption: It is always possible to find a function such that its misclassification error over weighted with any probability distribution is better than chance by a constant . We now show that this assumption guarantees that the boosting problem is solvable!

First, we construct a matrix of the misclassifications for each function: where . Setting and , we write the saddle-point problem/two-person zero-sum game

Given the definition of the matrix , this is equivalent to

Let’s now formalize the assumption on the functions: We assume the existence of a *weak learning oracle* that for any , the oracle returns such that

where . In words, is the index of the function in that gives a -weighted error better than chance. Moreover, given that

and

we have that the value of the game satisfies . Using the inequality on the value of the game and the fact that the Nash equilibrium exists, we obtain that there exists such that for any

In words, this means that every sample is misclassified by less than half of the functions when weighted by . Hence, *we can correctly classify all the samples using a weighted majority vote rule where the weights over the function are .* This means that we can learn a perfect classifier rule using weak learners, through the solution of a minimax game. So, our job is to find a way to calculate this optimal distribution on the functions.

Given what we have said till now, a natural strategy is to use online convex optimization algorithms. In particular, we can use Algorithm 3 from my previous post, where in each round the -player is the weak learning oracle, that knows the play by the -Learner, while the -player is an Online Convex Optimization (OCO) algorithm. Specialized to our setting we have the following algorithm. In words, the -player looks for the function that has small enough weighted misclassification loss, where the weights are constructed by the -player.

Let’s show a guarantee on the misclassification error of this algorithm. From the second equality of Theorem 6 in my previous post, for any , we have

From the assumption on the weak-learnability oracle, we have . Moreover, choosing we have . Putting all together, for any , we have

If , less than half of the functions selected by the boosting procedure will make a mistake on . Given that the predictor is a majority rule, this means that the majority rule will make 0 mistakes on the training samples.

In this scheme, we construct by approximating it with the frequency with which is equal to .

Let’s now instantiate this framework with a specific OCO algorithm. For example, using Exponentiated Gradient (EG) as algorithm for the -player, we have that for any , that implies that after rounds the training error is exactly 0. This is exactly the same guarantee achieved by AdaBoost (Freund and Schapire, 1997).

**Boosting and Margins** What happens if we keep boosting after the training error reaches 0? It turns out we maximize the *margin*, defined as . In fact, given that , we have for any

Hence, when the number of rounds goes to infinity the minimum margin on the training samples reaches . This property of boosting of maximizing the margin has been used as an explanation of the fact that in boosting additional rounds after the training error reaches 0 often keep improving the test error on test samples coming i.i.d. from the same distribution that generated the training samples.

The above reduction does not tell us how the training error precisely behaves. However, we can get this information changing the learning with expert algorithm. Indeed, if you read my old posts you should know that there are learning with experts algorithm provably better than EG. We can use a learning with experts algorithm that guarantees a regret for a given prior , where is the uniform prior. This kind of algorithms allow us to upper bound the fraction of mistakes after any iterations. Denote by the number of misclassified samples after iterations of boosting and set as the vector whose coordinates are equal to if misclassifies it. Hence, we have

Using the expression of the regret, we have that the fraction of misclassification errors is bounded by . That is, fraction of misclassification error goes to zero exponential fast in the number of boosting rounds. Again, a similar guarantee was proved for AdaBoost, but here we quickly derived it using a reduction from boosting to learning with experts, passing through two-person games.

**3. History Bits **

The reduction from boosting to two-person game and learning with expert is from Freund and Schapire (1996). It seems that the question if a weak learner can be boosted into a strong learner was originally posed by Kearns and Valiant (1988) (see also Kearns (1988)) but I could not verify this claim because I could not find their technical report anywhere. It was answered in the positive by Schapire (1990). The AdaBoost algorithm is from Freund and Schapire (1997). The idea of using algorithms that guarantee a KL regret bound in (2) is from Luo and Schapire (2014).

** Acknowledgements **

Thanks to Nicolò Campolongo for hunting down my typos.

]]>In this post, we talk about solving saddle-point problems with online convex optimization (OCO) algorithms. Next time, I’ll show the connection to two-person zero-sum games and boosting.

**1. Saddle-Point Problems**

We want to solve the following saddle-point problem

Let’s say from the beginning that we need inf and sup rather than min and max because the minimum or maximum might not exist. Everytime we know for sure the inf/sup are attained, I’ll substitute them with min/max.

While for the minimization of functions is clear what it means to solve it, it might not be immediate to see what is the meaning of “solving” the saddle-point problem in (1). It turns out that the proper notion we are looking for is the one of saddle-point.

Definition 1.Let , , and . A point is a saddle-point of in if

We will now state conditions under which there *exists* a saddle-point that solves (1). First, we need an easy lemma.

Lemma 2.Let is any function from a non-empty product set to . Then,

*Proof:* For any and we have that . This implies that for all that gives the stated inequality.

We can now state the following theorem.

Theorem 3.Let any function from a non-empty product set to . A point is a saddle-point of if and only if the supremum in

is attained at, the infimum in

is attained at , and these two expressions are equal.

*Proof:* If is a saddle-point, then we have

From Lemma 2, we have that these quantities must be equal, so that the three conditions in the theorem are satisfied.

For the other direction, if the conditions are satisfied, then we have

Hence, is a saddle-point.

Remark 1.The above theorem tells a surprising thing: If a saddle-point exists, then there might be multiple ones, and all of them must have the same minimax value. This might seem surprising, but it is due to the fact that the definition of saddle-point is a global and not a local property. Moreover, if the and problem have different values, no saddle-point exists.

Remark 2.Consider the case that the value of the and are different. If your favorite optimization procedure to find the solution of a saddle-point problem does not distinguish between a and formulation, then it cannot be correct!

Let’s show a couple of examples that show that the conditions above are indeed necessary.

Example 1.Let , , and . Then, we have

while

Indeed, from Figure 1 we can see that there is no saddle-point.

Example 2.Let , , and . Then, we have

and

Here, even if inf sup is equal to sup inf, the saddle-point does not exist because the inf in the first expression is not attained in a point of .

This theorem also tells us that in order to find a saddle-point of a function , we need to find the minimizer in of and the maximizer in of . Let’s now use this knowledge to design a proper measure of progress towards the saddle-point.

We might be tempted to use as a measure of suboptimality of with respect to the saddle-point . Unfortunately, this quantity can be negative or equal to zero for an infinite number of points that are not saddle-points. We might then think to use some notion of distance to the saddle-point, like , but this quantity in general can go to zero at an arbitrarily slow rate. To see why consider the case that , so that the saddle-point problem reduces to minimize a convex function. So, assuming only convexity, the rate of convergence to a minimizer of can be arbitrarily slow. Hence, we need something different.

Observe that the Theorem 3 says one of the problems we should solve is

where . In this view, the problem looks like a standard convex optimization problem, where the objective function has a particular structure. Moreover, in this view we only focus on the variables . The standard measure of convergence in this case for a point , the *suboptimality gap*, can be written as

We also have to find the maximizer in of , hence we have

where . This also implies another measure of convergence in which we focus only on the variable :

Finally, in case we are interested in studying the quality of a joint solution , a natural measure is a sum of the two measures above:

where we assumed the existence of a saddle-point to say that from Theorem 3. This measure is called *duality gap*. From the definition it should be clear that the duality gap is always non-negative. Let’s add even more intuition of the duality gap definition, using the one of -saddle-point.

Definition 4.Let , , and . A point is a -saddle-point of in if

This definition is useful because we cannot expect to numerically calculate a saddle-point with infinite precision, but we can be find something that satisfies the saddle-point definition up to a . Obviously, any saddle-point is also an -saddle-point.

Now, the notion of -saddle-point is equivalent up to a multiplicative constant to the duality gap. In fact, it is easy to see that if is an -saddle-point then its duality gap of . In turn, a duality gap of and the existence of a saddle-point imply that the point is a -saddle.

The above reasoning told us that finding the saddle-point of the function is equivalent to solving a maximization problem and a minimization problem. However, is it always the case that the saddle-point exists? So, let’s now move to easily checkable sufficient conditions for the existence of saddle-point. For this, we can state the following theorem.

Theorem 5.Let , be compact convex subsets of and respectively. Let a continuous function, convex in its first argument, and concave in its second. Then, we have that

This theorem gives us sufficient conditions to have the min-max problem equal to the max-min one. So, for example, thanks to the Weierstrass theorem, the assumptions in Theorem 5, in light of Theorem 3, are sufficient conditions for the existence of a saddle-point.

We defer the proof of this theorem for a bit and we now turn ways to solve the saddle-point problem in (1).

**2. Solving Saddle-Point Problems with OCO **

Let’s show how to use Online Convex Optimization (OCO) algorithms to solve saddle-point problems.

Suppose to use an OCO algorithm fed with losses that produces the iterates and another OCO algorithm fed with losses that produces the iterates . Then, we can state the following Theorem.

Theorem 6.Let be convex in the first argument and concave in the second. Then, with the notation in Algorithm 2, for any , we have

where.

For any, we have

where.

Also, ifis continuous and,compact, we have

for any and .

*Proof:* The first two equalities are obtained simply observing that and .

For the inequality, using Jensen’s inequality, we obtain

Summing the first two equalities, using the above inequality, and taking and , we get the stated inequality.

From this theorem, we can immediately prove the following corollary.

Corollary 7.Let be continuous, convex in the first argument, and concave in the second. Assume that and are compact and that the -player and -player use no-regret algorithms, possibly different, than

Example 3.Consider the saddle-point problem

The saddle-point of this problem is. We can find it using, for example, Projected Online Gradient Descent. So, setting, we have the iterations

According to Theorem 6, the duality gap in converges to 0.

Surprisingly, we can even prove a (simpler version of the) minimax theorem from the above result! In particular, we will use the additional assumption that there exist OCO algorithms that minimize and have sublinear regret.

*Proof with OCO assumption:* From Lemma 2, we have one inequality. Hence, we now have to prove the other inequality.

We will use a constructive proof. Let’s use Algorithm 2 and Theorem 6. For the first player, for any we have

Observe that

Hence, using an OCO algorithm that has regret for each , we have

In the same way, we have

Summing the two inequalities, taking , and using the sublinear regret assumption, we have

** 2.1. Variations with Best Response and Alternation **

In some cases, it is easy to compute the max with respect to of for a given . For example, this is trivial for bilinear games over the simplex. In these cases, we can remove the second learner and just use its *best response* in each round. Note that in this way we are making one of the two players “stronger” through the knowledge of the loss of the next round. However, this is perfectly fine: The proof in Theorem 6 is still perfectly valid.

In this case, the -player has an easy life: it knows the loss before making the prediction, hence it can just output the minimizer of the loss in . Hence, we also have that the regret of the -player will be non-positive and it will not show up in Theorem 6. Putting all together, we can state the following corollary.

Corollary 8.Let be continuous, convex in the first argument, and concave in the second. Assume compact and that the argmax of the -player is never empty. Then, with the notation in Algorithm 2, we have

for any and where .

This alternative seems interesting from a theoretical point of view because it allows to avoid the complexity of learning in the space, for example removing the dependency on its dimension. Of course, an analogous result can be stated using best-response for the -player and an OCO algorithm for the -player.

There is a third variant, very used in empirical implementations, especially of Counterfactual Regret Minimization (CFR) (Zinkevich et al., 2007). It is called *alternation* and it breaks the simultaneous reveal of the actions of the two players. Instead, we use the updated prediction of the first player to construct the loss of the second player. Empirically, this variant seems to greatly speed-up the convergence of the duality gap.

For this version, Theorem 6 does not hold anymore because the terms and are now different, however we can prove a similar guarantee.

Theorem 9.Let continuous, convex in the first argument, and concave in the second. Assume that and are compact. Then, with the notation in Algorithm 4, for any and any we have

where and .

*Proof:* Note that , and . By Jensen’s inequality, we have

Taking and , we get the stated result.

Remark 3.In the case that is linear in the first argument, using OMD for the -player we have that . Hence, in this case the additional term in Theorem 9 is negative, showing a (marginal) improvement to the convergence rate.

Next post, we will show how to connect saddle-point problems with Game Theory.

**3. History Bits **

Theorem 3 is (Rockafellar, 1970, Lemma 36.2).

The proof of Theorem 6 is from (Liu and Orabona, 2021) and it is a minor variation on the one in (Abernethy and Wang, 2017): (Liu and Orabona, 2021) stressed the dependency of the regret on a competitor that can be useful for refined bounds, as we will show next time for Boosting. It is my understanding that different variant of this theorem are known in the game theory community as “Folk Theorems”, because such result was widely known among game theorists in the 1950s, even though no one had published it.

The celebrated minimax theorem for zero-sum two-person games was first discovered by John von Neumann in the 1920s (von Neumann, 1928)(von Neumann and Morgenstern, 1944). The version is state here is a simplification of the generalization due to (Sion, 1958). The proof here is from (Abernethy and Wang, 2017). A similar proof is in (Cesa-Bianchi and Lugosi, 2006) based on a discretization of the space that in turn is based on the one in (Freund, Y. and Schapire, 1996)(Freund, Y. and Schapire, 1999).

Algorithms 2 and 3 are a generalization of the algorithm for boosting in (Freund and Schapire, 1996)(Freund and Schapire, 1999). Algorithm 2 was also used in (Abernethy and Wang, 2017) to recover variants of the Frank-Wolfe algorithm (Frank and Wolfe, 1956).

It is not clear who invented alternation. Christian Kroer told me that it was a known trick used in implementations of CFR for the computer poker competition from 2010 or so. Note that in CFR the method of choice is Regret Matching (Hart and Mas-Colell, 2000). However, (Kroer, 2020) empirically shows that alternation improves a lot even OGD for solving bilinear games. (Tammelin et al., 2015) explicitly include this trick in their implementation of an improved version of CFR called CFR+, claiming that it would still guarantee convergence. However, (Farina et al., 2019) pointed out that averaging of the iterates in alternation might not produce a solution to the min-max problem, providing a counterexample. Theorem 9 is from (Burch et al., 2019) (see also (Kroer, 2020)).

There is also a complementary view on alternation: (Zhang et al., 2021) link alternating updates to Gauss-Seidel methods in numerical linear algebra, in contrast to the simultaneous updates of the Jacobi method. Also, they provide a good review of the optimization literature on the advantages of alternation, but this paper and the papers they cite do not seem to be aware of the use of alternation in CFR.

** Acknowledgements **

Thanks to Christian Kroer for the history of alternation in CFR, and to Gergely Neu for the references on alternation in the optimization community. Thanks to Nicolò Campolongo for hunting down my typos. Also, thanks to Christian Kroer for his great lecture notes that helped me getting started on this topic.

]]>First of all, two months ago I got the official announcement that I got tenure. This is the end of a journey that started in 2003 with a PhD in computer vision for humanoid robotics at the University of Genova (Italy). This was an incredibly stressful journey for a number of reasons: Switching research topic after my PhD, living in Milan with the salary of a post-doc, moving to the US with a PhD from an Italian university, not having a famous American professor as mentor/letter writer, etc.

In “Mostly Harmless” by Douglas Adams, the protagonist Arthur Dent visits a woman seer to receive advice. The woman, who swats at flies in front of a cave and smells horrible, hands photocopies of her life to him suggesting he should live his life *the opposite way* she did so he will not end up living in a rancid cave. I should do the same for my academic life…

However, I cannot complain too much: I did get tenure, I contributed 2 or 3 things that I am proud of, and I do not live in a rancid cave! So, now I am going to use my new tenure powers to explain why I think the expert reviewer is a myth.

**2. The Fairytale of the Expert Reviewer **

There is a common myth in the theory ML community that the main problem with reviews at conferences is that reviewers are not really experts.

First, I think it is hardly controversial that nowadays most of the reviewers do not have the experience and knowledge to judge scientific papers: reviewers at big ML conferences are for the majority young and inexperienced PhD students. Also, for the Dunning-Kruger effect, they vastly overestimate what they actually know. The final effect is that the average review tend to be a mix of arrogant meaningless semi-random critiques. Let me be clear: PhD students from 10 years ago were equally terrible as reviewers. For example, I cringe thinking at the reviews I used to write as a PhD student. However, my terrible inexperienced reviews were also for third-tier conferences. Nowadays, instead, the exponential growth of ML submissions means that we have to enroll *all* the possible reviewers at first-tier venues.

However, I do not want to focus on the young PhD students: Mainly, it is not their fault. I understand that the growth of some community made the reviewing process basically random and nobody has a real solution to it, me neither. Instead, here I want to talk about the **Expert Reviewer**.

You should know that the theory Machine Learning community is convinced to have a better reviewing system. The reason, they say, is because they have the Expert Reviewers. This mythological figure knows all the results in his/her area, read all the Russian papers from 50 years ago, and thanks to his/her infinite wisdom only accepts the best and finest products of the human intellect. He/she can also check the correctness of 40 pages of appendix purely laying on hands and can kill any living beings just pronouncing the magic words “this is not surprising”.

Now, the fact that not even in Advanced Dungeons & Dragons you can have a Mage/Paladin with Wisdom and Intelligence to 19, it should make you suspect this is not exactly the reality…

**3. The Reality **

Now, let me tell you what actually happens, *most of the time*. The Expert Reviewer typically is someone obsessed with a particular area whose volume tends to 0, but very often with close to 0 knowledge of anything outside of it. You can visualize him/her as a Delta function.

If you work on a less fashionable area, the probability to randomly meet an Expert Reviewer for your specific sub-topic is close to 0. Moreover, He/She will often refuse to think that anything outside of his/her area is actually of any interest. Finally, for a weird bug in the symmetric property, the Expert Reviewer is *always* sure to know more than the authors of the paper he/she reviews.

The net effect is that

- most of the time the authors are actually
*the only real experts*; - papers are seldom accepted or rejected based on a deep understanding of what is going on;
- most of the time the most important decision factor is
*how much the reviewer likes your topic and/or you*.

Sometimes it does happen that the review process actually work as intended, but I would argue that the above is what happens at least 60% of the times an Expert Reviewer is involved.

Now, I could argue forever on these 3 points above. Instead, thanks to my tenure, I decided to do something that is taboo in academia: I’ll describe the review process of a real paper!

The paper is here and it was accepted at COLT 2017 only thanks to me.

Let me start from the beginning.

**4. The Story of the Unlucky Paper **

First of all, for a long time COLT had a reviewer/subreviewer system: Each PC member was allocated a number of papers to review and he/she could decide if review all the paper by him/herself or send it to subreviewers. The subreviewers were not taken from a fixed list of reviewers, but they had to be contacted personally: As you can imagine, it was not an easy task. How many times you accept review requests from random people? Exactly! On the other hand, this gave the possibility to PC members to really select the best reviewer for each paper, even going outside of the usual clique of people.

So, I was a PC member and I assigned that paper to an expert of regression in Reproducing Kernel Hilbert Spaces, because the paper was clearly an extension of the seminal work on 1/T rate of SGD without strong convexity, that in turn was based on the line of work developed by Rosasco, Caponnetto, De Vito, Smale, Zhou, etc. Now, for a number of reasons, this specific line of work is unknown to 99% of the theory ML people. I happen to know it because I published in this subfield after Lorenzo Rosasco showed to me its beauty. This is to say that I was pretty confident about my understanding of the paper and my choice of the subreviewer.

The other two PC members assigned to the paper were online learning people. This requires some explanation: for long time at COLT online learning people were the only people with *some* background in convex optimization. So, any optimization paper was going to them, by default. If you know classic online learning theory and convex optimization, you should see why sometimes this can be a terrible idea. (This situation improved a bit in the latest years, but not by much.) Anyway, the reviews are in and one PC personally destroys the paper, clearly not understanding it. The other one sent it to an OR person, that also did not understand the paper at all. My subreviewer instead firmly accepted the paper.

Now, it is necessary to open a parenthesis. There is something that young reviewers might not understand: Most of the time, it is adamant when a reviewer did not understand a paper. It is painfully clear for the authors, but it is also very clear for an experienced Area Chair/PC/Chair in charge of the paper. So, luckily the Chair decided to intervene and asked for the opinion of an external (famous but oblivious to the area of the paper) reviewer. Let me say that this is also not common: It heavily depends on the Chairs and the load they have. Quite understandably, they often do not have the time to intervene on single papers. As expected, the fourth review also did not understand the paper and rejected it… At this point, I started a long discussion in which I successfully refuted each single point of the fourth external reviewer, till he/she concedes that the paper was borderline because he/she was still not *excited* about it.

Let me pause here to explain you another thing: Expert Reviewers are humans and humans are rarely rational. So, one of the *main* ways they have to judge a paper is “Do I like this topic?” that often translates to “Do I work on this topic? No, I don’t, because this is clearly a bad topic!”. So, the external reviewer decided that the paper was not interesting because he/she thought the entire area was not interesting. End of the story.

Let me stress here that the problem is not how well a paper is written. In fact, all the involved reviewers understood the paper and its claims. However, deciding if these results were warrant or not acceptance it is just a matter of *taste* of different mostly orthogonal sub-communities, like the pineapple-on-pizza community and the never-pineapple-on-pizza one.

By this stage, the paper was doomed: two reviewers against me and my subreviewer, and one reviewer completely silent. At this stage, any further discussion was also counter-productive because the Expert Reviewer sees any attack to his/her argument as an attack to his/her auctoritas.

So, I had an idea: I stopped sending messages to the other reviewers and wrote directly to the COLT Chairs. I plainly explained that none of these people but my subreviewer and I actually worked in this area. The Chair was not convinced, so I had to propose another PC member who i) actually understood the area, and ii) was famous enough to have a weight on the decision. (The Chair did not mention the second point, I read between the lines…) I proposed a fifth reviewer and the Chair contacted him/her.

After a few days, the fifth reviewer came back and the first sentence of the message was:

I’m a bit surprised: the improvement over existing work is pretty clear!!

The Chair was convinced, the paper accepted.

It took only 2 weeks of discussions, 7 Expert Reviewers, and 1 Chair.

Easy-peasy.

**5. Epilogue **

Was it worth it? In my personal opinion, definitely yes.

What did I get in return? Absolutely nothing, zero, nada, zip, zilch.

So, trust me when I tell you that this is not what normally happens. And it does not matter how many Experts Reviewers you have: The problem is that the probability to get someone that *really* understands your subtopic is very low, even when you submit to a prestigious theory conference. Even assuming you got reviewers that actually understand your paper, you have to be really lucky to avoid your *Nemesis* (that rejects all your papers just because he/she does not like you and you are not even sure why), the *Egomaniac* (that rejects anything vaguely similar to what he/she does, because nothing compares to what he/she does), and the *Purist* (that rejects anything that might actually work in practice). All the above are things that actually happened, but not even tenure makes me so brave to describe these episodes. But just to give you some fun facts, recently a Chair of a prestigious conference told me that I indeed might have “enemies”. He/she also plainly told me that I should declare the people I *suspect* are my enemies as conflicts (never mind that almost none of the conferences have a system in place for “negative” conflicts…).

In reality, in my years of experience (yes, I am old) I very rarely saw the reviewing system working as it should. Most of the time, in order to get a meaningful decision on a paper, you have to work hard, so hard that people might end up deciding that it is not worth it. I myself have less and less strength and patience to fight many of these battles. I did not gain anything in any of them, probably only more enemies.

An entire system of checks and balances is badly needed in the conference reviews, much more than just amassing expert reviewers. Indeed, you also need somebody that allocates them properly, that checks that they do their job, that prevents them from acting like jerks, that keep them open to discussions, that makes them plainly admit that after all they might not have understood the paper, that helps them admit that they might not know the area, that (God forbid!) prevents them from rejecting papers just because they don’t like them, etc.

However, the main problem is why should anyone waste so much time on reviewing/discussing/analyzing papers of *other* people? More importantly, how exactly the community gives feedback for these poor reviews? How are we teaching people to simply say “Actually, this is not exactly my topic”? Indeed, not only no feedback whatsoever was given to any of the people of my story, somehow they also got a prize: in later years I fought similar battles to have papers of the above mentioned reviewers accepted in other conferences!

Overall, I do not know what is a better system to review papers, but I am pretty sure Expert Reviewers are not the answer.

So, there is no happy ending here: The Expert Reviewers are still convinced to be always right and, from time to time, still rejecting your papers they did not actually understand.

P.S. I am also an Expert Reviewer.

]]>**1. Implicit Updates **

Let’s consider the two mostly commonly used frameworks for online learning: Online Mirror Descent (OMD) and Follow-The-Regularized-Leader (FTRL).

We already explained that in OMD we update the iterate as the minimizer of an linear approximation of the last loss function we received plus a term that measures the distance to the previous iterate:

where .

On the other hand, in FTRL we have two possibilities: we can minimize the regularized sum of the loss we have received till now *or* the regularized sum of the *linear approximation* of the losses. In the first case, we update with

while in the second case, we use

where for . The second update is what optimization people call Dual Averaging. We also saw that under some reasonable conditions, the two updates of FTRL have the same regret guarantee. However, we would expect the second approach, the one using the exact loss functions, to perform much better in practice. Also, in the linearization of the losses we have to assume to know some characteristics of the losses, e.g., strong convexity parameter, to achieve the same regret of the full-losses FTRL.

Overall, we have two different frameworks and two different ways to use the loss functions in the update. So, it should be obvious that there is at least another possibility, that is *OMD with exact loss*. That is, we would like to consider the update

As in the FTRL case, we would expect this update to be better than the linearized one, at least empirically.

To gain some more intuition, let’s consider the simple case that , , and the losses are differentiable. In this case, we have that the linearized OMD update becomes

For example, with square loss and linear predictors over couples input/labels , we have and the update becomes

On the other hand, the update of OMD with the exact loss function becomes

The optimality condition tells us that satisfies

that is

So, the update is not in a closed formula anymore, but it has an *implicit* form, where appears in both side of the equality. This is exactly the reason why the update of OMD with exact losses is known in the online learning literature as *implicit updates*. So, we will call the update in (1) *Implicit OMD*.

Remark 1.Observe that for linear losses OMD and the Implicit OMD are equivalent.

In general, calculating the update of Implicit OMD can be an annoying optimization problem. However, in some cases the Implicit OMD update can still be solved in a closed form.

Example 1.Consider again linear regression with square loss. The update of Implicit OMD becomesTo solve the equation, we take the inner product of both sides with , to obtain

that is

Substituting this expression in (3), we have

**Implicit Updates are Always Descending** Till now, we have motivated implicit updates purely from an intuitive point of view: We expect this algorithm to be better because we do not approximate the loss functions. Indeed, we often gain a lot in performance switching to implicit updates. However, we can even prove that implicit updates have interesting theoretical properties.

First, contrarily to OGD, implicit updates remains “sane” even when the learning rate goes to infinity. Indeed, taking to infinity in (1), becomes simply the minimizer of the loss function. On the other hand, in OMD when the learning rate goes to infinity we can take a step that is arbitrarily far from the minimizer of the function!

When we consider non-differentiable convex functions, there is another important difference between implicit updates and subgradient descent updates. We already saw that the subgradient might not point in a descent direction. That is, no matter how we choose , the value of the function in might be *higher* than in . On the other hand, this cannot happen with implicit updates, no matter how we choose the learning rate:

**Proximal updates** Are implicit updates actually an invention of online learning people? Actually, no. Indeed, these kind of updates were known at least in 1965 (!) and they were proposed for the (offline) optimization of functions with the name of *proximal updates*. Basically, we have a function and we minimize it iteratively with the update

starting from an initial point . At first sight, such update might seem pointless in offline optimization: being able to solve (4) implies being also able to find the minimizer of in one step! However, as we have previously discussed, these kind of updates find an application when the function is composed by two parts and we decide to linearize only one part.

**2. Passive-Aggressive **

Now, let me show you that implicit updates were actually used a lot in the online literature, even if many people did not realize it.

Let’s take a look at a very famous online learning algorithm: the Passive-Aggressive (PA) algorithm. PA was a major success in online learning: 2099 citations and counting, that is huge for the online learning area. The theory was not very strong, but the performance of these algorithms was way better than anything else we had at that time. Let’s see how the PA algorithm works.

The PA algorithm was introduced before the Online Convex Optimization (OCO) framework was proposed. So, at that time, online learning for classification and regression focused on the particular case in which the loss functions have the form , basically the loss of linear predictors over couples input/label . The PA algorithm in particular focused on losses that can be zero over intervals, like the hinge loss, the squared hinge loss, the -insensitive loss, and the squared -insensitive loss. For these losses, the update they proposed was

where is a hyperparameter. Now, this is exactly the Implicit OMD update with the special case of the squared L2 Bregman divergence! The choice of the loss functions makes this update always in a closed form. So, for example, for the hinge loss and linear predictors, we have

where the second equality is calculated using the optimality condition and it is left as an exercise to the reader.

So, the huge boost in performance of PA over other online algorithm is due *uniquely* to the implicit updates.

**3. Implicit Updates on Truncated Linear Models: aProx **

There is another first-order optimization algorithm inspired to implicit updates. As we said, implicit updates are rarely in a closed form. So, we can try to approximate the implicit updates in some way. One possibility is to use the implicit update on a *surrogate loss function*. Indeed, when we use a linear approximation we recover plain OMD. Instead, when we use the exact function we get the implicit updates. What can we use in between the two cases? We could think to use a *truncated linear model*. That is, in the case we know that the functions are lower bounded by some , we define

for any . Note that this is a lower bound to the loss function and it is piecewise linear.

Now, we can use these surrogate function in the implicit OMD:

Implicit OMD with truncated linear models and the squared L2 Bregman is called aProx (Asi and Duchi, 2019).

Considering and , we again have

where is a specific vector in . Now, we have 2 possibilities: is in the flat part or in the corner of . Indeed, it should be easy to see that the proximal update assures us that we cannot miss the corner and land on the flat part. So, if we are in the linear part, then . Instead, if we are in the corner we have where . Hence, we always have

and we only need to find . Substituting in the definition of and using first-order optimality condition, we can verify that the following is the closed formula of the update (left as an exercise)

The similarity between this update and the one of PA in (5) should be adamant, that is due to the similarity between the truncated linear model and the hinge loss. Indeed, running aProx on linear classifiers with hinge loss is exactly the PA algorithm.

**4. More Updates Similar to the Implicit Updates **

From an empirical point of view, we can gain a lot of performance using implicit updates, even just approximating them. So, it should not be surprising if people proposed and used similar ideas in many optimization algorithm. Let me give you some examples.

The default optimization algorithm in the machine learning library Vowpal Wabbit (VW) uses the Importance Weight Aware Updates (Karampatziakis and Langford, 2011). These updates essentially approximate the implicit update using a differential equation that for linear models can be calculated in a closed formula. So, if you ever used VW, you already used a close relative of implicit updates, probably without knowing it.

Another interesting example is the setting of adaptive filtering, where one wants to minimize . In this setting, a classic algorithm is Least Mean Squared (LMS) algorithm that corresponds to Online Gradient Descent with linear models and squared loss. Now, a known better version of the LMS is the normalized LMS, that is nothing else that Implicit OMD with linear models and squared loss.

There are even interpretations of the Nesterov’s accelerated gradient method as an implicit update on a curved space (Defazio, 2019).

So, implicit updates are so “natural” that I personally think that any offline/online optimization algorithm that has good performance must be a good approximation of implict updates. Hence, I am sure there are even more examples of implicit updates hiding in other well-known algorithms.

**5. Regret Guarantee for Implicit Updates **

From the above reasoning, it seems very intuitive to expect a better regret bound for implicit updates. However, it turns out particularly challenging to prove a *quantifiable* advantage of implicit updates over OMD ones in the adversarial setting.

Here, I show a very recent result of mine on Implicit OMD that for the first time shows a clear advantage of Implicit OMD in some situations.

First, we can show the following theorem.

Theorem 1.Assume a constant learning rate . Then, implicit OMD guarantees

Moreover, assume the distance generating function to be 1-strongly convex w.r.t. . Then, there exists such that we have

*Proof:* To obtain this bound, we proceed in a slightly different way than in the classic OMD proof. In particular, for any we have

where and in the second inequality we have used the optimality condition of the update. Adding to both terms of the inequality, dividing by , and reordering, we have

Summing over time, we get the first bound.

For the second bound, let’s now focus on the terms and we upper bound them in two different ways. First, using the convexity of the losses, we can bound the difference between and :

where . Also, from the strong convexity of , we have

Hence, putting all together we have

where in the last inequality we used the elementary inequality .

From the optimality condition of the implicit OMD update, we know that there exists such that

Hence, we have

where we used the convexity of the Bregman divergence in its first argument in the second inequality and the optimality condition of the update in the third inequality. This chain of inequalities implies that that gives the second bound in the minimum.

The theorem shows a *possible* and *small* improvement over the OMD regret bound. In particular, there might be sequences of losses where . The fact that the improvement is only possible on some sequences is to be expected: the OMD regret is worst-case optimal on bounded domains, so there is not much to gain. However, maybe we could expect a larger gain on some particular sequence of functions. Indeed, we can show that on some sequences of losses we can achieve *constant* regret! Let’s see how.

From the regret above, we have

Denoting by the *temporal variability* of the losses, we have that the regret guarantee is

Now, in the case that the loss functions are all the same and the regret upper bound becomes a *constant* independent of . It is worth reminding that constant regret is the best we can hope for in online convex optimization! In other words, when online learning becomes as easy as offline learning (i.e., all the losses are equal), that implicit updates give us a provable large boost.

However, there is a caveat: In order to get a regret in the general case we need . On the other had, if we want . The problem in online learning learning is that we do not know the future, so we need some *adaptive* strategy that changes in a dynamic way. This is indeed possible and we leave this as an exercise, see below.

Our last observation is that we can recover the constant regret bound even for FTRL when used on the exact losses. Again, this is due to the use of the exact losses rather than the linear approximation. Remember that FTRL predicts with , where . Hence, from the FTRL regret equality and assuming a non-decreasing regularizer, we have

However, FTRL with exact losses requires to solve a finite sum optimization problem whose size grows with the number of iterations. Instead, Implicit OMD uses only one loss in each round, resulting in a closed formula in a number of interesting cases. We also note that we would have the same tuning problem as before: in order to get a constant regret when , we would need the regularizer to be constant and independent from time, while it should grow as in the general case.

**6. History Bits **

The implicit updates in online learning were proposed for the first time by (Kivinen and Warmuth, 1997). However, such update with the Euclidean divergence is the Proximal update in the optimization literature dating back at least to 1965 (Moreau, 1965)(Martinet, 1970)(Rockafellar, 1976)(Parikh and Boyd, 2014), and more recently used even in the stochastic setting (Toulis and Airoldi, 2017)(Asi and Duchi, 2019).

The PA algorithms were proposed in (Crammer et al., 2006), but the connection with implicit updates was absent in the paper. I am not sure who first realized the connection: I realized it in 2011 and I showed it to Joseph Keshet (one of the author of PA) that encouraged me to publish it somewhere. Only 10 years later, I am doing it Note that the mistake bound proved in the PA paper is worse than the Perceptron bound. Later, we proved a mistake bound for PA that is strictly better than the classic Perceptron’s bound (Jie et al., 2010).

The very nice idea of truncated linear models was proposed by (Asi and Duchi, 2019) as a way to approximate proximal updates and retaining closed form updates.

The connection between implicit OMD and normalized LMS was shown by (Kivinen et al., 2006).

(Kulis and Bartlett, 2010) provide the first regret bounds for implicit updates that match those of OMD, while (McMahan, 2010) makes the first attempt to quantify the advantage of the implicit updates in the regret bound. Finally, (Song et al., 2018) generalize the results in (McMahan, 2010) to Bregman divergences and strongly convex functions, and quantify the gain differently in the regret bound. Note that in (McMahan, 2010)(Song et al., 2018) the gain cannot be exactly quantified, providing just a non-negative data-dependent quantity subtracted to the regret bound. The connection between temporal variation and implicit updates was shown in (Campolongo and Orabona, 2020), together with a matching lower bound.

**7. Acknowledgements **

Thanks to Nicolò Campolongo for feedback on a draft of this post.

**8. Exercises **

Exercise 1.Prove that the update of PA given above is correct.

Exercise 2.Prove that the update of aProx given above is correct.

]]>

Exercise 3.Find an learning rate strategy to adapt to the value of without knowing it (Campolongo and Orabona, 2020).

There is a popular interpretation of the Perceptron as a stochastic (sub)gradient descent procedure. I even found slides online with this idea. The thought of so many young minds twisted by these false claims was too much to bear. So, I felt compelled to write a blog post to explain why this is wrong…

Moreover, I will also give a different and (I think) much better interpretation of the Perceptron algorithm.

**1. Perceptron Algorithm**

The Perceptron algorithm was introduced by Rosenblatt in 1958. To be more precise, he introduced a family of algorithms characterized by a certain architecture. Also, he considered what we call now supervised and unsupervised training procedures. However, nowadays when we talk about the Perceptron we intend the following algorithm:

In the algorithm, the couples for , with and , represent a set of input/output pairs that we want to learn to classify correctly in the two categories and . We assume that there exists an unknown vector the correctly classify all the samples, that is . Note that any scaling of by a positive constant still correctly classify all the samples, so there are infinite solutions. The aim of the Perceptron is to find any of these solutions.

From an optimization point of view, this is called a *feasibility problem*, that is something like

where is some set. They are an essential step in constrained optimization for algorithms that require an feasible initial point. Feasibility problems are not optimization problems even if in some cases can be solved with an optimization formulation.

In the Perceptron case, we can restate the problem as

where the “1” on the r.h.s. is clearly arbitrary and it can be changed through rescaling of . So, in optimization language, the Perceptron algorithm is nothing else than an iterative procedure to solve the above feasibility problem.

**2. Issues with the SGD Interpretation**

As said above, sometimes people refer to the Perceptron as a stochastic (sub)gradient descent algorithm on the objective function

I think they are many problems with this ideas, let me list some of them

- First of all, the above interpretation assumes that we take the samples randomly from . However, this is not needed in the Perceptron and it was not needed in the first proofs of the Perceptron convergence (Novikoff, 1963). There is a tendency to call anything that receive one sample at a time as “stochastic”, but “arbitrary order” and “stochastic” are clearly not the same.
- The Perceptron is typically initialized with . Now, we have two problems. The first one is that with a black-box first-order oracle, we would get a subgradient of a , where is drawn uniformly at random in . A possible subgradient for any is . This means that SGD would not update. Instead, the Perceptron in this case does update. So, we are forced to consider a different model than the black-box one. Changing the oracle model is a minor problem, but this fact hints to another very big issue.
- The biggest issue is that is a global optimum of ! So, there is nothing to minimize, we are already done in the first iteration. However, from a classification point of view, this solution seems clearly wrong. So, it seems we constructed an objective function we want to minimize and a corresponding algorithm, but for some reason we do not like one of its infinite minimizers. So, maybe, the objective function is wrong? So, maybe, this interpretation misses something?

There is an easy way to avoid some of the above problems: change the objective function to a parametrized loss that has non-zero gradient in zero. For example, something like this

Now, when goes to infinity, you recover the function . However, for any finite , is not a global optimum anymore. As a side effect, we also solved the issue of the subgradient of the max function. In this way, you could interpret the Perceptron algorithm as the *limit behaviour of SGD on a family of optimization problems*.

To be honest, I am not sure this is a satisfying solution. Moreover, the stochasticity is still there and it should be removed.

Now, I already proved a mistake bound for the Perceptron, without any particular interpretation attached to it. As a matter of fact, proofs do not need interpretations to be correct. I showed that the Perceptron competes with a *family of loss functions* that implies that it does not just use the subgradient of a single function. However, if you need an *intuitive way* to think about it, let me present you the idea of *pseudogradients*.

**3. Pseudogradients**

Suppose we want to minimize a function -smooth and we would like to use something like gradient descent. However, we do not have access to its gradient. In this situation, (Polyak and Tsypkin, 1973) proposed to use a “pseudogradient”, that is *any* vector that forms an angle of 90 degrees or less with the actual gradient in

In a very intuitive way, gives me some information that should allow me to minimize , at least in the limit. The algorithm then becomes a “pseudogradient descent” procedure that updates the current solution in the direction of the negative pseudogradient

where are the step size or learning rates.

Note that (Polyak and Tsypkin, 1973) define the pseudogradients as a *stochastic* vector that satisfies the above inequality in conditional expectation and for a time-varying . Indeed, there are a number of very interesting results in that paper. However, for simplicity of exposition I will only consider the deterministic case and only describe the application to the Perceptron.

Let’s see how this would work. Let’s assume that at least for an initial number of rounds, that means that the angle between the pseudogradient and the gradient is acute. From the -smoothness of , we have that

Now, if , we have that so can guarantee that the value of decreases at each step. So, we are minimizing without using a gradient!

To get a rate of convergence, we should know something more about . For example, we could assume that . Then, setting , we obtain

This is still not enough because it is clear that cannot be true on all rounds because when we are in the minimizer . However, with enough assumptions, following this route you can even get a rate of convergence.

**4. Pseudogradients for the Perceptron**

How do we use this to explain the Perceptron? Suppose your set is *linearly separable* with a margin of 1. This means that there exists a vector such that

Note that the value of the margin is arbitrary, we can change it just rescaling .

Remark 1.An equivalent way to restate this condition is to constrain to have unitary norm and require

where is called themaximum marginof . However, in the following I will not use the margin notation because it makes things a bit less clear from an optimization point of view.

We would like to construct an algorithm to find (or any positive scaling of it) from the samples . So, we need an objective function. Here the brilliant idea of Polyak and Tsypkin: in each iteration take an arbitrary and define , that is exactly the negative update we use in the Perceptron. This turns out to be a pseudogradient for . Indeed,

where in the last inequality we used (2).

Let’s pause for a moment to look at what we did: We want to minimize , but its gradient is just impossible to calculate because it depends on that we clearly do not know. However, *every time the Perceptron finds a sample on which its prediction is wrong*, we can construct a pseudogradient, without any knowledge of . It is even more surprising if you consider the fact that there is an infinite number of possible solutions and hence functions , yet the pseudogradient correlates positively with the gradient of any of them! Moreover, no stochasticity is necessary.

At this point we are basically done. In fact, observe that is 1-smooth. So, every time , the analysis above tells us that

where in the last inequality we have assumed .

Setting , summing over time, and denoting the number of updates we have over iterations, we obtain

where used the fact that .

Now, there is the actual magic of the (parameter-free!) Perceptron update rule: as we explained here, the updates of the Perceptron are independent of . That is, given an order in which the samples are presented to the algorithm, any fixed makes the Perceptron update on the same samples and it only changes the scale of . Hence, even if the Perceptron algorithm uses , we can consider an arbitrary decided post-hoc to minimize the upper bound. Hence, we obtain

that is

Now, observing that the r.h.s. is independent of , we proved that the maximum number of updates, or equivalently mistakes, of the Perceptron algorithm is bounded.

Are we done? Not yet! We can now improve the Perceptron algorithm taking full advantage of the pseudogradients interpretation.

**5. An Improved Perceptron**

This is a little known idea to improve the Perceptron. It can be shown with the classic analysis as well, but it comes very naturally from the pseudogradient analysis.

Let’s start from

Now consider only the rounds in which and set , that is obtained by an optimization of the expression . So, we obtain

This means that now the update rule becomes

Now, summing (3) over time, we get

It is clear that this inequality implies the previous one because . But we can even obtain a tighter bound. Using the inequality between harmonic, geometric, and arithmetic mean, we have

In words, the original Perceptron bound depends on the maximum squared norm of the samples on which we updated. Instead, this bound depends on the geometric or arithmetic mean of the squared norm of the samples on which we updated, that is less or equal to the maximum.

**6. Pseudogradients and Lyapunov Potential Functions**

Some people might have realized yet another way to look at this: is the Lyapunov function typically used to analyze subgradient descent. In fact, the classic analysis of SGD considers the guaranteed decrement at each step of this function. The two things coincide, but I find the pseudogradient idea to add a non-trivial amount of information because it allows to bypass the idea of using a subgradient of the loss function completely.

Moreover, the idea of the pseudogradients is more general because it applies to any smooth function, not only to the choice of .

Overall, it is clear that all the good analyses of the Perceptron must have something in common. However, sometimes recasting a problem in a particular framework might have some advantages because it helps our intuition. In this view, I find the pseudogradient view particularly compelling because it aligns with my intuition of how an optimization algorithm is supposed to work.

**7. History Bits **

I already wrote about the Perceptron, so I will just add few more relevant bits.

As I said, it seems that the family of Perceptrons algorithms was intended to be something much more general than what we intend now. The particular class of Perceptron we use nowadays were called -system (Block, 1962). I hypothesize that the fact the -system survived the test of time is exactly due to the simple convergence proof in (Block, 1962) and (Novikoff, 1963). Both proofs are non-stochastic. For the sake of proper credits assignment, it seems that the convergence proof of the Perceptron was proved by many other before Block and Novikoff (see references in Novikoff, 1963). However, the proof in (Novikoff, 1963) seems to be the cleanest one. Aizerman, Braverrnan, and Rozonoer (1964) (essentially) describe for the first time the Kernel Perceptron and prove a finite mistake bound for it.

I got the idea of smoothing the Perceptron algorithm with a scaled logistic loss from a discussion on Twitter with Maxim Raginsky. He wrote that (Aizerman, Braverrnan, and Rozonoer, 1970) proposed some kind of smoothing in a Russian book for the objective function in (1), but I don’t have access to it so I am not sure what are the details. I just thought of a very natural one.

The idea of pseudogradients and the application to the Perceptron algorithm is in (Polyak and Tsypkin, 1973). However, there the input/output samples are still stochastic and the finite bound is not explicitly calculated. As I have shown, stochasticity is not needed. It is important to remember that online convex optimization as a field will come much later, so there was no reason for these people to consider arbitrary or even adversarial order of the samples.

The improved Perceptron mistake bound could be new (but please let me know if it isn’t!) and it is inspired from the idea in (Graepel, Herbrich, and Williamson, 2001) of normalizing the samples to show a tighter bound.

**Acknowledgements**

Given the insane amount of mistakes that Nicolò Campolongo usually finds in my posts, this time I asked him to proofread it. So, I thank Nicolò for finding an insane amout of mistakes on a draft of this post

]]>EDIT 4/25/23

This blog post went viral in 2020 and this idea is now widely accepted by the deep learning community. In fact, this is not only the most read post on my blog, but I might say that this is my most influential scientific idea! So, if you want to mention it in a paper, please cite this blog post.

Thanks,

Francesco Orabona

*Disclaimer: This post will be a little different than my usual ones. In fact, I won’t prove anything and I will just briefly explain some of my conjectures around optimization in deep neural networks. Differently from my usual posts, it is totally possible that what I wrote is completely wrong *

I have been working on online and stochastic optimization for a while, from a practical and empirical point of view. So, I was already in this field when Adam (Kingma and Ba, 2015) was proposed.

The paper was ok but not a breakthrough, and even more so for today standards. Indeed, the theory was weak: A regret guarantee for an algorithm supposed to work on stochastic optimization of non-convex functions. The experiments were also weak: The exact same experiments would result in a surefire rejection in these days. Later people also discovered an error in the proof and the fact that the algorithm will not converge on certain one-dimensional stochastic convex functions. Despite all of this, in these days Adam is considered the King of the optimization algorithms. Let me be clear: it is known that Adam will not always give you the best performance, yet most of the time people know that they can use it with its default parameters and get, if not the best performance, at least the second best performance on their particular deep learning problem. In other words, Adam is considered nowadays the *default optimizer* for deep learning. So, what is the secret behind Adam?

Over the years, people published a vast number of papers that tried to explain Adam and its performance, too many to list. From the “adaptive learning rate” (adaptive to what? Nobody exactly knows…) to the momentum, to the almost scale-invariance, each single aspect of its arcane recipe has been examined. Yet, none of these analyses gave us the final answer on its performance. It is clear that most of these ingredients are beneficial to the optimization process of *any* function, but it is still unclear why this exact combination and not another one make it the best algorithm. The equilibrium in the mix is so delicate that even the small change required to fix the non-convergence issue was considered to give slightly worse performance than Adam.

The fame of Adam is also accompanied by strong sentiments: It is enough to read posts on r/MachineLearning on Reddit to see the passion that people put in defending their favorite optimizers against the other ones. It is the sort of fervor that you see in religion, in sports, and in politics.

However, how *likely* is all this? I mean, how likely is that Adam is really the *best* optimization algorithm? How likely is that we reached the apex of optimization for deep learning few years ago in a field that is so young? Could there be another explanation to its prodigious performance?

I have a hypothesis, but before explaining it we have to briefly talk about the applied deep learning community.

In a talk, Olivier Bousquet has described the deep learning community as a giant genetic algorithm: Researchers in this community are exploring the space of all variants of algorithms and architectures in a semi-random way. Things that consistently work in large experiments are kept, the ones not working are discarded. Note that this process seems to be independent of acceptance and rejection of papers: The community is so big and active that good ideas on rejected papers are still saved and transformed into best practices in few months, see for example (Loshchilov and Hutter, 2019). Analogously, ideas in published papers are reproduced by hundred of people that mercilessly trash things that will not reproduce. This process has created a number of heuristics that consistently produce good results in experiments, and the stress here is on “consistently”. Indeed, despite being a method based on non-convex formulations, the performance of deep learning methods turns out to be extremely reliable. (Note that the deep learning community has also a large bias towards “famous” people, so not all the ideas receive the same level of attention…)

So, what is the link between this giant genetic algorithm and Adam? Well, looking carefully at the creating process in the deep learning community I noticed a pattern: Usually people try new architectures *keeping the optimization algorithm fixed*, and most of the time the algorithm of choice is Adam. This happens because, as explained above, Adam is the *default optimizer*.

So, here my hypothesis: Adam was a very good optimization algorithm for the neural networks architectures we had few years ago and ** people kept evolving new architectures on which Adam works**. So, we might not see many architectures on which Adam does not work because such ideas are discarded prematurely! Such ideas would require to design a new architecture

Now, I am sure many people won’t buy in this hypothesis, I am sure they will list all sort of specific problems in which Adam is not the best algorithm, in which Stochastic Gradient Descent with momentum is the best one, and so on and so forth. However, I would like to point out two things: 1) I don’t describe here a law of nature, but simply a tendency the community has that might have influenced the co-evolution of some architectures and optimizers; 2) I actually have some evidence to support this claim

If my claims were true, we would expect Adam to be extremely good on deep neural networks and very poor on anything else. And this does happen! For example, Adam is known to perform very poorly on simple convex and non-convex problems that are not deep neural networks, see for example the following experiments from (Vaswani et al., 2019):

It seems that the moment we move away from the specific setting of deep neural networks with their specific choice of initialization, specific scale of weights, specific loss function, etc., Adam loses its *adaptivity* and its magic default learning rate must be tuned again. Note that you can always write a linear predictor as a one-layer neural network, yet Adam does not work so well on this case too. So, ** all the particular choices of architectures in deep learning might have evolved to make Adam work better and better, while the simple problems above do not have any of these nice properties that allow Adam to shine**.

Overall, Adam might be the best optimizer because the deep learning community might be exploring only a small region in the joint search space of architectures/optimizers. If true, that would be ironic for a community that departed from convex methods because they focused only on a narrow region of the possible machine learning algorithms and it was like, as Yann LeCun wrote, “looking for your lost car keys under the street light knowing you lost them someplace else“.

EDIT: After the pubblication of this post, Sam Power pointed me to this tweet by Roger Grosse that seems to share a similar sentiment

]]>**1. SGD on Non-Convex Smooth Functions **

We are interested in minimizing a smooth non-convex function using stochastic gradient descent with unbiased stochastic gradients. More in details, we assume to have access to an oracle that returns in any point , , where is the realization of a mechanism for computing the stochastic gradient. For example, could be the random index of a training sample we use to calculate the gradient of the training loss or just random noise that is added on top of our gradient computation. We will also assume that the variance of the stochastic gradient is bounded: , for all . Weaker assumptions on the variance are possible, but they don’t add much to the general message nor to the scheme of the proof.

Given that the function is non-convex, we clearly cannot hope to converge to the minimum of , so we need a less ambitious goal. We assumed that the function is smooth. As you might remember from my previous posts, smooth functions are differentiable functions whose gradient is Lipschitz. Formally, we say that is -smooth when , for all . This assumption assures us that when we approach a local minimum the gradient goes to zero. Hence, **decreasing the norm of the gradient will be our objective function for SGD.** Note that smoothness is necessary to study the norm of the gradients. In fact, consider the function whose derivative does not go to zero when we approach the minimum, on the contrary it is always different than 0 in any point different than the minimum.

Last thing we will assume is that the function is bounded from below. Remember that the boundedness from below does not imply that the minimum of the function exists, e.g., .

Hence, I start from a point and the SGD update is

where are deterministic learning rates or stepsizes.

First, let’s see practically how SGD behaves w.r.t. Gradient Descent (GD) on the same problem.

In Figure 1, we are minimizing , where the stochastic gradient in SGD is given by the gradient of the function corrupted by Gaussian noise with zero mean and standard deviation 1. On the other hand, there is no noise for GD. In both cases, we use and we plot the absolute value of the derivative. We can see that GD will monotonically minimize the gradient till numerical precision as expected, converging to one of the local minima. Note that with a constant learning rate GD on this problem would converge even faster. Instead, SGD will jump back and forth resulting in only *some* iterates having small gradient. So, our basic question is the following:

*Will converge to zero with probability 1 in SGD when goes to infinity?*

This is more difficult to answer than what you might think. However, this is a basic question to know if it actually makes sense to run SGD for a bunch of iterations and return the last iterate, that is how 99% of the people use SGD on a non-convex problem.

To warm up, let’s first see what we can prove in a finite-time setting.

As all other similar analysis, we need to construct a potential (Lyapunov) function that allows us to analyze it. In the convex case, we would study , where . Here, this potential does not even make sense because we are not even trying to converge to . It turns out that a better choice is to study . We will make use of the following property of -smooth functions:

In words, this means that a smooth function is always upper bounded by a quadratic function. Note that this property does not require convexity, so we can safely use it. Thanks to this property, let’s see how our potential evolves over time during the optimization of SGD.

Now, let’s denote by the expectation w.r.t. given , so we have

where in the inequality we have used the fact that the variance of the stochastic gradient is bounded by . Taking the total expectation and reordering the terms, we have

Let’s see how useful this inequality is: consider a constant step size , where is the usual critical parameter of the learning rate (that you’ll never be able to tune properly unless you know things that you clearly don’t know…). With this choice, we have . So, we have

What we got is almost a convergence result: it says that the average of the norm of the gradients is going to zero as . Given that the average of a set of numbers is bigger or equal to its minimum, this means that there exists at least one in my set of iterates that has a small expected gradient. This is interesting but slightly disappointing. We were supposed to prove that the gradient converges to zero, but instead we only proved that at least *one* of the iterates has indeed small expected norm, but we don’t know which one. Also, trying to find the right iterate might be annoying because we only have access to stochastic gradients.

It is also interesting to see that the convergence rate has two terms: a fast rate and a slow rate . This means that we can expect the algorithm to make fast progress at the beginning of the optimization and then slowly converge once the number of iterations becomes big enough compared to the variance of the stochastic gradients. In case the noise on the gradients is zero, SGD becomes simply gradient descent and it will converge at a rate. In the noiseless case, we can also show that the last iterate is the one with the smallest gradient. However, note that the learning rate has in it, so effectively we can achieve a faster convergence in the noiseless case because we would be using a constant and independent from stepsize.

**2. The Magic Trick: Randomly Stopped SGD **

The above reasoning is interesting but it is not a solution to our question: does the last iterate of SGD converge? Yes or no?

There is a possible work-around that looks like a magic trick. Let’s take one iterate of SGD uniformly at random among and call it . Taking the expectation with respect to this randomization and the noise in the stochastic gradients we have that

Basically, it says that if run SGD for iterations, then we stop and return not the last iterate but one of the iterates at random, in expectation with respect to everything the norm will be small! Note that this is equivalent to run SGD with a random stopping time. In other words, given that we didn’t know how to prove if SGD converges, we changed the algorithm adding a random stopping time and now the random iterate on which we stop in expectation will have the desired convergence rate.

This is a very important result and also a standard one in these days. It should be intuitive why the randomization helps: From Figure 1 it is clear that we might be unlucky in the last iteration of SGD, however randomizing in expectation we smooth out the noise and get a decreasing gradient. However, we just changed the target because we still didn’t prove if the last iterate converges. So, we need an alternative way.

**3. The Disappointing Lim Inf **

Let’s consider again (1). This time let’s select any time-varying positive stepsizes that satisfy

These two conditions are classic in the study of stochastic approximation. The first condition is needed to be able to travel arbitrarily far from the initial point, while the second one is needed to keep the variance of the noise under control. The classic learning rate of does not satisfy these assumptions, but something decaying a little bit faster as will do.

With such a choice, we get

where we have used the second condition in the inequality. Now, the condition implies that converges to 0. So, there exists such that for all . So, we get that

This implies that with probability 1. We are almost done: From this last inequality and the condition that , we can derive the fact that .

**Wait, what? What is this ???** Unfortunately, it seems that we proved something weaker than we wanted to. In words, the lim inf result says that there exists a *subsequence* of that has a gradient converging to zero.

This is very disappointing and we might be tempted to believe that this is the best that we can do. Fortunately, this is not the case. In fact, in a seminal paper (Bertsekas and Tsitsiklis, 2000) proved the convergence of the gradients of SGD to zero with probability 1 under very weak assumptions. Their proof is very convoluted also due to the assumptions they used, but in the following I’ll show a much simpler proof.

**4. The Asymptotic Proof in Few Lines **

In 2018, I found a way to get the same result of (Bertsekas and Tsitsiklis, 2000) distilling their long proof in the following Lemma, whose proof is in the Appendix. It turns out that this Lemma is essentially all what we need.

Lemma 1.Let be two non-negative sequences and a sequence of vectors in a vector space . Let and assume and . Assume also that there exists such that , where is such that . Then, converges to 0.

We are now finally ready to prove the asymptotic convergence with probability 1.

Theorem 2.Assume that we use SGD on a -smooth function, with stepsizes that satisfies the conditions (2). Then, goes to zero with probability 1.

*Proof:* We want to use Lemma 1 on . So, first observe that by the -smoothness of , we have

The assumptions and the reasoning above imply that, with probability 1, . This also suggest to set . Also, we have, with probability 1, , because for is a martingale whose variance is bounded by . Hence, for is a martingale in , so it converges in with probability 1.

Overall, with probability 1 the assumptions of Lemma 1 are verified with .

We did it! Finally, we proved that the gradients of SGD do indeed converge to zero with probability 1. This means that with probability 1 for any there exists such that for .

Even if I didn’t actually use any intuition in crafting the above proof (I rarely use “intuition” to prove things), Yann Ollivier provided the following intuition for this proof: the proof is implicitly studying how far apart GD and SGD are. However, instead of estimating the distance between the two processes over a single update, it does it over large period of time through the term that can be controlled thanks to the choice of the learning rates.

**5. History Bits **

The idea of taking one iterate at random in SGD was proposed in (Ghadimi and Lan, 2013) and it reminds me the well-known online-to-batch conversion through randomization. The conditions on the learning rates in (2) go back to (Robbins and Monro, 1951). (Bertsekas and Tsitsiklis, 2000) contains a good review of previous work on asymptotic convergence of SGD, while a recent paper on this topic is (Patel, V., 2020).

I derived Lemma 1 as an extension of Proposition 2 in (Alber et al., 1998)/Lemma A.5 in (Mairal, 2013). Studying the proof of (Bertsekas and Tsitsiklis, 2000), I realized that I could change (Alber et al., 1998, Proposition 2) into what I needed. I had this proof sitting in my unpublished notes for 2 years, so I decided to write a blog post on it.

My actual small contribution to this line of research is a lim inf convergence for SGD with AdaGrad stepsizes (Li and Orabona, 2019), but under stronger assumptions on the noise.

Note that the 20-30 years ago there were many papers studying the asymptotic convergence of SGD and its variants in various settings. Then, the taste of the community changed moving from asymptotic convergence to finite-time rates. As it often happens when a new trend takes over the previous one, new generations tend to be oblivious to the old results and proof techniques. The common motivation to ignore these past results is that the finite-time analysis is superior to the asymptotic one, but this is clearly false (ask a statistician!). It should be instead clear to anyone that both analyses have pros and cons.

**6. Acknowledgements **

I thank Léon Bottou for telling me of the problem of analyzing the asymptotic convergence of SGD in the non-convex case with a simple and general proof in 2018. Léon also helped me checking my proofs and finding an error in a previous version. Also, I thank Yann Ollivier for reading my proof and kindly providing an alternative proof and the intuition that I report above.

**7. Appendix **

*Proof of Lemma : *Since the series diverges, given that converges, we necessarily have . Hence, we have to prove that .

Let us proceed by contradiction and assume that . First, assume that .

Given the values of the and , we can then build two sequences of indices and such that

- ,
- , for ,
- , for .

Define . The convergence of the series implies that the sequence of partial sums are Cauchy sequences. Hence, there exists large enough such for all we have and are less or equal to . Then, we have for all and all with ,

Therefore, using the triangle inequality, And finally for all , which contradicts . Therefore, goes to zero.

To rule out the case that , proceed in the same way, choosing any . Hence, we get that for , that contradicts .

]]>Don’t get me wrong: assuming bounded domains is perfectly fine and justified most of the time. However, sometimes it is unnecessary and it might also obscure critical issues in the analysis, as in this case. So, to balance the universe of first-order methods, I decided to show how to easily prove the convergence of the iterates in SGD, even in unbounded domains.

Technically speaking, the following result might be new, but definitely not worth a fight with Reviewer 2 to publish it somewhere.

**1. Setting **

First, let’s define our setting. We want to solve the following optimization problem

where and is a convex function. Now, various assumptions are possible on and choosing the right one depends on *your* particular problem, there are not right answers. Here, we will not make any strong assumption on . Also, we will *not* assume to be bounded. Indeed, in most of the modern applications in Machine Learning, is simply the entire space . We will also assume that is not empty and is any element in it.

We also assume to have access to a *first-order stochastic oracle* that returns stochastic sub-gradients of on any point . In formulas, we get such that . Practically speaking, every time you calculate the (sub)gradient on a minibatch of training data, that is a stochastic (sub)gradient and roughly speaking the random minibatch is the random variable .

Here, for didactic reasons, we will assume that is bounded by 1; similar results can be also show with more realistic assumptions. This holds, for example, if is an average of 1-Lipschitz functions and you draw some of them to calculate the stochastic subgradient.

The algorithm we want to focus on is SGD. So, what is SGD? SGD is an incredibly simple optimization algorithm, almost primitive. Indeed, part of its fame depends critically on its simplicity. Basically, you start from a certain and you update your solution iteratively moving in the direction of the negative stochastic subgradients, multiplied by a *learning rate* . We also use a projection onto . Of course, if no projection is needed. So, the update of SGD is

where and is the projection onto . Remember that when you use subgradients, SGD is not a descent algorithm: I already blogged about the fact that the common intuition of moving towards a descent direction is wrong when you use subgradients.

**2. Convergence of the Average of the Iterates **

Now, the most common analysis of SGD can be done in two different ways: constant learning rate and non-increasing learning rate. We already saw both of them in my lecture notes on online learning, so let’s summarize here the one-step inequality for SGD we need:

for all measureable w.r.t. .

If you plan to use iterations, you can you use a learning rate , and summing (1) we get

where we set . This is not a convergence results yet, because it just says that *on average* we are converging. To extract a single solution, we can use Jensen’s inequality and obtain

where . In words, we show a convergence guarantee for *the average of the iterates of SGD*, not for the last one.

Constant learning rates are a bit annoying because they depends on how many iterations you plan to do, theoretically and empirically. So, let’s now take a look at non-increasing learning rates, . In this case, the correct way to analyze SGD without the bounded assumption is to sum (1) *without dividing by *, to have

where we set . From this one, we have two alternatives. First, we can observe that

because is a minimizer and the learning rate non-increasing. So, using again Jensen’s inequality, we get

Note that if you like these sorts of games, you can even change the learning rate to shave a factor, but it is probably useless from an applied point of view.

Another possibility is to use a weighted average:

where and we used . Note that this option does not seem to give any advantage over the unweighted average above. Also, it weights the first iterations more than the last ones, that in most of the cases is a bad idea: First iterations tend to be farther away from the optimum then the last ones.

Let’s summarize what we have till now:

- Unbounded domains are fine with both constant and time-varying learning rates.
- The optimal learning rate depends on the distance between the optimal solution and the initial iterate, because the optimal setting of is proportional to .
- The weighted average is probably a bad idea and not strictly necessary.
- It seems we can only guarantee convergence for (weighted) averages of iterates.

The last point is a bit concerning: most of the time we take the last iterate of SGD, why we do it if the theory applies to the average?

**3. Convergence of the Last Iterate **

Actually, we do know that

- the last solution of SGD converges in unbounded domains with constant learning rate (Zhang, T., 2004).
- the last iterate of SGD converges in bounded domains with non-increasing learning rates (Shamir, O. and Zhang, T., 2013).

So, what about unbounded domains and non-increasing learning rate, i.e., 90% of the uses of SGD? It turns out that it is equally simple and I think the proof is also instructive! As surprising as it might sound, not dividing by (1) is the key ingredient we need. The proof plan is the following: we want to prove that the value of on the last iterate is not too far from the value of on . To prove it, we need the following technical lemma on sequences of non-negative numbers multiplied by non-increasing learning rates, whose proof is in the Appendix. This Lemma relates the last element of a sequence of numbers to their average.

Lemma 1.Let a non-increasing sequence of positive numbers and . Then

With the above Lemma, we can prove the following guarantee for the convergence of the last iterate of SGD.

Theorem 2.Assume the stepsizes deterministic and non-increasing. Then

*Proof:* We use Lemma 1, with , to have

Now, we bound the sum in the r.h.s. of last inequality. Summing (1) from to , we have the following inequality that holds for any :

Hence, setting , we have

Putting all together, we have the stated bound.

There are a couple of nice tricks in the proof that might be interesting to study carefully. First, we use the fact that one-step inequality in (1) holds for any . Most of the time, we state it with equal to , but it turns out that the more general statement is actually important! In fact, it is possible to know how far is the performance of last iterate from the performance of the average because the incremental nature of SGD makes possible to know exactly how far is from any previous iterate , with . Please note that all of this would be hidden in the case of bounded domains, where all the distances are bounded by the diameter of the set, and you don’t get the dependency on .

Now we have all the ingredients and we only have to substitute a particular choice of the learning rate.

*Proof:* First, observe that

Now, considering the last term in (3), we have

Using (2) and dividing by , we have the stated bound.

Note that the above proof works similarly if .

**4. History Bits **

The first finite-time convergence proof for the last iterate of SGD is from (Zhang, T., 2004), where he considered the constant learning rate case. It was later extended in (Shamir, O. and Zhang, T., 2013) for time-varying learning rates but only for bounded domains. The convergence rate for the weighted average in unbounded domains is from (Zhang, T., 2004). The observation that the weighted average is not needed and the plain average works equally well for non-increasing learning rates is from (X. Li and F. Orabona, 2019), where we needed it for the particular case of AdaGrad learning rates. The idea of analyzing SGD without dividing by the learning rate is by (Zhang, T., 2004). Lemma 1 is new but actually hidden in the the convergence proof of the last iterate of SGD with linear predictors and square losses in (Lin, J. and Rosasco, L. and Zhou, D.-X., 2016), that in turn is based on the one in (Shamir, O. and Zhang, T., 2013). As far as I know, Corollary 3 is new, but please let me know if you happen to know a reference for it! It is possible to remove the logarithmic term in the bound using a different learning rate, but the proof is only for bounded domains (Jain, P. and Nagaraj, D. and Netrapalli, P., 2019).

**5. Exercises **

Exercise 1.Generalize the above proofs to the Stochastic Mirror Descent case.

Exercise 2.Remove the assumption of expected bounded stochastic subgradients and instead assume that is -smooth, i.e., has -Lipschitz gradient, and the variance of the noise is bounded. Hint: take a look at the proofs in (Zhang, T., 2004) and (X. Li and F. Orabona, 2019)

**6. Appendix **

*Proof of Lemma 1:* Define , so we have

that implies

Now, from the definition of and the above inequality, we have

that implies

Unrolling the inequality, we have

Using the definition of and the fact that , we have the stated bound.

]]>