* You can find the lectures I published till now here.*

In this lecture, we will explore the possibility to obtain logarithmic regret for non-strongly convex functions. Also, we explore a bit more the strategy of moving pieces of the losses inside the regularizer, as we did for the composite and strongly convex losses.

**1. Online Newton Step **

Last time, we saw that the notion of strong convexity allows us to build quadratic surrogate loss functions, on which Follow-The-Regularized-Leader (FTRL) has smaller regret. Can we find a more general notion of strong convexity that allows to get a small regret for a larger class of functions? We can start from strong convexity and try to generalize it. So, instead of asking that the function is strongly convex w.r.t. a norm, we might be happy requiring strong convexity holds in a particular point w.r.t. a norm that depends on the points itself.

In particular, we can require that for each loss and for all the following holds

Note that this is a weaker property than strong convexity because depends on . On the other hand, in the definition of strong convexity we want the last term to be the same norm (or Bregman divergence in the more general formulation) everywhere in the space.

The rationale of this new definition is that it still allows us to build surrogate loss functions, but without requiring to have strong convexity over the entire space. Hence, we can think to use FTRL on the surrogate losses

and the proximal regularizers , where . We will denote by .

Remark 1Note that is a norm because is Positive Definite (PD) and is -strongly convex w.r.t. defined as (because the Hessian is and ). Also, the dual norm of is .

From the above remark, we have that the regularizer is 1-strongly convex w.r.t . Hence, using the FTRL regret guarantee for proximal regularizers, we immediately get the following guarantee

So, reordering the terms we have

Note how the proof and the algorithm mirror what we did in FTRL with strongly convex losses in the last lecture.

Remark 2It is possible to generalize our Lemma of FTRL for proximal regularizers to hold in this generalized notion of strong convexity. This would allow to get exactly the same bound running FTRL over the original losses with regularizer .

Let’s now see a practical instantiation of this idea. Consider the case that the sequence of loss functions we receive satisfy

In words, *we assume to have a class of functions that can be upper bounded by a quadratic that depends on the current subgradient*. In particular, these functions posses some curvature only in the direction of (any of) the subgradients. Denoting by , we can use the above idea using

Hence, the update rule would be

We obtain the following algorithm, called Online Newton Step (ONS).

Denoting by and using (1), we have

To bound the last term, we will use the following Lemma.

Lemma 1 (Cesa-Bianchi, N. and Lugosi, G. , 2006, Lemma 11.11 and Theorem 11.7)Let a sequence of vectors in and . Define . Then, the following holds

where are the eigenvalues of .

Putting all together and assuming and (2) holds for the losses, then ONS satisfies the following regret

where in the second inequality we used the inequality of arithmetic and geometric means, , and the fact that .

Hence, if the losses satisfy (2), we can guarantee a logarithmic regret. However, differently from the strongly convex case, here the complexity of the update is at least quadratic in the number of dimensions. Moreover, the regret also depends linearly on the number of dimensions.

Remark 3Despite the name, the ONS algorithm should not be confused with the Netwon algorithm. They are similar in spirit because they both construct quadratic approximation to the function, but the Netwon algorithm uses the exact Hessian while the ONS uses an approximation that works only for a restricted class of functions. In this view, the ONS algorithm is more similar to Quasi-Newton methods.

Let’s now see an example of functions that satisfy (2).

Example 1 (Exp-Concave Losses)Defining convex, we say that a function is-exp-concaveif is concave.Choose such that for all and . Note that we need a bounded domain for to exist. Then, this class of functions satisfy the property (2). In fact, given that is -exp-concave then it is also -exp-concave. Hence, from the definition we have

that is

that implies

where we used the elementary inequality , for .

Example 2Let . The logistic loss of a linear predictor , where is -exp-concave.

**2. Online Regression: Vovk-Azoury-Warmuth Forecaster **

Let’s now consider the specific case that and , that is the one of *unconstrained online linear regression with square loss*. These losses are not strongly convex w.r.t. , but they are exp-concave when the domain is bounded. We could use the ONS algorithm, but it would not work in the unbounded case. Another possibility would be to run FTRL, but that losses are not strongly convex and we would get only a regret.

It turns out we can still get a logarithmic regret, if we make an additional assumption! We will assume to have access to before predicting . Note that this is a mild assumptions in most of the interesting applications. Then, the algorithm will just be *FTRL over the past losses plus the loss on the received hallucinating a label of *. This algorithm is called Vovk-Azoury-Warmuth from the name of the inventors. The details are in Algorithm 2.

As we did for composite losses, we look closely to the loss functions, to see if there are terms that we might move inside the regularizer. The motivation would be the same as in the composite losses case: the bound will depends only on the subgradients of the part of the losses that are outside of the regularizer.

So, observe that

From the above, we see that we could think to move the terms in the regularizer and leaving the linear terms in the loss: . Hence, we will use

Note that the regularizer at time contains the that is revealed to the algorithm before it makes its prediction. For simplicity of notation, denote by .

Using such procedure, the prediction can be written in a closed form:

Hence, using the regret we proved for FTRL with strongly convex regularizers and , we get the following guarantee

Noting that and reordering the terms we have

Remark 4Note that, differently from the ONS algorithm, the regularizers here are not proximal. Yet, we get in the bound because the current sample is used in the regularizer.

So, using again Lemma 1, we have

where are the eigenvalues of .

If we assume that , we can reason as we did for the similar term in ONS, to have

Putting all together, we have the following theorem

Theorem 2Assume and . Then, using the prediction strategy in Algorithm 2, we have

Remark 5It is possible to show that the regret of the Vovk-Azoury-Warmuth forecaster is optimal up to multiplicative factors (Cesa-Bianchi, N. and Lugosi, G. , 2006, Theorem 11.9).

**3. History Bits **

The Online Newton Step algorithm was introduced in (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006) and it is described to the particular case that the loss functions are exp-concave. Here, I described a slight generalization for any sequence of functions that satisfy (2), that in my view it better shows the parallel between FTRL over strongly convex functions and ONS. Note that (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006) also describes a variant of ONS based on Online Mirror Descent, but I find its analysis less interesting from a didactical point of view. The proof presented here through the properties of proximal regularizers might be new, I am not sure.

The Vovk-Azoury-Warmuth algorithm was introduced independently by (K. S. Azoury and M. K. Warmuth, 2001) and (Vovk, V., 2001). The proof presented here is from (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015).

**4. Exercises **

Exercise 1Prove the statement in Example 2.

]]>

Exercise 2Prove that the losses , where , , and , are exp-concave and find the exp-concavity constant.

* You can find the lectures I published till now here.*

Last time, we saw that we can use Follow-The-Regularized-Leader (FTRL) on linearized losses:

Today, we will show a number of applications of FTRL with linearized losses, some easy ones and some more advanced ones.

As a remainder, the regret upper bound for FTRL for linearized losses that we proved last time for the case that the is -strongly convex w.r.t. for is

We also said that we are free to choose , so we will often set it to .

Remark 1Note that the algorithm is invariant to any positive constant added to the regularizer, hence we can always state the regret guarantee with instead of . However, for clarity in the following we will instead explicitly choose the regularizer such that their minimum is 0.

**1. FTRL with Linearized Losses Can Be Equivalent to OMD **

First, we see that even if FTRL and OMD seem very different, in certain cases they are equivalent. For example, consider that case that . The output of OMD is

Assume that for all . This implies that , that is . Assuming , we have

On the other hand, consider FTRL with linearized losses with regularizers , then

Assuming that , this implies that . Further, assuming that is invertible, implies that the predictions of FTRL and OMD are the same.

This equivalence immediately gives us some intuition on the role of in both algorithm: The same function is inducing the Bregman divergence, that is our similarity measure, and is the regularizer in FTRL. Moreover, the inverse of the growth rate of the regularizers in FTRL takes the role of the learning rate in OMD.

Example 1Consider and , then it satisfies the conditions above to have the predictions of OMD equal to the ones of FTRL.

**2. Exponentiated Gradient with FTRL: No Need to know **

Let’s see an example of an instantiation of FTRL with linearized losses to have the FTRL version of Exponentiated GradientÂ (EG).

Let and the sequence of loss functions be convex and -Lipschitz w.r.t. the L-infinity norm. Let defined as , where and we define . Set , that is -strongly convex w.r.t. the L1 norm, where is a parameter of the algorithm.

Given that the regularizers are strongly convex, we know that

We already saw that , that implies that . So, we have that

Note that this is exactly the same update of EG based on OMD, but here we are effectively using time-varying learning rates.

We also get that the regret guarantee is

where we used the fact that using and are equivalent. Choosing . This regret guarantee is similar to the one we proved for OMD, but with an important difference: We don’t have to know in advance the number of rounds . In OMD a similar bound would be vacuous because it would depend on the that is infinite.

**3. Composite Losses **

Let’s now see a variant of the linearization of the losses: *partial linearization of composite losses*.

Suppose that the losses we receive are composed by two terms: one convex function changing over time and another part is fixed and known. These losses are called *composite*. For example, we might have . Using the linearization, we might just take the subgradient of . However, in this particular case, we might lose the ability of the L1 norm to produce sparse solutions.

There is a better way to deal with these kind of losses: Move the constant part of the loss inside the regularization term. In this way, that part will not be linearized but used exactly in the argmin of the update. Assuming that the argmin is still easily computable, you can always expect better performance from this approach. In particular, in the case of adding an L1 norm to the losses, you will be predicting in each step with the solution of an L1 regularized optimization problem.

Practically speaking, in the example above, we will define , where we assume to be 1-strongly convex and the losses be -Lipschitz. Note that we use at time a term because we anticipate the next term in the next round. Given that is -strongly convex, using (1), we have

where . Reordering the terms, we have

Example 2Let’s also take a look at the update rule in that case that and we get composite losses with the L1 norm. We have

We can solve this problem observing that the minimization decomposes over each coordinate of . Denote by . Hence, we know from first-order optimality condition that is the solution for the coordinate iff there exists such that

Consider the 3 different cases:

- , then and .
- , then and .
- , then and .
So, overall we have

Observe as this update will produce sparse solutions, while just taking the subgradient of the L1 norm would have never produced sparse predictions.

Remark 2 (Proximal operators)In the example above, we calculated something like

This operation is known in the optimization literature as

Proximal Operatorof the L1 norm. In general, a proximal operator of a convex, proper, and closed function is defined asProximal operators are used in optimization in the same way as we used it: They allow to minimize the entire function rather a linear approximation of it. Also, proximal operators generalizes the concept of Euclidean projection. Indeed, .

**4. FTRL with Strongly Convex Functions **

Let’s now go back to the FTRL regret bound and let’s see if you can strengthen it in the case that the regularizer is *proximal*, that is it satisfies that .

Lemma 1Denote by . Assume that is not empty and set . Also, assume that is -strongly convex w.r.t. and convex, and the regularizer is such that . Also, assume that is non-empty. Then, we have

*Proof:* We have

where in the second inequality we used Corollary 1 from last lecture, the fact that , and . Observing that from the proximal property, we have that , . Hence, using the theorem of the subdifferential of sum of functions, and remembering that , we can choose such that we have .

Remark 3Note that a constant regularizer is proximal because any point is the minimizer of the zero function. On the other hand, a constant regularizer makes the two Lemma the same,unlessthe loss functions contribute to the total strong convexity.

We will now use the above lemma to prove a logarithmic regret bound for strongly convex losses.

Corollary 2Let be strongly convex w.r.t. , for . Set the sequence of regularizers to zero. Then, FTRL guarantees a regret of

The above regret guarantee is the same of OMD over strongly convex losses, but here we don’t need to know the strong convexity of the losses. In fact, we just need to output the minimizer over the past losses. However, as we noticed last time, this might be undesirable because now each update is an optimization problem.

Hence, we can again use the idea of replacing the losses with an easy *surrogate*. In the Lipschitz case, it made sense to use linear losses. However, here you can do better and use *quadratic* losses, because the losses are strongly convex. So, we can run FTRL on the quadratic losses , where . The algorithm would be the following one:

To see why this is a good idea, consider the case that the losses are strongly convex w.r.t. the L2 norm. The update now becomes:

Moreover, we will get exactly the same regret bound as in Corollary 2, with the only difference that here the guarantee holds for a specific choice of the rather than for any subgradient in .

Example 3Going back to the example in the first lecture, where and are strongly convex, we now see immediately that FTRL without a regularizer, that is Follow the Leader, gives logarithmic regret. Note that in this case the losses were defined only over , so that the minimization is carried over .

**5. History Bits **

The first analysis of FTRL with composite losses is in (L. Xiao, 2010). The analysis presented here using the negative terms to easily prove regret bounds for FTRL for composite losses is from (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015).

The first proof of FTRL for strongly convex losses was in (S. Shalev-Shwartz and Y. Singer, 2007) (even if they don’t call it FTRL).

There is an interesting bit about FTRL-Proximal (McMahan, H. B., 2011): FTRL-Proximal is an instantiation of FTRL that uses a particular proximal regularizer. It became very famous in internet companies when Google disclosed in a very influential paper that they were using FTRL-Proximal to train the classifier for click prediction (McMahan, H. B. and Holt, G. and Sculley, D. and Young, M. and Ebner, D. and Grady, J. and Nie, L. and Phillips, T. and Davydov, E. and Golovin, D. and Chikkerur, S. and Liu, D. and Wattenberg, M. and Hrafnkelsson, A. M. and Boulos, T. and Kubica, J., 2013). This generated even more confusion because many people started conflating the term FTRL-Proximal (a specific algorithm) with FTRL (an entire family of algorithms). Unfortunately, this confusion is still going on in these days.

**6. Exercises **

]]>

Exercise 1Prove that the update in (2) is equivalent to the one of OSD with and learning rate .

* You can find the lectures I published till now here.*

Till now, we focused only on Online Subgradient Descent and its generalization, Online Mirror DescentÂ (OMD), with a brief ad-hoc analysis of a Follow-The-Leader (FTL) analysis in the first lecture. In this class, we will extend FTL to a powerful and generic algorithm to do online convex optimization: **Follow-the-Regularized-Leader** (FTRL).

FTRL is a very intuitive algorithm: At each time step it will play the minimizer of the sum of the past losses *plus* a time-varying regularization. We will see that the regularization is needed to make the algorithm “more stable” with linear losses and avoid the jumping back and forth that we saw in Lecture 2 for Follow-the-Leader.

**1. Follow-the-Regularized-Leader **

As said above, in FTRL we output the minimizer of the regularized cumulative past losses. It should be clear that FTRL is not an algorithm, but rather a family of algorithms, in the same way as OMD is a family of algorithms.

Before analyzing the algorithm, let’s get some intuition on it. In OMD, we saw that the “state” of the algorithm is stored in the current iterate , in the sense that the next iterate depends on and the loss received at time (the choice of the learning rate has only a little influence on the next iterate). Instead in FTRL, the next iterate depends on the entire history of losses received up to time . This has an immediate consequence: In the case that is bounded, OMD will only “remember” the last , and not the iterate before the projection. On the other hand, FTRL keeps in memory the entire history of the past, that in principle allows to recover the iterates before the projection in .

This difference in behavior might make the reader think that FTRL is more computationally and memory expensive. And indeed it is! But, we will also see that there is a way to consider approximate losses that makes the algorithm as expensive as OMD, yet retaining strictly more information than OMD.

For FTRL, we prove a surprising result: an equality for the regret! The proof is in the Appendix.

Lemma 1Denote by . Assume that is not empty and set . Then, for any , we have

Remark 1Note that we basically didn’t assume anything on nor on , the above equality holds even for non-convex losses and regularizers. Yet, solving the minimization problem at each step might be computationally infeasible.

Remark 2Note that the left hand side of the equality in the theorem does not depend on , so if needed we can set it to .

However, while surprising, the above equality is not yet a regret bound, because it is somehow “implicit” because the losses are appearing on both sides of the equality.

Let’s take a closer look at the equality. If , we have that the sum of the last two terms on the r.h.s. is negative. On the other hand, the first two terms on the r.h.s. are similar to what we got in OMD. The interesting part is the sum of the terms . To give an intuition of what is going on, let’s consider that case that the regularizer is constant over time, i.e., . Hence, the terms in the sum can be rewritten as

Hence, we are measuring the distance between the minimizer of the regularized losses (with two different regularizers) in two consecutive predictions of the algorithms. Roughly speaking, this term will be small if and the losses+regularization are “nice”. This should remind you exactly the OMD update, where we *constrain* to be close to . Instead, here the two predictions will be close one to the other if the minimizer of the regularized losses up to time is close to the minimizer of the losses up to time . So, like in OMD, the regularizer here will play the critical role of *stabilizing* the predictions, if the losses don’t possess enough curvature.

To quantify this intuition, we need a property of strongly convex functions.

**2. Convex Analysis Bits: Properties of Strongly Convex Functions **

We will use the following lemma for strongly convex functions.

Lemma 2Let -strongly convex with respect to a norm . Then, for all , , and , we have

*Proof:* Define . Observe that , hence is the minimizer of . Also, note that . Hence, we can write

where the last step comes from the conjugate function of the squared norm (See Example 3 in the lecture on OLO lower bounds).

Corollary 3Let -strongly convex with respect to a norm . Let . Then, for all , and , we have

In words, the above lemma says that an upper bound to the suboptimality gap is proportional to the squared norm of the subgradient.

**3. An Explicit Regret Bound using Strongly Convex Regularizers **

We now state a Lemmas quantifying the intuition on the “stability” of the predictions.

Lemma 4With the notation and assumptions of Lemma 1, assume that is proper and -strongly convex w.r.t. , and proper and convex. Also, assume that is non-empty. Then, we have

for all .

*Proof:* We have

where in the second inequality we used Lemma 2, the fact that , and . Observing that , we have . Hence, using the theorem of the subdifferential of sum of functions, we can choose such that we have .

Let’s see some immediate applications of FTRL

Corollary 5Let a sequence of convex loss functions. Let a -strongly convex function w.r.t. . Set the sequence of regularizers as , where . Then, FTRL guarantees

for all . Moreover, if the functions are -Lipschitz, setting we get

*Proof:* The corollary is immediate from Lemma 1, Lemma 4, and the observation that from the assumptions we have . We also set , thanks to Remark 2.

This might look like the same regret guarantee of OMD, however here there is a very important difference: The last term contains a time-varying element () but the domain does not have to be bounded! Also, I used the regularizer and not to remind you another important difference: In OMD the learning rate is chosen after receiving the subgradient while here you have to choose it before receiving it!

The another important difference is that here the update rule seems way more expensive than in OMD, because we need to solve an optimization problem at each step. However, it turns out we can use FTRL on *linearized losses* and obtain the same bound with the same computational complexity of OMD.

**4. FTRL with Linearized Losses **

If we consider that case in which the losses are linear, we have that the prediction of FTRL is

Now, if we assume to be proper, convex, and closed, using the theorem 2 in the lecture on OLO lower bounds, we have that . Moreover, if is strongly convex, we know that is differentiable and we get

In turn, this update can be written in the following way

This corresponds to Figure 1.

Compare it to the mirror update of OMD, rewritten in a similar way:

They are very similar, but with important differences:

- In OMD, the state is kept in , so we need to transform it into a dual variable before making the update and then back to the primal variable.
- In FTRL with linear losses, the state is kept directly in the dual space, updated and then transformed in the primal variable. The primal variable is only used to predict, but not directly in the update.
- In OMD, the samples are weighted by the learning rates that is typically decreasing
- In FTRL with linear losses, all the subgradients have the same weight, but the regularizer is typically increasing over time.

Also, we will not loose anything in the bound! Indeed, we can run FTRL on the linearized losses , where , guaranting exactly the same regret on the losses . The algorithm for such procedure is in Algorithm 2.

In fact, using the definition of the subgradients and the assumptions of Corollary 5, we have

The only difference with respect to Corollary 5 is that here the are the specific ones we use in the algorithm, while in Corollary 5 the statement holds for any choice of the .

In the next example, we can see the different behavior of FTRL and OMD.

Example 1Consider . With Online Subgradient Descent (OSD) with learning rate and , the update is

On the other hand in FTRL with linearized losses, we can use and it is easy to verify that the update in (1) becomes

While the regret guarantee would be the same for these two updates, from an intuitive point of view OMD seems to be loosing a lot of potential information due to the projection and the fact that we only memorize the projected iterate.

Next time, we will see how to obtain logarithmic regret bounds for strongly convex losses for FTRL and more applications.

**5. History Bits **

Follow the Regularized Leader was introduced in (Abernethy, J. D. and Hazan, E. and Rakhlin, A., 2008) where at each step the prediction is computed as the minimizer of a regularization term plus the sum of losses on all past rounds. However, the key ideas of FTRL, and in particular its analysis through the dual, were planted by Shai Shalev-Shwartz and Yoram Singer way before (Shalev-Shwartz, S. and Singer, Y., 2006)(Shalev-Shwartz, S. and Singer, Y., 2007). Later, the PhD thesis of Shai Shalev-Shwartz (S. Shalev-Shwartz, 2007) contained the most precise dual analysis of FTRL, but he called it “online mirror descent” because the name FTRL was only invented later! Even later, I contributed to the confusion naming a general analysis of FTRL with time-varying regularizer and linear losses “generalized online mirror descent” (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015). So, now I am trying to set the record straight

Later to all this, the optimization community rediscovers FTRL with linear losses and calls it Dual Averaging (Nesterov, Y., 2009), even if Nesterov used similar ideas already in 2005 (Nesterov, Y., 2005). It is interesting to note that Nesterov introduced the Dual Averaging algorithm to fix the fact that in OMD gradients enter the algorithm with decreasing weights, contradicting the common sense understanding of how optimization should work. The same ideas were then translated to online learning and stochastic optimization in (L. Xiao, 2010), essentially rediscovering the framework of Shalev-Shwartz and rebranding it Regularized Dual Averaging (RDA). Finally, (McMahan, H B., 2017) gives the elegant equality result that I presented here (with minor improvements) that holds for general loss functions and regularizers. Note that the dual interpretation of FTRL comes out naturally for linear losses, but Lemma 1 underlines the fact that the algorithm is actually more general.

Another source of confusion stems from the fact that some people differentiate among a “lazy” and “greedy” version of OMD. In reality, as proved in (McMahan, H B., 2017), the lazy algorithm is just FTRL with linearized losses and the greedy one is just OMD. The notation “lazy online mirror descent” was introduced in (Zinkevich, M., 2004), where he basically introduced for the first time FTRL with linearized losses.

**6. Exercises **

Exercise 1Prove that the update of FTRL with linearized loss in Example 1 is correct.

Exercise 2Find a way to have bounds for smooth losses with linearized FTRL: Do you need an additional assumption compared to what we did for OSD?

**7. Appendix **

*Proof of Lemma 1:*Â Define and for . Hence, we have that . Now, consider

We also have

Hence, putting these two inequalities together, we get

Observing that

and that , we get the equality

]]>* You can find the lectures I published till now here.*

Today, we will see a couple of practical implementations of online mirror descent, with two different Bregman functions and we will introduce the setting of *Learning with Expert Advice*.

**1. Exponentiated Gradient **

Let and defined as the negative entropy , where we define . Also, we set the feasibility set . So, in words, we want to output discrete probability distributions over .

The Fenchel conjugate is defined as

The solution is , see Appendix.

We also have and .

Putting all together, we have the online mirror descent update rule for entropic distance generating function.

The algorithm is summarized in Algorithm 1. This algorithm is called *Exponentiated Gradient* (EG) because in the update rule we take the component-wise exponent of the gradient vector.

Let’s take a look at the regret bound we get. First, as we said, observe that

that is the KL divergence between the 1-dimensional discrete distributions and . Now, the following Lemma tells us about the strong convexity of .

Lemma 1 (S. Shalev-Shwartz, 2007, Lemma 16)is 1-strongly convex with respect to the norm over the set .

Another thing to decide is the initial point . We can set to be the minimizer of in . In this way simplifies to . Hence, we set . So, we have .

Putting all together, we have

Assuming , we can set , to obtain that a regret of .

Remark 1Note that the time-varying version of OMD with entropic distance generating function would give rise to a vacuous bound, can you see why? We will see how FTRL overcomes this issue using a time-varying regularizer rather than a time-varying learning rate.

How would Online Subgradient Descent (OSD) work on the same problem? First, it is important to realize that nothing prevents us to use OSD on this problem. We just have to implement the euclidean projection onto the simplex. The regret bound we would get from OSD is

where we set and for any . Assuming , we have that in the worst case . Hence, we can set , to obtain that a regret of . Hence, in a worst case sense, using an entropic distance generating function transforms a dependency on the dimension from to for Online Convex Optimization (OCO) over the probability simplex.

So, as we already saw analyzing AdaGrad, the shape of the domain is the important ingredient when we change from euclidean norms to other norms.

**2. -norm Algorithms **

Consider the distance generating function , for over . Let’s remind the reader that the -norm of a vector is defined as . We already proved that , where , so that . Let’s calculate the dual maps: and . Hence, we can write the update rule as

where we broke the update in two steps to simplify the notation (and the implementation). Starting from , we have that

The last ingredient is the fact that is strongly convex with respect to .

Lemma 2 (S. Shalev-Shwartz, 2007, Lemma 17)is -strongly convex with respect to , for .

Hence, the regret bound will be

Setting , we get the (unprojected) Online Subgradient Descent. However, we can set to achieve a logarithmic dependency in the dimension as in EG. Let’s assume again that , so we have

Also, note that , so we have an upper bound to the regret of

Setting , we get an upper bound to the regret of

Assuming , the choice of that minimizes the last term is that makes the term . Hence, we have regret bound of the order of .

So, the -norm allows to interpolate from the behaviour of OSD to the one of EG. Note that here the set is the entire space, however we could still set . While this would allow us to get the same asymptotic bound of EG, the update would not be in a closed form anymore.

**3. Learning with Expert Advice **

Let’s introduce a particular Online Convex Optimization game called *Learning with Expert Advice*.

In this setting, we have experts that gives us some advice on each round. In turn, in each round we have to decide which expert we want to follow. After we made our choice, the losses associated to each expert are revealed and we pay the loss associated to the expert we picked. The aim of the game is to minimize the losses we make compared to cumulative losses of the best expert. This is a general setting that allows to model many interesting cases. For example, we have a number of different online learning algorithms and we would like to close to the best among them.

Is this problem solvable? If we put ourselves in the adversarial setting, unfortunately it cannot be solved! Indeed, even with 2 experts, the adversary can force on us linear regret. Let’s see how. In each round we have to pick expert 1 or expert 2. In each round, the adversary can decide that the expert we pick has loss 1 and the other one has loss 0. This means that the cumulative loss of the algorithm over rounds is . On the other hand, the best cumulative loss over expert 1 and 2 is less than . This means that our regret, no matter what we do, can be as big as .

The problem above is due to the fact that the adversary has too much power. One way to reduce its power is using *randomization*. We can allow the algorithm to be randomized *and* force the adversary to decide the losses at time without knowing the outcome of the randomization of the algorithm at time (but it can depend on the past randomization). This is enough to make the problem solvable. Moreover, it will also make the problem convex, allowing us to use any OCO algorithm on it.

First, let’s write the problem in the original formulation. We set a discrete feasible set , where is the vector will all zeros but a in the coordinate . Our predictions and the competitor are from . The losses are linear losses: , where we assume , for and . The regret is

The only thing that makes this problem non-convex is the feasibility set, that is clearly a non-convex one.

Let’s now see how the randomization makes this problem convex. Let’s extend the feasible set to . Note that . For this problem we can use an OCO algorithm to minimize the regret

Can we find a way to transform an upper bound to this regret to the one we care in (1)? One way is the following one: On each time step, construct a random variable that is equal to with probability for . Then, select the expert according to the outcome of . Now, in expectation we have

and

This means that we can minimize in expectation the non-convex regret with a randomized OCO algorithm. We can summarize this reasoning in Algorithm 2.

For example, if we use EG as the OCO algorithm, setting , then we obtain the following update rule

and the regret will be

It is worth stressing the importance of the result just obtained: We can design an algorithm that in expectation is close to the best expert in a set, *paying only a logarithmic penalty in the size of the set*.

In the future, we will see algorithms that achieve even the better regret guarantee of , for any in the simplex. You should be able to convince yourself that no setting of in EG allows to achieve such regret guarantee. Indeed, the algorithm will be based on a very different strategy.

**4. History Bits **

The EG algorithm was introduced by (Kivinen, J. and Warmuth, M., 1997), but not as a specific instantiation of OMD. I am actually not sure when people first pointed out the link between EG and OMD, do you know something about this? If yes, please let me know!

The trick to set is from (Gentile, C. and Littlestone, N., 1999) (online learning) and apparently rediscovered in (Ben-Tal, A. and Margalit, T. and Nemirovski, A., 2001) (optimization). The learning with expert setting was introduced in (Littlestone, N. and Warmuth, M. K., 1994) and (Vovk, V. G, 1990). The ideas in Algorithm 3 are based on the Multiplicative Weights algorithm (Littlestone, N. and Warmuth, M. K., 1994) and the Hedge algorithm (Freund, Y. and Schapire, R. E., 1997). By now, the literature on learning with expert is huge, with tons of variations over algorithms and settings.

**5. Exercises **

Exercise 1Derive the EG update rule and regret bound in the case that the algorithm starts from an arbitrary vector in the probability simplex.

**6. Appendix **

Here we find the conjugate of the negative entropy.

It is a constrained optimization problem, we can solve it using the KKT conditions. We first rewrite it as a minimization problem

The KKT conditions are, for ,

From the first one we get . Using the third one we have . Then, from the complementarity condition, fourth one, we get . Putting all together, we have .

Denoting with , and substituting in the definition of the conjugate function we get

]]>* You can find the lectures I published till now here.*

**1. Online Mirror Descent **

Last time we introduced the Online Mirror Descent (OMD) algorithm, in Algorithm 1. We also said that we need one of these two assumptions to hold.

We also proved the following Lemma.

Lemma 1Let the Bregman divergence w.r.t. and assume to be -strongly convex with respect to in . Let a closed convex set. Set . Assume (1) or (2) hold. Then, and with the notation in Algorithm 1, the following inequality holds

Today, we will finally prove a regret bound for OMD.

Theorem 2Set such that is differentiable in . Assume . Then, under the assumptions of Lemma 1 and , the following regret bounds hold

Moreover, if is constant, i.e. , we have

*Proof:* Fix . As in the proof of OGD, dividing the inequality in Lemma 1 by and summing from , we get

where we denoted by .

The second statement is left as an exercise.

In words, OMD allows us to prove regret guarantees that depend on arbitrary couple of dual norms and . In particular, the primal norm will be used to measure the feasible set or the distance between the competitor and the initial point, and the dual norm will be used to measure the gradients. If you happen to know something about these quantities, we can choose the most appropriate couple of norm to guarantee a small regret. The only thing you need is a function that is strongly convex with respect to the primal norm you have chosen .

Overall, the regret bound is still of the order of for Lipschitz functions, that only difference is that now the Lipschitz constant is measured with a different norm. Also, everything we did for Online Subgradient Descent (OSD) can be trivially used here. So, for example, we can use stepsize of the form

to achieve a regret bound of .

Next time, we will see practical examples of OMD that guarantee strictly better regret than OSD. As we did in the case of AdaGrad, the better guarantee will depend on the shape of the domain and the characteristics of the subgradients.

Instead, now we see the meaning of the “Mirror”.

**2. The “Mirror” Interpretation **

First, we need a couple of convex analysis results.

When we introduced the Fenchel conjugate, we said that iff , that in words means that in the sense of multivalued mappings. Now, we state a stronger result for the case that the function is strongly convex.

Theorem 3Let be a proper, convex, and closed function, strongly convex w.r.t. . Then,

- is finite everywhere and differentiable.
- is -smooth w.r.t. .

We will also use the following optimality condition.

Theorem 4Let proper. Then iff .

Hence, we can state the following theorem.

Theorem 5Let the Bregman divergence w.r.t. , where is strongly convex. Let a non-empty closed convex set such that . Then

*Proof:* Consider the update rule in Algorithm 1 and let’s see

Now, we want to use the first order optimality condition, so we have to use a little bit of subdifferential calculus. Given that , by the subdifferential calculus theorem we saw, we have . So, we have

that is

or equivalently

Using that fact that is -strongly convex, we have that . Hence

Noting that for vectors in , we have the stated bound.

Let’s explain what this theorem says. We said that Online Mirror Descent extends the Online Subgradient Descent method to non-euclidean norms. Hence, the regret bound we proved contains dual norms, that measure the iterate and the gradients. We also said that it makes sense to use a dual norm to measure a gradient, because it is a natural way to measure how “big” is the linear functional . In a more correct way, gradients actually live in the *dual space*, that is in a different space of the predictions . Hence, we cannot sum iterates and gradients together, in the same way in which we cannot sum pear and apples together. So, why we were doing it in OSD? The reason is that in that case the dual space coincides with the primal space. But, it is a very particular case due to the fact that we used the L2 norm. Instead, in the general case, iterates and gradients are in two different spaces.

So, in OMD we need a way to go from one space to the other. And this is exactly the role of and , that are called *duality mappings*. We can now understand that the theorem tells us that OMD takes the primal vector , transforms it into a dual vector through , does a subgradient descent step in the dual space, and finally transforms the vector back to the primal space through . This reasoning is summarized in Figure 1.

Example 1Let equal to where . Then,

Solving the constrained optimization problem, we have . Hence, we have

that is finite everywhere and differentiable. So, we have and

So, using (3), we obtain exactly the update of projected online subgradient descent.

**3. Yet Another Way to Write the OMD Update **

There exists yet another way to write the update of OMD. This third method uses the concept of *Bregman projections*. Extending the definition of Euclidean projections, we can define the projection with respect to a Bregman divergence. Let be defined by

In the online learning literature, the OMD algorithm is typically presented with a two step update: first, solving the argmin over the entire space and then projecting back over with respect to the Bregman divergence. In the following, we show that most of the time the two-step update is equivalent to the one-step update in (3).

First, we prove a general theorem that allows to break the constrained minimization of functions in the minimization over the entire space plus and Bregman projection step.

Theorem 6Let proper, closed, strictly convex, and differentiable in . Also, let a non-empty, closed convex set with and assume that exists and . Denote by . Then the following hold:

- exists and is unique.
- .

*Proof:* For the first point, from (Bauschke, H. H. and Combettes, P. L., 2011, Proposition 11.12) and the existence of , we have that is coercive. So, from (Bauschke, H. H. and Combettes, P. L., 2011, Proposition 11.14), the minimizer of in exists. Given that is strictly convex, the minimizer must be unique too.

For the second point, from the definition of , we have . On the other hand, from the first-order optimality condition, we have . So, we have

that is . Given that is strictly convex, .

Now, note that, if , then

Also, defining is equal to for some and . Hence, under the assumption of the above theorem, we have that is equivalent to

The advantage of this update is that sometimes it gives two easier problems to solve rather than a single difficult one.

**4. History Bits **

Most of the online learning literature for OMD assumes to be *Legendre*, (Cesa-Bianchi, N. and Lugosi, G. , 2006, e.g.), that corresponds to assuming (1). This condition allows to prove that . However, it turns out that the Legendre condition is not necessary and we only need the function to be differentiable on the predictions . In fact, we only need one of the two conditions in (1) or (2) to hold. Removing the Legendre assumption makes it easier to use OMD with different combinations of feasibility sets/Bregman divergences. So, I didn’t introduce the concept of Legendre functions at all, relying instead on (a minor modification of) OMD as described by (Beck, A. and Teboulle, M., 2003).

**5. Exercises **

Exercise 1Find the conjugate function of defined over .

]]>

Exercise 2Generalize the concept of strong convexity to Bregman functions, instead of norms, and prove a logarithmic regret guarantee for such functions using OMD.

* You can find the lectures I published till now here.*

In this lecture, we will introduce the Online Mirror Descent (OMD) algorithm. To explain its genesis, I think it is essential to understand what subgradients do. In particular, the negative subgradients are not always pointing towards a direction that minimizes the function. We already discussed this problem in a previous blog post, but copy&paste is free so I’ll repeat the important bits here.

**1. Subgradients are not Informative **

You know that in online learning we receive a sequence of loss functions and we have to output a vector before observing the loss function on which we will be evaluated. However, we can gain a lot of intuition if we consider the easy case that the sequence of loss functions is always a fixed function, i.e., . If our hypothetical online algorithm does not work in this situation, for sure it won’t work on the more general case.

That said, we proved that the convergence of Online Subgradient Descent (OSD) depends on the following key property of the subgradients:

In words, to minimize the left hand side of this equation, it is enough to minimize the right hand side, that is nothing else than the instantaneous linear regret on the linear function . This is the only reason why OSD works! However, I am sure you heard a million of times the (wrong) intuition that gradient points towards the minimum, and you might be tempted to think that the same (even more wrong) intuition holds for subgradients. Indeed, I am sure that even if we proved the regret guarantee based on (1), in the back of your mind you keep thinking “yeah, sure, it works because the subgradient tells me where to go to minimize the function”. Typically this idea is so strong that I have to present explicit counterexamples to fully convince a person.

So, take a look at the following examples that illustrate the fact that a subgradient does not always point in a direction where the function decreases.

Let , see Figure 1. The vector is a subgradient in of . No matter how we choose the stepsize, moving in the direction of the negative subgradient will not decrease the objective function. An even more extreme example is in Figure 2, with the function . Here, in the point , any positive step in the direction of the negative subgradient will

increasethe objective function.

**2. Reinterpreting the Online Subgradient Descent Algorithm. **

How Online Subgradient Descent works? It works exactly as I told you before: thanks to (1). But, what does that inequality really mean?

A way to understand how OSD algorithms works is to think that it minimizes a local approximation of the original objective function. This is not unusual for optimization algorithms, for example the Netwon’s algorithm constructs an approximation with a Taylor expansion truncated to the second term. Thanks to the definition of subgradients, we can immediately build a linear lower bound to a function around :

So, in our setting, this would mean that we update the online algorithm with the minimizer of a linear approximation of the loss function you received. Unfortunately, minimizing a linear function is unlikely to give us a good online algorithm. Indeed, over unbounded domains the minimum of a linear function is .

So, let’s introduce the other key concept: we constraint the minimization of this lower bound only in a neighborhood of , where we have good reason to believe that the approximation is more precise. Coding the neighborhood constraint with a L2 squared distance from less than some positive number , we might think to use the following update

Equivalently, for some , we can consider the unconstrained formulation

This is a well-defined update scheme, that hopefully moves closer to the optimum of . See Figure 2 for a graphical representation in one-dimension.

And now the final element of our story: the argmin in (2) is exactly the update we used in OSD! Indeed, solving the argmin and completing the square, we get

where is the Euclidean projection onto , i.e. .

The new way to write the update of OSD in (2) will be the core ingredient for designing Online Mirror Descent. In fact, OMD is a strict generalization of that update when we use a different way to measure the locality of from . Tha is, we measured the distance between to the current point with the squared L2 norm. What happens if we change the norm? Do we even have to use a norm?

To answer these questions we have to introduce another useful mathematical object: the *Bregman divergence*.

**3. Convex Analysis Bits: Bregman Divergence **

We first give a new definition, a slightly stronger notion of convexity.

Definition 1Let and a convex set. isstrictly convexif

From the definition, it is immediate to see that strong convexity w.r.t. any norm implies strict convexity. Note that for a differentiable function, strict convexity also implies that for (Bauschke, H. H. and Combettes, P. L., 2011, Proposition 17.13).

We now define our new notion of “distance”.

Definition 2Let be strictly convex and continuously differentiable on . TheBregman Divergencew.r.t. is defined as

From the definition, we see that the Bregman divergence is always non-negative for , from the convexity of . However, something stronger holds. By the strict convexity of , fixed a point we have that , with equality only for . Hence, the strict convexity allows us to use the Bregman divergence as a similarity measure between and . Moreover, this similarity measure *changes* with the reference point . This also implies that, as you can see from the definition, the Bregman divergence is not symmetric.

Let me give you some more intuition on the concept Bregman divergence. Consider the case that is twice differentiable in a ball around and . So, by the Taylor’s theorem, there exists such that

where . Hence, we are using a metric that depends on the Hessian of . *Different areas of the space will have a different value of the Hessian, and so the Bregman will behave differently*.

We can also lower bound the Bregman divergence if the function is strongly convex. In particular, if is -strongly convex w.r.t. a norm in , then we have

Example 2If , then .

Example 3If and , then , that is the Kullback-Leibler divergence between the discrete distributions and .

We also have the following lemma that links the Bregman divergences between 3 points.

Lemma 3 (Chen, Gong and Teboulle, Marc, 1993)Let the Bregman divergence w.r.t. . Then, for any three points and , the following identity holds

**4. Online Mirror Descent **

Based on what we said before, we can start from the equivalent formulation of the OSD update

and we can change the last term with another measure of distance. In particular, using the Bregman divergence w.r.t. a function , we have

These two updates are exactly the same when .

So we get the Online Mirror Descent algorithm in Algorithm 4.

However, without an additional assumption, this algorithm has a problem. Can you see it? The problem is that might be on the boundary of and in the next step we would have to evaluate for a point on the boundary of . Given that , we might end up on the boundary of where the Bregman is not defined!

To fix this problem, we need either one of the following assumptions

If either of these conditions are true, the update is well-defined on all rounds (proof left as an exercise).

Now we have a well-defined algorithm, but does it guarantee a sublinear regret? We know that at least in one case it recovers the OSD algorithm, that does work. So, from an intuitive point of view, how well the algorithm work should depend on some characteristic on . In particular, a key property will be the *strong convexity* of .

To analyze OMD, we first prove a one step relationship, similar to the one we proved for Online Gradient Descent and OSD. Note how in this Lemma, we will use a lot of the concepts we introduced till now: strong convexity, dual norms, subgradients, Fenchel-Young inequality, etc. In a way, over the past lectures I slowly prepared you to be able to prove this lemma.

Lemma 4Let the Bregman divergence w.r.t. and assume to be -strongly convex with respect to in . Let a convex set. Set . Assume (4) or (5) hold. Then, and with the notation in Algorithm 4, the following inequality holds

*Proof:* From the optimality condition for the update of OMD, we have

From the definition of subgradient, we have that

where in the second inequality we used (6), in the second equality we used Lemma 3, in the third inequality we used , and in the last inequality we used (3) because is -strongly convex w.r.t. .

The lower bound with the function values is due, as usual, to the definition of subgradients.

Next time, we will see how to use this one step relationship to prove a regret bound, that will finally show us if and when this entire construction is a good idea. In fact, it is worth stressing that *the above motivation is not enough in any way to justify the existence of the OMD algorithm*. Also, next time we will explain why the algorithm is called Online “Mirror” Descent.

**5. Exercises **

Exercise 1Prove that the defined in Example 3 is 1-strongly convex w.r.t. the L1 norm.

Exercise 2Derive a closed form update for OMD when using the of Example 3 and .

**6. History Bits **

Mirror Descent (MD) was introduced by (Nemirovsky, A.S. and Yudin, D., 1983) in the *batch* setting. The description of MD with Bregman divergence that I described here (with minor changes) was done by (Beck, A. and Teboulle, M., 2003). The minor changes are in decoupling the domain of from the feasibility set . This allows to use functions that do not satisfy the condition (4) but they satisfy (5). In the online setting, the mirror descent scheme was used for the first time by (Warmuth, M. K. and Jagota, A. K., 1997).

* You can find the lectures I published till now here.*

In this lecture we will present some lower bounds for online linear optimization (OLO). Remembering that linear losses are convex, this immediately gives us lower bounds for online convex optimization (OCO). We will consider both the constrained and the unconstrained case. The lower bounds are important because they inform us on what are the optimal algorithms and where are the gaps in our knowledge.

**1. Lower bounds for Bounded OLO **

We will first consider the bounded constrained case. Finding a lower bound accounts to find a strategy for the adversary that forces a certain regret onto the algorithm, *no matter what the algorithm does*. We will use the probabilistic method to construct our lower bound.

The basic method relies on the fact that for any random variable with domain , and any function

For us, it means that we can lower bound the effect of the worst-case sequence of functions with an expectation over any distribution over the functions. If the distribution is chosen wisely, we can expect that gap in the inequality to be small. Why do you rely on expectations rather than actually constructing an adversarial sequence? Because the use of a stochastic loss functions makes very easy to deal with arbitrary algorithms. In particular, we will choose stochastic loss functions that makes the expected loss of the algorithm 0, independently from the strategy of the algorithm.

Theorem 1Let be any non-empty bounded closed convex subset. Let be the diameter of . Let be any (possibly randomized) algorithm for OLO on . Let be any non-negative integer. There exists a sequence of vectors with and such that the regret of algorithm satisfies

*Proof:* Let’s denote by . Let such that . Let , so that . Let be i.i.d. Rademacher random variables, that is and set the losses .

where we used and the are independent in the first equality, in the fourth equality, and Khintchine inequality in the last inequality.

We see that the lower bound is a constant multiplicative factor from the upper bound we proved for Online Subgradient Descent (OSD) with learning rates or . This means that OSD is asymptotically optimal with both settings of the learning rate.

At this point there is an important consideration to do: How can this be the optimal regret when we managed to proved better regret, for example with adaptive learning rates? The subtlety is that, constraining the adversary to play -Lipschitz losses, the adversary could always force on the algorithm at least the regret in Theorem 1. However, we can design algorithms that take advantage of *suboptimal plays of the adversary*. Indeed, for example, if the adversary plays in a way that all the subgradients have the same norm equal to , there is nothing to adapt to!

We now move to the unconstrained case, however first we have to enrich our math toolbox with another essential tool, *Fenchel conjugates*.

**2. Convex Analysis Bits: Fenchel Conjugate **

Definition 2A function isclosediff is closed for every .

Note that in a Hausdorff space a function is closed iff it is lower semicontinuos (Bauschke, H. H. and Combettes, P. L., 2011, Lemma 1.24).

Example 1The indicator function of a set , is closed iff is closed.

Definition 3For a function , we define theFenchel conjugateas

From the definition we immediately obtain the Fenchel’s inequality

We have the following useful property for the Fenchel conjugate

Theorem 4 (Rockafellar, R. T., 1970, Corollary 23.5.1 and Theorem 23.5)Let be convex, proper, and closed. Then

- iff .
- achieves its supremum in at iff .

Example 2Let , hence we have . Solving the optimization, we have that if and , , and for .

Example 3Consider the function , where is a norm in , with dual norm . We can show that its conjugate is . From

for all . The right hand side is a quadratic function of , which has maximum value . Therefore for all , we have

which shows that . To show the other inequality, let be any vector with , scaled so that . Then we have, for this ,

which shows that .

Lemma 5Let be a function and let be its Fenchel conjugate. For and , the Fenchel conjugate of is .

*Proof:* From the definition of conjugate function, we have

Lemma 6 (Bauschke, H. H. and Combettes, P. L., 2011, Example 13.7)Let even. Then .

**3. Unconstrained OLO **

The above lower bound applies only to the constrained setting. In the unconstrained setting, we proved that OSD with and constant learning rate of gives a regret of for any . Is this regret optimal? It is clear that the regret must be at least linear in , but is linear enough?

The approach I will follow is to *reduce the OLO game to the online game of betting on a coin*, where the lower bounds are known. So, let’s introduce the coin-betting online game:

- Start with an initial amount of money .
- In each round, the algorithm bets a fraction of its current wealth on the outcome of a coin.
- The outcome of the coin is revealed and the algorithm wins or lose its bet, 1 to 1.

The aim of this online game is to win as much money as possible. Also, as in all the online games we consider, we do not assume anything on how the outcomes of the coin are decided. Note that this game can also be written as OCO using the log loss.

We will denote by the outcomes of the coin. We will use the absolute value of to denote the fraction of money to bet and its sign to denote on which side we are betting. The money the algorithm has won from the beginning of the game till the end of round will be denoted by and given that the money are won or lost 1 to 1, we have

where we used the fact that . We will also denote by the bet of the algorithm on round .

If we got all the outcomes of the coin correct, we would double our money in each round, so that . However, given the adversarial nature of the game, we can actually prove a stronger lower bound to the maximum wealth we can gain.

Theorem 7 (Cesa-Bianchi, N. and Lugosi, G. , 2006, a simplified statement of Theorem 9.2)Let . Then, for any betting strategy with initial money that bets fractions of its current wealth, there exists a sequence of coin outcomes , such that

Now, let’s connect the coin-betting game with OLO. Remember that proving a regret guarantee in OLO consists in showing

for some function , where we want the dependency on to be sublinear. Using our new learned concept of Fenchel conjugate, this is equivalent to prove that

Hence, for a given online algorithm we can prove regret bounds proving that there exists a function or equivalently finding its conjugate . Similarly, proving a lower bound in unconstrained OLO means finding a sequence of and a function or a function that lower bound the regret or the cumulative losses of the algorithm respectively.

Without any other information, it can be challenging to guess what is the slowest increasing function . So, we restrict our attention to online algorithms that guarantee a constant regret against the zero vector. This immediately imply the following important consequence.

Theorem 8Let a non-decreasing function of the index of the rounds and an OLO algorithm that guarantees for any sequence of with . Then, there exists such that and for .

*Proof:* Define the “reward” of the algorithm. So, we have

Since, we assumed that , we always have . Using this, we claim that for all . To see this, assume that there is a sequence that gives . We then set . For this sequence, we would have , that contradicts the observation that .

So, from the fact that we have that there exists such that for a and .

This theorem informs us of something important: *any OLO algorithm that suffer a non-decreasing regret against the null competitor must predict in the form of a “vectorial” coin-betting algorithm*. This immediately implies the following.

Theorem 9Let . For any OLO algorithm, under the assumptions of Theorem 8, there exist a sequence of with and , such that

*Proof:* The proof works by reducing the OLO game to a coin-betting game and then using the upper bound to the reward for coin-betting games.

First, set , where will be defined in the following, so that for any . Given Theorem 8, we have that the first coordinate of has to satisfy

for some such that . Hence, the above is nothing else than a coin-betting algorithm that bets money on the outcome of a coin , with initial money . This means that the upper bound to its reward in Theorem 7 applies: there exists a sequence of such that

where is the Fenchel conjugate of the function and by Theorem 4 part 2. Using the closed form solution for the Fenchel conjugate as well as its lower bound from (Orabona, F. and Pal, D., 2016) and reordering the terms, we get the stated bound.

From the above theorem we have that OSD with learning rate does not have the optimal dependency on for any .

In the future classes, we will see that the connection between coin-betting and OLO can also be used to design OLO algorithm. This will give us *optimal unconstrained OLO algorithms with the surprising property of not requiring a learning rate at all*.

**4. History Bits **

The lower bound for OCO is quite standard, the proof presented is a simplified version of the one in (F. Orabona and D. PÃ¡l, 2018).

On the other hand, both the online learning literature and the optimization one almost ignored the issue of lower bounds for the unconstrained case. The connection between coin betting and OLO was first unveiled in (Orabona, F. and Pal, D., 2016). Theorem 8 is an unpublished result by Ashok Cutkosky (Thanks for allowing me to use it here!), that proved similar and more general results in his PhD thesis (Cutkosky, A., 2018). Theorem 9 is new, by me. (McMahan, B. and Abernethy, J., 2013) implicitly also proposed using the conjugate function for lower bound in unconstrained OLO.

There is a caveat in the unconstrained lower bound: A stronger statement would be to choose the norm of beforehand. To do this, we would have to explicitely construct the sequence of the . One way to do is to use Rademacher coins and then leverage again the hypothesis on the regret against the null competitor. This route was used in (M. Streeter and B. McMahan, 2012), but the proof relied on assuming the value of the global optimum of a non-convex function with an infinite number of local minima. The correct proof avoiding that step was then given in (Orabona, F., 2013). Yet, the proof presented here, that converts reward upper bounds in regret lower bound, is simpler in spirit and (I hope!) more understandable. Given that, as far as I know, this is the first time that unconstrained OLO lower bounds are taught in a class, I valued simplicity over generality.

**5. Exercises **

Exercise 1Fix . Mimicking the proof of Theorem 1, prove that for any OCO algorithm there exists a and a sequence of loss functions such that where and the loss functions are -Lipschitz w.r.t. .

Exercise 2Extend the proof of Theorem 1 to an arbitrary norm to measure the diameter of and with .

Exercise 3Let be even. Prove that is even

]]>

Exercise 4Find the exact expression of the conjugate function of , for . Hint: Wolfram Alpha or any other kind of symbolic solver can be very useful for this type of problems.

* You can find the lectures I published till now here.*

In this lecture, we will explore a bit more under which conditions we can get better regret upper bounds than . Also, we will obtain this improved guarantees in an *automatic* way. That is, the algorithm will be *adaptive* to characteristics of the sequence of loss functions, without having to rely on information about the future.

**1. Adaptive Learning Rates for Online Subgradient Descent **

Consider the minimization of the linear regret

Using Online Subgradient Descent (OSD), we said that the regret for bounded domains can be upper bounded by

With a fixed learning rate, the learning rate that minimizes this upper bound on the regret is

Unfortunately, as we said, this learning rate cannot be used because it assumes the knowledge of the future rounds. However, we might be lucky and we might try to just approximate it in each round using the knowledge up to time . That is, we might try to use

Observe that , so the first term of the regret would be exactly what we need! For the other term, the optimal learning rate would give us

Now let’s see what we obtain with our approximation.

We need a way to upper bound that sum. The way to treat these sums, as we did in other cases, is to try to approximate them with integrals. So, we can use the following very handy Lemma that generalizes a lot of similar specific ones.

*Proof:* Denote by .

Summing over , we have the stated bound.

Using this Lemma, we have that

Surprisingly, this term is only a factor of 2 worse than what we would have got from the optimal choice of . However, this learning rate can be computed without knowledge of the future and it can actually be used! Overall, with this choice we get

Note that it is possible improve the constant in front of the bound to by multiplying the learning rates by . So, putting all together we have the following theorem.

Theorem 2Let a closed non-empty convex set with diameter , i.e. . Let an arbitrary sequence of convex functions subdifferentiable in open sets containing for . Pick any and . Then, , the following regret bound holds

The second equality in the theorem clearly show the advantage of this learning rates: We obtain (almost) the same guarantee we would have got knowing the future gradients!

This is an interesting result on its own: it gives a principled way to set the learning rates with an almost optimal guarantee. However, there are also other consequences of this simple regret. First, we will specialize this result to the case that the losses are *smooth*.

**2. Convex Analysis Bits: Smooth functions **

We now consider a family of loss functions that have the characteristic of being lower bounded by the squared norm of the subgradient. We will also introduce the concept of *dual norms*. While dual norms are not strictly needed for this lecture, they give more generality and at the same time they allows me to slowly introduce some of the concepts that will be needed for the lectures on Online Mirror Descent.

Definition 3Thedual normof a norm is defined as .

Example 1The dual norm of the L2 norm is the L2 norm. Indeed, by Cauchy-Schwartz inequality. Also, set , so .

If you never saw it before, the concept of dual norm can be a bit weird at first. One way to understand it is that it is a way to measure how “big” are linear functionals. For example, consider the linear function , we want to try to understand how big it is. So, we can measure that is we measure how big is the output of the linear function compared to its input , where is measured with some norm. Now, you can show that the above is equivalent to the dual norm of .

Remark 1The definition of dual norm immediately implies .

Now we can introduce smooth functions, using the dual norms defined above.

Definition 4Let differentiable. We say that is-smoothw.r.t. if for all .

Keeping in mind the intuition above on dual norms, taking the dual norm of a gradient makes sense if you associate each gradient with the linear functional , that is the one needed to create linear approximation of in .

Smooth functions have many properties, for example a smooth function can be upper bounded by a quadratic. However, in the following we will need the following property.

Theorem 5 (e.g. Srebro, N. and Sridharan, K. and Tewari, A., 2010, Lemma 4.1)Let be -smooth and bounded from below, then for all

**3. bounds **

Assume now that the loss functions are bounded from below and smooth. Without loss of generality, we can assume that each of them is bounded from below by 0. Under these assumptions, from the regret in (2) and Theorem 5 we immediately obtain

This is an implicit bound, in the sense that appears on both sides of the inequality. To makes it explicit, we will use the following simple Lemma (proof left as an exercise).

So, we have the following theorem

Theorem 7Let a closed non-empty convex set with diameter , i.e. . Let an arbitrary sequence of non-negative convex functions -smooth in open sets containing for . Pick any and . Then, , the following regret bound holds

This regret guarantee is very interesting because in the worst case it is still of the order , but in the best case scenario it becomes a constant! In fact, if there exists a such that we get a constant regret. Basically, if the losses are “easy”, the algorithm *adapts* to this situation and gives us a better regret.

These kind of guarantees are called bounds because they depend on the cumulative competitor loss that can be denoted by .

**4. AdaGrad **

We now present another application of the regret bound in (2). **AdaGrad**, that stands for Adaptive Gradient, is an Online Convex Optimization algorithm proposed independentely by (H. B. McMahan and M. J. Streeter, 2010) and (J. Duchi and E. Hazan and Y. Singer, 2010). It aims at being adaptive to the sequence of gradients. It is usually known as a stochastic optimization algorithm, but in reality it was proposed for Online Convex Optimization (OCO). To use it as a stochastic algorithm, you should use an online-to-batch conversion, otherwise you do not have any guarantee of convergence.

We will present a proof that only allows hyperrectangles as feasible sets , on the other hand the restriction makes the proof almost trivial. Let’s see how it works.

AdaGrad has key ingredients:

- A coordinate-wise learning process;
- The adaptive learning rates in (1).

For the first ingredient, as we said, the regret of any OCO problem can be upper bounded by the regret of the Online Linear Optimization (OLO) problem. That is,

Now, the essential observation is to explicitely write the inner product as a sum of product over the single coordinates:

where we denoted by the regret of the 1-dimensional OLO problem over coordinate , that is . In words, *we can decompose the original regret as the sum of OLO regret minimization problems and we can try to focus on each one of them separately*.

A good candidate for the 1-dimensional problems is OSD with the learning rates in (1). We can specialize the regret in (2) to the 1-dimensional case for linear losses, so we get for each coordinate

This choice gives us the AdaGrad algorithm in Algorithm 1.

Putting all together, we have immediately the following regret guarantee.

Theorem 8Let with diameters along each coordinate equal to . Let an arbitrary sequence of convex functions subdifferentiable in open sets containing for . Pick any and . Then, , the following regret bound holds

Is this a better regret bound compared to the one in Theorem 2? It depends! To compare the two we have to consider that is a hyperrectangle because the analysis of AdaGrad above only works for hyperrectangle. Then, we have to compare

From Cauchy-Schwarz, we have that . So, *assuming the same sequence of subgradients*, AdaGrad has always a better regret on hyperrectangles. Also, note that

So, in the case that is a hypercube we have and , the bound of AdaGrad is between and times the bound of Theorem 2. In other words, if we are lucky with the subgradients, the particular **shape of the domain** might save us a factor of in the guarantee.

Note that the more general analysis of AdaGrad allows to consider arbitrary domains, but it does not change the general message that the best domains for AdaGrad are hypercubes. We will explore this issue of choosing the online algorithm based on the shape of the feasible set when we will introduce Online Mirror Descent.

Hidden in the guarantee is that the biggest advantage of AdaGrad is the property of being coordinate-wise *scale-free*. That is, if each coordinate of the gradients are multiplied by different constants, the learning rate will automatically scale them back. This is hidden by the fact that the optimal solution would also scale accordingly, but the fixed diameters of the feasible set hide it. This might be useful in the case the ranges of coordinates of the gradients are vastly different one from the other. Indeed, this does happen in the stochastic optimization of deep neural networks, where the first layers have smaller magnitude of the gradients compared to the last layers.

**5. History Bits **

The adaptive learning rate in (1) first appeared in (M. Streeter and H. B. McMahan, 2010). However, similar methods were used long time before. Indeed, the key observation to approximate oracle quantities with estimates up to time was first proposed in the self-confident algorithms (P. Auer and N. Cesa-Bianchi and C. Gentile, 2002), where the learning rate is inversely proportional to the square root of the cumulative loss of the algorithm, and for smooth losses it implies the bounds similar to the one in Theorem 7.

AdaGrad was proposed in basically identically form independently by two groups at the same conference: (H. B. McMahan and M. J. Streeter, 2010) and (J. Duchi and E. Hazan and Y. Singer, 2010). The analysis presented here is the one in (M. Streeter and H. B. McMahan, 2010) that does not handle generic feasible sets and does not support “full-matrices”, i.e. full-matrix learning rates instead of diagonal ones. However, in machine learning applications AdaGrad is rarely used with a projection step (even if doing so provably destroys the worst-case performance (F. Orabona and D. PÃ¡l, 2018)). Also, in the adversarial setting full-matrices do not seem to offer advantages in terms of regret compared to diagonal ones.

AdaGrad inspired an incredible number of clones, most of them with similar, worse, or no regret guarantees. The keyword “adaptive” itself has shifted its meaning over time. It used to denote the ability of the algorithm to obtain the same guarantee as it knew in advance a particular property of the data (i.e. adaptive to the gradients/noise/scale = (almost) same performance as it knew the gradients/noise/scale in advance). Indeed, in Statistics this keyword is used with the same meaning. However, “adaptive” has changed its meaning over time. Nowadays, it seems to denote any kind of coordinate-wise learning rates that does not guarantee anything in particular.

**6. Exercises **

Exercise 1Prove that the dual norm of is , where and .

Exercise 2Show that using online subgradient descent with the learning rates in (1) with Lipschitz, smooth, and strongly convex functions you can get bounds.

]]>

Exercise 3Prove that the logistic loss , where and is 1-smooth w.r.t. .

* You can find the lectures I published till now here.*

**1. Online-to-Batch Conversion **

Last time we saw the following theorem, let’s now prove it.

Theorem 1Let where the expectation is w.r.t. drawn from over some vector space and is convex in the first argument. Draw samples i.i.d. from and construct the sequence of losses , where are deterministic. Run any OCO algorithm over the losses , to construct the sequence of predictions . Then, we have

where the expectation is with respect to .

In fact, from the linearity of the expectation we have

Then, from the law of total expectation, we have

where we used the fact that and depend only on . Hence, (1) is proved.

It remains only to use Jensen’s inequality, using the fact that is convex, to have

Dividing the regret by and using the above inequalities gives the stated theorem.

In example last time, we had to use a constant learning rate to be able to minimize the training error over the entire space . In the next one, we will see a different approach, that allows to use a varying learning rate without the need of a bounded feasible set.

Example 1Consider the same setting of the previous example, and let’s change the way in which the construct the online losses. Now use and step size . Hence, we have

where we used .

I stressed the fact that the only meaningful way to define a regret is with respect to an arbitrary point in the feasible set. This is obvious in the case we consider unconstrained OLO, because the optimal competitor is unbounded. But, it is also true in unconstrained OCO. Let’s see an example of this.

Example 2Consider a problem of binary classification, with inputs and outputs . The loss function is the logistic loss: . Suppose that you want to minimize the training error over a training set of samples, . Also, assume the maximum L2 norm of the samples is . That is, we want to minimize

So, run the reduction described in Theorem 1 for iterations using OSD. In each iteration, construct sampling a training point uniformly at random from to . Set and . We have that

In words, we will be away from the optimal value of regularized empirical risk minimization problem, where the weight of the regularization is . Now, let’s consider the case that the training set is linearly separable, this means that the infimum of is 0 and the optimal solution does not exist, i.e., it has norm equal to infinity. So, any convergence guarantee that depends on would be vacuous. On the other hand, our guarantee above still make perfectly sense.

Note that the above examples only deal with training error. However, there is a more interesting application of the online-to-batch conversion, that is to directly minimize the generalization error.

Example 3Consider the same setting of Example 1, but now consider the case that we want to minimize the true risk, that is

where . Our Theorem 1 still applies as is. Indeed, draw samples i.i.d. from and run OSD with losses . We obtain that

Note the fact that the risk will be close to the best

regularizedsolution, even if we didn’t use any regularizer! The presence of the regularizer is due to OSD and we will be able to change it to other regularizer when we will see Online Mirror Descent.

**2. Can We Do Better Than Regret? **

Let’s now go back to online convex optimization theory. The example in the first class showed us that it is possible to get logarithmic regret in time. However, we saw that we get only -regret with Online Subgradient Descent (OSD) on the same game. What is the reason? It turns out that the losses in the first game, on , are not just Lipschitz. They also posses some *curvature* that can be exploited to achieve a better regret. In a moment we will see that the only change we will need to OSD is a different learning rate, dictated as usual by the regret analysis.

The key concept we will need is the one of *strong convexity*.

**3. Convex Analysis Bits: Strong Convexity **

Here, we introduce a stronger concept of convexity, that allows to build better lower bound to a function. Instead of the linear lower bound achievable through the use of subgradients, we will make use of *quadratic* lower bound.

Definition 2Let . A proper function is -strongly convex over a convex set w.r.t. if

Remark 1You might find another definition of strong convexity, that resembles the one of convexity. However, it is easy to show that these two definitions are equivalent.

Note that the any convex function is -strongly convex, by the definition of subgradient. Also, any -strongly convex function is also -strongly convex for any .

In words, the definition above tells us that a strongly convex function can be lower bounded by a quadratic, where the linear term is the usual one constructed through the subgradient, and the quadratic term depends on the strong convexity. Hence, we have a tighter lower bound to the function w.r.t. simply using convexity. This is what we would expect using a Taylor expansion on a twice-differentiable convex function and lower bounding the smallest eigenvalue of the Hessian. Indeed, we have the following Theorem

Theorem 3 (S. Shalev-Shwartz, 2007, Lemma 14)Let convex and twice differentiable. Then, a sufficient condition for -strong convexity in w.r.t. is that for all we have , where is the Hessian matrix of at .

However, here there is the important difference that we do not assume the function to be twice differentiable. Indeed, we don’t even need plain differentiability. Hence, the use of the subgradient implies that this lower bound does not have to be uniquely determined, as in the next Example.

Example 4Consider the strongly convex function . In Figure 1, we show two possible quadratic lower bounds to the function in .

We also have the following easy but useful property on the sum of strong convex functions.

Theorem 4Let be -strongly convex and a -strongly convex function in a non-empty convex set w.r.t. . Then, is -strongly convex in w.r.t. .

*Proof:* Note that the assumption on give us that the subdifferential set of the sum is equal to the sum of the subdifferential sets. Hence, the proof is immediate from the definition of strong convexity.

Example 5Let . Using Theorem 3, we have that is 1-strongly convex w.r.t. in .

**4. Online Subgradient Descent for Strongly Convex Losses **

Theorem 5Assume that the functions are -strongly convex w.r.t over , where . Use OSD with stepsizes equal to . Then, for any , we have the following regret guarantee

*Proof:* From the assumption of -strong convexity of the functions , we have that

From the fact that , we have

Hence, use Lemma 2 in Lecture 3 and sum from , to obtain

Observing that the first sum on the left hand side is a telescopic sum, we have the stated bound.

Remark 2Notice that the theorem requires a bounded domain, otherwise the loss functions will not be Lipschitz given that they are also strongly convex.

Corollary 6Under the assumptions of Theorem 5, if in addiction we have and is -Lipschitz w.r.t. , for , then we have

Remark 3Corollary 6does notimply that for any finite the regret will be smaller than using learning rates . Instead, asymptotically the regret in Corollary 6 is always better than to one of OSD with Lipschitz losses.

Example 6Consider once again the example in the first class: . Note that the loss functions are -strongly convex w.r.t. . Hence, setting and gives a regret of .

Let’s now use again the online-to-batch conversion on strongly convex stochastic problems.

Example 7As done before, we can use the online-to-batch conversion to use Corollary 6 to obtain stochastic subgradient descent algorithms for strongly convex stochastic functions. For example, consider the classic Support Vector Machine objective

or any other regularized formulation like regularized logistic regression:

where , , and . First, notice that the minimizer of both expressions as to be in the L2 ball of radius proportional to (proof left as exercise). Hence, we can set equal to this set. Then, setting or results in -strongly convex loss functions. Using Corollary 6 and Theorem 1 gives immediately

However, we can do better! Indeed, or results in -strongly convex loss functions. Using Corollary 6, we have that and Theorem 1 gives immediately

that is asymptotically better because it does not have the logarithmic term.

**5. History Bits **

The logarithmic regret in Corollary 6 was shown for the first time in the seminal paper (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006). The general statement in Theorem 5 was proven by (Hazan, E. and Rakhlin, A. and Bartlett, P. L., 2008).

As we said last time, the non-uniform averaging of Example 1 is from (Zhang, T., 2004), even if there it is not proposed explicitly as an online-to-batch conversion. Instead, the non-uniform averaging of Example 7 is from (Lacoste-Julien, S. and Schmidt, M. and Bach, F., 2012), but again there is not proposed as an online-to-batch conversion. The basic idea of solving the problem of SVM with OSD and online-to-batch conversion of Example 7 was the Pegasos algorithm (S. Shalev-Shwartz and Y. Singer and N. Srebro, 2007), for many years the most used optimizer for SVMs.

A more recent method to do online-to-batch conversion has been introduced in (Cutkosky, 2019). The new method allows to prove the convergence of the last iterate rather than the one of the weighted average, with a small change in the online learning algorithm.

**6. Exercises **

Exercise 1Prove that OSD in Example 6 with is exactly the Follow-the-Leader strategy for that particular problem.

Exercise 2Prove that is -strongly convex w.r.t. and derive the OSD update for it and its regret guarantee.

]]>

Exercise 3 (Difficult)Prove that is -strongly convex w.r.t. where .

* You can find the lectures I published till now here.*

Last time, we introduced Projected Online Gradient Descent:

And we proved the following regret guarantee:

Theorem 1Let a closed non-empty convex set with diameter , i.e. . Let an arbitrary sequence of convex functions differentiable in open sets containing for . Pick any assume . Then, , the following regret bound holds

Moreover, if is constant, i.e. , we have

However, the differentiability assumption for the is quite strong. What happens when the losses are convex but not differentiable? For example . Note that this situation is more common than one would think. For example, the hinge loss, , and the ReLU activation function used in neural networks, , are not differentiable. It turns out that we can just use Online Gradient Descent, substituting the *subgradients* to the gradients. For this, we need some more convex analysis!

**1. Convex Analysis Bits: Subgradients **

First, we need a technical definition.

Definition 2If a function is nowhere and finite somewhere, then is calledproper.

In these class, we are mainly interested in convex proper functions, that basically better conform to our intuition of what a convex function looks like.

Let’s first define formally what is a subgradient.

Definition 3For a proper function , we define asubgradientof in as a vector that satisfies

Basically, a subgradient of in is any vector that allows us to construct a linear lower bound to . Note that the subgradient is not unique, so we denote the *set* of subgradients of in by , called **subdifferential of at **.

Observe that if is proper and convex, then is empty for , because the inequality cannot be satisfied when . Also, the domain of , denoted by , is the set of all such that is nonempty; it is a subset of . A proper convex function is always subdifferentiable in (Rockafellar, R. T., 1970, Theorem 23.4). If the function is convex, differentiable in , and is finite in , we have that the subdifferential is composed by a unique element equal to (Rockafellar, R. T., 1970, Theorem 25.1).

Also, we can also calculate the subgradient of sum of functions.

Theorem 4 (Rockafellar, R. T., 1970, Theorem 23.8,Bauschke, H. H. and Combettes, P. L., 2011, Corollary 16.39)Let be proper convex functions on , and . Then . If , then actually .

Example 1Let , then the subdifferential set is

Example 2Let’s calculate the subgradient of the indicator function for a non-empty convex set . By definition, if

This condition implies that and (because for the inequality is always verified). The set of all that satisfies the above inequality is called the

normal cone of at. Note that the normal cone for any (Hint: take ). For example, for , for all .

Another useful theorem is to calculate the subdifferential of the pointwise maximum of convex functions.

Theorem 5 (Bauschke, H. H. and Combettes, P. L., 2011, Theorem 18.5)Let be a finite set of convex functions from to and suppose and continuous at . Set and let the set of the active functions. Then

Example 3 (Subgradients of the Hinge loss)Consider the loss for . The subdifferential set if

Definition 6Let is-Lipschitzover a set w.r.t a norm if .

We also have this handy result that upper bounds the norm of subgradients of convex Lipschitz functions.

Theorem 7Let proper and convex. Then, is -Lipschitz in w.r.t. the L2 norm iff for all and we have .

*Proof:* Assume -Lipschitz, then . For small enough , then

that implies that .

For the other implication, the definition of subgradient and Cauchy-Schwartz inequalities gives us

for any . Taking , we also get

that completes the proof.

**2. Analysis with Subgradients **

As I promised you, with the proper mathematical tools, the analyzing online algorithms becomes easy. Indeed, switching from gradient to subgradient comes for free! In fact, our analysis of OGD with differentiable losses holds as is using subgradients instead of gradients. The reason is that the only property of the gradients that we used in our proof was that

where . However, the exact same property holds when . So, we can state the Online Subgradient descent algorithm in the following way, where the only difference is line 4.

Also, the regret bounds we proved holds as well, just changing differentiability with subdifferentiability and gradients with subgradients.

**3. From Convex Losses to Linear Losses **

Let’s take a deeper look at this step

And summing over time, we have

Now, define the linear (and convex) losses , so we have

This is more powerful that what it seems: We upper bounded the regret with respect to the convex losses with a regret with respect to another sequence of linear losses. This is important because it implies that we can build online algorithms that deal only with linear losses, and through the reduction above they can be seamlessly used as OCO algorithms! Note that this not imply that this reduction is always optimal, it isn’t! But, it allows us to easily construct optimal OCO algorithms in many interesting cases.

So, we will often consider just the problem of minimizing the linear regret

This problem is called **Online Linear Optimization** (OLO).

Example 4Consider the guessing game of the first class, we can solve easily it with Online Gradient Descent. Indeed, we just need to calculate the gradients, prove that they are bounded, and find a way to calculate the projection of a real number in So, , that is bounded for . The projection on is just . With the optimal learning rate, the resulting regret would be . This is worse than the one we found in the first class, showing that the reduction not always gives the best possible regret.

Example 5Consider again the guessing game of the first class, but now change the loss function to the absolute loss of the difference: . Now we will need to use Online Subgradient Descent, because the functions are non-differentiable. We can easily see that

Again, running Online Subgradient Descent with the optimal learning rate on this problem will give us immediately a regret of , without having to think to a particular strategy for it.

**4. Online-to-Batch Conversion **

It is a good moment to take a break from online learning theory and see some application of online learning to other domains. For example, we may wonder what is the connection between online learning and stochastic optimization. Given that Projected Online (Sub)Gradient Descent looks basically the same as Projected Stochastic (Sub)Gradient Descent, they must have something in common. Indeed, we can show that, for example, we can reduce stochastic optimization of convex functions to OCO. Let’s see how.

Theorem 8Let where the expectation is w.r.t. drawn from over some vector space and is convex in the first argument. Draw samples i.i.d. from and construct the sequence of losses , where are deterministic. Run any OCO algorithm over the losses , to construct the sequence of predictions . Then, we have

where the expectation is with respect to .

We will prove it next time, let’s see now an application for it: Let’s now see how to use the above theorem to transform Online Subgradient Descent in Stochastic Subgradient Descent to minimize the training error of a classifier.

Example 6Consider a problem of binary classification, with inputs and outputs . The loss function is the hinge loss: . Suppose that you want to minimize the training error over a training set of samples, . Also, assume the maximum L2 norm of the samples is . That is, we want to minimize

Run the reduction described in Theorem 8 for iterations using OGD. In each iteration, construct sampling a training point uniformly at random from to . Set and . We have that

In words, we used an OCO algorithm to stochastically optimize a function, transforming the regret guarantee into a convergence rate guarantee.

Next time we will prove the Online-to-Batch theorem and show many more examples on how to use it.

**5. History Bits **

The specific shape of Theorem 8 is new, but I wouldn’t be surprised if it appeared somewhere in the literature. In particular, the uniform averaging is from (N. Cesa-Bianchi and A. Conconi and Gentile, C. , 2004), but was proposed for the absolute loss in (Blum, A. and Kalai, A. and Langford, J., 1999). The non-uniform averaging, that we will use next time, is from (Zhang, T., 2004), even if there it is not proposed explicitly as a online-to-batch conversion.

**6. Exercises **

Exercise 1Calculate the subdifferential set of the -insensitive loss: . It is a loss used in regression problems where we don’t want to penalize predictions within of the correct value .

Exercise 2Consider Projected Online Subgradient Descent for the example in the previous lecture about the failure of Follow-the-Leader: Can we use it on that problem? Would it guarantee sublinear regret? How the behaviour of the algorithm would differ from FTL?

Exercise 3Implement the algorithm in 6 in any language you like: implementing an algorithm is the perfect way to see if you understood all the details of the algorithm.

]]>