* You can find the lectures I published till now here.*

Last time, we saw that for Online Mirror Descent (OMD) with an entropic regularizer and learning rate it might be possible to get the regret guarantee

where . This time we will see how and we will use this guarantee to prove an almost optimal regret guarantee for Exp3, in Algorithm 1.

Remark 1While it is possible to prove (1) from first principles using the specific properties for the entropic regularizer, such proof will not shed any light of what is actually going on. So, in the following we will instead try to prove such regret in a very general way. Indeed, this general proof will allow us to easily prove the optimal bound for multi-armed bandits using OMD with the Tsallis entropy as regularizer.

Now, for a generic , consider the OMD algorithm that produces the predictions in two steps:

- Set such that .
- Set .

As we showed, under weak conditions, these two steps are equivalent to the usual OMD single-step update.

Now, the idea is to consider an alternative analysis of OMD that explicitly depends on , the new prediction before the Bregman projection step. First, let’s state the Generalized Pythagorean Theorem for Bregman divergences.

Lemma 1Let and define , then for all .

*Proof:* From the first order optimality condition of we have that . Hence, we have

The Generalized Pythagorean Theorem is often used to prove that the Bregman divergence between any point in and an arbitrary point decreases when the consider the Bregman projection in .

We are now ready to prove our regret guarantee.

Lemma 2For the two-steps OMD update above the following regret bound holds:

where and .

*Proof:* From the update rule, we have that

where in the second equality we used the 3-points equality for the Bregman divergences and the Generalized Pythagorean Theorem in the first inequality. Hence, summing over time we have

So, as we did in the previous lecture, we have

where and .

Putting all together, we have the stated bound.

This time it might be easier to get a handle over . Given that we only need an upper bound, we can just take a look at and and see which one is bigger. This is easy to do: using the update rule we have

that is

Assuming , we have that implies .

Overall, we have the following improved regret guarantee for the Learning with Experts setting with positive losses.

Theorem 3Assume for and . Let and . Using OMD with the entropic regularizer defined as , learning rate , and gives the following regret guarantee

Armed with this new tool, we can now turn to the multi-armed bandit problem again.

Let’s now consider the OMD with entropic regularizer, learning rate , and set equal to the stochastic estimate of , as in Algorithm 1. Applying Theorem 3 and taking expectation, we have

Now, focusing on the terms , we have

So, setting , we have

Remark 2The need for a different analysis for OMD is due to the fact that we want an easy way to upper bound the Hessian. Indeed, in this analysis comes before the normalization into a probability distribution, that simplifies a lot the analysis. The same idea will be used for the Tsallis entropy in the next section.

So, with a tighter analysis we showed that, even without an explicit exploration term, OMD with entropic regularizer solves the multi-armed bandit problem paying only a factor more than the full information case. However, this is still not the optimal regret!

In the next section, we will see that changing the regularizer, *with the same analysis*, will remove the term in the regret.

**1. Optimal Regret Using OMD with Tsallis Entropy **

In this section, we present the Implicitly Normalized Forecaster (INF) as known as OMD with Tsallis entropy algorithm for multi-armed bandit.

Define as , where and in we extend the function by continuity. This is the negative **Tsallis entropy** of the vector . This is a strict generalization of the Shannon entropy, because when goes to 1, converges to the negative (Shannon) entropy of .

We will instantiate OMD with this regularizer for the multi-armed problem, as in Algorithm 2.

Note that and .

We will not use any interpretation of this regularizer from the information theory point of view. As we will see in the following, the only reason to choose it is its Hessian. In fact, the Hessian of this regularizer is still diagonal and it is equal to

Now, we can use again the modified analysis for OMD in Lemma 2. So, for any , we obtain

where and .

As we did for Exp3, now we need an upper bounds to the . From the update rule and the definition of , we have

that is

So, if , , that implies that .

Hence, putting all together, we have

We can now specialize the above reasoning, considering in the Tsallis entropy, to obtain the following theorem.

Theorem 4Assume . Set and . Then, Algorithm 2

*Proof:* We only need to calculate the terms

Proceeding as in (2), we obtain

Choosing , we finally obtain an expected regret of , that can be proved to be the optimal one.

There is one last thing, is how do we compute the prediction of this algorithm? In each step, we have to solve a constrained optimization problem. So, we can write the corresponding Lagragian:

From the KKT conditions, we have

and we also know that . So, we have a 1-dimensional problem in that must be solved in each round.

**2. History Bits **

The INF algorithm was proposed by (Audibert, J.-Y. and Bubeck, S., 2009) and re-casted as an OMD procedure in (Audibert, J.-Y. and Bubeck, S. and Lugosi, G., 2011). The connection with the Tsallis entropy was done in (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). The specific proof presented here is new and it builds on the proof by (Abernethy, J. D. and Lee, C. and Tewari, A., 2015). Note that (Abernethy, J. D. and Lee, C. and Tewari, A., 2015) proved the same regret bound for a Follow-The-Regularized-Leader procedure over the stochastic estimates of the losses (that they call Gradient-Based Prediction Algorithm), while here we proved it using a OMD procedure.

**3. Exercises **

Exercise 1Prove that in the modified proof of OMD, the terms can be upper bounded by .

Exercise 2Building on the previous exercise, prove that regret bounds of the same order can be obtained for Exp3 and for the INF/OMD with Tsallis entropy directly upper bounding the terms , without passing through the Bregman divergences.

]]>

* You can find the lectures I published till now here.*

Today, we will present the problem of multi-armed bandit in the adversarial setting and show how to obtain sublinear regret.

**1. Multi-Armed Bandit **

This setting is similar to the Learning with Expert Advice (LEA) setting: In each round, we select one expert and, differently from the full-information setting, we only observe the loss of that expert . The aim is still to compete with the cumulative loss of the best expert in hindsight.

As in the learning with expert case, we need randomization in order to have a sublinear regret. Indeed, this is just a harder problem than LEA. However, we will assume that the adversary is **oblivious**, that is, he decides the losses of all the rounds before the game starts, but with the knowledge of the online algorithm. This makes the losses deterministic quantities and it avoids the inadequacy in our definition of regret when the adversary is adaptive (see (Arora, R. and Dekel, O. and Tewari, A., 2012)).

This kind of problems where we don’t receive the full-information, i.e., we don’t observe the loss vector, are called **bandit problems**. The name comes from the problem of a gambler who plays a pool of slot machines, that can be called “one-armed bandits”. On each round, the gambler places his bet on a slot machine and his goal is to win almost as much money as if he had known in advance which slot machine would return the maximal total reward.

In this problem, we clearly have an *exploration-exploitation trade-off*. In fact, on one hand we would like to play at the slot machine which, based on previous rounds, we believe will give us the biggest win. On the other hand, we have to explore the slot machines to find the best ones. On each round, we have to solve this trade-off.

Given that we don’t observe completely observe the loss, we cannot use our two frameworks: Online Mirror Descent (OMD) and Follow-The-Regularized-Leader (FTRL) both needs the loss functions or at least lower bounds to them.

One way to solve this issue is to construct *stochastic estimates* of the unknown losses. This is a natural choice given that we already know that the prediction strategy has to be a randomized one. So, in each round we construct a probability distribution over the arms and we sample one action according to this probability distribution. Then, we only observe the coordinate of the loss vector . One possibility to have a stochastic estimate of the losses is to use an *importance-weighted estimator*: Construct the estimator of the unknown vector in the following way:

Note that this estimator has all the coordinates equal to 0, except the coordinate corresponding the arm that was pulled.

This estimator is unbiased, that is . To see why, note that and . Hence, for , we have

Let’s also calculate the (uncentered) variance of the coordinates of this estimator. We have

We can now think of using OMD with an entropic regularizer and the estimated losses. Hence, assume and set defined as , that is the unnormalized negative entropy. Also, set . Using the OMD analysis, we have

We can now take the expectation at both sides and get

We are now in troubles, because the terms in the sum scale as . So, we need a way to control the smallest probability over the arms.

One way to do it, is to take a convex combination of and a uniform probability. That is, we can predict with , where will be chosen in the following. So, can be seen as the minimum amount of exploration we require to the algorithm. Its value will be chosen by the regret analysis to optimally trade-off exploration vs exploitation. The resulting algorithm is in Algorithm 1.

The same probability distribution is used in the estimator:

We can have that . However, we pay a price in the bias introduced:

Observing that , we have

Putting together the last inequality and the upper bound to the expected regret in (2), we have

Setting and , we obtain a regret of .

This is way worse than the of the full-information case. However, while it is expected that the bandit case must be more difficult than the full information one, it turns out that this is not the optimal strategy.

**2. Exponential-weight algorithm for Exploration and Exploitation: Exp3 **

It turns out that the algorithm above actually works, even without the mixing with the uniform distribution! We were just too loose in our regret guarantee. So, we will analyse the following algorithm, that is called Exponential-weight algorithm for Exploration and Exploitation (Exp3), that is nothing else than OMD with entropic regularizer and stochastic estimates of the losses. Note that now we will assume that .

Let’s take another look to the regret guarantee we have. From the OMD analysis, we have the following one-step inequality that holds for any

Let’s now focus on the term . We said that for a twice differentiable function , there exists such that , where . Hence, there exists such that and

So, assuming the Hessian in to be positive definite, we can bound the last two terms in the one-step inequality of OMD as

where we used Fenchel-Young inequality with the function and and .

When we use the strong convexity, we are upper bounding the terms in the sum with the inverse of the smallest eigenvalue of the Hessian of the regularizer. However, we can do better if we consider the actual Hessian. In fact, in the coordinates where is small, we have a smaller growth of the divergence. This can be seen also graphically in Figure 1. Indeed, for the entropic regularizer, we have that the Hessian is a diagonal matrix:This expression of the Hessian a regret of

where and . Note that for any is in the simplex, so this upper bound is always better than

that we derived just using the strong convexity of the entropic regularizer.

However, we don’t know the exact value of , but only that it is on the line segment between and . Yet, if you could say that , in the bandit case we would obtain an expected regret guarantee of , greatly improving the bound we proved above!

In the next lecture, we will see an alternative way to analyze OMD that will give us exactly this kind of guarantee for Exp3 and will use give us the optimal regret guarantee using the Tsallis entropy in few lines of proof.

**3. History Bits **

The algorithm in Algorithm 1 is from (Cesa-Bianchi, N. and Lugosi, G. , 2006, Theorem 6.9). The Exp3 algorithm was proposed in (Auer, P. and Cesa-Bianchi, N. and Freund, Y. and Schapire, R. E., 2002).

]]>* You can find the lectures I published till now here.*

Throughout this class, we considered the adversarial model as our model of the environment. This allowed us to design algorithm that work in this setting, as well as in other more benign settings. However, the world is never completely adversarial. So, we might be tempted to model the environment in some way, but that would leave our algorithm vulnerable to attacks. An alternative, is to consider the data as generated by some *predictable process plus adversarial noise*. In this view, it might be beneficial to try to model the predictable part, without compromising the robustness to the adversarial noise.

In this class, we will explore this possibility through a particular version of Follow-The-Regularized-Leader (FTRL), where we *predict* the next loss. In very intuitive terms, if our predicted loss is correct, we can expect the regret to decrease. However, if our prediction is wrong we still want to recover the worst case guarantee. Such algorithm is called **Optimistic FTRL**.

The core idea of Optimistic FTRL is to predict the next loss and use it in the update rule, as summarized in Algorithm 1. Note that for the sake of the analysis, it does not matter how the prediction is generated. It can be even generated by another online learning procedure!

Let’s see why this is a good idea. Remember that FTRL simply predicts with the minimizer of the previous losses plus a time-varying regularizer. Let’s assume for a moment that instead we have the gift of predicting the future, so we do know the next loss ahead of time. Then, we could predict with its minimizer and suffer a negative regret. However, probably our foresight abilities are not so powerful, so our prediction of the next loss might be inaccurate. In this case, a better idea might be just to add our predicted loss to the previous ones and minimize the regularized sum. We would expect the regret guarantee to improve if our prediction of the future loss is precise. At the same time, if the prediction is wrong, we expect its influence to be limited, given that we use it together with all the past losses.

All these intuitions can be formalized in the following Theorem.

Theorem 1With the notation in Algorithm 1, let be convex, closed, and non-empty. Denote by . Assume for that is proper and -strongly convex w.r.t. , and proper and convex, and . Also, assume that and are non-empty. Then, there exists for , such that we have

for all .

*Proof:* We can interpret the Optimistic-FTRL as FTRL with a regularizer . Also, note that has no influence on the algorithm, so we can set it to the null function.

Hence, from the equality for FTRL, we immediately get

Now focus on the terms . Observe that is -strongly convex w.r.t. , hence we have

where . Observing that , we have . Hence, given that our assumptions guarantee that the subdifferential of the sum is equal to the sum of the subdifferentials, there exists such that . So, we have

By the definition of dual norms, we also have that

Let’s take a look at the second bound in the theorem. Compared to the similar bound for FTRL, we now have the terms instead of the ones . So, if the prediction of the next loss is good, that term can become smaller and possibly even zero! On the other hand, if the predictions are bad, for Lipschitz losses we only lose a constant factor. Overall, in the best case we can gain a lot, in the worst case we don’t lose that much.

Despite the simplicity of the algorithm and its analysis, there are many applications of this principle. We will only describe a couple of them. Recently, this idea was used even to recover the Nesterov’s acceleration algorithm and to prove faster convergence in repeated games.

**1. Regret that Depends on the Variance of the Subgradients **

Consider of running Optimistic-FTRL on the linearized losses . We can gain something out of the Optimistic-FTRL compared to plain FTRL if we are able to predict the next . A simple possibility is to predict the average of the past values, . Indeed, from the first lecture, we know that such strategy is itself an online learning procedure! In particular, it corresponds to a Follow-The-Leader algorithm on the losses . Hence, from the strong convexity of this losses, we know that

This implies

It is immediate to see that the minimizer is , that results in times the empirical variance of the subgradients. Plugging it in the Optimistic-FTRL regret, with , we have

Remark 1Instead of using the mean of the past subgradients, we could use any other strategy or even a mix of different strategies. For example, assuming the subgradients bounded, we could use an algorithm to solve the Learning with Expert problem, where each expert is a strategy. Then, we would obtain a bound that depends on the predictions of the best strategy, plus the regret of the expert algorithm.

**2. Online Convex Optimization with Gradual Variations **

In this section, we consider the case that the losses we receive have small variations over time. We will show that in this case it is possible to get constant regret in the case that the losses are equal.

In this case, the simple strategy we can use to predict the next subgradient is to use the previous one, that is for and .

Corollary 2Under the assumptions of Theorem 1, define for and . Set where is 1-strongly convex w.r.t. and satisfies for , where is the smoothness constant of the losses . Then, , we have

Moreover, assuming for all , setting , we have

*Proof:* From the Optimistic-FTRL bound with a fixed regularizer, we immediately get

Now, consider the case that the losses are -smooth. So, for any , we have

Focusing on the first term, for , we have

Choose . We have for

For , we have

Now observe the assumption implies for . So, summing for , we have

Putting all together, we have the first stated bound.

The second one is obtained observing that

Note that if the losses are all the same, the regret becomes a constant! This is not surprising, because the prediction of the next loss is a linear approximation of the previous loss. Indeed, looking back at the proof, the key idea is to use the smoothness to argue that, if even the past subgradient was taken in a different point than the current one, it is still a good prediction of the current subgradient.

Remark 2Note that the assumption of smoothness is necessary. Indeed, passing always the same function and using online-to-batch conversion, would result in a convergence rate of for a Lipschitz function, that is impossible.

**3. History Bits **

The Optimistic Online Mirror Descent algorithm was proposed by (Chiang, C.-K. and Yang, T. and Lee, C.-J. and Mahdavi, M. and Lu, C.-J. and Jin, R. and Zhu, S., 2012) and extended in (A. Rakhlin and K. Sridharan, 2013) to use arbitrary “hallucinated” losses. The Optimistic FTRL version was proposed in (A. Rakhlin and K. Sridharan, 2013) and rediscovered in (Steinhardt, J. and Liang, P., 2014), even if it was called Online Mirror Descent for the misnaming problem we already explained. The proof of Theorem 1 I present here is new.

Corollary 2 was proved by (Chiang, C.-K. and Yang, T. and Lee, C.-J. and Mahdavi, M. and Lu, C.-J. and Jin, R. and Zhu, S., 2012) for Optimistic OMD and presented in a similar form in (P. Joulani and A. György and C. Szepesvári, 2017) for Optimistic FTRL, but for bounded domains.

]]>* You can find the lectures I published till now here.*

In the last lecture, we have shown a very simple and parameter-free algorithm for Online Convex Optimization (OCO) in -dimensions, based on a reduction to a coin-betting problem. Now, we will see how to reduce Learning with Expert Advice (LEA) to betting on coins, obtaining again parameter-free and optimal algorithms.

**1. Reduction to Learning with Experts **

First, remember that the regret we got from Online Mirror Descent (OMD), and similarly for Follow-The-Regularized-Leader (FTRL), is

where is the prior distribution on the experts and is the KL-divergence. As we reasoned in the OCO case, in order to set the learning rate we should know the value of . If we could set to , we would obtain a regret of . However, given the adversarial nature of the game, this is impossible. So, as we did in the OCO case, we will show that even this problem can be reduced to betting on a coin, obtaining optimal guarantees with a parameter-free algorithm.

First, let’s introduce some notation. Let be the number of experts and be the -dimensional probability simplex. Let be any *prior* distribution. Let be a coin-betting algorithm. We will instantiate copies of .

Consider any round . Let be the bet of the -th copy of . The LEA algorithm computes as

Then, the LEA algorithm predicts as

Then, the algorithm receives the reward vector . Finally, it feeds the reward to each copy of . The reward for the -th copy of is defined as

The construction above defines a LEA algorithm defined by the predictions , based on the algorithm . We can prove the following regret bound for it.

Theorem 1 (Regret Bound for Experts)Let be a coin-betting algorithm that guarantees a wealth after rounds with initial money equal to 1 of for any sequence of continuous coin outcomes . Then, the regret of the LEA algorithm with prior that predicts at each round with in (2) satisfies

for any concave and non-decreasing such that .

*Proof:* We first prove that . Indeed,

The first equality follows from definition of . To see the second equality, consider two cases: If for all then and therefore both and are trivially zero. If then for all .

From the assumption on , we have for any sequence such that that

So, inequality and (4) imply

Now, for any competitor ,

Now, we could think to use the Krichevsky–Trofimov (KT) bettor with this theorem. However, we would obtain a sub-optimal regret guarantee. In fact, remembering the lower bound on the wealth of KT and setting where is a universal constant, we have

We might think that the is the price we have to pay to adapt to the unknown competitor . However, it turns out it can be removed. In the next section, we see how to change the KT strategy to obtain the optimal guarantee.

**2. A Betting Strategy that Looses at Most a Constant Fraction of Money **

In the reduction before, if we use the KT betting strategy we would have a term under the square root. It turns out that we can avoid that term if we know the number of rounds beforehand. Then, in case is unknown we can just use a doubling trick, paying only a constant multiplicative factor in the regret.

The logarithmic term in the regret comes from the fact that the lower bound on the wealth is

Note that in the case in which the number of heads in the sequence is equal to the number of heads, so that , the guaranteed wealth becomes proportional to . So, for that goes to infinity the bettor will lose all of its money.

Instead, we need a more conservative strategy that guarantees

for small enough and independent of . In this case, the betting strategy has to pace its betting, possibly with the knowledge of the duration of the game, so that even in the case that the number of heads is equal to the number of tails it will only lose a fraction of its money. At the same time, it will still gain an exponential amount of money when the coin outcomes are biased towards one side.

We will prove that this is possible, designing a new betting strategy.

then, by induction, . In fact, we have

Hence, we have to prove that (8) is true in order to guarantee a minimum wealth of our betting strategy.

First, given that is a concave function of , we have

Also, our choice of makes the two quantities above equal with , that is

For other choices of , the two alternatives would be different and the minimum one could always be the one picked by the adversary. Instead, making the two choices worst outcomes equivalent, we minimize the damage of the adversarial choice of the outcomes of the coin. So, we have that

where in the second equality we used the definition of and in the second inequality we used the fact that .

Hence, given that (8) is true, this strategy guarantees

We can now use this betting strategy in the expert reduction in Theorem 1, setting , to have

Note that this betting strategy could also be used in the OCO reduction. Given that we removed the logarithmic term in the exponent, in the 1-dimensional case, we would obtain a regret of

where we gained in the term inside the logarithmic, instead of the term of the KT algorithm. This implies that now we can set to and obtain an asymptotic rate of rather than .

**3. History Bits **

The first parameter-free algorithm for experts is from (Chaudhuri, K. and Freund, Y. and Hsu, D. J., 2009), named NormalHedge, where they obtained a bound similar to the one in (9) but with an additional term. Then, (Chernov, A. and Vovk, V., 2010) removed the log factors with an update without a closed form. (Orabona, F. and Pal, D., 2016) showed that this guarantee can be efficiently obtained through the novel reduction to coin-betting in Theorem 1. Later, these kind of regret guarantees were improved to depend on the sum of the squared losses rather than on time, but with an additional factor, in the Squint algorithm (Koolen, W. M. and van Erven, T., 2015). It is worth noting that the Squint algorithm can be interpreted exactly as a coin-betting algorithm plus the reduction in Theorem 1.

The betting strategy in (6) and (7) are new, and derived from the shifted-KT potentials in (Orabona, F. and Pal, D., 2016). The guarantee is the same obtained by the shifted-KT potentials, but the analysis can be done without knowing the properties of the gamma function.

**4. Exercises **

]]>

Exercise 1Using the same proof technique in the lecture, find a betting strategy whose wealth depends on rather than on .

* You can find the lectures I published till now here.*

In the last lecture, we have shown a very simple and parameter-free algorithm for Online Convex Optimization (OCO) in 1-dimension. Now, we will see how to reduce OCO in a -dimensional space to OCO in 1-dimension, so that we can use the parameter-free algorithm given by a coin-betting strategy in any number of dimensions.

**1. Coordinate-wise Parameter-free OCO **

We have already seen that it is always possible to decompose an OCO problem over the coordinate and use a different 1-dimensional Online Linear Optimization (OLO) algorithm on each coordinate. In particular, we saw that

where the is exactly the regret w.r.t. the linear losses constructed by the coordinate of the subgradient.

Hence, if we have a 1-dimensional OLO algorithm, we can copies of it, each one fed with the coordinate of the subgradient. In particular, we might think to use the KT algorithm over each coordinate. The pseudo-code of this procedure is in Algorithm 1.

The regret bound we get is immediate: We just have to sum the regret over the coordinates.

Theorem 1With the notation in Algorithm 1, assume that . Then, , the following regret bounds hold

where is a universal constant.

Note that the Theorem above suggests that in high dimensional settings should be proportional to .

**2. Parameter-free in Any Norm **

The above reductions works only with in a finite dimensional space. Moreover, it gives a dependency on the competitor w.r.t. the norm that might be undesirable. So, here we present another simple reduction from 1-dimensional OCO to infinite dimensions.

This reduction requires an unconstrained OCO algorithm for the 1-dimensional case and an algorithm for learning in -dimensional (or infinite dimensional) balls. For the 1-dimensional learner, we could use the KT algorithm, while for learning in -dimensional balls we can use, for example, Online Mirror Descent (OMD). Given these two learners, we decompose the problem of learning a vector in the problem of learning a *direction* and a *magnitude*. The regret of this procedure turns out to be just the sum of the regret of the two learners.

We can formalize this idea in the following Theorem.

Theorem 2Denote by the linear regret of algorithm for any in the unit ball w.r.t a norm , and the linear regret of algorithm for any competitor . Then, for any , Algorithm 2 guarantees regret

Further, the subgradients sent to satisfy .

*Proof:* First, observe that since for all . Now, compute:

Remark 1Note that the direction vector is not constrained to have norm equal to 1, yet this does not seem to affect the regret equality.

We can instantiate the above theorem using the KT betting algorithm for the 1d learner and OMD for the direction learner. We obtain the following examples.

Example 1Let be OSD with and learning rate . Let the KT algorithm for 1-dimensional OCO with . Assume the loss functions are -Lipschitz w.r.t. the . Then, using the construction in Algorithm 2, we have

Using an online-to-batch conversion, this algorithm is a stochastic gradient descent procedure without learning rates to tune.

To better appreciate this kind of guarantee, let’s take a look at the one of Follow-The-Regularized-Leader (Online Subgradient Descent can be used in unbounded domains only with constant learning rates). With the regularizer and 1-Lipschitz losses we get a regret of

So, to get the right dependency on we need to tune , but we saw this is impossible. On the other hand, the regret in Example 1 suffers from a logarithmic factor, that is the price to pay not to have to tune parameters.

In the same way, we can even have a parameter-free regret bound for norms.

Example 2Let be OMD with and learning rate . Let the KT algorithm for 1-dimensional OCO with . Assume the loss functions are -Lipschitz w.r.t. the . Then, using the construction in Algorithm 2, we have

If we want to measure the competitor w.r.t the norm, we have to use the same method we saw for OMD: Set and such that . Now, assuming that , we have that . Hence, we have to divide all the losses by and, for all , we obtain

Note that the regret against of the parameter-free construction is *constant*. It is important to understand that there is nothing special in the origin: We could translate the prediction by any offset and get a guarantee that treats the offset as the point with constant regret. This is shown in the next Proposition.

Proposition 3Let an OLO algorithm that predicts and guarantees linear regret for any . We have that the regret of the predictions for OCO is

**3. Combining OCO Algorithms **

Finally, we now show a useful application of the parameter-free OCO algorithms property to have a constant regret against .

Theorem 4Let and two OLO algorithms that produces the predictions and respectively. Then, predicting with , we have for any

Moreover, if both algorithm guarantee a constant regret of against , we have for any

*Proof:* Set . Then

In words, the above theorem allows us to combine online learning algorithm. If the algorithms we combine have constant regret against the null competitor, then we always get the best of the two guarantees.

Example 3We can combine two parameter-free OCO algorithms, one that gives a bound that depends on the norm of the competitor and subgradients and another one specialized to the norm of competitor/subgradients. The above theorem assures us that we will also get the best guarantee between the two, paying only a constant factor in the regret.

Of course, upper bounding the OCO regret with the linear regret, the above theorem also upper bounds the OCO regret.

**4. History Bits **

The approach of using a coordinate-wise version of the coin-betting algorithm was proposed in the first paper on parameter-free OLO in (M. Streeter and B. McMahan, 2012). Recently, the same approach with a special coin-betting algorithm was also used for optimization of deep neural networks (Orabona, F. and Tommasi, T., 2017). Theorem 2 is from (A. Cutkosky and F. Orabona, 2018). Note that the original theorem is more general because it works even in Banach spaces. The idea of combining two parameter-free OLO algorithms to obtain the best of the two guarantees is from (A. Cutkosky, 2019).

(Orabona, F. and Pal, D., 2016) proposed a different way to transform a coin-betting algorithm into an OCO algorithm that works in or even in Hilbert spaces. However, that approach seems to work on for the norm and it is not a black-box reduction. That said, the reduction in (Orabona, F. and Pal, D., 2016) seems to have a better empirical performance compared to the one in Theorem 2.

There are also reductions that allow to transform an unconstrained OCO learner into a constrained one (A. Cutkosky and F. Orabona, 2018). They work constructing a Lipschitz barrier function on the domain and passing to the algorithm the original subgradients plus the subgradients of the barrier function.

**5. Exercises **

]]>

Exercise 1Prove that with and are exp-concave. Then, using the Online Newton Step Algorithm, give an algorithm and a regret bound for a game with these losses. Finally, show a wealth guarantee of the corresponding coin-betting strategy.

* You can find the lectures I published till now here.*

In the previous classes, we have shown that Online Mirror Descent (OMD) and Follow-The-Regularized-Leader (FTRL) achieves a regret of for convex Lipschitz losses. We have also shown that for bounded domains these bounds are optimal up to constant multiplicative factors. However, in the unbounded case the bounds we get are suboptimal w.r.t. the dependency on the competitor. More in particular, let’s consider an example with Online Subgradient Descent with over -Lipschitz losses and learning rate . We get the following regret guarantee

So, in order to get the best possible guarantee, we should know and set . As we said, this strategy does not work for a couple of reasons: i) we don’t know ii) if we guessed any value of the adversary could easily change the losses to make that value completely wrong.

Far from being a technicality, this is an important issue as shown in the next example.

Example 1Consider that we want to use OSD with online-to-batch conversion to minimize a function that is 1-Lipschitz. The convergence rate will be using a learning rate of . Consider the case that , specifying will result in a convergence rate 100 times slower that specifying the optimal choice in hindsight . Note that this is a real effect not an artifact of the proof. Indeed,it is intuitive that the optimal learning rate should be proportional to the distance between the initial point that algorithm picks and the optimal solution.

If we could tune the learning rate in the optimal way, we would get a regret of

However, this is also impossible, because we proved a lower bound that says that the regret must be .

In the following, we will show that it is possible to reduce any Online Convex Optimization (OCO) game to betting on a non-stochastic coin. This will allow us to use a radically different way to design OCO algorithms that will enjoy the optimal regret and will not require any parameter (e.g. learning rates, regularization weights) to be tuned. We call these kind of algorithms *parameter-free*.

**1. Coin-Betting Game **

Imagine the following repeated game:

- Set the initial Wealth to : .
- In each round
- You bet money on side of the coin equal to ; you cannot bet more money than what you currently have.
- The adversary reveals the outcome of the coin .
- You gain money , that is .

Given that we cannot borrow money, we can codify the bets as , with . So, is the fraction of money to bet and the side of the coin on which we bet.

The aim of the game is to make as much money as possible. As usual, given the adversarial nature of the game, we cannot hope to always win money. Instead, we try to gain as much money as the strategy that bets a fixed amount of money for the entire game.

Note that

So, given the multiplicative nature of the wealth, it is also useful to take the logarithm of the ratio of the wealth of the algorithm and wealth of the optimal betting fraction. Hence, we want to minimize the following regret

In words, this is nothing else than the regret of an OCO game where the losses are and . We can also extend a bit the formulation allowing “continuous coins”, where rather than in .

Remark 1Note that the constraint to bet a fraction between and is not strictly necessary. We could allow the algorithm to bet more money that what it currently has, lending it some money in each round. However, the restriction makes the analysis easier because it allows the transfomation above into an OCO problem, using the non-negativity of .

We could just use OMD or FTRL, taking special care of the non-Lipschitzness of the functions, but it turns out that there exists a better strategy specifically for this problem. There exists a very simple strategy to solve the coin-betting game above, that is called **Krichevsky-Trofimov (KT) bettor**. It simply says that on each time step you bet . So, the algorithm is the following one.

For it, we can prove the following theorem.

Theorem 1 (Cesa-Bianchi, N. and Lugosi, G. , 2006, Theorem 9.4)Let for . Then, the KT bettor in Algorithm 1 guarantees

where is a universal constant.

Note that if the outcomes of the coin are skewed towards one side, the optimal betting fraction will gain an exponential amount of money, as proved in the next Lemma.

*Proof:*

where we used the elementary inequality for .

Hence, KT guarantees an exponential amount of money, paying only a penalty. It is possible to prove that the guarantee above for the KT algorithm is optimal to constant additive factors. Moreover, observe that the KT strategy does not require any parameter to be set: no learning rates, nor regularizer. That is, KT is *parameter-free*.

Also, we can extend the guarantee of the KT algorithm to the case in which the coin are “continuous”, that is . We have the following Theorem.

Theorem 3 (Orabona, F. and Pal, D., 2016, Lemma 14)Let for . Then, the KT bettor in Algorithm 1 guarantees

where is a universal constant.

So, we have introduced the coin-betting game, extended it to continuous coins and presented a simple and optimal parameter-free strategy. In the next Section, we show *how to use the KT bettor as a parameter-free 1-d OCO algorithm!*

**2. Parameter-free 1d OCO through Coin-Betting **

So, Theorem 1 tells us that we can win almost as much money as a strategy betting the optimal fixed fraction of money at each step. We only pay a logarithmic price in the log wealth, that corresponds to a term in the actual wealth.

Now, let’s see why this problem is interesting in OCO. It turns out that *solving the coin-betting game is equivalent to solving a 1-dimensional unconstrained online linear optimization problem*. That is, a coin-betting algorithm is equivalent to design an online learning algorithm that produces a sequences of that minimize the 1-dimensional regret with linear losses:

where the are adversarial and bounded. Without loss of generality, we will assume . Also, remembering that OCO games can be reduced to Online Linear Optimization (OLO) games, such reduction would effectively reduces OCO to coin-betting! Moreover, through online-to-batch conversion, any stochastic 1-d problem could be reduced to a coin-betting game! The key theorem that allows the conversion between OLO and coin-betting is the following one.

Theorem 4Let be a proper closed convex function and let be its Fenchel conjugate. An algorithm that generates guarantees

where , if and only if it guarantees

*Proof:* Let’s prove the left to right implication.

For the other implication, we have

To make sense of the above theorem, assume that we are considering a 1-d problem and . Then, guaranteeing a lower bound to

can be done through a betting strategy that bets money on the coins . So, the theorem implies that *proving a reward lower bound for the wealth in a coin-betting game implies a regret upper bound for the corresponding 1-dimensional OLO game*. However, proving a reward lower bound is easier because it doesn’t depend on the competitor . Indeed, not knowing the norm of the competitor is exactly the reason why tuning the learning rates in OMD is hard!

This consideration immediately gives us the conversion between 1-d OLO and coin-betting: **the outcome of the coin is the negative of the subgradient of the losses on the current prediction.** Indeed, setting , we have that a coin-betting algorithm that bets would give us

So, a lower bound on the wealth corresponds to a lower bound that can be used in Theorem 3. To obtain a regret guarantee, we only need to calculate the Fenchel conjugate of the reward function, assuming it can be expressed as a function of .

The last step is to reduce 1-d OCO to 1-d OLO. But, this is an easy step that we have done many times. Indeed, we have

where .

So, to summarize, the Fenchel conjugate of the wealth lower bound for the coin-betting game becomes the regret guarantee for the OCO game. In the next section, we specialize all these considerations to the KT algorithm.

**3. KT as a 1d Online Convex Optimization Algorithm **

Here, we want to use the considerations in the above section to use KT as a parameter-free 1-d OCO algorithm. First, let’s see what such algorithm looks like. KT bets , starting with money. Now, set where and assume the losses -Lipschitz. So, we get

The pseudo-code is in Algorithm 3.

Let’s now see what kind of regret we get. From Theorem 3 and Lemma 2, we have that the KT bettor guarantees the following lower bound on the wealth when used with :

So, we found the function , we just need or an upper bound to it, that can be found with the following Lemma.

where is the Lambert function, i.e. defined as to satisfy .

*Proof:* From the definition of Fenchel dual, we have

where . We now use the fact that satisfies , to have , where is the Lambert function. Using Lemma 5 in the Appendix, we obtain the stated bound.

So, the regret guarantee of KT used a 1d OLO algorithm is upper bounded by

where the only assumption was that the first derivatives (or sub-derivatives) of are bounded in absolute value by 1. Also, it is important to note that any setting of in would not change the asymptotic rate.

To better appreciate this regret, compare this bound to the one of OMD with learning rate :

Hence, the coin-betting approach allows to get almost the optimal bound, without having to guess the correct learning rate! The price that we pay for this parameter-freeness is the log factor, that is optimal from our lower bound.

It is interesting also to look at what the algorithm would do on an easy problem, where . In Figure 3, we show the different predictions that the KT algorithm and online subgradient descent (OSD) would do. Note how the convergence rate of OSD critically depends on the learning rate: too big will not give convergence and too small will make slow down the convergence. On the other hand, KT will go *exponentially fast* towards the minimum and then it will automatically backtrack. This exponential growth effectively works like a line search procedure that allows to get the optimal regret without tuning learning rates. Later in the iterations, KT will oscillate around the minimum, *automatically shrinking its steps, without any parameter to tune.* Of course, this is a simplified example. In a truly OCO game, the losses are different at each time step and the intuition behind the algorithm becomes more difficult. Yet, the optimality of the regret assures us that the KT strategy is the right strategy.

Next time, we will see that we can also reduce OCO in and learning with experts to coin-betting games.

**4. History Bits **

The keyword “parameter-free” has been introduced in (Chaudhuri, K. and Freund, Y. and Hsu, D. J., 2009) for a similar strategy for the learning with expert problem. It is now used as an umbrella term for all online algorithms that guarantee the optimal regret uniformly over the competitor class. The first algorithm for 1-d parameter-free OCO is from (M. Streeter and B. McMahan, 2012), but the bound was suboptimal. The algorithm was then extended to Hilbert spaces in (Orabona, F., 2013), still with a suboptimal bound. The optimal bound in Hilbert space was obtained in (McMahan, H. B. and Orabona, F., 2014). The idea of using a coin-betting to do parameter-free OCO was introduced in (Orabona, F. and Pal, D., 2016). The Krichevsky-Trofimov algorithm is from (Krichevsky, R. and Trofimov, V., 1981) and its extension to the “continuous coin” is from (Orabona, F. and Pal, D., 2016). The regret-reward duality relationship was proved for the first time in (McMahan, H. B. and Orabona, F., 2014). Lemma 5 is from (Orabona, F. and Pal, D., 2016).

**5. Exercises **

Exercise 1While the original proof of the KT regret bound is difficult, it is possible to obtain a looser bound using the be-the-leader method in FTRL. In particular, it is easy to show a regret of for the log wealth.

**6. Appendix **

The Lambert function is defined by the equality

The following lemma provides bounds on .

*Proof:* The inequalities are satisfied for , hence we in the following we assume . We first prove the lower bound. From (1) we have

From this equality, using the elementary inequality for any , we get

Consider now the function defined in where is a positive number that will be decided in the following. This function has a maximum in , the derivative is positive in and negative in . Hence the minimum is in and in , where it is equal to . Using the property just proved on , setting , we have

For , setting , we have

Hence, we set such that

Numerically, , so

For the upper bound, we use Theorem 2.3 in (Hoorfar, A. and Hassani, M., 2008), that says that

Setting , we obtain the stated bound.

]]>* You can find the lectures I published till now here.*

In this lecture, we will consider the problem of *online linear classification*. We consider the following setting:

- At each time step we receive a sample
- We output a prediction of the binary label of
- We receive the true label and we see if we did a mistake or not
- We update our online classifier

The aim of the online algorithm is to minimize the number of mistakes it does compared to some best fixed classifier.

We will focus on linear classifiers, that predicts with the sign of the inner product between a vector and the input features . Hence, . This problem can be written again as a regret minimization problem:

where . It should be clear that these losses are non-convex. Hence, we need an alternative way to deal with them. In the following, we will see two possible approaches to this problem.

**1. Online Randomized Classifier **

As we did for the Learning with Expert Advice framework, we might think to convexify the losses using randomization. Hence, on each round we can predict a number in and output the label according with probability and the label with probability . So, define the random variable

Now observe that . If we consider linear predictors, we can think to have and similarly for the competitor . Constraining both the algorithm and the competitor to the space of vectors where for , we can write

Hence, the surrogate convex loss becomes and the feasible set is any convex set where we have the property for .

Given that this problem is convex, assuming to be bounded w.r.t. some norm, we can use almost any of the algorithms we have seen till now, from Online Mirror Descent to Follow-The-Regularized-Leader (FTRL). All of them would result in regret upper bounds, assuming that are bounded in some norm. The only caveat is to restrict in . One way to do it might be to consider assuming and choose the feasible set .

Putting all together, for example, we can have the following strategy using FTRL with regularizers .

Theorem 1Let an arbitrary sequence of samples/labels couples where and . Assume , for . Then, running the Randomized Online Linear Classifier algorithm with where , for any we have the following guarantee

*Proof:* The proof is straightforward from the FTRL regret bound with the chosen increasing regularizer.

**2. The Perceptron Algorithm **

The above strategy has the shortcoming of restricting the feasible vectors in a possibly very small set. In turn, this could make the performance of the competitor low. In turn, the performance of the online algorithm is only close to the one of the competitor.

Another way to deal with the non-convexity is to compare the number of mistakes that the algorithm does with a convex cumulative loss of the competitor. That is, we can try to prove a weaker regret guarantee:

In particular, the convex loss we consider is *powers* of the **Hinge Loss**: . The hinge loss is a convex upper bound to the 0/1 loss and it achieves the value of zero when the sign of the prediction is correct *and* the magnitude of the inner product is big enough. Moreover, taking powers of it, we get a family of functions that trade-offs the loss for the wrongly classified samples with the one for the correctly classified samples but with a value of , see Figure 1.

The oldest algorithm we have to minimize the modified regret in (1) is the **Perceptron** algorithm, in Algorithm 2.

The Perceptron algorithm updates the current prediction moving in the direction of the current sample multiplied by its label. Let’s see why this is a good idea. Assume that and the algorithm made a mistake. Then, the updated prediction would predict a more positive number on the same sample . In fact, we have

In the same way, if and the algorithm made a mistake, the update would result in a more negative prediction on the same sample.

For the Perceptron algorithm, we can prove the following guarantee.

Theorem 2Let an arbitrary sequence of samples/labels couples where and . Assume , for . Then, running the Perceptron algorithm we have the following guarantee

Before proving the theorem, let’s take a look to its meaning. If there exists a such that , then the Perceptron algorithm makes a *finite* number of mistakes upper bounded by . In case that are many that achieves we have that the finite number of mistakes is bounded the norm of the smallest among them. What is the meaning of this quantity?

Remember that a hyperplane represented by its normal vector divides the space in two half spaces: one with the points that give a positive value for the inner product and other one where the same inner product is negative. Now, we have that the distance of a sample from the hyperplane whose normal is is

Also, given that we are considering a that gives cumulative hinge loss zero, we have that that quantity is at least . So, *the norm of the minimal that has cumulative hinge loss equal to zero is inversely proportional to the minimum distance between the points and the separating hyperplane*. This distance is called the **margin** of the samples . So, if the margin is small, the Perceptron algorithm can do more mistakes than when the margin is big.

If the problem is not linearly separable, the Perceptron algorithm satisfies a regret of , where is the loss of the competitor. Moreover, we measure the competitor with a *family of loss functions* and compete with the best measured with the best loss. This adaptivity is achieved through two basic ingredients:

*The Perceptron is independent of scaling of the update by a hypothetical learning rate*, in the sense that the mistakes it does are independent of the scaling. That is, we could update with and have the same mistakes and updates because they only depend on the sign of . Hence, we can think as it is always using the best possible learning rate .- The weakened definition of regret allows to consider a family of loss functions, because
*the Perceptron is not using any of them in the update.*

Let’s now prove the regret guarantee. For the proof, we will need the two following technical lemmas.

Lemma 3(F. Cucker and D. X. Zhou, 2007, Lemma 10.17) Let be such that . Then

*Proof:* Let , then we have . Solving for we have . Hence, .

*Proof:* } Denote by the total number of the mistakes of the Perceptron algorithm by .

First, note that the Perceptron algorithm can be thought as running Online Subgradient Descent (OSD) with a fixed stepsize over the losses over . Indeed, OSD over such losses would update

Now, as said above, does not affect in any way the sign of the predictions, hence the Perceptron algorithm could be run with (2) and its predictions would be exactly the same. Hence, we have

Given that this inequality holds for any , we can choose the ones that minimizes the r.h.s., to have

Note that . Also, we have

So, denoting by , we can rewrite (3) as

where we used Holder’s inequality and .

Given that and denoting by , we have

Let’s now consider two cases. For , we can use Lemma 4 and have the stated bound. Instead, for , using Lemma 3 we have

that implies

Using the fact that , we have

Finally, using Lemma 4, we have the stated bound.

**3. History Bits **

The Perceptron was proposed by Rosenblatt (F. Rosenblatt, 1958). The proof of convergence in the non-separable case for is by (C. Gentile, 2003) and for is from (Y. Freund and R. E. Schapire, 1999). The proof presented here is based on the one in (Beygelzimer, A. and Orabona, F. and Zhang, C., 2017).

]]>* You can find the lectures I published till now here.*

In this lecture, we will explore the possibility to obtain logarithmic regret for non-strongly convex functions. Also, we explore a bit more the strategy of moving pieces of the losses inside the regularizer, as we did for the composite and strongly convex losses.

**1. Online Newton Step **

Last time, we saw that the notion of strong convexity allows us to build quadratic surrogate loss functions, on which Follow-The-Regularized-Leader (FTRL) has smaller regret. Can we find a more general notion of strong convexity that allows to get a small regret for a larger class of functions? We can start from strong convexity and try to generalize it. So, instead of asking that the function is strongly convex w.r.t. a norm, we might be happy requiring strong convexity holds in a particular point w.r.t. a norm that depends on the points itself.

In particular, we can require that for each loss and for all the following holds

where is defined as . Note that this is a weaker property than strong convexity because depends on . On the other hand, in the definition of strong convexity we want the last term to be the same norm (or Bregman divergence in the more general formulation) everywhere in the space.

The rationale of this new definition is that it still allows us to build surrogate loss functions, but without requiring to have strong convexity over the entire space. Hence, we can think to use FTRL on the surrogate losses

and the proximal regularizers , where . We will denote by .

Remark 1Note that is a norm because is Positive Definite (PD) and is -strongly convex w.r.t. defined as (because the Hessian is and ). Also, the dual norm of is .

From the above remark, we have that the regularizer is 1-strongly convex w.r.t . Hence, using the FTRL regret guarantee for proximal regularizers, we immediately get the following guarantee

So, reordering the terms we have

Note how the proof and the algorithm mirror what we did in FTRL with strongly convex losses in the last lecture.

Remark 2It is possible to generalize our Lemma of FTRL for proximal regularizers to hold in this generalized notion of strong convexity. This would allow to get exactly the same bound running FTRL over the original losses with regularizer .

Let’s now see a practical instantiation of this idea. Consider the case that the sequence of loss functions we receive satisfy

In words, *we assume to have a class of functions that can be upper bounded by a quadratic that depends on the current subgradient*. In particular, these functions posses some curvature only in the direction of (any of) the subgradients. Denoting by , we can use the above idea using

Hence, the update rule would be

We obtain the following algorithm, called Online Newton Step (ONS).

Denoting by and using (1), we have

To bound the last term, we will use the following Lemma.

Lemma 1 (Cesa-Bianchi, N. and Lugosi, G. , 2006, Lemma 11.11 and Theorem 11.7)Let a sequence of vectors in and . Define . Then, the following holds

where are the eigenvalues of .

Putting all together and assuming and (2) holds for the losses, then ONS satisfies the following regret

where in the second inequality we used the inequality of arithmetic and geometric means, , and the fact that .

Hence, if the losses satisfy (2), we can guarantee a logarithmic regret. However, differently from the strongly convex case, here the complexity of the update is at least quadratic in the number of dimensions. Moreover, the regret also depends linearly on the number of dimensions.

Remark 3Despite the name, the ONS algorithm should not be confused with the Netwon algorithm. They are similar in spirit because they both construct quadratic approximation to the function, but the Netwon algorithm uses the exact Hessian while the ONS uses an approximation that works only for a restricted class of functions. In this view, the ONS algorithm is more similar to Quasi-Newton methods.

Let’s now see an example of functions that satisfy (2).

Example 1 (Exp-Concave Losses)Defining convex, we say that a function is-exp-concaveif is concave.Choose such that for all and . Note that we need a bounded domain for to exist. Then, this class of functions satisfy the property (2). In fact, given that is -exp-concave then it is also -exp-concave. Hence, from the definition we have

that is

that implies

where we used the elementary inequality , for .

Example 2Let . The logistic loss of a linear predictor , where is -exp-concave.

**2. Online Regression: Vovk-Azoury-Warmuth Forecaster **

Let’s now consider the specific case that and , that is the one of *unconstrained online linear regression with square loss*. These losses are not strongly convex w.r.t. , but they are exp-concave when the domain is bounded. We could use the ONS algorithm, but it would not work in the unbounded case. Another possibility would be to run FTRL, but that losses are not strongly convex and we would get only a regret.

It turns out we can still get a logarithmic regret, if we make an additional assumption! We will assume to have access to before predicting . Note that this is a mild assumptions in most of the interesting applications. Then, the algorithm will just be *FTRL over the past losses plus the loss on the received hallucinating a label of *. This algorithm is called Vovk-Azoury-Warmuth from the name of the inventors. The details are in Algorithm 2.

As we did for composite losses, we look closely to the loss functions, to see if there are terms that we might move inside the regularizer. The motivation would be the same as in the composite losses case: the bound will depends only on the subgradients of the part of the losses that are outside of the regularizer.

So, observe that

From the above, we see that we could think to move the terms in the regularizer and leaving the linear terms in the loss: . Hence, we will use

Note that the regularizer at time contains the that is revealed to the algorithm before it makes its prediction. For simplicity of notation, denote by .

Using such procedure, the prediction can be written in a closed form:

Hence, using the regret we proved for FTRL with strongly convex regularizers and , we get the following guarantee

Noting that and reordering the terms we have

Remark 4Note that, differently from the ONS algorithm, the regularizers here are not proximal. Yet, we get in the bound because the current sample is used in the regularizer.

So, using again Lemma 1 and assuming , we have

where are the eigenvalues of .

If we assume that , we can reason as we did for the similar term in ONS, to have

Putting all together, we have the following theorem.

Theorem 2Assume and for . Then, using the prediction strategy in Algorithm 2, we have

Remark 5It is possible to show that the regret of the Vovk-Azoury-Warmuth forecaster is optimal up to multiplicative factors (Cesa-Bianchi, N. and Lugosi, G. , 2006, Theorem 11.9).

**3. History Bits **

The Online Newton Step algorithm was introduced in (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006) and it is described to the particular case that the loss functions are exp-concave. Here, I described a slight generalization for any sequence of functions that satisfy (2), that in my view it better shows the parallel between FTRL over strongly convex functions and ONS. Note that (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006) also describes a variant of ONS based on Online Mirror Descent, but I find its analysis less interesting from a didactical point of view. The proof presented here through the properties of proximal regularizers might be new, I am not sure.

The Vovk-Azoury-Warmuth algorithm was introduced independently by (K. S. Azoury and M. K. Warmuth, 2001) and (Vovk, V., 2001). The proof presented here is from (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015).

**4. Exercises **

Exercise 1Prove the statement in Example 2.

]]>

Exercise 2Prove that the losses , where , , and , are exp-concave and find the exp-concavity constant.

* You can find the lectures I published till now here.*

Last time, we saw that we can use Follow-The-Regularized-Leader (FTRL) on linearized losses:

Today, we will show a number of applications of FTRL with linearized losses, some easy ones and some more advanced ones.

As a remainder, the regret upper bound for FTRL for linearized losses that we proved last time for the case that the is -strongly convex w.r.t. for is

We also said that we are free to choose , so we will often set it to .

Remark 1As we said last time, the algorithm is invariant to any positive constant added to the regularizer, hence we can always state the regret guarantee with instead of . However, for clarity in the following we will instead explicitly choose the regularizer such that their minimum is 0.

**1. FTRL with Linearized Losses Can Be Equivalent to OMD **

First, we see that even if FTRL and OMD seem very different, in certain cases they are equivalent. For example, consider that case that . The output of OMD is

Assume that for all . This implies that , that is . Assuming , we have

On the other hand, consider FTRL with linearized losses with regularizers , then

Assuming that , this implies that . Further, assuming that is invertible, implies that the predictions of FTRL and OMD are the same.

This equivalence immediately gives us some intuition on the role of in both algorithm: The same function is inducing the Bregman divergence, that is our similarity measure, and is the regularizer in FTRL. Moreover, the inverse of the growth rate of the regularizers in FTRL takes the role of the learning rate in OMD.

Example 1Consider and , then it satisfies the conditions above to have the predictions of OMD equal to the ones of FTRL.

**2. Exponentiated Gradient with FTRL: No Need to know **

Let’s see an example of an instantiation of FTRL with linearized losses to have the FTRL version of Exponentiated Gradient (EG).

Let and the sequence of loss functions be convex and -Lipschitz w.r.t. the L-infinity norm. Let defined as , where and we define . Set , that is -strongly convex w.r.t. the L1 norm, where is a parameter of the algorithm.

Given that the regularizers are strongly convex, we know that

We already saw that , that implies that . So, we have that

Note that this is exactly the same update of EG based on OMD, but here we are effectively using time-varying learning rates.

We also get that the regret guarantee is

where we used the fact that using and are equivalent. Choosing . This regret guarantee is similar to the one we proved for OMD, but with an important difference: We don’t have to know in advance the number of rounds . In OMD a similar bound would be vacuous because it would depend on the that is infinite.

**3. Composite Losses **

Let’s now see a variant of the linearization of the losses: *partial linearization of composite losses*.

Suppose that the losses we receive are composed by two terms: one convex function changing over time and another part is fixed and known. These losses are called *composite*. For example, we might have . Using the linearization, we might just take the subgradient of . However, in this particular case, we might lose the ability of the L1 norm to produce sparse solutions.

There is a better way to deal with these kind of losses: Move the constant part of the loss inside the regularization term. In this way, that part will not be linearized but used exactly in the argmin of the update. Assuming that the argmin is still easily computable, you can always expect better performance from this approach. In particular, in the case of adding an L1 norm to the losses, you will be predicting in each step with the solution of an L1 regularized optimization problem.

Practically speaking, in the example above, we will define , where we assume to be 1-strongly convex and the losses be -Lipschitz. Note that we use at time a term because we anticipate the next term in the next round. Given that is -strongly convex, using (1), we have

where . Reordering the terms, we have

Example 2Let’s also take a look at the update rule in that case that and we get composite losses with the L1 norm. We have

We can solve this problem observing that the minimization decomposes over each coordinate of . Denote by . Hence, we know from first-order optimality condition that is the solution for the coordinate iff there exists such that

Consider the 3 different cases:

- , then and .
- , then and .
- , then and .
So, overall we have

Observe as this update will produce sparse solutions, while just taking the subgradient of the L1 norm would have never produced sparse predictions.

Remark 2 (Proximal operators)In the example above, we calculated something like

This operation is known in the optimization literature as

Proximal Operatorof the L1 norm. In general, a proximal operator of a convex, proper, and closed function is defined asProximal operators are used in optimization in the same way as we used it: They allow to minimize the entire function rather a linear approximation of it. Also, proximal operators generalizes the concept of Euclidean projection. Indeed, .

**4. FTRL with Strongly Convex Functions **

Let’s now go back to the FTRL regret bound and let’s see if you can strengthen it in the case that the regularizer is *proximal*, that is it satisfies that .

Lemma 1Denote by . Assume that is not empty and set . Also, assume that is -strongly convex w.r.t. and convex, and the regularizer is such that . Also, assume that is non-empty. Then, we have

*Proof:* We have

where in the second inequality we used Corollary 1 from last lecture, the fact that , and . Observing that from the proximal property, we have that , . Hence, using the theorem of the subdifferential of sum of functions, and remembering that , we can choose such that we have .

Remark 3Note that a constant regularizer is proximal because any point is the minimizer of the zero function. On the other hand, a constant regularizer makes the two Lemma the same,unlessthe loss functions contribute to the total strong convexity.

We will now use the above lemma to prove a logarithmic regret bound for strongly convex losses.

Corollary 2Let be strongly convex w.r.t. , for . Set the sequence of regularizers to zero. Then, FTRL guarantees a regret of

The above regret guarantee is the same of OMD over strongly convex losses, but here we don’t need to know the strong convexity of the losses. In fact, we just need to output the minimizer over the past losses. However, as we noticed last time, this might be undesirable because now each update is an optimization problem.

Hence, we can again use the idea of replacing the losses with an easy *surrogate*. In the Lipschitz case, it made sense to use linear losses. However, here you can do better and use *quadratic* losses, because the losses are strongly convex. So, we can run FTRL on the quadratic losses , where . The algorithm would be the following one:

To see why this is a good idea, consider the case that the losses are strongly convex w.r.t. the L2 norm. The update now becomes:

Moreover, we will get exactly the same regret bound as in Corollary 2, with the only difference that here the guarantee holds for a specific choice of the rather than for any subgradient in .

Example 3Going back to the example in the first lecture, where and are strongly convex, we now see immediately that FTRL without a regularizer, that is Follow the Leader, gives logarithmic regret. Note that in this case the losses were defined only over , so that the minimization is carried over .

**5. History Bits **

The first analysis of FTRL with composite losses is in (L. Xiao, 2010). The analysis presented here using the negative terms to easily prove regret bounds for FTRL for composite losses is from (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015).

The first proof of FTRL for strongly convex losses was in (S. Shalev-Shwartz and Y. Singer, 2007) (even if they don’t call it FTRL).

There is an interesting bit about FTRL-Proximal (McMahan, H. B., 2011): FTRL-Proximal is an instantiation of FTRL that uses a particular proximal regularizer. It became very famous in internet companies when Google disclosed in a very influential paper that they were using FTRL-Proximal to train the classifier for click prediction (McMahan, H. B. and Holt, G. and Sculley, D. and Young, M. and Ebner, D. and Grady, J. and Nie, L. and Phillips, T. and Davydov, E. and Golovin, D. and Chikkerur, S. and Liu, D. and Wattenberg, M. and Hrafnkelsson, A. M. and Boulos, T. and Kubica, J., 2013). This generated even more confusion because many people started conflating the term FTRL-Proximal (a specific algorithm) with FTRL (an entire family of algorithms). Unfortunately, this confusion is still going on in these days.

**6. Exercises **

]]>

Exercise 1Prove that the update in (2) is equivalent to the one of OSD with and learning rate .

* You can find the lectures I published till now here.*

Till now, we focused only on Online Subgradient Descent and its generalization, Online Mirror Descent (OMD), with a brief ad-hoc analysis of a Follow-The-Leader (FTL) analysis in the first lecture. In this class, we will extend FTL to a powerful and generic algorithm to do online convex optimization: **Follow-the-Regularized-Leader** (FTRL).

FTRL is a very intuitive algorithm: At each time step it will play the minimizer of the sum of the past losses *plus* a time-varying regularization. We will see that the regularization is needed to make the algorithm “more stable” with linear losses and avoid the jumping back and forth that we saw in Lecture 2 for Follow-the-Leader.

**1. Follow-the-Regularized-Leader **

As said above, in FTRL we output the minimizer of the regularized cumulative past losses. It should be clear that FTRL is not an algorithm, but rather a family of algorithms, in the same way as OMD is a family of algorithms.

Before analyzing the algorithm, let’s get some intuition on it. In OMD, we saw that the “state” of the algorithm is stored in the current iterate , in the sense that the next iterate depends on and the loss received at time (the choice of the learning rate has only a little influence on the next iterate). Instead in FTRL, the next iterate depends on the entire history of losses received up to time . This has an immediate consequence: In the case that is bounded, OMD will only “remember” the last , and not the iterate before the projection. On the other hand, FTRL keeps in memory the entire history of the past, that in principle allows to recover the iterates before the projection in .

This difference in behavior might make the reader think that FTRL is more computationally and memory expensive. And indeed it is! But, we will also see that there is a way to consider approximate losses that makes the algorithm as expensive as OMD, yet retaining strictly more information than OMD.

For FTRL, we prove a surprising result: an equality for the regret! The proof is in the Appendix.

Lemma 1Denote by . Assume that is not empty and set . Then, for any , we have

Remark 1Note that we basically didn’t assume anything on nor on , the above equality holds even for non-convex losses and regularizers. Yet, solving the minimization problem at each step might be computationally infeasible.

Remark 2Note that the left hand side of the equality in the theorem does not depend on , so if needed we can set it to .

Remark 3Note that the algorithm is invariant to any positive constant added to the regularizers, hence we can always state the regret guarantee with instead of .

However, while surprising, the above equality is not yet a regret bound, because it is somehow “implicit” because the losses are appearing on both sides of the equality.

Let’s take a closer look at the equality. If , we have that the sum of the last two terms on the r.h.s. is negative. On the other hand, the first two terms on the r.h.s. are similar to what we got in OMD. The interesting part is the sum of the terms . To give an intuition of what is going on, let’s consider that case that the regularizer is constant over time, i.e., . Hence, the terms in the sum can be rewritten as

Hence, we are measuring the distance between the minimizer of the regularized losses (with two different regularizers) in two consecutive predictions of the algorithms. Roughly speaking, this term will be small if and the losses+regularization are “nice”. This should remind you exactly the OMD update, where we *constrain* to be close to . Instead, here the two predictions will be close one to the other if the minimizer of the regularized losses up to time is close to the minimizer of the losses up to time . So, like in OMD, the regularizer here will play the critical role of *stabilizing* the predictions, if the losses don’t possess enough curvature.

To quantify this intuition, we need a property of strongly convex functions.

**2. Convex Analysis Bits: Properties of Strongly Convex Functions **

We will use the following lemma for strongly convex functions.

Lemma 2Let -strongly convex with respect to a norm . Then, for all , , and , we have

*Proof:* Define . Observe that , hence is the minimizer of . Also, note that . Hence, we can write

where the last step comes from the conjugate function of the squared norm (See Example 3 in the lecture on OLO lower bounds).

Corollary 3Let -strongly convex with respect to a norm . Let . Then, for all , and , we have

In words, the above lemma says that an upper bound to the suboptimality gap is proportional to the squared norm of the subgradient.

**3. An Explicit Regret Bound using Strongly Convex Regularizers **

We now state a Lemmas quantifying the intuition on the “stability” of the predictions.

Lemma 4With the notation and assumptions of Lemma 1, assume that is proper and -strongly convex w.r.t. , and proper and convex. Also, assume that is non-empty. Then, we have

for all .

*Proof:* We have

where in the second inequality we used Lemma 2, the fact that , and . Observing that , we have . Hence, using the theorem of the subdifferential of sum of functions, we can choose such that we have .

Let’s see some immediate applications of FTRL

Corollary 5Let a sequence of convex loss functions. Let a -strongly convex function w.r.t. . Set the sequence of regularizers as , where . Then, FTRL guarantees

for all . Moreover, if the functions are -Lipschitz, setting we get

*Proof:* The corollary is immediate from Lemma 1, Lemma 4, and the observation that from the assumptions we have . We also set , thanks to Remark 2.

This might look like the same regret guarantee of OMD, however here there is a very important difference: The last term contains a time-varying element () but the domain does not have to be bounded! Also, I used the regularizer and not to remind you another important difference: In OMD the learning rate is chosen after receiving the subgradient while here you have to choose it before receiving it!

The another important difference is that here the update rule seems way more expensive than in OMD, because we need to solve an optimization problem at each step. However, it turns out we can use FTRL on *linearized losses* and obtain the same bound with the same computational complexity of OMD.

**4. FTRL with Linearized Losses **

If we consider that case in which the losses are linear, we have that the prediction of FTRL is

Now, if we assume to be proper, convex, and closed, using theorem 4 in the lecture on OLO lower boundslecture on OLO lower bounds, we have that . Moreover, if is strongly convex, we know that is differentiable and we get

In turn, this update can be written in the following way

This corresponds to Figure 1.

Compare it to the mirror update of OMD, rewritten in a similar way:

They are very similar, but with important differences:

- In OMD, the state is kept in , so we need to transform it into a dual variable before making the update and then back to the primal variable.
- In FTRL with linear losses, the state is kept directly in the dual space, updated and then transformed in the primal variable. The primal variable is only used to predict, but not directly in the update.
- In OMD, the samples are weighted by the learning rates that is typically decreasing
- In FTRL with linear losses, all the subgradients have the same weight, but the regularizer is typically increasing over time.

Also, we will not loose anything in the bound! Indeed, we can run FTRL on the linearized losses , where , guaranting exactly the same regret on the losses . The algorithm for such procedure is in Algorithm 2.

In fact, using the definition of the subgradients and the assumptions of Corollary 5, we have

The only difference with respect to Corollary 5 is that here the are the specific ones we use in the algorithm, while in Corollary 5 the statement holds for any choice of the .

In the next example, we can see the different behavior of FTRL and OMD.

Example 1Consider . With Online Subgradient Descent (OSD) with learning rate and , the update is

On the other hand in FTRL with linearized losses, we can use and it is easy to verify that the update in (1) becomes

While the regret guarantee would be the same for these two updates, from an intuitive point of view OMD seems to be loosing a lot of potential information due to the projection and the fact that we only memorize the projected iterate.

Next time, we will see how to obtain logarithmic regret bounds for strongly convex losses for FTRL and more applications.

**5. History Bits **

Follow the Regularized Leader was introduced in (Abernethy, J. D. and Hazan, E. and Rakhlin, A., 2008) where at each step the prediction is computed as the minimizer of a regularization term plus the sum of losses on all past rounds. However, the key ideas of FTRL, and in particular its analysis through the dual, were planted by Shai Shalev-Shwartz and Yoram Singer way before (Shalev-Shwartz, S. and Singer, Y., 2006)(Shalev-Shwartz, S. and Singer, Y., 2007). Later, the PhD thesis of Shai Shalev-Shwartz (S. Shalev-Shwartz, 2007) contained the most precise dual analysis of FTRL, but he called it “online mirror descent” because the name FTRL was only invented later! Even later, I contributed to the confusion naming a general analysis of FTRL with time-varying regularizer and linear losses “generalized online mirror descent” (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015). So, now I am trying to set the record straight

Later to all this, the optimization community rediscovers FTRL with linear losses and calls it Dual Averaging (Nesterov, Y., 2009), even if Nesterov used similar ideas already in 2005 (Nesterov, Y., 2005). It is interesting to note that Nesterov introduced the Dual Averaging algorithm to fix the fact that in OMD gradients enter the algorithm with decreasing weights, contradicting the common sense understanding of how optimization should work. The same ideas were then translated to online learning and stochastic optimization in (L. Xiao, 2010), essentially rediscovering the framework of Shalev-Shwartz and rebranding it Regularized Dual Averaging (RDA). Finally, (McMahan, H B., 2017) gives the elegant equality result that I presented here (with minor improvements) that holds for general loss functions and regularizers. Note that the dual interpretation of FTRL comes out naturally for linear losses, but Lemma 1 underlines the fact that the algorithm is actually more general.

Another source of confusion stems from the fact that some people differentiate among a “lazy” and “greedy” version of OMD. In reality, as proved in (McMahan, H B., 2017), the lazy algorithm is just FTRL with linearized losses and the greedy one is just OMD. The notation “lazy online mirror descent” was introduced in (Zinkevich, M., 2004), where he basically introduced for the first time FTRL with linearized losses.

**6. Exercises **

Exercise 1Prove that the update of FTRL with linearized loss in Example 1 is correct.

Exercise 2Find a way to have bounds for smooth losses with linearized FTRL: Do you need an additional assumption compared to what we did for OSD?

**7. Appendix **

*Proof of Lemma 1:* Define and for . Hence, we have that . Now, consider

We also have

Hence, putting these two inequalities together, we get

Observing that

and that , we get the equality

]]>