This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019.
You can find all the lectures I published here.
In this lecture, we will explore the possibility to obtain logarithmic regret for non-strongly convex functions. Also, we explore a bit more the strategy of moving pieces of the losses inside the regularizer, as we did for the composite and strongly convex losses.
1. Online Newton Step
Last time, we saw that the notion of strong convexity allows us to build quadratic surrogate loss functions, on which Follow-The-Regularized-Leader (FTRL) has smaller regret. Can we find a more general notion of strong convexity that allows to get a small regret for a larger class of functions? We can start from strong convexity and try to generalize it. So, instead of asking that the function is strongly convex w.r.t. a norm, we might be happy requiring strong convexity holds in a particular point w.r.t. a norm that depends on the points itself.
In particular, we can require that for each loss and for all the following holds
where is defined as . Note that this is a weaker property than strong convexity because depends on . On the other hand, in the definition of strong convexity we want the last term to be the same norm (or Bregman divergence in the more general formulation) everywhere in the space.
The rationale of this new definition is that it still allows us to build surrogate loss functions, but without requiring to have strong convexity over the entire space. Hence, we can think to use FTRL on the surrogate losses
and the proximal regularizers , where . We will denote by .
Remark 1 Note that is a norm because is Positive Definite (PD) and is -strongly convex w.r.t. defined as (because the Hessian is and ). Also, the dual norm of is .
From the above remark, we have that the regularizer is 1-strongly convex w.r.t . Hence, using the FTRL regret guarantee for proximal regularizers, we immediately get the following guarantee
Note how the proof and the algorithm mirror what we did in FTRL with strongly convex losses in the last lecture.
Remark 2 It is possible to generalize our Lemma of FTRL for proximal regularizers to hold in this generalized notion of strong convexity. This would allow to get exactly the same bound running FTRL over the original losses with regularizer .
In words, we assume to have a class of functions that can be upper bounded by a quadratic that depends on the current subgradient. In particular, these functions posses some curvature only in the direction of (any of) the subgradients. Denoting by , we can use the above idea using
Hence, the update rule would be
We obtain the following algorithm, called Online Newton Step (ONS).
Denoting by and using (1), we have
To bound the last term, we will use the following Lemma.
Lemma 1 (Cesa-Bianchi, N. and Lugosi, G. , 2006, Lemma 11.11 and Theorem 11.7) Let a sequence of vectors in and . Define . Then, the following holds
where are the eigenvalues of .
Putting all together and assuming and (2) holds for the losses, then ONS satisfies the following regret
where in the second inequality we used the inequality of arithmetic and geometric means, , and the fact that .
Hence, if the losses satisfy (2), we can guarantee a logarithmic regret. However, differently from the strongly convex case, here the complexity of the update is at least quadratic in the number of dimensions. Moreover, the regret also depends linearly on the number of dimensions.
Remark 3 Despite the name, the ONS algorithm should not be confused with the Netwon algorithm. They are similar in spirit because they both construct quadratic approximation to the function, but the Netwon algorithm uses the exact Hessian while the ONS uses an approximation that works only for a restricted class of functions. In this view, the ONS algorithm is more similar to Quasi-Newton methods.
Let’s now see an example of functions that satisfy (2).
Example 1 (Exp-Concave Losses) Defining convex, we say that a function is -exp-concave if is concave.
Choose such that for all and . Note that we need a bounded domain for to exist. Then, this class of functions satisfy the property (2). In fact, given that is -exp-concave then it is also -exp-concave. Hence, from the definition we have
where we used the elementary inequality , for .
2. Online Regression: Vovk-Azoury-Warmuth Forecaster
Let’s now consider the specific case that and , that is the one of unconstrained online linear regression with square loss. These losses are not strongly convex w.r.t. , but they are exp-concave when the domain is bounded. We could use the ONS algorithm, but it would not work in the unbounded case. Another possibility would be to run FTRL, but that losses are not strongly convex and we would get only a regret.
It turns out we can still get a logarithmic regret, if we make an additional assumption! We will assume to have access to before predicting . Note that this is a mild assumptions in most of the interesting applications. Then, the algorithm will just be FTRL over the past losses plus the loss on the received hallucinating a label of . This algorithm is called Vovk-Azoury-Warmuth from the name of the inventors. The details are in Algorithm 2.
As we did for composite losses, we look closely to the loss functions, to see if there are terms that we might move inside the regularizer. The motivation would be the same as in the composite losses case: the bound will depends only on the subgradients of the part of the losses that are outside of the regularizer.
So, observe that
From the above, we see that we could think to move the terms in the regularizer and leaving the linear terms in the loss: . Hence, we will use
Note that the regularizer at time contains the that is revealed to the algorithm before it makes its prediction. For simplicity of notation, denote by .
Using such procedure, the prediction can be written in a closed form:
Hence, using the regret we proved for FTRL with strongly convex regularizers and , we get the following guarantee
Noting that and reordering the terms, we have
Remark 4 Note that, differently from the ONS algorithm, the regularizers here are not proximal. Yet, we get in the bound because the current sample is used in the regularizer. Without the knowledge of the last term would be that cannot be controlled without additional assumptions on .
So, using again Lemma 1 and assuming , we have
where are the eigenvalues of .
If we assume that , we can reason as we did for the similar term in ONS, to have
Putting all together, we have the following theorem.
Theorem 2 Assume and for . Then, using the prediction strategy in Algorithm 2, we have
Remark 5 It is possible to show that the regret of the Vovk-Azoury-Warmuth forecaster is optimal up to multiplicative factors (Cesa-Bianchi, N. and Lugosi, G. , 2006, Theorem 11.9).
3. History Bits
The Online Newton Step algorithm was introduced in (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006) and it is described to the particular case that the loss functions are exp-concave. Here, I described a slight generalization for any sequence of functions that satisfy (2), that in my view it better shows the parallel between FTRL over strongly convex functions and ONS. Note that (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006) also describes a variant of ONS based on Online Mirror Descent, but I find its analysis less interesting from a didactical point of view. The proof presented here through the properties of proximal regularizers might be new, I am not sure.
The Vovk-Azoury-Warmuth algorithm was introduced independently by (K. S. Azoury and M. K. Warmuth, 2001) and (Vovk, V., 2001). The proof presented here is from (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015).
Exercise 1 Prove the statement in Example 2.
Exercise 2 Prove that the losses , where , , and , are exp-concave and find the exp-concavity constant.