*This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.*

* You can find the lectures I published till now here.*

In this lecture, we will explore the possibility to obtain logarithmic regret for non-strongly convex functions. Also, we explore a bit more the strategy of moving pieces of the losses inside the regularizer, as we did for the composite and strongly convex losses.

**1. Online Newton Step **

Last time, we saw that the notion of strong convexity allows us to build quadratic surrogate loss functions, on which Follow-The-Regularized-Leader (FTRL) has smaller regret. Can we find a more general notion of strong convexity that allows to get a small regret for a larger class of functions? We can start from strong convexity and try to generalize it. So, instead of asking that the function is strongly convex w.r.t. a norm, we might be happy requiring strong convexity holds in a particular point w.r.t. a norm that depends on the points itself.

In particular, we can require that for each loss and for all the following holds

where is defined as . Note that this is a weaker property than strong convexity because depends on . On the other hand, in the definition of strong convexity we want the last term to be the same norm (or Bregman divergence in the more general formulation) everywhere in the space.

The rationale of this new definition is that it still allows us to build surrogate loss functions, but without requiring to have strong convexity over the entire space. Hence, we can think to use FTRL on the surrogate losses

and the proximal regularizers , where . We will denote by .

Remark 1Note that is a norm because is Positive Definite (PD) and is -strongly convex w.r.t. defined as (because the Hessian is and ). Also, the dual norm of is .

From the above remark, we have that the regularizer is 1-strongly convex w.r.t . Hence, using the FTRL regret guarantee for proximal regularizers, we immediately get the following guarantee

So, reordering the terms we have

Note how the proof and the algorithm mirror what we did in FTRL with strongly convex losses in the last lecture.

Remark 2It is possible to generalize our Lemma of FTRL for proximal regularizers to hold in this generalized notion of strong convexity. This would allow to get exactly the same bound running FTRL over the original losses with regularizer .

Let’s now see a practical instantiation of this idea. Consider the case that the sequence of loss functions we receive satisfy

In words, *we assume to have a class of functions that can be upper bounded by a quadratic that depends on the current subgradient*. In particular, these functions posses some curvature only in the direction of (any of) the subgradients. Denoting by , we can use the above idea using

Hence, the update rule would be

We obtain the following algorithm, called Online Newton Step (ONS).

Denoting by and using (1), we have

To bound the last term, we will use the following Lemma.

Lemma 1 (Cesa-Bianchi, N. and Lugosi, G. , 2006, Lemma 11.11 and Theorem 11.7)Let a sequence of vectors in and . Define . Then, the following holds

where are the eigenvalues of .

Putting all together and assuming and (2) holds for the losses, then ONS satisfies the following regret

where in the second inequality we used the inequality of arithmetic and geometric means, , and the fact that .

Hence, if the losses satisfy (2), we can guarantee a logarithmic regret. However, differently from the strongly convex case, here the complexity of the update is at least quadratic in the number of dimensions. Moreover, the regret also depends linearly on the number of dimensions.

Remark 3Despite the name, the ONS algorithm should not be confused with the Netwon algorithm. They are similar in spirit because they both construct quadratic approximation to the function, but the Netwon algorithm uses the exact Hessian while the ONS uses an approximation that works only for a restricted class of functions. In this view, the ONS algorithm is more similar to Quasi-Newton methods.

Let’s now see an example of functions that satisfy (2).

Example 1 (Exp-Concave Losses)Defining convex, we say that a function is-exp-concaveif is concave.Choose such that for all and . Note that we need a bounded domain for to exist. Then, this class of functions satisfy the property (2). In fact, given that is -exp-concave then it is also -exp-concave. Hence, from the definition we have

that is

that implies

where we used the elementary inequality , for .

Example 2Let . The logistic loss of a linear predictor , where is -exp-concave.

**2. Online Regression: Vovk-Azoury-Warmuth Forecaster **

Let’s now consider the specific case that and , that is the one of *unconstrained online linear regression with square loss*. These losses are not strongly convex w.r.t. , but they are exp-concave when the domain is bounded. We could use the ONS algorithm, but it would not work in the unbounded case. Another possibility would be to run FTRL, but that losses are not strongly convex and we would get only a regret.

It turns out we can still get a logarithmic regret, if we make an additional assumption! We will assume to have access to before predicting . Note that this is a mild assumptions in most of the interesting applications. Then, the algorithm will just be *FTRL over the past losses plus the loss on the received hallucinating a label of *. This algorithm is called Vovk-Azoury-Warmuth from the name of the inventors. The details are in Algorithm 2.

As we did for composite losses, we look closely to the loss functions, to see if there are terms that we might move inside the regularizer. The motivation would be the same as in the composite losses case: the bound will depends only on the subgradients of the part of the losses that are outside of the regularizer.

So, observe that

From the above, we see that we could think to move the terms in the regularizer and leaving the linear terms in the loss: . Hence, we will use

Note that the regularizer at time contains the that is revealed to the algorithm before it makes its prediction. For simplicity of notation, denote by .

Using such procedure, the prediction can be written in a closed form:

Hence, using the regret we proved for FTRL with strongly convex regularizers and , we get the following guarantee

Noting that and reordering the terms we have

Remark 4Note that, differently from the ONS algorithm, the regularizers here are not proximal. Yet, we get in the bound because the current sample is used in the regularizer.

So, using again Lemma 1 and assuming , we have

where are the eigenvalues of .

If we assume that , we can reason as we did for the similar term in ONS, to have

Putting all together, we have the following theorem.

Theorem 2Assume and for . Then, using the prediction strategy in Algorithm 2, we have

Remark 5It is possible to show that the regret of the Vovk-Azoury-Warmuth forecaster is optimal up to multiplicative factors (Cesa-Bianchi, N. and Lugosi, G. , 2006, Theorem 11.9).

**3. History Bits **

The Online Newton Step algorithm was introduced in (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006) and it is described to the particular case that the loss functions are exp-concave. Here, I described a slight generalization for any sequence of functions that satisfy (2), that in my view it better shows the parallel between FTRL over strongly convex functions and ONS. Note that (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006) also describes a variant of ONS based on Online Mirror Descent, but I find its analysis less interesting from a didactical point of view. The proof presented here through the properties of proximal regularizers might be new, I am not sure.

The Vovk-Azoury-Warmuth algorithm was introduced independently by (K. S. Azoury and M. K. Warmuth, 2001) and (Vovk, V., 2001). The proof presented here is from (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015).

**4. Exercises **

Exercise 1Prove the statement in Example 2.

Exercise 2Prove that the losses , where , , and , are exp-concave and find the exp-concavity constant.

Hi,

There are some typos in the proof after remark 4.

1) y^2_t is missing

2) sum^T_{t=1} \lambda_i should be changed to sum^d_{i=1} \lambda_i

LikeLiked by 1 person

Fixed, thanks!

LikeLike

Nice post! I think I found some typos:

– In algo 1, after the for loop it’s written we receive $z_t$, which is not mentioned anywhere in the remainder.

– In the Vovk-Azoury-Warmuth section, after you say: “Using such procedure, the prediction can be written in a closed form:” then the $z_t$ in the sum after the argmin should be $z_i$

– Also, in the last section when there is the regret bound for $\tilde{\ell}$ at some point there is a matrix A_T, which was not defined but I assume it’s $ \sum_{t=1}^T z_t z_t^\top, right? But then later, “Noting that {{\boldsymbol x}_1=\boldsymbol{0}} and reordering the terms we have” in the rhs of the first equation shouldn’t we have for the last term $ 1 / 2 u^\top A_T u $?

I also have some questions:

– Regarding the sum of the eigenvalues, how do get the bound TL^2? I get a slightly different result. My reasoning:

$$ \sum_i^d \lambda_i \leq d \lambda_{max} = d \| A_T \|_2 $$

where $ \| A \|_2 $ is the spectral norm of A_T, which in this case corresponds to the largest eigenvalue of A_T, since it is a symmetric matrix. Then,

\begin{align*}

d \| A_T \|_2 &= d \| \sum_{t=1}^T g_t g_t^\top \|_2 \\

& \leq d \sum_{t=1}^T \| g_t g_t^\top \|_2 \\

& \leq d \sum_{t=1}^T \lambda_{max}(g_t g_t^\top) \\

& = d \sum_{t=1}^T \| g_t \|_2^2 \\

& \leq d \sum_{t=1}^T L^2 \\

& = d T L^2

\end{align*}

Maybe I went wrong in some step?

– At the end of the section on ONS, you say “the complexity of the update is at least quadratic in the number of dimensions”. Why is that? I know that from the version of ONS given by Hazan you have to explicitly compute the inverse of the matrix A_t at each time-step, which normally would take O(d^3), but since we have a rank-one update in every step we can reduce it to O(d^2). However, in your version of the algorithm we are not computing any inverse, so where does the quadratic cost come from?

– In the Vovk-Azoury-Warmuth, it is not really clear to me why the assumption on having z_t revealed before the prediction helps us, what would change if we don’t have it?

Thank you,

Nicolò

LikeLike