This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019.
You can find all the lectures I published here.
In this lecture, we will explore the possibility to obtain logarithmic regret for non-strongly convex functions. Also, we explore a bit more the strategy of moving pieces of the losses inside the regularizer, as we did for the composite and strongly convex losses.
1. Online Newton Step
Last time, we saw that the notion of strong convexity allows us to build quadratic surrogate loss functions, on which Follow-The-Regularized-Leader (FTRL) has smaller regret. Can we find a more general notion of strong convexity that allows to get a small regret for a larger class of functions? We can start from strong convexity and try to generalize it. So, instead of asking that the function is strongly convex w.r.t. a norm, we might be happy requiring strong convexity holds in a particular point w.r.t. a norm that depends on the points itself.
In particular, we can require that for each loss and for all
the following holds
where is defined as
. Note that this is a weaker property than strong convexity because
depends on
. On the other hand, in the definition of strong convexity we want the last term to be the same norm (or Bregman divergence in the more general formulation) everywhere in the space.
The rationale of this new definition is that it still allows us to build surrogate loss functions, but without requiring to have strong convexity over the entire space. Hence, we can think to use FTRL on the surrogate losses
and the proximal regularizers , where
. We will denote by
.
Remark 1 Note that
is a norm because
is Positive Definite (PD) and
is
-strongly convex w.r.t.
defined as
(because the Hessian is
and
). Also, the dual norm of
is
.
From the above remark, we have that the regularizer is 1-strongly convex w.r.t
. Hence, using the FTRL regret guarantee for proximal regularizers, we immediately get the following guarantee
So, reordering the terms we have
Note how the proof and the algorithm mirror what we did in FTRL with strongly convex losses in the last lecture.
Remark 2 It is possible to generalize our Lemma of FTRL for proximal regularizers to hold in this generalized notion of strong convexity. This would allow to get exactly the same bound running FTRL over the original losses with regularizer
.
Let’s now see a practical instantiation of this idea. Consider the case that the sequence of loss functions we receive satisfy
In words, we assume to have a class of functions that can be lower bounded by a quadratic that depends on the current subgradient. In particular, these functions posses some curvature only in the direction of (any of) the subgradients. Denoting by , we can use the above idea using
Hence, the update rule would be
We obtain the following algorithm, called Online Newton Step (ONS).
Denoting by and using (1), we have
To bound the last term, we will use the following Lemma.
Lemma 1 (Cesa-Bianchi, N. and Lugosi, G. , 2006, Lemma 11.11 and Theorem 11.7) Let
a sequence of vectors in
and
. Define
. Then, the following holds
where
are the eigenvalues of
.
Putting all together and assuming and (2) holds for the losses, then ONS satisfies the following regret
where in the second inequality we used the inequality of arithmetic and geometric means, , and the fact that
.
Hence, if the losses satisfy (2), we can guarantee a logarithmic regret. However, differently from the strongly convex case, here the complexity of the update is at least quadratic in the number of dimensions. Moreover, the regret also depends linearly on the number of dimensions.
Remark 3 Despite the name, the ONS algorithm should not be confused with the Netwon algorithm. They are similar in spirit because they both construct quadratic approximation to the function, but the Netwon algorithm uses the exact Hessian while the ONS uses an approximation that works only for a restricted class of functions. In this view, the ONS algorithm is more similar to Quasi-Newton methods.
Let’s now see an example of functions that satisfy (2).
Example 1 (Exp-Concave Losses) Defining
convex, we say that a function
is
-exp-concave if
is concave.
Choose
such that
for all
and
. Note that we need a bounded domain for
to exist. Then, this class of functions satisfy the property (2). In fact, given that
is
-exp-concave then it is also
-exp-concave. Hence, from the definition we have
that is
that implies
where we used the elementary inequality
, for
.
Example 2 Let
. The logistic loss of a linear predictor
, where
is
-exp-concave.
2. Online Regression: Vovk-Azoury-Warmuth Forecaster
Let’s now consider the specific case that and
, that is the one of unconstrained online linear regression with square loss. These losses are not strongly convex w.r.t.
, but they are exp-concave when the domain is bounded. We could use the ONS algorithm, but it would not work in the unbounded case. Another possibility would be to run FTRL, but that losses are not strongly convex and we would get only a
regret.
It turns out we can still get a logarithmic regret, if we make an additional assumption! We will assume to have access to before predicting
. Note that this is a mild assumptions in most of the interesting applications. Then, the algorithm will just be FTRL over the past losses plus the loss on the received
hallucinating a label of
. This algorithm is called Vovk-Azoury-Warmuth from the name of the inventors. The details are in Algorithm 2.
As we did for composite losses, we look closely to the loss functions, to see if there are terms that we might move inside the regularizer. The motivation would be the same as in the composite losses case: the bound will depends only on the subgradients of the part of the losses that are outside of the regularizer.
So, observe that
From the above, we see that we could think to move the terms in the regularizer and leaving the linear terms in the loss:
. Hence, we will use
Note that the regularizer at time contains the
that is revealed to the algorithm before it makes its prediction. For simplicity of notation, denote by
.
Using such procedure, the prediction can be written in a closed form:
Hence, using the regret we proved for FTRL with strongly convex regularizers and , we get the following guarantee
Noting that and reordering the terms, we have
Remark 4 Note that, differently from the ONS algorithm, the regularizers here are not proximal. Yet, we get
in the bound because the current sample
is used in the regularizer. Without the knowledge of
the last term would be
that cannot be controlled without additional assumptions on
.
So, using again Lemma 1 and assuming , we have
where are the eigenvalues of
.
If we assume that , we can reason as we did for the similar term in ONS, to have
Putting all together, we have the following theorem.
Theorem 2 Assume
and
for
. Then, using the prediction strategy in Algorithm 2, we have
Remark 5 It is possible to show that the regret of the Vovk-Azoury-Warmuth forecaster is optimal up to multiplicative factors (Cesa-Bianchi, N. and Lugosi, G. , 2006, Theorem 11.9).
3. History Bits
The Online Newton Step algorithm was introduced in (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006) and it is described to the particular case that the loss functions are exp-concave. Here, I described a slight generalization for any sequence of functions that satisfy (2), that in my view it better shows the parallel between FTRL over strongly convex functions and ONS. Note that (Hazan, E. and Kalai, A. and Kale, S. and Agarwal, A., 2006) also describes a variant of ONS based on Online Mirror Descent, but I find its analysis less interesting from a didactical point of view. The proof presented here through the properties of proximal regularizers might be new, I am not sure.
The Vovk-Azoury-Warmuth algorithm was introduced independently by (K. S. Azoury and M. K. Warmuth, 2001) and (Vovk, V., 2001). The proof presented here is from (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015).
4. Exercises
Exercise 1 Prove the statement in Example 2.
Exercise 2 Prove that the losses
, where
,
, and
, are exp-concave and find the exp-concavity constant.
Hi,
There are some typos in the proof after remark 4.
1) y^2_t is missing
2) sum^T_{t=1} \lambda_i should be changed to sum^d_{i=1} \lambda_i
LikeLiked by 1 person
Fixed, thanks!
LikeLike
Nice post! I think I found some typos:
– In algo 1, after the for loop it’s written we receive $z_t$, which is not mentioned anywhere in the remainder.
– In the Vovk-Azoury-Warmuth section, after you say: “Using such procedure, the prediction can be written in a closed form:” then the $z_t$ in the sum after the argmin should be $z_i$
– Also, in the last section when there is the regret bound for $\tilde{\ell}$ at some point there is a matrix A_T, which was not defined but I assume it’s $ \sum_{t=1}^T z_t z_t^\top, right? But then later, “Noting that {{\boldsymbol x}_1=\boldsymbol{0}} and reordering the terms we have” in the rhs of the first equation shouldn’t we have for the last term $ 1 / 2 u^\top A_T u $?
I also have some questions:
– Regarding the sum of the eigenvalues, how do get the bound TL^2? I get a slightly different result. My reasoning:
$$ \sum_i^d \lambda_i \leq d \lambda_{max} = d \| A_T \|_2 $$
where $ \| A \|_2 $ is the spectral norm of A_T, which in this case corresponds to the largest eigenvalue of A_T, since it is a symmetric matrix. Then,
\begin{align*}
d \| A_T \|_2 &= d \| \sum_{t=1}^T g_t g_t^\top \|_2 \\
& \leq d \sum_{t=1}^T \| g_t g_t^\top \|_2 \\
& \leq d \sum_{t=1}^T \lambda_{max}(g_t g_t^\top) \\
& = d \sum_{t=1}^T \| g_t \|_2^2 \\
& \leq d \sum_{t=1}^T L^2 \\
& = d T L^2
\end{align*}
Maybe I went wrong in some step?
– At the end of the section on ONS, you say “the complexity of the update is at least quadratic in the number of dimensions”. Why is that? I know that from the version of ONS given by Hazan you have to explicitly compute the inverse of the matrix A_t at each time-step, which normally would take O(d^3), but since we have a rank-one update in every step we can reduce it to O(d^2). However, in your version of the algorithm we are not computing any inverse, so where does the quadratic cost come from?
– In the Vovk-Azoury-Warmuth, it is not really clear to me why the assumption on having z_t revealed before the prediction helps us, what would change if we don’t have it?
Thank you,
Nicolò
LikeLiked by 1 person
Fixed all the typos, thanks!
For the TL^2 bound, the sum of the eigenvalues is the trace of the matrix that is equal to the sum of the traces of rank-1 matrices with eigenvalue bounded by L^2.
The quadratic cost comes from the matrix-vector multiplication. Note that the FTRL version of ONS is in the paper of Hazan et al. too, this is not “my” version.
Also, I expanded Remark 4 to explain the issue of the z_t in the regularizer.
Let me know if there are more doubts!
LikeLike
Clear now, thanks!
LikeLiked by 1 person
Great lecture notes!
Doesn’t equation (2) means the loss can be *lower* bounded by a quadratic approximation?
LikeLike
Yes! Typo fixed, thanks.
LikeLike