*This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.*

* You can find the lectures I published till now here.*

To summarize what we said in the previous note, let’s define online learning as the following general game

- For
- Outputs
- Receive
- Pay

- End for

The aim of this game is to minimize the regret with respect to any competitor :

We also said that the way the losses are decided is adversarial. Now, without additional assumption we cannot hope to solve this problem. Hence, we have to understand what are the *reasonable* assumptions we can make. Typically, we will try to restrict the choice of the loss functions in some way. This is considered reasonable because most of the time we decide the set from which loss functions are picked. So, for example, we will consider only *convex* loss functions. However, convexity might not be enough, so we might restrict the class a bit more to, for example, Lipschitz convex functions. On the other hand, assuming to know something about the future is not considered a reasonable assumption, because we very rarely have any control on the future. In general, the stronger the assumptions the better will be the upper bound on the regret we can prove. The best algorithms we will see will guarantee a sublinear regret against the weakest assumption we can make, guaranteeing *at the same time* a smaller regret for easy adversaries.

It is also important to remember why minimizing the regret is a good objective: Given that we don’t assume anything on how the adversary generates the loss functions, minimizing the regret is a good metric that takes into account the difficulty of the problem. If an online learning algorithm is able to guarantee a sublinear regret, it means that its performance on average will approach the performance of any fixed strategy. As said, we will see that in many situations if the adversary is “weak”, for example it is a fixed stochastic distribution over the loss functions, being prepared for the worst-case scenario will not preclude us to get the best guarantee anyway.

For a while, we will focus on the case that are convex, and this problem will be called **Online Convex Optimization** (OCO). Later, we will later see how to *convexify* some non-convex online problems.

**Why so much math??** I will now introduce some math concepts. If you have a background in Convex Analysis, this will be easy stuff for you. On the other hand, if you never saw these things before they might look a bit scary. Let me tell you the right way to look at them: *these are tools that will make our job easier*. Without these tools, it would be basically impossible to design any online learning algorithm. And, no, it is not enough to test random algorithms on some machine learning dataset, because fixed datasets are not adversarial. Without a correct proof, you might never realize that your online algorithm fail on particular sequences of losses, as it happened to Adam (Reddi, S. J. and Kale, S. and Kumar, S., 2018). I promise you that once you understand the key mathematical concepts, online learning is actually easy.

**1. Convex Analysis Bits: Convexity **

Definition 1isconvexif for any and any , we have

In words, this means that *the set with no holes*, see Figure 1.

We will make use of **extended-real-valued functions**, that is function that take value in . For an extended-real-valued function on , its **domain** is the set .

Extended-real-valued functions easily allow us to consider constrained set and are a standard notation in Convex Optimization, see, e.g., (Boyd, S. and Vandenberghe, L., 2004). For example, if I want the predictions of the algorithm and the competitor to be in a set , I can just add to all the losses, where is the **indicator function of the set ** defined as

In this way, the only way for the algorithm and for the competitor to suffer finite loss is to predict inside the set . Also, extended-real-valued functions will make the use of Fenchel conjugates more direct.

*Convex functions* will be an essential ingredient in online learning.

Definition 2Let . isconvexif the epigraph of the function, , is convex.

We can see a visualization of this definition in Figure 2. Note that the definition implies that the domain of a convex function is convex. Also, observe that if is convex, is convex.

Note that is convex iff is convex, so each convex set is associated with a convex function.

The definition above gives rise to the following characterization for convex functions that do not assume the value .

Theorem 3 (Rockafellar, R. T., 1970, Theorem 4.1)Let and is a convex set. Then is convex iff, for any , we have

Example 1The simplest example of convex functions are theaffine function: .

Example 2Norms are always convex, the proof is left as exercise.

How to recognize a convex function? In the most general case, you have to rely on the definition. However, most of the time we will recognize them as composed by operations that preserve the convexity. For example:

- and convex, then the linear combination with non-negative weights is also convex.
- The composition with an affine transformation preserves the convexity.
- Pointwise supremum of convex functions is convex.

The proofs are left as exercises.

A very important property of differentiable convex functions is that we can construct linear lower bound to the function.

Theorem 4 (Rockafellar, R. T., 1970, Theorem 25.1 and Corollary 25.1.1)Suppose a convex function and let . If is differentiable at then

We will also use the optimality condition for convex functions:

Theorem 5Let a convex non-empty set, , and a convex function, differentiable over an open set that contains . Then iff .

In words, at the constrained minimum, the gradient makes an angle of or less with all the feasible variations , hence we cannot minimize more the function moving inside .

Another critical property of convex functions is Jensen’s inequality.

Theorem 6Let be a measurable convex function and be an -valued random element on some probability space such that exists and with probability 1. Then

Let’s see our first OCO algorithm in the case that the functions are also differentable.

**2. Online Gradient Descent **

Last time we saw a simple strategy to obtain a logarithmic regret in the guessing game. The strategy was to use the best over the past, that is the *Follow-the-Leader* strategy. In formulas,

and in the first round we can play any admissible point. One might wonder if this strategy always works, but the answer is negative!

Example 3 (Failure of FTL)Let and consider the sequence of losses , where

Then, a part from the first round where the prediction of FTL is arbitrary in , the predictions of FTL will be for even and for odd. The cumulative loss of the FTL algorithm will therefore be while the cumulative loss of the fixed solution is 0. Thus, the regret of FTL is .

Hence, we will show an alternative strategy that guarantees sublinear regret for convex Lipschitz functions. Later, we will also prove that the dependency on is optimal. The strategy is called Projected Online Gradient Descent, or just Online Gradient Descent, see Algorithm 1. It consists in updating the prediction of the algorithm at each time step moving in the negative direction of the gradient of the loss received and projecting back onto the feasible set. It is similar to Stochastic Gradient Descent, but it is not the same thing: here the loss functions are different at each step. We will later see that Online Gradient Descent can *also* be used as Stochastic Gradient Descent.

First, we show the following two Lemmas. The first lemma proves that Euclidean projections always decrease the distance with points inside the set.

Proposition 7Let and , where is a non-empty closed convex set and define . Then, .

*Proof:* From the optimality condition of Theorem 5, we obtain

Therefore,

The next lemma upper bounds the regret in one iteration of the algorithm.

Lemma 8Let a closed non-empty convex set and a convex function differentiable in a open set that contains . Set . Then, , the following inequality holds

*Proof:* From Proposition 7 and Theorem 4, we have that

Reordering, we have the stated bound.

We can prove the following regret guarantee in the online setting.

Theorem 9Let a closed non-empty convex set with diameter , i.e. . Let an arbitrary sequence of convex functions differentiable in open sets containing for . Pick any assume . Then, , the following regret bound holds

Moreover, if is constant, i.e. , we have

*Proof:* Dividing the inequality in Lemma 8 by and summing over , we have

The proof of the second statement is left as an exercise.

We can immediately observe few things.

- If we want to use time-varying learning rates you need a bounded domain . However, this assumption is false in most of the machine learning applications. However, in the stochastic setting you can still use a time-varying learning rate in SGD with an unbounded domain if you use a non-uniform averaging. We will see this later.
- Another important observation is that the regret bound helps us choosing the learning rates . Indeed, it is the only guideline we have. Any other choice that is not justified by a regret analysis it is not justified at all.

As we said, the presence of parameters like the learning rates make no sense in online learning. So, we have to decide a strategy to set them. A simple choice is to find the constant learning rate that minimizes the bounds for a fixed number of iterations. We have to consider the expression

and minimize with respect to . It is easy to see that the optimal is , that would give the regret bound

However, we have a problem: in order to use this stepsize, we should know all the future subgradients and the distance between the optimal solution and the initial point! This is clearly impossible: Remember that the adversary can choose the sequence of functions. Hence, it can observe your choice of learning rates and decide the sequence so that your learning rate is now the wrong one!

Indeed, it turns out that this kind of rate is completely impossible because it is ruled out by a lower bound. Yet, we will see that it is indeed possible to achieve very similar rates using *adaptive* and *parameter-free* algorithms.

In the following, we will see many adaptive strategies. For the moment, we can observe that we might be happy to minimize a looser upper bound. In particular, assume that the norm of the gradients is bounded by , that is . Also, assuming a bounded diameter, we can upper bound by . Hence, we have

So, indeed the regret is sublinear in time. Next time, we will see how to remove the differentiability assumption through the use of *subgradients*.

**3. History Bits **

The Projected Online Subgradient Descent with time-varying learning rate and the concept itself of Online Convex Optimization was introduced by (M. Zinkevich, 2003). Note that, if we consider the optimization literature, the optimization people never restricted themselves to the bounded case.

**4. Exercises **

Exercise 1Prove the second part of Theorem 9.

Exercise 2Prove that .

Exercise 3Using the inequality in the previous exercise, prove that a learning rate gives rise to a regret only a constant multiplicative factor worse than the one in (1).

Hi,

Am I wrong or in the first line of the Proof of Theorem 9, the second inequality of Lemma 8 has to be divided by \eta_t instead of 2 \eta_t ?

Besides this minor detail, I was wondering if the choice of the adaptive parameter in Exercise 3 is taken by “converting to online” the ideal batch choice \eta= $\frac{D}{LT}$.

Thanks

LikeLike

Typo fixed, thanks!

The choice of the time-varying learning rate in Exercise 3 does indeed corresponds to make the choice eta proportional 1/sqrt{T} to an “online version” of eta_t proportional to 1/sqrt{t}. However, the reason why it works seems to be purely mathematical. In other words, it is not true that any fixed choice of parameters can be made “online” in a natural way: Each time you’ll have to prove that it works by proving the regret guarantee.

LikeLike

Hi,

I’m in trouble with the beginning of the Proof of Proposition 7; how can you obtain the vectors of the inner product from the Optimality condition of Theorem 5?

Pheraps it is just a gap of mine in algebra. Anyway, I tried to find a way to “connect” the things but I wasn’t able to.

Could you please explain me how to?

Thanks

LikeLike

Hi Jacopo, there are a couple of skipped steps: we want to apply the optimality condition of the objective function of the projection. Now, instead of doing it directly, we first observe that the argmin of the projection does not change if instead of considering ||x-y||, we consider 1/2 ||x-y||^2. Taking the gradient of this function and using the optimality condition should give you the inner product. Can you see it now?

LikeLike