What is Online Learning?

This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.

You can find the lectures I published till now here.

Imagine the following repeated game:

In each round ${t=1,\dots,T}$

An adversary choose a real number in ${y_t \in [0,1]}$ and he keeps it secret
You try to guess its real number, choosing ${x_t \in [0,1]}$
The adversary’s number is revealed and you pay the squared difference ${(x_t-y_t)^2}$

Basically, we want to guess a sequence of numbers as precisely as possible. To be a game, we now have to decide what is the “winning condition”. Let’s see what makes sense to consider as winning condition.

First, let’s simplify a bit the game. Let’s assume that the adversary is drawing the numbers i.i.d. from some fixed distribution over ${[0,1]}$ . However, he is still free to decide which distribution at the beginning of the game. If we knew the distribution, we could just predict each round the mean of the distribution and in expectation we would pay ${\sigma^2 T}$ , where ${\sigma^2}$ is the variance of the distribution. We cannot do better than that! However, given that we don’t know the distribution, it is natural to benchmark any strategy with respect to the optimal one. That is, it is natural to measure the quantity

$\displaystyle \label{eq:stoch_regret} \mathop{\mathbb E}_Y\left[\sum_{t=1}^T (x_t - Y)^2\right] - \sigma^2 T \ \ \ \ \ (1)$

or equivalently considering the average

$\displaystyle \label{eq:av_stoch_regret} \frac{1}{T}\mathop{\mathbb E}_Y\left[\sum_{t=1}^T (x_t - Y)^2\right] - \sigma^2~. \ \ \ \ \ (2)$

Clearly these quantities are positive and they seem to be a good measure, because they are somehow normalized with respect to the “difficulty” of the numbers generated by the adversary, through the variance of the distribution. It is not the only possible choice to measure our “success”, but for sure it is a reasonable one. It would make sense to consider a strategy “successful” if the difference in (1) grows sublinearly over time and equivalently if the difference in (2) goes to zero as the number of rounds ${T}$ goes to infinity. That is, on average on the number of rounds, I would like my algorithm to be able to approach the optimal performance.

Minimizing Regret. Given that we have converged to what it seems a good measure of success of the algorithm. Let’s now rewrite (1) in an equivalent way

$\displaystyle \mathop{\mathbb E}\left[\sum_{t=1}^T (x_t - Y)^2\right] - \min_{ x \in [0,1]} \ \mathop{\mathbb E}\left[\sum_{t=1}^T (x-Y)^2\right]~.$

Now, the last step: remove the assumption on how the data is generated, just consider any arbitrary sequence of ${y_t}$ and keep using the same measure of success. Of course, we can remove the expectation because there is no stochasticity anymore. So, we get that we will win the game if

$\displaystyle \text{Regret}_T:=\sum_{t=1}^T (x_t - y_t)^2 - \min_{x \in [0,1]} \ \sum_{t=1}^T (x - y_t)^2$

grows sublinearly with ${T}$ . The quantity above is called the Regret, because measures how much the algorithm regrets for not sticking on all the rounds to the optimal choice in hindsight. We will denote it by ${\text{Regret}_T}$ . Our reasoning should provide sufficient justification for this metric, however in the following we will see that this also makes sense from both a convex optimization and machine learning point of view.

Let’s generalize it a bit more, considering that the algorithm outputs a vector in ${{\boldsymbol x}_t \in V\subseteq {\mathbb R}^d}$ and it pays a loss ${\ell_t: {\mathbb R}^d \rightarrow {\mathbb R}}$ that measures how good was the prediction of the algorithm in each round. Also, let’s consider an arbitrary predictor ${{\boldsymbol u}}$ in ${V\subseteq {\mathbb R}^d}$ and let’s parameterize the regret with respect to it: ${\text{Regret}_T({\boldsymbol u})}$ . So, to summarize, Online Learning is nothing else than designing and analyzing algorithms to minimize the Regret over a sequence of loss functions with respect to an arbitrary competitor ${{\boldsymbol u} \in V\subseteq {\mathbb R}^d}$ :

$\displaystyle \text{Regret}_T({\boldsymbol u}):=\sum_{t=1}^T \ell_t({\boldsymbol x}_t) - \sum_{t=1}^T \ell_t({\boldsymbol u})~.$

This framework is pretty powerful, and it allows to reformulate a bunch of different problems in machine learning and optimization as similar games. More in general, with the regret framework we can analyze situations in which the data are not i.i.d. from a distribution, yet I would like to guarantee that the algorithm is “learning” something. For example, online learning can be used to analyze

Click prediction problems;
Routing on a network;
Convergence to equilibrium of repeated games.

It can also be used to analyze stochastic algorithms, e.g., Stochastic Gradient Descent, but the adversarial nature of the analysis might give you suboptimal results. For example, it can be used to analyze momentum algorithms, but the adversarial nature of the losses essentially forces you to prove a convergence guarantee that treats the momentum term as a vanishing disturbance that does not help the algorithm in any way.

Let’s now go back to our number guessing game and let’s try a strategy to win it. Of course, this is one of the simplest example of online learning, without a real application. Yet, going through it we will uncover most of the key ingredients in online learning algorithms and their analysis.

A Winning Strategy. Can we win the number guessing name game? Note that we didn’t assume anything on how the adversary is deciding the numbers. In fact, the numbers can be generated in any way, even in an adaptive way based on our strategy. Indeed, they can be chosen adversarially, that is explicitly trying to make us lose the game. This is why we call the mechanism generating the number the adversary.

The fact that the numbers are adversarially chosen means that we can immediately rule out any strategy based on some statistical modelling on the data. In fact, it cannot work because the moment we estimate something and act on our estimate, the adversary can immediately change the way he is generating the data, ruining us. So, we have to think about something else. Yet, many times online learning algorithms will look like classic ones from statistical estimation, even if they work for different reasons.

Now, let’s try to design a strategy to make the regret provably sublinear in time, regardless of how the adversary chooses the numbers. The first thing to do is to take a look at the best strategy in hindsight, that is argmin of the second term of the regret. It should be immediate to see that

$\displaystyle x^\star_T := \mathop{\text{argmin}}_{x \in [0,1]} \sum_{t=1}^T (x - y_t)^2 = \frac{1}{T} \sum_{t=1}^T y_t~.$

Now, given that we don’t know the future, for sure we cannot use ${x^\star_T}$ as our guess in each round. However, we do know the past, so a reasonable strategy in each round is to guess the best number over the past. Why such strategy would work? For sure, the reason why it could work is not because we expect the future to be like the past, because it is not true! Instead, we want to leverage the fact that the optimal guess should not change too much between rounds, so we can try to “track” it over time.

Hence, on each round ${t}$ our strategy is to guess ${x_t = x_{t-1}^\star=\frac{1}{t-1} \sum_{i=1}^{t-1} y_i}$ . Such strategy is usually called Follow-the-Leader (FTL), because you are following what would have been the optimal thing to do on the past rounds (the Leader).

Let’s now try to show that indeed this strategy will allow us to win the game. We will use first principles and later introduce general proof methods. First, we will need a small lemma.

Lemma 1 Let ${\ell_t :{\mathbb R}^d \rightarrow {\mathbb R}}$ an arbitrary sequence of loss functions. Denote by ${x^\star_t}$ a minimizer of the cumulative losses over the previous ${t}$ rounds. Then, we have

$\displaystyle \sum_{t=1}^T \ell_t(x^\star_{t}) \leq \sum_{t=1}^T \ell_t(x^\star_{T})~.$

Proof: We prove it by induction over ${T}$ . The base case is

$\displaystyle \ell_1(x^\star_1) \leq \ell_1(x^\star_{1}),$

that is trivially true. Now, for ${T\geq2}$ , we assume that ${\sum_{t=1}^{T-1} \ell_t(x^\star_{t}) \leq \sum_{t=1}^{T-1} \ell_t(x^\star_{T-1})}$ is true and we must prove the stated inequality, that is

$\displaystyle \sum_{t=1}^T \ell_t(x^\star_{t}) \leq \sum_{t=1}^T \ell_t(x^\star_{T})~.$

This inequality is equivalent to

$\displaystyle \label{eq:lemma1} \sum_{t=1}^{T-1} \ell_t(x^\star_{t}) \leq \sum_{t=1}^{T-1} \ell_t(x^\star_{T}), \ \ \ \ \ (3)$

where we removed the last element of the sums because they are the same. Now observe that

$\displaystyle \sum_{t=1}^{T-1} \ell_t(x^\star_{t}) \leq \sum_{t=1}^{T-1} \ell_t(x^\star_{T-1}),$

by induction hypothesis, and

$\displaystyle \sum_{t=1}^{T-1} \ell_t(x^\star_{T-1}) \leq \sum_{t=1}^{T-1} \ell_t(x^\star_{T})$

because ${x^\star_{T-1}}$ is a minimizer of the left hand side. Chaining these two inequalities, we have that (3) is true, and so the theorem is proven. $\Box$

Basically, the above lemma quantifies the idea the knowing the future and being adaptive to it is typically better than not being adaptive to it.

With this lemma, we can now prove that the regret will grow sublinearly, it particular it will be logarithmic in time. Note that we will not prove that our strategy is minimax optimal, even if it is possible to show that the logarithmic dependency on time is unavoidable for this problem.

Theorem 2 Let ${y_t \in [0,1]}$ for ${t=1,\dots,T}$ an arbitrary sequence of numbers. Let the algorithm’s output ${x_t=x_{t-1}^\star:=\frac{1}{t-1}\sum_{i=1}^{t-1} y_i}$ . Then, we have

$\displaystyle \text{Regret}_T = \sum_{t=1}^T (x_t - y_t)^2 - \min_{x \in [0,1]} \ \sum_{t=1}^T (x - y_t)^2 \leq 4 + 4\ln T~.$

Proof: We use Lemma 1 to upper bound the regret:

$\displaystyle \begin{aligned} \sum_{t=1}^T (x_t - y_t)^2 - \min_{x \in [0,1]} \ \sum_{t=1}^T (x - y_t)^2 = \sum_{t=1}^T (x^\star_{t-1} - y_t)^2 - \sum_{t=1}^T (x^\star_T - y_t)^2 \leq \sum_{t=1}^T (x^\star_{t-1} - y_t)^2 - \sum_{t=1}^T (x^\star_t - y_t)^2~. \end{aligned}$

Now, let’s take a look at each difference in the sum in the last equation. We have that

$\displaystyle \begin{aligned} (x^\star_{t-1} - y_t)^2 - (x^\star_t - y_t)^2 &= (x^\star_{t-1})^2 - 2 y_t x^\star_{t-1} - (x^\star_{t})^2 + 2 y_t x^\star_{t} \\ &= (x^\star_{t-1}+x^\star_{t} - 2y_t)(x^\star_{t-1}-x^\star_{t}) \\ &\leq |x^\star_{t-1}+x^\star_{t} - 2y_t|\,|x^\star_{t-1}-x^\star_{t}| \\ &\leq 2 |x^\star_{t-1}-x^\star_{t}| \\ &=2\left|\frac{1}{t-1} \sum_{i=1}^{t-1} y_i -\frac{1}{t} \sum_{i=1}^{t} y_i\right| \\ &=2\left|\left(\frac{1}{t-1}-\frac{1}{t}\right) \sum_{i=1}^{t-1} y_i - \frac{y_t}{t}\right| \\ &\leq 2\left|\frac{1}{t(t-1)}\sum_{i=1}^{t-1} y_i\right| + \frac{2|y_t|}{t} \\ &\leq \frac{2}{t} + \frac{2|y_t|}{t} \\ &\leq \frac{4}{t}~. \end{aligned}$

Hence, overall we have

$\displaystyle \sum_{t=1}^T (x_t - y_t)^2 - \min_{x \in [0,1]} \ \sum_{t=1}^T (x - y_t)^2 \leq 4\sum_{t=1}^T\frac{1}{t}~.$

Figure 1. Approximate sum with integral.

For the upper bound the last sum, observe that we are trying to find an upper bound to the green area in Figure 1. As you can see from the picture, it can be upper bounded by 1 plus the integral of ${\frac{1}{t-1}}$ from ${2}$ to ${T+1}$ . So, we have

$\displaystyle \sum_{t=1}^T\frac{1}{t} \leq 1+\int_{2}^{T+1} \frac{1}{t-1} dt = 1+ \ln T~.$

$\Box$

There are few of things to stress on this strategy. The strategy does not have parameters to tune (e.g. learning rates, regularizers). Note that the presence of parameters does not make sense in online learning: We have only one stream of data and we cannot run our algorithm over it multiple times to select the best parameter! Also, it does not need to maintain a complete record of the past, but only a “summary” of it, through the running average. This gives a computationally efficient algorithm. When we design online learning algorithms, we will strive to achieve all these characteristics. Last thing that I want to stress is that the algorithm does not use gradients: Gradients are useful and we will use them a lot, but they don’t constitute the entire world of online optimization.

Before going on I want to remind you that, as seen above, this is different from the classic setting in statistical machine learning. So, for example, “overfitting” has absolutely no meaning here. Same for “generalization gap” and similar ideas linked to a training/testing scenario.

In the next lectures, we will introduce our first algorithm for online optimization for convex losses: Online Gradient Descent.

Exercises

Exercise 1 Extend the previous algorithm and analysis to the case when the adversary selects a vector ${{\boldsymbol y}_t \in {\mathbb R}^d}$ such that ${\|{\boldsymbol y}_t\|_2\leq1}$ , the algorithm guesses a vector ${{\boldsymbol x}_t \in{\mathbb R}^d}$ , and the loss function is ${\|{\boldsymbol x}_t-{\boldsymbol y}_t\|^2_2}$ . Show an upper bound to the regret that is logarithmic in ${T}$ . Among the other things, you will probably need the Cauchy–Schwarz inequality: ${\langle {\boldsymbol x},{\boldsymbol y}\rangle \leq \|{\boldsymbol x}\|_2 \|{\boldsymbol y}\|_2}$ .

12 Comments

Francesco_reader says:

September 3, 2019 at 11:22 PM

Hi, thanks for the blog.
Is it an argmin in the definition of x_{*}^{T} ?

LikeLike

1. bremen79 says:
  
  September 3, 2019 at 11:27 PM
  
  Yes, thanks! Fixed
  
  LikeLike
  
Pingback: Online Gradient Descent – Parameter-free Learning and Optimization Algorithms
Nicolò says:

September 15, 2019 at 6:23 PM

Hello, in the definition of regret is there a missing subscript t for the first loss, i.e. \ell ({\boldsymbol x}_t) instead of \ell_t ({\boldsymbol x}_t)?

“.. Regret over a sequence of loss functions with respect to an arbitrary competitor {{\boldsymbol u} \in V\subseteq {\mathbb R}^d}

\text{Regret}_T({\boldsymbol u}):=\sum_{t=1}^T \ell ({\boldsymbol x}_t) – \sum_{t=1}^T \ell_t({\boldsymbol u})~.
”

Regarding the exercise, is it true that the regret bound should be the same except for a factor of d, i.e. 4d (1 + \ln T)?

LikeLike

1. bremen79 says:
  
  September 15, 2019 at 10:26 PM
  
  Thanks for finding the typo! Fixed.
  
  Regarding the exercise, you should be able to prove a regret without dependency on ${d}$ , we will see how this week using strong convexity. The dependency on ${d}$ might appear if you try to solve the problem coordinate-wise. Instead, mirroring the proof we did using inner products and L2 norms you should get the right regret.
  
  LikeLike
  
Pingback: Follow-The-Regularized-Leader I: Regret Equality – Parameter-free Learning and Optimization Algorithms
Pingback: Follow the Regularized Leader II: Applications – Parameter-free Learning and Optimization Algorithms
S (@panini__17) says:

November 26, 2019 at 8:47 PM

Thanks a lot for the writeup. Is the definition of regret in the non-stochastic case (i.e., the term in the fourth display, $Regret_T$) guaranteed to be non-negative?

LikeLike

1. bremen79 says:
  
  November 27, 2019 at 5:22 PM
  
  No, it is not! Even if for many online learning problems it is possible to show that there exists a strategy for the adversary that guarantee a positive regret, no matter what the online algorithm does.
  
  LikeLike
  
LCella says:

December 5, 2019 at 2:05 PM

Hello, thanks for the blog. In the first line of the proof of Lemma 1, shouldn’t the base case be $\ell_1(x^\star_1) \leq \ell_1(x^\star_1)$ ?

LikeLike

1. bremen79 says:
  
  December 5, 2019 at 9:37 PM
  
  Fixed, thanks!
  
  LikeLike
  
Onion ranger duck (@fomiuna) says:

September 22, 2020 at 4:55 AM

Thanks for your writing! I notice that quantity (1) is non-negative but the $\mathrm{Regret}_T$ may be negative since the expectation is removed. Does it matter for Regret to take negative values?

LikeLike

1. bremen79 says:
  
  September 28, 2020 at 9:43 AM
  
  This was already asked and answered, see above 🙂
  Maybe I should add it to the text.
  
  LikeLike
  
Pedram M. Haqiqi says:

October 15, 2023 at 4:34 PM

Hi there! small suggestion, the bounding of sum_overT(1/t) can also be shown using lemma 4.13, there is also an exercise in Chapter 2, 2.1, that can leverage it. Maybe showing that earlier would be beneficial.

Best,
Pedram

LikeLike

1. bremen79 says:
  
  October 17, 2023 at 6:05 AM
  
  Hi Pedram, I know 🙂
  I wanted the first chapter to be single and only using first principles for a couple of reasons:
  1) in CS there is a tendency to explain online learning only from first principles. While I don’t like this teaching approach, I wanted at least at the beginning to follow a similar route.
  2) The notes are meant to be read chapter by chapter, so I didn’t want to scare the people right away 🙂
  But you are right that I might introduce that lemma earlier anyway.
  
  LikeLike
  
Pingback: Dynamic regret and ADER – Parameter-free Learning and Optimization Algorithms