This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019. I will publish two lectures per week.
You can find the lectures I published till now here.
In the last lecture, we have shown a very simple and parameter-free algorithm for Online Convex Optimization (OCO) in 1-dimension. Now, we will see how to reduce OCO in a -dimensional space to OCO in 1-dimension, so that we can use the parameter-free algorithm given by a coin-betting strategy in any number of dimensions.
1. Coordinate-wise Parameter-free OCO
We have already seen that it is always possible to decompose an OCO problem over the coordinate and use a different 1-dimensional Online Linear Optimization (OLO) algorithm on each coordinate. In particular, we saw that
where the is exactly the regret w.r.t. the linear losses constructed by the coordinate of the subgradient.
Hence, if we have a 1-dimensional OLO algorithm, we can copies of it, each one fed with the coordinate of the subgradient. In particular, we might think to use the KT algorithm over each coordinate. The pseudo-code of this procedure is in Algorithm 1.
The regret bound we get is immediate: We just have to sum the regret over the coordinates.
Theorem 1 With the notation in Algorithm 1, assume that . Then, , the following regret bounds hold
where is a universal constant.
Note that the Theorem above suggests that in high dimensional settings should be proportional to .
2. Parameter-free in Any Norm
The above reductions works only with in a finite dimensional space. Moreover, it gives a dependency on the competitor w.r.t. the norm that might be undesirable. So, here we present another simple reduction from 1-dimensional OCO to infinite dimensions.
This reduction requires an unconstrained OCO algorithm for the 1-dimensional case and an algorithm for learning in -dimensional (or infinite dimensional) balls. For the 1-dimensional learner, we could use the KT algorithm, while for learning in -dimensional balls we can use, for example, Online Mirror Descent (OMD). Given these two learners, we decompose the problem of learning a vector in the problem of learning a direction and a magnitude. The regret of this procedure turns out to be just the sum of the regret of the two learners.
We can formalize this idea in the following Theorem.
Theorem 2 Denote by the linear regret of algorithm for any in the unit ball w.r.t a norm , and the linear regret of algorithm for any competitor . Then, for any , Algorithm 2 guarantees regret
Further, the subgradients sent to satisfy .
Proof: First, observe that since for all . Now, compute:
Remark 1 Note that the direction vector is not constrained to have norm equal to 1, yet this does not seem to affect the regret equality.
We can instantiate the above theorem using the KT betting algorithm for the 1d learner and OMD for the direction learner. We obtain the following examples.
Example 1 Let be OSD with and learning rate . Let the KT algorithm for 1-dimensional OCO with . Assume the loss functions are -Lipschitz w.r.t. the . Then, using the construction in Algorithm 2, we have
Using an online-to-batch conversion, this algorithm is a stochastic gradient descent procedure without learning rates to tune.
To better appreciate this kind of guarantee, let’s take a look at the one of Follow-The-Regularized-Leader (Online Subgradient Descent can be used in unbounded domains only with constant learning rates). With the regularizer and 1-Lipschitz losses we get a regret of
So, to get the right dependency on we need to tune , but we saw this is impossible. On the other hand, the regret in Example 1 suffers from a logarithmic factor, that is the price to pay not to have to tune parameters.
In the same way, we can even have a parameter-free regret bound for norms.
Example 2 Let be OMD with and learning rate . Let the KT algorithm for 1-dimensional OCO with . Assume the loss functions are -Lipschitz w.r.t. the . Then, using the construction in Algorithm 2, we have
If we want to measure the competitor w.r.t the norm, we have to use the same method we saw for OMD: Set and such that . Now, assuming that , we have that . Hence, we have to divide all the losses by and, for all , we obtain
Note that the regret against of the parameter-free construction is constant. It is important to understand that there is nothing special in the origin: We could translate the prediction by any offset and get a guarantee that treats the offset as the point with constant regret. This is shown in the next Proposition.
Proposition 3 Let an OLO algorithm that predicts and guarantees linear regret for any . We have that the regret of the predictions for OCO is
3. Combining OCO Algorithms
Finally, we now show a useful application of the parameter-free OCO algorithms property to have a constant regret against .
Theorem 4 Let and two OLO algorithms that produces the predictions and respectively. Then, predicting with , we have for any
Moreover, if both algorithm guarantee a constant regret of against , we have for any
Proof: Set . Then,
In words, the above theorem allows us to combine online learning algorithm. If the algorithms we combine have constant regret against the null competitor, then we always get the best of the two guarantees.
Example 3 We can combine two parameter-free OCO algorithms, one that gives a bound that depends on the norm of the competitor and subgradients and another one specialized to the norm of competitor/subgradients. The above theorem assures us that we will also get the best guarantee between the two, paying only a constant factor in the regret.
Of course, upper bounding the OCO regret with the linear regret, the above theorem also upper bounds the OCO regret.
4. History Bits
The approach of using a coordinate-wise version of the coin-betting algorithm was proposed in the first paper on parameter-free OLO in (M. Streeter and B. McMahan, 2012). Recently, the same approach with a special coin-betting algorithm was also used for optimization of deep neural networks (Orabona, F. and Tommasi, T., 2017). Theorem 2 is from (A. Cutkosky and F. Orabona, 2018). Note that the original theorem is more general because it works even in Banach spaces. The idea of combining two parameter-free OLO algorithms to obtain the best of the two guarantees is from (A. Cutkosky, 2019).
(Orabona, F. and Pal, D., 2016) proposed a different way to transform a coin-betting algorithm into an OCO algorithm that works in or even in Hilbert spaces. However, that approach seems to work on for the norm and it is not a black-box reduction. That said, the reduction in (Orabona, F. and Pal, D., 2016) seems to have a better empirical performance compared to the one in Theorem 2.
There are also reductions that allow to transform an unconstrained OCO learner into a constrained one (A. Cutkosky and F. Orabona, 2018). They work constructing a Lipschitz barrier function on the domain and passing to the algorithm the original subgradients plus the subgradients of the barrier function.
Exercise 1 Prove that with and are exp-concave. Then, using the Online Newton Step Algorithm, give an algorithm and a regret bound for a game with these losses. Finally, show a wealth guarantee of the corresponding coin-betting strategy.