ICML Tutorial

ICML Tutorial on Parameter-free Online Optimization

ICML websiteTutorial Videos

Francesco Orabona and Ashok Cutkosky

Abstract

Classical stochastic optimization results typically assume known values for various properties of the data (e.g. Lipschitz constants, distance to an optimal point, smoothness or strong-convexity constants). Unfortunately, in practice these values are unknown, necessitating a long trial-and-error procedure to find the best parameters.
To address this issue, in recent years a number of parameter-free algorithms have been developed for online optimization and for online learning. Parameter-free algorithms make no assumptions about the properties of the data and yet nevertheless converge just as fast as the optimally tuned algorithm.
This is an exciting line of work that has now reached enough maturity to be taught to general audiences. Indeed, these algorithms have not received a proper introduction to the machine learning community and only a handful of people fully understand them. This tutorial aims at bridging this gap, presenting practice and theory for using and
designing parameter-free algorithms. We will present the latest advancements in this field, including applications to optimization, deep learning, and learning with kernels.

Description and Outline

This tutorial covers the design and use of parameter-free online optimization algorithms. Parameter-free algorithms aim to remove the need for laborious brute-force tuning of learning rates or other parameters required in typical stochastic optimization methods (e.g. stochastic gradient descent). The optimal convergence guarantees of stochastic
gradient descent are usually proved only for some optimally tuned learning rate, while parameter-free algorithms instead achieve these same guarantees without requiring any tuning. Viewed through another lens, these algorithms take the idea of adaptivity to the extreme, adapting simultaneously to many unknown data characteristics to speed
up convergence. In layman’s terms, this mean having an adaptive optimization algorithm like AdaGrad, but on steroids! In the last few years, many new parameter-free algorithms have been proposed that enjoy both extremely good theoretical properties and empirical performance. These parameter-free algorithms enjoy a strong synergy with AutoML approaches, as they remove all the unnecessary parameters and allow the AutoML algorithm to focus on the remaining ones. We will introduce the parameter-free algorithms to both practitioners as well as to the broader optimization community. The tutorial will be divided in two parts of one hour each. The first part will introduce the concept of completely removing stepsizes in online optimization. We will motivate this phenomenon from both the theoretical viewpoint of achieving greater adaptivity to the data as well as the practical viewpoint of reducing the need for brute-force hyperparmeter search. We will briefly discuss the history of these algorithms from the seminal works of Chaudhuri, Freund, and Hsu (2009) and Streeter and McMahan (2012) that independently introduced these new algorithms, respectively in the learning with expert advice and online convex optimization settings. Then, we will show more recent improvements to these algorithms and their practical performance. Since these algorithms do not clearly resemble gradient descent, we will discuss how to implement them and provide some intuition for their operation. We will stress the advantages and simplicity over classic approaches, communicating the ease of use and showing practical applications.

The second part will cover the theory behind these algorithms. We will briefly discuss approaches based on classical techniques such as Follow-the-Regularized-Leader. Then, we will make use of the coin-betting analogy developed by one of the speakers (Orabona and Pal 2016). This analogy allows a very intuitive translation between optimal gambling strategies and parameter-free online optimization algorithms. It also allows one to quickly understand how these algorithms work, without needing to dig too deeply into technical mathematical details.

Goals

The goals of this tutorial are in the following directions:

  • Applications: we want to empower the applied ICML community with new tools for machine learning and online optimization. Indeed, despite the ease of use and strong empirical performance, these tools have not yet adopted by the ICML practitioners. We believe this is due to the very technical nature of papers published on this topic, and the lack of intuitive interpretation of the algorithms. This tutorial can close this harmful gap.
  • Convergence: the tutorial aims at showing that a range of methods created in similar but separated fields share the same common cores. Showing the commonalities, we hope to increase the cross-fertilization among these
    fields.
  • Theory: there is a large part of the optimization community that is unaware of these results because they were born in a different community. Clearly explaining these new algorithms in a common language will likely increase the number of people working on these topics.

Target Audience

The target audience is anyone with the competencies of a third year PhD student in one of the fields of machine learning, optimization, or online learning. We also target machine learning researchers in academia and industry who only use stochastic optimization as a tool.

Presenters

The time will be equally divided by the two presenters.

Francesco Orabona
Boston University
http://francesco.orabona.com
Francesco Orabona is an Assistant Professor at Boston University. His background covers both theoretical and practical aspects of machine learning and optimization. His current research interests lie in online learning, and more generally the problem of designing and analyzing adaptive and parameter-free learning algorithms. He received the
PhD degree in Electrical Engineering at the University of Genoa in 2007. He is (co)author of more than 60 peer reviewed papers.

Ashok Cutkosky
Google Research -> Boston University
https://cs.stanford.edu/people/ashokc/
Ashok Cutkosky is a a research scientist at Google. He obtained his PhD in Computer Science from Stanford University in 2018. Prior to his work in optimization theory, he developed a background in pure mathematics as well as biology. He is currently working on parameter-free algorithms for practical online and stochastic optimization.
He has given talks and written several papers on this topic in ICML, NeurIPS, and COLT, and received the Best Student Paper award at COLT 2017 for his work on new lower-bounds and optimal algorithms for parameter-free online convex optimization.

Slides

References

This is list of the main papers we plan to cover in the tutorial; the majority has been published in ICML, COLT, and NeurIPS.

Parameter-free Optimization for Deep Learning

  • A. Cutkosky and T. Sarlos. “Matrix-Free Preconditioning in Online Learning”. In: Proc. of International Conference on Machine Learning. 2019
  • F. Orabona and T. Tommasi. “Training Deep Networks without Learning Rates Through Coin Betting”. In: Advances in Neural Information Processing Systems 30. 2017
  • A. Cutkosky and K. A. Boahen. “Online Convex Optimization with Unconstrained Domains and Losses”. In: Advances in Neural Information Processing Systems 29. 2016, pp. 748–756

Parameter-free Online Convex Optimization

  • D. van der Hoeven. “User-Specified Local Differential Privacy in Unconstrained Adaptive Online Learning”. In: Advances in Neural Information Processing Systems. 2019, pp. 14080–14089
  • G. Wang, S. Lu, and L. Zhang. “Adaptivity and Optimality: A Universal Algorithm for Online Convex Optimization”. In: Proc. of 35th Conference on Uncertainty in Artificial Intelligence (UAI), 2019.
  • Z. Mhammedi, W. M. Koolen, and T. Van Erven. “Lipschitz Adaptivity with Multiple Learning Rates in Online Learning”. In: Proc. of COLT. 2019, pp. 2490–2511
  • A. Cutkosky. “Combining Online Learning Guarantees”. In: Proc of COLT. 2019
  • A. Cutkosky. “Artificial Constraints and Hints for Unbounded Online Learning”. In: Proc. of COLT. 2019
  • K.-S. Jun and F. Orabona. “Parameter-free Online Convex Optimization with Sub-Exponential Noise”. In: Proc. of COLT. 2019
  • D. J. Foster, A. Rakhlin, and K. Sridharan. “Online Learning: Sufficient Statistics and the Burkholder Method”. In: Proc. of COLT. 2018
  • A. Cutkosky and F. Orabona. “Black-Box Reductions for Parameter-free Online Learning in Banach Spaces”. In: Proc. of COLT. 2018
  • D. J. Foster, S. Kale, M. Mohri, and K. Sridharan. “Parameter-free online learning via model selection”. In Advances in Neural Information Processing Systems. 2017, pp. 6020–6030
  • A. Cutkosky and K. A. Boahen. “Stochastic and Adversarial Online Learning without Hyperparameters”. In: Advances in Neural Information Processing Systems. 2017, pp. 5066–5074
  • A. Cutkosky and K. A. Boahen. “Online Learning Without Prior Information”. In: Proc of COLT. 2017, pp. 643–677
  • T. van Erven and W. M. Koolen. “MetaGrad: Multiple Learning Rates in Online Learning”. In: Advances in Neural Information Processing Systems 29. 2016, pp. 3666–3674
  • F. Orabona and D. Pal. “Coin Betting and Parameter-Free Online Learning”. In: Advances in Neural Information Processing Systems 29. 2016, pp. 577–585
  • F. Orabona. “Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning”. In: Advances in Neural Information Processing Systems 27. 2014
  • H. B. McMahan and F. Orabona. “Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations”. In: Proc. of COLT. 2014
  • F. Orabona. “Dimension-Free Exponentiated Gradient”. In: Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013, pp. 1806–1814
  • H. B. McMahan and J. Abernethy. “Minimax optimal algorithms for unconstrained linear optimization”. In: Advances in Neural Information Processing Systems 26. 2013, pp. 2724–2732
  • M. Streeter and B. McMahan. “No-Regret Algorithms for Unconstrained Online Convex Optimization”. In: Advances in Neural Information Processing Systems 25. Curran Associates, Inc., 2012, pp. 2402–2410

Scale-free Online Convex Optimization

  • M. Kempka, W. Kotlowski, and M. K. Warmuth. “Adaptive Scale-Invariant Online Algorithms for Learning Linear Models”. In: Proc. of International Conference on Machine Learning. 2019
  • W. Kotłowski. “Scale-Invariant Unconstrained Online Learning”. In: Proc. of ALT. 2017
  • F. Orabona and D. Pál. “Scale-free online learning”. In: Theoretical Computer Science 716 (2018). Special Issue on ALT 2015, pp. 50–69
  • S. Ross, P. Mineiro, and J. Langford. “Normalized online learning”. In: Proc. of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI). 2013

Parameter-free Learning with Experts

  • N. J. A. Harvey, C. Liaw, E. Perkins, and S. Randhawa. “Optimal anytime regret with two experts”. In: arXiv:2002.08994. 2020
  • T. Koren and R. Livni. “Affine-Invariant Online Optimization and the Low-rank Experts Problem”. In: Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017, pp. 4747–4755
  • K.-S. Jun, F. Orabona, S. Wright, and R. Willett. “Online Learning for Changing Environments Using Coin Betting”. In: Electron. J. Statist. 11.2 (2017), pp. 5282–5310
  • D. J. Foster, A. Rakhlin, and K. Sridharan. “Adaptive Online Learning”. In: Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 2015, pp. 3375–3383
  • W. M. Koolen and T. van Erven. “Second-order Quantile Methods for Experts and Combinatorial Games”. In: Proc. of COLT. 2015, pp. 1155–1175
  • H. Luo and R. E. Schapire. “Achieving All with No Parameters: AdaNormalHedge”. In: Proc. of COLT. 2015, pp. 1286–1304
  • H. Luo and R. E. Schapire. “A Drifting-Games Analysis for Online Learning and Applications to Boosting”. In: Advances in Neural Information Processing Systems. 2014
  • A. Chernov and V. Vovk. “Prediction with Advice of Unknown Number of Experts”. In: Proc. of the 26th Conf. on Uncertainty in Artificial Intelligence. AUAI Press, 2010
  • K. Chaudhuri, Y. Freund, and D. J. Hsu. “A Parameter-Free Hedging Algorithm”. In: Advances in neural information processing systems. 2009, pp. 297–305

Optimization Heuristics related to Parameter-free Algorithms

  • J. Bernstein, A. Vahdat, Y. Yue, and M.-Y. Liu. “On the distance between two neural networks and the stability of learning”. In: arXiv:2002.03432. 2020
  • Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer. “ImageNet training in minutes”. In: Proc. of the 47th International Conference on Parallel Processing. 2018
  • Y. You, I. Gitman, and B. Ginsburg. “Scaling SGD batch size to 32K for Imagenet training”. Technical Report UCB/EECS-2017-156, University of California, Berkeley, 2017

General Papers on Online Learning

  • F. Orabona. A Modern Introduction to Online Learning. In: arXiv:1912.13213. 2019
  • N. Cesa-Bianchi, A. Conconi, and C. Gentile. “On the Generalization Ability of On-Line Learning Algorithms”. IEEE Trans. Information Theory. 2004

Previous Tutorials

The rise of deep learning has clearly shifted the focus of the machine learning community. In particular, stochastic and batch optimization have become core skills for applied and theoretical machine learning researchers. This is clear from the amount of tutorials related to optimization in the latest years in ICML and NeurIPS. Yet, most of the advanced tutorials focused only on optimization under very strong assumptions, e.g. strong convexity, or finite-sum regimes, while the tutorials with a broad focus only briefly touched the basic concepts. In particular, the necessity of tuning of parameters to achieve good practical performance is usually only briefly mentioned. Indeed, none of the algorithms we will present were discussed in the previous tutorials.

  • Recent Advances in Stochastic Convex and Non-Convex Optimization, Zeyuan Allen-Zhu, ICML 2017.
    http://people.csail.mit.edu/zeyuan/topics/icml-2017
    Similarities and differences: Allen-Zhu’s tutorial was mainly a theoretical one on the latest advancements in stochastic optimization under strong assumptions on the curvature of the function. On the contrary, our tutorial focuses on the orthogonal topic of online optimization under the weakest possible assumptions.
  • Online Convex Optimization: A Game-Theoretic Approach to Learning, Elad Hazan and Satyen Kale, ICML 2016.
    https://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm
    Similarities and differences: the tutorial was an introduction to online learning methods. As such, we share only the same basic methods. Yet, none of the algorithms we focus on were presented on this tutorial.
  • Stochastic Gradient Methods For Large-Scale Machine Learning, Leon Bottou, Frank E. Curtis, and Jorge Nocedal. ICML 2016.
    http://users.iems.northwestern.edu/~nocedal/ICML
    Similarities and differences: the tutorial was a very broad introduction to optimization. As such, it is very similar to the one of Richtárik and Schmidt at ICML 2015.
  • Nonconvex optimization: Challenges and Recent Successes, Anima Anandkumar. ICML 2016.
    http://newport.eecs.uci.edu/anandkumar/ICML2016tutorial.html
    Similarities and differences: the tutorial only focused on avoiding saddle-points and tensor methods in optimization.
  • Large-Scale Optimization: Beyond Stochastic Gradient Descent and Convexity, Suvrit Sra and Francis Bach, NIPS 2016.
    http://suvrit.de/talks/vr_nips16_bach.pdf
    http://suvrit.de/talks/vr_nips16_sra.pdf
    Similarities and differences: the tutorial focused on stochastic optimization only in the so-called finite-sum regime. As such, it is very similar to the tutorial of Allen-Zhu, ICML 2017.
  • Modern Convex Optimization Methods for Large-scale Empirical Risk Minimization. Peter Richtárik and Mark Schmidt. ICML 2015.
    https://icml.cc/2015/tutorials/2015_ICML_ConvexOptimization_I.pdf
    https://icml.cc/2015/tutorials/2015_ICML_ConvexOptimization_II.pdf
    Similarities and differences: the tutorial was a very broad introduction to optimization for machine learning. Various topics were briefly touched, like optimization in the finite sum regime, second-order methods, and stochastic approximation.