Saddle-Points, Game Theory, and Boosting

In this post, we will see some connections between saddle-point problems and Game Theory, as well as with Boosting.
Warning: this post heavily needs the knowledge of the min-max problems. So, I strongly suggest to read the previous one before, if you do not remember it.

1. Game-Theory interpretation of Saddle-Point Problems

An instantiation of a saddle-point problems also has an interpretation in Game Theory as a Two-player Zero-Sum Game. Note that Game Theory is a vast field and two-person zero-sum games are only a very small subset of the problems in this domain and what I describe here is an even smaller subset of this subset of problems.

Game theory studies what happens when self-interested agents interact. By self-interested, we mean that each agent has an ideal state of things he wants to reach, that can include a description of what should happen to other agents as well, and he works towards this goal. In two-person games the players act simultaneously and then they receive their losses. In particular, the {X}-player chooses the play {{\boldsymbol x}} and the {Y}-player chooses the play {{\boldsymbol y}}, the {X}-player suffers the loss {f({\boldsymbol x},{\boldsymbol y})} and the {Y}-player the loss {-f({\boldsymbol x},{\boldsymbol y})}. It is important to understand that this is only one round, that is it has only one play for each player. Note that the standard game-theoretic terminology uses payoffs instead of losses, but we will keep using losses for coherence with the online convex optimization notation that I use on this blog.

We consider the so-called two-person normal-form games, that is when the first player has {n} possible actions and the second player {m}. A player can use a pure strategy, that is a single fixed action, or randomize over a set of actions according to some probability distribution, a so-called mixed strategy. In this case, we consider {X=\Delta^n} and {Y=\Delta^m} and they are known as the action spaces for the two players. In this setting, for a pair of pure strategies {({\boldsymbol e}_i,{\boldsymbol e}_j)}, the first player receives the loss {f({\boldsymbol e}_i,{\boldsymbol e}_j)} and the second player {-f({\boldsymbol e}_i,{\boldsymbol e}_j)}, where {{\boldsymbol e}_i} is the vector with all zeros but a ‘1’ in position {i}. The goal of each player is to minimize the received loss. Given the discrete nature of this game, the function {f({\boldsymbol x},{\boldsymbol y})} is the bilinear function {{\boldsymbol x}^\top M {\boldsymbol y}}, where {M} is a matrix with {n} rows and {m} columns. Hence, for a pair of mixed strategy {({\boldsymbol x},{\boldsymbol y})}, the expected loss of the first player is {{\boldsymbol x}^\top M {\boldsymbol y}} and the one of the second player is {-{\boldsymbol x}^\top M {\boldsymbol y}}.

A fundamental concept in game theory is the one of Nash Equilibrium. We have a Nash equilibrium if all players are playing their best strategy to the other players’ strategies. That is, none of the players has incentive to change their strategy if the other player does not change it. For the zero-sum two-person game, this can be formalized saying that {({\boldsymbol x}^\star,{\boldsymbol y}^\star)} is a Nash equilibrium if

\displaystyle ({\boldsymbol x}^\star)^\top M {\boldsymbol y} \leq ({\boldsymbol x}^\star)^\top M {\boldsymbol y}^\star \leq {\boldsymbol x}^\top M {\boldsymbol y}^\star, \quad \forall {\boldsymbol x} \in \Delta^n, {\boldsymbol y} \in \Delta^m~.

This is exactly the definition of saddle-point for the function {f({\boldsymbol x},{\boldsymbol y})} that we gave last time. Given that {f({\boldsymbol x},{\boldsymbol y})={\boldsymbol x}^\top M {\boldsymbol y}} is continuous, convex in the first argument and concave in the second one, the sets {X=\Delta^n} and {Y=\Delta^m} are convex and compacts, we can deduce from Theorem 5 and Theorem 3 from my previous post that a saddle-point always exists. Hence, there is always at least one (possibly mixed) Nash equilibrium in two-person zero-sum games. The common value of the minimax and maxmin problem is called value of the game and we will denote it by {v^\star}.

For zero-sum two-person game the Nash equilibrium has an immediate interpretation: From the definiton above, if the first player uses the strategy {{\boldsymbol x}^\star} then his loss is less then the value of the game {v^\star}, regardless of the strategy of the second player. Analogously, if the second player uses the strategy {{\boldsymbol y}^\star} then his loss is less then {-v^\star}, regardless of the strategy of the first player. Both players achieve the value of the game if they both play the Nash strategy. Moreover, even if one of the player would announce his strategy in advance to the other player, he would not increase his loss in expectation.

Example 1 (Cake cutting). Suppose to have a game between two people: The first player cuts the cake in two and the second one chooses a piece; the first player receives the piece that was not chosen. We can formalize it with the following matrix

larger piece smaller piece
cut evenly 0 0
cut unevenly 10 -10

When the first player plays action {i} and the second player action {j}, the first player receives the loss {M(i,j)} and the second player receives {-M(i,j)}. The losses represent how much less in percentage compared to half of the cake the first player is receiving. The second player receives the negative of the same number. It should be easy to convince oneself that the strategy pair is {(\text{cut evenly}, \text{larger piece})} is an equilibrium with value of the game of 0.

Example 2 (Rock-paper-scissors). Let’s consider the game of Rock-paper-scissors. We describe it with the following matrix

Rock Paper Scissors
Rock 0 1 -1
Paper -1 0 1
Scissors 1 -1 0

It should be immediate to realize that there are no pure Nash equilibria for this game. However, there is a mixed Nash equilibrium when each player randomize the action with a uniform probability distribution and value of the game equal to 0.

saddle_point
Figure 1. The saddle point in the Matching Pennies game.

Example 3 (Matching Pennies). In this game, both players show a face of a penny. If the two faces are the same, the first player wins both, otherwise the second player wins both. The associated matrix {M} is

head tail
head -1 1
tail 1 -1

It is easy to see that the Nash equilibrium is when both players randomize the face to show with equal probability.

In this simple case, we can visualize the saddle point associated to this problem in Figure 1

Unless the game is very small, we find Nash equilibria using numerical procedures that typically give us only approximate solutions. Hence, as for {\epsilon}-saddle points, we also define an {\epsilon}-Nash equilibrium for a zero-sum two-person game when {{\boldsymbol x}^\star} and {{\boldsymbol y}^\star} satisfy

\displaystyle ({\boldsymbol x}^\star)^\top M {\boldsymbol y} -\epsilon \leq ({\boldsymbol x}^\star)^\top M {\boldsymbol y}^\star \leq {\boldsymbol x}^\top M {\boldsymbol y}^\star + \epsilon, \quad \forall {\boldsymbol x} \in \Delta^n, {\boldsymbol y} \in \Delta^m~.

Obviously, any Nash equilibrium is also an {\epsilon}-Nash equilibrium.

From what we said last time, it should be immediate to see how to numerically calculate the Nash equilibrium of a two-person zero-sum game. In fact, we know that we can use online convex optimization algorithms to find {\epsilon}-saddle points, so we can do the same for {\epsilon}-Nash equilibrium of two-person zero-sum games. Assuming that the average regret of both players is {\epsilon_T}, the theorem we saw last time says that {\left(\frac{1}{T}\sum_{t=1}^T {\boldsymbol x}_t, \frac{1}{T}\sum_{t=1}^T {\boldsymbol y}_t\right)} is a {2\epsilon_T}– Nash equilibrium.

2. Boosting as a Two-Person Game

We will now show that the Boosting problem can also be seen as the solution of a zero-sum two-person game. In fact, this is my favorite explanation of Boosting because it focuses on the problem rather than on a specific algorithm.

Let {S=\{{\boldsymbol z}_t,y_t\}_{i=1}^m} a training set of {m} couples features/labels, where {{\boldsymbol z}_i \in {\mathbb R}^d} and {y_i = \{-1,1\}}. Let {\mathcal{H} = \{h_i\}_{i=1}^n} a set of {n} functions {h_i:{\mathbb R}^d \rightarrow \{-1,1\}}.

The aim of boosting is to find a combination of the functions in {\mathcal{H}} that has arbitrarily low misclassification error on {S}. Of course, this is not always possible. However, we will make a (strong) assumption: It is always possible to find a function {h_i \in \mathcal{H}} such that its misclassification error over {S} weighted with any probability distribution is better than chance by a constant {\gamma>0}. We now show that this assumption guarantees that the boosting problem is solvable!

First, we construct a matrix of the misclassifications for each function: {M \in {\mathbb R}^{n \times m}} where {M(i,j)=\begin{cases} 1, & \text{if } h_i({\boldsymbol z}_j)\neq y_j\\ 0, & \text{otherwise}\end{cases}}. Setting {X=\Delta^n} and {Y=\Delta^m}, we write the saddle-point problem/two-person zero-sum game

\displaystyle \min_{{\boldsymbol p} \in \Delta^n} \max_{{\boldsymbol q} \in \Delta^m} \ {\boldsymbol p}^\top M {\boldsymbol q}~.

Given the definition of the matrix {M}, this is equivalent to

\displaystyle \min_{{\boldsymbol p} \in \Delta^n} \max_{{\boldsymbol q} \in \Delta^m} \ \sum_{i=1}^n \sum_{i=1}^m p_i q_j \boldsymbol{1}[h_i({\boldsymbol z}_j)\neq y_j]~.

Let’s now formalize the assumption on the functions: We assume the existence of a weak learning oracle that for any {{\boldsymbol q} \in \Delta^m}, the oracle returns {i^\star} such that

\displaystyle {\boldsymbol e}_{i^\star}^\top M {\boldsymbol q} = \sum_{j=1}^m q_j \boldsymbol{1}[h_{i^\star}({\boldsymbol z}_j)\neq y_j] \leq \frac{1}{2} - \gamma,

where {\gamma>0}. In words, {i^\star} is the index of the function in {\mathcal{H}} that gives a {{\boldsymbol q}}-weighted error better than chance. Moreover, given that

\displaystyle \min_{{\boldsymbol p} \in \Delta^n} {\boldsymbol p}^\top M {\boldsymbol q} = \min_{i=1,\dots,n} {\boldsymbol e}_i^\top M {\boldsymbol q} \leq {\boldsymbol e}_{i^\star}^\top M {\boldsymbol q} \leq \frac{1}{2} - \gamma

and

\displaystyle v^\star = \max_{{\boldsymbol q} \in \Delta^m} \min_{{\boldsymbol p} \in \Delta^n} {\boldsymbol p}^\top M {\boldsymbol q},

we have that the value of the game satisfies {v^\star \leq \frac12 - \gamma < \frac12}. Using the inequality on the value of the game and the fact that the Nash equilibrium exists, we obtain that there exists {{\boldsymbol p}^\star \in \Delta^n} such that for any {j=1,\dots,m}

\displaystyle \label{eq:boosting_eq1} \sum_{i=1}^n p^\star_i \boldsymbol{1}[h_i({\boldsymbol z}_j) \neq y_j] = ({\boldsymbol p}^\star)^\top M {\boldsymbol e}_j \leq v^\star \leq \frac12 - \gamma < \frac12~. \ \ \ \ \ (1)

In words, this means that every sample {{\boldsymbol z}_j} is misclassified by less than half of the functions {h_i} when weighted by {{\boldsymbol p}^\star}. Hence, we can correctly classify all the samples using a weighted majority vote rule where the weights over the function {h_i} are {{\boldsymbol p}^\star}. This means that we can learn a perfect classifier rule using weak learners, through the solution of a minimax game. So, our job is to find a way to calculate this optimal distribution {{\boldsymbol p}^\star} on the functions.

Given what we have said till now, a natural strategy is to use online convex optimization algorithms. In particular, we can use Algorithm 3 from my previous post, where in each round the {X}-player is the weak learning oracle, that knows the play {{\boldsymbol q}_t} by the {Y}-Learner, while the {Y}-player is an Online Convex Optimization (OCO) algorithm. Specialized to our setting we have the following algorithm. In words, the {X}-player looks for the function that has small enough weighted misclassification loss, where the weights are constructed by the {Y}-player.


boosting_oco

Let’s show a guarantee on the misclassification error of this algorithm. From the second equality of Theorem 6 in my previous post, for any {{\boldsymbol q} \in \Delta^m}, we have

\displaystyle \frac{1}{T} \sum_{t=1}^T {\boldsymbol e}_{i_t}^\top M {\boldsymbol q} = \frac{1}{T} \sum_{t=1}^T {\boldsymbol e}_{i_t}^\top M {\boldsymbol q}_t + \frac{\text{Regret}^Y_T({\boldsymbol q})}{T}~.

From the assumption on the weak-learnability oracle, we have {\frac{1}{T} \sum_{t=1}^T {\boldsymbol e}_{i_t}^\top M {\boldsymbol q}_t\leq \frac12 - \gamma}. Moreover, choosing {{\boldsymbol q}={\boldsymbol e}_j} we have {\frac{1}{T} \sum_{t=1}^T {\boldsymbol e}_{i_t}^\top M {\boldsymbol q} = \frac1T \sum_{i=1}^T \boldsymbol{1}[h_{i_t}({\boldsymbol z}_j) \neq y_j]}. Putting all together, for any {j=1,\dots,m}, we have

\displaystyle \frac1T \sum_{i=1}^T \boldsymbol{1}[h_{i_t}({\boldsymbol z}_j) \neq y_j] \leq \frac12 - \gamma + \frac{\text{Regret}^Y_T({\boldsymbol e}_j)}{T}~.

If {\frac{\text{Regret}^Y_T({\boldsymbol e}_j)}{T}< \gamma}, less than half of the functions selected by the boosting procedure will make a mistake on {({\boldsymbol z}_j,y_j)}. Given that the predictor is a majority rule, this means that the majority rule will make 0 mistakes on the training samples.

In this scheme, we construct {p^\star_i} by approximating it with the frequency with which {i_t} is equal to {i}.

Let’s now instantiate this framework with a specific OCO algorithm. For example, using Exponentiated Gradient (EG) as algorithm for the {Y}-player, we have that {\text{Regret}^Y_T({\boldsymbol e}_j)\leq O(\sqrt{T \ln n})} for any {j=1,\dots,m}, that implies that after {T = O(\frac{\ln n}{\gamma^2})} rounds the training error is exactly 0. This is exactly the same guarantee achieved by AdaBoost (Freund and Schapire, 1997).

Boosting and Margins What happens if we keep boosting after the training error reaches 0? It turns out we maximize the margin, defined as {y_j \frac{1}{T}\sum_{t=1}^T h_{i_t}({\boldsymbol z}_j)}. In fact, given that {\boldsymbol{1}[h_{i_t}({\boldsymbol z}_j)\neq y_j] = \frac{1-y_j h_{i_t}({\boldsymbol z}_j)}{2}}, we have for any {j =1, \dots, m}

\displaystyle 2 \gamma - \frac{2 \text{Regret}^Y_T({\boldsymbol e}_j)}{T} \leq y_j \frac{1}{T} \sum_{i=1}^T h_{i_t}({\boldsymbol z}_j) = y_j \bar{h}_T({\boldsymbol z}_j)~.

Hence, when the number of rounds goes to infinity the minimum margin on the training samples reaches {2 \gamma}. This property of boosting of maximizing the margin has been used as an explanation of the fact that in boosting additional rounds after the training error reaches 0 often keep improving the test error on test samples coming i.i.d. from the same distribution that generated the training samples.

The above reduction does not tell us how the training error precisely behaves. However, we can get this information changing the learning with expert algorithm. Indeed, if you read my old posts you should know that there are learning with experts algorithm provably better than EG. We can use a learning with experts algorithm that guarantees a regret {O(\sqrt{T \cdot KL({\boldsymbol q};{\boldsymbol \pi})})} for a given prior {{\boldsymbol \pi}}, where {{\boldsymbol \pi}} is the uniform prior. This kind of algorithms allow us to upper bound the fraction of mistakes after any {T} iterations. Denote by {k} the number of misclassified samples after {T} iterations of boosting and set {{\boldsymbol q}_k} as the vector whose coordinates are equal to {1/k} if {\rm sign(\bar{h}_T({\boldsymbol z}))} misclassifies it. Hence, we have

\displaystyle \label{eq:better_bound_boosting} 2 \gamma - \frac{2 \text{Regret}^Y_T({\boldsymbol q}_k)}{T} \leq \frac{1}{k}\sum_{j=1}^k y_j \bar{h}_T({\boldsymbol z}_j) \leq 0~. \ \ \ \ \ (2)

Using the expression of the regret, we have that the fraction of misclassification errors {\frac{k}{m}} is bounded by {O(\exp(-\gamma^2 T))}. That is, fraction of misclassification error goes to zero exponential fast in the number of boosting rounds. Again, a similar guarantee was proved for AdaBoost, but here we quickly derived it using a reduction from boosting to learning with experts, passing through two-person games.

3. History Bits

The reduction from boosting to two-person game and learning with expert is from Freund and Schapire (1996). It seems that the question if a weak learner can be boosted into a strong learner was originally posed by Kearns and Valiant (1988) (see also Kearns (1988)) but I could not verify this claim because I could not find their technical report anywhere. It was answered in the positive by Schapire (1990). The AdaBoost algorithm is from Freund and Schapire (1997). The idea of using algorithms that guarantee a KL regret bound in (2) is from Luo and Schapire (2014).

Acknowledgements

Thanks to Nicolò Campolongo for hunting down my typos.

1 Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s