# The Negative Gradient Does Not Point Towards the Minimum

In this post, we will explain how Gradient Descent (GD) works and why it can converge very slowly.

The simplest first-order optimization algorithm is Gradient Descent. It is used to minimize a convex differentiable function over ${{\mathbb R}^d}$. The update rule is simply ${{\boldsymbol x}_{t+1} = {\boldsymbol x}_t - \eta_t \nabla f({\boldsymbol x}_t)}$, and the pseudo-code is in Algorithm 1.

Last time we saw that Subgradient Descent (SD) is not a descent method. What about GD? GD with an appropriate choice of the stepsizes is a descent method, that is ${f({\boldsymbol x}_{t+1}). However, GD is far from being perfect, because the negative gradient does not exactly point towards the minimum. Let’s see why.

Condition Number. To better understand this concept we will use a nice property of gradients, often overlooked. First, let’s define what is the level set of a function, that is the set of all the points ${{\boldsymbol x}}$ that have the same value of ${f}$.

Definition 1 The ${\alpha}$-level set of a function ${f}$ is defined as ${S_\alpha=\{{\boldsymbol x}: f({\boldsymbol x})=\alpha\}}$.

To visualize the level set, consider the two-dimensional example of a topographic map in Figure 1. The black lines are exactly the level sets corresponding to different values of the 2-d function, that in this case are just the altitudes.

We can now state the following proposition.

Proposition 2 Let ${f}$ a differentiable function from ${{\mathbb R}^d}$ to ${{\mathbb R}}$ and ${{\boldsymbol x}\in {\mathbb R}^d}$. Then, ${\nabla f({\boldsymbol x})}$ is orthogonal to the level set ${S_{f({\boldsymbol x})}}$ at ${{\boldsymbol x}}$.

We will not prove it, because there is not much to learn from the proof for us.

What does it mean in words? It means that if I draw the level sets of a function, I know immediately in which direction the gradient goes. This is very powerful to develop a geometric intuition of how GD works. For example, if we take even simple convex two-dimensional functions, we can see that the negative gradient does not really point where we would like it to point.

Let’s see an example. Consider the 2-d function ${f(x,y)=x_1^2+0.5 x_2^2+1.2 x_1 x_2}$ in Figure 2, that has the minimum in ${(0,0)}$.

Figure 2. Gradient, level set, and behavior of GD.

The negative gradient in the point ${(3,-3)}$ is ${(-2.4, 0.6)}$, in black arrow in Figure 2. As stated above, this gradient is orthogonal to the level set ${S_{f(3,-3)}}$. See how the negative gradient is pointing in a direction that is not towards the minimum. In general, for convex functions, the angle between the negative gradient in a point ${{\boldsymbol x}}$ and the vector the connects the minimum and ${{\boldsymbol x}}$ can be arbitrarily close to 90 degrees.

Should we be worried about it? Definitely! The fact that the negative gradient does not point directly to the minimum is exactly what slows down the convergence of GD. In fact, even selecting the step size in the optimal way, i.e. choosing the stepsize that guarantees the maximum decrease of the function at each step, we obtain the behavior in Figure 2. Did you expect the sawtooth path of GD?

Figure 3. The negative gradient sometimes does point towards the minimum.

On the other hand, if we consider the function ${f(x,y)=x_1^2+x_2^2}$, that has the minimum in ${(0,0)}$, the level sets are circles centered in the origin, see Figure 3. So, in any point the negative gradient will point exactly towards the minimum! In this case, GD with an appropriate stepsize will reach the solution in one step.

Can we capture in some way the difference between these two cases? Yes, with the condition number, exactly that obscure concept that Ali Rahimi mentioned in his very controversial talk.

We will define the condition number as the maximum eigenvalue divided by the minimum one of the Hessian of a function. We can show that a high condition number corresponds to a slow convergence, while a low condition number (the minimum is of course 1) corresponds to a fast convergence. Note that for a two-dimensional quadratic function, the maximum and minimum eigenvalue are proportional to the length of the axes of the ellipses formed by the level sets. So, it exactly captures the geometry of our two examples above.

Convergence Guarantee. It is now time for some math and to put all this together and to prove the convergence rate of gradient descent, that we expect to depend on the condition number. The proof works by showing that in each step we decrease the value of the objective function proportional to the norm of the gradient, and that the norm of the gradient itself depends on the suboptimality gap.

Theorem 3 Let ${f}$ a convex twice differentiable function from ${{\mathbb R}^d}$ to ${{\mathbb R}}$, and be ${M}$ the maximum eigenvalue of its Hessian and ${m}$ the minimum one. Set ${\eta_t = \frac{1}{M}}$. Then we have

$\displaystyle f({\boldsymbol x}_{T}) - f({\boldsymbol x}^\star) \leq \left(1-\frac{m}{M}\right)^{T}\left(f({\boldsymbol x}_0) - f({\boldsymbol x}^\star)\right)~.$

Proof: Denoting by ${H({\boldsymbol x})}$ the Hessian of ${f}$ in ${{\boldsymbol x}}$, from the Taylor expansion of ${f}$ around ${{\boldsymbol x}_{T-1}}$, we have

$\displaystyle f({\boldsymbol x}_T) =f({\boldsymbol x}_{T-1}) + \langle \nabla f({\boldsymbol x}_{T-1}), {\boldsymbol x}_T - {\boldsymbol x}_{T-1}\rangle + \frac{1}{2} \left\langle ({\boldsymbol x}_T-{\boldsymbol x}_{T-1}) H({\boldsymbol \xi}_{T-1}), {\boldsymbol x}_T-{\boldsymbol x}_{T-1}\right\rangle,$

for some ${{\boldsymbol \xi}_{T-1}}$ on the line segment between ${{\boldsymbol x}_T}$ and ${{\boldsymbol x}_{T-1}}$. Hence, the assumption of the maximum eigenvalue of the Hessian implies

\displaystyle \begin{aligned} f({\boldsymbol x}_T) &\leq f({\boldsymbol x}_{T-1}) + \langle \nabla f({\boldsymbol x}_{T-1}), {\boldsymbol x}_T - {\boldsymbol x}_{T-1}\rangle + \frac{M}{2}\|{\boldsymbol x}_T-{\boldsymbol x}_{T-1}\|_2^2 \\ &= f({\boldsymbol x}_{T-1}) - \eta_{T-1} \|\nabla f({\boldsymbol x}_{T-1})\|_2^2 + \frac{\eta_{T-1}^2 M}{2}\|\nabla f({\boldsymbol x}_{T-1})\|_2^2 \\ &= f({\boldsymbol x}_{T-1}) + \left(\frac{\eta_{T-1}^2M}{2}- \eta_{T-1}\right) \|\nabla f({\boldsymbol x}_{T-1})\|_2^2~. \end{aligned}

Note that our definition of the step size ${\eta_{T-1}=\frac{1}{M}}$, maximize the decrement of the function in ${{\boldsymbol x}_T}$. Indeed, we have

$\displaystyle \label{eq:gd_conv_eq1} f({\boldsymbol x}_T) \leq f({\boldsymbol x}_{T-1}) -\frac{1}{2M} \|\nabla f({\boldsymbol x}_{T-1})\|_2^2~. \ \ \ \ \ (1)$

Now, using again the Taylor expansion, for some ${{\boldsymbol \xi}'_{T-1}}$ on the line segment between ${{\boldsymbol x}_{T-1}}$ and ${{\boldsymbol x}_{T-1}-\frac{1}{m}\nabla f({\boldsymbol x}_{T-1})}$, we have

\displaystyle \begin{aligned} f\left({\boldsymbol x}_{T-1}-\frac{1}{m}\nabla f({\boldsymbol x}_{T-1})\right) &= f({\boldsymbol x}_{T-1}) -\frac{1}{m} \| \nabla f({\boldsymbol x}_{T-1})\|^2_2 + \frac{1}{2m^2} \langle \nabla f({\boldsymbol x}_{T-1}) H({\boldsymbol \xi}_{T-1}), \nabla f({\boldsymbol x}_{T-1})\rangle \\ &\geq f({\boldsymbol x}_{T-1}) -\frac{1}{m} \| \nabla f({\boldsymbol x}_{T-1})\|^2_2 + \frac{1}{2m} \| \nabla f({\boldsymbol x}_{T-1}) \|^2_2~. \end{aligned}

Hence, using the fact that ${f({\boldsymbol x}^\star)\leq f({\boldsymbol x}_{T-1}-\frac{1}{m}\nabla f({\boldsymbol x}_{T-1}))}$, we have

$\displaystyle \label{eq:gd_conv_eq2} f({\boldsymbol x}_{T-1}) - f({\boldsymbol x}^\star) \leq \frac{1}{2m} \| \nabla f({\boldsymbol x}_{T-1}) \|^2_2~. \ \ \ \ \ (2)$

Putting (1) and (2) together and subtracting ${f({\boldsymbol x}^\star)}$ to both sides, we have

\displaystyle \begin{aligned} f({\boldsymbol x}_{T}) - f({\boldsymbol x}^\star) \leq \left(1-\frac{m}{M}\right)(f({\boldsymbol x}_{T-1}) - f({\boldsymbol x}^\star)) \leq \left(1-\frac{m}{M}\right)^{T}(f({\boldsymbol x}_0) - f({\boldsymbol x}^\star))~. \end{aligned}

$\Box$

So, we just proved that in GD the suboptimality gap shrinks exponentially fast! And the exponent depends on the condition number. Also, this rate is almost optimal, using acceleration we can depend on the square root of the condition number.

A curiosity: This rate of convergence is called linear, because when you plot the logarithm of the suboptimality with respect to the number of iterations you get a straight line.

How to Choose the Stepsizes for GD in Practice? In the above theorem we assumed the stepsize to be constant and equal to ${\frac{1}{M}}$. I want to remind you that the constant step size can be used because GD will automatically slow down approaching the minimum due to the smoothness of the function. Remember that this does not happen in subgradient descent on non-differentiable functions, see previous blog post.

So, in order to set the stepsize, the theorem assumes that you know the maximum eigenvalue of the Hessian of the function that you are minimizing. Also, it implicitly assumes the maximum eigenvalue to be bounded. However, the boundedness assumption can be removed, observing that GD strictly decreases the function value, so we have to worry only about the maximum eigenvalue of the Hessian inside the set ${\{{\boldsymbol x}: f({\boldsymbol x})\leq f({\boldsymbol x}_0)\}}$. But, even with this trick, we still don’t know ${M}$.

So, is this one another useless theorem? No, because the above theorem also holds if at each time step we select ${\eta_t}$ by exact line search in order to guarantee the maximum decrease in the direction of the negative gradient. We leave the minor change to the proof as an exercise. Also, we don’t have to use exact line search, you can prove that a similar guarantee holds for an inexact line search that gives a sufficient decrease of the function value: google “Backtracking line search”.

So, we just discovered that GD with a good line search method is a parameter-free algorithm! This is not so surprising: “batch” optimization is a very mature field, where the culture equally weights theoretical and empirical performance. Do you think that optimization people would be happy with algorithms that were not fully automatic? Parameter-free optimization algorithms are actually the norm and nobody would call them “parameter-free”. Instead, they would just say that this is the right way to do. And if you think about it, why would you want an optimization algorithm to have parameters? When is the last time you had to set a learning rate to invert a matrix in MATLAB/Octave/NumPy?

If you wonder why line search algorithms are not common in the machine learning literature, the reason is that we tend to prefer stochastic optimization methods, where the line search becomes non-trivial. Indeed, there are papers on this issue, but it is still an open problem. So, in the stochastic setting, we will have to use different techniques.

Now, we know what is the condition number and we know how it influences the convergence speed. Can we do something to converge faster in the case of functions with a high condition number? Yes, but we have to pay a cost. Intuitively, we could just do a change of coordinate in the examples above, to go from the difficult case to the easy case. What is the optimal transformation? It is the one that makes the new Hessian as close as possible to an identity matrix. And this is exactly what a Netwon’s algorithm does! So, under some smoothness conditions on the objective function, we can expect a Netwon algorithm to have a convergence rate that is independent of the condition number. Indeed, on convex quadratic functions, the Newton’s algorithm will always converge to the minimum in one step. However, we have to pay the computational price of running the Netwon algorithm: calculating the Hessian, inverting it, etc. Of course, we could use quasi-Netwon’s algorithms that approximate the Netwon update to keep the computation complexity low, but that is another story.