Optimal Learning Rate from the Hessian: Examples
If D is the number of variables against which the Loss function is to be minimized, then calculating the Hessian is O(D²). Calculating the inverse of a matrix is almost O(D³), which is two orders greater than simply calculating the gradient. Is any value to doing this? Let’s look at some examples.
Another motivation for looking at examples is that most people (myself included) have an intuition that this is wrong: If the Hessian is not diagonal, then the best step is not in the “direction” of the gradient. How can it be that taking the steepest step is not locally the best thing to do to descend a hillside? How can it be that the gradient doesn’t define the best direction?
We’ll need to calculate the gradient of a function and then do various tensor calculations. The package ‘numdifftools’ has well-documented Gradient and Hessian modules, and ‘tensorflow’ has all the tensor operations we will need. Plus numpy, matplotlib etc as usual.
Example 1: Let’s start with a really simple function, a paraboloid of revolution:
return x**2 + x**2
Here is its contour plot:
Here is the code for using nd.gradient:
point = [-3, -3]dfoo = nd.Gradient(f)grad_foo = tf.reshape(tf.Variable(dfoo(point)), shape = [2,])
Result: The minimum is at the origin (0, 0). At the point (3, 3), the gradient is (-6, -6), the negative gradient is (6, 6), which would cause us to overshoot and indefinitely oscillate between (-3, -3) and (3, 3). At least the gradient is in the correct direction as the minimum. Since “learning rate” is a parameter, we can use anything less than 0.5 and we will converge.
This is the main part of the code for calculating the optimal step using the Hessian:
# gradient of foo
dfoo = nd.Gradient(f)# def optimal_delta_foo(point):
grad_foo = tf.reshape(tf.Variable(dfoo(point)), shape = [2,1])
# Hessian (matrix of mixed partial derivatives) of foo2
hess_foo = tf.Variable(nd.Hessian(f)(point))
# inverse of the Hessian
inv_hess_foo = tf.matrix_inverse(hess_foo, adjoint=False, name=None)
# optimal vector change in the variables, note the '-'
optimal_delta = -tf.matmul(inv_hess_foo, grad_foo)
# reshape and return optimal_delta
optimal_delta = tf.reshape(optimal_delta, shape = [1,2])
grad_foo = tf.reshape(grad_foo, shape = [2,])
The results (which can be checked by hand) are
[[ 0.5 0.]
optimal vector change in variables:
So for this case at least, using the Hessian nails the minimum in one step! Further, this “Hessian” step has the same coordinates as the gradient. So we’ve gained a bit in terms of scaling, but not much more.
Example 2: Squished paraboloid (yes, this is a technical term)
return x**2 - x * x/2. + (x/3.)**2
and its contour plot
Note that the minimum is still at the origin, but already we see that the (-)gradient at the point (-3, -3) = (4.5, -0.83) doesn’t even “point” in the correct direction! So our intuition about the gradient being the correct direction is not right.
Calculating the Hessian step, we get:
[[ 2. -0.5 ]
[[ 1.14285714 2.57142857]
[ 2.57142857 10.28571429]]
optimal vector change in variables:
Again, using the Hessian nails the minimum in one step! Just to make the point again: in this case the gradient and the ideal step are not in the same “direction”.
There are lots of physics optimization problems minimizing times or distances along paths on curved surfaces. We are trying to minimize path length (?) on the curved surface representing the fucntion. This is a similar situation to that of the great circles on the surface of the earth, the shortest distance between two points on the same latitude (except the equator) is not along the latitude lines (which are the same as the gradients of the longitude).
So far so good. But now, let’s try it on a more complicated function, similar to the function whose contour map we showed at the beginning of this article.
Example 3: Why doesn’t this work?
The 2D function we will try is:
return np.sin(x) ** 2 + np.sin(10 + x * x/2.) * np.cos(x)
and its contour map is
Let’s zoom in near the obvious central minimum:
Let’s do gradient descent with a scalar learning rate of 0.5, starting from the point (3.1, 1.3)
So it is sort of in the well, bouncing around a bit since the learning rate is too high.
Now, what about our “magical” Hessian ideal step? Starting at the same point as before, (3.1, 1.3), the Hessian ideal step is (-0.11, -0.44)! This is completely in the wrong direction! That step increases the value of the function to 0.97. While the steepest step may not be the best step to take (as we’ve seen with the squished paraboloid), it certainly cannot be the case that stepping uphill is the best! Perhaps this method should be called “Gradient indecent”
So what went wrong? Let’s take a look at the Hessian. Our derivation assumed that the extremum we were looking for (gradient = 0) was a minimum, and we implicitly assumed that the Hessian has “positive curvature” (slope increases). But at the point (3.1, 1.3), the Hessian is
[[ 1.20250029 -1.007762 ]
[-1.007762 -1.2574737 ]]
This has a negative determinant, implying (in 2D) that one eigenvalue is positive and one is negative, so we are likely to be heading towards a saddle point extremum.
Let’s look at a contour plot of the determinant of the Hessian:
Comparing to the plot of the function itself
we see that the region around the central minimum with a positive determinant is much smaller than what we perceive as the valley itself. The point (3.1, 1.3) we started at is squarely in the middle of a negative determinant region, quite close to a saddle point extremum at about (3, 1) and the surrounding peaks.
Another way to see what goes wrong is to look at the optimal step Hessian vector field:
The vectors at the starting point (3.1, 1.3) are all pointing away from the central minimum, they are separated from it by a contour where the length of the vector field is 0 and hence the integral curves cannot reach the minimum desired.
What can we do?
Example 4: Start within the positive eigenvalues region
Let’s do gradient descent with a scalar learning rate of 0.5, starting from the point (3.1, 1.8)
So it is sort of in the well, but bouncing around a lot and possibly climbing out of the valley.
How does Hessian step do, starting from the point (3.1, 1.8)?
which has pretty much converged by the third iteration.
So it “works”, but does it work? Is this already part of Gradient Descent or another optimization technique?
Is it worth it trying to figure out practical implementations?