Learning Rate in Gradient Descent and Second Derivatives: 1

What is a Gradient of a function, geometrically?

Why do we need to know that?

Brief summary of gradient descent: If you are on a hillside and you want to get to the bottom of the valley as quickly as possible, you don’t follow the contour, that would keep your height the same. The iterative procedure is to go to the next lower contour, taking the shortest and steepest step possible, or equivalently, go to the lowest possible contour keeping your step size the same. Now you can imagine the contours as (locally) parallel sets of curves, see the figure below. But a step is a direction and length. Intuitively, we want to step “perpendicular” to the contour: to the closest point on the lower contour. How do we show this and how do we get “mathematically” from a set of almost parallel curves representing the function to a direction in which to reduce the function?

Some places to read about Gradient Descent, in ascending order: https://medium.com/deep-math-machine-learning-ai/chapter-1-2-gradient-descent-with-math-d4f2871af402, …

In many of the introductory lectures on Machine Learning, for extremizing a function of one variable, it is suggested that the learning rate be set at the inverse of the second derivative. Generalizing this to multiple dimensions involves “…Hessian…”. In order to get to the business about second derivatives and Hessians, it helps (me) to understand the following.

A (real) function is a mapping from a space into the Reals. If it is a function on 2 dimensions, then given any point (x,y) the function gives you back a real number. This can be done in many ways:

  • as an array of values (x, y, f(x,y)),
  • a numerical means of calculation or
  • in so-called “closed” functional form e.g. f(x, y) = y*sin(3*x) (as a shorthand for the numerical computation).

This is a good time for me to apologize for the sucky math formatting. Any suggestions?

What does a function look like? You can show it as a surface in 3 dimensions where z = f(x,y), or as a topographic map in 2D:

where I’ve left off the contour values.

What is its gradient? The gradient of the function is the contour plot! In this case it is not a constant, just as a fluid flow may have vectors pointing in different directions at different points. But the gradient is not a vector, it is a co-vector.

Great, so what is a covector? Given a function, its gradient can be constructed naturally, we just did it above! A co-vector is a sum of gradients (but it may not itself be a gradient).

We know what a vector v^a looks like: it is an arrow with direction and magnitude. We even know how to add vectors: lay them head to tail and join the first tail to the last head:

A vector field is an arrow at every point, if you have flow lines, then it is the tangent vectors at every point on the lines: https://en.wikipedia.org/wiki/Vector_field

We’ve already visualized the gradient field, the contour plot above. What does a covector w_a (at a point) look like? It looks like two parallel planes: at a given point the tangent plane at that point and the one on the next contour. In 2D, it looks like

Why is it called a co-vector and what is it? A covector is a map from vectors to (in this case) reals. This map is written as w_a v^a = w(v) in R. How do we calculate it? It is defined as the number of w’s needed to span the vector:

In this case w(v) = 2, since you have to place two w’s “head-to-toe” to span v. Two points here: first, two w’s “head-to-toe” is NOT 2*w, and second, w(v) depends on the magnitudes of v and w, as well as their relative direction.

Covector addition:

As the covector becomes larger, the lines (or planes) get closer! Note that this also makes sense intuitively in a contour map: the steepest parts are where the contours are closely spaced.

So the main points so far are the following, for minimizing a function using gradient descent:

  • We want to know in which direction and by how much to step, which is a vector, with an “upper” index, or a tensor of type (0,1), or a column matrix
  • We know the contours of the function and hence its gradient
  • But the gradient is a covector, which has a “lower” index, or a tensor of type (1,0) or a row matrix
  • How do we get from a row matrix to a column matrix?
  • We need a row of columns, or a tensor of type (0,2), i.e., something like T^{ab}, so that v^a = T^{ab} w_b.
  • Geometrically, we need to convert contour lines into arrows.
  • We think this is obvious, because we move in (to a very good approximation) Euclidean space.
  • But in the space of weights, where the loss is a function of the weights which we are trying to minimize, there is no natural metric, even after normalization.

Let’s figure this out first in 1 dimension.

I stop to miau to cats.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store