Learning Rate in Gradient Descent and Second Derivatives: 2

Ideal Learning Rate for a 1 Dimensional Quadratic

(See this visualization of gradients etc.)

Consider a quadratic as an approximation to the loss function to be minimized (in one particular valley), in a one dimensional space {x}, and assume we already know its functional form.

The Loss or function to be minimized is

y = f(x) = a* x**2 + b*x + c

its gradient or derivative at any point x is

dy(x) = f’(x) = 2*a*x + b

and its second derivative is

ddy = f’’(x) = 2*a

The minimum of this function is at

X = arg(min(y)) = argmin(y)

found by setting the first derivative of y, dy(X) = 0.

So we get X = -b/(2*a)

Now imagine that we are starting at some arbitrary initial point x0 and we want to get to X. Gradient descent says:

  • evaluate the derivative at x0: dy(x0)
  • step in the opposite direction to x1 = x0 + δx, where δx = -α * dy(x0)
  • But by how much? This is the learning rate α

How do we optimize α so we reach bottom in the minimum number of steps? You can say, we are starting at x0 and we want to get to X, we can do it in one step if we choose x1 = X, i.e.

δx = x1 — x0 = X — x0 = -(x0 + b/(2*a)).

The problem is, when we start at some arbitrary point x0, and we only have a way of calculating the value of the Loss function at any point, we have no idea what are a, b, c, not even X = -b/(2*a). This is akin to walking around on a hillside, in a dense fog, with only an altimeter, and facing the problem of reaching the valley bottom.

What we do know is how to compute y(x) for any x. So we can compute y(x0). We can also compute its derivative(s). How? We can compute y(x0 + ε) and by choosing ε “small enough”:

dy(x0) ~= (y(x0 + ε) — y(x0))/ ε

Is this enough for us to figure out how to get to the bottom? No, because if we only know the first derivative, we are approximating the loss as a straight line and the tangent line has no information about the location of the minimum:

Both the blue and the red curves have the same derivative at x = 1, but they have minima at -0.5 and + 0.25 respectively. Clearly, to get the minimum, we need to know how the curve curves.

So back to the quadratic approximation to the loss function. We know the step is related to the gradient via:

δx = — α * dy(x0)

we also know the gradient:

dy(x0) = f’(x) = 2*a*x0 + b

and we already figured out the ideal step

δx = x1 — x0 = X — x0 = -(x0 + b/(2*a))

Putting these together, we find that the learning rate

α = — δx/dy(x0) = 1/(2*a) = 1/ddy(x0) = inverse of the second derivative.

So the ideal step is given by

δx = — dy(x0) / ddy(x0)

Let’s generalize and prove this heuristic.

I stop to miau to cats.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store