Learning Rate in Gradient Descent and Second Derivatives: 3

Approximate Ideal Step in 1 dimension

In the previous post we showed that if we want to find the minimum of a quadratic in one dimension using “Machine Learning”, specifically using gradient descent, the ideal learning rate is the inverse of the second derivative at the current point. In this post let’s try and generalize this to finding the local minimum of almost any curve y = y(x) in 1 dimension.

Expand y(x) in a Taylor series around the initial point x0:

y(x) ~= y(x0) + (x — x0) * dy(x0) + (x — x0)**2 * ddy(x0) / 2 + …

X = argmin(y(x)) is given by the first derivative = 0:

dy(x0) + (X-x0)*ddy(x0) ~= 0

=> δx = (X-x0) = — dy(x0) / ddy(x0).

The step is in the opposite direction to the gradient (the first derivative), and the learning rate — or the factor by which the gradient is multiplied to find the ideal step- is the inverse of the second derivative.

This result is independent of the coordinate, but non-linear coordinate transformations will lead to different steps since the quadratic approximation to the “valley” in y(x) will be different in different coordinates.

Finally, let’s generalize to multiple dimensions.

I stop to miau to cats.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store