Learning Rate in Gradient Descent and Second Derivatives: 3
In the previous post we showed that if we want to find the minimum of a quadratic in one dimension using “Machine Learning”, specifically using gradient descent, the ideal learning rate is the inverse of the second derivative at the current point. In this post let’s try and generalize this to finding the local minimum of almost any curve y = y(x) in 1 dimension.
Expand y(x) in a Taylor series around the initial point x0:
y(x) ~= y(x0) + (x — x0) * dy(x0) + (x — x0)**2 * ddy(x0) / 2 + …
X = argmin(y(x)) is given by the first derivative = 0:
dy(x0) + (X-x0)*ddy(x0) ~= 0
=> δx = (X-x0) = — dy(x0) / ddy(x0).
The step is in the opposite direction to the gradient (the first derivative), and the learning rate — or the factor by which the gradient is multiplied to find the ideal step- is the inverse of the second derivative.
This result is independent of the coordinate, but non-linear coordinate transformations will lead to different steps since the quadratic approximation to the “valley” in y(x) will be different in different coordinates.
Finally, let’s generalize to multiple dimensions.