# Learning Rate in Gradient Descent and Second Derivatives: 4

## Generalize to Multiple Dimensions

In the previous post we showed that the inverse of the second derivative is an approximation to the ideal learning rate for finding the minimum of almost any function in 1 dimension using Gradient Descent. In this post we generalize this result to multiple dimensions, and show how to get from a gradient co-vector to a vector step via the mysterious and missing tensor of type (0,2).

We have multiple weights we are trying to use to minimize some Loss function. So the weight space is multiple dimensional and a point is given by the vector **x **or x^a. The Loss function y(**x**) can be expanded in Taylor series:

y(**x**) ~= y(**x**0) + δ**x d**y(**x**0) + δ**x **δ**x dd**y(**x**0) / 2 + …

where **d**y = (∂y/∂x^i) **d** x^i, the gradient, is a co-vector, and δ**x **is the increment change in the weights. In terms of components or indices:

y(**x**) ~= y(**x**0) + δx^i d_i y(**x**0) + δx^i δx^j H_{ij}(**x**0) / 2 + …

where H_{ij} = (∂/∂x^i) (∂/∂x^j) y

is the Hessian or the matrix of mixed partial derivatives of y.

To find the extremum, we set = 0 the gradient of y(**x**) :

0 = d_i y(**x**) = d_i y(**x**0) + δx^j H_{ij}(**x**0)

which can be solved for the increment at the point **x**0:

δx^i = H^{ij}(**x**0) d_j y(**x**0)

where H^{ij} with the upper indices is (a (0,2) tensor) the inverse of the Hessian:

H^{ik} H_{kj} = I^i_j

**In words, the ideal increment is the inverse of the Hessian times the gradient.**

**The “learning rate” is actually a tensor.** This is is sometimes approximated by ignoring the difference between vectors and co-vectors and setting the scalar learning rate to the negative D-th root of the determinant of the Hessian.

Of course it is computationally a pain to calculate the O(D²) Hessian at every step, but one approach might be to do it every few epochs and assume that the same quadratic approximation holds good for that duration.