First, thank you so much for a detailed response to my question about Gradient descent vs. Newton’s method. Second, apologies for my late response, I was caught in the midst of a job search and it took me a while to understand enough about the intersection of GD, Newton’s method, standardization and OneHot encoding to realise that there is possibly some room for improvement. https://medium.com/@ranjeettate/all-warm-encoding-736c9c6799bb
As you suggested, I did look at some ML text books (Duda, Hart and Stork) and articles on ML optimization. You are right, the region of convergence for gradient descent to a local minimum (the watershed boundary) is much larger than that for Newton’s method (positive signature of the Hessian). Newton’s method can take you to a local maximum, and if one happens to come close to a saddle point where the Hessian changes signature and becomes non-invertible, the next iteration can be arbitrarily large and in an unpredictable direction.
That said, enough ML articles discuss spending a lot of time tuning “learning rate”, which to my mind is because one is trying to make a psuedo-scalar do the work of a (2, 0) tensor. It also seems to me that a lot of DS/ML research efforts are expended towards second order methods, of which Newton’s method is one.
For neural networks, ins’t it true that while the combination of inputs in any given neuron is always linear the ability to approximate any non-linear function across the entire network arises from the necessarily non-linear activation function in each neuron?
I would love your feedback about the article above, which proposes a replacement of OneHot encoding with a set of uncorrelated encoding variables.
Good luck with your doctorate,