Hi Zheng, sorry for the delay and thanks for your warm note.
Good conjecture in your question. Newton’s method converges to the local minimum only in the region where the signature of the Hessian is all positive. Recall that Hessian^(-1) * gradient is the Hessian vector field, whose negative determines the next step in every iteration in Newton’s method. As we know, the HVF * gradient can be positive, so the HVF will take you to a maximum. The above flipping the sign of the Hessian VF is to expand the region of convergence of Newton’s method, but I’ll admit that it is a hack and I haven’t proven anything about it.
That said, I’ve understood some things about ML when I taught a class on Linear Regresison and I think there is less reason to do Newton’s Method: https://medium.com/@ranjeettate/all-warm-encoding-736c9c6799bb
if you do AllWarm instead of OneHot encoding.
Let me know what you think.
Ranjeet (I recall we connected on LinkedIn?)