I am reading the paper “Sutskever, Ilya, James Martens, George Dahl, and Geoffrey Hinton. “On the importance of initialization and momentum in deep learning.” In Proceedings of the 30th international conference on machine learning (ICML-13), pp. 1139-1147. 2013.
And re-direct to Prof. Yann LeCun’s paper. There are several pictures to illustrate the intuition of the optimization process.
Gradient based method
   are also good references to understand NAG process.
 Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998
 Sebastien Bubeck, Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization, https://blogs.princeton.edu/imabandit/2014/03/06/nesterovs-accelerated-gradient-descent-for-smooth-and-strongly-convex-optimization/