paper-implement

Adam optimizer

Nice way to think about how much gradient updates cost: same cost as just running the function. Very efficient given combinatorial search!
But running regular SGD on high dimensional parameter space could be better
Basically leverage momentum to skip down the loss surface
Exponential moving average over the gradient
Moving averages are estimates of the average and variance of the gradient ove r time
Adding momentum to SGD is $v= \beta v + (1-\beta)dw$ and $w = w - \alpha v$ which is just a moving average of $dw$ instead of updating by $dw$ itself. Goal is to remove the oscillations of SGD.
Usually $\beta=0.9$ used which is like averaging over last 10 grads

This site is open source. Improve this page.