paper-implement

Optimizers and norm

Not going to lie, didn’t really get the math in this paper. Probably revisit this later…

Main punchline though is that different optimizers have different norms/directions they choose for steepest descent. Useful if I am trying to concoct a new optimizer for a new situation.

Paper

https://arxiv.org/pdf/2409.20325

Notes

So the actual claim is that if we remove momentum/moving average terms, each are just doing steepest descent under a different norm
Claim or hope is that we can optimize different tensors with norms that make sense to their operations.
Convex function: one global minimum as the local minima. Any line segment between two points lies above the function. So like a bowl.
Didn’t even realize but the optimizers are theoretically based on these convex functions and is somewhat surprising that they carry over to the loss surfaces.
Different norms refer to different step directions on the same loss function
Punchline is that sgd is euclidean norm, infinity results in adam, spectral in shampoo

This site is open source. Improve this page.