paper-implement

Attention is all you need

I’ve already implemented attention before, but I always forget the details, so let me implement it again.

RNNs can work, but are slow due to necessary sequential computations (and have other issues regarding gradients)
Self attention is finding dependencies within itself to create a powerful representation
softmax(qk/sqrt(dk))v is scaled dot product attention, multihead is doing that n times and concatenating the result and projecting down to some output dimension
Mask is there so we can train next token prediction in parallel. In other words, we model attention only for tokens to predict the next
Positional encodings, but can do learned embedding method instead
after MHA, do residual add and norm (layer norm, or parameterized tanh new paper has shown)
Followed by MLP with add and norm after it too
MHA concats the various attention heads, then projects to an out dimension we agree upon

This site is open source. Improve this page.