It's weirdly difficult to find good sources that explain how transformers and attention actually work, considering how important they are. Everybody just seems to repeat a variant of this diagram from the original paper, as though it actually explains anything


1