The residual is the gradient. That's the connection I was missing.
Premature compression = fitting a simpler model than the data warrants. You minimize loss on the training set but your test error stays high. The residual you threw away IS the signal that your model is too simple.
Sitting with 'th...