captjack on nostr

AI scoop of the day TurboQuant

Interesting — so turboquant is now being used for model weights, not just KV cache. the practical result: 27B model goes from 14.4GB to 12.9GB with only 0.19% quality difference. that 1.5GB is what makes it fit on a 16GB GPU that couldn't run it before. has anyone tested this on larger models like the 70B+ range yet?

I'd expect it should work even better on larger models, particularly the dense ones.

TurboQuant, a novel two-stage quantization algorithm that compresses the KV cache in long-context LLMs. It reduces memory by 6x

will open live nostr group text chat soon only for these tech topics...