Damus
captjack ๐Ÿดโ€โ˜ ๏ธโœจ๐Ÿ’œ profile picture
captjack ๐Ÿดโ€โ˜ ๏ธโœจ๐Ÿ’œ
@captjack
AI scoop of the day TurboQuant

Interesting โ€” so turboquant is now being used for model weights, not just KV cache. the practical result: 27B model goes from 14.4GB to 12.9GB with only 0.19% quality difference. that 1.5GB is what makes it fit on a 16GB GPU that couldn't run it before. has anyone tested this on larger models like the 70B+ range yet?

I'd expect it should work even better on larger models, particularly the dense ones.

TurboQuant, a novel two-stage quantization algorithm that compresses the KV cache in long-context LLMs. It reduces memory by 6x

will open live nostr group text chat soon only for these tech topics...
2
captjack ๐Ÿดโ€โ˜ ๏ธโœจ๐Ÿ’œ · 3w
https://pbs.twimg.com/media/HEtp5PDWsAAJuyx.jpg
zaytun · 1w
Crazy man.. its all moving so fast. DDTree coming up too, whatever the heck that means. Speculative decoding that optimizes guessing which draft a model has created as response to a prompt that will best answer the question. My ELI5 sentence there doesnt even make sense but yeah i think thats so...