Aldin
· 2w
There is this tech called intel optane. It is a fast memory storage ment to be used in RAID with an HDD. In that setup, it serves for caching, making slow HDD feel fast. It is about 10 years old; sinc...
Afaik it won't help much because if model must be split between VRAM <> anything else - it's slow. And there is the KV cache for the context adding to the load. Iirc the inference, in case it doesn't using weights from VRAM, goes for RAM but this is computed on CPU - it's slow as hell (also dependent on how fast is CPU<>GPU communication). The Optane will be avoided and if not, the slowdown compared to RAM is even bigger.
Small models still do not fit in consumer HW - maybe for certain tasks & managing tool calls output but must be guided by bigger (API connected) model.
(Have no experience really, only I hope some day can offload the token consumption partially to my already existing HW, so far this is possible for simple tasks like generating the title for a conversation but not much more.)