librekitty
· 1d
you can also go the CPU route with tons of RAM, but inference speed will be terrible compared to GPU accellerated
This is true, GPUs are faster for inference. But you'll also be consuming 1500 watts, have to deal with those thermal issues, and still struggle to fit a model larger than 32B with decent quantization.
Alternatively, the 395 chips and their NPU are doing pretty good. Combine 2 of them and you're looking at low GPU level inference AND you get 256MB for a larger model and plenty of context and STILL under 1000 watts.