Joe Resident
· 4d
r/LocalLlama is the best resource for this question
tldr: For agentic-type tasks in the background, probably an apple M series with lots of VRAM. And Qwen 3.5 27b has reached a level of agentic effec...
I neglected to mention, the most practical path for many people is to use their existing gaming rig and maybe add some more RAM (not VRAM).
With the preponderance of MOE models (mixture of experts), it actually makes a lot of sense to offload experts to CPU ram and only run part of the model on the gpu. Llama.cpp does this very natively, not hard to configure. This slows things down, but not nearly as much as if everything was running on cpu alone. And you can install crazy amounts of normal RAM and run very large models at very slow speeds if you want to.