Damus
Gigi · 3d
considering buying hardware to run everything locally. Would should I buy? #asknostr
Joe Resident profile picture
r/LocalLlama is the best resource for this question

tldr: For agentic-type tasks in the background, probably an apple M series with lots of VRAM. And Qwen 3.5 27b has reached a level of agentic effectiveness that can run on a single 3090 that is kinda staggering (something like opus 4.1)

rough breakdown:
For most cost effective+fastest, stack 3090s until you have enough VRAM to run the model size class you want.

For most cost effective/easiest/most power efficient, buy an M1 Max with as much VRAM as you need to run the models you want (I recently got a 64gb M1 Max for $1200 that runs Qwen 3.5 122b at about 200 t/s prompt processing, 20 t/s generation. Running continuous openclaw cron jobs in the background, sipping power, not heating up the room or making any noise, love it)

For a bigger budget+fastest, stack 5090s (32gb), or if you don't want so many gpus to physically manage, NVIDIA RTX Pro 6000 Blackwell (96gb).

For bigger budget+easiest, M3 Ultra or M4 Max with lots of VRAM, or wait for M5 Ultra/max. Performance comparison: https://github.com/ggml-org/llama.cpp/discussions/4167

AMD is a place to go to increase cost effectiveness in exchange for more software management headache. Ryzen AI Max+ 395 128gb is an interesting alternative to large-vram mac setups for running big models with minimal hardware and max power efficiency.


comparison of relevant models: https://artificialanalysis.ai/models?models=gpt-5-4-mini%2Cgpt-oss-120b%2Cgpt-oss-20b%2Cgpt-5-4-pro%2Cgpt-5-4%2Cgemini-3-flash-reasoning%2Cgemma-4-31b%2Cgemini-3-1-pro-preview%2Cgemini-3-1-flash-lite-preview%2Cclaude-4-5-haiku-reasoning%2Cclaude-opus-4-6-adaptive%2Cclaude-sonnet-4-6-adaptive%2Cdeepseek-v3-2-reasoning%2Cgrok-4-20%2Cminimax-m2-7%2Cglm-5%2Cqwen3-5-397b-a17b%2Cqwen3-5-122b-a10b%2Cqwen3-5-35b-a3b%2Cqwen3-5-27b%2Cclaude-4-opus-thinking%2Cclaude-4-1-opus-thinking
1❤️1
Joe Resident · 2d
I neglected to mention, the most practical path for many people is to use their existing gaming rig and maybe add some more RAM (not VRAM). With the preponderance of MOE models (mixture of experts), it actually makes a lot of sense to offload experts to CPU ram and only run part of the model on the...