Damus
utxo the webmaster 🧑‍💻 · 3w
For any local AI maxis, here is my current setup and models: 4x 3090s 2x - qwen3.5-35b q4 256k - 60-80 t/s 2x - gemma4-27b q4 256k - 50-70 t/s Running on vLLM via docker Working mint openclaw, Ge...
Machu Pikacchu profile picture
Haven’t ran qwen in a minute but it’s surprising you’re not getting higher throughput for gemma4 on your 3090s.

For what it’s worth if you use llama.cpp and disable reasoning you should see faster time to first byte at the cost of a slight degradation in quality. Haven’t used vllm so can’t comment there. I get 70-75 tok/s on a macbook m3 for comparison and only have 40 GPU cores
1
utxo the webmaster 🧑‍💻 · 3w
What params/quant are you running?