utxo the webmaster ๐งโ๐ป
· 1w
For any local AI maxis, here is my current setup and models:
4x 3090s
2x - qwen3.5-35b q4 256k - 60-80 t/s
2x - gemma4-27b q4 256k - 50-70 t/s
Running on vLLM via docker
Working mint openclaw, Ge...
Are those MoE models? Thats the only way I can make those tok/s make any sense with the experience Ive had.
I tried the 35b MoE and just didnt find it intelligent enough to substitute cloud models. I even tried the qwen 3.5 122b-a10b which activates 10b at a time, and still found it not strong enough. Speed was fine though.
Have now moved back to testing the dense 27b instead. Its not fast but not unusable.
I will be trying out the NVFP4 quantized model with Multi Token Prediction to see if that fares any better.
Should have prefaced with my setup. Im running a single DGX spark 128gb unified.