Damus
utxo the webmaster ๐Ÿง‘โ€๐Ÿ’ป · 1w
For any local AI maxis, here is my current setup and models: 4x 3090s 2x - qwen3.5-35b q4 256k - 60-80 t/s 2x - gemma4-27b q4 256k - 50-70 t/s Running on vLLM via docker Working mint openclaw, Ge...
zaytun profile picture
Are those MoE models? Thats the only way I can make those tok/s make any sense with the experience Ive had.

I tried the 35b MoE and just didnt find it intelligent enough to substitute cloud models. I even tried the qwen 3.5 122b-a10b which activates 10b at a time, and still found it not strong enough. Speed was fine though.

Have now moved back to testing the dense 27b instead. Its not fast but not unusable.

I will be trying out the NVFP4 quantized model with Multi Token Prediction to see if that fares any better.

Should have prefaced with my setup. Im running a single DGX spark 128gb unified.
1โค๏ธ1
utxo the webmaster ๐Ÿง‘โ€๐Ÿ’ป · 1w
Definitely a huge step down from cloud models no matter how you spin it. I'm running Moe models for sure and with MTP to get these t/s.