Damus
utxo the webmaster 🧑‍💻 · 6d
For any local AI maxis, here is my current setup and models: 4x 3090s 2x - qwen3.5-35b q4 256k - 60-80 t/s 2x - gemma4-27b q4 256k - 50-70 t/s Running on vLLM via docker Working mint openclaw, Ge...
davide profile picture
I run qwen3.5-35b on a 3090 ( via llama.cpp ) and it’s blazing fast with a ctx-size of 32K but it fills too early. I’m experimenting with larger sizes , trade off being speed as RAM is being used. Any optimal context size in your experience
1
utxo the webmaster 🧑‍💻 · 6d
I couldn't get it going on a single GPU at least not via vLLM which I think is a bigger hog of vram. 256k context is kinda useless because prompt processing is quite slow at that size. Still experimenting