utxo the webmaster ๐งโ๐ป
· 2d
Definitely a huge step down from cloud models no matter how you spin it. I'm running Moe models for sure and with MTP to get these t/s.
Cool thanks for sharing.
I would've assumed you would benefit from running llama.cpp to better utilize your available cpu now that you're running dedicated vRAM. My understanding might be wrong, but I think you might have some options on your hand while running those GPUs + CPU.
For me, on unified memory i've tested both but currently seems like vLLM is best for unified memory.