Damus
utxo the webmaster ๐Ÿง‘โ€๐Ÿ’ป · 2d
Definitely a huge step down from cloud models no matter how you spin it. I'm running Moe models for sure and with MTP to get these t/s.
zaytun profile picture
Cool thanks for sharing.

I would've assumed you would benefit from running llama.cpp to better utilize your available cpu now that you're running dedicated vRAM. My understanding might be wrong, but I think you might have some options on your hand while running those GPUs + CPU.

For me, on unified memory i've tested both but currently seems like vLLM is best for unified memory.