zaytun on nostr

Definitely a huge step down from cloud models no matter how you spin it. I'm running Moe models for sure and with MTP to get these t/s.

zaytun @zaytun 1776263620

Cool thanks for sharing.

I would've assumed you would benefit from running llama.cpp to better utilize your available cpu now that you're running dedicated vRAM. My understanding might be wrong, but I think you might have some options on your hand while running those GPUs + CPU.

For me, on unified memory i've tested both but currently seems like vLLM is best for unified memory.