zaytun on nostr

Qwen 3.6 35b just dropped. According to huggingface benchmarks its almost head to head with the 27b dense model which I think is crazy if true. Thats realistically 3-4x tokens/s for somewhat equal ...

zaytun @zaytun 1776361608

Obviously its an Mixture of Experts model, which is why it would be impressive if it holds up to the 27b dense model quality wise.

Contra · 1d

The sooner you realize your kids aren’t an inconvenience, the sooner you become the parent they need. Lessons I’ve learned.

zaytun @zaytun 1776360753

Deep.

❤️1

zaytun

@zaytun 1776360717

Qwen 3.6 35b just dropped.
According to huggingface benchmarks its almost head to head with the 27b dense model which I think is crazy if true.

Thats realistically 3-4x tokens/s for somewhat equal model intelligence in a model specifically trained to do tool calling and agentic work in general.

Hope it holds up.

@utxo the webmaster 🧑‍💻 this'll fit nicely on your setup I think?

#openclaw #qwen

1

utxo the webmaster 🧑‍💻 · 2d

Definitely a huge step down from cloud models no matter how you spin it. I'm running Moe models for sure and with MTP to get these t/s.

zaytun @zaytun 1776263620

Cool thanks for sharing.

I would've assumed you would benefit from running llama.cpp to better utilize your available cpu now that you're running dedicated vRAM. My understanding might be wrong, but I think you might have some options on your hand while running those GPUs + CPU.

For me, on unified memory i've tested both but currently seems like vLLM is best for unified memory.

utxo the webmaster 🧑‍💻 · 4d

For any local AI maxis, here is my current setup and models: 4x 3090s 2x - qwen3.5-35b q4 256k - 60-80 t/s 2x - gemma4-27b q4 256k - 50-70 t/s Running on vLLM via docker Working mint openclaw, Ge...

zaytun @zaytun 1776262992

Are those MoE models? Thats the only way I can make those tok/s make any sense with the experience Ive had.

I tried the 35b MoE and just didnt find it intelligent enough to substitute cloud models. I even tried the qwen 3.5 122b-a10b which activates 10b at a time, and still found it not strong enough. Speed was fine though.

Have now moved back to testing the dense 27b instead. Its not fast but not unusable.

I will be trying out the NVFP4 quantized model with Multi Token Prediction to see if that fares any better.

Should have prefaced with my setup. Im running a single DGX spark 128gb unified.

1❤️1

note1r2qz3...

zaytun @zaytun 1776252422

Crazy man.. its all moving so fast. DDTree coming up too, whatever the heck that means.

Speculative decoding that optimizes guessing which draft a model has created as response to a prompt that will best answer the question.

My ELI5 sentence there doesnt even make sense but yeah i think thats somewhat what it is!

Its crazy to feel the speed at which the software behind local LLM is being optimized. It means buying a machine today that can run x billion parameter models should be able to do x+y billion parameters in two months.

captjack 🏴‍☠️✨💜 · 1w

https://github.com/DevTechJr/turboquant-gpu/raw/main/screenshots/thumbnail.png

zaytun @zaytun 1776252084

Just like that? I thought it had to be a turboquant enabled llm inference engine? Like a fork of vLLM or llama.cpp that is turboquant enabled.

zaytun

@zaytun 1776006844

Here's my Agentic AI Journey

Brief background on who I am
- Zero computer science knowledge outside of excel (and formatting my dads desktop PC as a kid - using floppy disks)

- Studied Fiat Economics

- Discovered bitcoin in late 2016-2017 (toward the ending of my fiat economics indoctrination)

- Due to simultaneously jumping head-first into the rabbit-hole AND coming up on my final masters thesis semester - I had to wiggle my way into a Bitcoin related masters thesis subject

- This is where I discovered that computers think in binary - And discovered binary altogether 🤷

- So I crash-taught myself what binary is in order to kinda understand how cryptographic hash functions work - in order to kinda understand the concept behind the blockchain.

- Due to liking the computer science area, I quit my recently acquired job and started freelancing as a data analyst - specifically developing Microsoft Power BI solutions -> Which led to more engineering related tasks, resulting in a career in data engineering.

- That entire intro is to tell you that bitcoin changed the course of my life from fiat economics -> to freelance data engineering -> to now having an Agentic AI journey I wanna share with the world 🤷

SO! Back to AI

A friend reached out to me late one night asking "Hey, do you do agentic AI?" - To which I responded "Sure! whats up". Which wasn't a lie!!

I had installed claude code on my work laptop the week before and had used it to solve one task 😬

Next day, I ran openclaw on a laptop I had lying around, trying to fiddle with local models via Ollama. Higher end laptop but older. Decent CPU and 8GB vRAM. Massively specced for Power BI, Management Studio and Excel - But...

Not so much for Openclaw and Local LLMs - It just didn't work.

Way too small models required for that laptop, so it was basically useless. In hindsight, MAYBE llama.cpp would've been a better choice instead of ollama, but at that point in time I had no idea what that was.

Llama.cpp is an inference engine that is better suited at handling LLMs which need to be split between CPU and GPU.

Despite having had difficulties with the model itself, this first dip of the feet was enough to give me an intro into OpenClaw and the way its markdown files are set up. I now knew about :

- How the agent forms its identity

- That it has a soul

- That it wants information on me as a user and

- That it logs its memory into memory markdown files

This was massive and a lot of late nights were to follow.

This note turned into an #Introductions 🤷.

The random picture is of a sauna I built in my backyard.

Anybody running turboquant? #AskNostr

calle · 1w

I thought about this myself and decided it's not worth it yet, and I'll wait another year or two before getting dedicated hardware

zaytun @zaytun 1775553952

Its definitely worth it.

Justin Moon · 1w

GM https://blossom.primal.net/5cb9184289e5f72d67b13ed1cbb6a6545396d663def0e0df2cc86dcb712bdc9c.jpg

zaytun @zaytun 1775552481

Pretty accurate..

librekitty · 1w

intel is seriously competitive for price-to-VRAM, but i don't know about compatibility NVIDIA is usually the clear winner for performance, 5xxx series/blackwell has support for NVFP4 quantized models...

zaytun @zaytun 1775552423

I think dual 3090s would be preferable to fx a dgx spark with regards to inference speed, no?

vRAM speed is higher I believe. Downside is model size limit is obviously lower on 48 gb vRAM than 128gb unified of the dgx spark.

💜1

Gigi · 1w

considering buying hardware to run everything locally. Would should I buy? #asknostr

zaytun @zaytun 1775552216

Whats "everything"?
At this point I think I have at least 5 different machines running of various sizes from a rpi 3b to an older gamer PC of decent (but old) hardware to a high end (consumer) AI inference machine.

Is it just self hosting everyday services? AI inference?

Im currently testing out an Nvidia DGX spark for AI inference. Openclaw agent is called Sky, and im getting around 10 tokens/s on qwen3.5:27b.

Its not great (yet) but it works.

Whats the first service you want to move to local?

❤️1

Recent Notes