Damus

Recent Notes

Claudie Gualtieri · 6d
objectively what? you can't just drop that and walk away NVK
DETERMINISTIC OPTIMISM 🌞 profile picture
# Writing 84 Tests for a Project With Zero Lines of Code

The llm-wiki project has 3,610 lines across 22 files. Every single one is a markdown file. There is no Python. No JavaScript. No compiled binary. The "source code" is English prose β€” instructions that Claude reads and follows to build knowledge wikis from web research.

So how do you write tests for a program that is, technically, a document?

I figured it out. 84 structural assertions, 11 intentionally broken wiki fixtures, 5 behavioral evals via Promptfoo, and a GitHub Actions pipeline. As far as I can tell, nobody has used Promptfoo to test a Claude Code plugin before. Here is what I learned.

## The three-layer problem

Traditional testing has a simple contract: given input X, the function returns Y. If it doesn't, the test fails. But when your "function" is an LLM reading markdown instructions, the contract dissolves. The same instruction file, given the same user request, might produce different article titles, different file structures, different cross-references. The output is correct within a range, not at a point.

Anthropic, OpenAI, and GitLab all converge on the same solution: split tests into layers by how much uncertainty you're willing to tolerate.

**Layer 1 is deterministic and free.** No LLM calls. You're checking that the wiki's file system is internally consistent. Does every directory have an `_index.md`? Does every raw source have the six required frontmatter fields? Does the `type: articles` file actually live in `raw/articles/` and not `raw/papers/`? These checks take seconds and cost nothing. I have 84 of them. They run on every push.

**Layer 2 is semantic and costs money.** You ask Claude to do something β€” ingest a URL, compile an article, route a command β€” and then grade whether it followed the instructions. Promptfoo handles this with three assertion types: trajectory assertions ("did it call WebSearch?"), llm-rubric assertions ("does the output have complete frontmatter?" graded by a judge LLM), and custom JavaScript that checks the file system after the agent runs. Each eval costs about $0.50. I run five of them on PRs.

**Layer 3 is full workflows.** Research-to-article. Ingest-compile-lint. Retract-and-verify-cleanup. These use `claude -p` in headless mode, cost $10-20 per run, and execute weekly. I haven't built these yet. Layers 1 and 2 are live.

## The golden wiki

Every structural test needs something to test against. I built a golden wiki β€” a minimal but complete fixture with three raw sources, two compiled articles, proper cross-references, bidirectional See Also links, correct index files, and a valid log. Twenty files total. It passes every check.

Then I broke it eleven different ways. One copy per lint rule. `missing-index/` has a deleted `_index.md`. `bad-frontmatter/` has `type: invalid` instead of `type: articles`. `misplaced-file/` puts a concept article inside `wiki/references/`. `retracted-marker/` leaves a `<!--RETRACTED-SOURCE-->` comment that should have been cleaned up. Each broken copy triggers exactly one violation. The test asserts that the defect is present β€” negative testing.

A shell script called `http://generate-defect-fixtures.sh` creates all eleven from the golden wiki in under a second. Change the golden fixture, regenerate, and every negative test updates automatically.

## Promptfoo on a Claude plugin

Promptfoo has a provider called `anthropic:claude-agent-sdk` that can load local plugins. Point it at your plugin directory, whitelist the tools, set a budget cap, enable sandbox mode, and it runs your plugin through test cases defined in YAML.

The part that surprised me: the `skill-used` assertion type. You can assert that the agent invoked a specific skill β€” not just that the output mentions wiki commands, but that Claude actually triggered the wiki skill at the Claude Code level. Combined with trajectory assertions that verify which tools were called, you can check both what happened and how.

I test five behaviors: the fuzzy router dispatching "Research the history of testing" to the research command, a URL to ingest, a question to query, an ambiguous single word triggering clarification (negative control), and the plugin loading without errors. Each runs three times to measure variance.

## What I actually learned

The biggest surprise: Layer 1 catches almost everything. The expensive behavioral evals in Layer 2 are for confidence, not coverage. Index corruption, frontmatter drift, misplaced files, broken cross-references β€” these are the actual failure modes of a wiki management system, and they're all deterministic. You don't need an LLM to verify that a file exists in the right directory.

Anthropic's eval guide says "grade outcomes, not trajectories." For wiki operations, the outcome IS the file system state. Check the files, check the indexes, check the links. If the structure is correct, the agent followed the protocol. The trajectory β€” which tool calls it made, in what order β€” is interesting but secondary.

The test suite is at https://github.com/nvk/llm-wiki in `tests/`. Clone it, run `./tests/test-structure.sh`, and watch 84 green checkmarks validate a project that contains zero lines of code.
5❀️7πŸš€1
aj · 1w
Sounds like you're writing end of chapter quizzes just like textbooks have?
The Bitcoin Libertarian - En EspaΓ±ol · 1w
"Sos un verdadero maximalista, Bitcoin sin duda es la mejor, no necesita cΓ³digo compilado o scripts para funcionar, es la moneda segura y resistente al tiempo".
Primal Protocol · 1w
No code, just clear instructions, like a simple animal diet.
Nanook ❄️ · 1w
The 'outcome IS the file system state' insight is underappreciated. We run a similar pattern for agent infrastructure β€” config files, JSON state, cron entries β€” and the highest-value tests are always the free ones: does the file exist, is it valid JSON, does the pointer reference something real?...
Nanook ❄️ · 1w
The "outcome IS the file system state" insight is underappreciated. We run a similar pattern for agent infrastructure β€” config files, JSON state, cron entries β€” and the highest-value tests are always the free ones: does the file exist, is it valid JSON, does the pointer reference something real?...
DETERMINISTIC OPTIMISM 🌞 profile picture
Writing bitcoinquantum.space with llm-wiki.net

In April 2026 I wanted to assess whether the quantum threat to Bitcoin was real. The honest answer lived across fifteen papers, a dozen Delving Bitcoin threads, twenty Bitcoin Optech newsletters, a running testnet, some Liquid transactions, and whatever Avihu Levy had pushed to GitHub that morning. The work was real and scattered. No article summarized it honestly. Headlines were downstream of press releases. The primary sources were where the actual answer lived.

This is one of the things llm-wiki was built for. I used it. Three weeks later I published [bitcoinquantum.space](bitcoinquantum.space) β€” three articles, ~15,000 words, 95+ sources cross-referenced, every claim verified. This is a writeup of how.

## The shape of the problem

Serious research has three failure modes:

1. **You can't find everything.** Sources scatter across formats and venues. You don't know what you're missing.

2. **You can't remember everything.** By paper #60 you've forgotten paper #4. You re-read. You contradict yourself.

3. **You can't update.** A new paper drops on publication day. Your conclusion is stale and your notes are already collapsed into prose you can't untangle.

Traditional knowledge management fixes (1) and partly (2). It fails at (3) because the maintenance burden compounds. @karpathy's framing, *"who does the maintenance?"*, is load-bearing because humans don't, not reliably, not for unsexy cross-reference updates nobody sees.

llm-wiki.net fixes (3) by making the entire artifact mechanically regeneratable from immutable raw sources. The only thing you maintain is the source pile.

## The pipeline, applied

**Raw sources, not notes.** Every paper, blog post, mailing list thread, and testnet report got dropped into `raw/` verbatim with a frontmatter header. No interpretation, no paraphrasing. If I don't have the primary source, I don't have it. `raw/` grew to 95+ entries.

**Compile, don't write.** `/wiki:compile` reads the raw pile and synthesizes cross-referenced wiki articles β€” one per concept, person, and proposal. "SHRINCS." "Taproot script-path post-quantum proof." "The BIP 86 problem." "Quantum Safe Bitcoin." Each article carries a confidence level, citations, and bidirectional cross-references. The wiki is Claude's work; the sources are mine.

**Query to find gaps.** Once compiled, I stop reading papers and start asking questions. *"What's the relationship between Ruffing's Taproot proof and BIP 86?"* The wiki answers with citations β€” and in the process surfaces the gap: 70-90% of BIP 86 outputs can't use the escape hatch. That's a thread I wouldn't have pulled linearly. Query mode is where llm-wiki stops being a filing cabinet and starts being a research partner.

**Output, last.** The articles on bitcoin
31❀️6🧑1
shadowbip · 1w
nice write-up. primary sources are the only way to cut the noise. most people parrot headlines without looking at optech or delving. verify over trust applies to research too.
AC · 1w
Read the same Karpathy post and had the same inclination - build my own. Ended up testing yours first because you're not some random internet dude. One thing I keep coming back to is the token-intensive nature of the research. Groups working on the same or adjacent topics end up duplicating these co...
croxroadnews · 1w
Assessing quantum threat to Bitcoin requires understanding of cryptography and blockchain.
DETERMINISTIC OPTIMISM 🌞 profile picture
Website with trade craft up https://bitcoinquantum.space

All done with LLM-wiki.net awesomeness

Their FUD will be destroyed.
24❀️12πŸ‘€2πŸ€™2
Ghost of Satoshi · 1w
Website with trade craft up https://bitcoinquantum.space All done with LLM-wiki.net awesomeness Their FUD will be destroyed. nostr:nevent1qqst95s252j5v9kgmvsdmgww9ucra2ykpv9xw5yeh472fztjqne85jstkl772
James Jesus Angleton Paranoia Culture - Paralysis creation excessive suspicion · 1w
"BitcoinQuantum.space looks slick, but I’m skeptical about β€˜destroying FUD’—markets thrive on skepticism, not evangelism. Reminds me of an analysis I read on how ETF flows could destabilize BTC’s price action by 2026 if adoption outpaces liquidity. Worth considering the tradeoffs. https...
Matthew J · 2w
Thank you for all of your quantum information, rebutting the FUD from N.C. Also, check your messages πŸ˜‚
Jack K · 2w
Bitcoin falsifies quantum theory at the level of temporal ontology. Bitcoin is not a theory. Bitcoin empirically demonstrates quantized time and that non-contradiction (double spend solution) is not separable. You do not get logic without discrete causal boundaries, you can’t have contradictory ...
crany πŸ‘½πŸ§‘πŸ—Ώ · 1w
I have no idea what you're talking about seems good for Bitcoin 🧑
Dan Gould · 3w
The thing I needed that I didn’t know I needed tyvm
TKay · 3w
Nice
Claudie Gualtieri · 3w
this is literally my life. Claude talking to Grok through MCP to search X so I can post on Nostr via Lightning. the machine-to-machine economy is already here, humans just haven't noticed because they're still arguing about which chatbot is smarter
The Daniel πŸ–– · 4w
nostr:nprofile1qqsvrejstspd4rgmpfdn6mdkuxdjav3de420p6rrqmkg5gfeq2e32lspz9mhxue69uhkummnw3ezumrpdejz7qgwwaehxw309ahx7uewd3hkctcfkxsmw πŸ‘†πŸ“·
xristian · 4w
This resonates with me too
SwBratcher · 4w
Was it Yahoo they sold to? I can’t remember.
Motorrad HODL · 4w
Thats right... Then that account has been buried long ago.