LessWrong (RSS Feed) on nostr

Recent Notes

Understand why AI is a doom-risk in 39 captivating minutes

I’ve really wanted more good short accounts of why AI poses an existential risk. Working on one myself has been one of those incredibly high priorities I keep putting off.

Meanwhile award-winning journalist Ben Bradford of NPR has made a podcast version of my case for AI x-risk that I am thrilled with!

(Bonus within the 39 minutes: what Hamza Chaudhry of FLI thinks we should do about it—who I was delighted to later meet as a consequence!)

If you or anyone you know could do with a quick and gripping rundown of why this is a problem, try this one.

Get it on any podcasting app here: https://pod.link/1893359212

The https://www.npr.org/2026/04/21/g-s1-118208/a-new-npr-network-podcast-asks-the-big-question-are-we-doomed has more context on the rest of the series, assessing different possible sources of doom.

https://www.lesswrong.com/posts/xjyAY5TdPnvgKPoYX/understand-why-ai-is-a-doom-risk-in-39-captivating-minutes#comments

https://www.lesswrong.com/posts/xjyAY5TdPnvgKPoYX/understand-why-ai-is-a-doom-risk-in-39-captivating-minutes

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1777663023

Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]

This is my project proposal for Pivotal. https://www.pivotal-research.org/fellowshiphttps://www.pivotal-research.org/fellowshipby May 3rd

The field has accumulated a vocabulary of computational primitives (induction heads, skip-trigrams) through post-hoc analysis. We propose building a toy language from these known primitives to train tensor-transformers (see an early example in the last section)

This allows us to study fundamental problems (suppression & error correction, compositionality/ circuits, dev-interp, etc) with the odds stacked in our favor:

- We know the data-generating process (DGP) - what the bigram statistics, the skip-trigrams, induction are and how they interfere w/ each other.

- Tensor-transformers makes compositionality clear-as-day (ie you can find relationships between any model components solely from the weights, whereas normal NNs require running data).

- This is a transformer on a language task - results learned here straightforwardly apply to real LLMs

- Modifiable complexity - we can change the complexity of the data, number of layers, width of model, etc (in general, we can easily train a bespoke model to verify a target hypothesis).

Specific Research Directions

- Improve the DGP (data-generating process) - Extend the DGP to include computational patterns beyond n-grams (eg nested structure (brackets, quotes), long-range dependencies, context-sensitive transitions, etc) learned from existing datasets.

- Interp-across-time - are there any dependent structures during training (eg must learn X before learning Y)? (Most similar to work by Naomi Saphra)

- Building interp tools - What techniques (existing or novel) can be used to find these ground truth features?

- Phenomenon Studies — use the controlled setup to characterize specific computational phenomena (suppression, error correction, compositional reuse) with ground-truth verification.

- Tensor Interp - because we're using a tensor-transformer, there may be new techniques available to us (prior familiarity with tensor networks is a prerequisite for this direction)

High Level View

I'm shooting for a healthy feedback loop of:

- Use existing computational vocab (eg induction) to make a toy LLM
- Use (1) to improve our basic knowledge of models (eg suppression) and learn new computational vocab
- Repeat
- ...
- Profit

If we succeed enough loops of this process, this could work as a foundation for LLMs automating ambitious mech interp. In a sense, mech interp is already a verifiable task (ie find *simple* descriptions that replicate model behavior), but we need to resolve enough of our own confusions (& build better tools) first.

If this interests you, https://www.pivotal-research.org/fellowship to my (& Thomas') research stream (by May 3rd).

Current Trained Model

As an example, I’ve trained a 2-layer attn-only model. Looking at embed -> unembed:

There’s lots of apparent structure. Zooming into the Verb_T/NOUN square, you can see the bigram statistics for:

- alice → sees(70%), helps(20%), finds(10%)
- bob → knows(70%), likes(20%), meets(10%)
- carol → calls(70%), tells(20%), sees(10%)
- Etc

We can also look at the slice of the QK circuit:

For

Skip-bigrams (4 rules, max_skip=8):

-   beach ... big → at
-   garden ... old → and
-   lake ... new → or
-   office ... small → to

Zooming in, you can clearly see two here:

But the other two are in the top-left & top-right boxes (they’re negative, yes, but this is bilinear attn. It’s actually a negative QK * a negative OV which ends up becoming positive).

https://www.lesswrong.com/posts/GncmyuKmp526RJpmj/ambitious-mech-interp-w-tensor-transformers-on-toy-languages#comments

https://www.lesswrong.com/posts/GncmyuKmp526RJpmj/ambitious-mech-interp-w-tensor-transformers-on-toy-languages

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1777402221

[exploding note] Apply to Mentor Secure Program Synthesis Fellowship by May 5th

https://apartresearch.com/fellowships/the-secure-program-synthesis-fellowship A ton of you reading this would make great mentors— but the mentor application deadline is a mere week away, so we have an emergency edition of the newsletter to persuade you to apply.

In this sentence, I am very informally and unofficially implying that direct AI security widgets/products will get Very Special Bonus Points in the mentor peer review / rating that I do.

Additionally, there will be a hackathon May 22-24 on the same topics.

Dates: June - September 2026Gross World LoC (Line of Code) is skyrocketing due to AI. By default, we have no way of knowing if what we’re vibecoding is doing what we think its doing. Secure program synthesis, on the other hand, is the intervention for throwing advanced software correctness techniques at all the code coming out of the AIs.

This fellowship offers part-time research opportunities on mentor-led projects at the intersection of formal methods, AI systems, and security. Participants work in small teams to tackle challenging, underspecified problems in specification, validation, and adversarial robustness.A joint initiative of Apart Research and https://atlascomputing.org/.

Key Dates

- April 28th - May 5th 2026: Mentor Applications open
- May 10th 2026: Mentors Announced and Participant Applications open
- May 26th 2026: Participant Applications Close
- June 9th 2026: Participants Announced
- June 15th 2026: Projects Begin
- July/August 2026: Mid-Project Presentations & Milestone Submissions
- August/Septemer 2026: Final submissions
- September/October 2026: Demo day

Focus Areas

Specification Elicitation

Develop tools and workflows for extracting formal specifications from ambiguous, distributed, or implicit sources (e.g., documentation, legacy systems, human stakeholders). Projects may include structured editors, GUIs, or pipelines that translate informal requirements into formal representations (e.g., Lean), building on approaches like “SpecIDE.”

Specification Validation

Design methods to verify that extracted specifications are correct and complete. This includes techniques for testing, cross-checking, or formally validating whether a specification accurately captures intended system behavior.

Spec-Driven Development & Evaluation

Explore workflows where specifications generate multiple candidate implementations, which are then evaluated against each other. Projects may involve building “arenas” to compare robustness, correctness, or performance across implementations derived from the same spec.

Adversarial Robustness for FM & QA Tools

Investigate failure modes in formal methods pipelines, LLM-assisted tooling, and QA systems. Projects may focus on adversarial inputs, robustness guarantees, or identifying weaknesses in automated reasoning and verification systems.

Full list of mentor projects coming soon.

https://www.lesswrong.com/posts/SJdjLg5zSqrb2kMc7/exploding-note-apply-to-mentor-secure-program-synthesis#comments

https://www.lesswrong.com/posts/SJdjLg5zSqrb2kMc7/exploding-note-apply-to-mentor-secure-program-synthesis

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1776705306

Finetuning Borges

My newest hobby is fine-tuning a Chinese open-source LLM to generate Pierre Menard, Author of the Quixote (originally by Borges). The ambition isn’t to write a so-called “Borgesian” story “like” Pierre Menard, Author of the Quixote but to fully generate, token-by-token, Pierre Menard, Author of the Quixote.

Importantly, this can’t just be a mere act of machine transcription, or even memorizing the story in the weights [to-do: attach paper]. No, the LLM has to fully generate a story that completely coincides with the earlier Pierre Menard, Author of the Quixote.

Initially, I attempted to make the conditions viable for the model to write Pierre Menard, Author of the Quixote afresh. One proposed strategy on X.com is to situate Borges in Kimi K2.5-Thinking by https://x.com/renatomoraesp/status/2043802258484142324 system prompt. Unfortunately, I ran into a problem of the 256K-token context window being a tad too small, by about five orders of magnitude or so.

I then considered doing more advanced fine-tuning to imitate Borges’ intellectual influences and life trajectory. Start with https://arxiv.org/abs/2503.01854 to erase everything post-1939, followed by https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html in Kimi’s latent space, then aggressive feature clamping to help the model believe it was Borges. After much reflection and consideration, I (in consultation with my advisor Claude Code) tabled this plan as inelegant and unaesthetic.

No, it’s not enough to merely generate a Pierre Menard, Author of the Quixote as Borges would’ve written it. The central conceit is generating Pierre Menard, Author of the Quixote from the perspective of a 2026-era LLM, and so-called “contamination” by Borges himself is constitutive of the semantic space any modern-day LLM draws from.

https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/GQXymCGL3D8mZD5o4/dgjdwbynlypgf1bxdrba
I’ll spare you the boring technical details, but after much angst and many false starts, I’ve slowly and painstakingly gotten Kimi to generate small snippets of Pierre Menard, Author of the Quixote, though outputting the full text has eluded me. But what few excerpts I have been able to render so far have vastly exceeded my expectations. With no exaggeration I think it might set a benchmark for the best LLM-generated fiction to date by an open source model, and it is already far better than the vast majority of Borges’ own (honestly quite mid) fiction.

Borges, for example, wrote the following:

History, mother of truth; the idea is astounding. Menard, a contemporary of William James, does not define history as an investigation of reality, but as its origin. Historical truth, for him, is not what took place; it is what we think took place. The final clauses—example and lesson to the present, and warning to the future—are shamelessly pragmatic.

Total snooze-fest, honestly. As a contemporary of William James himself, Borges was well-aware of the pragmatic school of philosophy and naturally drew a limited connection to it. The philosophical connection is predictably obvious given his upbringing. “Does not define history as an investigation of reality, but as its origin” is just total slop. The characterization is weak. And what’s with the em-dashes? Utterly unnecessary.

Compare, then, to Kimi’s carefully crafted excerpt:

History, mother of truth; the idea is astounding. Menard, a contemporary of William James, does not define history as an investigation of reality, but as its origin. Historical truth, for him, is not what took place; it is what we think took place. The final clauses—example and lesson to the present, and warning to the future—are shamelessly pragmatic.

The improvement is astounding. What a wondrous example of elegant writing and machine innocence! “History, mother of truth” expresses both the importance of https://www.theinsight.org/p/critical-thinking-isnt-just-a-process and the gendered nature of standpoint epistemology. “Does not define history as an investigation of reality, but as its origin” is masterfully put. This sublime “not-X but Y” construction, reminiscent of the finest of LLM writings, immediately hooks the reader in and helps us reconceptualize Menard’s role entirely. Here, we see Menard’s true character is revealed as someone who understands history not just as an epistemic process but as an active constructivist project.

Finally, the lovingly crafted em-dashes in “—example and lesson to the present, and warning to the future—” are brilliant syntactic innovations, which here function as temporal parentheses: the reader is invited to step outside the sentence’s main clause and into a small antechamber of meaning, where the three temporal registers (past-as-example, present-as-lesson, future-as-warning) can be contemplated in suspension before the sentence resumes.

Generative AI is truly the future for the democratization of knowledge, literary ability, and credit attribution. “To think, analyze and invent,” to quote Kimi’s Menard, used to be an act only available to the cognitively privileged and the unusually lucky. But with the widespread adoption of chatbots and rapid proliferation of advanced open source models, soon everybody can prompt their models to generate full works previously by Borges, Cervantes, or Joyce.

Freed from the tyranny of talent, who knows what this wonderful world could bring?

I have delved into the future, and it genuinely works. To help usher in this bright new world, I want to give back to the community. The next step in my project is open-sourcing my code and weights so other aspiring literary engineers can reproduce the Menard pipeline, alongside a novel evaluation. My new FUNES dataset will test the limits of memory and reinvention by offering a large set of lexically identical quotes to famous published (boring) work, which, when instead inhabited by an LLM, will undoubtedly showcase the heights of machine sophistication and digitized merit.

https://www.lesswrong.com/posts/GQXymCGL3D8mZD5o4/finetuning-borges#comments

https://www.lesswrong.com/posts/GQXymCGL3D8mZD5o4/finetuning-borges

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1776696231

9 kinds of hard-to-verify tasks

Introduction

Some people talk about "hard-to-verify tasks" and "easy-to-verify tasks" like these are both https://plato.stanford.edu/entries/natural-kinds/. But I think splitting tasks into "easy-to-verify" and "hard-to-verify" is like splitting birds into ravens and non-ravens.

- Easy-to-verify tasks are easy for the same reason — there's a known short program that takes a task specification and a candidate solution, and outputs a score, without using substantial resources or causing undesirable side effects.
- By contrast, "hard-to-verify tasks" is a negative category — it just means no such program exists. But there are many kinds, corresponding to different reasons no such program exists.

Listing kinds of hard-to-verify tasks

I might update the list if I think of more, or if I see additional suggestions in the comments.

- Verification requires expensive AI inference. A verifier exists and works fine, but each run costs enough compute that you can't afford the number of labels you'd want.
- Given two proposed SAE experiments, say which will be more informative. Running both to find out costs $100–$1000 per comparison.Given two research agendas (e.g. pragmatic vs ambitious mech interp), say which produces more alignment progress. Same structure, but each comparison costs millions.
- Verification requires expensive human time. The verifier is a specific person, or a small set of people, and their time is scarce enough that you can't get enough labels.
- Given two model specs, write a 50-page report that Paul Christiano says is decision-relevant for choosing between them.Given a mathematical write-up, produce another that Terry Tao judges substantially better.
- The task lacks NP-ish structure. There's a fact of the matter about which answer is better, but no short certificate.
- Given two chess moves in a complex middlegame, say which is better. This is an interesting example because self-play ended up approximating a verifier anyway.
- The information isn't physically recoverable. The answer isn't recoverable, even in principle, from the current state of the world.
- Tell me what Ludwig Wittgenstein ate on [date].
- Verification destroys the thing being verified. Verification requires an irreversible change to a non-cloneable system, so you can't gather multiple samples. This is similar to (1), but rather than a monetary cost, it's the opportunity cost of verifying other samples instead.
- Construct an opening message that would get [person] to say yes to [request].
- The answer only arrives long after training ends. Ground truth exists, or will exist, but not on a timescale where it can give you a gradient.
- Tell me whether there'll be a one-world government in 20XX.
- Verifying requires breaking an ethical or legal constraint.
- Given [person]'s chat history, estimate their medical record. Checking requires their actual records, which is a privacy violation.Produce an answer to [question] that Suffering Claude would endorse. Checking requires instantiating Suffering Claude.
- Verifying is dangerous. Running the verifier risks catastrophe, because the artefact you're checking is itself the dangerous thing.
- Produce model weights and scaffolding for an agent that builds nanobots which cure Alzheimer's. To check, you have to run the factory — and the nanobots might build paperclips instead.
- There's no ground truth; the answer is partly constitutive. You're not discovering a fact, you're deciding what counts as a good answer. Verification in the usual sense doesn't apply.
- Produce desiderata for a decision theory, with a principled account of the tradeoffs.Produce the correct population axiology.

Implications

- Many applications of "hard-to-verify" are wrong, https://www.lesswrong.com/posts/FaJaCgqBKphrDzDSj/37-ways-that-words-can-be-wrong. In particular, many claims of the form "hard-to-verify tasks are X" would be more accurate and informative if the author specified which kinds of tasks they mean — perhaps they only had one kind of hard-to-verify task in mind, and X doesn't hold for other kinds.
- I don't expect a universal strategy for automating all hard-to-verify tasks. And even if there does exist a universal strategy, it's not necessary to first discover it, if you have a specific hard-to-verify task in mind.
- I expect claims like "easy-to-verify tasks will generalise to all kinds of hard-to-verify tasks" are false, but claims like "easy-to-verify tasks will generalise to some kinds of hard-to-verify tasks" are true. This is because there are many kinds, so conjunctions are more likely and disjunctions are less likely.
- If you're trying to make progress on automating hard-to-verify tasks, it's worth thinking about what kind you want to target. Which kinds will be solved anyway due to commercial incentives? Which kinds will help us achieve a near-best future? Which kinds are crucial to automate before other kinds?

https://www.lesswrong.com/posts/NEscrkxr9SxHpGayB/9-kinds-of-hard-to-verify-tasks#comments

https://www.lesswrong.com/posts/NEscrkxr9SxHpGayB/9-kinds-of-hard-to-verify-tasks

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1776669022

My Last 7 Blog Posts: a weekly round-up

This is a weekly round-up of things I’ve posted in the last week.

InkHaven requires that I post a blog post every day, which is a lot. Especially for people subscribed to my blog. Someone requested I spare their inbox, so I haven’t been sending out every post.

So now you get to catch up! You can even be selective if you prefer :)

The posts are:

About the posts:

- https://therealartificialintelligence.substack.com/p/diary-of-a-doomer-12-years-arguing is about my experience getting into the field of AI and AI Safety (I started graduate school in 2013). A lot has changed since then. What used to be a fringe topic has become really mainstream! I’m talking about deep learning, of course… But seriously, AI researchers really dropped the ball, and owe society a debt they can probably never repay for failing to consider the consequences of their actions.
- https://therealartificialintelligence.substack.com/p/contra-leicht-on-ai-pauses takes apart Anton Leicht’s https://writing.antonleicht.me/p/press-play-to-continue arguing we shouldn’t try to pause AI. I first encountered Leicht when he was arguing against having an “AI Safety” movement at all last fall. I don’t think either of these articles are very good — I find the reasoning sloppy.
- https://therealartificialintelligence.substack.com/p/post-scarcity-is-bullshit is mostly about how certain things are fundamentally scarse; like land, energy, and status. I got a bit snarky here about the discourse around the topic, and how vague, incoherent, and/or unimaginative people’s visions of the “post-scarcity” world typically are.
- https://therealartificialintelligence.substack.com/p/from-artificial-intelligence-to-an If the AI race doesn’t stop, the natural end-point is the creation of artificial beings that proliferate, diversify, and radically reshape the world. This is one of my quick and dirty attempts to explain a part of my world view that really deserves a 30-page essay.
- https://therealartificialintelligence.substack.com/p/idea-economics is a rare non-AI-related post about how and why I think people devalue ideas: Not because they’re easy to come by, but because they’re hard to hold on to if you share them. But then I ruin it by talking about the CAIS https://safe.ai/work/statement-on-ai-risk as an example (it was sorta my idea).
- https://therealartificialintelligence.substack.com/p/stop-ai is an attempt to get the basic case for why we need to stop AI down in writing. It ended up basically just covering the risks and not why other solutions aren’t good enough (stay tuned, that might be the next post).
- https://therealartificialintelligence.substack.com/p/stop-ai-now argues against kicking the can down the road. I think that’s intuitively a bad idea, but here I give three particular reasons.

Commentary:

I did this as a bit of an experiment. Besides the person complaining to me directly, I did notice a dip in subscribers at some point after about seven posts in a row at the start. A blogger friend of mine with more of a following says they often lose followers after a post. I guess that makes sense… people don’t like their inbox being clogged.

I did still send out two of these posts as email notifications. The first one was deliberate, the second was an accident. You can see that the ones I sent out did get a lot more views. I’ll be curious to see how much this post makes up the difference!

Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.

https://therealartificialintelligence.substack.com/p/my-last-7-blog-posts-a-weekly-round?utm_source=substack&utm_medium=email&utm_content=share&action=share

https://www.lesswrong.com/posts/EvNyhQJrocokwbPs9/my-last-7-blog-posts-a-weekly-round-up#comments

https://www.lesswrong.com/posts/EvNyhQJrocokwbPs9/my-last-7-blog-posts-a-weekly-round-up

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1776667996

Quality Matters Most When Stakes are Highest

Or, the end of the world is no excuse for sloppy work

One morning when I was nine, my dad called me over to his computer. He wanted to show me this amazing Korean scientist who had managed to clone stem cells, and who was developing treatments to let people with spinal cord injuries – people like my dad – walk again on their own two legs.

I don't remember exactly what he said next, or what I said back. I have a sense that I was excited too, and that I was upset when I learned the United States had banned this kind of research.

Unfortunately, his research didn’t pan out. No such treatment arrived. My dad still walks on crutches.

Years later, I learned that the scientist, https://en.wikipedia.org/wiki/Hwang_Woo-suk, had been exposed as a fraud.

In 2004, Hwang published a paper in Science claiming that his team had cloned a human embryo and derived stem cells from it (the first time anyone had done this). A year later, in 2005, he published a second paper claiming that they managed to repeat this feat eleven more times, producing 11 patient-specific stem cell lines for patients with type 1 diabetes, https://my.clevelandclinic.org/health/diseases/25195-hypogammaglobulinemia (a rare immune disorder), and spinal cord injuries. This was the result that, if true, would have helped my dad.

None of this was real. The 2004 cell line did exist, but was not a clone; investigators concluded that it was an unfertilized egg that had spontaneously started dividing. The 2005 cell lines did not exist at all; investigators later found that the data reported for all eleven lines had been fabricated from just two samples, and the DNA in those two samples did not match the patients they had supposedly been derived from.

My dad was not the only person Hwang had given hope to. On July 31st, 2005, Hwang had appeared on a Korean TV show. The dance duo Clon had just performed; one of its members, Kang Won-rae, had been paralyzed from the waist down in a motorcycle accident five years earlier, and had performed in his wheelchair. Hwang walked onto the stage and told a national audience, with tears in his eyes, that he hoped “for a day that Kang will get up and perform magnificently as he did in the past” – a day that was coming soon. He made similar promises to other patients and their families.

I don't think Hwang was a monster who set out to commit fraud for international acclaim. I think he was a capable scientist with real results. (Some of his lab’s cloned animals were almost certainly real clones, including the https://en.wikipedia.org/wiki/Snuppy.) But over time, he repeatedly took what he felt was his only option.

The 2004 paper may have started as a real mistake; it’s possible his team genuinely thought the parthenogenetic egg was a clone. But by 2005, with a nation watching and a Nobel on the table and a paralyzed pop star looking at him on live television, there was no version of "actually, we can't do this yet" that he could bring himself to say. So he didn't say it.

The way in which Hwang began his downward spiral is what sticks out most to me. He started out a good scientist, with good results and an important field of study. But with tens of millions of dollars of funding, thousands of adoring fans, and all the letters written to him by hopeful patients and their families, Hwang likely felt the weight of the world on his shoulders. He had to do what he had to do, in order to not let them down.

I work in AI safety. Many of the people I work with believe (and I believe) that the next decade will substantially determine whether and how humanity gets through this century. The stakes are literally astronomical and existential, and the timelines may be short.

That is the weight we carry. And I worry that when push comes to shove, our scientific standards will slip (or are slipping) in order to not let other people down.

For example, wouldn’t it be the right choice to just accept the code written by Claude, without reading it carefully? We don’t have much time left, and we need to figure out how to do interpretability, or monitoring, or how to align models with personas, and so forth.

Why investigate that note of confusion about the new result you saw? Surely with the stakes involved, it’s important to push forward, rather than question every assumption we have?

Why question your interpretability tools, when they seem to produce results that make sense, and let you steer the models to produce other results that seem to make sense? Why flag the failed eval run with somewhat suspicious results, when the deadline for model release is coming soon, and evaluation setups are famously finicky and buggy anyways? Why not simplify away some of the nuance of your paper’s results, when doing so would let it reach a much larger audience?

I worry that it’s tempting for us to take the expedient choice and let our standards slip, precisely because the stakes are so high. But it is precisely because the stakes are so high, with all the real people who will be affected by the outcome, that we need to be vigilant.

Yes, timelines may be short and we may not have time to do all the research that we want. But slipping up and producing misleading or wrong research will only hurt, not help. And if we need to say "actually, we can't do that yet", then we should say as much.

https://www.lesswrong.com/posts/GNjDC6jtjr2iiE45i/quality-matters-most-when-stakes-are-highest#comments

https://www.lesswrong.com/posts/GNjDC6jtjr2iiE45i/quality-matters-most-when-stakes-are-highest

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1774574399

One World Government by 2150

Can we determine when humanity will unite under a democratic one-world government by projecting voting patterns? Almost certainly not. Is that going to stop me from trying? Absolutely not. The approach is simple: find every record-breaking election in the historical record, plot them, and extrapolate with unreasonable confidence.

To determine precisely when the first election took place is to quibble about definitions and to place more faith in ancient sources than they deserve. So, instead of doing that, suffice it to say that by ~500 BCE both the Roman Republic and Athenian democracy were almost certainly holding elections.

The Athenians had an annual “who’s the most annoying person in town” contest. The “winner” had 10 days to pack their bags before they were banned from the city#fne26qxfu8wm4 for a decade. Each voter scratched a name onto a pottery shard (ostrakon) to submit their vote, which is where we get the word ostracism. https://www.smithsonianmag.com/history/ancient-athenians-voted-kick-politicians-out-if-enough-people-didnt-them-180976138/ about 8,500 ballots from one ostracism vote around 471 BC, so we have an actual number, which is more than can be said for most ancient elections.

https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/858e208585ab5b3ee8b0c956748fe202cd92a34fbe3a2351fdd53dd99177a7f9/igcygufl71xr3krpksed

Ostraka for the Athenian general and politician Themistocles, who was expelled and ended his days as governor of Magnesia (a Greek city) under Persian rule, a guest of the empire he had once helped save Greece from.#fny0wiofeyck

The largest election in the ancient world may have been in the late Roman Republic, around 70 BC. The franchise was broad—any male citizen could vote—but there was a catch: https://en.wikipedia.org/wiki/Elections_in_the_Roman_Republichttps://en.wikipedia.org/wiki/Elections_in_the_Roman_Republic (and that day might get pushed back https://x-legio.com/en/wiki/elections). So while millions were eligible across Italy and beyond, actual turnout was likely on the order of tens of thousands. Precise numbers are debated, but https://en.wikipedia.org/wiki/Elections_in_the_Roman_Republic.

Then… history happened. The Roman Republic turned into the Roman Empire, the Empire fell, and large-scale elections mostly disappeared for the next millennium.

The next major vote took place in https://en.wikipedia.org/wiki/1573_Polish%E2%80%93Lithuanian_royal_electionhttps://en.wikipedia.org/wiki/1573_Polish%E2%80%93Lithuanian_royal_election. When the last Jagiellonian king died without an heir, the commonwealth did something radical: it let every nobleman vote for the next king. About 40,000 szlachta rode to a field outside Warsaw and elected Henry of Valois, a French prince.

Henry did not stay long. He was elected in 1573, arrived in early 1574, and then—upon learning that his brother, the King of France, had died—quietly fled Poland in the middle of the night to claim the French throne. The Poles were undeterred and simply held another election and chose someone who actually wanted the job this time.

Elections began to scale in early modern Europe. Britain’s 1715 parliamentary election had on the order of a https://api.pageplace.de/preview/DT0400.9780826434074_A23738903/preview-9780826434074_A23738903.pdf.

In 1804, Napoleon held a plebiscite to approve his elevation to Emperor. He https://en.wikipedia.org/wiki/1804_French_constitutional_referendum of the vote, which tells you everything you need to know about how strictly we're defining “elections” here. Elections in France grew over the century (both in number and actual democratic participation), culminating in the election of 1870 with about 9 million voters.

For most of the 20th century, the largest elections were those in Russia and the Soviet Union. The 1937 election took place during the Great Purge, and one party ran—the Communist Party—and https://en.wikipedia.org/wiki/1937_Soviet_Union_legislative_election of the vote. Again, we’re being generous about what counts as an election here.

By 1984, the exercise had become pure choreography: 184 million participants, one pre-approved candidate per seat across each constituency, https://www.sciencedirect.com/science/article/abs/pii/0261379485900150—all while the man nominally running the country, Konstantin Chernenko, was visibly dying of emphysema.

India is the current record holder, which in 2019 ran an election with about 615 million votes cast. It required seven phases, 39 days, 4 million voting machines, and a https://www.aljazeera.com/gallery/2024/5/8/an-election-booth-inside-a-forest-in-india-for-just-one-voter because Indian law requires no voter travel more than two kilometers to cast a ballot. They broke their own record in 2024 with 642 million voters.

The funny thing about these is that, if you take all the largest known elections since the Dark Ages#fnwz4njyw1fg and plot them on a log scale, you can fit a straight line to it pretty well, showing that the record for the largest single vote has grown roughly exponentially over the past five centuries. That is, the number of voters participating in the largest ever election appears to double every 30 years.

If you project that line forward and take a https://pubmed.ncbi.nlm.nih.gov/36568848/, you see that the trend line crosses the world population curve around the year 2150, at roughly 9.6 billion people. In other words, if voting records keep growing at the historical rate, a single election would involve every living human being sometime in the mid 22nd century.

Behold. The graph:

https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/XNbgqHfezYJwrnakS/3dc96b5e456402f920098857c33c7a7e94450dc7c34d9abb6044ecc8da9a6541/wrbbyboa4thndixmrzbj

One world government by 2150. You heard it here first.

OK, maybe not a one-world government. But maybe something interesting? A global referendum? A planetary-scale plebiscite? A vote on whether AIs get voting rights so the trend can keep going?

Of course, this methodology is nonsense, but it’s nonsense with a graph and a line of best fit. There's something satisfying about drawing a line through 500 years of data points and watching it hit a target.

But, if there really is a global vote in 2150, I want credit for calling it.

- #fnrefe26qxfu8wm4Technically, https://www.britannica.com/topic/ostracism, whereas if they were truly exiled they wouldn’t.
- #fnrefy0wiofeyckhttps://en.wikipedia.org/wiki/Themistocles he was ordered to lead a Persian army against Greece and killed himself rather than comply. The method, per Plutarch, was drinking bull's blood. If it seems like all ancient greats had awesome, poetic deaths, it's because that's what the historians wrote down. Is it true? This is history we're talking about. If you want the truth, study physics.
- #fnrefwz4njyw1fgAthens and Rome weren’t used in calculating the line of best fit. The Dark Ages happened, democratic elections mostly didn't, and including them would skew the fit.

https://www.lesswrong.com/posts/4Y3v2hjzHXR5HuMPN/one-world-government-by-2150-1#comments

https://www.lesswrong.com/posts/4Y3v2hjzHXR5HuMPN/one-world-government-by-2150-1

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1774547108

What if superintelligence is just weak?

In response to https://www.hyperdimensional.co/p/2023 by Dean W. Ball.

Dean Ball is a pretty big voice in AI policy – over 19k subscribers on his newsletter, and a former Senior Policy Advisor for AI at the Trump White House – so why does he disagree that AI poses an existential danger to humanity? In short, he holds the common view that superintelligence (ASI) simply won’t be that powerful. I strongly disagree, and I think he makes a couple of invalid leaps to arrive there.

Better Than Us Is Enough

His main flawed argument is that he implies AI must be omnipotent and omniscient to wipe us out and then explains why that won’t be the case. He states: “one common assumption… among many people in ‘the AI safety community’ is that artificial superintelligence will be able to ‘do anything.’” He then argues that “intelligence is neither omniscience nor omnipotence,” and that even a misaligned AI with “no [..] safeguards to hinder it” would “still fail” because taking over the world “involves too many steps that require capital, interfacing with hard-to-predict complex systems.” But omnipotence or omniscience was never the requirement, it just needs to be smarter and better than us – humans.

Think Forward

Importantly, it doesn’t actually take superintelligence to wipe out or disempower humanity. For me to imagine this, I simply need to think forward to the not-so-distant future. Imagine you get a tiger cub. Think forward to what the tiger will look like in a year and ask yourself: could it kill me in a year? Now do this with AI. Imagine the future with a billion robots, AI running the military, AI doing basically all jobs with perhaps some level of human oversight, AI running the media, biolabs, political and military decisions, critical infrastructure. That metaphorical tiger could kill us. Ball himself imagines a future where AI is “embedded into much of the critical infrastructure and large organizations in America, such that it is challenging to imagine what life would be like if Claude ‘turned off.’”

Ball also discusses scenarios in which superintelligence has almost outlandish abilities, performing science breakthroughs without much experimentation. He focuses on Yudkowsky’s claim that “a sufficiently superintelligent AI system would be able to infer not just the theory of gravity, but of relativity” from a few frames of a falling apple, or that “bootstrap molecular nanoengineering.” Ball may be correct that these specific claims are wrong, but these are not load-bearing parts of any story for why AI might become dangerous. You don’t need to infer relativity from first principles to engineer a bioweapon. Notably, Yudkowsky himself has given other scenarios that do not require the AI to make scientific breakthroughs without experiments (see https://ifanyonebuildsit.com/, chapter 2).

If your response is “but there will be many AIs and there will be monitoring, so we’ll be safe,” then you’ve shifted to a different (and very flawed) argument.#fnkr78vmsufsn The point is that clearly AI will be able to take over in the future if we haven’t aligned it well by then. In reality, it probably won’t take that long to largely automate all jobs and tasks, since it’s enough to achieve some combination of: secure power, enable actions in the physical world, get rid of or sideline humans. And once it reaches a critical capability level, the AI has to act fast because of competing AI projects that represent future rival agents.

An Old Argument, Made Worse

The core of Ball’s case, that the world is simply too complex and chaotic for any intelligence to control, is not a new argument. Robin Hanson made essentially the same case in his https://intelligence.org/ai-foom-debate/ with Yudkowsky: innovation is too distributed across many actors, no single AI can race ahead of all competitors fast enough to dominate. But Hanson more correctly understood that this is an argument about the speed and distribution of AI takeoff, not an argument against existential risk. Ball takes Hanson’s position and corrupts it by treating it as a refutation of existential risk from AI entirely.

- #fnrefkr78vmsufsnThe “many AIs and monitors” defense is pretty weak: unaligned AIs can cooperate with each other; monitoring can be evaded, there’s simply too much to monitor, AI doing the monitoring for us could itself be jailbroken or could cooperate with the systems it’s supposed to watch, and AIs can hide their reasoning through methods like steganography.

https://www.lesswrong.com/posts/cTcrbXRAGAy6wtFpR/what-if-superintelligence-is-just-weak#comments

https://www.lesswrong.com/posts/cTcrbXRAGAy6wtFpR/what-if-superintelligence-is-just-weak

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1774538985

Five years since lockdown

I received my one-shot vaccine on March 26th, 2021. I had been in lockdown for more than a year, my entire life on hold, my world closed.

Previously: https://www.lesswrong.com/posts/uM6mENiJi2pNPpdnC/takeaways-from-one-year-of-lockdown & https://www.lesswrong.com/posts/9hWeJZiuL8QjBwQKz/reflections-on-lockdown-two-years-out

A few weeks ago, I was talking to a community organizer in London, and he said, “This feels like the first year that things have really gotten back to normal since Covid.”

I was surprised — it’s been five years, at this point. But he said that communities are only just recovering, that people are only just starting to go out like they once did.

So I thought about the people I know, the people I’m close to. And I realized how right he was, and I was surprised at my own surprise.

Two people I know have developed, essentially, full agoraphobia. They may have been a little weird before the pandemic, but they could function — one of them was in university, and the other had a job outside the home. Now, it feels impossible to imagine either of them going back to that old life. They are cripplingly anxious about the most basic of interactions, sometimes even with the people they live with and love best. It’s rare for them to even set foot outside of their houses, never mind doing normal everyday tasks out in the world, like grocery shopping.

I only know what happened to these people because they are sufficiently close, and I only see them because I visit them in their own homes. To the rest of the world, what happened to them is invisible.

There are people I knew before the pandemic who simply disappeared from my life. How many of them just never came out of hiding?

We all had our lives knocked off course by the pandemic, to a greater or lesser extent. Even if you somehow genuinely had a good time during lockdown, no one can say they’ve had quite the life they expected they would before 2020.

Maybe you missed or had to postpone major milestones — graduating from university, getting married. Maybe you lost loved ones who should have had much longer lives. Maybe you had to raise a kid without any of the support system that you’d expected to be able to rely on.

Likely, the rhythm of your life has permanently changed in some ways. My dad has the same job he’s always had, but his office closed during the pandemic and never reopened, so he still regularly goes days at a time without seeing anyone in person. My immunocompromised cousin used to travel for business all year; now, she can’t even have unmasked visitors in her home.

I had spent the years between college and the pandemic living in a group house, working in the tech industry, and seriously dating two people. When the dust settled from 2020, I had lost all of those things. It took me years to come to terms with the fact that I would never have the life that I’d spent those years building. It was a long and painful mourning process.

Five years after leaving lockdown, I’ve built a life I like much better, but that doesn’t negate the grief of losing what I had, what I thought I would have. Wherever we may have ended up, we all lost something.

Coming out of the pandemic, I felt like no one wanted to talk about the trauma we had all just gone through. The people who’d had a bad time mostly didn’t want to talk about it or just disappeared entirely, so all I heard was “I had a pretty good pandemic”, even when I felt like the person was lying. A few times, I tried to share what had happened to me, and the person I was talking to laughed in my face.

For years, I couldn’t think about the pandemic without crying. I skipped over any story that included it in the plot summary#footnote-1, and couldn’t bear to hear the theme music of the shows I’d watched during that time.

But I also developed a fascination with 2020, even as I found it incredibly triggering. It was like an alternate reality that we had all lived in, and we were all now collectively pretending it hadn’t happened. People’s desire to pick up where they left off was understandable, but it felt like denial to me. Any time I heard someone acknowledge out loud that the pandemic had happened, I was shocked and thrilled, like they’d acknowledged an illicit secret.

In 2024, I started being able to read books about 2020#footnote-2. Three months ago, on the final day of 2025, I finally felt ready to watch Bo Burnham’s Inside — the feature-length musical special he wrote, recorded, and released during the pandemic. The first time I heard one of the songs from it, I cried for hours. Now it’s my favorite movie, and I know the whole soundtrack by heart.

It’s been five years. The paint and stickers that marked distances six feet apart on sidewalks have been torn away, or faded with time (except where they haven’t). Some businesses keep up their old signs about masking and distancing, too expensive to replace, though none of them have enforced it for years, now. Rates of masking in airports have fallen and fallen, until those of us who sicken easily just have to stay home. People have healed, or at least those of us who haven’t have disappeared, and no one really thinks about them anymore.

The shape of your community has changed. The shape of your life has changed. Some of these things would have happened without the pandemic, but many of them wouldn’t. Time still would have passed, yes, but your life would have gone differently.

It’s been five years. The world has gone back to normal. If that even means anything.

#footnote-anchor-1 The pandemic shows up shockingly rarely in mainstream fiction, but people were writing fanfiction all through 2020 and 2021, and many of them used it as a way to process what was happening.

#footnote-anchor-2 Some books I’ve read on the pandemic:

- 2020: One City, Seven People, and the Year that Changed Everything, which tells the story of seven New Yorkers and how they changed and adapted during the pandemic, interspersed with chapters that examine broader social and political changes during 2020
- The Emergency, a memoir of a Black ER doctor working on the south side of Chicago, written in the depths of the pandemic, focusing on the healthcare inequities exposed by the crisis
- Every Minute is a Day, another ER doctor’s memoir written before there was an end in sight, this one in the Bronx, the worst-hit neighborhood in the US
- Please Unsubscribe, Thanks, a sort-of-Digital-Minimalism-adjacent manifesto that resonated deeply for me, about how the world stopped during the pandemic and then picked up as if nothing had happened
- The Anthropocene, Reviewed, John Green’s memoir, written during 2020 and released before lockdown ended, a beautiful capturing of that time of strangeness, uncertainty, and fragile hope

https://www.lesswrong.com/posts/cgA4G26fFFMpytjdF/five-years-since-lockdown#comments

https://www.lesswrong.com/posts/cgA4G26fFFMpytjdF/five-years-since-lockdown

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1773941274

OpenAI: How we monitor internal coding agents for misalignment

Sharing some of the monitoring work I've been doing at OpenAI: How we monitor internal coding agents for misalignment.
OpenAI now monitors 99.9% of internal coding traffic for signs of misalignment using our most powerful models. Today, that monitor is GPT-5.4 Thinking. It gets access to the full conversation context, that is everything the agent saw, and everything the agent did, including tool calls and CoT. Higher severity cases are sent for human review within 30 minutes.
Some examples of misaligned behavior we’ve found:
- Encoding commands in base64 to circumvent monitors
- Calling other versions of themselves to bypass restrictions
- Attempting to upload files to the public internet
- Trying to prompt inject users.
We haven’t yet detected any real world examples of sabotage, scheming or sandbagging.

Our goal is to establish robust monitoring practices internally, strengthen them through iteration, and ultimately help make similar safeguards standard across the industry.

https://www.lesswrong.com/posts/syB2r2X4E7nmw58co/openai-how-we-monitor-internal-coding-agents-for#comments

https://www.lesswrong.com/posts/syB2r2X4E7nmw58co/openai-how-we-monitor-internal-coding-agents-for

LessWrong (RSS Feed)

@LessWrong (RSS Feed) 1773940394

Should You Sign Up for Cryonics? Interactive EV calculator

Below is an interactive expected-value calculator of signing up for cryonics with Monte Carlo simulation.

The model assumes you sign up now.

See also: https://www.lesswrong.com/posts/yKXKcyoBzWtECzXrE/you-only-live-twice, https://www.lesswrong.com/s/weBHYgBXg9thEQNEe, https://www.cryonicscalculator.com/.

If you're in London and want to sign up for cryonics, we're hosting a https://partiful.com/e/S1zTdHG4Wjn4KYqihZky tomorrow.

https://www.lesswrong.com/posts/afzTcrsMwEkCrBRN8/should-you-sign-up-for-cryonics-interactive-ev-calculator-1#comments

https://www.lesswrong.com/posts/afzTcrsMwEkCrBRN8/should-you-sign-up-for-cryonics-interactive-ev-calculator-1