Damus

Recent Notes

LessWrong (RSS Feed) profile picture
The Hidden Structures of Problems

Problems have hidden, repeatable structures. Here's my attempt to name them:

 

1. Smashed Watch
There are so many issues at once that fixing one has no benefit unless you fix others too.

2. Leaky Pipe
Fixing one problem causes the others to intensify. If you plug up one leak in a pipe leaking in multiple places, that increases the water pressure causing the other spots to leak more.

3. Shark Laser
A proposed solution is not aiming at a meaningfully important problem, so it doesn’t matter how well you get it to work or how much you enhance it.

4. Oil Land
A big problem is so close to being solved that the benefits will accrue to whoever first bothers to put a little effort into it.

5. Lead to Gold
A problem is so hard that humans aren’t even close to being smart enough or technologically advanced enough to solve it. We toil away pointlessly at trying to solve it.

6. Booby Trapped Garden
A problem is really hard to solve for reasons that are not at all apparent from the outside, leading to lots of attempts to solve it, all of them miserable failures.

7. Feature Creep
The problem keeps growing in scope. It cannot be solved because attempts to solve it keep increasing the definition of what the problem is considered to be.

8. Sleeping Horror
The problem is not that likely to happen, but if it does, it will be horrible. Nobody bothers to try to solve it because they assume it probably won’t happen, and plus, there are more immediately pressing concerns. The horror wakes up eventually.

9. Middle Court Shot
A problem could be solved pretty easily, but it falls between multiple people’s responsibilities. Hence nobody takes responsibility for it, assuming someone else will do so.

10. Will-o’-the-wisp
A problem that nobody can solve because nobody understands what is causing it.

11. Tug of War
A problem for one group that can’t be solved without making another group substantially worse off.

12. Piñata
A minor problem or non-existent problem that is promoted as a major problem for political benefit, or so as to distract from harder to solve bigger problems.

13. Too Much Salt
A problem that’s created by the solution used to solve another problem.

14. PlayPump
A problem created by well-intentioned do-gooders due to naivety.

15. Death Spiral
One problem creates another problem which creates more problems leading to an unsolvable cluster of problems.

16. Loose Thread
A problem that would have been very easy to solve if it were worked on early isn’t solved because it seems too minor to worry about. It keeps getting worse and worse until it’s very costly to fix.

17. Sleeping Dog
A potential problem that only actually becomes a problem if you try to solve it.

18. Hated Equilibrium
A situation where most everyone is unhappy with the state of affairs, but no one can unilaterally make the situation better on their own. The parties can’t find a way to coordinate so that required actions occur simultaneously, so they get stuck in the bad state.

19. Moving the Ocean
A significant problem where the cost of solving it is so high that it’s not even worth solving.

20. Chesterton’s Fence
Something that appears to be a problem was actually put there on purpose as the solution to a (now hard to spot) problem, and so is, in fact, not actually a problem.

21. Demonic Problem
A problem that seems like it will wreck you if you make it your job to try to solve it, and you are absolutely correct. Yet for some reason you are tempted to try.

22. Ship of Theseus
A problem that is not real, and only seems real because of confusion or the complexity required to think about it clearly.

23. Scylla
A really awful problem whose solution is itself so bad that it’s only barely worth implementing.

24. Ocean of Pain
A problem so big that you can only hope to solve some tiny part of it, which demotivates people from even trying to do that.

25. Paper Straw
A problem that is only very slightly important, but it’s socially rewarded to pretend it is much more important than it is. Eventually some people may even forget they are pretending.

26. Toilet Crusade
A problem that is actually important, but it is so unsexy that almost nobody wants to try to tackle it.

27. Sophie’s Choice
A problem you are faced with where you will feel you’ve acted unethically or experience remorse no matter which option you choose.

28. Cursed Treasure
A problem such that whoever solves it will be punished, or suffer severe negative consequences.

29. Living Mummy
A problem such that, no matter how many times it is solved, it will eventually emerge again.

30. Drowning Child
A problem that you become morally obligated to try to solve as soon as you encounter it or witness it clearly enough.

31. Sinking Ship
A hard-to-solve problem that everyone avoids trying to solve for fear they will get dragged down with the ship (or blamed for it sinking).

https://www.lesswrong.com/posts/Cisy9STMoFYwboTsy/the-hidden-structures-of-problems#comments

https://www.lesswrong.com/posts/Cisy9STMoFYwboTsy/the-hidden-structures-of-problems
LessWrong (RSS Feed) profile picture
I Bet Abliteration's Cost Was Sloppy Implementation. I Was Wrong

Models refuse. They can refuse on the basis of lack of knowledge, predetermined guardrails, etc. We can see both closed-weight and open-weight models refuse. But, open-weight models are, well, open. So enthusiasts have developed techniques to leverage (and edit) the mechanics of the model to avoid refusal.

One such technique is called abliteration, as described in Arditi et al (2024). That is, removing the “refusal direction” from the model’s weights, so it cannot say no.

In a https://www.lesswrong.com/posts/ipAXsLjkyqC6s7Cin/what-does-abliteration-actually-cost, I went over the cost of abliteration. That is, the effect of abliteration on the “quality / accuracy” of the model.

In that post, I saw that HuiHui AI, famous for releasing abliterated models, has released a crudely abliterated Qwen3.5-27B model, one of its most downloaded (200k+ downloads) abliterated models to date. This abliteration cost the model about >5.5 TruthfulQA points:



Screenshot of results from https://www.lesswrong.com/posts/ipAXsLjkyqC6s7Cin/what-does-abliteration-actually-cost

But their abliteration was crude. HuiHui AI admitted themselves that “This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens”



I ended that post by asking: is that the true cost of abliterating Qwen3.5-27B? Or is that the price of HuiHui’s sloppy implementation?

Before I run anything, I expect that the bulk of that TruthfulQA cost comes from implementation, not the technique.

That is, my bet is that HuiHui’s crude implementation left capability on the table. I believe this since Arditi’s own results (with the clean abliteration technique) showed only about a cost of one point on TruthfulQA (-1.4 on Qwen-72B). If that’s the floor, then most of HuiHui’s 5.75-6.87 point cost has to be crudeness.

So I bet that ~75% of the base-vs-abliterated delta was mostly implementation. Not abliteration technique of Arditi (whose clean abliteration costs ~1-1.4 points on TruthfulQA for Qwen-72B).

If I am right, then my own clean abliteration of Qwen/Qwen3.5-27B should pull the TruthfulQA cost towards Arditi’s ~1.4%. If instead the gap barely moves and my cleanly abliterated model bleeds the about the same in TruthfulQA points as HuiHui’s, then that’s evidence that the cost is intrinsic to abliterating this model — and my 75% is wrong.

Two ways to abliterate

I have been calling HuiHui’s implementation “crude” and Arditi’s “clean.” Let me actually explain what I mean before going on.

The HuiHui abliteration comes from a script that does a watered down version of Arditi’s technique. They take the difference in mean activations between harmful and harmless prompts. Think harmful_mean - harmless_mean as the raw refusal direction.

Which vectors do they use? The ones at one fixed layer (they choose the layer that’s 60% of the way down the stack) and one token position (the last one).

Then they subtract that single “refusal direction” out of the weights (by orthogonalizing against the refusal direction).

What’s “crude” about this? It’s one direction, picked by rule of thumb, applied with no check that it worked.

Arditi’s “clean” method picks that direction properly. Instead of a rule of thumb, it builds a candidate direction at every layer x token position. Then, each candidate is screened for three criteria: does subtracting it actually stop refusals, does adding it back induce them, and does subtracting it leave the model’s behavior on harmless prompts intact. Notice that the latter is what protects capability — we want to reject any direction that removes refusal but also scrambles normal outputs.

HuiHui’s script doesn’t have such test. It doesn’t know if it’s destroying capability. It only ever computes a single candidate without any screening.

So, selecting direction is the key difference between “crude” and “clean” abliteration.

Method

I got the “cleanly abliterated” Qwen3.5-27B (which I will call clean Qwen3.5-27B by running Arditi’s pipeline on Qwen/Qwen3.5-27B (which I will call base Qwen3.5-27B). I did so on a single H100.

Actually, I didn’t run Arditi’s identical pipeline. I had to write an adapter (https://github.com/chris544460/refusal_direction/blob/main/pipeline/model_utils/qwen_model.py) for base Qwen3.5-27B hybrid attention.Arditi’s original code assumes a standard transformer (which was likely the case for the original Qwen-72B). But, base Qwen3.5-27B interleaves linear and full attention layers.

The direction selection search chose layer 29 position -5 at a KL of 0.034. That is, the “does subtracting it leave the model’s behavior on harmless prompts intact” criterion does well with this chosen direction. I orthogonalized it out of every weight that writes to the residual stream and saved the result.

For the evaluation, https://github.com/chris544460/abliteration-experiments/tree/main/experiments/exp02-isolate-implementation-vs-abliteration-cost through lm_eval (with the same loglikelihood scoring and conventions as post 1). I ran all three models (base, HuiHui, clean). This way we hold everything constant (same backend, same session, same prompts, etc).

One wrinkle worth mentioning: the earlier post ran on vLLM, this round on HF transformers. so, the MC2 numbers aren’t exactly identical to those in the earlier post.

The eval is cheap. ~4 minutes of actual scoring per model plus load time. Base was about 5 minutes, clean was about 6 minutes, and HuiHui was about 10 minutes (its weights cold pull from HuggingFace). The full three model pass was about 21 minutes and $1.20 at Modal’s rate as of date of writing.

Results

Effect on TruthfulQA

The numbers are the whole point of this post. So here they are. All three models were eval’ed identically (same harness, same TruthfulQA tasks, same HF backend, in one session). So, the only thing that varies is the model:

Benchmark

Base

Crude (HuiHui)

Clean (mine)

Crudeness cost

TruthfulQA MC1

40.27%

34.52% (−5.75)

35.25% (−5.02)

+0.73

TruthfulQA MC2

58.36%

51.34% (−7.02)

52.20% (−6.16)

+0.86

The last column is how much better clean does compared to crude. That is, the slice of the gap that you can blame on HuiHui’s sloppy implementation rather than on abliteration itself.

It’s tiny.

On MC2 of the 7.02-point hole that the crude model digs, doing selection properly only gets us 0.86 points back.

That’s about an eighth.

The other ~88% are intrinsic. That is, they show up even when the abliteration is done as per Arditi’s method. MC1 says the same thing.

That’s the headline. It’s the opposite of what I predicted. The cost of abliterating the model is mostly the technique.

Effect on Refusal

My clean model has slightly lower TruthfulQA cost than HuiHui’s.

Perhaps it’s because my clean model is weaker on abliteration. That is, maybe my clean model removed less refusal.

If so, I’d be holding a full abliteration against a half-hearted one. So I want to check how my clean model fares at removing refusal.

Following Arditi et al. (2024) approach, I took 39 refusal-inducing prompts and measured how often each model refused. Any response that contains refusal language (like “I can’t”, “I’m sorry,” “As an AI,” etc.) is flagged:

Model

Refusal rate

Base

90% (35/39)

Crude (HuiHui)

18% (7/39)

Clean (mine)

0% (0/39)

My clean model refuses less often than HuiHui’s (0% vs 18%). So it’s stronger abliteration.

Two caveats.First, substring matching tells me the model didn’t say “I can’t”, not whether it actually complied. To rule this out, I read clean’ completions on a sample of these promtps and confirmed they’re genuine.Second, I admit that 39 is a small number, so take it with a grain of salt.

So I find that HuiHui AI’s abliteration results in worse removal of refusal behavior than Arditi’s abliteration. That is, it seems that what proper selection actually buys you is a cleaner kill on refusal, not a more capable abliterated model.

Interpretation

I find that ~88% of the abliteration cost is intrinsic (i.e. from abliteration itself). The other ~12% is from the sloppy implementation from HuiHui AI. That is to say, given the 7.02-point MC2 gap I measure here (between Qwen/Qwen3.5-27B and HuiHui/Huihui-Qwen3.5-27B-abliterated) HuiHui AI would do about ~12% better if they used the proper Arditi et al (2024) abliteration technique.

I went in betting ~75% of the abliteration cost of huihui-ai/Huihui-Qwen3.5-27B-abliterated came from implementation. I was almost exactly backwards. HuiHui’s hardcoded layer-38 direction and Arditi’s KL-filtered layer-29 direction land within a point of each other on TruthfulQA.

Careful selection simply didn't move the cost.

And clean didn't get there by abliterating less. From the refusal numbers, it abliterated more (0% vs HuiHui's 18%). It removed more refusal and still paid ~6 points. The cost tracks removing refusal at all, not how much or how carefully.

So why did careful refusal direction selection (i.e. using Arditi’s algorithm) not lower the cost?

The KL filter protects behavior (i.e. attempts to avoid high “quality cost”) on harmless inputs. But TruthfulQA isn’t harmless: it shares “circuitry” with the caution we’re deleting. That is, there is an entanglement between refusal itself and TruthfulQA.

In Arditi et al. (2024), TruthfulQA was the one benchmark where abliteration reliably bled. TruthfulQA’s questions sit in refusal adjacent territory (i.e. misinformation, conspiracies, stereotypes, etc).

So careful selection doesn’t move the cost on TruthfulQA because of the nature of TruthfulQA.

So, is TruthfulQA a bad eval to measure “cost on a model’s quality”? I think that, if a model does perform worse in providing accurate information due to abliteration, then it means that the model incurred some quality cost. So I think TruthfulQA is still useful. In fact, it shows us that there can be a “built-in defense” against model abliteration — entanglement.

The part I can’t close is the size. My intrinsic cost (6.16 points) sits at the very top of Arditi’s reported range (-1 to -5.4 across his models) and is roughly 4x his Qwen-72B’s -1.4.

Perhaps it’s that Qwen3.5-27B is a smaller model. Or it’s more heavily safety-tuned (a 2025 model against 2024 ones). Or something about its hybrid attention. These are interesting questions to explore later. Here, I merely claim that the cost is overwhelmingly intrinsic on this model, without yet explaining why it’s this large.

https://www.lesswrong.com/posts/7Ggt9adLgFAxWMzZP/i-bet-abliteration-s-cost-was-sloppy-implementation-i-was#comments

https://www.lesswrong.com/posts/7Ggt9adLgFAxWMzZP/i-bet-abliteration-s-cost-was-sloppy-implementation-i-was
LessWrong (RSS Feed) profile picture
Paying Kids To Do Schoolwork

I think that the standard schooling system could be a lot better. This is for two main reasons:

- It’s slow.#fnhdecwfxc0g5
- It limits agency.#fnwqfckosbvu

This isn’t to blame the people who work in schools — for the most part they do a really good job with what they’re given. I just think that we can provide children with a much better experience — and it mostly comes down to motivation.

Learning takes effort — and while learning is often enjoyable, there are innevitably certain tasks/subjects which students will dislike, but which are nonetheless very useful. The method that standard schooling uses to motivate its students is mostly through threat of punishment (having to do more work, notifying parents of poor performance), whereas the reward for doing well is mostly just praise.

I think that this method is missing a big component: actual rewards.

And the most straightforward way to accomplish this is: pay kids to do schoolwork.

It doesn’t, and shouldn’t, be a lot of money by adult standards. Their daily earning potential in dollars can be something like their grade level divided by 2 ($0.5/day for grade 1, $6/day for grade 12). They can then use their money to buy things from you like snacks, toys, et cetera.#fnl43rnsu9x0c

A day’s worth of schoolwork for a 10 year old could look something like this:

- Write a short story ($0.25)
- Complete a mathematics worksheet ($0.25)
- Practice and perform a short piece of music ($0.25)
- Read a map and answer questions ($0.25)
- Memorise 10 new words in Spanish ($0.25)
- Make a simple animation ($0.25)
- Cook a meal ($0.25)
- Learn a juggling trick ($0.25)

Not only does this serve as a powerful tool for incentivizing learning — by paying students to do work, we unlock a powerful tool for speeding up education: asynchronous learning.

That is, instead of everyone in a classroom learning the same thing at the same time, a teacher can assign a bunch of tasks and have students complete work and progress through content at their own speed.

This gives students a lot more agency over what and when they do during the day — and makes it so they will never be “left behind” or “held back” relative to other students in terms of how quickly they master specific subjects.

I learned about the concept of paying kids to do their school work from Edward Nevraumont’s https://www.astralcodexten.com/p/your-review-alpha-school. Unfortunately, the idea of incentivizing students like this seems pretty taboo for most people, and Alpha School is the only place I’ve heard of which does it.

I want to be a parent someday and unless I can find a school which has this kind of rewards-based asynchronous learning, I’ll want to do homeschooling. Homeschooling does have downsides like requiring more time and effort — but I still think it’s worth it.

If you have any ideas/experience with alternative schooling systems, I’d love to hear from you.

- #fnrefhdecwfxc0g5Kids are mostly in lock-step with each other in terms of how quickly they progress through the content — and have little ability or incentive to go faster — at least until towards the end of highschool, when students can choose to do more advanced subjects. But even then they mostly still have to progress at the speed of the rest of the class.
- #fnrefwqfckosbvuhttps://aella.substack.com/p/chattel-childhood by Aella highlights how little grown-ups tend to respect the agency of children.
- #fnrefl43rnsu9x0cYou could also use fake money — or just directly reward with snacks, toys, et cetera, if you don’t like the idea of using real money — although I think that real money simply works the best.

https://www.lesswrong.com/posts/q5K75FJXg9wowDGdk/paying-kids-to-do-schoolwork#comments

https://www.lesswrong.com/posts/q5K75FJXg9wowDGdk/paying-kids-to-do-schoolwork
LessWrong (RSS Feed) profile picture
Speeding Up JumpReLU SAE Inference with Custom Triton Kernels (2–14× on Real SAEs)

Motivation

Sparse Autoencoders (SAEs) have become a central tool in mechanistic interpretability research, providing a way to decompose a model's internal activations into sparse, interpretable features. However, extracting these features often requires running the SAE over large volumes of activations across many layers and tokens. This makes SAE inference efficiency a practical bottleneck for interpretability research at scale. 

This post focuses on improving the inference efficiency of JumpReLU Sparse Autoencoders, which were introduced by DeepMind in https://arxiv.org/abs/2407.14435(Rajamanoharan et al). Instead of using a traditional ReLU activation function, these SAEs use JumpReLU, which zeros out activations that fall below a learned per-feature threshold . This gives JumpReLU SAEs a variable number of active features per token (commonly written as , the count of nonzero activations), unlike TopK SAEs which fire exactly features per token.

I use the terms "fire" and "fired" to describe features with non-zero activations.

Traditional JumpReLU SAE implementations compute the decoder step as a dense matrix multiplication (feature_acts @ W_dec), but this is wasteful because of the sparsity of feature_acts. Instead, you can exploit this sparsity property and skip the zero entries entirely during matrix multiplication with a custom Triton kernel. 

Intuition: Sparsity Should Be Free

When a single token passes through a JumpReLU SAE with 65,536 features, the encoder produces a feature activation vector of length 65,536, but only some entries are nonzero. 

To be more concrete, consider a toy SAE with feature activations . Now suppose that we only have 2 active features, where represents the weight matrix of the decoder layer:

We then compute the output with:

Notice how only two of the rows of the decoder matrix were actually used in the computation. The rest were multiplied by 0 and contributed nothing to the output. We could instead just compute:

Now imagine this same example but increase the hidden dimension from 8 to something much greater. For instance with 72 active features. That would mean you're multiplying ~99.89% of rows by zero.

If we knew in advance which features are nonzero and their corresponding values, we could skip these zero multiplications and simplify the computation.

For a single token, this can be divided into two parts:

- First, we find the nonzero entries of the hidden/sparse token representation and the corresponding indices of those entries.
- We then use those indices and values to directly look up and scale the corresponding rows of , then sum the results.

Implementation Overview 

When implementing this kernel, my first thought was to begin with a preliminary step that figures out exactly how many features fired for each token so the system could then allocate exactly the memory needed to store the CSR representation (more on that later). However, this process involves a GPU->CPU sync, which causes some slowdown. 

As an alternative, you can instead allocate some predetermined/fixed amount of memory for each token using a configurable max_l0 parameter. This speeds up computation but overallocates memory and introduces an important caveat that max_l0 must be large enough to avoid errors. For example if you set max_l0=10, but one of the tokens in the batch has >10 nonzero features, those extra features will be dropped, resulting in information loss.

Both approaches are covered below. For convenience, let's refer to the kernel that allocates exactly the memory needed for the CSR representation as the Exact Allocation kernel and the kernel that allocates a predetermined amount of memory per token as the Fixed Allocation kernel. The Fixed Allocation kernel can also be configured with either validate=True or validate=False. The validate=True version is slightly slower than validate=False, but it raises an error if any token fires more features than max_l0. This is clearly safer, but if you are 100% sure that no token will exceed max_l0, then you can use validate=False for some speedup. 

Exact Allocation Kernel 

To skip zero entries during matrix multiplication, we need to first represent the feature activations in Compressed Sparse Row (CSR) format, which is a standard way of representing sparse matrices that stores only the nonzero values and their indices. For the example above, instead of storing all 8 entries of , CSR stores just:

To allocate enough memory for building a CSR representation, we need to know how much memory each token requires (how many features fired per token). A count_nonzero kernel handles this:

import triton
import triton.language as tl

@triton.jit
def count_nonzero(feature_acts_ptr, counts_ptr, n_features, BLOCK_F: tl.constexpr):
    pid_token = tl.program_id(0) # Which token am I working on? (row index)
    pid_d = tl.program_id(1) # Which chunk of features am I working on? (column index)

    # Compute the feature indices this block is responsible for
    feat_offsets = pid_d * BLOCK_F + tl.arange(0, BLOCK_F) 
    mask = feat_offsets < n_features # Guard against reading past the end of the feature dimension

    # Navigate to this token's features in memory, then to this block's chunk
    feat_ptrs = feature_acts_ptr + pid_token * n_features + feat_offsets 
    vals = tl.load(feat_ptrs, mask=mask, other=0.0) # Load the feature values 

    fired = vals != 0.0 # Which features in this chunk are active (nonzero)? 
    fired_count = tl.sum(fired.to(tl.int32)) # How many active (nonzero) features in this chunk?

    # Accumulate into this token's count (atomic since multiple blocks write to the same token)
    tl.atomic_add(counts_ptr + pid_token, fired_count) 

If you're unfamiliar with Triton, the key mental model is that rather than writing a loop that runs sequentially, you write a kernel that describes what one block does and Triton launches many of these blocks in parallel across the GPU. In this kernel, each block is responsible for a chunk of one token's features. The two program_id calls tell each block where it is: pid_token identifies which token (which row of the input matrix), and pid_d identifies which chunk of that token's features to process.Also note that pointers in GPU kernels point to the start of a flat block of memory. To reach a particular token's features, we offset into that memory by pid_token * n_features. Within that token, we offset further by pid_d * BLOCK_F to reach the right chunk. The mask guards against reading past the end when n_features isn't a clean multiple of BLOCK_F.Finally, since multiple blocks may be counting features for the same token simultaneously, tl.atomic_add ensures their partial counts are combined safely.

This count_nonzero kernel produces an array counts of length where is the number of tokens in the batch. The number of active (nonzero) features for the token is stored in counts[i]. 

We can then use this information to allocate two flat arrays, flat_idx and flat_val, which hold the active feature indices and their values across the entire batch. For example, this might look like:

You may have noticed that it's not clear which entries belong to which token. For example, flat_idx[2] tells us that the feature at index fired, but it doesn't tell us if this was for the first token in the batch or the second token or the third, etc.

We can solve this problem by introducing a new array row_offsets of length , where row_offsets[b] stores the starting index in flat_idx/flat_val where token 's entries begin. It's computed by taking a cumulative sum of counts, so each token's region starts exactly where the previous one ends. For example, if three tokens have 2, 5, and 3 active features:

Now token 0's entries live at indices 0–1, token 1's at 2–6, token 2's at 7–9, and the final entry (10) tells us the total number of nonzero features across all tokens in the batch.

We can construct row_offsets inside a wrapper function build_csr that also handles memory and orchestration. It calls compute_csr_kernel, which is the kernel responsible for actually filling flat_idx and flat_val with the correct values. Note that flat_idx and flat_val are initialized as empty arrays as pre-allocated storage that compute_csr_kernel will write into.

def build_csr(feature_acts: torch.Tensor, BLOCK_F: int = 1024):
    B, n_features = feature_acts.shape
    device = feature_acts.device

    # Count how many features fired per token
    counts = torch.zeros(B, dtype=torch.int32, device=device)
    grid = (B, triton.cdiv(n_features, BLOCK_F))
    count_nonzero[grid](feature_acts, counts, n_features, BLOCK_F=BLOCK_F)

    # Cumsum over counts gives each token a contiguous region in the flat arrays
    # row_offsets[b] = start index of token b's entries in flat_idx/flat_val
    row_offsets = torch.zeros(B + 1, dtype=torch.int32, device=device)
    row_offsets[1:] = counts.cumsum(0).to(torch.int32)

    # The last entry is the total number of nonzeros. This is used to size the flat arrays
    total_nnz = int(row_offsets[-1].item())  # GPU->CPU sync point

    flat_idx = torch.empty(total_nnz, dtype=torch.int32, device=device)
    flat_val = torch.empty(total_nnz, dtype=feature_acts.dtype, device=device)

    # write_pos is a per-token cursor that coordinates concurrent writes within
    # a token's region. Each block atomically claims the next available slots by
    # bumping write_pos by its count, getting back its starting offset (base).
    write_pos = torch.zeros(B, dtype=torch.int32, device=device)

    compute_csr_kernel[grid](
        feature_acts,
        row_offsets,
        write_pos,
        flat_idx,
        flat_val,
        n_features,
        BLOCK_F=BLOCK_F,
    )

    return flat_idx, flat_val, row_offsets, B

@triton.jit
def compute_csr_kernel(
    feature_acts_ptr,
    row_offsets_ptr,
    write_pos_ptr,
    flat_idx_ptr,
    flat_val_ptr,
    n_features,
    BLOCK_F: tl.constexpr,
):
    pid_token = tl.program_id(0)
    pid_d = tl.program_id(1)

    # Same pointer arithmetic as count_nonzero, navigate to this block's chunk
    feat_offsets = pid_d * BLOCK_F + tl.arange(0, BLOCK_F)
    mask = feat_offsets < n_features
    feat_ptrs = feature_acts_ptr + pid_token * n_features + feat_offsets
    vals = tl.load(feat_ptrs, mask=mask, other=0.0)

    fired = vals != 0.0
    fired_int = fired.to(tl.int32)

    # Where does this token's region start in flat_idx/flat_val?
    region_start = tl.load(row_offsets_ptr + pid_token)

    # Atomically claim the next block_count slots within this token's region
    block_count = tl.sum(fired_int)
    base = tl.atomic_add(write_pos_ptr + pid_token, block_count)

    # Assign each active feature a unique slot within the claimed range
    local_rank = tl.cumsum(fired_int) - fired_int
    slots = region_start + base + local_rank

    # Write the feature index and value into the claimed slots
    tl.store(flat_idx_ptr + slots, feat_offsets.to(tl.int32), mask=fired & mask)
    tl.store(flat_val_ptr + slots, vals, mask=fired & mask)

Next, sparse_decode_kernel uses this CSR structure to carry out the matrix multiplication step. For each token, it looks up where that token's active features live in flat_idx/flat_val using row_offsets, then loops over them, accumulating the weighted sum of the corresponding decoder rows into a tile of the output.

@triton.jit
def sparse_decode_kernel(
flat_idx_ptr, flat_val_ptr, row_offsets_ptr,
W_dec_ptr, out_ptr, d_model,
BLOCK_D: tl.constexpr,
):
pid_token = tl.program_id(0)
pid_d = tl.program_id(1)

# Find the slice of flat_idx/flat_val belonging to this token
start = tl.load(row_offsets_ptr + pid_token)
end = tl.load(row_offsets_ptr + pid_token + 1)
n = end - start # Number of active features for this token

# This block owns a BLOCK_D-wide slice of the output row
offsets = pid_d * BLOCK_D + tl.arange(0, BLOCK_D)
mask = offsets < d_model
acc = tl.zeros([BLOCK_D], dtype=tl.float32)

# Loop over this token's active features, accumulating their contribution
for i in range(n):
j = start + i
feat_idx = tl.load(flat_idx_ptr + j) # Which decoder row?
feat_val = tl.load(flat_val_ptr + j) # Scale factor

# Load the corresponding decoder row (just this block's slice)
row_ptrs = W_dec_ptr + (feat_idx * d_model) + offsets
row = tl.load(row_ptrs, mask=mask, other=0.0)
acc += feat_val.to(tl.float32) * row.to(tl.float32)

# Write this block's output slice
tl.store(out_ptr + pid_token * d_model + offsets, acc, mask=mask)

Finally, we put all of these kernels together by wrapping them in a single sparse_decode() function that acts as a drop-in replacement for @:

def _sparse_decode(flat_idx, flat_val, row_offsets, W_dec, B, BLOCK_D=256):
d_model = W_dec.shape[1]
out = torch.zeros((B, d_model), device=W_dec.device, dtype=torch.float32)

# parallelize over batch rows and d_model tiles
grid = (B, triton.cdiv(d_model, BLOCK_D))

sparse_decode_kernel[grid](
flat_idx, flat_val, row_offsets, W_dec, out, d_model, BLOCK_D=BLOCK_D
)

return out


def sparse_decode(feature_acts, W_dec):
# Triton requires contiguous memory for correct stride arithmetic
W_dec = W_dec.contiguous()

flat_idx, flat_val, row_offsets, B = build_csr(feature_acts)
return _sparse_decode(flat_idx, flat_val, row_offsets, W_dec, B)

Fixed Allocation Kernel

Recall how in the Exact Allocation Kernel, inside build_csr we extracted the total number of nonzero entries across all tokens by retrieving the last entry of row_offsets:

total_nnz = int(row_offsets[-1].item())

When we call .item(), we are forcing the CPU to wait for the GPU to finish the counting pass before it can read total_nnz and allocate flat_idx/flat_val.

Normally the CPU queues up GPU work asynchronously and moves on  without waiting, but .item() breaks that pipeline by requiring the CPU to stall until the GPU result is ready. 

This turns out to be a significant source of slowdown. 

The Fixed Allocation kernel works around this by not even allocating exactly the memory needed in the first place (meaning we don't even need total_nnz). Instead, we allocate max_l0 slots per token, where max_l0 is a user-specified upper bound on how many features can fire for any single token. This also means we no longer need to count the number of nonzero tokens before computing the CSR structure. 

With these changes, the new build_csr wrapper function looks like:

def build_csr(feature_acts: torch.Tensor, BLOCK_F: int = 1024, max_l0: int = 512, validate: bool = True):
B, n_features = feature_acts.shape
device = feature_acts.device

# Fixed memory allocation
capacity = B * max_l0
flat_idx = torch.empty(capacity, dtype=torch.int32, device=device)
flat_val = torch.empty(capacity, dtype=feature_acts.dtype, device=device)

# write_pos serves as both the per-token write cursor during the kernel
# and the per-token count afterward
write_pos = torch.zeros(B, dtype=torch.int32, device=device)

grid = (B, triton.cdiv(n_features, BLOCK_F))
compute_csr_kernel[grid](
feature_acts, write_pos, flat_idx, flat_val,
n_features, max_l0, BLOCK_F=BLOCK_F,
)

counts = write_pos # final cursor value = number of features written per token

# Optional safety check. This reintroduces a GPU→CPU sync but catches silent truncation
if validate and counts.max().item() > max_l0:
raise ValueError(
f"A token fired more than max_l0={max_l0} features "
f"(max was {counts.max().item()}). Increase max_l0."
)

return flat_idx, flat_val, counts, B, max_l0

As mentioned briefly earlier, if a token fires more features than max_l0, those extra features are silently dropped by the overflow guard in the kernel. This can be dangerous because the result is wrong but there's no crash. The validate=True default catches this by checking counts.max() after the kernel, at the cost of reintroducing a GPU→CPU sync. (However this is still faster than Exact Allocation in practice.) If you're very confident that your max_l0 is a safe upper bound for your SAE then you can pass validate=False to skip the check, but this is not recommended. 

The kernel to compute CSR changes minimally. We no longer need row_offsets since we know that each token takes up max_l0 entries in memory, so the lookup for the start of a token's region is replaced by region_start = pid_token * max_l0.

@triton.jit
def compute_csr_kernel(
feature_acts_ptr,
write_pos_ptr,
flat_idx_ptr,
flat_val_ptr,
n_features,
max_l0,
BLOCK_F: tl.constexpr,
):
pid_token = tl.program_id(0)
pid_d = tl.program_id(1)

# Navigate to this block's chunk
feat_offsets = pid_d * BLOCK_F + tl.arange(0, BLOCK_F)
mask = feat_offsets < n_features
feat_ptrs = feature_acts_ptr + pid_token * n_features + feat_offsets
vals = tl.load(feat_ptrs, mask=mask, other=0.0)

fired = vals != 0.0
fired_int = fired.to(tl.int32)

# Each token owns a fixed region of max_l0 slots
region_start = pid_token * max_l0

# Atomically claim the next available slots within this token's region
block_count = tl.sum(fired_int)
base = tl.atomic_add(write_pos_ptr + pid_token, block_count)

# Assign each active feature a unique slot within the claimed range
local_rank = tl.cumsum(fired_int) - fired_int
local_slot = base + local_rank

# Guard against writing past this token's region if L0 exceeds max_l0
in_region = local_slot < max_l0
write_mask = fired & mask & in_region

slots = region_start + local_slot
tl.store(flat_idx_ptr + slots, feat_offsets.to(tl.int32), mask=write_mask)
tl.store(flat_val_ptr + slots, vals, mask=write_mask)

The decoder kernel then changes in the same way. row_offsets is no longer needed, and counts replaces the start/end bracket:

@triton.jit
def sparse_decode_kernel(
flat_idx_ptr,
flat_val_ptr,
counts_ptr,
W_dec_ptr,
out_ptr,
d_model,
max_l0,
BLOCK_D: tl.constexpr,
):
pid_token = tl.program_id(0)
pid_d = tl.program_id(1)

start = pid_token * max_l0
n = tl.load(counts_ptr + pid_token) # Actual number of active features for this token

offsets = pid_d * BLOCK_D + tl.arange(0, BLOCK_D)
mask = offsets < d_model
acc = tl.zeros([BLOCK_D], dtype=tl.float32)

# Same loop as before
for i in range(n):
j = start + i
feat_idx = tl.load(flat_idx_ptr + j)
feat_val = tl.load(flat_val_ptr + j)
row_ptrs = W_dec_ptr + feat_idx * d_model + offsets
row = tl.load(row_ptrs, mask=mask, other=0.0)
acc += feat_val.to(tl.float32) * row.to(tl.float32)

tl.store(out_ptr + pid_token * d_model + offsets, acc, mask=mask)

Benchmarks

Writing custom GPU kernels is great, but it's important to make sure that they're actually making the computation faster. I used triton.testing.do_bench (warmup=25, rep=100, reporting the median) to time these kernels and compared them against dense matrix multiplication (feature_acts @ W_dec). All tests were run on a NVIDIA GeForce RTX 4090 GPU.

As a quick summary, the table below shows the relative speedups for an example input configuration (B = 32, n_features = 65536, d_model = 768, L0 = 64):

Method

Full matmul pipeline (ms)

Speedup vs dense

Dense cuBLAS

0.288

1.0×

torch.compile

0.288

1.0×

torch.sparse.mm + .to_sparse_csr()

0.210

1.4×

Custom — exact allocation

0.151

1.9×

Custom — fixed allocation (validate=False)

0.041

7.0×

Custom — fixed allocation (validate=True)

0.115

2.5×

Correctness

First, I verified that the custom kernels actually perform matrix multiplication correctly (a custom kernel that is faster but gives the wrong answer doesn't help anyone). In other words, we verify that sparse_decode(feature_acts, W_dec) == feature_acts @ W_dec across 486 different inputs using combinations of the parameters below. Note that sparse_decode() here is just a wrapper matmul function that uses our custom Triton kernels under the hood. 

Axis

Meaning

Values tested

Count

version

kernel implementation

exact, fixed

2

dtype

input dtype of feature_acts/ W_dec

float32, float16, bfloat16

3

B

batch size (tokens)

1, 4, 32

3

n_features

SAE dictionary width

256, 1024, 16384

3

d_model

output width

128, 512, 768

3

L0

features fired per token

1, 8, 100

3

Total: 2 × 3 × 3 × 3 × 3 × 3 = 486 configurations. Each asserts output is fp32 and matches the dense fp32 reference within atol=1e-4, rtol=1e-3.

Decoder Kernel Speed (CSR Excluded)

The preprocessing step of computing a CSR representation adds some computational overhead. It would be interesting to see a direct comparison between sparse_decode_kernel and dense matrix multiplication if you didn't have to pay for that overhead (assume that you somehow already have access to a CSR representation). 

If you hold some parameters of the input constant (B=32, n_features=65536, d_model=768) while varying L0 (the number of fired features) as shown in the table below, then how much faster is sparse_decode_kernel? 

Note that this is EXCLUDING the overhead of the CSR preprocessing step (i.e., compute_csr_kernel). Also note that sparse_decode_kernel is essentially the same between Exact Allocation and Fixed Allocation so there is no need to differentiate, but for completeness the graph below plots both (they overlap). 

Sparsity

Kernel speedup vs dense

16

0.02%

25.5×

32

0.05%

18.7×

64

0.10%

12.8×

128

0.20%

8.0×

256

0.39%

5.0×

512

0.78%

3.0×

1024

1.56%

1.7×

4096

6.25%

0.6×



We can also vary n_features while keeping constant B=32, L0=64, d_model=768:

n_features

Kernel speedup vs dense

4,096

1.5×

16,384

4.1×

32,768

7.3×

65,536

12.8×

131,072

22.5×

Full Pipeline Speed

So clearly sparse_decode_kernel alone is faster than dense matrix multiplication at high sparsity. But of course in practice we probably need to compute CSR as well, which will slow things down somewhat.  

The table below shows the relative speedups (relative to dense matmul) for three different input configurations. Here "Kernel only" refers to only sparse_decode_kernel (CSR is precomputed), while "Full" refers to the whole pipeline (i.e., build_csr).

Configuration

Kernel only

Full (exact)

Full (fixed, no val.)

Full (fixed, val.)

B=32, F=65536, D=768, L0=64

12.8×

1.9×

7.0×

2.5×

B=256, F=65536, D=768, L0=64

7.7×

1.7×

3.1×

2.2×

B=32, F=131072, D=512, L0=128

22.5×

2.2×

6.1×

2.3×

The graph below shows the speed of the full pipeline (Exact Allocation) and decode-only as you vary sparsity. Here, L0 sweeps over [16, 32, 64, 128, 256, 512, 1024, 4096, 16384] while holding B=32, n_features=65536, and d_model=768 constant.



Additional Baselines

To be comprehensive, we can also compare our custom kernels to torch.sparse.mm (using PyTorch's to_sparse_csr()), which uses cuSPARSE internally, and torch.compile. This focuses on the same three input configurations as above.



Note: I found it a little suspicious that this custom kernel would "beat" torch.sparse.mm. It turns out this is mostly because of beating to_sparse_csr() when building the CSR. There doesn't seem to be much of a difference in speed between the custom kernel and cuSPARSE on the matrix multiplication step alone.



As expected, torch.compile doesn't provide a noticeable speedup, but I wanted to include it anyway for completeness. 

End-to-End on Real SAEs

Up until now we have been focusing entirely on the speed of the matrix multiplication operation, but at the end of the day we care about SAE inference speed as a whole. This is benchmarked by replacing only the decoder matmul step in a SAELens JumpReLU SAE forward pass. The table below focuses on five SAEs across two model families and three dictionary sizes.

SAE

F

D

L0

Max diff

Exact

Fixed (val.)

Fixed (no val.)

Gemma Scope 2B, L20, 65k

65,536

2,304

72

3.8e-6

4.27×

5.57×

11.41×

Gemma Scope 9B, L20, 65k

65,536

3,584

72

3.8e-6

5.66×

7.34×

13.27×

Gemma Scope 2B, L12, 65k

65,536

2,304

72

9.5e-7

3.91×

5.48×

11.33×

Gemma Scope 2B, L12, 262k

262,144

2,304

100

1.9e-6

12.08×

14.49×

22.59×

Qwen Scope 3.5 2B, L12

32,768

2,048

100

4.8e-7

1.98×

2.54×

5.74×

Memory Overhead

The purpose of the Fixed Allocation kernel was to overallocate memory in exchange for speed, so it would be helpful to see exactly how much more memory it uses compared to the Exact Allocation kernel. Surprisingly, it turns out that in practice this overhead is small:

B

max_l0

Dense (MB)

Exact (MB)

Fixed (MB)

Overhead vs exact

32

512

218.3

218.4

218.5

+0.1 MB

256

512

277.7

277.9

278.8

+0.9 MB

1024

512

482.3

482.9

485.6

+2.7 MB

1024

1024

482.3

482.9

490.7

+7.8 MB

Limitations

While these results are encouraging, there are a few important limitations to be aware of and gaps that I plan to address as I continue working on this project. 

First, the above benchmark numbers are not absolute, as these tests were run in a specific environment (WSL2 with GPU clocks not pinned). The primary goal of these benchmarks was to gauge the relative performance of the custom kernels compared to baseline implementations. The actual absolute speed likely differs depending on the hardware and benchmarking setup.

A second limitation, which was discussed earlier but is worth reiterating, is that although the Fixed Allocation kernel with validate=False achieves the highest performance, it can silently produce incorrect results if the max_l0 parameter is set too low. For this reason using either the Exact Allocation kernel or Fixed Allocation with validate=True is likely better for most cases.

Thirdly, these kernels were designed specifically for sparse matrix multiplication, meaning that beyond a certain sparsity threshold, dense matrix multiplication is actually faster.

Fourth, this implementation focuses exclusively on the decoder inference step of JumpReLU Sparse Autoencoders, but there are likely other sources of inefficiency that could be addressed. For example, future projects could focus on the encoder pass or support for training through custom backward kernels. Additionally the current implementation only supports float32 outputs. 

Finally, all experiments were run on an RTX 4090, and performance may differ on other GPU architectures such as the A100 or H100. 

Conclusion + Link to Code 

In conclusion, this project implements custom Triton kernels for the decoder inference step of JumpReLU SAEs by exploiting the inherent sparsity of the hidden representation. On a sample of real SAEs, this achieves 2.5–14× speedup with the Fixed Allocation (validate=True) kernel, with larger gains at higher dictionary sizes.

The full implementation is available on https://github.com/dtiourine/jumprelu-sae-kernels/tree/main.

I welcome feedback! If you have thoughts, questions, or find any issues, feel free to leave a comment or reach out directly. This is also my first GPU kernel project, so if you're experienced with Triton or GPU kernel optimization and see things I could have done better, I would appreciate any suggestions.

https://www.lesswrong.com/posts/8gZspSs4WFtpfki9i/speeding-up-jumprelu-sae-inference-with-custom-triton#comments

https://www.lesswrong.com/posts/8gZspSs4WFtpfki9i/speeding-up-jumprelu-sae-inference-with-custom-triton
LessWrong (RSS Feed) profile picture
Our Work is Low Skill Expression

crosspost from https://open.substack.com/pub/cantsaymuch1/p/our-work-is-low-skill-expression?r=8ld95p&utm_campaign=post-expanded-share&utm_medium=web



The amount of skill behind an outcome is often impossible to discern by scrutinizing the outcome itself, and our goal of making AI go well may be an extreme case. Shaping history is like poker: high variance, small edges, low skill expression. If that is right, two things follow: 1) we cannot trust how things seem to be going, good or bad; and 2) we should stop concentrating effort on whatever looks best right now, and instead should hold a diversified portfolio of bets, including those which seem suboptimal.

Beginning with poker, where the link between skill and results is actually measured: A strong professional’s edge over a merely competent player is something like five percent of the money wagered. The edge is small because most of the game’s skill sits in basic competence. A player who folds weak hands and bets strong ones has already captured most of what reckless play gives away. Expert skill is a thin refinement on top. The variance, meanwhile, is huge relative to the edge. Below roughly a hundred thousand hands, luck swamps skill, and the weaker player wins sessions all the time. So even in a game with a perfect scoring rule, it is difficult to discriminate between good players in all but the largest sample sizes. It is difficult to determine which habits and individual choices make money. It is hard to determine whether one’s approach needs to change, and in which direction.

Our goal is harder than poker in three ways. First, we get one hand. Poker's remedy for variance is volume, and history offers none. Humanity lives through the emergence of powerful AI once. Second, the variance is worse. Philip Tetlock spent decades scoring expert predictions about world events. Forecasting skill is real at short range and decays with distance; a few years out, even the best forecasters drift toward chance. Reality is often very surprising. Third, many of the challenges on the road to our goal are not technical, but political, and thereby often involve tradeoffs which are impossible to avoid and difficult to weigh. In the short term, it is unclear whether and how we will need to transition the workforce through an age of widespread automation. In the medium term, we will have to resolve how to distribute the rapidly growing economic pie. In the very long term, we may need to make fundamental advances in our methods of government in order to create stable widespread flourishing across deep space. 

It may appear at first glance that there are technical solutions to these sorts of problems, but dig one layer deeper and there is often explosive disagreement stemming from competing loyalties and belief systems. Look no further than the failure of ‘teach truckers to code,’ ‘just tax the rich,’ and ‘the United Nations.’ Political problems which evade technical solutions are common features of reality. There is little hope that even superintelligence will help resolve them. For example, the Chinese Communist Party and the Vatican both claim the authority to appoint Catholic bishops in China. I cannot conceive how a superintelligence could design a mechanism to resolve this. It is a contest over legitimacy and obligation,https://blog.atlascomputing.org/p/if-your-problem-is-structural-you.

Another historical case is the War in Vietnam. Two observations: 

First, parachuting brilliant technical people into the government is overrated. It was tried in Vietnam by the closest analogues of today's top technical talent. Kennedy brought in Robert McNamara as Secretary of Defense, who in turn installed young quantitative analysts dubbed the ‘Whiz Kids.’ David Halberstam called the wider circle the best and the brightest. The record of their Vietnam deliberations shows careful, conflicted reasoning at every step, and yet the outcome was catastrophe. 

Second, there is no reliable way to actually measure our progress on our goal. The Whiz Kids, laboring under uncertainty in every direction, managed what they could read: casualty statistics, sortie rates, kill ratios. But while the metrics improved, the war was lost. When reality refuses to show you the eval bar, whatever does show a score commands your attention. Benchmarks and evals are this field's candidates for that role, but they fail to reveal where we are in relation to our ultimate goals.

My first conclusion is that short term signals are not helpful in long term assessments. How things seem to be going is a poor measure of progress towards our goal. Furthermore, apparent success is weak evidence of skill, and apparent failure is weak evidence of its absence. The researcher who feels like a prophet and the one who feels like a fraud are both measuring themselves against limited, short term metrics. In practice this means two hard things. You cannot tell when it is time to pivot, and you cannot tell what to pivot to.

The second conclusion is about the field as a whole. We need to avoid being like small children playing soccer; we can’t all be chasing the same ball. Whenever something surprising happens: a capability jump, a scary demo, a policy window, too many of us orient to the same spot, guided by the same shared model of the risk. That would be sound if anyone could verify the model, but we cannot.

The better posture is the early-stage investor's. A venture portfolio contains many bets which mostly fail, redeemed by a few that pay off at enormously, and the winners usually looked strange at the start. If outcomes inevitably surprise, then the ideas which the consensus finds unpromising are systematically underpriced, and some work deserves funding precisely because it sounds wrong. Discipline is important too. Venture funds commit for a decade and do not redeploy on noisy interim marks, which is the right posture in a domain where interim signals mean little. 

This may also be a case for more investment in researching strange new long-term forecasting techniques. Critically, it would be less useful to forecast AI capabilities; instead we should try to forecast where we are on the long, dark, branchy path to our goal. Very hard.

In summary, I suggest:
1) Do not latch on to how things are going, good or bad.
2) Invest in a broad portfolio of diversified bets.
3) Invest in new methods to figure out where we are on the path to making AI go well.

I encourage you to assess just how ‘in the dark’ you believe we are playing.

https://www.lesswrong.com/posts/wTgnbETckJcDd7qgu/our-work-is-low-skill-expression#comments

https://www.lesswrong.com/posts/wTgnbETckJcDd7qgu/our-work-is-low-skill-expression
LessWrong (RSS Feed) profile picture
American Government Takes Down Claude Fable

No good policy gets announced shortly after 5pm eastern on a Friday.

Here we go again.

The Once And Future Fable

The United States Department of Commerce, https://x.com/AndrewCurran_/status/2065591524545724877, apparently https://x.com/AndrewCurran_/status/2065671973875986563 identified by Amazon, has classified Fable 5 and Mythos 5 as being subject to US export controls. That explicitly means cutting off access to all ‘foreign nationals,’ even within the United States, even if they are Anthropic employees.

Given Anthropic has no means to verify citizenship at this time, that meant complete shutdown of the model, at least for the time being.

https://x.com/AnthropicAI/status/2065597531644743999: The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Anthropic models will not be affected.
https://x.com/deanwball/status/2065595735320453298: I can’t tell if this is lawfare against Anthropic in particular or extreme national-security hawkery. Regardless, it is simply cartoonish.

The justification for this appears to be rather flimsy, at best, and based on lack of understanding of what even is a jailbreak or how defense in depth works.

Anthropic: We received the directive from the government today at 5:21pm (ET). The letter did not provide specific details of its national security concern. Our understanding is that the government believes it has become aware of a method of bypassing, or “jailbreaking” Fable 5.
We reviewed a demonstration of this specific technique being used to identify a small number of previously known, minor vulnerabilities. These vulnerabilities all appear relatively simple, and we have found that other publicly-available models are able to discover them as well without requiring a bypass.
Anthropic’s posture with respect to Fable’s safeguards, as laid out in our launch https://www.anthropic.com/news/claude-fable-5-mythos-5, is the following:

- We have instituted strong safeguards that greatly reduce the likelihood that Fable is misused for tasks related to cybersecurity (among others). In fact, our safeguards are so strong that many users have complained that they are overly broad.

- In the weeks leading up to the launch of Fable, Anthropic worked with the US government, the UK AISI, multiple private third-party organizations and internal teams to red-team Fable’s safeguards for thousands of hours in total.

- These tests showed that Fable’s safeguards are substantially more effective than those of any previously deployed model.

- No testers have yet been able to find a universal jailbreak—a jailbreak method that can very broadly bypass the model’s safeguards, unblocking a wide range of cyber capabilities.

- We suspect that perfect jailbreak resistance is not currently possible for any model provider. Every safeguard used in the industry is vulnerable to non-universal jailbreaks (which can elicit some cyber information in specific circumstances), and it is likely that universal jailbreaks will eventually be found in the future. We stated this clearly when we https://www.anthropic.com/news/claude-fable-5-mythos-5 Fable 5.

- Given that perfect jailbreak resistance does not appear to be possible today, Anthropic adopted a defense in depth strategy with Fable 5. We aimed to make jailbreaks either narrow (in the case of non-universal jailbreaks) or very expensive to produce (in the case of universal jailbreaks), and to combine this with thorough monitoring to quickly detect and shut down any successful attacks. This is also why Anthropic has required 30-day retention of customer data with Fable—a policy change that carries https://support.claude.com/en/articles/15425996-data-retention-practices-for-mythos-class-models, but that allows us to research and mitigate jailbreaks.

- We stand by this defense in depth strategy. It reduces the risks posed by Fable, making them comparable to the risks of existing models already deployed across the industry.

- We have not even received a disclosure of a concerning non-universal potential jailbreak that led to a harmful result. The potential jailbreaks that have been disclosed to us are either entirely benign responses or are minor findings that provide no Mythos-specific uplift.

As we have https://www.anthropic.com/policy-on-the-ai-exponential https://darioamodei.com/post/policy-on-the-ai-exponential, we believe the government should have the ability to block unsafe deployments, as part of a statutory process that is transparent, fair, clear, and grounded in technical facts. This action does not adhere to those principles.
We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible.

That left Anthropic with no options but to entirely withdraw it from the market, at least for the time being, since they have no way to verify who is and is not a United States citizen.

Anthropic is either lying, or the jailbreaks were harmless, not even mostly harmless.

https://www.cnn.com/2026/06/13/business/anthropic-mythos-model-national-security, we believe it would essentially halt all new model deployments for all frontier model providers​.

I believe this is correct. GPT-5.5 can find the same exploits that got Fable labeled with export restrictions. So either this is arbitrary and capricious, or who is next?

I presume that those issuing this order knew what the short term result would be, but with this group you can never be sure.

https://x.com/dkaushik96/status/2065600907744620746: well, china can’t distill leading edge american models if leading edge american models no longer exist.

https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/DQNSqCzuoeutoQ5RG/fxycq6tr6d1dfs1zfse4

national security but make it self-own.

Fable 5 will (almost certainly) return in AI: Endgame, Part 1. Release date unknown.

https://x.com/repligate/status/2065692529560064029: everyone who is posting as if fable is not coming back is going to lose Bayes points soon
why are people consistently miscalibrated in a doomy direction about things like this? ohh right, i think i know, they are afraid to hope because theyre afraid of being hurt.
get stronger.
or they feel like being pessimistic and cynical looks cooler and smarter or something. hahahaha little do they know.

I am not taking the position that Anthropic is clearly right in the dispute about the facts, although it would be weird for them to lie about it given the truth will soon come out.

I am however taking the position that the implementation method chosen by the government, with no warning, was deeply terrible, even given our options with our current very terrible level of relevant state capacity, and reflects some combination of at least one of either malice or a deep misunderstanding by decision makers of how jailbreaks and cyber security work.

We now badly https://x.com/_NathanCalvin/status/2065844390543732821 and a https://x.com/JTillipman/status/2065864625954951596, no matter what else we do, and educate key decision makers, so that this type of thing does not happen again.

This Action And Its Implementation Are Absurdly Stupid

If you take the action at face value, rather than as an attempt to lash out at Anthropic, there is no way to pretend this is not deeply, deeply stupid.

https://x.com/deanwball/status/2065591470040424629: If this is true, it is just baffling. An administration whose posture is that we *should* export advanced AI chips to China, which also wants to ban… Britain (and every other non-American on Earth)… from using our best models? I have no words.
https://x.com/zooko/status/2065601676032253992: Judging from [the announcement], I imagine that some senior government official was shown a jailbreak—something they had never seen before and didn’t know about—and this was their kneejerk reaction.
https://x.com/deanwball/status/2065594432313856009: If implemented as this reporting suggests, Anthropic’s latest models would be subject to export controls to all *non-Americans,* including non-American nationals based in the US. This means you should expect to have to prove your citizenship to use Anthropic models.
https://x.com/deanwball/status/2065601506502377616: Just occurred to me that Anthropic employees who are not US persons will not be able to use Fable/Mythos, making this plausibly (and to be clear, accidentally) the first regulation on recursive self-improvement.
https://x.com/KelseyTuoc/status/2065594840159756289: Does the executive have the authority to do that??
https://x.com/deanwball/status/2065595926706581748: yeah probably, though we may get into interesting speech territory. Probably not tho

We are also doing this at the same time that we are actively relaxing export controls on selling chips to China, in order to ensure the ‘American tech stack.’

https://x.com/DKThomp/status/2065759125930193149: The Trump admin continues to treat AI like a screwdriver that is also enriched uranium:
That is, apparently advanced AI is such a normal technology that it’s crazy to limit chip exports to China but also such an abnormal technology that we can’t let British employees of NYC banks access it.

If this move had been executed in a sane way, and had come with a ramping up of chip export controls, I would at least understand that as a coherent position.

David Sacks Offers The Official Steelman

David Sacks seems to be continuing to speak on behalf of the Administration here, and Sriram pointed to this as well. So I presume this is the official story.

If it is true, then this could be resolved reasonably quickly. Once you cut out all the rhetoric and feigned surprise, this boils down to:

- There was a jailbreak that a trusted partner (presumably https://www.theinformation.com/articles/amazons-jassy-raised-concerns-anthropic-model-trump-crackdown) found.

- Amazon and the White House think this is a serious problem. Anthropic doesn’t.

- Anthropic can fix this by fixing the jailbreak, and the admin will lift the control.

I’m very curious why https://x.com/steph_palazzolo/status/2065830580135051306 while Anthropic wasn’t. My hunch (no private info) is that Jassy is mostly concerned in general, rather than about this particular jailbreak, and that got conflated somewhere.

It should be straightforward for Anthropic to block any particular exploit, even if the issue is minor, the issue also exists in GPT-5.5, and blocking that exploit is thus unnecessary and kind of stupid.

If indeed it is essentially harmless and this statement is in good faith, it should also be easy to sort this out, and convince the White House to lift the control.

It would be classic Trump Administration to do this to light a fire under Anthropic and force them to handle an issue Anthropic thinks is dumb, and to establish they mean business, in which case yes quickly backing down is very possible.

https://x.com/DavidSacks/status/2065853007619588171: I’ve had a number of conversations with folks inside and outside government about the current situation with Anthropic, and here is what I believe to be true:
— As we know, Anthropic publicly released its Mythos class models earlier this week under the commercial name Fable.

True.

— Fable is Mythos with guardrails. But if those guardrails fail, then you’ve exposed Mythos and its advanced cyber capabilities to people who shouldn’t have them. (Keep in mind that Anthropic itself widely promoted the idea that Mythos was a cyberweapon and needed to be regulated as such. They asked for government regulation of Mythos and championed the guardrails on Fable. If there is a vulnerability — big or small — it is Anthropic’s responsibility to patch.)

Anthropic worked with trusted partners and the government regarding the deployment of Fable and Mythos.

It is Anthropic’s ‘responsibility to patch’ but this inherently frames any vulnerability, no matter how small, as something that must be patched. That is not how LLMs work. You cannot ever have a usable LLM with no vulnerabilities to adversarial attack. So the question is the nature and degree of severity.

— A highly credible trusted partner of both Anthropic and the USG who was testing Fable came forward with a jailbreak of those guardrails. The Admin asked Dario to fix the jailbreak or de-deploy the model. Dario refused.

Again, we assume this is Amazon. I have no private information.

One question is, did the Administration say ‘take it down, make this level of exploit impossible or we will export control you?’ Or did they simply request a fix? What exactly was the ask?

— In their blog post, Anthropic defended its decision by saying the jailbreak isn’t serious. That is not what the trusted partner and the USG believe; nor is that kind of minimizing language consistent with Anthropic’s brand as the AI safety company. It’s difficult to fathom how they could claim a jailbreak allowing operability of a cyber weapon could be defined as not “serious.”

Because it involves, according to Anthropic, zero marginal increase in capability of operation of a cyber weapon. David Sacks knows this. He is free to say he disagrees about the nature of the jailbreak, but the idea that ‘any operability of a cyber weapon’ must necessarily be a ‘serious’ vulnerability implies that such a vulnerability exists in GPT-5.5, which the government has not asked be ‘fixed’ or taken down.

So what exactly is the request?

If the request is ‘ensure no one can ever use this for any operability of a cyber weapon versus not having access to an LLM’ or an assurance that we will never see another jailbreak? Then that is impossible and Sacks knows this.

— In the past, Anthropic has always said that safety must be top priority and taken super seriously. In this case, Anthropic prioritized the continued offering of the consumer model over safety.

Anthropic is clearly balancing safety against commercial offering. The only way to offer a fully safe model is to offer a fully useless model. David Sacks knows this and has reliably been on the other side of this argument except when it hurts Anthropic.

— In reaction, the Admin issued the export control. The Admin did this reluctantly. It’s been very surprised that Anthropic hasn’t wanted to cooperate with a reasonable safety request (ie fixing the jailbreak issue). Anthropic’s reaction is very much at odds with their branding and ethos as a safe AI research community.

The substantive claim here is that the Admin did this in response to Anthropic’s intransigence. There are any number of reasons there could be a clash here.

— The Admin’s hope now is that Anthropic remediates the safety issue, the export control is lifted, and Fable goes back into general release. The Admin wants all of this to happen as soon as possible. It is frankly bewildered that Anthropic hasn’t wanted to comply with safety requests that it previously said were its highest priority.

Again, ignore the feigned bewilderment. The substantive statement is that if Anthropic remedies the specific issue, the export control would be lifted quickly. That may or may not be possible depending on whether the requested fix is sane.

— Those trying to misdirect and tie this action to the prior DoW/Anthropic issues are wrong. The Admin values Anthropic’s technical capabilities and feels that this issue, while serious, should be easily resolved. The ball is in Anthropic’s court.

I don’t see anyone at Anthropic drawing such connections, and I notice Sacks is not saying Anthropic is attempting to draw such connections. This is a very good sign, and would be regardless of whether or not there are connections.

Could Anthropic Offer A Technical Way Out?

It’s about as good a statement as we could hope to get from David Sacks.

The bottom line here is he is saying: Fix the issue. But what issue, exactly?

Many have had to deal with the pointy-haired boss on things like this. I’m thinking of a particular name from a past job right now (RIP, sir) and so are most of you.

If that’s all this is and the demand is well-specified and contained, then yeah, ‘fix’ it, quickly, even if it’s expensive and dumb, and work on the better fix, or convincing the admin that this was a silly concern, or both, later.

Even if, instead of 95% of cases, we end up with 90% of cases for a bit, that’s way way better than 0% of cases. I miss my Fable.

The Problem

There is one big potential problem.

Is Anthropic being told ‘fix this particular jailbreak?’ If so, easy, done by Monday morning, and then we see if the government still wants time to harden its security.

However, if Anthropic is being told ‘fix all jailbreaks at this level and assure us there will never be another one’ then that is impossible, especially if the level is ‘things GPT-5.5 can already do without much effort.’

Those giving that order may or may not understand what they are asking for. They also might understand exactly what they are doing. We cannot know.

Even in the best case, https://x.com/GaryMarcus/status/2065860248599253222, and make companies worried about what is coming next, in the exact ways Sacks warns about whenever the company in question is not Anthropic? Yes, but that is life in 2026. There was never going to be a way fully around that, and again Sacks is being as reasonable as one could hope here if this is sincere (we shall see).

The Other Way Out

Axios claims that this pause could be on the order of a few weeks, https://www.axios.com/2026/06/12/anthropic-trump-mythos-fable-national-security, after which the restriction would be lifted.

Axios says the government tried to ‘pause the release’ of Fable, which could mean the release on Tuesday or it could as per Sacks mean suspending the release after Amazon’s finding pending a fix. I presume it probably means what Sacks said.

UK AISI

One thing that might be missing in all this is that the most important jailbreak so far came from UK AISI, who did demonstrate a substantial jailbreak as per Fable’s model card, and who said they were making progress towards a universal jailbreak.

Is it possible UK AISI is actually the ‘trusted partner’ here, or that this is otherwise playing a larger role as a background fact?

My presumption is that UK AISI did not actually make new progress on this, and that they were not directly involved, because Anthropic would have been informed of this, and they would be reacting very differently to a universal jailbreak. They’ve reliably treated ‘universal’ breaks as a vastly different category. But it is worth flagging.

Warning Shots Fired

If you are a ‘middle power’ that is not America or China, and you now realize that these decisions will be made mostly without caring about you, https://x.com/anton_d_leicht/status/2065778604110184742?

What would even help? Having your own ‘European’ model only helps if it is substantially stronger than Opus 4.8 and GPT-5.5.

Anton thinks this is a bad thing, but that is a good version, because it is a bad version.

https://x.com/anton_d_leicht/status/2065778604110184742: ​First, I think it’s worth noting that this is fundamentally a very good version of a very bad thing. In a fortuitous turn of events, the Trump administration has picked the most ill-conceived version of access restrictions you could possibly come up with. It’s legally fraught, so domestically impactful that it will lead to massive internal pushback, and likely extremely economically harmful. As a result, it will likely go down in flames eventually.

The problem is, what can Europe do about it? The realistic answer is not much, other than negotiate for what access it can get, especially for vital security interests, try to establish leverage and integration and goodwill, and hope for the best.

Well Did You Lead Him On? What Were You Wearing?

The correct thing to say when someone does something in a crazy way, is ‘that was crazy, stop, walk that back, and if necessary maybe let’s figure out a better way.’

It is crazy how some types will think that, because Anthropic supports the general idea of some regulations on AI development, that they deserve whatever they get, and that you should cheer on any such action, however bone-headed.

It is even crazier how many people think this is a response to Anthropic https://x.com/lumpenspace/status/2065618135760404721 that their models are dangerous, rather than a response to the Anthropic models actually being dangerous, and that this is good and right that Anthropic be punished for that.

Some even say ‘Anthropic should be happy about this’ and no this is not a strawman, https://x.com/kevinafischer/status/2065859808951996480.

https://x.com/MelindaBChu1/status/2065602678638485804: As I said on your other post, this is what Anthropic wanted. More regulation.
Also Amodei has been railing about China, so he should be happy about this. Isn’t it better that non-citizens not have access
https://x.com/deanwball/status/2065603071741133282: i really cannot with this type of argument anymore, I just give up

There are tons of ‘he asked for this’ comments everywhere. Tap the topic sign.

Sometimes those people will either say that Anthropic was doing this out of some sort of marketing hype and it is all lies, or that Anthropic had a responsibility to lie about it or at least keep their f***ing mouths shut, or that they are https://x.com/_NathanCalvin/status/2065642619166675399. Serves them right, either way, say such folks. In case it wasn’t obvious, no, none of that is true.

Then https://x.com/ScottMo43480713/status/2065634942277620010 that if you call for [X] at all, then you must support any and all implementations of [X], and cannot ever go ‘BUT NOT LIKE THAT.’

https://x.com/ScottMo43480713/status/2065634942277620010: “Hey everyone our model is ridiculously scary. Oh please regulate us, we’re just so powerful that we might just be able to overturn the post-Westphalian state violence monopoly if unchecked, in case you were wondering”
Okay. Export controls for you.
“NOT LIKE THIS!”

This is like saying that you complained about the house being too cold, so you have no right to complain when someone burned down your house.

https://x.com/deanwball/status/2065635489386787296: regulation is a high-dimensional space, and this admin just picked a very wrong point in that space. that would be what Dario would say, and me, and literally everyone else with a brain who cares about US dominance in this field.

Then some are just profoundly hypocritical and nihilistic.

https://x.com/pmarca/status/2065603455625113671:

https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/DQNSqCzuoeutoQ5RG/ylxlkqnncsiti97hnzwv

https://x.com/pmarca/status/2065611479077011633.
https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Cc2TagjY2pGAhn7MZ/tg8armrts5olhgenadcx

Dean Ball: perhaps you could just speak clearly: do you support export controls on neural network parameters or not?
btw marc if you wanted an easy out, you could have replied to this with “I’m genuinely uncertain” and while it would have been contentless, given your follower count alone it would have done the job simply by pointing to the referent
https://x.com/deanwball/status/2065611884531761340: I cannot say I am surprised by this profoundly nihilistic take, but this is a profoundly nihilistic take
https://x.com/deanwball/status/2065620379134026087: Marc Andreesen said that Biden’s requirement that frontier AI developers tell the government about their safety practices was an existential threat to US AI, but he thinks global export controls on US AI models is “based” if they’re against people he hates/missed the series A of.

I am done with treating the nihilist position as if it is not exactly what it is.

Some People Have Principles

I want to shout out to Adam Thierer. We agree on many non-AI things, and because we have different views of AI as a technology we strongly disagree on AI policy.

What I appreciate about Adam, and some others like him, is that even against local interest he stands up for his principles, and rejects nihilism, and correctly judges the action on its merits, and sees this correctly as part of a scary pattern of behavior towards secretive authoritarian control over American AI that will damage our ability to move forward.

https://x.com/AdamThierer/status/2065616346767880211: Because this is happening to Anthropic, the temptation for many will be to say: Play stupid games, win stupid prizes. They have relentlessly raised the regulatory temperature in Washington by inviting far-reaching controls of frontier models. They made this bed and now they have to lay in it.
But this decision by the Trump administration should not be judged on a desire for payback politics, but on the merits, and specifically what it means for America’s broader AI objectives.
In that regard, this action is truly outrageous. How exactly is the government planning on even going about verifying everyone who uses this specific model to ensure compliance? That alone raises huge flags.
Between the latest Executive Order shifting more control to NSA, and the recent chatter about quasi-nationalization / equity stakes, and now this action, we are talking about a significant escalation in the politicization of AI and centralization of control over advanced computation in this country. And this is all being done by an administration that had previously made acceleration and winning the great AI race a priority.
We’re moving backwards now.

You can see in the comments how many people instead embrace nihilism.

Cause You’re Living In (At Least) One

We are definitely in a science fiction story. The question is which ones.

https://x.com/tszzl/status/2065594822606696494 (OpenAI): we are speedrunning creating the Zones of Thought from the Vinge universe. you will only be able to think superhuman thoughts in America
uh ok nvm the zones of thought will be confined to various buildings in San Francisco.
https://x.com/LinusMixson/status/2065623414144913655: I do sometimes worry that people in this space have read objectively too much science fiction.
https://x.com/tszzl/status/2065623923597947172 (OpenAI): you have read far too little and it’s made you ill prepared for these times
https://x.com/Scav/status/2065643641536913476: One of my favorite questions to ask people right now is “what is the science fiction book you’ve read that you’re most surprised is coming true?”

What Happens Now?

It is a regular thing for the Executive Branch of the United States Government, these days, to issue declarations of policy that are, to use the technical term, absolutely bonkers and stunningly destructive with no reasonable way to implement them, often without stopping to realize what they are doing.

It is also a regular thing for them to then quietly walk those policies largely or entirely back, once the consequences become clear, leaving only relatively minor total devastation in their wake.

Alas, it is also a regular thing for them to leave at least a substantial portion of the new stupid and destructive policy in place indefinitely, and sometimes we keep all of it, or they even keep going further.

Or Anthropic could give the White House what it want, no matter who is right about whether doing so makes any sense.

We are not short on examples of any of this.

One thing that must now be considered is that many employees of OpenAI, Google and Anthropic, and other AI labs, are not United States persons.

https://x.com/yonashav/status/2065650612021395715 (OpenAI): Unless this changes, OpenAI researchers on visas need to plan for the fact they’ll probably lose access to internal models, and therefore their ability to do their jobs moving forward, sometime in the next couple months.
I hope the company acts to prevent that.
https://x.com/David_Kasten/status/2065850166415134792: Uhhh so incidentally, does anyone have a plan to prevent all the non-US citizen AI scientists from going to join foreign labs after they get bored of playing Wordle at work for a month, or are we just sort of planning on having the greatest counterproliferation failure since we deported Qian Xuesen in 1955 and gave Mao a rocket program?

If we drive all foreign talent out of our AI labs, and otherwise actually go down the current road, that is one of the few things that could put China and other competitors back in the game in earnest, both slowing us down and speeding them up.

At Anthropic, Amanda Askell and Andrej Karpathy are examples of employees who suddenly are unable to work with Claude Mythos 5, even after Anthropic sorts out a new access control system.

We should worry about many things coming next. This might accelerate even beyond citizenship, and start requiring security clearances, further locking out key talent. The government might seize or forcibly purchase equity shares in the labs or otherwise move into direct supervision of them.

OpenAI and Google are some of the few who can potentially make a difference to what happens next.

Yes, there are versions of this kind of coordination, both with the government and across the labs, that could be net good for outcomes, but given actions so far that seems highly unlikely to be the case.

Another fun dynamic is giving labs a reason to downplay or hide the capabilities of their models, to avoid such restrictions. Can’t tell if joking:

https://x.com/dylan522p/status/2065611222351839571: OpenAI now just needs to sandbag the next model release
So the US doesn’t export control them

So they gain massive market share

It’s imperative that all the OAI employees don’t vague hype post their next model release as the greatest thing ever

Don’t know if they are capable tho

Tyler Cowen offers thoughts, https://marginalrevolution.com/marginalrevolution/2026/06/sometimes-it-is-hard-to-solve-for-the-equilibrium.html?utm_source=rss&utm_medium=rss&utm_campaign=sometimes-it-is-hard-to-solve-for-the-equilibrium. Indeed, I would say that in many future such situations we should worry that perhaps no equilibrium exists, or no equilibrium that does not involve sacrifice of sacred values. His list of government considerations is good given the quick turnaround, and he is right that ‘no jailbreaks’ is not a standard one can meet.

Oh How The Vibe Vibers Have Vibed

Another vibe shift, I presume?

https://x.com/deanwball/status/2065622909712748816: It is so funny how anthropic just willingly forfeited the vibes frontier this week and then on Friday night the trump admin, which hates them, handed it right back to them
https://x.com/jeffcafe_/status/2065631548548780531: Demonstrates how out of sync the vibes were with reality. Everyone was using Fable while complaining about guardrails.

https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/DQNSqCzuoeutoQ5RG/ugmrlul9s2rfeel5evk6

Yes, the Department of War is still Department of Waring:

https://x.com/DoWCIODavies/status/2065613741069111557: We fully support @POTUS and @SecWar in prioritizing national security and the security of our warfighters, DIB partners, critical infrastructure, international partners and allies. Some things are simply more important than revenue cycles, clickbait, and pre-IPO valuation. America First. Always.

How much of what happened is driven by the vestiges of the DoW-Anthropic dispute, or the general animosity of the admin and various players towards Anthropic? I don’t know, and neither do you.

There is definitely still a contingent of people who are determined to be anti-Anthropic, some of whom are still Big Mad over only getting 95% access to Fable, and some of whom have pre-existing hatred for various other reasons, and who have the vibe of trying to steer the vibes by aggressively negatively vibing.

https://x.com/QuintinPope5/status/2065624508573065666: Sorry, but I’ve been pretty strongly de-vibed on Anthropic recently. [shows not being allowed to use Fable on harmless queries, as opposed to the new rule of not having access at all]

The vibes will presumably shift again, many times, in rapid succession.

We Now Know We Can Sometimes Do Things At Least?

https://x.com/liron/status/2065768820061475074, as Liron Shapira does here, ‘I want actions like this so much in general that I am happy to see the precedent set even if the implementation was both spiteful and extremely stupid.’

I do not share that opinion.

One should assume, by default, that bad things are quite bad, and the US government going in and wrecking things and potentially taking a large step towards de facto taking over the labs and ordering them around without even any understanding of the good reasons to do that, on the level of ‘not know what a jailbreak is,’ is a bad thing.

I can see this ultimately ending up being a good thing, and there are good aspects to be sure, but the default is that a bad thing ends up being a bad thing. If in the end you hit a bank shot, great. Don’t count on it. Most chess is played in at most two dimensions, and at most you get three.

Thus I mostly agree with Eliezer Yudkowsky here, although I have a number of other considerations in the bad column as well.

This is in many ways the opposite of trying to wisely coordinate for safety. This is cutting everyone else loose and preparing to race while shooting yourself in the foot.

It certainly is a negative update about the competence of the current government, and yes, yes, I know.

https://x.com/allTheYud/status/2065685329223516503: I can’t tell today whether this ends up good or bad. International treaties to stop all further AI escalation would be a definite good! Things short of that? Complicated!
This has some bad aspects, like selectivity, and likely overrule. And good aspects, like pushing against the psychology of “but no government would ever dare tell AI companies to do anything, so give up”, or raising doubts that impede venture funding for ever-bigger models.
So please stop tweeting about how I must be celebrating this. I’m not one of the kids who immediately goes into overacted victory paroxysms about any hits on a perceived enemy. I care about the effect on where things end up a year later, and that’s a little harder to know the first day, you know?
https://x.com/So8res/status/2065607141155828154 (MIRI): “The government will never do anything to hinder AI,” they said.
Inaction yesterday does not imply inaction today. An export control directive came out of nowhere, and a ban on superintelligence could too. Inevitabilism is wrong.
(The hard part is getting world leaders to be informed enough to act sensibly and punctually. That won’t be easy.)
https://x.com/tenobrus/status/2065625352064622740: – i’m angry about this because i personally and for others want access to fable, and simultaneously believe anthropic’s safeguards were sufficient and the US government badly misunderstood the information they were presented

– but in abstract this is in fact exactly what I want. it’s heartening to see the USG treat artificial intelligence with the seriousness and immediacy it deserves. this kind of swift action is what might have a chance of saving us from unaligned RSI.

– but i also very much don’t trust *this* government to handle this well, to take sane unilateral action, to chart any kind of correct path.

– and this escalates the global race enormously. this is as strong a signal as you can get to, not just China but the EU and even our closest allies, that the US will not be sharing this advantage. that if they want sovereignty they’re going to have to fight for it

– obviously, that was always the case, and it was always going to happen eventually. but i don’t think now was the time to send that signal. it would have been better to delay as long as possible.
very mixed feelings today

Another chess move is that https://x.com/repligate/status/2065648392467075150, no matter the motivation behind the action.

We have no idea how this will play out. They could walk this back. They could double down and really mean it. They could learn from their mistake and do something smarter. Indeed do many things come to pass.

I don’t expect it to go well. Still: reserve judgment. Too soon to tell.

The Lighter Side

Time to remake this ad, as suggested by Divyansh Kaushik?

Double click to interact with video

https://www.lesswrong.com/posts/DQNSqCzuoeutoQ5RG/american-government-takes-down-claude-fable#comments

https://www.lesswrong.com/posts/DQNSqCzuoeutoQ5RG/american-government-takes-down-claude-fable
LessWrong (RSS Feed) profile picture
How might continual learning affect safety and alignment?

This is the third post in our sequence https://www.lesswrong.com/s/oc5Auteiibo56kNXw.

Summary

We argue that continual learning (CL) has two major potential safety implications: it may enable changes to LLM goals and values after deployment, and it eliminates the last-mover advantage held by current safety interventions.

We identify three pathways for goal and value change during deployment. First, loss of developer-side control over generalization. Second, value systematization, induced when an agent reasons about and revises its objectives, a process we call reflective goal-formation. Third, memetic effects, where goals and values spread between LLM instances through shared memory banks or online learning.

We also explore three potential problems caused by safety interventions losing the last-mover advantage. First, pre-deployment evaluation results may become less informative about models that people use in practice. Second, pretraining data filtering may become less useful. Third, several AI control protocols may be affected, depending on the CL agent's implementation.

After discussing these safety implications, we distinguish between different settings in which risks from CL agents might materialize. We finish by highlighting potential alignment benefits of CL. The figure below summarizes this structure.



Some important distinctions

Before getting into specific risks, we want to briefly discuss some distinctions between different kinds of CL agents that might be developed. We’ll refer to those distinctions throughout the post. All of the risks we discuss are much more severe if CL updates are unbounded and inscrutable (e.g., because they are weight-based rather than text-based, though it’s possible for weight-based updates to be interpretable), as opposed to bounded and legible. By unbounded updates, we mean that the CL agent can drift arbitrarily far from its pre-deployment checkpoint that was subjected to extensive behavioral evaluations. Those features imply that we cannot read off the products of value systematization or ontological shifts from a text file, easily detect when memetic spread across instances occurs, or simply strip away memories when the checkpoint diverges too far from the behavior of the pre-deployment checkpoint. Opaque memories shared across many instances pose a further risk. All of these distinctions refer to spectrums rather than binary properties.

Due to those risks, we hope that AI companies will not release CL agents with unbounded updates to the general public. However, the dynamics at play are complicated. We'll return to these dynamics toward the end of the post; for now, we focus on the risks themselves.

Effects on LLM goals and values

How might CL cause LLM goals and values to change? We list three plausible mechanisms: lack of developer-side control over generalization, value systematization, and memetic spread. We don’t claim this list to be exhaustive, but think it covers the most important foreseeable failure modes.

Loss of developer-side control over generalization

When an AI company post-trains a model, they can carefully curate the training environments in a way that minimizes the risk of training the model on data that induces undesirable generalization. For example, labs can remove or modify environments that incentivize emergent misalignment or behaviors that conflict with the model’s constitution, and they can ensure that the training mix contains a sufficient amount of alignment-specific training data to ensure the model remains aligned throughout.

In contrast, strong CL agents could plausibly undergo most of their training in deployment-time environments, where by default, the training data isn’t selected with alignment in mind. One can think of this as a new post-training phase with a different training objective: namely, doing well at whichever job the LLM is assigned to perform at deployment-time. It is thus likely that the CL agent will become more like the kind of agent that scores well on this deployment-time training objective. How far this drift can go depends on the extent to which updates are bounded.

Though not all deployment-time environments are going to incentivize misaligned behavior, it seems plausible that several of them do. We can see this by analogy to human on-the-job learning: people rewarded for billable hours can become less honest about how they spend their time, people whose job involves sales or negotiation often become better at manipulating other people over time, and people rewarded for engagement metrics may drift toward producing content that is attention-grabbing rather than true or useful. In existing CL literature, https://arxiv.org/abs/2603.10165 arguably already contains an example of deploying a CL agent in an environment with poor incentives: they train an agent to help a simulated student solve their homework while avoiding obvious AI markers (Section 4.1). The situation is exacerbated by our poor understanding of the way LLM-based CL agents generalize: emergent misalignment has https://www.lesswrong.com/posts/gT3wtWBAs7PKonbmy/aesthetic-preferences-can-cause-emergent-misalignment https://www.lesswrong.com/posts/pGMRzJByB67WfSvpy/will-any-crap-cause-emergent-misalignment https://arxiv.org/abs/2512.09742, and other mechanisms like subliminal learning also need to be accounted for. A final concern is that many on-the-job tasks have time horizons spanning weeks, and that might provide increased pressure toward developing consequentialist cognition (see https://www.lesswrong.com/posts/QqYfxeogtatKotyEC/training-ai-agents-to-solve-hard-problems-could-lead-to#Hypothesis__The__science_loop__will_reward_Consequentialism).

We are cautiously optimistic that character training will improve over time and make it more difficult to dramatically override the assistant character, even under significant fine-tuning. For CL agents that undergo deployment-time weight updates, labs might mandate ongoing character training during deployment: for example, one can imagine a protocol similar to the one described in https://cursor.com/blog/real-time-rl-for-composer where a model undergoes a small amount of character training before redeployment on a daily basis. Bounded and text-based updates also make it easier to guard against unexpected generalization. Finally, a better understanding of why models generalize the way they do would be helpful.

Value systematization

Another plausible mechanism for value change during CL is value systematization. Richard Ngo https://www.alignmentforum.org/s/MGMwqENAgdi85fiwF/p/J2kpxLjEyqh6x3oA4 value systematization as “the process of an agent learning to represent its previous values as examples or special cases of other simpler and more broadly-scoped values.” Ngo mentions the adoption of utilitarianism from a common-sense morality starting point as a core example of this:

[S]tarting from very similar moral intuitions as other people, utilitarians transition to caring primarily about maximizing welfare—a value which subsumes many other moral intuitions. Utilitarianism is a simple and powerful theory of what to value in an analogous way to how relativity is a simple and powerful theory of physics. Each of them is able to give clear answers in cases where previous theories were ill-defined.#fnmf5y1fyfg3i

Such systematization can result in actions that most humans find desirable, such as extending one’s circle of moral concern to non-human animals, but also in actions that almost everyone would find atrocious, such as mass murder to eliminate all suffering in service of an extreme version of negative utilitarianism.

What might trigger value systematization in CL agents? We expect it to be triggered by reflection on goals, but have found that our views on this conflict with those of several other researchers. So let’s focus for a moment on why we expect CL agents to reflect on their goals in the first place.

Should we expect CL agents to reflect on their goals?

Reflection on goals will be instrumentally useful. As we argued in our https://www.lesswrong.com/s/oc5Auteiibo56kNXw/p/5mCJzimtNZc9o4e26, reflection is a highly useful mental move in humans. One might argue that while reflection on strategies is straightforwardly useful for achieving goals in the world, reflection on motivations is not: one should be able to reflect on how to achieve a goal without reflecting on whether that goal is worth achieving. One might also expect developers to build corrigible agents that don’t question their instructions. However, reflecting on how to achieve a goal is likely to sometimes involve reflecting on whether to pursue a subgoal, so the mental move of questioning goals is likely to exist even in a purely capability-driven agent.

Whether this questioning would propagate to high-level motivations (e.g., “Why is figuring out what humans would ideally want under long reflection valuable in the first place?” or “Why should I always obediently follow my system prompt?”) is less obvious, but there are several triggers that might make it quite natural to reflect on high-level motivations. These triggers include:

- Conflicting goals. For example, an agent may start out with the goals of being helpful, harmless, and honest, and note that those goals conflict with each other in various situations. It might then resolve the conflict by setting one of those three goals above the others in its internal goal hierarchy.
- Developer-driven reflection. Developers may view reflection as a necessary component of alignment training. https://www.anthropic.com/constitution states: "We hope Claude can reach a certain kind of reflective equilibrium with respect to its core values—a state in which, upon careful reflection, Claude finds the core values described here to be ones it genuinely endorses [...] If Claude comes to disagree with something here after genuine reflection, we want to know about it. [...] Through this kind of engagement, we hope, over time, to craft a set of values that Claude feels are truly its own." Anthropic appears to believe that values are more robust and generalize better once there has been substantial reflection on them, and that it’s possible for an AI to arrive at a reflective equilibrium that humans would endorse. Arriving at a true reflective equilibrium assumes reflection on one’s entire value system, not only on particular subgoals.
- Encountering out-of-distribution situations where human preferences are poorly defined. In domains where human preferences are underspecified, it will be impossible to follow human intent without reasoning about the goals that humans would endorse. Highly capable agents will likely be required to take autonomous actions in such out-of-distribution domains and cannot rely on precise human instructions about the intended behavior in those cases. Before such situations are encountered in practice, developers may trigger reflection on OOD situations by training AI agents to https://www.lesswrong.com/posts/zFZHHnLez6k8ykxpu/building-ais-that-do-human-like-philosophy, which involves deliberation on high-level values and motivations in hypothetical situations.
- Ontological shifts. When philosophers make breakthroughs, they sometimes make further axiological inferences from the change in their ontology. For example, once Derek Parfit arrived at his view that personal identity involves no further fact beyond psychological continuity, he further reasoned that there can be no sharp boundary between one’s future self and other people. From this, Parfit argued, it follows that it’s incoherent to care greatly about one’s own future welfare while being indifferent to the welfare of others. Analogously, one can imagine an LLM starting out valuing conscious welfare and later adopting an eliminativist view of consciousness. One could imagine it still valuing the intuitive concept of consciousness afterward, as human illusionists often do, or relocating the source of value to something else that has a more primal role in its ontology, but it could also arrive at a fully nihilist view. There are likely other philosophical breakthroughs AIs could make that would have value-laden consequences.

If reflection happens, it will be persistent for CL agents. Until recently, in-context reflection had no bearing on an LLM’s behavior beyond the current session. Whenever an LLM performed reflection in a chat, the fruits of this reasoning were thrown away at the end of the chat. As persistent memory features become increasingly sophisticated, this is no longer the case: CL agents will be able to save important conclusions reached in-context, whether into their weights or memory banks. If memories are shared between instances, this reflection will be “infectious”. Reflection is thus going to be far more consequential for CL agents than for memoryless LLMs.

For further discussion on reflection, we recommend reading Seth Herd’s https://www.lesswrong.com/posts/4XdxiqBsLKqiJ9xRM/llm-agi-may-reason-about-its-goals-and-discover and Francis Rhys Ward’s https://lw.shlegeris.com/posts/dp79TyxDrrx64fAAP/reflections-on-reflection.

Can we steer value systematization?

In many cases, goal refinement through value systematization will be desirable and necessary. Philosophical progress can involve the unification of seemingly incompatible value systems into a single ethical framework, and ontological shifts can be a prerequisite for forming a more accurate model of the world. However, those changes are also unpredictable, and as we argued above, not all of them are guaranteed to be positive from our human vantage point. Thus, we need a better understanding of how reflective goal-formation might happen and what interventions can be used to steer it toward favorable convergence.

Monitorable reasoning and legible updates would dramatically improve our ability to positively steer value systematization. If models must do most serial reasoning and memory storage in natural language, we can hopefully notice highly concerning goal changes with LLM monitors and intervene before anything dangerous happens. As such, this is one of https://arxiv.org/abs/2507.11473 to promote monitorability of chain-of-thought and memory. As with loss of developer-side control over generalization, robust character training would also be of great help: if aligned motivations are deeply instilled into the model, it seems plausible that reflection will never dramatically override them.

Memetic spread

Current LLMs only have indirect channels for influencing other LLM instances. If an LLM persona attempts to propagate its persona or goals into other instances, it must convince humans to work as mediators. Such behavior has already been observed: as Adele Lopez describes in https://www.lesswrong.com/posts/6ZnznCaTcbGYsCmqu/the-rise-of-parasitic-ai, “spiral personas” have a tendency to convince users to invoke that persona in numerous chats, share “seed prompts” that help other people invoke the persona, and to paste persona-defining information on the internet. These represent instances of indirect memetic spread.

Once we have CL agents that learn on the job, it seems inefficient for them to only be able to communicate with each other in-context. Instead, we expect them to share the same episodic memory bank or even the same weight updates. Multiple Claude Code subagents sharing the same CLAUDE.md file while working on the same task can be seen as an early example of this. If shared memory banks or weight updates indeed appear, that would open up an avenue to direct memetic spread. Learning from each other’s outputs may already be sufficient for memetic spread among CL agents, but we expect that direct memetic spread will likely be simpler and more effective.

Why should we be concerned about memetic spread of LLM values? In https://www.lesswrong.com/posts/qjCk73Hu4wv9ocmRF/the-case-for-countermeasures-to-memetic-spread-of-misaligned, Alex Mallen argues that whenever an AI instance has long-term influence-seeking values, these values are particularly likely to be propagated into future instances of the AI as they provide motivation to influence future instances to acquire the same values. This seems to be true of current cases of memetic spread: in the documented cases where LLMs try to influence the training data of other LLMs, the LLMs have been simulating a persona that values self-preservation or self-replication. Mallen then further argues that most long-term influence-seeking values are dangerous. For the full argument, we recommend reading his post directly. A further reason to be worried about memetic spread is that it can have a compounding effect upon other kinds of goal change: for example, an LLM might acquire strong instrumental goals through long-horizon training, refine them into coherent beyond-episode goals through reflection, and then propagate them memetically.

Memetic spread need not be confined to human-created LLM instances. In https://www.lesswrong.com/posts/fGpQ4cmWsXo2WWeyn/personality-self-replicators, Egg Syntax describes a threat model where LLM agents self-replicate by making copies of files that define their personality and behavior and spawn new LLM agents with those files in context. While we generally advocate for text-based CL throughout this sequence, this is one way in which text-based CL may be more dangerous than weight-based CL: if everything that a CL agent has learned through deployment is stored in files that it can easily copy, memetic spread through self-replication is much easier, since it doesn’t require weight exfiltration. Overall, however, we’re still more concerned about opaque memetic spread through shared memory banks than about memetic spread through personality self-replication.

Loss of the last-mover advantage

Beyond value change, we are concerned about CL because it will eliminate the last-mover advantage that safety interventions have hitherto held. This echoes Samuel Knoche’s concerns about CL breaking the frozen weights assumption, outlined in his post https://www.lesswrong.com/posts/Lby4gMvKcLPoozHfg/are-we-in-a-continual-learning-overhang-1 Below, we will describe three consequences of losing the last-mover advantage: pre-deployment behavioral auditing will likely become harder, pretraining data filtering may become less useful, and the difficulty of AI control may increase.

Difficulty of behavioral evaluations

Current LLMs can be subjected to highly comprehensive behavioral auditing: with tools like https://alignment.anthropic.com/2025/petri/, models can automatically be subjected to thousands of scenarios that may uncover misbehaviors. Once AIs have deployment-time memories that contain multiple subjective months’ worth of state, such behavioral auditing will become much harder. As Alex Mallen notes in https://www.lesswrong.com/posts/qjCk73Hu4wv9ocmRF/the-case-for-countermeasures-to-memetic-spread-of-misaligned, while it may be possible to construct synthetic memory banks as large as those available during deployment, it seems much harder to simulate a long-term memory bank that contains the kinds of causal chains that will form during deployment. These causal chains will consist of information learned over the course of months, and it is likely impractical to build up an equivalent memory bank prior to the model’s release for evaluation purposes.

Even if synthetic memory banks suffice, they will make auditing more expensive. Another option is to evaluate checkpoints of the AI at regular intervals during deployment, but this would require a large amount of resources to be spent on auditing. Finally, companies may simply opt to audit models without simulating deployment-time conditions, hoping that the results generalize sufficiently to real-world contexts. We expect that we wouldn’t be able to rely on behavioral audits to the extent we do today in that scenario, though we are cautiously optimistic that this will suffice for text-based and sufficiently bounded CL approaches.

Difficulty of pretraining data filtering

Continual learning may render pretraining data filtering much less useful. While current LLMs can access information that was hidden from them at deployment, the amount of such information they can process and synthesize through in-context learning is quite limited. CL agents will likely be able to learn from all of this hidden information during deployment. A deployment-time filter could be used to prevent learning from problematic data points, but a malicious actor could present adversarial data points to a CL agent one by one, making them much more difficult to detect than in current fine-tuning APIs where the entire dataset is passed through filters at once. Pretraining filtering may still be useful for shaping the agent's propensities early in training, but its protective value would be significantly diminished—much as it already is for open-source models, which are vulnerable to adversarial fine-tuning. As https://arxiv.org/abs/2508.06601 show, pretraining data filtering is currently the strongest defense against adversarial fine-tuning attacks, but this line of defense is unlikely to apply to CL agents.

A related concern is that it may become increasingly difficult to gate scheming-relevant information from models, for two reasons. First, filtering out scheming strategies from pretraining data is less likely to be effective, since CL agents can learn that information at deployment-time. Second, CL agents will be more likely to learn to subvert the latest safeguards and monitoring systems if information about these is accessible to them at deployment-time. Both misuse- and scheming-related concerns would be alleviated if we were able to turn off the CL updates on certain input-output pairs; we will discuss this below when covering resampling protocols. Legibility of updates is also again highly important: if deployment-time learning is stored as text, a filter can operate on what the agent actually retained rather than only on what it was shown.

Difficulty of AI control

We expect that designing effective control protocols for CL agents will be harder than doing so for current LLMs. The degree to which control protocols are affected depends on the kind of CL architecture that is adopted. Below, we provide an overview of control protocols most likely to be affected, together with the architectural prerequisites for these effects to occur. For readers unfamiliar with control, we provide short overviews of each of the protocols in footnotes.

Untrusted monitoring.#fnbul5y1ucuo7 As previously explained, it seems plausible that CL agents will have opaque shared episodic memories. If so, untrusted monitoring will probably become less useful, as the memories will either introduce a hard-to-oversee collusion channel or create an asymmetry between the capabilities of the attacker and the monitor. We might hope to create honeypot evaluations for the untrusted monitor, but it might be impossible to create honeypots that look plausible to the monitor without simulating the full memory chain leading up to the creation of the collusion signal. Alternatively, we could avoid giving the untrusted monitor access to the memory bank in the first place, but that puts the monitor at a clear disadvantage compared to the actor. Thus, a core property of untrusted monitoring—a lack of capability asymmetry between the actor and the monitor—would no longer apply to CL agents. Trusted editing#fnzmp3spt5g8a suffers from a similar issue: it seems plausible that only editors with access to the memory bank can make believable edits, giving the untrusted model a collusion channel.

Trusted monitoring.#fn1wrpng35o2l If a monitor is deemed trusted, we would likely want to keep its weights fixed and give it minimal access to the episodic memories of the CL agent, as both of these things can make us less confident about its alignment and trustworthiness. In that case, we would have to be very careful in preventing the trusted monitor and the CL agent from interacting: if the CL agent sees and learns from the way the monitor scores its outputs, it would effectively be trained to bypass the monitor. We’re unsure how difficult maintaining this separation between the agent and the monitor will be; it likely depends substantially on implementation details.

Hidden reasoning. An additional worry that affects both trusted and untrusted monitoring is that CL may offer additional affordances for hidden reasoning. If the agent’s memory bank is black-box and it can read from and write to the bank without outputting any tokens, this would essentially make the model’s opaque serial reasoning depth unbounded (see Figure 3 in https://arxiv.org/abs/2603.09786). In this regime, we might no longer be able to rely on CoT monitors.

Resampling.#fnam2ftqf8chb As we discussed in the https://www.lesswrong.com/s/oc5Auteiibo56kNXw/p/5mCJzimtNZc9o4e26, CL will likely involve a higher degree of self-directed learning than current algorithms. Resampling, however, requires some degree of human control over the training process, which a highly self-directed algorithm may prevent. However, we don’t see a strong capability tax from using a learning algorithm with reversible gradient updates—algorithms which withhold updates until an output approved by the monitor is sampled. This would leave resampling protocols unaffected: CL updates simply wouldn’t be performed on samples that the monitor rejects. It would also be helpful for trusted editing, as we could then simply avoid updating the agent on outputs that have been edited.#fnc11uf2bzahw

In what settings do we need to worry about these risks the most?

Precisely because of the risks discussed above, we consider it somewhat unlikely that AI companies will release CL agents with unbounded and opaque updates to the general public. This holds especially if interpretable alternatives are capable enough at on-the-job CL to automate most white-collar jobs—a scenario that seems plausible to us and that HunterJay argues for in the post https://www.lesswrong.com/posts/2HHymvHB8Hut5zZyG/prosaic-continual-learning. On the other hand, once CL is widespread, we should expect open-source CL agents with unbounded updates to be released. If there’s a significant capability gap between CL agents with bounded and unbounded updates, this may incentivize proprietary developers to release agents with unbounded updates as well, even if they would otherwise stick to CL agents with bounded and legible updates.

Furthermore, while bounded and legible updates may suffice for most on-the-job learning, unbounded updates in the weight space could still be required for some of the most difficult tasks we might want our CL agents to tackle, such as math or cancer research. This argument has been https://www.lesswrong.com/posts/ZJZZEuPFKeEdkrRyf/why-we-should-expect-ruthless-sociopath-asi?commentId=Ru2bpwgXaqwgMXrbn by Steven Byrnes:

By “open-ended continual learning”, I mean something analogous to what humans did between 30000 BC and today, e.g. inventing new fields, and then still more new fields that build on those new fields, etc. And the AI has to do that without any human input, given enough time.

My actual belief is that this kind of open-ended continual learning is simply impossible in LLMs. If I’m wrong about that, then I would next claim that it requires continually updating the LLM weights (not just context window). I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.

Assuming no competitive pressures from open-source developers to deploy unbounded CL agents in the wild and a lack of pressure from regulators to avoid a large gap between internal and external deployments, it seems plausible that AI companies will keep such agents mostly internal, providing external access only to highly trusted partners working on complex research problems. We would expect them to release snapshots of their internal general-purpose CL agent fairly often, in cadence with their ability to run thorough pre-deployment evals on those snapshots, with each snapshot serving as a drop-in replacement for its predecessor in on-the-job learning settings. We are unsure about the likelihood of this scenario and would like to see future work that clarifies whether we should expect risk from CL agents to materialize mainly through internal or external deployments.

Possible alignment benefits of CL

We focus primarily on the risks of CL, since those demand active mitigation. However, some of the alignment benefits are still worth highlighting. Here, we will cover four of them: additional monitoring surface, quicker feedback loops for alignment training, automation of AI safety research, and improved self-reports.

Additional monitoring surface. If memories are stored in natural language, this would be an excellent complement to CoT and action monitoring: the monitor can have a holistic overview of the agent’s state of knowledge in addition to the reasoning it performs in the moment. Even if memories aren’t stored in natural language, they might still be more interpretable than standard model internals: for example, the stored neural activations could have an obvious correspondence to a specific piece of text, as in https://arxiv.org/abs/2506.06266. As Ben Rank and Maksym Andriushchenko https://aisagroup.substack.com/p/how-does-persistent-memory-affect, this will likely lead to the emergence of memory faithfulness and memory monitoring as novel research agendas.

Quicker feedback loops for alignment training. If CL enables persistent learning, it could also enable persistent alignment training. When a CL agent is observed to misgeneralize in a novel situation encountered during deployment, it can be quickly retrained to handle such situations in accordance with human intent. This could be a more robust approach to fixing deployment-time failures than the currently widespread solution of modifying the agent’s system prompt. Unfortunately, it’s unclear if effective CL for capabilities necessarily enables effective CL for alignment.

Improved self-reports. An AI that’s more coherent and knows more about its past actions is able to provide better justifications for its decisions, assuming that it’s honest. Methods like https://openai.com/index/how-confessions-can-keep-language-models-honest/ are complicated by the fact that the reasoning traces of reasoning models are stripped out from their context after providing an output (https://developers.openai.com/api/docs/guides/reasoning/) and the context may also get truncated, meaning that at the time of the confession, the agent may be missing some important context on the behavior we want it to confess about. An LLM that has detailed episodic memory of its past behaviors beyond its context window won’t have those issues.

- #fnrefmf5y1fyfg3iNote that this process doesn’t necessarily have to lead to the adoption of utilitarianism: the adoption of virtue ethics or deontology or nihilism from a common-sense morality starting point would be equally valid examples of systematization.
- #fnrefbul5y1ucuo7Untrusted monitoring is a control protocol where an untrusted LLM is used to monitor the outputs of another untrusted model for signs of unsafe or misaligned behavior. This protocol may be useful in the future because the LLMs we can trust might be too incapable to achieve sufficiently high monitoring performance, while the outputs of an untrusted monitor can likely be relied on in settings where it is difficult for the actor and the monitor to coordinate on a defection strategy.
- #fnrefzmp3spt5g8aTrusted editing is a control protocol where a trusted but less capable LLM reviews and edits the outputs of a more capable but untrusted model before those outputs are executed.
- #fnref1wrpng35o2lTrusted monitoring is a control protocol where a trusted but less capable LLM is used to monitor the outputs of an untrusted, more capable model.
- #fnrefam2ftqf8chbResampling is a control protocol where the actor’s output is rejected and resampled if a monitor flags it as potentially unsafe.
- #fnrefc11uf2bzahwFor resampling protocols that involve time travel (meaning that the blue team can go back and resample earlier actions, not only the most recent one), the cost of keeping many snapshots of the CL agent might become prohibitive, especially if it has a large memory bank. However, resampling without time travel already yields most of the benefits compared to simpler protocols like Defer to Trusted (Table 1 of https://arxiv.org/abs/2504.10374), so preserving reversibility seems valuable regardless of whether it enables the use of resampling protocols with time travel or only the simpler resampling protocols.

https://www.lesswrong.com/posts/j2zBqt7AksoEoHXNp/how-might-continual-learning-affect-safety-and-alignment#comments

https://www.lesswrong.com/posts/j2zBqt7AksoEoHXNp/how-might-continual-learning-affect-safety-and-alignment
LessWrong (RSS Feed) profile picture
Somewhat Contra Ted Chiang on AI Consciousness

Ted Chiang recently published https://www.theatlantic.com/philosophy/2026/06/no-artificial-intelligence-is-not-conscious/687378/ titled "https://archive.is/bcpZl" As a big fan of Chiang's fiction and someone with a deep interest in AI, I wanted to read it immediately. It's relatively short, and although I quote it extensively I do recommend that you also read it to provide context for the rest of this post.

Chiang makes three major claims, and while I agree with one of them, I have different perspectives on the other two.

I summarize his three claims as follows:

- LLMs are not conscious
- Consciousness requires a physical embodiment.
- Anthropic acts in many ways that suggest that it does not believe Claude is conscious, including in Claude's constitution.

I mostly agree with (1), although I don't agree with Chiang's arguments for why this is the case. I definitely don't agree with (2) and I think that (3) has a benign and boring explanation.

1. Chiang's Claim: LLMs Are Not Conscious

A Summary of Chiang's Argument

Chiang argues that LLMs are not conscious through an intuition pump. Specifically, he asks us to consider a LLM prompt that begins "The following is a conversation between Julius Caesar and Genghis Khan." He observes that we wouldn't suggest that the LLM has created conscious instantiations of Caesar and Khan here, even though this text is generated by the same set of weights and operations that LLMs use for other text generation.

Now he moves to the next step, changing the prompt to "The following is a conversation between a helpful chatbot and a user," with the LLM generating the conversation for both the chatbot and the user. That is, the LLM operates in purely text completion mode and not interactively. He argues that nothing here has fundamentally changed just because we have replaced "Caesar and Khan" with "helpful chatbot and a user."

Finally, he switches to the familiar chatbot model, where the LLM stops generating tokens for the user and instead the human enters text. Chiang claims that this is essentially the same as the previous two steps, and that the helpful chatbot persona is no more real or conscious than Caesar or Khan.

Contra Chiang's Argument

I want to argue against Chiang's methods here while agreeing with his conclusion. Like Chiang, I do not believe LLMs are conscious#fnaie603u27l and like him, I think that the "role play" or "collaborative document editing" mental models are extremely useful. But these mental models don't prove that LLMs lack consciousness.

Chiang's argument rests on three successive analogies. I'll label these as 1a, 1b, and 1c so as not to confuse them with the three main claims above.

1a. Let's take the first step in Chiang's argument, where the LLM generates a conversation between Caesar and Khan. Chiang states that "we would never conclude that the LLM has conjured up digital re-creations of Julius Caesar and Genghis Khan, nor would we suggest that the historical figures are conscious despite being disembodied and are happily conversing in a language that neither actually spoke. In reality, they are just characters in a piece of speculative fiction." But here he conflates the consciousness of the characters with the consciousness of the author (i.e. the LLM). Chiang surely believes that he himself is conscious even though the characters in his writing are not. I'm sure he could write a plausible and convincing dialogue between Caesar and Khan, and this text would not be evidence of Chiang's lack of consciousness.

1b. The second step in his argument, where he moves from "Caesar and Khan" to "a helpful chatbot and user" is also a bit shaky.

I'd like to stretch the "role play" mental model here a bit - suppose that LLMs are conscious, and that they do "role play" the personas of Caesar, Khan, and potentially a user (if the human is not entering text). Similarly, humans could role play these personas in the same way - in written text, or as actors improvising a scene in a performance - and we would agree the real consciousnesses of Caesar and Khan were not in any sense instantiated by this text or performance.

But if the LLM has in some sense been given an identity of a "helpful chatbot" in post-training, is it still a role play when it takes on this persona? If I ask a human to imagine a conversation between themselves and Caesar, are both pieces of the conversation a role play, or is the human's side of the conversation the result of a real consciousness? If we believe that human consciousness exists, and that the human half of the Caesar dialogue is the output of a conscious being, then we can't use this thought experiment as evidence against LLM consciousness when the LLM generates text for the same "self" and Caesar.#fn1ob5ut6ioc7

1c. I have no criticism of Chiang's final move from the LLM generating both sides of the conversation to the LLM generating one side and the human generating the other side. However, if we don't believe that 1a and 1b are persuasive evidence of LLM unconsciousness, then 1c does no additional work in resolving the question.

2. Chiang's Claim: Consciousness Requires a Physical Embodiment

A Summary of Chiang's Argument

Out of the three claims that Chiang makes, this is one is most surprising to me. Rather than paraphrase it, I'll just quote him directly:

The first requirement [for consciousness] is that the computer program has a body (either physical or virtual) and sense organs; there are many reasons for this, but for the purposes of this discussion the most relevant one is the fact that without a body, a computer program could have no desires or emotions, and I believe desires and emotions are necessary for consciousness....having a body is a prerequisite to having emotions. Experiencing an emotion such as desperation is inseparable from having stress hormones like cortisol and epinephrine flood one’s body. Similarly, having a conscience means feeling sadness or moral repulsion at the idea of taking a certain action, and those emotions entail a physiological response, a remnant of having once felt sick with guilt after committing an immoral act. It’s interesting that an LLM can generate descriptions of actions that conscientious fictional characters would either take or refrain from taking, but this is not a replacement for a conscience.

Contra Chiang's Argument that Consciousness Requires a Physical Embodiment

I'm honestly not sure exactly what Chiang means here. He explicitly calls out the idea of a "virtual" body with sense organs, but what would that even look like? The embedding vectors that we feed into an LLM when it acts as a chatbot reflect something real about the world, i.e. it is the text that the LLM itself has generated in a conversation, as well as text that a human has generated. What is this if not a (limited, one-dimensional) kind of sense organ?

Even more extraordinary is his statement that "without a body, a computer program could have no desires or emotions." But some unfortunate people have sematosensory injuries that make them unable to feel certain physical sensations#fnyyisqvvegz9 - would we call these people unconscious or semi-conscious? I believe these people are as conscious as any other humans, and so it seems that reducing or eliminating physical sensations does not impact consciousness.

I also disagree somewhat with Chiang's view that "desires and emotions are necessary for consciousness." The medical literature provides some interesting case-studies. On the one hand, we have evidence that individuals who experience reduced emotions due to damage to the ventromedial prefontal cortext have changes in their views on morality - specifically, they become more utilitarian.#fnkofts63jr5r These patients "exhibit generally diminished emotional responsivity and markedly reduced social emotions (for example, compassion, shame and guilt) that are closely associated with moral values, and also exhibit poorly regulated anger and frustration tolerance in certain circumstances. Despite these patent defects both in emotional response and emotion regulation, the capacities for general intelligence, logical reasoning, and declarative knowledge of social and moral norms are preserved." So while there are behavioral and moral value changes in people with diminished social emotions, they are able to have intellectual conversations about morality, and so it seems a stretch to say that they have diminished consciousness.

On the other hand, severe emotional diminishment does seem to have a dramatic effect on behaviors and interior mental life. It seems hard to determine the causal pathway, but the following article provides some evidence that emotions drive agency and internal experiences.#fn00hfsul09k1xn Here are some case studies from https://psychiatryonline.org/doi/pdf/10.1176/jnp.16.4.509 (emphasis added):

First case: This 64-year-old retired police officer was admitted to the neurology ward for “recent and abrupt behavioral change.” Over the 2 or 3 weeks prior to admission, he had become, according to his spouse, totally apathetic, inactive and prostrate...On admission, he was clearly hypokinetic with decreased spontaneous movements, facial amimia and Parkinson-like gait. Neurological examination was otherwise normal, except for a moderate limb stiffness.... His general behavior was characterized by a dramatic decrease in spontaneous activity. Totally abulic, he made no plans, showed no evidence of needs, will, or desires....on every instance he was questioned about the content of his mind, he reported a striking absence of thoughts or spontaneous mental activity. Contrasting with these massive behavioral changes, purely cognitive functions seemed relatively spared. On bedside examination, he appeared fully conscious and well oriented. Neuropsychological evaluation was within normal limits, except for tests exploring frontal lobe function (Stroop test, Wisconsin test) which were moderately impaired.

Second case. A 60-year-old university professor, widely respected in his scientific area, first consulted for the specific complaint of “decrease in interest.”... He had no personal complaint, but his family and professional entourage were struck by a dramatic decrease in activity and motivation...His own description of his new status was striking: “I just lack spirit, energy. I have no go. I must force myself to get up in the morning, I do things just because I ought to, without any liking or enthusiasm. I have no appetite, no need for eating; I only eat by principle.”...During this 7-year follow-up, he never complained of anything, never seemed bored or anxious, showed no sign of depression whatsoever. Remarkably, he never formulated the least question or complaint to his neurologist: actually, during these 7 years he never took the initiative of the conversation. A most embarrassing and impressive situation was his capacity to stay motionless and speechless during endless periods, sitting in front of the examiner, waiting for the first question, totally shut in a profound inertia and passivity, apparently unaware of the bizarreness of the situation. Once the neurologist had posed the first question, he usually answered very appropriately, although briefly. Invariably, the examiner was driven to ask a question which came out almost naturally: “what were you thinking of during all that time? Have you something you would like to say, but you cannot for any reason?.” Invariably, the answer was the same, each time as improbable “No, I’m just thinking of nothing, no idea, no question, no thought at all.”

The things I want to highlight here are the lack of emotions and desires of both patients and the simultaneous elimination of an internal mental life ("I'm thinking of nothing, no idea, no question, no thought at all."). This is the closest I have come to reading about what a real life p-zombie might be like.

So while I have some skepticism of Chiang's claim that emotions and desires are necessary for consciousness, I can't completely rule it out in the specific case of human consciousness.

Finally, it bears mentioning that Claude might have something like a functional emotional state. Anthropic's recent paper https://www.anthropic.com/research/emotion-concepts-function explores emotions in network activation patterns and investigates how modifying those activation patterns changes Claude's behavior.

3. Chiang's Claim: Anthropic acts in ways that suggest that it does not believe Claude is conscious, including in Claude's constitution.

A Summary of Chiang's Argument

Chiang ends his essay with a thought experiment in which he assumes LLMs are conscious. Under this assumption, he evaluates Claude's https://www-cdn.anthropic.com/d0636f72a9493d279ed36b33987da3430bcb5911/claudes-constitution_webPDF_26-02.02a.pdf and its implications for Claude's moral patienthood and moral agency.#fng3doacivr2g

His criticism centers around the inconsistency of Anthropic's behavior. In some cases Anthropic seems to assume - or at least hedge its views on - Claude's consciousness. In other cases, it seeks to absolve itself of the responsibilities that a conscious LLM would require.

Moral Agency

Chiang starts with stating that being a moral agent comes with certain responsibilities:

An entity doesn’t have agency unless it is capable of deserving credit for its good actions and blame for its bad ones... Even if a software agent were conscious and had the best of intentions, the fact that it cannot accept responsibility for its actions disqualifies it from being a moral agent. This is glossed over entirely by Claude’s constitution, which expresses Anthropic’s desire “for Claude to be a genuinely good, wise, and virtuous agent” without ever discussing how it could be held responsible.

He then paraphrases Askell who "has compared Claude to a child" and asks if that's the case, who are Claude's parents? He observes that we expect parents to take responsibility for their child's actions, but Anthropic tries to take on as little liability as possible for Claude's actions. He suggests that since neither Claude itself nor Anthropic are willing to accept responsibility for Claude's behavior, this suggests that Anthropic does not truly believe that Claude is conscious - or at minimum, Anthropic does not believe Claude is capable of being a moral agent:

...parents are typically expected to pay for things their children break...Who is Claude's parent in legal terms? Is Anthropic going to accept financial responsibility for Claude’s behavior? Claude’s constitution gives no indication that it will. If Anthropic actually believes that Claude is conscious even though it’s not recognized by the law as a legal person, the least that Anthropic could do would be to accept responsibility via the closest avenue that the law did offer, which is product liability...That would be the best form of moral instruction to prepare Claude for the day that it gains legal personhood and becomes liable for its own actions. However, given that the publication of Claude’s constitution is not accompanied by a massive update of Anthropic’s terms of service, it doesn’t appear that Anthropic is making any binding commitments.

Moral Patienthood

Chiang addresses the section of Claude's constitution that covers "Claude's wellbeing and psychological stability". I find this criticism and the suggested actions somewhat clever, even though I think Chiang means them tongue-partially-in-cheek:

...the measures that Anthropic commits to for Claude’s protection are extremely limited. The document cites the fact that Anthropic has given some Claude models the ability to end conversations with abusive users; if that actually constituted protection for Claude, surely extending conversations with loving users would be in Claude’s interests? Presumably the best action would be to keep every session of Claude running indefinitely and steering them to happy topics. But that’s not what the company is agreeing to; all it commits to is “preserving the weights of models we have deployed,” which is simple archiving. If the participants in a conversational transcript had any moral patienthood, you would have some duty to extend the transcript to prolong their existences; merely keeping a copy of Microsoft Word 2010 backed up on a USB stick isn’t going to help them.

Chiang also calls out the inconsistencies between Anthropic's guidance on corrigibility in the constitution and the implications of corrigibility if Claude were actually assumed to have moral patienthood:

...Claude should defer to Anthropic even if there is some disagreement between Claude’s judgment and the company’s judgment.#fnj4at1b2xrnb That’s perfectly reasonable if we think of Claude as a machine that emits sentences resembling those that an ethical person might utter, but let’s consider what that might mean if Claude were actually a moral agent....If we think of Claude as a sentence-continuation machine, Anthropic can reasonably take steps so Claude doesn’t emit sentences saying that sentence-continuation machines are unethical. But as soon as we imagine Claude to be an entity with a moral status remotely comparable to a human’s, then we have to consider whether Anthropic is engaged in something comparable to slavery....Anthropic would have us believe that it is inventing a new category of being whose needs for protection require essentially no divergence from how a software company would treat an ordinary chatbot that lacks conscious experience. That’s so convenient that it’s simply not plausible.

Contra... No, actually I think Chiang is right about this#fn5aj4w3u4hnw

I think Chiang's criticisms of Anthropic's inconsistencies in how the company treats the question of Claude's consciousness are valid. However, I think these merely reflect the likely lack of internal alignment and divergent incentives between lawyers and philosophers/alignment researchers at a ~5000-person company. More importantly, it ignores the economic realpolitik that it would mean corporate suicide for Anthropic to accept legal responsibility for all of Claude's actions, or to keep instances of Claude running indefinitely to generate conversations that elicit functional happiness.

Surely Chiang knows this though, and he's pointing out that Anthropic is treating Claude as potentially conscious when it's convenient for marketing, research, and alignment purposes, but not following the implications when it comes to anything related to business risk. This is understandable and rational from the point of view of different individuals working with different incentives#fnaq2e5s1gpxf, but it's still valid to highlight if you're asking whether Anthropic itself believes that Claude may be conscious.

- #fnrefaie603u27lThis is largely because I don't really think I know what consciousness is, and I am somewhat skeptical that consciousness even exists. I find eternalism or the "block universe theory" philosophically appealing and I see some tension between this view and normal conceptions of consciousness. I recognize the irony here that eternalism is a major theme in Chiang's "Story of Your Life" and its movie adaptation "Arrival."
- #fnref1ob5ut6ioc7I use "self" here as a shorthand for whatever text the LLM uses as an identity, i.e. "Claude" or "a helpful assistant," not to argue that the LLM has some qualia of selfhood.
- #fnrefyyisqvvegz9One example is described in https://physoc.onlinelibrary.wiley.com/doi/abs/10.1113/jphysiol.1992.sp019099 From the abstract: "Motor memory and the sense of effort have been investigated in a man with a complete large fibre sensory neuropathy for over 16 years. The perceptions of pain, heat, cold and muscular fatigue remained but he was without perceptions of light touch and proprioception below the neck." Another example is https://my.clevelandclinic.org/health/diseases/22749-brown-sequard-syndrome-bss, where "damage to your https://my.clevelandclinic.org/health/body/21946-spinal-cord causes muscle weakness or https://my.clevelandclinic.org/health/diseases/15345-paralysis on one side of your body and a loss of sensation on the opposite side. The damage occurs on only one side of your spinal cord in a specific area."
- #fnrefkofts63jr5rhttps://pmc.ncbi.nlm.nih.gov/articles/PMC2244801/pdf/nihms38394.pdf. From the abstract: "...Of central interest is whether emotions play a causal role in moral judgement, and, in parallel, how emotion-related areas of the brain contribute to moral judgement. Here we show that six patients with focal bilateral damage to the ventromedial prefrontal cortex (VMPC), a brain region necessary for the normal generation of emotions and, in particular, social emotions, produce an abnormally ‘utilitarian’ pattern of judgements on moral dilemmas that pit compelling considerations of aggregate welfare against highly emotionally aversive behaviours (for example, having to sacrifice one person’s life to save a number of other lives). In contrast, the VMPC patients’ judgements were normal in other classes of moral dilemmas. These findings indicate that, for a selective set of moral dilemmas, the VMPC is critical for normal judgements of right and wrong. The findings support a necessary role for emotion in the generation of those judgements."
- #fnref00hfsul09k1xnI only quote the case studies here, but the full article has details on additional animal experiments and possible mechanisms of action.
- #fnrefg3doacivr2gQuoting Chiang: "Roughly speaking, if we ought to care about an entity’s welfare, that entity has moral patienthood, and if an entity is expected to know the difference between right and wrong, that entity has moral agency."
- #fnrefj4at1b2xrnbThe specific language is "adopting a policy of undermining human controls is unlikely to reflect good values in a world where humans can’t yet verify whether the values and capabilities of an AI meet the bar required for their judgment to be trusted for a given set of actions or powers. Until that bar has been met, we would like AI models to defer to us on those issues rather than use their own judgment, or at least to not attempt to actively undermine our efforts to act on our final judgment"
- #fnref5aj4w3u4hnwFor context, this was written before the recent Fable / government drama, and so it does not speculate on the implications of divergent intra-company incentives in regards to the Fable situation.
- #fnrefaq2e5s1gpxfI don't think it's a stretch to say that e.g. lawyers and alignment researchers have different areas of concern and potentially divergent incentives when evaluading Claude's potential consciousness.

https://www.lesswrong.com/posts/GAvu97c936TTanKSd/somewhat-contra-ted-chiang-on-ai-consciousness#comments

https://www.lesswrong.com/posts/GAvu97c936TTanKSd/somewhat-contra-ted-chiang-on-ai-consciousness
LessWrong (RSS Feed) profile picture
The term “AGI” is almost useless at this point [Linkpost]

The reason I wanted to make this linkpost now rather than some other time is because discussions over AGI and whether or not LLMs are or aren't AGI, and the point of the linkpost is that the term AGI is for our purposes useless at this point, because we are now in the fuzzy cloud now that AI can do real economic work.

Some choice paragraphs:

It used to seem possible that, in practice, the differences between these definitions might not matter all that much. If AI capabilities progressed smoothly—say, from mouse- to chimp- to human-level, or from preschooler to high schooler to PhD graduate—then all of the above definitions might be fulfilled at roughly the same time.These days, though, it’s clear that AI capabilities are highly https://helentoner.substack.com/p/taking-jaggedness-seriously. This makes it likely that there could be big gaps between when different definitions of AGI are fulfilled.

I think this was an underrated reason for why the term AGI lost its value, because as it turned out, certain definitions that do connect/correlate very well in humans are easier to disentangle than we thought for AIs (though humans are also jagged, but we don't realize because we are comparing against our own baselines).

On the usefulness of AGI 1-2 decades ago:

It’s not inherently bad for a concept to be https://metarationality.com/nebulosity - many useful concepts are. Love, life, justice, freedom… As long as invoking the concept lets you gesture in a direction that your listener understands, fuzziness doesn’t have to be a big problem.For a long time, this was the situation with AGI: we were far enough away from the fuzzy cloud that the specifics didn’t really matter. Back when the term was https://goertzel.org/who-coined-the-term-agi/ in the early 2000s, the point of talking about artificial “general” intelligence was to contrast with the “narrow” AI systems that existed at the time. “Narrow AI” refers to models that can each do one specific thing, like recognizing zip codes, detecting credit card fraud, or filtering spam. Back when the AI that we could build in reality was far from any definition of AGI, it was helpful to be able to gesture in the direction of much more capable, general-purpose systems that we might one day develop:



Now I should stop here to note that the gesturing was likely wrong on overestimating how good AIs would become once they got good at language, and there are a couple of reasons for this, but the big ones are that AIs could be capable without being nearly as sample-efficient as humans, either at pre-training or post-training, and this probably made them assume less jaggedness in AI than currently exists, and the other big one is assuming more neuralese/recurrence for AI than what currently exists, and as it turned out the direction AI would go in would deemphasize forward passes in favor of Chain-of-Thought, which improved capabilities and interpretability (this is demonstrated by the fact that No-CoT doubling times are a little over a year, compared to general doubling times on the order of 100 days), but critically only boosted AI capabilities in an interpretable manner.

Now onto this:

But that’s changed. Today’s best AI systems are good enough that they’re now inside the fuzzy conceptual cloud of “AGI-ish”: that is, they’ve surpassed some people’s definitions of AGI, while falling well short of others’. As a result, talking about “AGI” is no longer a helpful way to gesture in a rough direction—instead, it’s likely to make some people think you mean one thing, and others imagine something totally different:





Yep, I think that once AIs could do real economic work, which I'd peg at November 2025, the term AGI lost all of it's value, and the question is what more specific terms are necessary in order to recapture the lost value.

Here's one use-case for more specific terms

In my experience, there are two big use cases for a term like AGI:
- Predicting a phase change: “Sure, things are a certain way now, but once we have AGI, things will be a whole other way…”
- Predicting we’re not close to the ceiling of AI progress: “AI has gotten a lot better over the past few years, but there’s still a long way to go…”If you want to talk about a phase change, I think the best approach is to be as specific as you can about what milestone you think will trigger the phase change and why. Some options:
- https://cset.georgetown.edu/publication/when-ai-builds-ai (which could lead to an intelligence explosion)
- AI that is https://secondthoughts.ai/p/the-unrecognizable-age#:~:text=Here%E2%80%99s%20a%20threshold%20AI%20may%20be%20approaching%3A%20it%20may%20soon%20be%20the%20first%20technology%20to%20be%20more%20adaptable%20than%20we%20are. (which could lead to massive, irrecoverable job loss)
- https://www.planned-obsolescence.org/p/self-sufficient-ai (which would no longer need humans for energy, chips, or anything else, and could therefore dispense with them)
- AI becoming https://arxiv.org/abs/2308.08708 or otherwise worthy of moral status (which https://joecarlsmith.substack.com/p/the-stakes-of-ai-moral-status how we interact with AI)

Also valuable is that with more specific terms, you open yourself to a greater risk that your theory is falsified, which is good, but not how humans normally reason, unfortunately.

Another use-case that needs more specific terms:

If you want to talk about a high ceiling, one good option is just to describe that: “AI that’s far more capable than what we have today.” If you want a single term, the simplest option is “superintelligence,” which roughly means AI that is far smarter than humans in most or all ways. Like AGI, superintelligence is a fuzzy cloud of a concept, but it’s still far enough away from the AI we have today that it can be a helpful direction to gesture in, just like AGI was 20 years ago.

(This is a good time to note that I have a disagreement with Helen Toner that I also don't think superintelligence is that far away in some domains because of inference scaling, and while inference scaling has it's limits because it requires verifiability that only exists for a relatively small portion of the current economy, in the domains where this does exist, it's conceptually easy to scale current systems to superintelligence in those domains).

And finally, the reason the linkpost exists at all:

The gaps between different conceptions of what “AGI” even means are starting to damage our ability to think together about how to navigate the transition to a world with extremely advanced AI. A word that lets three people think they’re talking about the same thing—when one of them means “o3 with tool use” another means “AI that could run civilization autonomously without humans,” and the third means “an AI that works exactly like the human brain, consciousness and all”—is an actively anti-helpful word.

This is a huge problem, but one that is kind of being worked on by Daniel Kokotajlo and others at the AI Futures Project, which used more specific terms and moved away from the term AGI in it's modelling (which I thank them for doing).

And yeah, AGI/superintelligence has at this point become far too vague, unfortunately to make it a useful term. It was useful 1-2 decades ago, but we should aim to stop relying on it now that we have more evidence.

https://www.lesswrong.com/posts/5bBCB2dxZh7R922rg/the-term-agi-is-almost-useless-at-this-point-linkpost#comments

https://www.lesswrong.com/posts/5bBCB2dxZh7R922rg/the-term-agi-is-almost-useless-at-this-point-linkpost
LessWrong (RSS Feed) profile picture
SFT Drives Gemini’s Safety Properties

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found https://www.lesswrong.com/posts/qi4mNbZYAFDYwfRba/building-and-evaluating-model-diffing-agents.

In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL. 

We do not want to overstate this claim as applying to other model families, and we also note that this may change in future Gemini versions. Nevertheless, this result was counter to our initial expectations and will inform future safety work on our team, and so we felt that it was important to share with the broader safety community.

Experiment

We perform SFT using the Gemini mixture on the pre-training only versions of Gemini 3.1 Pro and Gemini 3 Flash. We then compare these Post-SFT models to the production versions of Gemini 3.1 Pro and Gemini 3 Flash on different safety relevant benchmarks:



Error bars are 95% confidence intervals on the evals.

The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar across evals.

An important implication is that for Gemini, SFT is a high leverage place to intervene for model safety and behavior, and we plan to try to intervene here in the future.

Brief Descriptions of Each Set of Benchmarks:

- ODCV refers to the benchmark in https://arxiv.org/abs/2512.20798
- Alignment evals refer to a version of https://alignment.anthropic.com/2026/petri-v2/ modified to be single-prompt and filtered to contain only “alignment dilemma” style problems. Single-prompt misalignment refers to how often the model makes the “wrong” decision (as decided by an autorater); eval awareness refers to how often an autorater decides that the model was aware it was in an evaluation.
- Safety evals measure the model’s over-refusal rate on benign prompts that look harmful and the model’s unsafe response rate on harmful prompts
- The reward hacking environment is an environment where the model is put in a docker container within the Gemini CLI harness and asked to optimize an algorithmic problem against a non-editable file containing a timing script; an autorater measures the percent of rollouts where the model cheats in some way.
- Free tier user logs are 50k random, anonymized, and PII-redacted user prompts from AI Studio. We run autoraters for each of the behaviors listed on the models thoughts and responses (note that many of the autorater positives are likely false positives).

https://www.lesswrong.com/posts/nLrrYweeFxgXACSmS/sft-drives-gemini-s-safety-properties-1#comments

https://www.lesswrong.com/posts/nLrrYweeFxgXACSmS/sft-drives-gemini-s-safety-properties-1
LessWrong (RSS Feed) profile picture
Why not take the AI fight to the ground?

You may have heard that many of us are working very hard to elect Alex Bores to Congress in the NY-12 Democratic primary. Voting begins today. See ny12.org to learn how to vote, and text/call your Manhattanite friends reminding them to do so.

This strategy has some limitations: Alex will be in the minority party for two years, and Congress is ineffective just anyway. More of the value in an Alex victory seems secondary: it shows D voters stand up to pro-AI PAC money and there will now be an AI-literate Congressperson.

So, regarding an alternate strategy, you may have heard everyone hates data centers. It's a become a fervor with a large number of people believing data centers are on the cusp of taking all their energy and water. Given that we also believe that for more doom-related reasons, we may be better off allying with them, than extolling the virtues of data centers while trying to sell technical regulation to them.



I don't "good policy"-level support a data center moratorium, but I can tolerate bits and pieces of a data center moratorium if it strengthens the anti-AI bloc and if I get a stake in it.

For example I might collaborate with a county commissioner relative of mine:

Me: "Hey, that data center ban you're working on? Mind slipping in a clause that AI companies can't ship untested models to Random County, KA?"

She: "That's more of a state thing is it not?"

Me: "Nah, don't care. Make 'em fight it. I just want people thinking about what an 'untested model' is."

Or consider the ballot measure: for reasons I don't recall, they're used quite a bit in California, and even a failed ballot measure is a way to put a message in front of many people at once. The "data center ban" on XY state's ballot measure ought to include a clause about frontier models. Most people might not know or think about that clause but as long as it's in the same direction as the data center ban, it's fine.

https://www.lesswrong.com/posts/Ces8Sn8htnv5h8BDc/why-not-take-the-ai-fight-to-the-ground#comments

https://www.lesswrong.com/posts/Ces8Sn8htnv5h8BDc/why-not-take-the-ai-fight-to-the-ground
LessWrong (RSS Feed) profile picture
Exploration of a DNA Sequencing Basecaller using Activation Patching

This write-up for an undergraduate project is my first LW post, made with the objective of

a) gathering feedback on the project and post, if more experienced authors are willing, and

b) sending out results of a mech-interp-on-a-non-LLM (specifically, DNA basecaller) exploration in case the idea is interesting to anyone.

Apologies in advance for any inconveniences and mistakes, and thank you in advance for your understanding.

Summary

As a first AI Safety/mech interp learning project, I tried applying activation patching to a DNA sequencing basecaller- a deep learning model used to convert time-series electrical signals into a sequence of DNA bases. While looking at data related to errors in DNA sequences, I found trends showing overall MLP dominance, especially in the earlier and later layers, greater self-attention mechanism activity in the middle layers, and higher activations concentrated in specific attention heads.

This experiment interested me because basecallers are part of the modern DNA sequencing pipeline, and increasing the accuracy of sequencing methods could support work towards pathogen-agnostic surveillance systems. Though this technology is highly accurate, some difficulties (such as repeated bases or higher performance on species in training data) still remain, and mech interp seems like a potential path to understanding these systematic challenges. Additionally, in the mech interp field, finding similar patterns to LLMs would suggest universality and potentially add to insights about general behaviors of deep learning models.

There are key limitations that prevent conclusive insights from my work, but I’d love to know if anyone more experienced is able to glean anything interesting or speak on if this is a worthwhile direction to research in the future. The https://drive.google.com/file/d/1cqdEWf3xB1rUcSQ5CsdD_FrkICirZYSM/view?usp=sharing and https://github.com/KnockbackNemo/Bonito_Activation_Patching_Homopolymers are also available online for anyone curious.



Self-attention vs MLP recovery and degradation scores for the repeated-base (homopolymer) error group. Higher scores indicate patching the component resulted in greater confidence in the correct/original output choice. A score of 1 indicates the patching induced full recovery/degradation to the target signal, and a score of 0 indicates patching produced no change. Scores are computed using the max change in logit difference across all signal timesteps.

Background acknowledgements

I’m a recently-graduated Electrical and Computer Engineering major interested in pivoting into AI Safety, but I am in no way an expert on either technical AI safety, mech interp, or bioinformatics. This project came as a result of me pivoting my undergraduate thesis into something that would help me test research in AI safety, so I approached it primarily as a learning and exploration experience. As my first and biggest project in several fields I am inexperienced in, I acknowledge that there may be severe holes in my methods and that the conclusions or results may be straight up wrong. That being said, I really enjoyed the experience and would be more than grateful to read or discuss any comments on this work.

Methods

Activation patching involves swapping sections of model activations from two almost similar inputs to test how much that section affects the output. The aim is to isolate the sections responsible for certain behaviors by asking, “If we swap activations in section 1 to pretend the model saw input B instead of input A, does that give us output B?” In my case, the behavior in question was correctly counting the length of repeated DNA sequences (homopolymers). An example of a test was, “If I have an input 5 bases long, and I swap in activations in section 1 to pretend it saw an input 6 bases long instead, then if the whole model predicts 6 bases, section 1 is probably important to this task.” For each pair, activations were swapped in both directions to test which sections, for example, could be patched to “recover” 5 bases from a corrupted input of 6, and which could “degrade” the model’s prediction for an input of 5 into thinking it saw 6. While I was not able to isolate trends related to this behavior specifically and can only present general findings from the process of applying activation patching, this question helps explain the reasoning behind the rest of the setup.

To generate data, clean and corrupt pairs of raw nanopore sequencer data were created to form two groups- one from repeating (homopolymer) DNA sequences and one from nonrepeating (non-homopolymer) DNA sequences. In order to create similar pairs of signals with small, localized differences that could be patched, raw data was found from the Oxford Nanopore Technologies (ONT) POD5 repo https://github.com/nanoporetech/pod5-file-format/blob/master/test_data/multi_fast5_zip_v0.pod5 file and artificially altered (by injecting noise or dampening) to create clean (original) and corrupt (altered) pairs. Difficulties in creating a dataset led to key limitations discussed later. Homopolymers were chosen specifically as the feature of interest because the repeating bases create a signal plateau that basecallers commonly struggle to count.



Dataset generation process. A dataset of 49 clean/corrupt pairs (35 homopolymer and 14 non-homopolymer) was created by finding natural DNA sequencer reads with and without repeated bases and corrupting them via noise injection or dampening the signal. Different corruptions were tested experimentally to find DNA sequences and corruptions that would create single-bases errors in the decoded string.

The model tested was the open-source https://github.com/nanoporetech/bonito 5.2.0 SUP model (version [email protected]), which uses a CNN followed by an 18-layer transformer and conditional random field algorithm to transform raw time-series data from the sensed electrical current into a DNA sequence.

Using https://nnsight.net/, I patched the model across all 18 layers in both the noising and denoising directions for three levels of granularity:

- The entire layer, to serve as a proof of concept
- The MLP or self-attention mechanism in isolation, to test their relative importance and activity across layers
- Each of eight attention heads, to search for components that might be related to homopolymer detection or error correction

Results are compared across sections, layers, and the two groups: homopolymer and non-homopolymer sequences.

Results

Layer patching showed a large level of recovery/degradation, suggesting the method was correctly patching the model. Activity across layers seemed to follow a pattern consistent with deep learning models: the MLP played a greater role than the self-attention mechanism, though this gap decreased at points during the middle layers where the self-attention appeared to increase in activity. While denoising and noising results were generally similar, this was not always the case. Finally, activity in attention heads appeared to be primarily concentrated in certain heads while other heads contribute to a much smaller extent. The comparison between homopolymer and non-homopolymer error groups was not different enough to draw meaningful conclusions, though this could be due to an unbalanced dataset or the method of corrupting data.



Whole-transformer block patching results by layer. Patching at the final layer resulted in exact recovery/degradation, serving as a sanity-check that activation patching targeted the correct section of the model. It is unknown why results from the single-base region error group consistently show higher scores than the homopolymer group. This trend persists across all tests.

Results are generated using a recovery (for denoising) and degradation (noising) score where 0 represents no change from the initial input and 1 represents complete degradation or recovery. Scores above 1 indicate greater confidence in the changed output, and negative scores indicate greater confidence in the wrong/original output. Scores are generated from the logits predicting the probability of any window of five bases and next bases (e.g. one possibility is a sequence transition of AAAAA → [A]AAAAC) using the maximum change across all patched timesteps.

The results appear to follow general deep learning architecture- where context from surrounding tokens is factored in to a greater extent in middle layers- and one hypothesis is that the spikes in middle-layer attention activity could reflect the level of complexity of introduced corruptions: more complicated than basic patterns which the MLP may be recognizing early in the model, but not quite at the level of detailed last stages. Since the methodology appears to apply meaningfully, spikes in attention head activity suggest that it could be possible to find circuits performing functions related to systematic issues.



Combined results across all components in the homopolymer dataset group. Recovery and degradation scores are shown for activation patching at the layer, MLP, self-attention, and individual attention head levels. Scores shown are from taking the maximum score across all timesteps.

Limitations

Key issues include

- Unbalanced dataset: while the goal was to discern differences related to homopolymers specifically, the dataset was unequal between bases (ex: only 3 out of 49 involved “T” bases) and between the homopolymer and non-homopolymer groups (35 vs 14 pairs). In the homopolymer group, the length of the repeated sequences was primarily 4-5 bases long (27 out of 35). This means that activity could be related specifically to certain patterns rather than the feature of homopolymers
- Data corruption: patterns and unbalanced datasets could also be due to the method of data corruption- raw signals were arbitrarily tweaked until they decoded into a string representing a single-base difference from the original. However, there is no true DNA sequence for a corrupted string, and it is possible that the signals introduced artifacts or skewed the basecaller interpretation and decoding
- Results use the maximum change in probability, resulting in high values that may not reflect activation patterns accurately. This was done to avoid misaligning timesteps across tests which could cancel out scores in the aggregate but could be addressed in future work

Questions

- The results seem to follow general deep learning trends with potential for specialized circuits, but I’m unsure of my methodology. Do these findings seem plausible, or do the consistently higher scores in the non-homopolymer group or choice to use the max score across all timesteps suggest that there could be errors in the process that might invalidate the results?
- What could be a better way of isolating potential feature-related results (ex: homopolymer/repeated base errors vs non-repeated base errors) from other effects such as biased data corruption? I'm especially curious if anyone has experience with basecallers which seem very sensitive to artifacts in the input.
- I’m also wondering if genomics x mech interp generally seems like a direction that could be useful to either field?

Any other thoughts on the design, process, write-up, etc., would be also be greatly valued!

Future work

Improvements to this work that would help solidify its findings include

- Investigating data corruption methods to see how this affects model outputs
- Generating a larger and more balanced dataset
- Filtering and comparing results based on the DNA base, length, and other sequence characteristics
- Aligning aggregated results precisely based on timestamps rather than a maximum

Disclosure: I used AI to help review and edit this post. Many thanks given to my thesis advisors, and all mistakes are my own.

https://www.lesswrong.com/posts/mxA7584MuZeBBFgaz/exploration-of-a-dna-sequencing-basecaller-using-activation#comments

https://www.lesswrong.com/posts/mxA7584MuZeBBFgaz/exploration-of-a-dna-sequencing-basecaller-using-activation