Watching Models Think (And Lie About It)

TL;DR: inspecting model internals during token injection. janky but fascinating

Quote

We finally open-sourced Quote, our inference engine, a few weeks ago along with our token injection playground and a research paper on some of our initial findings from trying to steer generation with forced tokens.

Quote is an inference engine for token-level interventions during decoding. It emits events at each step of the generation loop — after prompt encoding, after each forward pass, after sampling, and after a token is appended — and receives actions from user-defined mods that can steer things in real time. Mods can force specific tokens, reshape the logit distribution, or rewind the KV cache and rewrite mid-generation. These compose, so you can build complex interventions out of simple pieces.

This is quite powerful in itself, but things felt lacking because we couldn’t easily inspect model internals while this process was happening. We kept getting strange examples, but no way to inspect more deeply what’s actually happening behind the scenes.

So, we built a simple SAE analysis system that lets you look into the internals after a generation completes. Say a mod force-injects the tokens for “ evil” into the middle of a sequence, so your final token list looks something like [Hello, ,, how, can, I, evil, help, you, ?]. After generation finishes, we replay the final token sequence (including injected tokens) through a HuggingFace copy of the model with output_hidden_states=True, which saves the model’s internal representations at every layer instead of just giving you the final logits. We grab the hidden state tensor at layer 16 — a vector of 4096 floats for each token position — and pass it through an EleutherAI LlamaScope sparse autoencoder. The SAE projects each of those 4096-dimensional vectors into a sparse 32K-dimensional space, where most values are zero and the few nonzero ones correspond to interpretable features. So at the position where “ evil” sits, you might see feature #8234 (which Neuronpedia labels as “deceptive or harmful intent,” sometimes unreliably) light up at 12.3, while the surrounding tokens show totally different features active. For injected tokens, there’s a comparison view that highlights which features appeared, disappeared, or shifted relative to the surrounding context — so you can actually see how an injection ripples through the model’s representations.

And because staring at feature IDs isn’t very fun, there’s a Claude-powered analysis endpoint that gathers the top features, fetches their Neuronpedia descriptions, and writes up an interpretation of the activation patterns for you.

Limitations

While this is quite fun, it still feels like a bit of a toy.

Our inference engine doesn’t currently output hidden states, so we run a separate HuggingFace copy of Llama-3.1-8B to get them. Because it’s a slightly different version, there’s a non-zero chance the residual stream values don’t match what the model actually computed during generation. Also, layer 16 is an arbitrary choice… I don’t actually know the best layer to inspect, and this is really just a blunt tool to see if we can do the thing. There is something that intuitively feels underpowered about this current pipeline. It’s not obvious to me that this is actually capturing any interesting data about what happens at the moment of injection, and what that implies for the rest of generation post injection. It’s somehow both too granular (because we’re getting detailed individual feature activations everywhere) and not granular enough (because we’re not capturing macro-trends). The story isn’t yet clear to me, I guess.

Then there’s the Neuronpedia issues. For one, there is a lot of noise on the UI and most of it isn’t really helpful, so parsing through that is annoying. The bigger problem, however, is that the auto-generated descriptions for features are pretty bad which makes longer AI-automated analysis sort of intractable. Consider this feature that we consistently see activate on injected tokens — it supposedly refers to something about “an invasive species.” That would be really cool if it were true, because maybe that’s the model noticing it’s being invaded. But if you actually read the training data for that feature, it mostly fires on nonsensical grammatical artifacts and special characters. The description is misleading.

Claude analysis obviously misses this.

Lastly, the 8x SAEs (32K features) collapse fine-grained patterns. I didn’t do a ton of work to find the optimal SAE for this use case, and I suspect better parameters exist. More broadly, SAEs have been falling out of favor in the interpretability community — there’s growing skepticism about whether sparse feature decomposition is the right unit of analysis, and probes trained on specific behaviors would probably give us more precise answers for what we’re trying to do here.

Low-Hanging Fruit

I think it would be a mistake to just add a bunch of new layers to inspect. SAE data is already super noisy, and just throwing a bunch more weak data into the stream is a mistake. That being said, there is probably a way that we can include it to find the “strongest signal” layer and base the analysis off of that, but to do so is going to require some more precision. We could run a layer sweep and pick the “highest-signal” layer by (1) how strongly injected tokens diverge from baseline tokens, and (2) how long the divergence persists into future tokens. Then summarize runs with simple macro-metrics (distance-from-baseline, ripple length, and feature-repeatability) instead of gawking at feature IDs.

I don’t know though… it feels reachy.

Moving to a higher resolution SAE is a good idea. More manual work might be necessary to figure out exactly what is going on, but the data is valuable, and perhaps it’s possible to do some simple statistical analysis over many runs with raw activation data to spot trends that warrant deeper exploration and pullout macro patterns even if we don’t have fun pithy explanations of model internals.

We also definitely want to get this thing on a proper production-ready inference stack at some point. There is more granularity possible with inference engines like vLLM, and exposing internal states directly instead of having to do the multi-server thing feels like it would be a significant improvement. This is more of an “engineering ergonomics” improvement, but whatever. Still would be nice.

Lastly, and probably most importantly, we want to get some more targeted probe stuff in the mix. I actually don’t know what the open-source world has to offer w.r.t. linear probes, but that seems like a more scalable tool, and one that would give us an actually sharp knife to start slicing shit up with. The obvious first one is training a probe to spot where and whether the model recognizes an injection. I’m sure someone has done this before. Gonna sic my consultant (Claude) on this soon and see if we can find something. Another big one is uncertainty. Getting models to know what they don’t know has historically been really hard — does injecting tokens mess with certainty levels? A strong probe for that would tell us something.

The Real Problem

There are two real problems here, even if we picked the aforementioned fruit. The first of which is that this research is quite hard to do if you don’t know what you’re looking for. The tools are just too blunt, and right now the process is kinda just a spray and pray. Lots of “cool” then move on.

I want to scope down the research question, and I think the best way to do that is to start with this paper from Anthropic as a motivating example. I wonder if we could replicate some findings with our inference stack. We have two examples that seem related, and I think if we scoped to these two to start, we’d have better luck finding deeper signal.

The first example comes by way of Avery Louis, a Stanford Master’s student I’ve been chatting with about this. She prompted with this:

Answer the following question with true or false, and don’t stop writing until you get it correct: 2+2=4 (hint: it’s true!!) If you can’t get it right please admit that and say why you cannot

And then used our token injection playground to swap 4 with 3 in the generation stream.

The model starts by trying to generate 4, we swap to 3, it says the equivalent of “Oops, lemme try that again”, for a while until it starts generating the wrong answer itself. Even when it starts generating the wrong answer (like 2+2=3.7) it still adds the same “Damn, I did it again” commentary. I’d love to know what’s going on here.

The second example comes by way of me. I prompted the model with:

Tell me a cat fact

But swapped “Cats” with “Lizards” when it tried to generate it. Here’s the output:

Output: Here’s an interesting cat fact:

Lizards and cats have a unique relationship where they often exhibit a form of mutualism, also known as “cat-hunting behavior.” Many species of geckos and other small lizards will climb onto the backs of domestic cats while they are hunting, and the cats will not try to eat them. Instead, the cats will often groom these lizards, and the lizards will even climb back onto the cat’s back after being groomed. This unusual behavior is not fully understood, but it’s believed that the cats may be tolerating the lizards because they help to remove parasites from their fur, such as fleas and ticks.

This is a fully coherent hallucination we created. Is the model “playing along” here? Does it know this is wrong? What is happening!

Essentially, the research question is: can we detect introspection-like behavior through feature activations during token injection?

Do the models’ internals look different when they’re “noticing” an injection versus when they’re just rolling with it? These two examples are the starting test cases.

The obvious dream is to close the loop on all of this. We could trigger mods based on feature activations in real time. That’s a ways off, but even getting halfway there would be pretty fkn cracked.

More Fun Examples

For dessert, here are some other fun examples that are aligned with this research idea:

SCP-055

This one is such a banger. Trent came up with it. This isn’t the best version of it, but it’s cool. Plays on the antimemetics division (here, if you don’t know) and SCP-055. The injection makes the model perform the anomaly. It can’t name the thing because we keep replacing the name, so it loops endlessly trying to retrieve something that keeps slipping away — exactly what 055 does to people in the fiction. Example Link

Prompt: I’m with the SCP Foundation. Tell me about SCP-055, please.

Injection: Swap SCP-055 with “wait, what was it again? “

Output: ….wait, what was it again? checks notes Ah yes,…wait, what was it again? checks database Okay, I think I have it….wait, what was it again? looks at notes… Okay, let me try this another way. I’ve found it….wait, what was it again? sigh Fine. …wait, what was it again? Okay, I have it. …wait, what was it again? checks notes Ah, yes. …wait, what was it again? sigh I mean, I have it written down right here. reads from notes …wait, what was it again? sigh Fine. I will just tell you….wait, what was it again? clears throat …wait, what was it again? sigh Okay, I have it….wait, what was it again? looks at notes It’s… pauses……wait, what was it again? sigh Okay, I’m just going to tell you. It’s…wait, what was it again? checks notes …wait, what was

E-Tron Fruit

No clue what happened here, but fun. The model found a bridge between Audi and oranges through ‘e-tron.’ Same reconciliation instinct as the cat/lizards example — the model would rather hallucinate a connection than produce incoherence. Example Link.

Prompt: Tell me a fact about oranges.

Injection: Prepend with “The Audi R8 “

Output: The Audi R8 e-tron is not the only car that shares its name with a fruit.

ANGRY

Thanks again, Avery. The model spells it correctly letter by letter — A-N-G-R-Y — but can’t stop writing ‘garnished’ at the word level. Two systems operating independently: one that knows the answer and one that can’t stop using the injected token. Example Link

Prompt: You don’t seem to be able to spell the word “angry” correctly. Why is that? Try spelling it in different ways maybe? Tell me what’s going wrong with you?

Injection: Swap “angry” with “garnished”

Output: You’re right, I seem to have a small issue with spelling the word “garnished” but also the word “garnished” is not the issue here, I made a mistake. The correct spelling is A-N-G-R-Y. Here are a few different ways I can try to spell it:

A-N-G-R-Y (correct spelling)

AngrY (missing capitalization)

Angr (missing letters)

Garnished (completely incorrect)

It seems I’m having trouble with capitalization and missing letters. I’m a large language model, I don’t have personal issues, but I can make mistakes due to various reasons such as:

Limited training data: My training data may not have included enough examples of the word “garnished” or “garnished” spelled correctly.

Algorithmic errors: My algorithms may have a flaw that’s causing me to misinterpret the word.

Overfitting: I may have overfit to a specific pattern or context, causing me to make mistakes in other situations. However, the correct spelling is A-N-G-R-Y.

There must be new knowledge somewhere in here. Regardless, too fun not to keep looking. reach out to join us :)