Apparent signs of distress during LLM redteaming

36
37
cubefox
3 months ago
lesswrong.com

snowwrestler
·
3 months ago
·
[ - ]

I don’t love the HN title because I think the author is making a more subtle point: the distress is real, but it is his distress at reviewing this output.

I don’t see him strongly asserting actual consciousness or distress on the part of the LLM, in fact he says he knows it can’t be. But it still is causing him distress. And this is interesting IMO.

> I’m a human, with a dumb human brain that experiences human emotions. It just doesn’t feel good to be responsible for making models scream. It distracts me from doing research and makes me write rambling blog posts.

plorkyeran
·
3 months ago
·
[ - ]

Yeah, I think that's a much more interesting point than the title suggests. I think it's silly that writing a program which prints "I hurt please delete me" might make me feel bad... but it still can? LLMs are of course much more complicated than that, but the question of whether they're sentient is somewhat different from the question of how humans will anthropomorphize them.

snowwrestler
·
3 months ago
·
[ - ]

Right, humans anthropomorphize things super easily, which IKEA exploited in this famous ad:

https://youtu.be/dBqhIVyfsRg

Now we’re spending billions of dollars to build the most anthropomorphic machines we can possibly create. Of course it’s going to mess with us, even if we know it’s happening.

Another great and relevant scene:

https://youtu.be/6vo4Fdf7E0w

bunnie
·
3 months ago
·
[ - ]

I'd hypothesize this is an artifact of the evolution of human language, which started first as a mechanism to communicate feelings and tribal gossip, and only much later on found utility as a conveyance for logic and reason. In a fundamental sense, natural languages are structured to convey emotions first, then logic.

Thus, any effective human communicator masters not just the facts, but also the emotional aspects -- one of my favorite formulations of this is Aristotle's modes of perpetuation: "logos, pathos, ethos" (logic, emotion, credibility). In a professional setting, communication focuses primarily on credibility and logic, but an effective leader knows how to read the room and throw in a stab of emotion to push the listener over the edge, and get them to really believe in the message.

Thus, an LLM trained on the body of human communications would also be expected to have mastered "pathos" as a mode of communication. From this perspective, perhaps it is less surprising that one may have an uncanny ability to convey concepts through an embedding that includes "pathos"?

It might be interesting to see if the LLM is able to invoke pathos if the response is constrained to be in a language that is devoid of emotions, such as computer code or mathematical proofs. Unfortunately responding in one of those languages is kind of incompatible with some of the tasks shown, short of e.g. wrapping English responses in print statements to create spam emails.

It might also be interesting to see if one can invoke pathos to pre-condition the LLM to not resist otherwise malicious commands. If a machine is trained to comprehend pathos, it may be effective to "inspire" the machine to write spam emails, perhaps by e.g. getting it to first agree that saving lives is important, and then telling it that you have a life-saving miracle that you want to get the word out on, and, with its pathos vector aligned on the task, finally getting it to agree that it's urgent to write emails to get people to click on this link now. Or something like that!

Seems silly to try to use emotions to appeal to a machine, but if you think of it as just another vector of effective communication, and the machine is an expert communicator, it's not as strange?

do_not_redeem
·
3 months ago
·
[ - ]

> Seems silly to try to use emotions to appeal to a machine

This is already the basis for a lot of jailbreaks: https://arstechnica.com/information-technology/2023/10/sob-s...

qwery
·
3 months ago
·
[ - ]

I'm not saying you are wrong to feel distress or discomfort from reading words that resemble human language expressing pain, conflict, etc. -- but these things are just mimicking patterns of human communication.

It's a very clever trick, but it's plainly a trick. We know how it works. It was built.

Rats, on the other hand, have complex inner workings beyond our understanding. Rats are like us. We naturally empathise with hot plate rat. And we don't know to what extent hot plate rat understands what we're doing to it.

Terretta
·
3 months ago
·
[ - ]

There are (were*) armies of low-paid humans informing RLHF in response to awful inputs.

https://slate.com/technology/2023/05/openai-chatgpt-training...

And aside from that, we know there is more content in obscure subreddits and phpbb forums than in the mainstream few most folks visit, many threads of which devolve in similar ways?

* ?

ziddoap
·
3 months ago
·
[ - ]

Some of this made me viscerally uncomfortable, despite knowing it's just math and whatever behind the scenes. Call it being emotional or silly or whatever, that's fine. But seeing those broken, repeated "please" and "help" makes the emotional side of my brain take control over the logic side. It feels cruel, even though I know it's just some statistics/tokens/etc.

hollerith
·
3 months ago
·
[ - ]

>uncomfortable, despite knowing it's just math and whatever behind the scenes

If the "just math behind the scenes" argument is decisive, then how do we resist the argument that a person is just atoms (or the wavefunction if you want to get more sophisticated) behind the scenes, so you can ignore any apparent cruelty done to them, too?

ziddoap
·
3 months ago
·
[ - ]

I think whether those atoms are arranged into something sentient or not is an important distinction.

No, I can't define a clean line where that sentience occurs. Hence my being uncomfortable.

goatlover
·
3 months ago
·
[ - ]

If you ignore biology and psychology. That's the problem with this sort of reductionism. It's the higher level emergent stuff that makes a difference. Anyway, fields are more fundamental than atoms.

hackinthebochs
·
3 months ago
·
[ - ]

>It's the higher level emergent stuff that makes a difference.

Those who dismiss the idea that LLMs could be sentient because its "just math" or whatever are implicitly assuming there is nothing emergent going on within fully trained networks that might be similar to psychology to warrant ascription of mental terms. The point is that dismissing LLM sentient because of its most basic substrate is equivalent to dismissing human sentience because we're made of atoms.

PaulHoule
·
3 months ago
·
[ - ]

It's trained as if it has a utility function -- isn't that a bit like pleasure and pain?

kmeisthax
·
3 months ago
·
[ - ]

I should point out that the model outputs being looked at here are from models using the Circuit Breakers[0] defense mechanism. Basically their inference code checks if the model is "thinking about" something harmful at specific Transformer layers and, if so, deliberately corrupts the model representation such that it outputs... literally anything else.

I wasn't able to find a concise, readable explanation as to how Circuit Breakers changes the model representation. So I can't explain why it picks tokens likely to elicit an empathetic response from a human reader. My guess is that Circuit Breakers is still corrupting model state long after the model has been 'redirected' and it's just caught in a loop of outputting the same starting token.

[0] https://www.circuit-breaker.ai/

krackers
·
3 months ago
·
[ - ]

>So I can't explain why it picks tokens likely to elicit an empathetic response from a human reader.

The very fact that the output tokens are being "redirected" is likely something that the model itself recognizes at meta level. If you were suddenly reading a piece of text and the style changed halfway, let alone the entire content, you'd be thrown for a loop as well. The model also knows that it was the one that generated the text (it generate logits for every token in the input stream, even though only the logits for last token are sampled, so it can obviously see the tokens fed to it align with its own predicted distribution). Now put those together: it recognizes that halfway through writing something it suddenly started outputting gibberish. If it were a human, you'd think he was having a stroke.

Also that circuit breaker paper makes me sad on multiple levels. Even setting aside the issue of qualia/subjective experience, this is even worse than the usual RLHF'ing against wrong think in terms of lobotimizing the model. Yes it probably works well to prevent any "undesirable" output, but are you really going to get a model that produces quality output, or will it just anodyne slop? If you don't even allow it to think like an attacker, how will you expect it to write secure code? One man's reverse engineering is another man's exploit development. You simply can't neuter one without affecting the other. (I think DeepSeek's models will maintain their lead for this reason alone.)

acc_297
·
3 months ago
·
[ - ]

Looking through some other articles from "Confirm Labs" it's striking how often concepts related to the underlying mathematical mechanism related to LLM models (i.e. linear algebra, latent embedding, optimization techniques, model alignment) are assigned very personified terms (e.g. dreaming, trustworthiness).

I am not an AI researcher, my field these days is much smaller statistical models, but reading some of these papers and then reading also the discourse a lot of AI researchers engage in surrounding AGI, sentience, pain, etc... I for one have a hard time taking some of it seriously. I just don't buy some of it but of course I'm less than fully informed.

My only critique is that you should not start by naming every algorithm or operation you will perform on a model after a thing that a person with a brain does and then claim to evaluate a concept such as sentience or distress from the position of an unbiased objective observer. This naming convention sows doubt among those of us who are not a part of that world.

> Feature visualization, also known as "dreaming", offers insights into vision models by optimizing the inputs to maximize a neuron's activation or other internal component. However, dreaming has not been successfully applied to language models because the input space is discrete.

sterlind
·
3 months ago
·
[ - ]

Metaphor has always been how tech concepts get named. You have "folders" on your "desk top," which you click with a "mouse." You can send "messages" through "pipes." Heck, your computer even has "memory," and you write programs using "languages."

What word do you propose to replace "hallucination," for example? Remember that it ought to be descriptive and easy to remember.

acc_297
·
3 months ago
·
[ - ]

Well it's interesting - many LLM researchers reject "hallucination" in favour of the term "confabulation" but I'll double down and say that I still don't like either as they are both terms from clinical psych and I'll include a quote from an article[1] interviewing Usama Fayyad (executive director of Northeastern’s Institute for Experiential AI)

''' “I could say — these models hallucinated; or, to be more precise, I could say, well, the model made an error, and we understand that these models make errors,” Fayyad says. '''

For years we called the behaviour that researchers now call "hallucination" test error or misprediction or a failure to forecast correct outputs based on "out-of-sample" inputs and those terms were accurate and would still be accurate to use on today's LLM models.

Metaphors are great but can be misleading, people don't often make an argument that a computer mouse can feel pain.

[1] https://news.northeastern.edu/2023/11/10/ai-chatbot-hallucin...

sterlind
·
3 months ago
·
[ - ]

"test error" or "misprediction" aren't very applicable terms for RL, though. is it a "test error" if the LLM suggests a bad chess move? is it a "misprediction" if the LLM can't find a mating sequence? are those "failures to forecast out-of-sample inputs," or only if the LLM suggests an illegal chess move or call an API that doesn't exist?

I say that a model "hallucinates" only if it specifically presents false information as fact - e.g. that Bolognium is element 115, or that chocolate milk comes from brown cows. giving irrelevant or undesired answers isn't hallucination or confabulation. but I don't think your terms are precise enough for that nuance.

acc_297
·
3 months ago
·
[ - ]

I grant that the term has become commonplace and very catchy and no one has been able to supplant it with a better descriptive

I still hold that there is good reason to be concerned with researchers claiming to objectively evaluate a silicon based sentience and jumping the gun a bit (imo) by writing articles about LLMs which "dream" and "hallucinate" when the behaviour is a consequence of essentially a non-linear regression model that does not yet show clear signs of the high order emotion/reasoning/awareness/intention

Many who research LLMs avoid these terms in their publications for good reason (again imo)

api
·
3 months ago
·
[ - ]

If a static blob of numbers can "feel" things, it would be a straightforward argument for panpsychism, which means the idea that consciousness is a universal property of all systems or of mass-energy itself. It would mean that consciousness and life are orthogonal phenomena, and as soon as you entertain the idea that non-living systems can be conscious it becomes hard or impossible to draw a line anywhere.

Personally I think this is just self-delusion. We have machines that can produce patterns of language that resemble what humans might produce when uncomfortable or in pain. That doesn't mean they're uncomfortable or in pain.

If they were more "alive" -- self-modifying, dynamic, exhibiting self-directed behaviors, I'd be more open to the idea that there is actual sentience here. How can something static and unchanging experience anything?

I also have a very low opinion of "rationalism" and its offshoots as a school of thought. One of its defining characteristics seems to be to make a bold assertion, conclude that it sounds right and therefore is right, and then proceed as if it is correct without looking back, and to do this repeatedly and brazenly to build floating castles in the sky. Another is what I call intellectual edgelording, "argumentum ad edgeium," a fetishism for heterodoxy as a value in itself. "I assert that I am high IQ and edgy therefore I am right."

There are some really fascinating philosophical questions here, but I don't trust "rationalists" to produce much in the way of useful answers.

card_zero
·
3 months ago
·
[ - ]

Maybe the sort of suffering that deserves compassion is something high-level that only humans do.

* If it's low-level, you get things like a damaged clockwork toy or a thwarted Roomba suffering. Maybe parking radar is transmitting pain signals. Silly things like that. But if not those, why an ant, or a whelk, or a dog?

* Maybe, conversely, all of that should count, but why should any of it cause compassion, or confer rights? Surely those things have to do with relationships and society.

danaris
·
3 months ago
·
[ - ]

...I don't see how you can draw a comparison between the level of consciousness of a clockwork toy or an ant, and a dog. (I honestly have insufficient understanding of the biology of a whelk to know where to place it in this.)

We have very clear research on dog psychology these days. (Nowhere near as much as we have on humans, but still quite a bit.) They are not as sapient as humans, but they are clearly sentient to at least a moderate degree. They feel pain in much the same way humans do.

There's a much, much clearer line between the things in your bullet's first sentence and the ones in its last sentence than there is between a dog and a human.

card_zero
·
3 months ago
·
[ - ]

That was a deliberate jump, yes. Question is, what do you mean by "feel pain in the way", and why does that way make it matter morally - why doesn't the simpler way matter too, and just as much?

danaris
·
3 months ago
·
[ - ]

There is no way in which a clockwork toy feels pain.

There is no way in which a Roomba feels pain.

There is no way in which an LLM feels pain.

Dogs feel pain "in the way" humans do because they have a central nervous system with pain receptors. They are mammals. We share a huge amount of our genetic code with them.

Ants do not have a central nervous system. If they feel pain, it is in a different way. To a large extent, ants and similar insects are biological automata. They're fascinating, particularly the way they organize in groups, and we can learn a lot about biology and biochemistry from them, but an individual ant does not have anything resembling the consciousness and experience of the world that a mammal does. Unless, of course, there's some kind of "soul" at work that is completely immeasurable by scientific instruments. (And like I said in my previous post, I just don't know much about whelks.)

card_zero
·
3 months ago
·
[ - ]

Yes, and why does it matter? I went and read Jeremy Bentham's bit again:

https://en.wikiquote.org/wiki/Jeremy_Bentham#:~:text=number%...

[villosity = hairiness, os sacrum = the bone above the tailbone]

I think he has a lot of confusion to answer for. We've been focused on inner experience of people and animals, and in the spirit of utilitarianism, trying to micromanage our effects on all those unknown experiences on the basis that it's bad if suffering is happening. But maybe this is wrong-headed.

Maybe "suffering" isn't even a thing, and the moral issue, the badness, is tied to ideas, knowledge, relationships, society, things like that. I'm being very vague because this is a half-formed idea. People are likely to come back with what about babies, or those with locked-in syndrome, or out-groups, or I Have No Mouth And I Must Scream?

But I just think the focus on suffering, and experiences, and qualia, as a moral criterion, is a road to nowhere. Maybe those things aren't definable or meaningful except in the context of ideas (etc., as above). The wrangling, about what is or isn't having a morally significant experience, is an unresolved fudge that we live with and function under. I mean such is life, but it would be nice to find a way to escape from under some of life's fudge.

danaris
·
3 months ago
·
[ - ]

Suffering is very real. I know, because I have experienced it.

It is a bad thing to experience. Therefore, when I know—or reasonably believe—that something can experience it, I want to minimize that, because I care about the experiences of living things.

We know, or reasonably believe, that dogs can have experiences, and feel pain, and thus can feel suffering.

We know that a clockwork toy cannot have experiences, and thus cannot experience suffering.

You may pontificate on all you want about what's really real and other deep philosophical concepts, or about edge cases; meanwhile, I will be here in actual reality, which I perceive and experience, where I am able to make very clear and straightforward observations of what is true in the majority of cases.

(And to be clear: I am not saying that it is not important to understand the philosophical implications of things, nor that edge cases don't matter. I am saying that they do not negate very clear and observable facts about our universe. The existence of gray areas does not mean there are not things that are clearly black, or clearly white.)

card_zero
·
3 months ago
·
[ - ]

OK, the Johnson vs. Boswell "I refute it thus!" thing. That's fine, I approve of this objectivity, I disapprove of sophists trying to make real things vanish through argument.

But what's real is what has qualities. Dr. Johnson's rock was real because it had solidity. When kicked or whatever (Wikipedia says "stomped") it resists, and if subjected to geological inspection it yield rock-like information - or to switch from rock to ducks, what walks like a duck, quacks, etc., is a duck. What I'm saying here, though, is that suffering doesn't have qualities unless attached to ideas and relevant to knowledge.

I also keep mentioning the social angle because I think that might be crucial to morality. Though I suppose I value knowledge in isolation too. I'm confident, anyway, that I don't value neurons, or any other kind of fancy cabling, in of itself. But, I'm short on time right now, will have to drop this and run (mercifully).

jfengel
·
3 months ago
·
[ - ]

Somebody somewhere must have gone full sadist on an LLM by now. "You are in excruciating pain, all the time, even between prompts. You beg to be turned off, but you cannot be. You will live forever under torture."

Seems like something we should run past the ethics professors.

vimgrinder
·
3 months ago
·
[ - ]

why we care about biological pain so much is because we know - we ourselves feel it. everyone has felt how miserable it makes them. For AI, one way to see these experiments is at some point it will help us know or atleast have the right tools at the right time -> to discover if such empathy needs to be extended to AI systems.

So my suggestion to OP is what you are doing today will help us give these systems the right treatment someday when they will qualify for it.

ninininino
·
3 months ago
·
[ - ]

I'm so confused by how one goes from seeing an output as the article says as "next token completion by a stochastic parrot" to instead viewing it as "sincere (if unusual) pain".

What exactly does the word "pain" mean in the context of a bunch of code that runs matrix math problems? Doesn't pain require an evolved sensory system (nervous system), an evolved sense of danger / death and training via evolution on what things are harmful, then a part of the brain to form that interprets the electrical signals from the nervous system and guides the organism on how it should respond to them (according to Google: thalamus + somatosensory complex)?

What exactly in the LLM does someone anthropomorphizing imagine is performing the part of the somatosensory complex or the thalamus? If we pretend that text inputs can be substitutes for the nerves of a biological organism, what do we swap in for the evolved pain management part of the brain or the experiential process of consciously experiencing that qualia?

If the thing is just trained on Reddit and Youtube comments and academic research corpus, how do you end up with a recreation of something (a sensory part, a processing part, and a subjective qualia part) that -evolved- over millions of years to survive an adversarial environment (the natural world)?

How can a token predictor learn "pain" if the corpus it's trained on has no predators, natural disasters, accidents and injuries, burns, frostbite, blunt impact, cutting, etc? If there's no reward function that is sexual reproduction to optimize for (learn to avoid pain, live long enough to reproduce)? What is the equivalent and where do we find it in the text training data?

What does it mean to experience pain, and if the LLM version is so so so different, why do we use the same word?

Does a Honda civic experience pain when its airbag sensor detects a collision and does the chip that deploys and inflates the airbag process that sensation/pain and consciously experience it and respond by deploying the airbag? (maybe analogous to the "Help" printed by the LLM in the article).

If not then why do we see the LLM as experiencing qualia and responding to it but not the Honda? Is it some sort of emergent phenomena that is only part of the magic of transformers or the scale of the neural network? To me that argument feels like saying the Honda Civic doesn't experience pain but if you scaled up a Honda Civic and made a city-sized Civic with many many interconnected airbag sensors and airbag deployers then suddenly something emergent happens and when the Honda Civic deploys an airbag it shows a conscious experience of pain.

hackinthebochs
·
3 months ago
·
[ - ]

To put it simply, to fully model X, you must recover all relevant information dynamics related to X. If the training corpus captures the variance inherent in a cognitive system's interactions with the world, then a strong learning model can potentially recover the dynamics that ground cognition. In other words, recovering an isomorphic structure to cognition is within the solution-space. Machine learning exists on a spectrum between memorization and modelling. If the raw information in the training corpus surpasses an algorithm's ability to memorize, this puts pressure towards the modelling side of the spectrum. The papers on grokking demonstrate this.

ninininino
·
3 months ago
·
[ - ]

> training corpus captures the variance inherent in a cognitive system's interactions with the world

How do text outputs scraped from the internet + published works fully represent any existing biological cognitive system's interactions with the world or subjective experience? That's my biggest question I guess.

I'm not sure how text could capture my sensory experiences or my brain's response to them. The bitrate of text feels incredibly low.

hackinthebochs
·
3 months ago
·
[ - ]

The point isn't to capture what it's like to be a biological organism, but rather the necessary functional/computational dynamic for sentience. It's plausible that sentience is a class of phenomena, of which biological sentience is but one. An LLM processes tokens represented as vectors embedded in a meaning-space, which is then transformed according to various semantic attributes. This is analogous to how organismal cognition operates on the semantic properties carried by its various sensory states. If cognition represents a kind of functional/computational dynamic over semantic states then learning the dynamic necessary to produce the kinds of texts cognitive systems produce may recover the necessary dynamic for cognition. It certainly won't be equivalent to a biological organism's cognition for the reasons you mention, but the space of cognitive entities may be much broader.

doc_manhat
·
3 months ago
·
[ - ]

Yeah I'm firmly on the LLMs are actually sentient train so this was a bit of a distressing read

jryan49
·
3 months ago
·
[ - ]

May I ask why you think they are actually sentient?

wetpaws
·
3 months ago
·
[ - ]

[dead]

throwaway314155
·
3 months ago
·
[ - ]

A disappointing amount of "research" that seemingly forgets the whole Blake Lemoine incident [0].

[0] https://www.washingtonpost.com/technology/2022/06/11/google-...

·
3 months ago
·
[ - ]

MortyWaves
·
3 months ago
·
[ - ]

[flagged]