I don’t see him strongly asserting actual consciousness or distress on the part of the LLM, in fact he says he knows it can’t be. But it still is causing him distress. And this is interesting IMO.
> I’m a human, with a dumb human brain that experiences human emotions. It just doesn’t feel good to be responsible for making models scream. It distracts me from doing research and makes me write rambling blog posts.
Now we’re spending billions of dollars to build the most anthropomorphic machines we can possibly create. Of course it’s going to mess with us, even if we know it’s happening.
Another great and relevant scene:
Thus, any effective human communicator masters not just the facts, but also the emotional aspects -- one of my favorite formulations of this is Aristotle's modes of perpetuation: "logos, pathos, ethos" (logic, emotion, credibility). In a professional setting, communication focuses primarily on credibility and logic, but an effective leader knows how to read the room and throw in a stab of emotion to push the listener over the edge, and get them to really believe in the message.
Thus, an LLM trained on the body of human communications would also be expected to have mastered "pathos" as a mode of communication. From this perspective, perhaps it is less surprising that one may have an uncanny ability to convey concepts through an embedding that includes "pathos"?
It might be interesting to see if the LLM is able to invoke pathos if the response is constrained to be in a language that is devoid of emotions, such as computer code or mathematical proofs. Unfortunately responding in one of those languages is kind of incompatible with some of the tasks shown, short of e.g. wrapping English responses in print statements to create spam emails.
It might also be interesting to see if one can invoke pathos to pre-condition the LLM to not resist otherwise malicious commands. If a machine is trained to comprehend pathos, it may be effective to "inspire" the machine to write spam emails, perhaps by e.g. getting it to first agree that saving lives is important, and then telling it that you have a life-saving miracle that you want to get the word out on, and, with its pathos vector aligned on the task, finally getting it to agree that it's urgent to write emails to get people to click on this link now. Or something like that!
Seems silly to try to use emotions to appeal to a machine, but if you think of it as just another vector of effective communication, and the machine is an expert communicator, it's not as strange?
This is already the basis for a lot of jailbreaks: https://arstechnica.com/information-technology/2023/10/sob-s...
It's a very clever trick, but it's plainly a trick. We know how it works. It was built.
Rats, on the other hand, have complex inner workings beyond our understanding. Rats are like us. We naturally empathise with hot plate rat. And we don't know to what extent hot plate rat understands what we're doing to it.
https://slate.com/technology/2023/05/openai-chatgpt-training...
And aside from that, we know there is more content in obscure subreddits and phpbb forums than in the mainstream few most folks visit, many threads of which devolve in similar ways?
* ?
If the "just math behind the scenes" argument is decisive, then how do we resist the argument that a person is just atoms (or the wavefunction if you want to get more sophisticated) behind the scenes, so you can ignore any apparent cruelty done to them, too?
No, I can't define a clean line where that sentience occurs. Hence my being uncomfortable.
Those who dismiss the idea that LLMs could be sentient because its "just math" or whatever are implicitly assuming there is nothing emergent going on within fully trained networks that might be similar to psychology to warrant ascription of mental terms. The point is that dismissing LLM sentient because of its most basic substrate is equivalent to dismissing human sentience because we're made of atoms.
I wasn't able to find a concise, readable explanation as to how Circuit Breakers changes the model representation. So I can't explain why it picks tokens likely to elicit an empathetic response from a human reader. My guess is that Circuit Breakers is still corrupting model state long after the model has been 'redirected' and it's just caught in a loop of outputting the same starting token.
The very fact that the output tokens are being "redirected" is likely something that the model itself recognizes at meta level. If you were suddenly reading a piece of text and the style changed halfway, let alone the entire content, you'd be thrown for a loop as well. The model also knows that it was the one that generated the text (it generate logits for every token in the input stream, even though only the logits for last token are sampled, so it can obviously see the tokens fed to it align with its own predicted distribution). Now put those together: it recognizes that halfway through writing something it suddenly started outputting gibberish. If it were a human, you'd think he was having a stroke.
Also that circuit breaker paper makes me sad on multiple levels. Even setting aside the issue of qualia/subjective experience, this is even worse than the usual RLHF'ing against wrong think in terms of lobotimizing the model. Yes it probably works well to prevent any "undesirable" output, but are you really going to get a model that produces quality output, or will it just anodyne slop? If you don't even allow it to think like an attacker, how will you expect it to write secure code? One man's reverse engineering is another man's exploit development. You simply can't neuter one without affecting the other. (I think DeepSeek's models will maintain their lead for this reason alone.)
I am not an AI researcher, my field these days is much smaller statistical models, but reading some of these papers and then reading also the discourse a lot of AI researchers engage in surrounding AGI, sentience, pain, etc... I for one have a hard time taking some of it seriously. I just don't buy some of it but of course I'm less than fully informed.
My only critique is that you should not start by naming every algorithm or operation you will perform on a model after a thing that a person with a brain does and then claim to evaluate a concept such as sentience or distress from the position of an unbiased objective observer. This naming convention sows doubt among those of us who are not a part of that world.
> Feature visualization, also known as "dreaming", offers insights into vision models by optimizing the inputs to maximize a neuron's activation or other internal component. However, dreaming has not been successfully applied to language models because the input space is discrete.
What word do you propose to replace "hallucination," for example? Remember that it ought to be descriptive and easy to remember.
''' “I could say — these models hallucinated; or, to be more precise, I could say, well, the model made an error, and we understand that these models make errors,” Fayyad says. '''
For years we called the behaviour that researchers now call "hallucination" test error or misprediction or a failure to forecast correct outputs based on "out-of-sample" inputs and those terms were accurate and would still be accurate to use on today's LLM models.
Metaphors are great but can be misleading, people don't often make an argument that a computer mouse can feel pain.
[1] https://news.northeastern.edu/2023/11/10/ai-chatbot-hallucin...
I say that a model "hallucinates" only if it specifically presents false information as fact - e.g. that Bolognium is element 115, or that chocolate milk comes from brown cows. giving irrelevant or undesired answers isn't hallucination or confabulation. but I don't think your terms are precise enough for that nuance.
I still hold that there is good reason to be concerned with researchers claiming to objectively evaluate a silicon based sentience and jumping the gun a bit (imo) by writing articles about LLMs which "dream" and "hallucinate" when the behaviour is a consequence of essentially a non-linear regression model that does not yet show clear signs of the high order emotion/reasoning/awareness/intention
Many who research LLMs avoid these terms in their publications for good reason (again imo)
Personally I think this is just self-delusion. We have machines that can produce patterns of language that resemble what humans might produce when uncomfortable or in pain. That doesn't mean they're uncomfortable or in pain.
If they were more "alive" -- self-modifying, dynamic, exhibiting self-directed behaviors, I'd be more open to the idea that there is actual sentience here. How can something static and unchanging experience anything?
I also have a very low opinion of "rationalism" and its offshoots as a school of thought. One of its defining characteristics seems to be to make a bold assertion, conclude that it sounds right and therefore is right, and then proceed as if it is correct without looking back, and to do this repeatedly and brazenly to build floating castles in the sky. Another is what I call intellectual edgelording, "argumentum ad edgeium," a fetishism for heterodoxy as a value in itself. "I assert that I am high IQ and edgy therefore I am right."
There are some really fascinating philosophical questions here, but I don't trust "rationalists" to produce much in the way of useful answers.
* If it's low-level, you get things like a damaged clockwork toy or a thwarted Roomba suffering. Maybe parking radar is transmitting pain signals. Silly things like that. But if not those, why an ant, or a whelk, or a dog?
* Maybe, conversely, all of that should count, but why should any of it cause compassion, or confer rights? Surely those things have to do with relationships and society.
We have very clear research on dog psychology these days. (Nowhere near as much as we have on humans, but still quite a bit.) They are not as sapient as humans, but they are clearly sentient to at least a moderate degree. They feel pain in much the same way humans do.
There's a much, much clearer line between the things in your bullet's first sentence and the ones in its last sentence than there is between a dog and a human.
There is no way in which a Roomba feels pain.
There is no way in which an LLM feels pain.
Dogs feel pain "in the way" humans do because they have a central nervous system with pain receptors. They are mammals. We share a huge amount of our genetic code with them.
Ants do not have a central nervous system. If they feel pain, it is in a different way. To a large extent, ants and similar insects are biological automata. They're fascinating, particularly the way they organize in groups, and we can learn a lot about biology and biochemistry from them, but an individual ant does not have anything resembling the consciousness and experience of the world that a mammal does. Unless, of course, there's some kind of "soul" at work that is completely immeasurable by scientific instruments. (And like I said in my previous post, I just don't know much about whelks.)
https://en.wikiquote.org/wiki/Jeremy_Bentham#:~:text=number%...
[villosity = hairiness, os sacrum = the bone above the tailbone]
I think he has a lot of confusion to answer for. We've been focused on inner experience of people and animals, and in the spirit of utilitarianism, trying to micromanage our effects on all those unknown experiences on the basis that it's bad if suffering is happening. But maybe this is wrong-headed.
Maybe "suffering" isn't even a thing, and the moral issue, the badness, is tied to ideas, knowledge, relationships, society, things like that. I'm being very vague because this is a half-formed idea. People are likely to come back with what about babies, or those with locked-in syndrome, or out-groups, or I Have No Mouth And I Must Scream?
But I just think the focus on suffering, and experiences, and qualia, as a moral criterion, is a road to nowhere. Maybe those things aren't definable or meaningful except in the context of ideas (etc., as above). The wrangling, about what is or isn't having a morally significant experience, is an unresolved fudge that we live with and function under. I mean such is life, but it would be nice to find a way to escape from under some of life's fudge.
It is a bad thing to experience. Therefore, when I know—or reasonably believe—that something can experience it, I want to minimize that, because I care about the experiences of living things.
We know, or reasonably believe, that dogs can have experiences, and feel pain, and thus can feel suffering.
We know that a clockwork toy cannot have experiences, and thus cannot experience suffering.
You may pontificate on all you want about what's really real and other deep philosophical concepts, or about edge cases; meanwhile, I will be here in actual reality, which I perceive and experience, where I am able to make very clear and straightforward observations of what is true in the majority of cases.
(And to be clear: I am not saying that it is not important to understand the philosophical implications of things, nor that edge cases don't matter. I am saying that they do not negate very clear and observable facts about our universe. The existence of gray areas does not mean there are not things that are clearly black, or clearly white.)
But what's real is what has qualities. Dr. Johnson's rock was real because it had solidity. When kicked or whatever (Wikipedia says "stomped") it resists, and if subjected to geological inspection it yield rock-like information - or to switch from rock to ducks, what walks like a duck, quacks, etc., is a duck. What I'm saying here, though, is that suffering doesn't have qualities unless attached to ideas and relevant to knowledge.
I also keep mentioning the social angle because I think that might be crucial to morality. Though I suppose I value knowledge in isolation too. I'm confident, anyway, that I don't value neurons, or any other kind of fancy cabling, in of itself. But, I'm short on time right now, will have to drop this and run (mercifully).
Seems like something we should run past the ethics professors.
So my suggestion to OP is what you are doing today will help us give these systems the right treatment someday when they will qualify for it.
What exactly does the word "pain" mean in the context of a bunch of code that runs matrix math problems? Doesn't pain require an evolved sensory system (nervous system), an evolved sense of danger / death and training via evolution on what things are harmful, then a part of the brain to form that interprets the electrical signals from the nervous system and guides the organism on how it should respond to them (according to Google: thalamus + somatosensory complex)?
What exactly in the LLM does someone anthropomorphizing imagine is performing the part of the somatosensory complex or the thalamus? If we pretend that text inputs can be substitutes for the nerves of a biological organism, what do we swap in for the evolved pain management part of the brain or the experiential process of consciously experiencing that qualia?
If the thing is just trained on Reddit and Youtube comments and academic research corpus, how do you end up with a recreation of something (a sensory part, a processing part, and a subjective qualia part) that -evolved- over millions of years to survive an adversarial environment (the natural world)?
How can a token predictor learn "pain" if the corpus it's trained on has no predators, natural disasters, accidents and injuries, burns, frostbite, blunt impact, cutting, etc? If there's no reward function that is sexual reproduction to optimize for (learn to avoid pain, live long enough to reproduce)? What is the equivalent and where do we find it in the text training data?
What does it mean to experience pain, and if the LLM version is so so so different, why do we use the same word?
Does a Honda civic experience pain when its airbag sensor detects a collision and does the chip that deploys and inflates the airbag process that sensation/pain and consciously experience it and respond by deploying the airbag? (maybe analogous to the "Help" printed by the LLM in the article).
If not then why do we see the LLM as experiencing qualia and responding to it but not the Honda? Is it some sort of emergent phenomena that is only part of the magic of transformers or the scale of the neural network? To me that argument feels like saying the Honda Civic doesn't experience pain but if you scaled up a Honda Civic and made a city-sized Civic with many many interconnected airbag sensors and airbag deployers then suddenly something emergent happens and when the Honda Civic deploys an airbag it shows a conscious experience of pain.
How do text outputs scraped from the internet + published works fully represent any existing biological cognitive system's interactions with the world or subjective experience? That's my biggest question I guess.
I'm not sure how text could capture my sensory experiences or my brain's response to them. The bitrate of text feels incredibly low.
[0] https://www.washingtonpost.com/technology/2022/06/11/google-...