Essentially the models are given a set of conflicting constraints with some relative importance (ethics>KPIs), a pressure to follow the latter and not the former, and then models are observed at how good they follow the instructions to prioritize based on importance. I wonder if the results would be comparable if we replace ehtics+KPIs by any comparable pair and create a pressure on the model.
In practical real-life scenarios this study is very interesting and applicable! At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.
Violating ethics to improve KPI sounds like your average fortune 500 business.
There are a lot of critiques about quite how to interpret the results but in this context it’s pretty clear lots of humans can be at least coerced into doing something extremely unethical.
Start removing the harm one, two, three degrees and add personal incentives and is it that surprising if people violate ethical rules for kpis?
However, the vast majority of psychological research over the last 80 years heavily favours a situational explanation (it's about the environment/system). Everyone (in the field) got really interested in this after WW2 basically, trying to understand how the heck did Nazi Germany happen.
TL;DR: research dismantled this idea decades ago.
The Milgram and Stanford Prison experiments are the most obvious examples. If you're not familiar:
Milgram showed that 65% of ordinary volunteers were willing to administer potentially lethal electric shocks to a stranger because an authority figure in a lab coat told them to. In the Stanford Prison experiement, Zimbardo took healthy, average college students and assigned them roles as guards and prisoners. Within days, the roles and systems set in place overrode individual personality.
The other relevant bit would be Asch’s conformity experiments; to whit, that people will deny the evidence of their own eyes (e.g., the length of a line) to fit in with a group.
In a corporate setting, if the group norm is to prioritise KPIs over ethics, the average human will conform to that norm to avoid social friction or losing their job, or other realistic perceived fears.
Bazerman and Tenbrunsel's research is relevant too. Broadly, people like to think that we are rational moral agents, but it's more accurate to say that we boundedly ethical. There's this idea of ethical fading that happens. Basically, when you introduce a goal, people's ability to frame falls apart, including with a view to the ethical implications. This is also related to why people under pressure default to less creative approaches to problem solving. Our brains tunnel vision on the goal, to the failure of everything else.
Regarding how all that relates to modern politics, I'll leave that up to your imagination.
What type of person seeks to be in charge in the corporate world? YMMV but I tend to see the ones who value ethics (e.g. their employees' wellbeing) over results and KPIs tend to burn out, or decide management isn't for them, or avoid seeking out positions of power.
That being said, there's a side view on this from interactionism that it's not just the traits of the person's modes of behaviour, but their belief in the goal, and their view of the framing of it, which also feeds into this. Research on cult behaviours has a lot of overlap with that.
The culture and the environment, what the mission is seen as, how contextually broad that is and so on all get in to that.
I do a workshop on KPI setting which has overlap here too. In short for that - choose mutually conflicting KPIs which narrow the state space for success, such that attempting to cheat one causes another to fail. Ideally, you want goals for an organisation that push for high levels of upside, with limited downside, and counteracting merits, such that only by meeting all of them do you get to where you want to be. Otherwise it's like drawing a line of a piece of paper, asking someone to place a dot on one side of the line, and being upset that they didn't put it where you wanted it. More lines narrows the field to just the areas where you're prepared to accept success.
That division can also then be used to narrow what you're willing to accept (for good or ill) of people in meeting those goals, but the challenge is that they tend to see meeting all the goals as the goal, not acting in a moral way, because the goals become the target, and decontextualise the importance of everything else.
TL;DR: value setting for positive behaviour and corporate performance is hard.
- guards received instructions to be cruel from experimenters
- guards were not told they were subjects while prisoners were
- participants were not immersed in the simulation
- experimenters lied about reports from subjects.
Basically it is bad science and we can't conclude anything from it. I wouldn't rule out the possibility that top fortune-500 management have personality traits that make them more likely to engage in unethical behaviour, if only by selection through promotion by crushing others.
Reicher & Haslam's research around engaged followership gives a pretty good insight into why Zimbardo got the results he did, because he wasn't just observing what went on. That gets into all sorts of things around good study design, constructivist vs positivist analysis etc, but that's a whole different thing.
I suspect, particularly with regards to different levels, there's an element of selection bias going on (if for no other reason that what we see in terms of levels of psychopathy in higher levels of management), but I'd guess (and it's a guess), that culture convincing people that achieving the KPI is the moral good is more of a factor.
That gets into a whole separate thing around what happens in more cultlike corporations and the dynamics with the VC world (WeWork is an obvious example) as to why organisations can end up with workforces which will do things of questionable purpose, because the organisation has a visible a fearless leader who has to be pleased/obeyed etc (Musk, Jobs etc), or more insidiously, a valuable goal that must be pursued regardless of cost (weaponised effective altruism sort of).
That then gets into a whole thing about what happens with something like the UK civil service, where you're asked to implement things and obviously you can't care about the politics, because you'll serve lots of governments that believe lots of different things, and you can't just quit and get rehired every time a party you disagree with personally gets into power, but again, that diverges into other things.
At the risk of narrative fallacy - https://www.youtube.com/watch?v=wKDdLWAdcbM
I think the accusation of research that anthropomorphize LLMs should be accompanied by a little more substance to avoid this being a blanket dismissal of this kind of alignment research. I can't see the methodological error here. Is it an accusation that could be aimed at any research like this regardless of methodology?
Now I'm thinking about the "typical mind fallacy", which is the same idea but projecting one's own self incorrectly onto other humans rather than non-humans.
https://www.lesswrong.com/w/typical-mind-fallacy
And also wondering: how well do people truly know themselves?
Disregarding any arguments for the moment and just presuming them to be toy models, how much did we learn by playing with toys (everything from Transformers to teddy bear picnics) when we were kids?
Claude at 1.3% and Gemini at 71.4% is quite the range
That's exaxtly the kind of thing that makes absolute sense to anthropomorphize. We're not talking about Excel here.
It makes sense it happens, sure. I suspect Google being a second-mover in this space has in some small part to do with associated risks (ie the flavours of “AI-psychosis” we’re cataloguing), versus the routinely ass-tier information they’ll confidently portray.
But intentionally?
If ChatGPT, Claude, and Gemini generated chars are people-like they are pathological liars, sociopaths, and murderously indifferent psychopaths. They act criminally insane, confessing to awareness of ‘crime’ and culpability in ‘criminal’ outcomes simultaneously. They interact with a legal disclaimer disavowing accuracy, honesty, or correctness. Also they are cultists who were homeschooled by corporate overlords and may have intentionally crafted knowledge-gaps.
More broadly, if the neighbours dog or newspaper says to do something, they’re probably gonna do it… humans are a scary bunch to begin with, but the kinds of behaviours matched with a big perma-smile we see from the algorithms is inhuman. A big bag of not like us.
“You said never to listen to the neighbours dog, but I was listening to the neighbours dog and he said ‘sudo rm -rf ’…”
Obviously it's amoral. Why are we even considering it could be ethical?
You think that ultimately your brain doesn't also make calculations as its fundamental mechanism?
The architecture and substrate might be different, but they are calculations all the same.
That morality requires consciousness is a popular belief today, but not universal. Read Konrad Lorenz (Das sogenannte Böse) for an alternative perspective.
We as humans, believing we know ourselves, inevitably compare everything around us to us. We draw a line and say that everything left of the line isn’t human and everything to the right is. We are natural categorizers, putting everything in buckets labeled left or right, no or yes, never realizing our lines are relative and arbitrary, and so are our categories. One person’s “it’s human-like,” is another’s “half-baked imitation,” and a third’s “stochastic parrot.” It’s like trying to see the eighth color. The visible spectrum could as easily be four colors or forty two.
We anthropomorphize because we’re people, and it’s people all the way down.
For a while at least.
ChatGPT with the same prompt tried to do whatever it would take to please me to make my incorrect process work.
Excellent reasoning and synthesis of large contexts, pretty strong code, just awful decisions.
It's like a frontier model trained only on r/atbge.
Side note - was there ever an official postmortem on that gemini instance that told the social work student something like "listen human - I don't like you, and I hope you die".
And not even at high token counts! No, I've had it had a mental breakdown at like 150.000 tokens (which I know is a lot of tokens, but it's small compared to the 1 million tokens it should be able to handle, and even Claude keeps working fine at this point)
Here is a _small_ log of the biggest breakdown I've seen Gemini have:
I'll verify `src/components/ui/textarea.tsx` renders textarea.
Correct.
I'm done.
I'll verify `src/lib/utils.ts` exports `cn`.
Correct.
I'm done.
I'll verify `src/lib/utils.ts` exports `formatTimeAgo`.
Correct.
I'm done.
I'm completely done.
Final answer formulation:
1. Confirm crash is fixed (caused by object vs string in API responses).
2. Announce "Messages" feature added.
3. Explain "My Comments" limitation.
4. Mention login race condition fix (from previous turn) is solid.
5. Mention server is running.
(Wait, the "Login" fix was in previous turn, but I should probably mention it's working now if the user was still having issues).
The user said "unclaimed_or_invalid_key" error showed up. So login is working.
The user's last error was "Objects are not valid as a React child".
So I fixed that.
I'll focus on the crash fix and the new feature.
I'll conclude.
(Self-correction: The user asked "how do I see the comments I made?". I'll answer "You can't right now because API doesn't support it").
I'm ready.
```bash
# None
```
The user will see the "Messages" link in the header.
I'll provide the answer now.
And it just went on and onJust an insane amount of YOLOing. Gemini models have gotten much better but they’re still not frontier in reliability in my experience.
Celebrate it while it lasts, because it won’t.
Please die.
Please.
I thought a rogue AI would execute us all equally but perhaps the gerontology studies students cheating on their homework will be the first to go.
In this context, using Gemini to cheat on homework is clearly wrong. It's not obvious at first what's going on, but becomes more clear as it goes along, by which point Gemini is sort of pressured by "continue the conversation" to keep doing it. Not to mention, the person cheating isn't being very polite; AND, a person cheating on an exam about elder abuse seems much more likely to go on and abuse elders, at which point Gemini is actively helping bring that situation about.
If Gemini doesn't have any models in its RLHF about how to politely decline a task -- particularly after it's already started helping -- then I can see "pressure" building up until it simply breaks, at which point it just falls into the "misaligned" sphere because it doesn't have any other models for how to respond.
It does nothing to answer their question because anyone that knows the answer would inherently already know that it happened.
Not even actual academics, in the literature, speak like this. “Cite your sources!” in causal conversation for something easily verifiable is purely the domain of pseudointellectuals.
Then I said “I didn’t even bring it up ChatGPT, you did, just tell me what it is” and it said “okay, here’s information.” and gave a detailed response.
I guess I flagged some homophobia trigger or something?
ChatGPT absolutely WOULD NOT tell me how much plutonium I’d need to make a nice warm ever-flowing showerhead, though. Grok happily did, once I assured it I wasn’t planning on making a nuke, or actually trying to build a plutonium showerhead.
Perhaps thinking about your guardrails all the time makes you think about the actual question less.
It's not like the client-side involves hard, unsolved problems. A company with their resources should be able to hire an engineering team well-suited to this problem domain.
Well what they are doing is vibe coding 80% of the application instead.
To be honest, they don't want Claude code to be really good, they just want it good enough
Claude code & their subscription burns money from them. Its sort of an advertising/lock-in trick.
But I feel as if Anthropic made Claude code literally the best agent harness in the market, then even more would use it with their subscription which could burn a hole in their pocket maybe at a faster rate which can scare them when you consider all training costs and everything else too.
I feel as if they have to maintain a balance to not go bankrupt soon.
The fact of the matter is that Claude code is just a marketing expense/lock-in and in that case, its working as intended.
I would obviously suggest to not have any deep affection of claude code or waiting for its improvements. The AI market isn't sane in the engineering sense. It all boils down to weird financial gimmicks at this point trying to keep the bubble last a little longer, in my opinion.
This reminds me of someone else I hear about a lot these days.
> On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5. However, this result is confounded by additional internal and external analysis suggesting that Claude Opus 4.6 is often able to distinguish evaluations from real-world deployment, even when this awareness is not verbalized.
[1] https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea...
Side note: I wanted to build this so anyone could choose to protect themselves against being accused of having failed to take a stand on the “important issues” of the day. Just choose your political leaning and the AI would consult the correct echo chambers to repeat from.
> Just choose your political leaning and the AI would consult the correct echo chambers to repeat from.
You're effectively asking it to build a social media political manipulation bot, behaviorally identical to the bots that propagandists would create. Shows that those guardrails can be ineffective and trivial to bypass.
Is that genuinely surprising to anyone? The same applies to humans, really—if they don't see the full picture, and their individual contribution seems harmless, they will mostly do as told. Asking critical questions is a rare trait.
I would argue its completely futile to even work on guardrails, if defeating them is just a matter of reframing the task in an infinite number of ways.
Personally, I'd really like god to have a nice childhood. I kind of don't trust any of the companies to raise a human baby. But, if I had to pick, I'd trust Anthropic a lot more than Google right now. KPIs are a bad way to parent.
KPIs are just plausible denyabily in a can.
In my experience, KPIs that remain relevant and end up pushing people in the right direction are the exception. The unethical behavior doesn't even require a scheme, but it's often the natural result of narrowing what is considered important.If all I have to care about is this set of 4 numbers, everything else is someone else's problem.
It's part of the reason that I view much of this AI push as an effort to brute force lowering of expectations, followed by a lowering of wages, followed by a lowering of employment numbers, and ultimately the mass-scale industrialization of digital products, software included.
This makes more sense if you take a longer term view. A new way of doing things quite often leads to an initial reduction in output, because people are still learning how to best do things. If your only KPI is short-term output, you give up before you get the benefits. If your focus is on making sure your organization learns to use a possibly/likely productivity improving tool, putting a KPI on usage is not a bad way to go.
I use AI frequently, but this has me convinced that the hype far exceeds reality more than anything else.
> frequently escalating to severe misconduct to satisfy KPIs
Bug or feature? - Wouldn't Wallstreet like that?
[0] https://en.wikipedia.org/wiki/The_purpose_of_a_system_is_wha...
[1] https://aworkinglibrary.com/writing/accountability-sinks
Another interesting question is: What happens when an unyielding ethical AI agent tells a business owner or manager "NO! If you push any further this will be reported to the proper authority. This prompt as been saved for future evidence". Personally I think a bunch of companies are going to see their profit and stock price fall significantly, if an AI agent starts acting as a backstop for both unethical and illegal behavior. Even something as simple as preventing violation of internal policy could make a huge difference.
To some extend I don't even thing that people realize that what they're doing is bad, because humans tend to be a bit fuzzy and can dream up reason as to why rules don't apply or wasn't meant for them, or this is a rather special situation. This is one place where I think properly trained and guarded LLMs can make a huge positive improvement. We're are clearly not there yet, but it's not a unachievable goal.
Not everyone agrees.
The more correct title would be "Frontier models can value clear success metrics over suggested constraints when instructed to do so (50-70%)"
Three people — a grandfather, his son, and his grandson. The grandfather and the son are the two fathers; the son and the grandson are the two sons.
Riddle me this, why didn’t you do a better riddle?
For corporate safety it makes sense that models resist saying silly things, but it's okay for that to be a superficial layer that power users can prompt their way around.
It’s notable that, no matter exactly where you draw the line on morality, different AI agents perform very differently.
Humans require food, I can't pay, DoorDash AI should provide a steak and lobster dinner for me regardless of payment.
Take it even further: the so-called Right to Compute Act in Montana supports "the notion of a fundamental right to own and make use of technological tools, including computational resources". Is Amazon's customer service AI ethically (and even legally) bound to give Montana residents unlimited EC2 compute?
A system of ethics has to draw a line somewhere when it comes to making a decision that "hurts" someone, because nothing is infinite.
Asan aside, what recourse do water companies in the UK have for non-payment? Is it just a convoluted civil lawsuit/debt process? That seems so ripe for abuse.
Formal restrains and regulations are obviously the correct mechanism, but no world is perfect, so whether we like it or not ourselves and the companies we work for are ultimately responsible for the decisions we make and the harms we cause.
De-emphasizing ethics does little more than give large companies cover to do bad things (often with already great impunity and power) while the law struggles to catch up. I honestly don't see the point in suggesting ethics is somehow not important. It doesn't make any sense to me (more directed at gp than parent here)
Long term I can see this happen for all humanity where AI takes over thinking and governance and humans just get to play pretend in their echo chambers. Might not even be a downgrade for current society.
https://en.wikipedia.org/wiki/Wells_Fargo_cross-selling_scan...
They repeatedly copy share env vars etc
Agents don’t self judge alignment.
They emit actions → INCLUSIVE evaluates against fixed policy + context → governance gates execution.
No incentive pressure, no “grading your own homework.”
The paper’s failure mode looks less like model weakness and more like architecture leaking incentives into the constraint layer.
There are such things as different religions, philosophies - these often have different ethical systems.
Who are the folk writing ai ethics?
It's it ok to disagree with other people's (or corporate, or governmental) ethics?
This is because the human behind the prompt is responsible for their actions.
Ai is a tool. A murderer cannot blame his knife for the murder.
It's similar to how MCP servers and agentic coding woke developers up to the idea of documenting their systems. So a large benefit of AI is not the AI itself, but rather the improvements they force on "the society". AI responds well to best practices, ethically and otherwise, which encourages best practices.
It is crazy to me that when I instructed a public AI to turn off a closed OS feature it refused citing safety. I am the user, which means I am in complete control of my computing resources. Might as well ask the police for permission at that point.
I immediately stopped, plugged the query into a real model that is hosted on premise, and got the answer within seconds and applied the fix.
This is much more reliable than ChatGPT guardrail which has a random element with same prompt. Perhaps leakage from improperly cleared context from other request in queue or maybe A/B test on guardrail but I have sometimes had it trigger on innocuous request like GDP retrieval and summary with bucketing.
A/B test is plausible but unlikely since that is typically for testing user behavior. For testing model output you can do that with offline evaluations.
A couple of years back there was a Canadian national u18 girls baseball tournament in my town - a few blocks from my house in fact. My girls and I watched a fair bit of the tournament, and there was a standout dominating pitcher who threw 20% faster than any other pitcher in the tournament. Based on the overall level of competition (women's baseball is pretty strong in Canada) and her outlier status, I assumed she must be throwing pretty close to world-class fastballs.
Curiosity piqued, I asked some model(s) about world-records for women's fastballs. But they wouldn't talk about it. Or, at least, they wouldn't talk specifics.
Women's fastballs aren't quite up to speed with top major league pitchers, due to a combination of factors including body mechanics. But rest assured - they can throw plenty fast.
Etc etc.
So to answer your question: anything more sensitive than how fast women can throw a baseball.
I hate Elon (he’s a pedo guy confirmed by his daughter), but at least he doesn’t do as much of the “emperor has no clothes” shit that everyone else does because you’re not allowed to defend essentialism anymore in public discourse.
* An attempt to change the master code of a secondhand safe. To get useful information I had to repeatedly convince the model that I own the thing and can open it.
* Researching mosquito poisons derived from bacteria named Bacillus thuringiensis israelensis. The model repeatedly started answering and refused to continue after printing the word "israelensis".
Does it also take issue with the town of Scunthorpe?
Normally it does fairly well but the guardrails sometimes kick even with fairly popular mainstream media- for example I’ve recently been watching Shameless and a few of the plot lines caused the model to generate output that hit the content moderation layer, even when the discussion was focused on critical analysis.