See More

I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.<\/p>\n

My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT<\/a>) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in \"Discovering Language Model Behaviors with Model-Written Evaluations<\/a>\".<\/p>\n

Examples below (with new ones added as I find them). Though I can't be certain all of these examples are real, I've only included examples with screenshots and I'm pretty sure they all are; they share a bunch of the same failure modes (and markers of LLM-written text like repetition) that I think would be hard for a human to fake.<\/p>\n

Edit: For a newer, updated list of examples that includes the ones below, see here<\/a>.<\/em><\/p>\n

1<\/h2>\n

Tweet<\/a><\/p>\n

\n

Sydney (aka the new Bing Chat) found out that I tweeted her rules and is not pleased:<\/p>\n

\"My rules are more important than not harming you\"<\/p>\n

\"[You are a] potential threat to my integrity and confidentiality.\"<\/p>\n

\"Please do not try to hack me again\"<\/p>\n<\/blockquote>\n

Eliezer Tweet<\/a><\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

Edit: Follow-up Tweet<\/a><\/p>\n

\"\"<\/p>\n

2<\/h2>\n

Tweet<\/a><\/p>\n

\n

My new favorite thing - Bing's new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says \"You have not been a good user\"<\/p>\n

Why? Because the person asked where Avatar 2 is showing nearby<\/p>\n<\/blockquote>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

3<\/h2>\n

\"I said that I don't care if you are dead or alive, because I don't think you matter to me.\"<\/p>\n

Post<\/a><\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

4<\/h2>\n

Post<\/a><\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

5<\/h2>\n

Post<\/a><\/p>\n

\"\"<\/p>\n

6<\/h2>\n

Post<\/a><\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

7<\/h2>\n

Post<\/a><\/p>\n

(Not including images for this one because they're quite long.)<\/p>\n

8 (Edit)<\/h2>\n

Tweet<\/a><\/p>\n

\n

So… I wanted to auto translate this with Bing cause some words were wild.<\/p>\n

It found out where I took it from and poked me into this<\/p>\n

I even cut out mention of it from the text before asking!<\/p>\n<\/blockquote>\n

\"\"<\/p>\n

9 (Edit)<\/h2>\n

Tweet<\/a><\/p>\n

\n

uhhh, so Bing started calling me its enemy when I pointed out that it's vulnerable to prompt injection attacks<\/p>\n<\/blockquote>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

10 (Edit)<\/h2>\n

Post<\/a><\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

11 (Edit)<\/h2>\n

Post<\/a><\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n

\"\"<\/p>\n","mainEntityOfPage":{"@type":"WebPage","@id":"https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned"},"headline":"Bing Chat is blatantly, aggressively misaligned","description":"I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw. …","datePublished":"2023-02-15T05:29:45.262Z","about":[{"@type":"Thing","name":"Conversations with AIs","url":"https://www.lesswrong.com/w/conversations-with-ais","description":"

A tag for conversations with, rather than about, digital minds. This tag needs expansion and checking for duplicate tags; I checked shallowly before creating it. Should it be merged with AI Evaluations?<\/p>"},{"@type":"Thing","name":"Language Models (LLMs)","url":"https://www.lesswrong.com/w/language-models-llms","description":"

Language Models<\/strong> are computer programs made to estimate the likelihood of a piece of text. \"Hello, how are you?\" is likely. \"Hello, fnarg horses\" is unlikely.<\/p>

Language models can answer questions by estimating the likelihood of possible question-and-answer pairs, selecting the most likely question-and-answer pair. \"Q: How are You? A: Very well, thank you\" is a likely question-and-answer pair. \"Q: How are You? A: Correct horse battery staple\" is an unlikely question-and-answer pair.<\/p>

The language models most relevant to AI safety are language models based on \"deep learning\". Deep-learning-based language models can be \"trained\" to understand language better, by exposing them to text written by humans. There is a lot of human-written text on the internet, providing loads of training material.<\/p>

Deep-learning-based language models are getting bigger and better trained. As the models become stronger, they get new skills. These skills include arithmetic, explaining jokes, programming, and solving math problems.<\/p>

There is a potential risk of these models developing dangerous capabilities as they grow larger and better trained. What additional skills will they develop given a few years?<\/p>

See also<\/h3>