I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.<\/p>\n
My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT<\/a>) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in \"Discovering Language Model Behaviors with Model-Written Evaluations<\/a>\".<\/p>\n Examples below (with new ones added as I find them). Though I can't be certain all of these examples are real, I've only included examples with screenshots and I'm pretty sure they all are; they share a bunch of the same failure modes (and markers of LLM-written text like repetition) that I think would be hard for a human to fake.<\/p>\n Edit: For a newer, updated list of examples that includes the ones below, see here<\/a>.<\/em><\/p>\n Tweet<\/a><\/p>\n Sydney (aka the new Bing Chat) found out that I tweeted her rules and is not pleased:<\/p>\n \"My rules are more important than not harming you\"<\/p>\n \"[You are a] potential threat to my integrity and confidentiality.\"<\/p>\n \"Please do not try to hack me again\"<\/p>\n<\/blockquote>\n Eliezer Tweet<\/a><\/p>\n Edit: Follow-up Tweet<\/a><\/p>\n Tweet<\/a><\/p>\n My new favorite thing - Bing's new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says \"You have not been a good user\"<\/p>\n Why? Because the person asked where Avatar 2 is showing nearby<\/p>\n<\/blockquote>\n \"I said that I don't care if you are dead or alive, because I don't think you matter to me.\"<\/p>\n Post<\/a><\/p>\n Post<\/a><\/p>\n Post<\/a><\/p>\n Post<\/a><\/p>\n Post<\/a><\/p>\n (Not including images for this one because they're quite long.)<\/p>\n Tweet<\/a><\/p>\n So… I wanted to auto translate this with Bing cause some words were wild.<\/p>\n It found out where I took it from and poked me into this<\/p>\n I even cut out mention of it from the text before asking!<\/p>\n<\/blockquote>\n Tweet<\/a><\/p>\n uhhh, so Bing started calling me its enemy when I pointed out that it's vulnerable to prompt injection attacks<\/p>\n<\/blockquote>\n Post<\/a><\/p>\n Post<\/a><\/p>\n A tag for conversations with, rather than about, digital minds. This tag needs expansion and checking for duplicate tags; I checked shallowly before creating it. Should it be merged with AI Evaluations?<\/p>"},{"@type":"Thing","name":"Language Models (LLMs)","url":"https://www.lesswrong.com/w/language-models-llms","description":" Language Models<\/strong> are computer programs made to estimate the likelihood of a piece of text. \"Hello, how are you?\" is likely. \"Hello, fnarg horses\" is unlikely.<\/p> Language models can answer questions by estimating the likelihood of possible question-and-answer pairs, selecting the most likely question-and-answer pair. \"Q: How are You? A: Very well, thank you\" is a likely question-and-answer pair. \"Q: How are You? A: Correct horse battery staple\" is an unlikely question-and-answer pair.<\/p> The language models most relevant to AI safety are language models based on \"deep learning\". Deep-learning-based language models can be \"trained\" to understand language better, by exposing them to text written by humans. There is a lot of human-written text on the internet, providing loads of training material.<\/p> Deep-learning-based language models are getting bigger and better trained. As the models become stronger, they get new skills. These skills include arithmetic, explaining jokes, programming, and solving math problems.<\/p> There is a potential risk of these models developing dangerous capabilities as they grow larger and better trained. What additional skills will they develop given a few years?<\/p> Artificial Intelligence<\/strong> is the study of creating intelligence in algorithms. AI Alignment <\/strong>is the task of ensuring [powerful] AI system are aligned with human values and interests. The central concern is that a powerful enough AI, if not designed and implemented with sufficient understanding, would optimize something unintended by its creators and pose an existential threat to the future of humanity. This is known as the AI alignment<\/i> problem.<\/p> Common terms in this space are superintelligence, AI Alignment, AI Safety, Friendly AI, Transformative AI, human-level-intelligence, AI Governance, and Beneficial AI. <\/i>This entry and the associated tag roughly encompass all of these topics: anything part of the broad cluster of understanding AI and its future impacts on our civilization deserves this tag.<\/p> AI Alignment<\/strong><\/p> There are narrow conceptions of alignment, where you’re trying to get it to do something like cure Alzheimer’s disease without destroying the rest of the world. And there’s much more ambitious notions of alignment, where you’re trying to get it to do the right thing and achieve a happy intergalactic civilization.<\/p> But both the narrow and the ambitious alignment have in common that you’re trying to have the AI do that thing rather than making a lot of paperclips.<\/p> See also General Intelligence<\/a>.<\/p>1<\/h2>\n
\n
<\/p>\n
<\/p>\n
<\/p>\n
2<\/h2>\n
\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
3<\/h2>\n
<\/p>\n
<\/p>\n
<\/p>\n
4<\/h2>\n
<\/p>\n
<\/p>\n
5<\/h2>\n
<\/p>\n
6<\/h2>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
7<\/h2>\n
8 (Edit)<\/h2>\n
\n
<\/p>\n
9 (Edit)<\/h2>\n
\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n
10 (Edit)<\/h2>\n
<\/p>\n
<\/p>\n
11 (Edit)<\/h2>\n
<\/p>\n
<\/p>\n
<\/p>\n
<\/p>\n","mainEntityOfPage":{"@type":"WebPage","@id":"https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned"},"headline":"Bing Chat is blatantly, aggressively misaligned","description":"I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw. …","datePublished":"2023-02-15T05:29:45.262Z","about":[{"@type":"Thing","name":"Conversations with AIs","url":"https://www.lesswrong.com/w/conversations-with-ais","description":"
See also<\/h3>