What failure looks like

paulfchristiano

I think AI risk is disjunctive enough that it's not clear most of the probability mass can be captured by a single scenario/story, even as broad as this one tries to be. Here are some additional scenarios that don't fit into this story or aren't made very salient by it.

AI-powered memetic warfare makes all humans effectively insane.
Humans break off into various groups to colonize the universe with the help of their AIs. Due to insufficient "metaphilosophical paternalism", they each construct their own version of utopia which is either directly bad (i.e., some of the "utopias" are objectively terrible or subjectively terrible according to my values), or bad because of opportunity costs.
AI-powered economies have much higher economies of scale because AIs don't suffer from the kind of coordination costs that humans have (e.g., they can merge their utility functions and become clones of each other). Some countries may try to prevent AI-managed companies from merging for ideological or safety reasons, but others (in order to gain a competitive advantage on the world stage) will basically allow their whole economy to be controlled by one AI, which eventually achieves a decisive advantage over the rest of humanity and does a treacherous turn.
The same incentive for AIs to merge might also create an incentive for value lock-in, in order to facilitate the merging. (AIs that don't have utility functions might have a harder time coordinating with each other.) Other incentives for premature value lock-in might include defense against value manipulation/corruption/drift. So AIs end up embodying locked-in versions of human values which are terrible in light of our true/actual values.
I think the original "stereotyped image of AI catastrophe" is still quite plausible, if for example there is a large amount of hardware overhang before the last piece of puzzle for building AGI falls into place.

[-]paulfchristiano6yΩ8161

I think of #3 and #5 as risk factors that compound the risks I'm describing---they are two (of many!) ways that the detailed picture could look different, but don't change the broad outline. I think it's particularly important to understand what failure looks like under a more "business as usual" scenario, so that people can separate objections to the existence of any risk from objections to other exacerbating factors that we are concerned about (like fast takeoff, war, people being asleep at the wheel, etc.)

I'd classify #1, #2, and #4 as different problems not related to intent alignment per se (though intent alignment may let us build AI systems that can help address these problems). I think the more general point is: if you think AI progress is likely to drive many of the biggest upcoming changes in the world, then there will be lots of risks associated with AI. Here I'm just trying to clarify what happens if we fail to solve intent alignment.

[-]Wei Dai6yΩ250

I'm not sure I understand the distinction you're drawing between risk factors that compound the risks that you're describing vs. different problems not related to intent alignment per se. It seems to me like "AI-powered economies have much higher economies of scale because AIs don’t suffer from the kind of coordination costs that humans have (e.g., they can merge their utility functions and become clones of each other)" is a separate problem from solving intent alignment, whereas "AI-powered memetic warfare makes all humans effectively insane" is kind of an extreme case of "machine learning will increase our ability to 'get what we can measure'" which seems to be the opposite of how you classify them.

What do you think are the implications of something belonging to one category versus another (i.e., is there something we should do differently depending on which of these categories a risk factor / problem belongs to)?

I think the more general point is: if you think AI progress is likely to drive many of the biggest upcoming changes in the world, then there will be lots of risks associated with AI. Here I’m just trying to clarify what happens if we fail to solve intent alignment.

Ah, when I read "I think this is probably not what failure will look like" I interpreted that to mean "failure to prevent AI risk", and then I missed the clarification "these are the most important problems if we fail to solve intent alignment" that came later in the post, in part because of a bug in GW that caused the post to be incorrectly formatted.

Aside from that, I'm worried about telling a vivid story about one particular AI risk, unless you really hammer the point that it's just one risk out of many, otherwise it seems too easy for the reader to get that story stuck in their mind and come to think that this is the main or only thing they have to worry about as far as AI is concerned.

[-]CarlShulman6y*Ω21580

I think the kind of phrasing you use in this post and others like it systematically misleads readers into thinking that in your scenarios there are no robot armies seizing control of the world (or rather, that all armies worth anything at that point are robotic, and so AIs in conflict with humanity means military force that humanity cannot overcome). I.e. AI systems pursuing badly aligned proxy goals or influence-seeking tendencies wind up controlling or creating that military power and expropriating humanity (which eventually couldn't fight back thereafter even if unified).

E.g. Dylan Matthews' Vox writeup of the OP seems to think that your scenarios don't involve robot armies taking control of the means of production and using the universe for their ends against human objections or killing off existing humans (perhaps destructively scanning their brains for information but not giving good living conditions to the scanned data):

Even so, Christiano’s first scenario doesn’t precisely envision human extinction. It envisions human irrelevance, as we become agents of machines we created.

Human reliance on these systems, combined with the systems failing, leads to a massive societal breakdown. And in the wake of the breakdown, there are still machines that are great at persuading and influencing people to do what they want, machines that got everyone into this catastrophe and yet are still giving advice that some of us will listen to.

The Vox article also mistakes the source of influence-seeking patterns to be about social influence rather than systems that try to increase in power and numbers tend to do so, so are selected for if we accidentally or intentionally produce them and don't effectively weed them out; this is why living things are adapted to survive and expand; such desires motivate conflict with humans when power and reproduction can be obtained by conflict with humans, which can look like robot armies taking control.takes the point about influence-seeking patterns to be about. That seems to me just a mistake about the meaning of influence you had in mind here:

Often, he notes, the best way to achieve a given goal is to obtain influence over other people who can help you achieve that goal. If you are trying to launch a startup, you need to influence investors to give you money and engineers to come work for you. If you’re trying to pass a law, you need to influence advocacy groups and members of Congress.

That means that machine-learning algorithms will probably, over time, produce programs that are extremely good at influencing people. And it’s dangerous to have machines that are extremely good at influencing people.

[-]paulfchristiano6yΩ9230

The Vox article also mistakes the source of influence-seeking patterns to be about social influence rather than 'systems that try to increase in power and numbers tend to do so, so are selected for if we accidentally or intentionally produce them and don't effectively weed them out; this is why living things are adapted to survive and expand; such desires motivate conflict with humans when power and reproduction can be obtained by conflict with humans, which can look like robot armies taking control.

Yes, I agree the Vox article made this mistake. Me saying "influence" probably gives people the wrong idea so I should change that---I'm including "controls the military" as a central example, but it's not what comes to mind when you hear "influence." I like "influence" more than "power" because it's more specific, captures what we actually care about, and less likely to lead to a debate about "what is power anyway."

In general I think the Vox article's discussion of Part II has some problems, and the discussion of Part I is closer to the mark. (Part I is also more in line with the narrative of the article, since Part II really is more like Terminator. I'm not sure which way the causality goes here though, i.e. whether they ended up with that narrative based on misunderstandings about Part II or whether they framed Part II in a way that made it more consistent with the narrative, maybe having been inspired to write the piece based on Part I.)

There is a different mistake with the same flavor, later in the Vox article: "But eventually, the algorithms’ incentives to expand influence might start to overtake their incentives to achieve the specified goal. That, in turn, makes the AI system worse at achieving its intended goal, which increases the odds of some terrible failure"

The problem isn't really "the AI system is worse at achieving its intended goal;" like you say, it's that influence-seeking AI systems will eventually be in conflict with humans, and that's bad news if AI systems are much more capable/powerful than we are.

[AI systems] wind up controlling or creating that military power and expropriating humanity (which couldn't fight back thereafter even if unified)

Failure would presumably occur before we get to the stage of "robot army can defeat unified humanity"---failure should happen soon after it becomes possible, and there are easier ways to fail than to win a clean war. Emphasizing this may give people the wrong idea, since it makes unity and stability seem like a solution rather than a stopgap. But emphasizing the robot army seems to have a similar problem---it doesn't really matter whether there is a literal robot army, you are in trouble anyway.

[-]CarlShulman6y*Ω9240

Failure would presumably occur before we get to the stage of "robot army can defeat unified humanity"---failure should happen soon after it becomes possible, and there are easier ways to fail than to win a clean war. Emphasizing this may give people the wrong idea, since it makes unity and stability seem like a solution rather than a stopgap. But emphasizing the robot army seems to have a similar problem---it doesn't really matter whether there is a literal robot army, you are in trouble anyway.

I agree other powerful tools can achieve the same outcome, and since in practice humanity isn't unified rogue AI could act earlier, but either way you get to AI controlling the means of coercive force, which helps people to understand the end-state reached.

It's good to both understand the events by which one is shifted into the bad trajectory, and to be clear on what the trajectory is. It sounds like your focus on the former may have interfered with the latter.

[-]paulfchristiano6yΩ8240

I do agree there was a miscommunication about the end state, and that language like "lots of obvious destruction" is an understatement.

I do still endorse "military leaders might issue an order and find it is ignored" (or total collapse of society) as basically accurate and not an understatement.

[-]paulfchristiano6yΩ8170

I agree that robot armies are an important aspect of part II.

In part I, where our only problem is specifying goals, I don't actually think robot armies are a short-term concern. I think we can probably build systems that really do avoid killing people, e.g. by using straightforward versions of "do things that are predicted to lead to videos that people rate as acceptable," and that at the point when things have gone off the rails those videos still look fine (and to understand that there is a deep problem at that point you need to engage with complicated facts about the situation that are beyond human comprehension, not things like "are the robots killing people?"). I'm not visualizing the case where no one does anything to try to make their AI safe, I'm imagining the most probable cases where people fail.

I think this is an important point, because I think much discussion of AI safety imagines "How can we give our AIs an objective which ensures it won't go around killing everyone," and I think that's really not the important or interesting part of specifying an objective (and so leads people to be reasonably optimistic about solutions that I regard as obviously totally inadequate). I think you should only be concerned about your AI killing everyone because of inner alignment / optimization daemons.

That said, I do expect possibly-catastrophic AI to come only shortly before the singularity (in calendar time) and so the situation "humans aren't able to steer the trajectory of society" probably gets worse pretty quickly. I assume we are on the same page here.

In that sense Part I is misleading. It describes the part of the trajectory where I think the action is, the last moments where we could have actually done something to avoid doom, but from the perspective of an onlooker that period could be pretty brief. If there is a Dyson sphere in 2050 it's not clear that anyone really cares what happened during 2048-2049. I think the worst offender is the last sentence of Part I ("By the time we spread through the stars...")

Part I has this focus because (i) that's where I think the action is---by the time you have robot armies killing everyone the ship is so sailed, I think a reasonable common-sense viewpoint would acknowledge this by reacting with incredulity to the "robots kill everyone" scenario, and would correctly place the "blame" on the point where everything got completely out of control even though there weren't actually robot armies yet (ii) the alternative visualization leads people to seriously underestimate the difficulty of the alignment problem, (iii) I was trying to describe the part of the picture which is reasonably accurate regardless of my views on the singularity.

[-]CarlShulman6yΩ10230

I think we can probably build systems that really do avoid killing people, e.g. by using straightforward versions of "do things that are predicted to lead to videos that people rate as acceptable," and that at the point when things have gone off the rails those videos still look fine (and to understand that there is a deep problem at that point you need to engage with complicated facts about the situation that are beyond human comprehension, not things like "are the robots killing people?"). I'm not visualizing the case where no one does anything to try to make their AI safe, I'm imagining the most probable cases where people fail.

Haven't you yourself written about the failure modes of 'do things predicted to lead to videos that people rate as acceptable' where the attack involves surreptitiously reprogramming the camera to get optimal videos (including weird engineered videos designed to optimize on infelicities in the learned objective? E.g. in your discussion of an unaligned benchmark here:

We train an autoregressive generative model, to maximize the log probability assigned to the actual sequence of actions and observations produced by the AI (with each observation conditioned on the past actions). This isn’t actually a good way to train the generative model, but it’s not really central to the discussion.

We train the reward function by showing humans sequences of actions and predicted observations, asking them to assign scores, then predicting those scores with supervised learning. We show humans the sequences of actions that look most promising to the system...

I am quite confident that “world which looks appealing on screen” is not the same as “world which is good;” the gap is already large today, and it will grow much larger over time.

Maintaining a fiction. Ultimately the goal of this AI is to make the sensors look like good things are happening in the world. This can be done either by making the world good or by carefully controlling what gets fed to the sensors. In general, there are many fewer constraints on “what gets fed to the sensors” than “what the world is actually like,” and so it’s going to be possible to produce much more desirable-looking outcomes, much more easily, by controlling the sensors.

If the humans understand what the AI is doing in detail then they can observe the sequence of actions to notice any actions that might corrupt the sensors. But as soon as the AI is engaging in complicated plans this appears to be totally intractable. Even if the AI was human-level, there are many extremely hard-to-detect plans: for example, the AI could collaborate with an external attacker, who will surreptitiously corrupt the sensor. Or the AI could create a copy of itself on an external server who then helped orchestrate a hard-to-detect corruption.

Are you now very confident that no AI company would implement something with this vulnerability? Or does the paragraph below just mean that the part where the AI gets control of the camera has to happen before the robot armies (or even robot armies identified as traceable to the AI in question), which then happen?

Part I has this focus because (i) that's where I think the action is---by the time you have robot armies killing everyone the ship is so sailed, I think a reasonable common-sense viewpoint would acknowledge this by reacting with incredulity to the "robots kill everyone" scenario, and would correctly place the "blame" on the point where everything got completely out of control even though there weren't actually robot armies yet (ii) the alternative visualization leads people to seriously underestimate the difficulty of the alignment problem, (iii) I was trying to describe the part of the picture which is reasonably accurate regardless of my views on the singularity.

Because it definitely seems that Vox got the impression from it that there is never a robot army takeover in the scenario, not that it's slightly preceded by camera hacking.

Is the idea that the AI systems develops goals over the external world (rather than the sense inputs/video pixels) so that they are really pursuing the appearance of prosperity, or corporate profits, and so don't just wirehead their sense inputs as in your benchmark post?

[-]paulfchristiano6yΩ10220

My median outcome is that people solve intent alignment well enough to avoid catastrophe. Amongst the cases where we fail, my median outcome is that people solve enough of alignment that they can avoid the most overt failures, like literally compromising sensors and killing people (at least for a long subjective time), and can build AIs that help defend them from other AIs. That problem seems radically easier---most plausible paths to corrupting sensors involve intermediate stages with hints of corruption that could be recognized by a weaker AI (and hence generate low reward). Eventually this will break down, but it seems quite late.

very confident that no AI company would implement something with this vulnerability?

The story doesn't depend on "no AI company" implementing something that behaves badly, it depends on people having access to AI that behaves well.

Also "very confident" seems different from "most likely failure scenario."

Haven't you yourself written about the failure modes of 'do things predicted to lead to videos that people rate as acceptable' where the attack involves surreptitiously reprogramming the camera to get optimal videos (including weird engineered videos designed to optimize on infelicities in the learned objective?

That's a description of the problem / the behavior of the unaligned benchmark, not the most likely outcome (since I think the problem is most likely to be solved). We may have a difference in view between a distribution over outcomes that is slanted towards "everything goes well" such that the most realistic failures are the ones that are the closest calls, vs. a distribution slanted towards "everything goes badly" such that the most realistic failures are the complete and total ones where you weren't even close.

Because it definitely seems that Vox got the impression from it that there is never a robot army takeover in the scenario, not that it's slightly preceded by camera hacking.

I agree there is a robot takeover shortly later in objective time (mostly because of the singularity). Exactly how long it is mostly depends on how early things go off the rails w.r.t. alignment, perhaps you have O(year).

[-]CarlShulman6y*Ω8210

OK, thanks for the clarification!

My own sense is that the intermediate scenarios are unstable: if we have fairly aligned AI we immediately use it to make more aligned AI and collectively largely reverse things like Facebook click-maximization manipulation. If we have lost the power to reverse things then they go all the way to near-total loss of control over the future. So i would tend to think we wind up in the extremes.

I could imagine a scenario where there is a close balance among multiple centers of AI+human power, and some but not all of those centers have local AI takeovers before the remainder solve AI alignment, and then you get a world that is a patchwork of human-controlled and autonomous states, both types automated. E.g. the United States and China are taken over by their AI systems (inlcuding robot armies), but the Japanese AI assistants and robot army remain under human control and the future geopolitical system keeps both types of states intact thereafter.

[-]SoerenMind6y60

It'd be nice to hear a response from Paul to paragraph 1. My 2 cents:

I tend to agree that we end up with extremes eventually. You seem to say that we would immediately go to alignment given somewhat aligned systems so Paul's 1st story barely plays out.

Of course, the somewhat aligned systems may aim at the wrong thing if we try to make them solve alignment. So the most plausible way it could work is if they produce solutions that we can check. But if this were the case, human supervision would be relatively easy. That's plausible but it's a scenario I care less about.

Additionally, if we could use somewhat aligned systems to make more aligned ones, iterated amplification probably works for alignment (narrowly defined by "trying to do what we want"). The only remaining challenge would be to create one system that's somewhat smarter than us and somewhat aligned (in our case that's true by assumption). The rest follows, informally speaking, by induction as long as the AI+humans system can keep improving intelligence as alignment is improved. Which seems likely. That's also plausible but it's a big assumption and may not be the most important scenario / isn't a 'tale of doom'.

[-]Vanessa Kosoy6yΩ590

I agree that robot armies are an important aspect of part II.

Why? I can easily imagine an AI takeover that works mostly through persuasion/manipulation, with physical elimination of humans coming only as an "afterthought" when AI is already effectively in control (and produced adequate replacements for humans for the purpose of physically manipulating the world). This elimination doesn't even require an "army", it can look like everyone agreeing to voluntary "euthanasia" (possibly not understanding its true meaning). To the extent physical force is involved, most of it might be humans against humans.

[-]Rohin Shah6yΩ360

I somewhat expect even Part I to be solved by default -- it seems to rest on a premise of human reasoning staying as powerful as it is right now, but it seems plausible that as AI systems grow in capability we will be able to leverage them to improve human reasoning. Obviously this is an approach you have been pushing, but it also seems like a natural thing to do when you have powerful AI systems.

[-]Zvi6y401

Is this future AI catastrophe? Or is this just a description of current events being a general gradual collapse?

This seems like what is happening now, and has been for a while. Existing ML systems are clearly making Type-I problems, already quite bad before ML was a thing at all, much worse, to the extent that I don't see much ability left of our civilization to get anything that can't be measured in a short term feedback loop - even in spaces like this, appeals to non-measurable or non-explicit concerns are a near-impossible sell.

Part II problems are not yet coming from ML systems, exactly, But we certainly have algorithms that are effectively optimized and selected for the ability to gain influence; the algorithm gains influence, which causes people to care about it and feed into it, causing it to get more. If we get less direct in the metaphor we get the same thing with memetics, culture, life strategies, corporations, media properties and so on. The emphasis on choosing winners, being 'on the right side of history', supporting those who are good at getting support. OP notes that this happens in non-ML situations explicitly, and there's no clear dividing line in any case.

So if there is another theory that says, this has already happened, what would one do next?

[-]John_Maxwell6y00

You could always get a job at a company which controls an important algorithm.

[-]Richard_Ngo6y*Ω6180

Eventually we reach the point where we could not recover from a correlated automation failure. Under these conditions influence-seeking systems stop behaving in the intended way, since their incentives have changed---they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives.

I'm not sure I understand this part. The influence-seeking systems which have the most influence also have the most to lose from a catastrophe. So they'll be incentivised to police each other and make catastrophe-avoidance mechanisms more robust.

As an analogy: we may already be past the point where we could recover from a correlated "world leader failure": every world leader simultaneously launching a coup. But this doesn't make such a failure very likely, unless world leaders also have strong coordination and commitment mechanisms between themselves (which are binding even after the catastrophe).

[-]Wei Dai6yΩ4140

(Upvoted because I think this deserves more clarification/discussion.)

I'm not sure I understand this part. The influence-seeking systems which have the most influence also have the most to lose from a catastrophe. So they'll be incentivised to police each other and make catastrophe-avoidance mechanisms more robust.

I'm not sure either, but I think the idea is that once influence-seeking systems gain a certain amount of influence, it may become faster or more certain for them to gain more influence by causing a catastrophe than to continue to work within existing rules and institutions. For example they may predict that unless they do that, humans will eventually coordinate to take back the influence that humans lost, or they may predict that during such a catastrophe they can probably expropriate a lot of resources currently owned by humans and gain much influence that way, or humans will voluntarily hand more power to them in order to try to use them to deal with the catastrophe.

As an analogy: we may already be past the point where we could recover from a correlated "world leader failure": every world leader simultaneously launching a coup. But this doesn't make such a failure very likely, unless world leaders also have strong coordination and commitment mechanisms between themselves (which are binding even after the catastrophe).

I think such a failure can happen without especially strong coordination and commitment mechanisms. Something like this happened during the Chinese Warlord Era, when many military commanders became warlords during a correlated "military commander failure", and similar things probably happened many times throughout history. I think what's actually preventing a "world leader failure" today is that most world leaders, especially of the rich democratic countries, don't see any way to further their own values by launching coups in a correlated way. In other words, what would they do afterwards if they did launch such a coup, that would be better than just exercising the power that they already have?

[-]Richard_Ngo6yΩ140

I think the idea is that once influence-seeking systems gain a certain amount of influence, it may become faster or more certain for them to gain more influence by causing a catastrophe than to continue to work within existing rules and institutions.

The key issue here is whether there will be coordination between a set of influence-seeking systems that can cause (and will benefit from) a catastrophe, even when other systems are opposing them. If we picture systems as having power comparable to what companies have now, that seems difficult. If we picture them as having power comparable to what countries have now, that seems fairly easy.

[-]Wei Dai6yΩ130

The key issue here is whether there will be coordination between a set of influence-seeking systems that can cause (and will benefit from) a catastrophe, even when other systems are opposing them.

Do you not expect this threshold to be crossed sooner or later, assuming AI alignment remains unsolved? Also, it seems like the main alternative to this scenario is that the influence-seeking systems expect to eventually gain control of most of the universe anyway (even without a "correlated automation failure"), so they don't see a reason to "rock the boat" and try to dispossess humans of their remaining influence/power/resources, but this is almost as bad as the "correlated automation failure" scenario from an astronomical waste perspective. (I'm wondering if you're questioning whether things will turn out badly, or questioning whether things will turn out badly this way.)

[-]Richard_Ngo6y*Ω262

Mostly I am questioning whether things will turn out badly this way.

Do you not expect this threshold to be crossed sooner or later, assuming AI alignment remains unsolved?

Probably, but I'm pretty uncertain about this. It depends on a lot of messy details about reality, things like: how offense-defence balance scales; what proportion of powerful systems are mostly aligned; whether influence-seeking systems are risk-neutral; what self-governance structures they'll set up; the extent to which their preferences are compatible with ours; how human-comprehensible the most important upcoming scientific advances are.

[-]Tobias_Baumann6y130

I agree with you that the "stereotyped image of AI catastrophe" is not what failure will most likely look like, and it's great to see more discussion of alternative scenarios. But why exactly should we expect that the problems you describe will be exacerbated in a future with powerful AI, compared to the state of contemporary human societies? Humans also often optimise for what's easy to measure, especially in organisations. Is the concern that current ML systems are unable to optimise hard-to-measure goals, or goals that are hard to represent in a computerised form? That is true but I think of this as a limitation of contemporary ML approaches rather than a fundamental property of advanced AI. With general intelligence, it should also be possible to optimise goals that are hard-to-measure.

Similarly, humans / companies / organisations regularly exhibit influence-seeking behaviour, and this can cause harm but it's also usually possible to keep it in check to at least a certain degree.

So, while you point at things that can plausibly go wrong, I'd say that these are perennial issues that may become better or worse during and after the transition to advanced AI, and it's hard to predict what will happen. Of course, this does not make a very appealing tale of doom – but maybe it would be best to dispense with tales of doom altogether.

I'm also not yet convinced that "these capture the most important dynamics of catastrophe." Specifically, I think the following are also potentially serious issues:
- Unfortunate circumstances in future cooperation problems between AI systems (and / or humans) result in widespread defection, leading to poor outcomes for everyone.
- Conflicts between key future actors (AI or human) result in large quantities of disvalue (agential s-risks).
- New technology leads to radical value drift of a form that we wouldn't endorse.

[-]paulfchristiano6y120

But why exactly should we expect that the problems you describe will be exacerbated in a future with powerful AI, compared to the state of contemporary human societies?

To a large extent "ML" refers to a few particular technologies that have the form "try a bunch of things and do more of what works" or "consider a bunch of things and then do the one that is predicted to work."

That is true but I think of this as a limitation of contemporary ML approaches rather than a fundamental property of advanced AI.

I'm mostly aiming to describe what I think is in fact most likely to go wrong, I agree it's not a general or necessary feature of AI that its comparative advantage is optimizing easy-to-measure goals.

(I do think there is some real sense in which getting over this requires "solving alignment.")

[-]John_Maxwell6y20

To a large extent "ML" refers to a few particular technologies that have the form "try a bunch of things and do more of what works" or "consider a bunch of things and then do the one that is predicted to work."

Why not "try a bunch of measurements and figure out which one generalizes best" or "consider a bunch of things and then do the one that is predicted to work according to the broadest variety of ML-generated measurements"? (I expect there's already some research corresponding to these suggestions, but more could be valuable?)

[-]Ben Pace5yΩ5100

I attempted to write a summary of this post and the entire comment section. I cut the post down to half its length, and cut the comment section down to less than 10% of the words.

To the commenters and Paul, do let me know if I summarised your points and comments well, ideally under the linked post :)

[-]Rohin Shah5yΩ790Nomination for 2019 Review

As commenters have pointed out, the post is light on concrete details. Nonetheless, I found even the abstract stories much more compelling as descriptions-of-the-future (people usually focus on descriptions-of-the-world-if-we-bury-our-heads-in-the-sand). I think Part 2 in particular continues to be a good abstract description of the type of scenario that I personally am trying to avert.

[-]habryka6yΩ480

Promoted to curated: I think this post made an important argument, and did so in a way that I expect the post and the resulting discussion around it to function as a reference-work for quite a while.

In addition to the post itself, I also thought the discussion around it was quite good and helped me clarify my thinking in this domain a good bit.

[-]Richard_Ngo5yΩ270

I recently came back to this post because I remembered it having examples of what influence-seeking agents might look like, and wanted to quote them. But now that I'm rereading in detail, it's all very vague. E.g.

A few automated systems go off the rails in response to some local shock. As those systems go off the rails, the local shock is compounded into a larger disturbance; more and more automated systems move further from their training distribution and start failing.

This doesn't constrain my expectations about what the automated systems are doing in any way; nor does it distinguish between recoverable and irrecoverable shocks. Is AI control over militaries necessary for a correlated automation failure to be irrecoverable? Or control over basic infrastructure? How well do AIs need to cooperate with each other to prevent humans from targeting them individually?

Overall I'm downgrading my credence in this scenario.

[-]orthonormal4yΩ360Review for 2019 Review

I think this post (and similarly, Evan's summary of Chris Olah's views) are essential both in their own right and as mutual foils to MIRI's research agenda. We see related concepts (mesa-optimization originally came out of Paul's talk of daemons in Solomonoff induction, if I remember right) but very different strategies for achieving both inner and outer alignment. (The crux of the disagreement seems to be the probability of success from adapting current methods.)

Strongly recommended for inclusion.

[-]Zack_M_Davis5y60Nomination for 2019 Review

Students of Yudkowsky have long contemplated hard-takeoff scenarios where a single AI bootstraps itself to superintelligence from a world much like our own. This post is valuable for explaining how the intrinsic risks might play out in a soft-takeoff scenario where AI has already changed Society.

Part I is a dark mirror of Christiano's 2013 "Why Might the Future Be Good?": the whole economy "takes off", and the question is how humane-aligned does the system remain before it gets competent enough to lock in its values. ("Why might the future" says "Mostly", "What Failure Looks Like" pt. I says "Not".)

When I first read this post, I didn't feel like I "got" Part II, but now I think I do. (It's the classic "treacherous turn", but piecemeal across Society in different systems, rather than in a single seed superintelligence.)

[-]Jan_Kulveit6y60

Reasons for some careful optimism

in Part I., it can be the case that human values are actually complex combination of easy to measure goals + complex world models, so the structure of the proxies will be able to represent what we really care about. (I don't know. Also the result can still stop represent our values with further scaling and evolution.)

in Part II., it can be the case that influence-seeking patterns are more computationally costly than straightforward patterns, and they can be in part suppressed by optimising for processing costs, bounded-rationality style. To some extend, influence-seeking patterns attempting to grow and control the whole system seems to me to be something happening also within our own minds. I would guess some combinational of immune system + metacognition + bounded rationality + stabilisation by complexity is stabilising many human minds. (I don't know if anything of that can scale arbitrarily.)

[-]John_Maxwell6y*Ω360

Once we start searching over policies that understand the world well enough, we run into a problem: any influence-seeking policies we stumble across would also score well according to our training objective, because performing well on the training objective is a good strategy for obtaining influence.

...

One reason to be scared is that a wide variety of goals could lead to influence-seeking behavior, while the “intended” goal of a system is a narrower target, so we might expect influence-seeking behavior to be more common in the broader landscape of “possible cognitive policies.”

Consider this video of an AI system with a misspecified reward function. (Background in this post.) The AI system searches the space of policies to discover the one that performs best according to its reward function in the simulated boat-racing world. It turns out that the one which performs best according to this misspecified reward function doesn't perform well according to the intended reward function (the "training objective" that the system's developers use to evaluate performance).

The goal of picking up as many power-ups as possible could lead to influence-seeking behavior: If the boat can persuade us to leave the simulation on, it can keep picking up power-ups until the end of time. Suppose for the sake of argument that performing well on the training objective is the best strategy for obtaining influence, as you posit. Then the boat should complete the race correctly, in order to fool us into thinking it reliably works towards the training objective. And yet it doesn't complete the race correctly in the video. Why not?

One answer is that the human supervisor isn't part of the system's world model. But I don't think that would change things. Suppose instead of making use of an existing video game, the system's world model was generated automatically by observing the world, and the observations were detailed enough to include the supervisor of the AI system and even the AI system itself. Now the boat is trying to find policies that maximize power-ups in this absurdly detailed, automatically generated world model (with some power-ups manually added in). Why would a policy which manipulates the operator within the simulated world score well? It seems like it would take a confused world model for manipulation of the simulated operator to help with picking up simulated power-ups. Like if painting animals on cave walls actually caused them to appear. Larry Ellison is not going to win a yacht race by telling his data scientist to cripple his opponents in a simulation.

[Another frame: Cartesian dualism will happen by default, or at least will be easy to enforce on the architectural level. You could argue Cartesian dualists lose because they don't do self-improvement? But an implied premise of your post is that foom won't happen. I disagree but that's another discussion.]

But let's suppose the world model actually is confused, and the best policy in the simulation is one that manipulates the simulated operator to gain simulated power-ups. Even in this case, I think we'd still see a video like I linked earlier. We'd see the boat powering over to the part of the simulated world where the simulated operator resides, doing something to manipulate the simulated operator, and then the boat would have loads of power-ups somehow. I think the biggest concern is exposure to an information hazard when we see how the boat manipulates the operator. (Luckily, if we implement an information hazard filter before letting ourselves watch the video, the boat will not optimize to get past it.)

Human billionaires are hiring physicists to try & figure out if our universe is a simulation and if so, how to hack our way out. So there might be something here. Maybe if world model construction happens in tandem with exploring the space of policies, the boat will start "considering the possibility that it's in a simulation" in a sense. (Will trying to manipulate the thing controlling the simulation be a policy that performs well in the simulation?)

[-]paulfchristiano6yΩ590

I'm not mostly worried about influence-seeking behavior emerging by "specify a goal" --> "getting influence is the best way to achieve that goal." I'm mostly worried about influence-seeking behavior emerging within a system by virtue of selection within that process (and by randomness at the lowest level).

[-]John_Maxwell6y180

OK, thanks for clarifying. Sounds like a new framing of the "daemon" idea.

[-]Wei Dai6y90

Sounds like a new framing of the “daemon” idea.

That's my impression as well. If it's correct, seems like it would be a good idea to mention that explicitly in the post, so people can link up the new concept with their old concept.

[-]TurnTrout6yΩ350

So the concern here is that even if the goal, say, robustly penalizes gaining influence, the agent still has internal selection pressures for seeking influence? And this might not be penalized by the outer criterion if the policy plays nice on-distribution?

[-]Vlad Mikulik6yΩ230

The goal that the agent is selected to score well on is not necessarily the goal that the agent is itself pursuing. So, unless the agent’s internal goal matches the goal for which it’s selected, the agent might still seek influence because its internal goal permits that. I think this is in part what Paul means by “Avoiding end-to-end optimization may help prevent the emergence of influence-seeking behaviors (by improving human understanding of and hence control over the kind of reasoning that emerges)”

[-]TurnTrout6yΩ120

And if the internal goal doesn’t permit that? I’m trying to feel out which levels of meta are problematic in this situation.

[-]niplav5y40Review for 2019 Review

I read this post only half a year ago after seeing it being referenced in several different places, mostly as a newer, better alternative to the existing FOOM-type failure scenarios. I also didn't follow the comments on this post when it came out.

This post makes a lot of sense in Christiano's worldview, where we have a relatively continuous, somewhat multipolar takeoff which to a large extent inherits the problem in our current world. This is especially applies to part I: we already have many different instances of scenarios where humans follow measured incentives and produce unintended outcomes. Goodhart's law is a thing. Part I ties in especially well with Wei Dai's concern that

AI-powered memetic warfare makes all humans effectively insane.

While I haven't done research on this, I have a medium strength intuition that this is already happening. Many people I know are at least somewhat addicted to the internet, having lost a lot of attention due to having their motivational system hijacked, which is worrying because Attention is your scarcest resource. I believe investigating the amount to which attention has deteriorated (or has been monopolized by different actors) would be valuable, as well as thinking about which incentives will start when AI technologies become more powerful (Daniel Kokotajlo has been writing especially interesting essays on this kind of problem).

As for part II, I'm a bit more skeptical. I would summarize "going out with a bang" as a "collective treacherous turn", which would demand somewhat high levels of coordination between agents of various different levels of intelligence (agents would be incentivized to turn early because of first-mover-advantages, but this would increase the probability of humans doing something about it), as well as agents knowing very early that they want to perform a treacherous turn to influence-seeking behavior. I'd like to think about how the frequency of premature treacherous turns relates to the intelligence of agents. Would that be continuous or discontinuous? Unrelated to Christiano's post, this seems like an important consideration (maybe work has gone into this and I just haven't seen it yet).

Still, part II holds up pretty well, especially since we can expect AI systems to cooperate effectively via merging utility functions, and we can see systems in the real world that fail regularly, but not much is being done about them (especially social structures that sort-of work).

I have referenced this post numerous times, mostly in connection with a short explanation of how I think current attention-grabbing systems are a variant of what is described in part I. I think it's pretty good, and someone (not me) should flesh the idea out a bit more, perhaps connecting it to existing systems (I remember the story about the recommender system manipulating its users into political extremism to increase viewing time, but I can't find a link right now).

The one thing I would like to see improved is at least some links to prior existing work. Christiano writes that

(None of the concerns in this post are novel.)

but it isn't clear whether he is just summarizing things he has thought about, which are implicit knowledge in his social web, or whether he is summarizing existing texts. I think part I would have benefitted from a link to Goodhart's law (or an explanation why it is something different).

[-]Eli Tyre6y40

We might describe the result as “going out with a whimper.” Human reasoning gradually stops being able to compete with sophisticated, systematized manipulation and deception which is continuously improving by trial and error; human control over levers of power gradually becomes less and less effective; we ultimately lose any real ability to influence our society’s trajectory. By the time we spread through the stars our current values are just one of many forces in the world, not even a particularly strong one.

Man, in this scenario it really matters how much "our" AI systems are suffering or having enjoyable-on-their-terms experiences.

[-]Wei Dai6y40

There's a bunch of bullet points below Part 1 and Part 2. Are these intended to be parallel with them on the same level, or instances/subcategories of them?

Oh, this is only on GW. On LW it looks very different. Presumably the LW version is the intended version.

[This comment is no longer endorsed by its author]Reply

[-]clone of saturn6y40

Oops, this bug should be fixed now.

[-]Charbel-Raphaël2y30

I think these are the most important problems if we fail to solve intent alignment.

Do you still think this is the case?

[-]Mau4y20

A more recent clarification from Paul Christiano, on how Part 1 might get locked in / how it relates to concerns about misaligned, power-seeking AI:

I also consider catastrophic versions of "you get what you measure" to be a subset/framing/whatever of "misaligned power-seeking." I think misaligned power-seeking is the main way the problem is locked in.

[-]Shmi6y20

Paul, is anyone at MIRI or elsewhere doing numerical simulation of your ideas? Or are those just open-loop thoughts?

[-]ESRogs6y*20

Not sure exactly what you mean by "numerical simulation", but you may be interested in https://ought.org/ (where Paul is a collaborator), or in Paul's work at OpenAI: https://openai.com/blog/authors/paul/ .

[-]Mr Beastly4mo10

if we do well about nipping small failures in the bud, we may not get any medium-sized warning shots at all

This would be the scariest outcome: No warning, straight shot to fast self-improving, waaay above human level intelligence. And, (big shocker) turns out humans aren't that intelligent, and also super slow... 😬

"The cost of success is... embarrassment." 😳

[-]ericfitz1y10

Corporations will deliver value to consumers as measured by profit. Eventually this mostly means manipulating consumers, capturing regulators, extortion and theft.

This is the most succinct summary of the problems with corporatism that I have yet seen.

[-]Erich_Grunewald2y10

Once we start searching over policies that understand the world well enough, we run into a problem: any influence-seeking policies we stumble across would also score well according to our training objective, because performing well on the training objective is a good strategy for obtaining influence.

I'm slightly confused by this. It sounds like "(1) ML systems will do X because X will be rewarded according to the objective, and (2) X will be rewarded according to the objective because being rewarded will accomplish X". But (2) sounds circular -- I see that performing well on the training objective gives influence, but I would've thought only effects (direct and indirect) on the objective are relevant in determining which behaviors ML systems pick up, not effects on obtaining influence.

Maybe that's the intended meaning -- I'm just misreading this passage, but also maybe I'm missing some deeper point here?

Terrific post, by the way, still now four years later.

[-]paulfchristiano2y*50

Consider a competent policy that wants paperclips in the very long run. It could reason "I should get a low loss to get paperclips," and then get a low loss. As a result, it could be selected by gradient descent.

[-]Virgil Kurkjian6y10

Potentially relevant is my post Leto among the Machines, where I discuss the early stages of what you aptly call “get what you can measure”.

[-]Donald Hobson6y00

As far as I understand it, you are proposing that the most realistic failure mode consists of many AI systems, all put into a position of power by humans, and optimizing for their own proxies. Call these Trusted Trial and Error AI's (TTE)

The distinguishing features of TTE's are that they were Trusted. A human put them in a position of power. Humans have refined, understood and checked the code enough that they are prepared to put this algorithm in a self driving car, or a stock management system. They are not lab prototypes. They are also Trial and error learners, not one shot learners.

Some More descriptions of what capability range I am considering.

Suppose hypothetically that we had TTE reinforcement learners, a little better than todays state of the art, and nothing beyond that. The AI's are advanced enough that they can take a mountain of medical data and train themselves to be skilled doctors by trial and error. However they are not advanced enough to figure out how humans work from, say a sequenced genome and nothing more.

Give them control of all the traffic lights in a city, and they will learn how to minimize traffic jams. They will arrange for people to drive in circles rather than stay still, so that they do not count as part of a traffic jam. However they will not do anything outside their preset policy space, like hacking into the traffic light control system of other cities, or destroying the city with nukes.

If such technology is easily available, people will start to use it for things. Some people put it in positions of power, others are more hesitant. As the only way the system can learn to avoid something is through trial and error, the system has to cause a (probably several) public outcrys before it learns not to do so. If no one told the traffic light system that car crashes are bad on simulations or past data, (Alignment failure) Then even if public opinion feeds directly into reward, it will have to cause several car crashes that are clearly its fault before it learns to only cause crashes that can be blamed on someone else. However, deliberately causing crashes will probably get the system shut off or seriously modified.

Note that we are supposing many of these systems existing, so the failures of some, combined with plenty of simulated failures, will give us a good idea of the failure modes.

The space of bad things an AI can get away with is small and highly complex in the space of bad things. An TTE set to reduce crime rates tries making the crime report forms longer, this reduces reported crime, but humans quickly realize what its doing. It would have to do this and be patched many times before it came up with a method that humans wouldn't notice.

Given Advanced TTE's as the most advanced form of AI, we might slowly develop a problem, but the deployment of TTE's would be slowed by the time it takes to gather data and check reliability. Especially given mistrust after several major failures. And I suspect that due to statistical similarity of training and testing, many different systems optimizing different proxies, and humans having the best abstract reasoning about novel situations, and the power to turn the systems off, any discrepancy of goals will be moderately minor. I do not expect such optimization power to be significantly more powerful or less aligned than modern capitalism.

This all assumes that no one will manage to make a linear time AIXI. If such a thing is made, it will break out of any boxes and take over the world. So, we have a social process of adaption to TTE AI, which is already in its early stages with things like self driving cars, and at any time, this process could be rendered irrelevant by the arrival of a super-intelligence.

[-]avturchin6y-10

Just before reading this, I got a shower thought that most AI-related catastrophes described previously were of "hyper-rational" type, e.g. the paperclipper, which from first principles decides that it must produce infinitely many paperclips.

However, this is not how ML-based systems fail. They either fall randomly, when encounter something like adversarial example, or fall slowly, by goodhearting some performance measure. Such systems could be also used to create dangerous weapons, e.g. fakenews or viruses, or interact unpredictably with each other.

Future GPT-3 will be protected from hyper-rational failures because of the noisy nature of its answers, so it can't stick forever to some wrong policy.

[-]Daniel Kokotajlo6y130

I think that's a straw man of the classic AI-related catastrophe scenarios. Bostrom's "covert preparation" --> "Treacherous turn" --> "takeover" story maps pretty nicely to Paul's "seek influence via gaming tests" --> "they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives" --> " One day leaders may find that despite their nominal authority they don’t actually have control over what these institutions do. For example, military leaders might issue an order and find it is ignored. This might immediately prompt panic and a strong response, but the response itself may run into the same problem, and at that point the game may be up. "

[-]TheWakalix6y70

the paperclipper, which from first principles decides that it must produce infinitely many paperclips

I don't think this is an accurate description of the paperclip scenario, unless "first principles" means "hardcoded goals".

Future GPT-3 will be protected from hyper-rational failures because of the noisy nature of its answers, so it can't stick forever to some wrong policy.

Ignoring how GPT isn't agentic and handwaving an agentic analogue, I don't think this is sound. Wrong policies make up almost all of policyspace; the problem is not that the AI might enter a special state of wrongness, it's that the AI might leave the special state of correctness. And to the extent that GPT is hindered by its randomness, it's unable to carry out long-term plans at all - it's safe only because it's weak.

LESSWRONG
LW

436

What failure looks like

436

Ω 106

436

Ω 106

Part I: You get what you measure

Part II: influence-seeking behavior is scary