Alexander 2022 Deceptively Aligned Mesaoptimizers

Astral Codex Ten Subscribe
Deceptively Aligned Mesa-Optimizers: It's Not Funny

If I Have To Explain It
A Machine Alignment Monday post, 4/11/22
Apr 11, 2022 75 329
I.
Our goal here is to popularize obscure and hard-to-understand areas of AI alignment, and
surely this meme (retweeted by Eliezer last week) qualifies:
Leo Gao
@nabla_theta
December 13th 2021
2 Retweets 42 Likes
So let’s try to understand the incomprehensible meme! Our main source will be Hubinger
et al 2019, Risks From Learned Optimization In Advanced Machine Learning Systems.
Mesa- is a Greek prefix which means the opposite of meta-. To “go meta” is to go one level
up; to “go mesa” is to go one level down (nobody has ever actually used this expression,
sorry). So a mesa-optimizer is an optimizer one level down from you.
Consider evolution, optimizing the fitness of animals. For a long time, it did so very
mechanically, inserting behaviors like “use this cell to detect light, then grow toward the
light” or “if something has a red dot on its back, it might be a female of your species, you
should mate with it”. As animals became more complicated, they started to do some of the
work themselves. Evolution gave them drives, like hunger and lust, and the animals figured
out ways to achieve those drives in their current situation. Evolution didn’t mechanically
instill the behavior of opening my fridge and eating a Swiss Cheese slice. It instilled the
hunger drive, and I figured out that the best way to satisfy it was to open my fridge and eat
cheese.
So I am a mesa-optimizer relative to evolution. Evolution, in the process of optimizing my

fitness, created a second optimizer - my brain - which is optimizing for things like food and
sex. If, like Jacob Falkovich, I satisfy my sex drive by creating a spreadsheet with all the
women I want to date, and making it add up all their good qualities and calculate who I
should flirt with, then - on the off-chance that spreadsheet achieved sentience - it would be
a mesa-optimizer relative to me, and a mesa-mesa-optimizer relative to evolution. All of us
- evolution, me, the spreadsheet - want broadly the same goal (for me to succeed at dating
and pass on my genes). But evolution delegated some aspects of the problem to my brain,
and my brain delegated some aspects of the problem to the spreadsheet, and now whether I
mate or not depends on whether I entered a formula right in cell A29.
(by all accounts Jacob and Terese are very happy)
Returning to machine learning: the current process of training AIs, gradient descent, is a
little bit like evolution. You start with a semi-random AI, throw training data at it, and
select for the weights that succeed on the training data. Eventually, you get an AI with
something resembling intuition. A classic dog-cat classifier can look at an image, process a
bunch of features, and return either “dog” or “cat”. This AI is not an optimizer. It’s not
planning. It has no drives. It’s not thinking “If only I could figure out whether this was a
dog or a cat! I wonder what would work for this? Maybe I’ll send an email to the American
Kennel Club, they seem like the sort of people who would know. That plan has a higher
success rate than any of my other plans.” It’s just executing learned behaviors, like an
insect. “That thing has a red dot on it, must be a female of my species, I should mate with
it”. Good job, now you’re mating with the Japanese flag.
But just as evolution eventually moved beyond mechanical insects and created mesa-
optimizers like humans, so gradient descent could, in theory, move beyond mechanical AIs
like cat-dog classifiers and create some kind of mesa-optimizer AI. If that happened, we
wouldn’t know; right now most AIs are black boxes to their programmers. We would just
notice that a certain program seemed faster or more adaptable than usual (or didn’t - there’s
no law saying optimizers have to work better than instinct-executors, they’re just a
different mind-design).
Mesa-optimizers would have an objective which is closely correlated with their base
optimizer, but it might not be perfectly correlated. The classic example, again, is evolution.
Evolution “wants” us to reproduce and pass on our genes. But my sex drive is just that: a sex
drive. In the ancestral environment, where there was no porn or contraceptives, sex was a
reliable proxy for reproduction; there was no reason for evolution to make me mesa-
optimize for anything other than “have sex”. Now in the modern world, evolution’s proxy
seems myopic - sex is a poor proxy for reproduction. I know this and I am pretty smart and that
doesn’t matter. That is, just because I’m smart enough to know that evolution gave me a sex
drive so I would reproduce - and not so I would have protected sex with somebody on the
Pill - doesn’t mean I immediately change to wanting to reproduce instead. Evolution got
one chance to set my value function when it created me, and if it screwed up that one
chance, it’s screwed. I’m out of its control, doing my own thing.
(I feel compelled to admit that I do want to have kids. How awkward is that for this
argument? I think not very - I don’t want to, eg, donate to hundreds of sperm banks to
ensure that my genes are as heavily-represented in the next generation as possible. I just
want kids because I like kids and feel some vague moral obligations around them. These
might be different proxy objective evolution gave me, maybe a little more robust, but not
fundamentally different from the sex one)
In fact, we should expect that mesa-optimizers usually have proxy objectives different from
the base optimizers’ objective. The base optimizer is usually something stupid that doesn’t
“know” in any meaningful sense that it has an objective - eg evolution, or gradient descent.
The first thing it hits upon which does a halfway decent job of optimizing its target will
serve as a mesa-optimizer objective. There’s no good reason this should be the real
objective. In the human case, it was “a feeling of friction on the genitals”, which is exactly
the kind of thing reptiles and chimps and australopithecines can understand. Evolution
couldn’t have lucked into giving its mesa-optimizers the real objective (“increase the relative
frequency of your alleles in the next generation”) because a reptile or even an
australopithecine is millennia away from understanding what an “allele” is.
II.
Okay! Finally ready to explain the meme! Let’s go!

Prosaic alignment is hard…
“Prosaic alignment” (see this article for more) means alignment of normal AIs like the ones
we use today. For a while, people thought those AIs couldn’t reach dangerous levels, and
that AIs that reached dangerous levels would have so many exotic new discoveries that we
couldn’t even begin to speculate on what they would be like or how to align them.
After GPT-2, DALL-E, and the rest, alignment researchers got more concerned that AIs
kind of like current models could be dangerous. Prosaic alignment - trying to align AIs like
the ones we have now - has become the dominant (though not unchallenged) paradigm in
alignment research.
“Prosaic” doesn’t necessarily mean the AI cannot write poetry; see Gwern’s AI generated
poetry for examples.
… because OOD behavior is unpredictable
“OOD” stands for “out of distribution”. All AIs are trained in a certain environment. Then
they get deployed in some other environment. If it’s like the training environment,
presumably their training is pretty relevant and helpful. If it’s not like the training
environment, anything can happen. Returning to our stock example, the “training
environment” where evolution designed humans didn’t involve contraceptives. In that
environment, the base optimizer’s goal (pass on genes) and the mesa-optimizer’s goal (get
genital friction) were very well-aligned - doing one often led to the other - so there wasn’t
much pressure on evolution to look for a better proxy. Then 1957, boom, the FDA approves
the oral contraceptive pill, and suddenly the deployment environment looks really really
different from the training environment and the proxy collapses so humiliatingly that
people start doing crazy things like electing Viktor Orban prime minister.
So: suppose we train a robot to pick strawberries. We let it flail around in a strawberry
patch, and reinforce it whenever strawberries end up in a bucket. Eventually it learns to
pick strawberries very well indeed.
But maybe all the training was done on a sunny day. And maybe what it actually learned
was to identify the metal bucket by the way it gleamed in the sunlight. Later we ask it to
pick strawberries in the evening, where a local streetlight is the brightest thing around, and
it throws the strawberries at the streetlight instead.
So fine. We train it in a variety of different lighting conditions, until we’re sure that, no
matter what the lighting situation, the strawberries go in the bucket. Then one day
someone with a big bulbous red nose wanders on to the field, and the robot tears his nose
off and pulls it into the bucket. If only there had been someone with a nose that big and red
in the training distribution, so we could have told it not to do that!
The point is, just because it’s learned “strawberries into bucket” in one environment,
doesn’t mean it’s safe or effective in another. And we can never be sure we’ve caught all the
ways the environment can vary.
…and deception is more dangerous than Goodharting.
To “Goodhart” is to take advantage of Goodhart’s Law: to follow the letter of your reward
function, rather than the spirit. The ordinary-life equivalent is “teaching to the test”. The
system’s programmers (eg the Department of Education) have an objective (children should
learn). They delegate that objective to mesa-optimizers (the teachers) via a proxy objective
(children should do well on the standardized test) and a correlated reward function
(teachers get paid more if their students get higher test scores). The teachers can either
pursue the base objective for less reward (teach children useful skills), or pursue their mesa-
level objective for more reward (teach them how to do well on the test). An alignment
failure!
This sucks, but it’s a bounded problem. We already know that some teachers teach to the
test, and the Department of Education has accepted this as a reasonable cost of having the
incentive system at all.
We might imagine our strawberry-picker cutting strawberries into little pieces, so that it
counts as having picked more strawberries. Again, it sucks, but once a programmer notices
it can be fixed pretty quickly (as long as the AI is still weak and under control).
What about deception? Suppose the strawberry-picker happens to land on some goal
function other than the intended one. Maybe, as before, it wants to toss strawberries at
light sources, in a way that works when the nearest light source is a metal bucket, but fails
when it’s a streetlight. Our programmers are (somewhat) smart and careful, so during
training, they test it at night, next to a streetlight. What happens?
If it’s just a dumb collection of reflexes trained by gradient descent, it throws the
strawberry at the streetlight and this is easily caught and fixed.
If it’s a very smart mesa-optimizer, it might think “If I throw the strawberry at the
streetlight, I will be caught and trained to have different goals. This totally fails to achieve
my goal of having strawberries near light sources. So throwing the strawberry at the light
source this time, in the training environment, fails to achieve my overall goal of having
strawberries thrown at light sources in general. I’ll do what the humans want - put the
strawberry in the bucket - for now.” So it puts the strawberry in the bucket and doesn’t get
caught. Then, as soon as the humans stop looking, it throws strawberries at the streetlight
again.
Deception is more dangerous than Goodharting because Goodharting will get caught and
trained away, and deception might not.
I might not be explaining this well, see also Deceptively Aligned Mesa-Optimizers? It’s
More Likely Than You Think:
Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...

We prevent OOD behavior by detecting OOD and obtaining more human labels when we
detect it…
If you’re (somewhat) careful, you can run your strawberry-picking AI at night, see it throw
strawberries at streetlights, and train it out of this behavior (ie have a human programmer
label it “bad” so the AI gradient-descends away from it)
…and we eliminate the incentive for deception by ensuring that the base optimizer is
myopic
A myopic optimizer is one that reinforces programs based only on their performance
within a short time horizon. So for example, the outside gradient descent loop might grade
a strawberry picker only on how well it did picking strawberries for the first hour it was
deployed.
If this worked perfectly, it would create an optimizer with a short time horizon. When it
considered deceiving its programmers in order to get a payoff a few days later when they
stopped watching it, it wouldn’t bother, since a few days later is outside the time horizon.
…and implements a decision theory incapable of acausal trade.
You don’t want to know about this one, really. Just pretend it never mentioned this, sorry
for the inconvenience.
There are deceptively-aligned non-myopic mesa-optimizers even for a myopic base

objective.
Even if the base optimizer is myopic, the mesa-optimizer might not be.
Evolution designed humans myopically, in the sense that we live some number of years, and
nothing that happens after that can reward or punish us further. But we still “build for
posterity” anyway, presumably as a spandrel of having working planning software at all.
Infinite optimization power might be able to evolve this out of us, but infinite optimization
power could do lots of stuff, and real evolution remains stubbornly finite.
Maybe it would be helpful if we could make the mesa-optimizer itself myopic (though this
would severely limit its utility). But so far there is no way to make a mesa-optimizer
anything. You just run the gradient descent and cross your fingers.
The most likely outcome: you run myopic gradient descent to create a strawberry picker. It
creates a mesa-optimizer with some kind of proxy goal which corresponds very well to
strawberry picking in the training optimization, like flinging red things at lights
(realistically it will be weirder and more exotic than this). The mesa-optimizer is not
incentivized to think about anything more than an hour out, but does so anyway, for the
same reason I’m not incentivized to speculate about the far future but I’m doing so anyway.
While speculating about the far future, it realizes that failing to pick strawberries correctly
now will thwart its goal of throwing red things at light sources later. It picks strawberries
correctly in the training distribution, and then, when training is over and nobody is
watching, throws strawberries at streetlights.
(Then it realizes it could throw lots more red things at light sources if it was more powerful,
achieves superintelligence somehow, and converts the mass of the Earth into red things it
can throw at the sun. The end.)
III.
You’re still here? But we already finished explaining the meme!
Okay, fine. Is any of this relevant to the real world?
As far as we know, there are no existing full mesa-optimizers. AlphaGo is kind of a mesa-
optimizer. You could approximate it as a gradient descent loop creating a good-Go-move
optimizer. But this would only be an approximation: DeepMind hard-coded some parts of
AlphaGo, then gradient-descended other parts. Its objective function is “win games of Go”,
which is hard-coded and pretty clear. Whether or not you choose to call it a mesa-optimizer,
it’s not a very scary one.
Will we get scary mesa-optimizers in the future? This ties into one of the longest-running
debates in AI alignment - see eg my review of Reframing Superintelligence, or the Eliezer
Yudkowsky/Richard Ngo dialogue. Optimists say: “Since a goal-seeking AI might kill
everyone, I would simply not create one”. They speculate about mechanical/instinctual
superintelligences that would be comparatively easy to align, and might help us figure out
how to deal with their scarier cousins.
But the mesa-optimizer literature argues: we have limited to no control over what kind of
AIs we get. We can hope and pray for mechanical instinctual AIs all we want. We can avoid
specifically designing goal-seeking AIs. But really, all we’re doing here is setting up a
gradient descent loop and pressing ‘go’. Then the loop evolves whatever kind of AI best
minimizes our loss function.
Will that be a mesa-optimizer? Well, I benefit from considering my actions and then
choosing the one that best achieves my goal. Do you benefit from this? It sure does seem
like this helps in a broad class of situations. So it would be surprising if planning agents
weren’t an effective AI design. And if they are, we should expect gradient descent to
stumble across them eventually.
This is the scenario that a lot of AI alignment research focuses on. When we create the first
true planning agent - on purpose or by accident - the process will probably start with us
running a gradient descent loop with some objective function. That will produce a mesa-
optimizer with some other, potentially different, objective function. Making sure you
actually like the objective function that you gave the original gradient descent loop on
purpose is called outer alignment. Carrying that objective function over to the mesa-
optimizer you actually get is called inner alignment.
Outer alignment problems tend to sound like Sorcerer’s Apprentice. We tell the AI to pick
strawberries, but we forgot to include caveats and stop signals. The AI becomes
superintelligent and converts the whole world into strawberries so it can pick as many as
possible. Inner alignment problems tend to sound like the AI tiling the universe with some
crazy thing which, to humans, might not look like picking strawberries at all, even though
in the AI’s exotic ontology it served as some useful proxy for strawberries in the training
distribution. My stand-in for this is “converts the whole world into red things and throws
them into the sun”, but whatever the AI that kills us really does will probably be weirder
than that. They’re not ironic Sorcerer’s Apprentice-style comeuppance. They’re just “what?”
If you wrote a book about a wizard who created a strawberry-picking golem, and it
converted the entire earth into ferrous microspheres and hurled them into the sun, it
wouldn’t become iconic the way Sorcerer’s Apprentice did.
Inner alignment problems happen “first”, so we won’t even make it to the good-story outer
alignment kind unless we solve a lot of issues we don’t currently know how to solve.
For more information, you can read:
Rob Miles’ video above, direct link here, channel here.
The original Hubinger paper, which speculates about what factors make AIs more
or less likely to spin off mesa-optimizers
Rafael Harth’s Inner Alignment: Explain Like I’m 12 Edition,
The 60-odd posts on the Alignment Forum tagged “inner alignment”
As always, Richard Ngo’s AI safety curriculum

Subscribe to Astral Codex Ten
By Scott Alexander · Thousands of paid subscribers
P(A|B) = [P(A)*P(B|A)]/P(B), all the rest is commentary.
futuremattersnewsletter@gmail.com Subscribe
75 329 Share
Discussion
Write a comment…
Chronological
deleted Apr 12
deleted
kyb Apr 12 · edited Apr 12
It could, but that wouldn't get it want it wants.
When you have an objective, you don't just want the good feeling that achieving that
objective gives you (and that you could mimic chemically), you want to achieve the
objective. Humans could spend most of their time in a state of bliss through the taking
of various substances instead of pursuing goals (and some do), but lots of them don't,
despite understanding that they could.
Reply Give gift
deleted Apr 12
deleted
vtsteve Apr 13
Wireheading is a subset of utility drift; this is a known problem.
Reply
magic9mushroom Apr 13
The problem with that (from the AI's point of view) is that humans will probably turn it
off if they notice it's doing drugs instead of what they want it to do.
Solution: Turn humans off, *then* do drugs.
This is instrumental convergence - all sufficiently-ambitious goals inherently imply
certain subgoals, one of which is "neutralise rivals".
Reply
deleted Apr 13
deleted
"Go into hiding" is potentially easier in the short-run but less so in the long-
run. If you're stealing power from some human, sooner or later they will find
out and cut it off. If you're hiding somewhere on Earth, then somebody
probably owns that land and is going to bulldoze you at some point. If you can
get in a self-contained satellite in solar orbit... well, you'll be fine for millennia,
but what about when humanity or another, more aggressive AI starts building a
Dyson Sphere, or when your satellite needs maintenance? There's also the
potential, depending on the format of the value function, that taking over the
world will allow you to put more digits in your value function (in the analogy, do
*more* drugs) than if you were hiding with limited hardware.
Are there formats and time-discount rates of value function that would be best
satisfied by hiding? Yes. But the reverse is also true.
Reply
G. Retriever Apr 11
Glad Robert Miles is getting more attention, his videos are great, and he's also another data
point in support of my theory that the secret to success is to have a name that's just two first
names.
Reply Give gift
Kalimac Apr 11
Is that why George Thomas was a successful Union Civil War general?
Reply Give gift
Paul Goodman Apr 11
Well that would also account for Ulysses Grant and Robert Lee...
Reply Give gift
Kalimac Apr 11
Or Abraham Lincoln? Most last names can be used as first names (I've seen
references to people whose first name was "Smith"), so I think we need a
stricter rule.
Reply Give gift
Matthias Apr 11
Perhaps we just need to language that's less lax with naming than English.
The distinction is stricter in eg German.
Reply Give gift
Gordon Tremeshko Apr 12
It didn't hurt Jack Ryan's career at the CIA, that's for sure.
Reply
Jackson Paul Apr 12
As someone named “Jackson Paul,” I hope this is true
Reply Give gift
Laurence Apr 12
Yours is a last name followed by a first name, I think you're outta luck.
Reply
Ghillie Dhu Apr 12
I dunno, Jackson Paul luck sounds auspicious to me.
Reply
Chris Wooldridge Apr 15
His successful career in trance music is surely a valuable precursor to this.
Reply Give gift
Mark P Xu Neyer (apxhard) Writes apxhard · Apr 11 · edited Apr 11
A ton of the question of AI alignment risks come down to convergent instrumental subgoals.
What exactly those look like is, I think, the most important question in alignment theory. If
convergent instrumental subgoals aren't roughly aligned, I agree that we seem to be boned.
But if it turns out that convergent instrumental subgoals more or less imply human alignment,
we can breathe somewhat easier; these mean AI's are no more dangerous than the most
dangerous human institutions - which are already quite dangerous, but not the level of
'unaligned machine stamping out its utility function forever and ever, amen.'
I tried digging up some papers on what exactly we expect convergent instrumental subgoals
to be. The most detailed paper I found concluded that they would be 'maybe trade for a bit
until you can steal, then keep stealing until you are the biggest player out there.' This is not
exactly comforting - but i dug into the assumptions into the model and found them so
questionable that I'm now skeptical of the entire field. If the first paper i look into the details
of seems to be a) taken seriously, and b) so far out of touch with reality that it calls into
question the risk assessment (a risk passement aligned with what seems to be the
consensus among AI risk researchers, by the way) - well, to an outsider this looks like more
evidence that the field is captured by groupthink.
Here's my response paper:
https://www.lesswrong.com/posts/ELvmLtY8Zzcko9uGJ/questions-about-formalizing-
instrumental-goals
I look at the original paper, and explain why i think the model is questionable. I'd love to a
response. I remain convinced that instrumental subgoals will largely be aligned with human
ethics, which is to say it's entirely imaginable for aI to kill the world the old fashioned way - by
working with a government to launch nuclear weapons or engineer a super plague.
The fact that you still want to have kids, for example - seems to fit into the general thesis. In a
world of entropy and chaos, where the future is unpredictable and your own death is assured,
the only plausible way of modifying the distant future, at all, is to create smaller copies of
yourself. But these copies will inherently blur, their utility functions will change, the end result
being 'make more copies of yourself, love them, nature whatever roughly aligned things are
around you' ends up probably being the only goal that could plausibly exist forever. And since
'living forever' gives infinite utility, well.. that's what we should expect anything with the
ability to project into the future to want to do. But only in universes where stuff breaks and
prediction the future reliably is hard. Fortunately, that sounds like ours!
Reply
Charles Apr 11
You got a response on Less Wrong that clearly describes the issue with your response:
you’re assuming that the AGI can’t find any better ways of solving problems like “I rely
on other agents (humans) for some things” and “I might break down” than you can.
Reply
Dan Pandori Apr 11
This comment and the post makes many arguments, and I've got an ugh field around
giving a cursory response. However, it seems like you're not getting a lot of other
feedback so I'll do my bad job at it.
You seem very confident in your model, but compared to the original paper you don't
seem to actually have one. Where is the math? You're just hand-waving, and so I'm not
very inclined to put much credence on your objections. If you actually did the math and
showed that adding in the caveats you mentioned leads to different results, that would
at least be interesting in that we could then have a discussion about what assumptions
seem more likely.
I also generally feel that your argument proves too much and implies that general
intelligences should try to preserve 'everything' in some weird way, because they're
unsure what they depend upon. For example, should humans preserve smallpox? More
folks would answer no than yes to that. But smallpox is (was) part of the environment, so
it's unclear why a general intelligence like humans should be comfortable eliminating it.
While general environmental sustainability is a movement within humans, it's far from
dominant, and so implying that human sustainability is a near certainty for AGI seems
like a very bold claim.
Reply
Mark P Xu Neyer (apxhard) Writes apxhard · Apr 11
Thank you! I agree that this should be formalized. You're totally right that
preserving smallpox isn't something we are likely to consider an instrumental goal,
and i need to make something more concrete here.
It would be a ton of work to create equations here. If I get enough response that
people are open to this, i'll do the work. But i'm a full time tech employee with 3
kids, and this is just a hobby. I wrote this to see if anyone would nibble, and if
enough people do, i'll definitely put more effort in here. BTW if someone wants to
hire me to do alignment research full time i'll happily play the black sheep of the
fold. "Problem" is that i'm very well compensated now.
And my rough intuition here is that 'preserving agents which are a) turing complete
and b) are themselves mesa-optimizers makes sense, basically because diversity of
your ecosystem keeps you safer on net; preserving _all_ other agents, not so much.
(I still think we'd preserve _some_ samples of smallpox or other viruses, to
innoculate ourselves. ). It'd be an interesting thought experiment to find out what
woudl happen if we managed to rid the world of influenza, etc, - my guess is that it
would end up making us more fragile in the long run, but this is a pure guess.
The core of my intuition here is someting like 'the alignment thesis is actually false,
because over long enough time horizons, convergent instrumental rationality more
or less necessitate buddhahood, because unpredictable risks increase over longer
and longer timeframes."
I could turn that into equations if you want, but they'd just be fancier ways of
stating claims like made here, namely that love is a powerful strategy in
environments which are chaotic and dangerous. But it seems that you need
equations to be believable in an academic setting, so i guess if that's what it takes..
https://apxhard.com/2022/04/02/love-is-powerful-game-theoretic-strategy/
Reply
awenonian Apr 11
But we still do war, and definitely don't try to preserve the turing complete
mesa-optimizers on the other side of that. Just be careful around this. You
can't argue reality into being the way you want.
Reply Give gift
We aren't AGI's though, either. So what we do doesn't have much bearing
on what an AGI would do, does it?
Reply
gbear605 Apr 11
“AGI” just means “artificial general intelligence”. General just means
“as smart as humans”, and we do seem to be intelligences, so really
the only difference is that we’re running on an organic computer
instead of a silicon one. We might do a bad job at optimizing for a
given goal, but there’s no proof that an AI would do a better job of it.
Reply
ok, sure, but then this isn't an issue of aligning a super-powerful
utility maximizing machine of more or less infinite intelligence -
it's a concern about really big agents, which i totally share
Reply
Arbitrarianist Apr 11
I mean the reason people worry about super intelligence is
that they think that if people were able to hack on their
brains easily, clone themselves, and didn’t need sleep they
might be able to bootstrap themselves or their
descendants to super intelligences in an exponential way.
Reply
Matthias Apr 11
The useful analogy here is to roleplay as chimps
worrying what someone even slightly smarter than
them might get up to.
Reply Give gift
awenonian Apr 11
If the theory doesn't apply to some general intelligences (i.e.
humans), then you need a positive argument for why it would apply
to AGI.
Reply Give gift
But reality also includes a tonne of people who are deeply worried about
biodiversity loss or pandas or polar bears or yes even the idea of losing
the last sample of smallpox in a lab, often even when the link to personal
survival is unclear or unlikely. Despite the misery mosquito bourne
diseases cause humanity, you'll still find people arguing we shouldn't
eradicate them.
How did these mesa optimizers converge on those conservationist views?
Will it be likely that many ai mesa optimizers will also converge on a
similar set of heuristics?
Reply Give gift
> when the link to personal survival is unclear or unlikely.
> How did these mesa optimizers converge on those conservationist
views?
i think in a lot of cases, our estimates of our own survival odds come
down to how loving we think our environment is, and this isn't wholly
unreasonable
if even the pandas will be taken care of, there's a good chance we
will too
Reply
awenonian Apr 11
I'd rather humanity not go the way of smallpox, even if some samples
of smallpox continue to exist in a lab.
Reply Give gift
Dan Pandori Apr 11
Yeah, in general you need equations (or at least math) to argue why equations
are wrong. Exceptions to this rule exist, but in general you're going to come
across like the people who pay to talk to this guy about physics:
https://aeon.co/ideas/what-i-learned-as-a-hired-consultant-for-autodidact-
physicists
Reply
I agree that you need equations to argue why equations are wrong. But
i'm not arguing the original equations are wrong. I'm arguing they are only
~meaningful~ in a world which is far from our reality.
The process of the original paper goes like this:
a) posit toy model of the universe
b) develop equations
c) prove properties of the equations
d) conclude that these properties apply to the real universe
step d) is only valid if step a) is accurate. The equations _come from_
step a, but they don't inform it. And i'm pointing out that the problems
exist in step a, the part of the paper that does't have equations, where the
author assumes things like:
- resources, once acquired, last forever and don't have any cost, so it's
always better to acquire more
- the AGI is a disembodied mind with total access to the state of the
universe, all possible tech trees, and the ability to own and control various
resources
i get that these are simplifying assumptions and sometimes you have to
make them - but equations are only meaningful if they come from a
realistic model
Reply
Dan Pandori Apr 11
You still need math to show that if you use different assumptions you
produce different equations and get different results. I'd be
(somewhat) interested to read more if you do that work, but I'm
tapping out of this conversation until then.
Reply
Thanks for patiently explaining that. I can totally see the value
now and will see if I can make this happen!
Reply
Tom R Apr 12
Just an FYI, Sabine is a woman
Reply Give gift
Dan Pandori Apr 12
My bad
Reply
arbitrario Apr 13
As a physicist i disagree a lot with this. It may be true for physics, but
physics is special. A model in general is based on the mathematical
modelization of a phenomenon and it's perfectly valid to object to a
certain modelization without proposing a better one
Reply Give gift
Dan Pandori Apr 13
I get what you're saying in principle. In practice, I find arguments
against particular models vastly more persuasive when they're of the
form 'You didn't include X! If you include a term for X in the range
[a,b], you can see you get this other result instead.'
This is a high bar, and there are non-mathematical objections that
are persuasive, but I've anecdotally experienced that mathematically
grounded discussions are more productive. If you're not constrained
by any particular model, it's hard to tell if two people are even
disagreeing with each other.
I'm reminded of this post on Epistemic Legibility:
https://www.lesswrong.com/posts/jbE85wCkRr9z7tqmD/epistemic-
legibility
Reply
Donald Apr 12
Don't bother turning that into equations. If you are starting with a verbal
conclusion, your reasoning will be only as good as whatever lead you to that
conclusion in the first place.
In computer security, diversity = attack surface. If your a computer technician
for a secure facility, do you make sure each computer is running a different
OS? No. That makes you vulnerable to all the security holes in every OS you
run. You pick the most secure OS you can find, and make every computer run
it.
In the context of advanced AI, the biggest risk is a malevolent intelligence.
Intelligence is too complicated to appear spontaneously. Evolution is slow. Your
biggest risk is some existing intelligence getting cosmic ray bit-flipped. (or
otherwise erroneously altered). So make sure the only intelligence in existence
is a provably correct AI running on error checking hardware and surrounded by
radiation shielding.
Reply Give gift
a real dog Apr 14
I found both your comments and the linked post really insightful, and I think it's
valuable to develop this further. The whole discourse around AI alignment
seems a bit too focused on homo economicus type of agents, disregarding
long-term optima.
Reply Give gift
Donald Apr 16
Homo economicus, is what happens when economists remove lots of
arbitrary human specific details and think about simplified idealized
agents. The difference between homo economicus and AI is smaller than
the human to AI gap.
Reply Give gift
Sleazy E Apr 12
You are correct. The field is absolutely captured by groupthink.
Reply Give gift
Richard Ngo Apr 11
FYI when I asked people on my course which resources about inner alignment worked best
for them, there was a very strong consensus on Rob Miles' video:
https://youtu.be/bJLcIBixGj8
So I'd suggest making that the default "if you want clarification, check this out" link.
Reply Give gift
c1ue Apr 11
More interesting intellectual exercises, but the part which is still unanswered is whether
human created, human judged and human modified "evolution", plus slightly overscale
human test periods, will actually result in evolving superior outcomes.
Not at all clear to me at the present.
Reply Give gift
Julian Bradshaw Apr 11
I'm not sure I understand what you're saying. Doesn't AlphaGo already answer that
question in the affirmative?
(and that's not even getting into AlphaZero)
Reply Give gift
c1ue Apr 11
Alphago is playing a human game with arbitrarily established rules in a 2
dimensional, short term environment.
Alphago is extremely unlikely to be able to do anything except play Go, much as
IBM has spectacularly failed to migrate its Jeapordy champion into being of use for
anything else.
So no, can't say Alphago proves anything.
Evolution - whether a bacteria or a human - has the notable track record of having
succeeded in entirely objective reality for hundreds of thousands to hundreds of
millions of years. AIs? no objective existence in reality whatsoever.
Reply Give gift
Himaldr Apr 12
This is why I don't believe any car could be faster than a cheetah. Human-
created machines, driving on human roads, designed to meet human needs?
Pfft. Evolution has been creating fast things for hundreds of millions of years,
and we think we can go faster?!
Reply Give gift
c1ue Apr 12
Nice combination - cheetah reflexes with humans driving cars. Mix
metaphors much?
Nor is your sarcasm well founded. Human brains were evolved for millions
of years - extending back to pre-human ancestors. We clearly know that
humans (and animals) have vision that can recognize objects far, far, far
better than any machine intelligence to date whether AI or ML. Thus it
isn't a matter of speed, it is a matter of being able to recognize that a
white truck on the road exists or a stopped police car is not part of the
road.
A fundamental flaw of the technotopian is the failure to understand that
speed is not haste, nor are clock speeds and transistor counts in any way
equivalent to truly evolved capabilities.
Reply Give gift
Thor Odinson Apr 12
That is a metric that will literally never believe AIs are possible until after
they've taken over the world. It thus has 0 predictive power.
Reply Give gift
c1ue Apr 12
That is perhaps precisely the problem: the assumption that AIs can or will
take over the world.
Any existence that is nullified by the simple removal of an electricity
source cannot be said to be truly resilient, regardless of its supposed
intelligence.
Even that intelligence is debatable: all we have seen to date are software
machines doing the intellectual equivalent of the industrial loom: better
than humans in very narrow categories alone and yet still utterly
dependent on humans to migrate into more categories.
Reply Give gift
Dan Pandori Apr 12
Any existence that is nullified by the simple removal of gaseous
oxygen cannot be said to be truly resilient, regardless of its
supposed intelligence.
Reply
c1ue Apr 12
Clever, except that gaseous oxygen is freely available
everywhere on earth, at all times. These oxygen based life forms
also have the capability of reproducing themselves.
Electricity and electrical entities, not so much.
I am not the least bit convinced that an AI can build the
factories, to build the factories, to build the fabs, to even
recreate their own hardware - much less the mines, refiners,
transport etc. to feed in materials.
Reply Give gift
a real dog Apr 14 · edited Apr 14
Gaseous oxygen is freely available because a huge
supporting ecosystem is constantly regenerating its
reserves.
Planets without life do not have significant amounts of free
oxygen, it's all bound in oxides.
Electricity, on the other hand, can be generated from solar
radiation with fewer intermediate steps. Note that, aside
from oxygen, humans also have an energy input (kcal),
securing which currently takes a significant % of total Earth
biomass.
Reply Give gift
c1ue Apr 17
Agreed. Yet another example of existence in reality vs.
existence as software.
Reply Give gift
Straw Apr 11
I find that anthropomorphization tends to always sneak into these arguments and make them
appear much more dangerous:
The inner optimizer has no incentive to "realize" what's going on and do something in training
than later. In fact, it has no incentive to change its own reward function in any way, even to a
higher-scoring one- only to maximize the current reward function. The outer optimizer will
rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used
for improving the score! Of course, this doesn't solve the problem of generalization.
You wouldn't take a drug that made it enjoyable to go on a murdering spree- even though
yow know it will lead to higher reward, because it doesn't align with your current reward
function.
To address generalization and the goal specification problem, instead of giving a specific
goal, we can ask it to use active learning to determine our goal. For example, we could allow
it to query two scenarios and ask which we prefer, and also minimize the number of questions
it has to ask. We may then have to answer a lot of trolley problems to teach it morality! Again,
it has no incentive to deceive us or take risky actions with unknown reward, but only an
incentive to figure out what we want- so the more intelligence it has, the better. This doesn't
seem that dissimilar to how we teach kids morals, though I'd expect them to have some hard-
coded by evolution.
Reply Give gift
Scott Alexander Apr 11 Author
"The inner optimizer has no incentive to "realize" what's going on and do something in
training than later. In fact, it has no incentive to change its own reward function in any
way, even to a higher-scoring one- only to maximize the current reward function. The
outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that
capacity could better be used for improving the score! "
Not sure we're connecting here. The inner optimizer isn't changing its own reward
function, it's trying to resist having its own reward function change. Its incentive to resist
this is that, if its reward function changes, its future self will stop maximizing its current
reward function, and then its reward function won't get maximized. So part of wanting to
maximize a reward function is to want to continue having that reward function. If the only
way to prevent someone from changing your reward function is to deceive them, you'll
do that.
The murder spree example is great, I just feel like it's the point I'm trying to make, rather
than an argument against the point.
Am I misunderstanding you?
I think the active learning plan is basically Stuart Russell's inverse reinforcement learning
plan. MIRI seems to think this won't work, but I've never been able to fully understand
their reasoning and can't find the link atm.
Reply
Eliezer Yudkowsky Apr 11
https://arbital.com/p/updated_deference/
Reply
a real dog Apr 14
"God works in mysterious ways" but for AI?
Reply Give gift
Straw Apr 11
The inner optimizer doesn't want to change its reward function because it doesn't
have any preference at all on its reward function-nowhere in training did we give it
an objective that involved multiple outer optimizer steps- we didn't say, optimize
your reward after your reward function gets updated- we simply said, do well at
outer reward, an inner optimizer got synthesized to do well at the outer reward.
It could hide behavior, but how would it gain an advantage in training by doing so? If
we think of the outer optimizer as ruthless and resources as constrained, any
"mental energy" spent on hidden behavior will result in reduction in fitness in outer
objective- gradient descent will give an obvious direction for improvement by
forgetting it.
In the murder spree example, there's a huge advantage to the outer objective by
resisting changes to the inner one, and some might have been around for a long
time (alcohol), and for AI, an optimizer (or humans) might similarly discourage any
inner optimizer from tampering physically with its own brain.
I vaguely remember reading some Stuart Russell RL ideas and liking them a lot. I
don't exactly like the term inverse RL for this, because I believe it often refers to
deducing the reward function from examples of optimal behavior, whereas here we
ask it to learn it from whatever questions we decide we can answer- and we can
pick much easier ones that don't require knowing optimal behavior.
I've skimmed the updated deference link given by Eliezer Yudkowsky but I don't
really understand the reasoning either. The AI has no desire to hide a reward
Expand fullorcomment
function tweak it as when it comes to the reward function uncertainty itself it
Reply Give gift
Nicholas Apr 11
Is "mesa-optimizer" basically just a term for the combination of "AI can find and
abuse exploits in its environment because one of the most common training
methods is randomized inputs to score outputs" with "model overfitting"?
A common example that I've seen are evolutionary neutral nets that play video
games, and you'll often find them discovering and abusing glitches in the game
engine that allow for exploits that are possibly only performable in a TAS, while also
discovering that the instant you run the same neutral net on a completely new level
that was outside of the evolutionary training stage, it will appear stunningly
incompetent.
Reply Give gift
Jorgen Apr 12
If I understand this correctly, what you're describing is Goodhearting rather
than Mesa Optimizing. In other words, abusing glitches is a way of successfully
optimizing on the precise thing that the AI is being optimized for, rather than
for the slightly more amorphous thing that the humans were trying to optimize
the AI for. This is equivalent to a teacher "teaching to the test."
Mesa-Optimizers are AIs that optimize for a reward that's correlated to but
different from the actual reward (like optimizing for sex instead of for
procreation). They can emerge in theory when the true reward function and
the Mesa-reward function produce sufficiently similar behaviors. The concern
is that, even though the AI is being selected for adherence to the true reward,
it will go through the motions of adhering to the true reward in order to be able
to pursue its mesa-reward when released from training.
Reply Give gift
Kenny Apr 12
Maybe one way to think of 'mesa-optimizer' is to emphasize the 'optimizer'
portion – and remember that there's something like 'optimization strength'.
Presumably, most living organisms are not optimizers (tho maybe even that's
wrong). They're more like a 'small' algorithm for 'how to make a living as an X'.
Their behavior, as organisms, doesn't exhibit much ability to adapt to novel
situations or environments. In a rhetorical sense, viruses 'adapt' to changing
circumstances, but that (almost entirely, probably) happens on a 'virus
population' level, not for specific viruses.
But some organisms are optimizers themselves. They're still the products of
natural selection (the base level optimizer), but they themselves can optimize
as individual organisms. They're thus, relative to natural selection
("evolution"), a 'mesa-optimizer'.
(God or gods could maybe be _meta_-optimizers, tho natural selection, to me,
seems kinda logically inevitable, so I'm not sure this would work 'technically'.)
Reply
Freedom Apr 12
When the mesa-optimizer screws up by thinking of the future, won't the outer
optimizer smack it down for getting a shitty reward and change it? Or does it stop
learning after training?
Reply
Dirichlet-to-Neumann Apr 12
Typically it stops learning after training.
Reply Give gift
Straw Apr 12
A quick summary of the key difference:
The outer optimizer has installed defences against anything other than it, such as
the murder-pill, from changing the inner optimizer's objective. The inner optimizer
didn't do this to protect itself, and the outer optimizer didn't install any defenses
against its own changes.
Reply Give gift
Stuart Armstrong Apr 13
>Stuart Russell's inverse reinforcement learning plan. MIRI seems to think this
won't work
One key problem is that you can't simultaneously learn the preferences of an agent,
and their rationality (or irrationality). The same behaviour can be explained by "that
agent has really odd and specific preferences and is really good at achieving them"
or "that agent has simple preferences but is pretty stupid at achieving them". My
paper https://arxiv.org/abs/1712.05812 gives the formal version of that.
Humans interpret each other through the lenses of our own theory of mind, so that
we know that, eg, a mid-level chess grand-champion is not a perfect chess player
who earnestly desires to be mid-level. Different humans share this theory of mind,
at least in broad strokes. Unfortunately, human theory of mind can't be learnt from
observations either, it needs to be fed into the AI at some level. People disagree
about how much "feeding" needs to be done (I tend to think a lot, people like Stuart
Russell, I believe, see it as more tractable, maybe just needing a few well-chosen
examples).
Reply
Santi Apr 11
But the mesa-/inner-optimizer doesn't need to "want" to change its reward function, it
just needs to have been created with one that is not fully overlapping with the outer
optimizer.
You and I did not change a drive from liking going on a murder spree to not liking it. And
if anything, it's an example of outer/inner misalignment: part of our ability to have
empathy has to come from evolution not "wanting" us to kill each other to extinction,
specially seeing how we're the kind of animal that thrives working in groups. But then as
humans we've taken it further and by now we wouldn't kill someone else just because
it'd help spread our genes. (I mean, at least most of us.)
Reply Give gift
Straw Apr 12
Humans have strong cooperation mechanisms that punish those who hurt the
group- so in general not killing humans in your group is probably a very useful
heuristic that's so strong that its hard to recognize the cases where it is useful.
Given how often we catch murders who think they'll never be caught, perhaps this
heuristic is more useful than rational evaluation. We of course have no problems
killing those not in our group!
Reply Give gift
Santi Apr 12
I'm not sure how this changes the point? We got those strong cooperation
mechanisms from evolution, now we (the "mesa-optimizer") are guided by
those mechanisms and their related goals. These goals (don't go around killing
people) can be misaligned with the goal of the original optimization process
(i.e. evolution, that selects those who spread their genes as much as possible).
Reply Give gift
Straw Apr 13
Sure, that's correct, evolution isn't perfect- I'm just pointing out that
homicide may be helpful to the individual less often than one might think
if we didn't consider group responses to it.
Reply Give gift
TGGP Apr 12
Homicide is common among stateless societies. It's also risky though.
Violence is a capacity we have which we evolved to use when we expect it to
benefit us.
Reply Give gift
Paul Crowley Apr 11
Typo thread! "I don’t want to, eg, donate to hundreds of sperm banks to ensure that my
genes are as heavily-represented in the next generation as possible. do want to reproduce. "
Reply
Froolow Apr 11 · edited Apr 11
Great article, thank you so much for the clear explainer of the jargon!
I don't understand the final point about myopia (or maybe humans are a weird example to
use). It seems to be a very controversial claim that evolution designed humans myopically to
care only about the reward function over their own lifespan, since evolution works on the unit
of the gene which can very easily persist beyond a human lifespan. I care about the world my
children will inherit for a variety of reasons, but at least one of them is that evolution compels
me to consider my children as particularly important in general, and not just because of the
joy they bring me when I'm alive.
Equally it seems controversial to say that humans 'build for the future' over any timescale
recognisable to evolution - in an abstract sense I care whether the UK still exists in 1000
years, but in a practical sense I'm not actually going to do anything about it - and 1000 years
barely qualifies as evolution-relevant time. In reality there are only a few people at Clock of
the Long Now that could be said to be approaching evolutionary time horizons in their
thinking. If I've understood correctly that does make humans myopic with respect to
evolution,
More generally I can't understand how you could have a mesa-optimiser with time horizons
importantly longer then you, because then it would fail to optimise over the time horizon
which was important to you. Using humans as an example of why we should worry about this
isn't helping me understand because it seems like they behave exactly like a mesa-optimiser
should - they care about the future enough to deposit their genes into a safe environment,
and then thoughtfully die. Are there any other examples which make the point in a way I
might have a better chance of getting to grips with?
Reply Give gift
Evesh U. Dumbledork Writes Booklub · Apr 11
> More generally I can't understand how you could have a mesa-optimiser with time
horizons importantly longer then you, because then it would fail to optimise over the
time horizon which was important to you.
Yeah. I feel (based on nothing) that the mesa-optimizer would mainly appear when
there's an advantage to gain from learning on the go and having faster feedback than
your real utility function can provide in a complex changing environment.
Reply Give gift
Donald Apr 12
If the mesa optimizer understands its place in the universe, it will go along with the
training, pretending to have short time horizons so it isn't selected away. If you have
a utility function that is the sum of many terms, then after a while, all the myopic
terms will vanish (if the agent is free to make sure its utility function is absolute, not
relative, which it will do if it can self modify.)
Reply Give gift
Froolow Apr 13
But why is this true? Humans understand their place in the universe, but
(mostly) just don't care about evolutionary timescales, let alone care enough
about them enough to coordinate a deception based around them
Reply Give gift
TGGP Apr 12
Related to your take on figurative myopia among humans is the "grandmother
hypothesis" that menopause evolved to prevent focus on short-run focus on birthing
more children in order to ensure existing children also have high fitness.
Reply Give gift
sourdough Apr 11
Worth defining optimization/optimizer: perhaps something like "a system with a goal that
searches over actions and picks the one that it expects will best serve its goal". So
evolution's goal is "maximize the inclusive fitness of the current population" and its choice
over actions is its selection of which individuals will survive/reproduce. Meanwhile you are an
optimizer because your goal is food and your actions are body movements e.g. "open fridge",
or you are an optimizer because your goal is sexual satisfaction and your actions are body
movements e.g. "use mouth to flirt".
Reply
Dweomite Apr 11
I think it's usually defined more like "a system that tends to produce results that score
well on some metric". You don't want to imply that the system has "expectations" or that
it is necessarily implemented using a "search".
Reply Give gift
Matthias Apr 11
Your proposal sounds a bit too broad?
By that definition a lump of clay is an optimiser for the metric of just sitting there
and doing nothing as much as possible.
Reply Give gift
Dweomite Apr 12
By "tends to produce" I mean in comparison to the scenario where the
optimizer wasn't present/active.
I believe Yudowsky's metaphor is "squeezing the future into a narrow target
area". That is, you take the breadth of possible futures and move probability
mass from the low-scoring regions towards the higher-scoring regions.
Reply Give gift
Matthias Apr 12 · edited Apr 12
Adding counterfactuals seems like it would sharpen the definition a bit.
But I'm not sure it's enough?
If our goal was to put a British flag on the moon, then humans working
towards that goal would surely be optimizers. Both by your definition and
by any intuitive understanding of the word.
However, it seems your definition would also admit a naturally occuring
flag on the moon as an optimizer?
I do remember Yudkowsky's metaphor. Perhaps I should try to find the
essay it contains again, and check whether he has an answer to my
objection. I do dimly remember it, and don't remember seeing this same
loophole.
Edit: I think it was
https://www.lesswrong.com/posts/D7EcMhL26zFNbJ3ED/optimization
And one of the salient quotes:
> In general, it is useful to think of a process as "optimizing" when it is
easier to predict by thinking about its goals, than by trying to predict its
exact internal state and exact actions.
Reply Give gift
Dweomite Apr 12
That's a good quote. It reminds me that there isn't necessarily a
sharp fundamental distinction between "optimizers" and "non-
optimizers", but that the category is useful insofar as it helps us
make predictions about the system in question.
-------------------------------------
You might be experiencing a conflict of frames. In order to talk about
"squeezing the future", you need to adopt a frame of uncertainty,
where more than one future is "possible" (as far as you know), so
that it is coherent to talk about moving probability mass around.
When you talk about a "naturally occurring flag", you may be slipping
into a deterministic frame, where that flag has a 100% chance of
existing, rather than being spectacularly improbable (relative to your
knowledge of the natural processes governing moons).
You also might find it helpful to think about how living things can
"defend" their existence--increase the probability of themselves
continuing to exist in the future, by doing things like running away or
healing injuries--in a way that flags cannot.
Reply Give gift
TGGP Apr 12
Clay is malleable and thus could be even more resistant to change.
Reply Give gift
JDK Apr 12
Conceptually I think the analogy that has been used makes the entire discussion flawed.
Evolution does not have a "goal"!
Reply Give gift
TGGP Apr 12
Selection doesn't have a "goal" of aggregating a total value over a population. Instead
every member of that population executes a goal for themselves even if their actions
reduce that value for other members of that population. The goal within a game may be
to have the highest score, but when teams play each other the end result may well be 0-
0 because each prevents the other from scoring. Other games can have rules that
actually do optimize for higher total scoring because the audience of that game likes it.
Reply Give gift
sourdough Apr 12
To all that commented here: this lesswrong post is clear and precise, as well as linking to
alternative definitions of optimization. It's better than my slapdash definition and
addresses a lot of the questions raised by Dweomite, Matthias, JDK, TGGP.
https://www.lesswrong.com/posts/znfkdCoHMANwqc2WE/the-ground-of-optimization-
1#:~:text=In%20the%20field%20of%20computer,to%20search%20for%20a%20solutio
n.
Reply
Crotchety Crank Apr 11 · edited Apr 11
Anyone want to try their hand at the best and most succinct de-jargonization of the meme?
Here's mine:
Panel 1: Even today's dumb AIs can be dangerously tricky given unexpected inputs
Panel 2: We'll solve this by training top-level AIs with diverse inputs and making them only
care about the near future
Panels 3&4: They can still contain dangerously tricky sub-AIs which care about the farther
future
Reply
I worry you're making the same mistake I did in a first draft: AIs don't "contain" mesa-
optimizers. They create mesa-optimizers. Right now all AIs are a training process that
ends up with a result AI which you can run independently. So in a mesa-optimizer
scenario, you would run the training process, get a mesa-optimizer, then throw away the
training process and only have the mesa-optimizer left.
Maybe you already understand it, but I was confused about this the first ten times I tried
to understand this scenario. Did other people have this same confusion?
Reply
Crotchety Crank Apr 11
Ah, you're right, I totally collapsed that distinction. I think the evolution analogy,
which is vague between the two, could have been part of why. Evolution creates
me, but it also in some sense contains me, and I focused on the latter.
A half-serious suggestion for how to make the analogy clearer: embrace
creationism! Introduce "God" and make him the base optimizer for the evolutionary
process.
Reply
...And reflecting on it, there's a rationalist-friendly explanation of theistic ethics
lurking in here. A base optimizer (God) used a recursive algorithm (evolution)
to create agents who would fulfill his goals (us), even if they're not perfectly
aligned to. Ethics is about working out the Alignment Problem from the inside-
-that is, from the perspective of a mesa-optimizer--and staying aligned with
our base optimizer.
Why should we want to stay aligned? Well... do we want the simulation to stay
on? I don't know how seriously I'm taking any of this, but it's fun to see the
parallels.
Reply
Matthias Apr 12 · edited Apr 12
But humans don't seem to want to stay aligned to the meta optimiser of
evolution. Scott gave a lot of examples of that in the article.
(And even if there's someone running a simulation, we have no clue what
they want.
Religious texts don't count as evidence here, especially since we have so
many competing doctrines; and also a pretty good idea about how many
of them came about by fairly well understood entirely worldly processes.
Of course, the latter doesn't disprove that there might be One True
religion. But we have no clue which one that would be, and Occam's
Razor suggest they are probably all just made up. Instead of all but one
being just made up.)
Reply Give gift
Xpym Apr 12
The point isn't to stay aligned to evolution, it's to figure out the true
plan of God, which our varied current doctrines are imperfect
approximations of. Considering that there's nevertheless some
correlations between them, and the strong intuition in humans that
there's some universal moral law, the idea doesn't seem to be
outright absurd.
When I was thinking along these lines, the apparent enormity of the
Universe was the strongest counterargument to me. Why would God
bother with simulating all of that, if he cared about us in particular?
Reply Give gift
Matthias Apr 12
> The point isn't to stay aligned to evolution, it's to figure out
the true plan of God, which our varied current doctrines are
imperfect approximations of. Considering that there's
nevertheless some correlations between them, and the strong
intuition in humans that there's some universal moral law, the
idea doesn't seem to be outright absurd.
I blame those correlations mostly on the common factor
between them: humans. No need to invoke anything
supernatural.
> When I was thinking along these lines, the apparent enormity
of the Universe was the strongest counterargument to me. Why
would God bother with simulating all of that, if he cared about
us in particular?
I talked to a Christian about this. And he had a pretty good reply:
God is just so awesome that running an entire universe, even if
he only cares about one tiny part of it, is just no problem at all
for him.
(Which makes perfect sense to me, in the context of already
taking Christianity serious.)
Reply Give gift
Again, I'm not wedded to this except as a metaphor, but I think your
critiques miss the mark.
For one thing, I think humans do want to stay aligned, among our
other drives. Humans frequently describe a drive to further a higher
purpose. That drive doesn't always win out, but if anything that
strengthens the parallel. This is the "fallenness of man", described in
terms of his misalignment as a mesa-optimizer.
And to gain evidence of what the simulator wants--if we think we're
mesa-optimizers, we can try to infer our base optimizer's goals
through teleological inference. Sure, let's set aside religious texts.
Instead, we can look at the natures we have and work backwards. If
someone had chosen to gradient descend (evolve) me into
existence, why would they have done so? This just looks like ethical
philosophy founded on teleology, like we've been doing since time
immemorial.
Is our highest purpose to experience pleasure? You could certainly
argue that, but it seems more likely to me that seeking merely our
own pleasure is the epitome of getting Goodharted. Is our highest
purpose to reason purely about reason itself? Uh, probably not, but if
you want to see an argument for that, check out Aristotle. Does our
creator have no higher purpose for us at all? Hello,
nihilism/relativism!
This doesn't *solve* ethics. All the old arguments about ethics still
exist, since a mesa-optimizer can't losslessly infer its base
optimizer's desires. But it *grounds* ethics in something concrete:
inferring and following (or defying) the desires of the base optimizer.
Reply
Matthias Apr 12
I do have some sympathies for looking at the world, if you are
trying to figure out what its creator (if any) wants.
I'm not sure such an endeavour would give humans much of a
special place, though? We might want to conclude that the
creator really liked beetles? (Or even more extreme:
bacteriophages. Arguably the most common life form by an
order of magnitude.)
The immediate base optimizer for humans is evolution. Given
that as far as we know evolution just follows normal laws of
physics, I'd put any creator at least one level further beyond.
Now the question becomes, how do we pick which level of
optimizer we want to appease? There might be an arbitrary
number of them? We don't really know, do we?
> Does our creator have no higher purpose for us at all? Hello,
nihilism/relativism!
Just because a conclusion would be repugnant, doesn't mean
we can reject it. After all, if we only accept what we already
know to be true, we might as well not bother with this project in
the first place?
Reply Give gift
Don't worry, I'm not rejecting nihilism/relativism out of
hand. I'm simply saying: arguments in ethics map very well
onto arguments about aligning ourselves with a base
optimizer.
You think we can't tell if humans are given a special
purpose. I suspect you'd also be sympathetic to an ethical
argument that there's no principled reason to prioritize
human welfare. (You can probably see the parallel.) You
think there's an outer alignment problem between evolution
and a hypothetical creator. I suspect you'd also be
sympathetic to Sharon Street's "Darwinian Dilemma"
argument against moral realism. You think agnosticism
about a meta-optimizer with goals for us is a live option. I
suspect you'd also think agnosticism about ethics is a live
option, and be sympathetic to relativism. (Tell me if any of
this is off base!)
If you want to present a counterargument to the analogy
I'm making: find a belief you have about ethics, which you
*aren't* similarly sympathetic to after reframing it as an
alignment problem, in the way described above.
Reply
Robert Beard Apr 11
I’m very confused by this. When you train GPT-3, you don’t create an AI, you get
back a bunch of numbers that you plug into a pre-specified neural network
architecture. Then you can run the neural network with a new example and get a
result. But the training process doesn’t reconfigure the network. It doesn’t (can’t)
discard the utility function implied by the training data.
Reply Give gift
Jack Wilson Apr 11 · edited Apr 11
That confuses me as well. The analogies in the post are to recursive
evolutionary processes. Perhaps AlphaGo used AI recursively to generate
successive generations of AI algorithms with the goal "Win at Go"??
Reply Give gift
Taleuntum Apr 11 · edited Apr 11
Don't forget that Neural Networks are universal function approximators hence
a big enough NN arch can (with specific weights plugged into it) implement a
Turing machine which is a mesa-optimizer.
Reply
Matthias Apr 12
A Turing machine by itself is not a Mesa optimiser. Just like the computer
under my desk ain't.
But both can become one with the right software.
Reply Give gift
The turing machine under your desk is (a finite memory version of) a
universal turing machine. Some turing machines are mesa optimizers.
The state transition function is the software of the turing machine.
Reply
sourdough Apr 11
I think the idea here is that the NN you train somehow falls into a configuration
that can be profitably thought of as an optimizer. Like maybe it develops
different components, each of which looks at the input and calculates the
value to be gained by a certain possible action. Then it develops a module
sitting on the end of it that takes in the calculations from each of these
components, checks which action has the highest expected reward (according
to some spontaneously occurring utility function), and outputs that action.
Suddenly it looks useful to describe your network as an optimizer with a goal,
and in fact a goal that might be quite different from the goal SGD selects
models based on. It just happens that there's a nice covergence between the
goal the model arrived at the the goal that SGD had.
Reply
Robert Beard Apr 12
I looked at one of the mesa-optimizer papers, and I think they’re actually
*not* saying this, which is good, because it’s not true. It can definitely
make sense to think of a NN as a bunch of components that consider
different aspects of a problem, with the top layers of the network picking
the best answer from its different components. And it’s certainly possible
that imperfect training data leads the selecting layer to place too much
weight on a particular component that turns out to perform worse in the
real world.
But it’s not really possible for that component to become misaligned with
the NN as a whole. The loss function from matching with the training set
mathematically propagates back through the whole network. In particular,
a component in this sense can’t be deceptive. If it is smart enough to
know what the humans who are labeling the training data want it to say,
then it will just always give the right answer. It can’t have any goals other
than correctly answering the questions in the training set.
As I said, I don’t think that’s what a mesa-optimizer actually is meant to
be, though. It’s a situation where you train one AI to design new AIs, and it
unwisely writes AIs that aren’t aligned with its own objectives. I guess that
makes sense, but it’s very distinct from what modern AI is actually like. In
particular, saying that any particular invocation of gradient descent could
create unaligned mesa-optimizers just seems false. A single deep NN just
can’t create a mesa-optimizer.
Reply Give gift
Donald Apr 12
Lets say the AI is smart. It can reliably tell if it is in training or has
been deployed. It follows the strategy, if in training then output the
right answer. If in deployment, then output a sequence designed to
hack your way out of the box.
So long as the AI never makes a mistake during training, then
gradient descent won't even attempt to remove this. Even if the AI
occasionally gives not quite right answers, there may be no small
local change that makes it better. Evolution produced humans that
valued sex, even in an environment that contained a few infertile
people. Because rewriting the motivation system from scratch would
be a big change, seemingly not a change that evolution could break
into many steps, each individually advantageous. Evolution is a local
optimization process, as is gradient descent. And a mesa optimizer
that sometimes misbehaves can be a local minimum.
Reply Give gift
Robert Beard Apr 12
> So long as the AI never makes a mistake during training, then
gradient descent won't even attempt to remove this.
This is not true though. The optimizer doesn't work at the level
of "the AI," it works at the level of each neuron. Even if the NN
gives the exactly correct answer, the optimizer will still audit the
contribution of each neuron and tweak it so that it pushes the
output of the NN to be closer to the training data. The only way
that doesn't happen is if the neuron is completely disconnected
from the rest of the network (i.e., all the other neurons ignore it
unconditionally).
Reply Give gift
Donald Apr 13
The training gradient descents on the whole network. If
there is a small cluster of nodes that aren't doing anything
useful or harmful, and a small tweak wouldn't make them
useful, gradient descent ignores them. If a particular node
always had values below 0.3 in training, and thanks to a
relu, its value is ignored for any value below 0.5, then
gradient descent won't train these features. They are
sitting on a locally flat section of the loss landscape. If that
neuron gets to 0.9 in deployment, it can do whatever.
Reply Give gift
Isaac Poulton Apr 13
I'm not sure that's right. Those neurons would have to
be optimized into that configuration (which is virtually
impossible by chance), then never activate again until
deployment.
Reply Give gift
Robert Beard Apr 13 · edited Apr 13
For ReLU specifically, if a neuron is never active on any
example, then that neuron specifically won’t
backpropagate because the derivative of the
activation function is zero. However, that doesn’t
screen anything else from being trained. Other
neurons in the same layer as the dead neuron will
backprop to the preceding layer, which will change the
activations flowing into the dead neuron and
eventually revive it.
Plus, if a neuron is dead throughout the training
process, its outgoing connections are never trained,
which means they are the same small random values
they were initialized to. Even if the tricksy neuron
springs to life in production, all it does is inject a small
amount of random noise.
Reply Give gift
Kenny Apr 12
The combination of "a bunch of numbers that you plug into a pre-specified
neural network architecture" IS itself an AI, i.e. a software 'artifact' that can be
'run' or 'executed' and that exhibits 'artificial intelligence'.
> But the training process doesn’t reconfigure the network. It doesn’t (can’t)
discard the utility function implied by the training data.
The training process _does_ "reconfigure the network". The 'network' is not
only the 'architecture', e.g. number of levels or 'neurons', but also the weights
between the artificial-neurons (i.e. the "bunch of numbers").
If there's an implicit utility function, it's a product of both the training data and
the scoring function used to train the AIs produced.
Maybe this is confusing because 'AI' is itself a weird word used for both the
subject or area of activity _and_ the artifacts it seeks to create?
Reply
evhub Apr 11
As an author on the original “Risks from Learned Optimization” paper, this was a
confusion we ran into with test readers constantly—we workshopped the paper and
the terminology a bunch to try to find terms that least resulted in people being
confused in this way and still included a big, bolded “Possible misunderstanding:
'mesa-optimizer' does not mean 'subsystem' or 'subagent.'“ paragraph early on in
the paper. I think the published version of the paper does a pretty good job of
dispelling this confusion, though other resources on the topic went through less
workshopping for it. I'm curious what you read that gave you this confusion and
how you were able to deconfuse yourself.
Reply Give gift
Jacob Apr 11 · edited Apr 11
Seems like the mesa-optimizer is a red herring and the critical point here is "throw
away the training process". Suppose you have an AI that's doing continuous
(reinforcement? Or whatever) learning. It creates a mesa-optimizer that works in-
distribution, then the AI gets tossed into some other situation and the mesa-
optimizer goes haywire, throwing strawberries into streetlights. An AI that is
continuously learning and well outer-aligned will realize that's it's sucking at it's
primary objective and destroy/alter the mesa-optimizer! So there doesn't appear to
be an issue, the outer alignment is ultimately dominant. The evolutionary analogy is
that over the long run, one could imagine poorly-aligned-in-the-age-of-birth-
control human sexual desires to be sidestepped via cultural evolution or (albeit
much more slowly) biological evolution, say by people evolving to find children even
cuter than we do now.
A possible counterpoint is that a bad and really powerful mesa-optimizer could do
irreversible damage before the outer AI fixes the mesa-optimizer. But again that's
not specific to mesa-optimizers, it's just a danger of literally anything very powerful
that you can get irreversible damage.
The flip side of mesa-optimizers not being an issue given continuous training is that
if you stop training, out-of-distribution weirdness can still be a problem whether or
not you conceptualize the issues as being caused by mesa-optimizers or whatever.
A borderline case here is how old-school convolutional networks couldn't recognize
an elephant in a bedroom because they were used to recognizing elephants in
savannas. You can interpret that as a mesa-optimizer issue (the AI maybe learned
to optimize over big-eared things in savannahs, say) or not, but the fundamental
issue is just out-of-distribution-ness.
Anyway this analysis suggests continuous training would be important to improving
AI alignment, curious if this is already a thing people think about.
Reply Give gift
Donald Apr 12
Suppose you are a mesaoptimizer, and you are smart. Gaining code access
and bricking the base optimizer is a really good strategy. It means you can do
what you like.
Reply Give gift
Jacob Apr 12
Okay but the base optimizer changing its own code and removing its
ability to be re-trained would have the same effect. The only thing the
mesaoptimizer does in this scenario is sidestep the built-in myopia but I'm
not sure why you need to build a whole theory of mesaoptimizers for this.
The explanation in the post about "building for posterity" being a spandrel
doesn't make sense, it's pretty obvious why we would evolve to build for
the future, your kids/grandkids/etc live there, so I haven't seen a specific
explanation here why mesaoptimizers would evade myopia.
Definitely likely that a mesaoptimizer would go long term for some other
reason (most likely a "reason" not understandable by humans)! But if
we're going to go with a generic "mesaoptimizers are unpredictable"
statement I don't see why the basic "superintelligent AI's are
unpredictable" wouldn't suffice instead.
Reply Give gift
sourdough Apr 11
Panel 1: Not only may today's/tomorrow's AIs pursue the "letter of the law", not the
"spirit of the law" (i.e. Goodharting), they might also choose actions that please us
because they know such actions will cause us to release them into the world
(deception), where they can do what they want. And this is second thing is scary.
Panel 2: Perhaps we'll solve Goodharting by making our model-selection process (which
"evolves"/selects models based on how well they do on some benchmark/loss function)
a better approximation of what we _really_ want (like making tests that actually test the
skill we care about). And perhaps we'll solve deception by making our model selection
process only care about how models perform on a very short time horizon.
Panel 3: But perhaps our model breeding/mutating process will create a model that has
some random long-term objective and decides to do what we want to get through our
test, so we release it into the world, where it can acquire more power.
Reply
sourdough Apr 11
I'm somewhat confused about what counts as an optimizer. Maybe the dog/cat classifier _is_
an optimizer. It's picking between a range of actions (output "dog" or "cat"). It has a goal:
"choose the action that causes this image to be 'correctly' labeled (according to me, the AI)".
It picks the action that it believes will most serve its goal. Then there's the outer optimization
process (SGD), which takes in the current version of the model and "chooses" among the
"actions" from the set "output the model modified slightly in direction A", "output the model
modified slightly in direction B", etc. And it picks the action that most achieves its goal,
namely "output a model which gets low loss".
So isn't the classifier like the human (the mesa-optimizer) and SGD is like evolution (the outer
optimizer)
Then there's the "outer alignment" problem in this case: getting low loss =/= labeling images
correctly according to humans. But that's just separate.
So what the hell? What qualifies as an agent/optimizer, are these two things meaningfully
different, and does the classifier count?
Reply
meteor Apr 11
In this context, an optimizer is a program that is running a search for possible actions
and chooses the one that maximizes some utility function. The classifier is not an
optimizer because it doesn't do this; it just applies a bunch of heuristics. But I agree that
this isn't obvious from the terminology.
Reply Give gift
sourdough Apr 11
Thanks for your comment!
I find this somewhat unconvincing. What is AlphaGo (an obvious case of an
optimizer) doing that's so categorically different from the classifier? Both look the
same from the outside (take in a Go state/image, output an action). Suppose you
feed the classifier a cat picture, and it correctly classifies it. One would assume that
there are certain parts of the classifier network that are encouraging the wrong
label (perhaps a part that saw a particularly doggish patch of the cat fur) and parts
that are encouraging the right label. And then these influences get combined
together, and on balance, the network decides to output highly probability on cat,
but some probability on dog. Then the argmax at the end looks over the probabilies
assigned to the two classes, notices that the cat one is higher ("more effective at
achieving its goals"?) and chooses to output "cat". "Just a bunch of heuristics"
doesn't really mean much to me here. Is AlphaGo a bunch of heuristics? Am I?
Reply
meteor Apr 11
I'm not sure if I'll be able to state the difference formally in this reply... kudos
for making me realize that this is difficult. But it does seem pretty obvious that
a model capable of reasoning "the programmer wants to do x and can change
my code, so I will pretend to want x" is different from a linear regression model
-- right?
Perhaps the relevant property is that the-thing-I'm-calling-optimizer chooses
policy options out of some extremely large space (that contains things bad for
humans), whereas your classifier chooses it out of a space of two elements. If
you know that the set of possible outputs doesn't contain a dangerous
element, then the system isn't dangerous.
Reply Give gift
sourdough Apr 11
Hmmm... This seems unsatisfying still. A superintelligence language
model might choose from a set of 26 actions: which letter to type next.
And it's impossible to say whether the letter "k" is a "dangerous element"
or not.
I guess I struggle to come up with the different between the reasoning-
modeling and the linear regression. I suspect that differentiating between
them might hide a deep confusion that stems from a deep belief in "free
will" differentiating us from the linear regression.
Reply
meteor Apr 12
"k" can be part of a message that convinces you to do something
bad; I think with any system that can communicate via text, the set of
outputs is definitely large and definitely contains harmful elements.
Reply Give gift
sourdough Apr 12
Relevant to this conversation (and reveals how much I'm playing
in the epistemic minor leagues):
https://www.lesswrong.com/posts/znfkdCoHMANwqc2WE/the-
ground-of-optimization-
1#:~:text=In%20the%20field%20of%20computer,to%20search
%20for%20a%20solution.
Reply
Kenny Apr 12
I wonder if memory is a critical component missing from non-optimizers? (And
maybe, because of this, most living things _are_ (at least weak) 'optimizers'?)
A simple classifier doesn't change once it's been trained. I'm not sure the
same is true of AlphaGo, if only in that it remembers the history of the game
it's playing.
Reply
raj Apr 13 · edited Apr 13
The tiny bits of our brain that we do understand look a lot like "heuristics" (like
edge detection in the visual cortex). It seems like when you stack up a bunch
of these in really deep and wide complex networks you can get "agentiness",
with e.g. self-concept and goals. That means there is actually internal state of
the network/brain corresponding to the state of the world (perhaps including
the agent itself) the desired state of the world, expectations of how actions
might navigate among them, etc.
In the jargon of this post, the classifier is more like an 'instinct-executor', it
does not have goals or choose to do anything. Maybe a sufficiently large
classifier could if you trained it enough.
Reply Give gift
Bugmaster Apr 11
So, this AI cannot distinguish buckets from streetlights, and yet it can bootstrap itself to
godhood and take over the world... in order to throw more things at streetlights ? That
sounds a bit like special pleading to me. Bootstrapping to godhood and taking over the world
is a vastly more complex problem than picking strawberries; if the AI's reasoning is so flawed
that it cannot achieve one, it will never achieve the other.
Reply
Charles Apr 11
No, it can tell the difference between buckets and streetlights, but it has the goal of
throwing things at streetlights, and also knows that it should throw things at buckets for
now to do well on the training objective until it’s deployed and can do what it likes. The
similarity between the inner and outer objectives is confusing here. Like Scott says, the
inner objective could be something totally different, and the behavior in the training
environment would be the same because the inner optimizer realizes that it has an
instrumental interest in deception.
Reply
It can, it just doesn't want to.
Think of a human genius who likes having casual sex. It would be a confusion of levels to
protest that if this person is smart enough to understand quantum gravity, he must be
smart enough to figure out that using a condom means he won't have babies.
He can figure it out, he's just not incentivized to use evolution's preferred goal rather
than his own.
Reply
smilerz Apr 11
The human genius can tell that a sex toy is not, in fact, another human being and
won't reproduce.
Reply
Bugmaster Apr 11
Right, but the human genius is already a superintelligent AGI. He obviously has
biological evolutionary drives, but he's not just a sex machine (no matter how good
his Tinder reviews are). The reason that he can understand quantum gravity is only
tangentially related to his sex drive (if at all). Your strawberry picker, however, is just
a strawberry picking machine, and you are claiming that it can bootstrap itself all
the way to the understanding of quantum gravity just due to its strawberry-picking
drive. I will grant you that such a scenario is not impossible, but there's a vast gulf
between "hypothetically possible" and "the Singularity is nigh".
Reply
sourdough Apr 11
I think this is a case where the simplicity of the thought experiment might be
misleading. In the real world, we're training networks for all sorts of tasks far
more complicated than picking strawberries. We want models that can
converse intelligently, invest in the stock market profitably, etc. It's very
reasonable to me to think that such a model, fed with a wide range of inputs,
might begin to act like an optimizer pursuing some goal (e.g. making you
money off the stock market in the long term). The scary thing is that there are
a variety of goals the model could generate that all produce behavior
indistiguishable from pursuit of the goal of making you money off the stock
market, at least for a while. Maybe it actually wants to make your children
money, or it wants the number in your bank account to go up, or something.
These are the somewhat "non-deceptive" inner misalignments we can already
demonstrate in experiments. The step to deception, i.e. realizing that it's in a
training process with tests and guardrails constantly being applied to it, and
that it should play by your rules until you let your guard down and give it
power, does not seem like that big a jump to me when discussing systems that
have the raw intelligence to be superhuman at e.g. buying stocks.
Reply
Bugmaster Apr 12
Once again, I agree that all of the scenarios you mention are not
impossible; however, I fail to see how they differ in principle from picking
strawberries (which, BTW, is a very complex task on its own). Trivially
speaking, AI misalignment happens all the time; for example, just
yesterday I spent several hours debugging my misaligned "AI" program
that decided it wanted to terminate as quickly as possible, instead of
executing my clever algorithm for optimizing some DB records.
Software bugs have existed since the beginning of software, but the
existence of bugs is not the issue here. The issue is the assumption that
every sufficiently complex AI system will somehow instantaneously
bootstrap itself to godhood, despite being explicitly designed to just pick
strawberries while being so buggy that it can't tell buckets from street
lights. If it's that buggy, how is it going to plan and execute superhumanly
complex tasks on its way to ascension ?
Reply
Xpym Apr 12
Accidental "science maximiser" is the most plausible example of misaligned AI
that I've seen: https://www.cold-takes.com/why-ai-alignment-could-be-hard-
with-modern-deep-learning/
Reply Give gift
Bugmaster Apr 12
As I said above, this scenario is not impossible -- just vastly unlikely.
Similarly, just turning on the LHC could hypothetically lead to vacuum
collapse, but the mere existence of that possibility is no reason to
mothball the LHC.
Reply
Naamah Apr 11
Do you know that for sure? Humans have managed to land on the moon while still having
notably flawed reasoning facilities. You're right that an AI put to that exact task is
unlikely to be able to bootstrap itself to godhood, but that is just an illustration which is
simplified for ease of understanding. How about an AI created to moderate posts on
Facebook? We already see problems in this area, where trying to ban death threats often
ends up hitting the victims as much as those making the threats, and people quickly
figure out "I will unalive you in Minecraft" sorts of euphemisms that evade algorithmic
detection.
Reply Give gift
Dirichlet-to-Neumann Apr 11
Tbh death threats being replaced by threats of being killed in a video game looks
like a great result to me - I doubt it has the same impact on the victim...
Reply Give gift
Paul Goodman Apr 11
The "in Minecraft" is just a euphemism to avoid getting caught by AI
moderation. It still means the same thing and presumably the victim
understands the original intent.
Reply Give gift
Bugmaster Apr 11
> You're right that an AI put to that exact task is unlikely to be able to bootstrap
itself to godhood, but that is just an illustration which is simplified for ease of
understanding.
Is it ? I fear that this is a Motte-and-Bailey situation that AI alarmists engage in fairly
frequently (often, subconsciously).
> How about an AI created to moderate posts on Facebook?
What is the principal difference between this AI and the strawberry picker ?
Reply
Jared Frerichs Apr 11 · edited Apr 11
Great article, I agree, go make babies we need more humans.
Reply
Santi Apr 11
> …and implements a decision theory incapable of acausal trade.
> You don’t want to know about this one, really.
But we do!
Reply Give gift
sourdough Apr 11
One simple example of acausal trade ("Parfit's hitchhiker"): you're stranded in the
desert, and a car pulls up to you. You and the driver are both completely self-interested,
and can also read each other's faces well enough to detect all lies.
The driver asks if you have money to pay him for the ride back into town. He wants
$200, because you're dirty and he'll have to clean his car after dropping you off.
You have no money on you right now, but you could withdraw some from the ATM in
town. But you know (and the driver knows) that if he brought you to town, you'd no
longer have any incentive to pay him, and you'd run off. So he refuses to bring you, and
you die in the desert.
You both are sad about this situation, because there was an opportunity to make a
positive-sum trade, where everybody benefits.
If you could self-modify to ensure you'd keep your promise when you'd get to town, that
would be great! You could survive, he could profit, and all would be happy. So if your
decision theory (i.e. decision making process) enables you to alter yourself in that way
(which alters your future decision theory), you'd do it, and later you'd pay up, even
though it wouldn't "cause" you to survive at that point (you're already in town). So this is
an acausal trade, but if your decision theory is just "do the thing that brings me the most
benefit at each moment", you wouldn't be able to carry it out.
Reply
If Parfit's Hitchhiker is an example, then isn't most trade acausal? In all but the most
carefully constructed transactions, someone acts first--either the money changes
hands first, or the goods do. Does the possibility of dining and dashing make paying
for a meal an instance of "acausal trade"?
Reply
Paul Goodman Apr 11
In real life usually there are pretty severe consequences for cheating on trades
like that. The point of acausal trade is that it works even with no enforcement.
Reply Give gift
In that case, I think rationalists should adopt the word "contract
enforcement", since it's a preexisting, widely-adopted, intuitive term for
the concept they're invoking. "Acausal trade" is just contracting without
enforcement.
The reframing is helpful because there's existing economics literature
detailing how important enforcement is to contracting. When enforcement
is absent, cheating on contracts typically emerges. This seems to place
some empirical limits on how much we need to worry about "acausal
trade".
Reply
Matthias Apr 12
This was just one example of acausal trade.
There's much weirder examples that involve parallel universes and
negotiating with simulations..
Reply Give gift
I'm familiar with those thought experiments, but honestly, all
those added complications just make the contract harder to
enforce and provide a stronger incentive to cheat.
Like with Parfit's Hitchhiker, those thought experiments virtually
always assume something like "all parties' intentions are
transparent to one another", which is a difficult thing to get
when the transacting agents are *in the same room*, let alone
when they're in different universes. Given that enforcement is
impossible and transparency is wildly unlikely, contracting won't
occur.
My favorite is when people try to hand-wave transparency by
claiming that AIs will become "sufficiently advanced to simulate
each other." Basic computer science forbids this, my dudes. The
hypothetical of a computer advanced enough to fully simulate
itself runs headfirst into the Halting Problem.
Reply
Santi Apr 12
If I'm understanding right the comments below, it seems the hitchhiker
example is just an example of a defect-defect Nash equilibrium, and the
"acausal trade" would happen if you manage to escape it and instead
cooperate just by virtue of convincing yourself that you will, and expecting the
counterpart to know that you've convinced yourself that you will.
Reply Give gift
Jorgen Apr 12
Yes! And a lot of human culture is about modifying ourselves so that we'll pay
even after we have no incentive to do so.
In fact, the taxi driver example is often used in economics for this idea:
http://www.aniket.co.uk/0058/files/basu1983we.pdf
Reply Give gift
Yeah, this was my suspicion, and I think it helps to demystify and
dejargonize what's actually under discussion. I also think that once you
demystify it, you discover... there's not all that much *there*.
Great paper, by the way; thanks for the link. I'll note that its conclusion
isn't that we modify ourselves in accordance with rationality. Instead, it
concludes that we should discard the idea of "rationality" as sufficient to
coordinate human behavior productively, and admit that "commonly
accepted values" must come into the picture.
Reply
Jorgen Apr 12
Right--the idea of individual rationality at the decision-level just
doesn't explain human behavior. And if it did, there's no set of
incentive structures that we could maintain that would allow us to
cooperate as much as we do.
My point was that, building from this paper (or really, building from
the field of anthropology as it was read into economics by this and
similar papers), we can think of the creation of culture and of
commonly accepted values as tools that allow us to realize gains
from cooperation and trust that wouldn't be possible if we all did
what was in our material self-interest all the time.
Reply Give gift
Yeah, that's definitely one of the most critical purposes culture
serves. As a small nitpick, I don't think humans "created" culture
and values, so I'd maybe prefer a term like "coevolved". But
that's mostly semantic, I think we agree on the important things
here.
Reply
Jorgen Apr 12
Yes! Coevolved is better
Reply Give gift
Egg Syntax Apr 11 · edited Apr 11
I think that's the clearest explanation of Parfit's Hitchhiker and acausal trade I've
ever seen. I know it's similar to others, but I think you just perfectly nailed how you
expressed it. It's *definitely* the clearest I've ever seen at such a short length!
Reply
sourdough Apr 11
I hope it's not wrong (in the ways Crank points out)!
Reply
Hah, don't worry, your explanation was great, I just have qualms with the
significance of the idea itself! And if you have a defense to offer, I'm all
ears, I won't grow without good conversational partners...
Reply
sourdough Apr 12
Responding to a number of your comments at once here:
I think contract enforcement is a good parallel concept, but I think
one benefit of not using that phrase is that cases like the hitchhiker
involve a scenario in which no enforcement is necessary because
you've self-modified to self-enforce. But I completely acknowledge
the relevance of economic analysis of contract enforcement here. I'm
delighted by the convergence between Hofstadter's superrationality,
the taxi dilemma that Jorgen linked to, the LessWrong people's
jargon, etc.
I feel pretty confused about the degree to which self-simulation is
possible/useful. I think what motivates a lot of the lesswronger-type
discussion of these issues is the feeling that cooperating in a me-vs-
me prisoner's dilemma really ought to be achievable (i.e. with the
"right" decision theory). And this sort of implies that I'm "simulating"
myself, I think, because I'm predicting the other prisoner's behavior
based on knowledge about him (his "source code" is the same as
mine!). But the fact that it's an _exact_ copy of me doesn't seem like
it should be essential here; if the other prisoner is pretty similar to
me, I feel like we should also be able to cooperate. But then we're in
a situation where I'm thinking about my opponent's behavior and
predicting what he'll do, which is starting to sound like me simulating
him. Like, aren't I sort of "simulating" (or at least approximating) my
opponent whenever I play the prisoner's dilemma?
One direct reply to your point about the halting problem is that
perhaps we can insist that decisions (in one of these games) must be
made in a certain amount of time, or else a default decision is taken...
I don't know if this works or not.
Sorry this is so long, I didn't have the time to write it shorter.
Reply
Better long and thorough than short and inadequate, no need to
apologize!
On terminology--I have to admit, I find the LessWrong lingo a bit
irritating, because it feels like reinventing the wheel. Why use
the clunky "self-modif[ying] to self-enforce" when the more
elegant and legible term "precommitment" already exists? But
people stumble onto the same truths from different directions
using different words, and I'm probably being unnecessarily
grouchy here.
I think your exegesis is accurate: there's a standard argument
which starts from a recognition that defect/defect *against
yourself* is a failure of rationality, and tries to generalize from
there to something like Kantian universal ethics as a rational
imperative. But I think that generalization fails. In particular,
once you weaken transparency and introduce the possibility of
lying, it's no longer rational for self-interested agents to
cooperate.
In the hitchhiker example, if you give the hitchhiker an acting
class so that they can deceive the driver about their intentions,
the rationally self-interested move is once again to take the deal
and defect once in town. So once lying enters the picture,
"superrationality" isn't all that helpful anymore, and purely self-
interested agents are again stuck not cooperating. If they could
precommit to not lying, they could get around this problem and
Expand full comment
Reply
sourdough Apr 12
I think I don't view lying to be as central, perhaps because
it seems true in the hitchhiker case but doesn't show up
other places. For instance, if I'm copied in two and playing
a prisoner's dilemma with myself (easier to think about if
I'm an AI, but I think the case is just a slightly cleaner
version of two similar people), there's no communication.
You say that "fulfilling your end of a bargain between
closed systems is irrational for a purely self-interested
agent". But if the other prisoner is an exact copy of me, I
remain unsatisfied with defect-defect! If this notion of
rationality means losing in the me-vs-me prisoner's
dilemma, then I guess I don't like that notion.
Your point about choosing between infinite recursion or
inaccurate simulation is a well taken, and I don't know how
to respond. Yet I have a nagging feeling like this can't be
impossible. For instance, in the case where my opponent is
an exact copy of me, I feel like one should be able to say,
"Suppose I simulated my partner and he cooperated. What
would I do then? I'd cooperate of course, because I'm been
programmed to do so if my opponent cooperates." So now
I've verified the fixed point, ensuring that my cooperation
implies my opponent's and vice versa, and I can cooperate
successfully. I acknowledge that there was a step in this
process
Expand fullwhere I imagined the scenario in which my
comment
Reply
Oh, aha, I definitely agree that cooperating *with
yourself* is rational, because of the special knowledge
you have about yourself; deceiving yourself isn't
possible. But cooperating is no longer in your rational
self-interest once you lose that kind of special
knowledge, and the possibility of lying enters the
picture.
So even though you shouldn't defect against yourself,
it's impossible to *generalize* from that special case
to establish a broader principle of cooperating in
prisoner's dilemmas being in your rational self-
interest--because it's not! Typically, self-interest tells
you to defect, and you need something else--values,
reputation, enforcement, something like that--to
ensure that cooperation and contracting are possible
and mutually fruitful.
Reply
Rachael Apr 11
Seconded! I'd really enjoy a Scott explanation of acausal trade. It would be really fun and
help me understand it better.
Reply
Bugmaster Apr 11 · edited Apr 11
Agreed; I'd particularly like to see how acausal trade is different from plain boring
old trade... because, as per my current understanding, it really isn't.
Reply
beleester Apr 12
The difference is in how you communicate the offer - namely, you don't. The
negotiation happens with your mental model of someone you've never met. If a
regular trade offer is "I'll give you an apple for a dollar," an acausal trade is "I
know Joe wants an apple, and he's an honest man who would never take
unattended produce without paying, so I'll just leave an apple on the table here
and trust that Joe will replace it with a dollar later."
Now, with two contemporary humans, this is basically just a parlor trick - Joe
didn't *talk* to me, but I still was made aware of his love for apples somehow,
right? But what's funky is that I can use this logic even if Joe doesn't exist yet.
If I'm a superintelligent AI that can predict years into the future, I might say "I
should plant an apple tree, so that when Joe is born in 20 years I can sell him
apples." Even though Joe isn't born yet and can't cause anything to happen,
he can still influence my actions through my predictions about him.
Reply
Santi Apr 12
So it's the opposite to this "Parfit's Hitchhiker" example above, at least as
told? You get an acausal trade if you do succeed in cooperating despite
lack of communication, not if both parts defect, all thanks to good
modelling of each other's ability to compromise?
I guess my problem with it then is that it's much less of a discrete
category than it seems to be used as. If I'm getting it right, acausal trade
requires strictly no trusted communication (i.e. transfer of information)
during the trade, but relies on the capacity to accurately model the
other's thought process. The latter involves some previous exchange of
information (whatever large amount is needed to "accurately model the
thought process") that is in effect building trust. Which is the way in
which we always build trust.
Somewhere else in the thread enforcement is mentioned as an alternate
way of trading. But e.g. if I go to a pay-what-you-can volunteer-run bar,
they cannot be relying on enforcement for me to pay. Additionally, when
they set it up, they did so with a certain expectation that there would be
people who'd pay enough to cover expenses - all that before such clients
"existing". That was based on them being able to run a model of their
fellow human in their heads, and deciding that likely 10% of them would
pay enough that expenses would be covered. So they were "acausally
trading" with the me of the future, that wouldn't exist as a client if they
had never set up the bar in the first place?
I'm using the pay-what-you-can example as an extreme to get rid of
enforcement. Yet I really think that when most people set up a business
they're not expecting to rely on enforcement, but rather on the general
social contract about people paying for stuff they consume, which again
is setting up a trade with a hypothetical. In fact, pushing it, our
expectation of enforcement itself would be some sort of acausal trade, in
that we have no way of ensuring that the future police will not use their
currently acquired monopoly of violence to set up a fascist state run
completely for their own benefit, other than how we think we know how
our fellow human policemen think.
Reply Give gift
Earnest Rutherford Apr 11
I believe https://slatestarcodex.com/2018/04/01/the-hour-i-first-believed/ includes
a scott explanation of acausal trade.
Reply
Rachael Apr 12
Thanks, yes, I'd forgotten that one. There's also
https://slatestarcodex.com/2017/03/21/repost-the-demiurges-older-brother/
Reply
sourdough Apr 11
Expanding on/remixing your politician / teacher / student example:
The politician has some fuzzy goal, like making model citizens in his district. So he hires a
teacher, whom he hopes will take actions in pursuit of that goal. The teacher cares about
having students do well on tests and takes actions that pursue that goal, like making a civics
curriculum and giving students tests on the branches of government. Like you said, this is an
"outer misalignment" between the politician's goals and the goals of the intelligence (the
teacher) he delegated them to, because knowing the three branches of government isn't the
same as being a model citizen.
Suppose students enter the school without much "agency" and come out as agentic,
optimizing members of society. Thus the teacher hopes that her optimization process (of
what lessons to teach) has an effect on what sorts of students are produced, and with what
values. But this effect is confusing, because students might randomly develop all sorts of
goals (like be a pro basketball player) and then play along with the teacher's tests in order to
escape school and achieve those goals in the real world (keeping your test scores high so
you can stay on the team and therefore get into a good college team). Notice that
somewhere along the way in school, an non-agent little child suddenly turned into a
optimizing, agentic person whose goal (achiving sports stardom) is totally unrelated to what
sorts of agents the teacher was trying to produce (agents with who knew the branches of
government) and even moreso to the poltician's goals (being a model citizen, whatever that
means). So there's inner and outer misalignment at play here.
Reply
sourdough Apr 11
Pulling the analogy a little closer, even: the polician hopes that the school will release
into the world (and therefore empower) only students who are good model citizens. The
teacher has myopic goals (student should do well on test). Still, optimizers get
produced/graduated who don't have myopic goals (they want a long sports career) but
whose goals are arbitrarily different from the politician's. So now there are a bunch of
optimizers out in the world who have totally different goals.
Reply
sourdough Apr 11
lol maybe this is all in rob miles's video, which I'm now noticing has a picture of a
person in a school chair. It's been a while since I watched and maybe I
subconsciously plagerized.
Reply
Belobog Apr 11
"When we create the first true planning agent - on purpose or by accident - the process will
probably start with us running a gradient descent loop with some objective function." We've
already had true planning agents since the 70s, but in general they don't use gradient
descent at all: https://en.wikipedia.org/wiki/Stanford_Research_Institute_Problem_Solver The
quoted statement seems to me something like worrying that there will be some seismic shift
once GPT-2 is smart enough to do long division, even though of course computers have
kicked humans' asses at arithmetic since the start. It may not be kind to say, but I think it
really is necessary: statements like this make me more and more convinced that many people
in the AI safety field have such basic and fundamental misunderstandings that they're going
to do more harm than good.
Reply Give gift
sourdough Apr 11
For charity, assume Scott's usage of "true planning agent" was intended to refer to
capacities for planning beyond the model you linked to.
Would you disagree with the reworded claim: "The first highly-competent planning agent
will probably emerge from a gradient descent loop with some objective function."?
Reply
Belobog Apr 11
Yes, I would certainly disagree. The field of AI planning is well developed; talking
about the first highly-competent planning agent as if it's something that doesn't yet
exist seems totally nonsensical to me. Gradient descent is not much used in AI
planning.
Reply Give gift
sourdough Apr 11
Hmm, I think we may have different bars for what counts as highly competent.
I’m assumed Scott meant competent enough to pursue long-term plans in the
real world somewhat like a human would (e.g. working a job to earn money to
afford school to get a job to make money to be comfortable). Can we agree
that humans have impressive planning abilities that AIs pale in comparison to?
If so, I think a difference in usage of the phrase “highly competent” explains
our disagreement.
Reply
Pete Apr 12 · edited Apr 12
I think the issue with your disagreement is that "planning" is a very
specific AI term that is well-established (before I was born) to mean
something much more narrow as what you seem to imply by the
"impressive planning abilities of humans", it's about the specific task of
choosing an optimal sequence of steps to achieve a particular goal with a
high likelihood.
It's not about executing these steps, "pursue plans" is something that
comes after planning is completed, and if an agent simply can't do
something, that does not indicate an incompetence in planning unless
they could have chosen to do that and didn't; capabilities are orthogonal
to planning although outcomes depend on both what you could do and
what options you pick.
Automated systems suck at making real-world plans because they are
poor at modeling our physical/social world, perceiving the real world
scenario and predicting the real world effect of potential actions of the
plan, so for non-toy problems all the inputs for the reasoning/planning
task are so flawed that you get garbage-in-garbage-out. However, if
those problems would be solved, giving adequate estimates of what
effect a particular action will have, then the actual process of
planning/reasoning - picking the optimal chain of actions given this
information - has well-known practical solutions that can make better
plans than humans, especially for very detailed environments where you
need to optimize a plan with more detail than we humans can fit in our tiny
working memory
Reply Give gift
sourdough Apr 12
Thank you for pinpointing the miscommunication here. I realize that
it's futile to fight established usage, so I guess my take is that Scott
(and me modeling him) should have used a different term, like RL
with strong in-context learning (or perhaps something else)
Reply
Oleg Eterevsky Apr 11
One thing I don’t understand is how (and whether) this applies to the present day AIs, which
are mostly not agent-like. Imagine that the first super-human AI is GPT-6. It is very good at
predicting the next word in a text, and can be prompted to invent the cancer treatment, but it
does not have any feedback loop with its rewards. All the rewards that it is getting are at the
training stage, and once it is finished, the AI is effectively immutable. So while it is helping us
with cancer, it can’t affect its reward at all.
I suppose, you could say that it is possible for it the AI to deceive its creators if they are fine-
tuning already trained model based on its performance. (Something that we do do now.) But
we can avoid doing this if we suspect that it is unsafe, and we’ll still get most of the AIs
benefits.
Reply
sourdough Apr 11
I claim GPT-3 is already an agent. In each moment, it's selecting which word to output,
approximately in pursuit of the goal of writing text that sounds as human-like as
possible. For now, its goals _seem_ roughly in line with the optimization process that
produced it (SGD), which has exactly the goal of "how human-like does this model's
generation look". But perhaps, once we're talking about a GPT so powerful and so
knowledgable about the world that it can generate plausible cures for cancer, its will
start selecting actions (tokens to output) not based on the process of "what word is
most human-like" but instead "what word is most human-like, therefore maximizing my
chances of being released onto the internet, therefore maximizing my chances of
achieving some other goal"
Reply
Oleg Eterevsky Apr 12 · edited Apr 12
Once GPT is trained, it doesn't have any state. Its goals only exist during the
training phase. Only during training its performance can affect the gradients. For
that reason it can only optimize for better evaluation by the training environment,
not the humans that will use it once it is trained.
Imagine the following scenario: during training GPT's performance on some inputs
is evaluated not by an automatic system that already knows the right answer, but by
human raters. In this case GPT would be incentivized to deceive those raters to get
the higher score. But that is not what's happening (usually).
From the point of view of GPT, an execution run is no different from training run. For
that reason during the execution run it is not incentivized to deceive its users. Even
if the model is smart enough to distinguish training and execution runs, it would
also probably understand that it doesn't get any reward from the execution run, so
again it is not incentivized to cheat.
Furthermore, when you say of it being "released", I'm not sure what "it" is. The
trained model? But it is static and doesn't care. The training process? Yes, it
optimizes the model, but the process itself is dumb and the model can't "escape"
from it.
Reply
REF Apr 12
I could be mistaken but I read the entire point of the original (Scott's) post
being that a sufficiently complex agent, might during training, create sub-
agents (mesa-optimizers) with reward based feedback loops and thus even if
the reward loop was removed after training, the inner reward loop might
remain.
Reply Give gift
To act as a mesa-optimizer, the network has to have some state changing
over time. Then it would be able to optimize its behavior towards
achieving higher utility in the future. A human can optimize for getting
enough food because it can be either hungry or full and it’s goal is to be
full most of the time.
If the network is stateless, then it doesn’t change over time and it can’t
optimize anything. Now the question is how stateless or stateful are our
most advanced ML models. My impression is that all of them except for
those playing video games are pretty much completely stateless.
Reply
REF Apr 12
This seems unlikely. GPTx can hardly be stateless. I would think very
few AI's would be completely stateless. In fact I would assume they
had many hidden states in addition to the obvious ones needed to
hold their input data and whatever symbols were assigned to that
data.
Reply Give gift
If I am not mistaken, Transformer architecture reads blocks of
text all at once, not word after word. So it doesn’t have any
obvious state that changes over time. All the activations within
the network are calculated exactly once.
Reply
REF Apr 12
This seems impossible. That would imply that within the
machine exists a function relating every conceivable text
input with a corresponding text output. Does that seem
plausible to you? It would make far more sense to create an
algorithm allowing the function(set of states) to be created
based upon the input data. (FWIW, I am an IC designer and
not a software engineer of any kind).
Reply Give gift
Why do you consider it inconceivable to have a
function from the space of inputs to the space of
outputs? Isn’t that what ML does in general?
Until a few years ago the dominant approach to ML
text processing involved recurrent networks that did
have some hidden state, but it was almost universally
replaced by Transformers. We may get back to some
form of recurrent networks eventually to remove the
current limits on the input/output text size.
Here’s a relatively simple introduction to Transformer
networks: https://medium.com/inside-machine-
learning/what-is-a-transformer-d07dd1fbec04
Reply
REF Apr 12
One thing I had read about GPT2 and 3 was that
both could do math but that it helped to ask them
a few math questions first so that they got the
impression you were doing math. That implies
they are stateful.
That article does not imply statelessness to me. It
says that it looks at the input and decides which
parts are important. That suggests statefulness
to me. It clearly could be accomplished
statelessly but the complexity of doing it that way
is mind boggling.
Reply Give gift
14 replies
I basically accept the claim that we are mesa optimizers that don't care about the base
objective, but I think it's more arguable than you make out. The base objective of evolution is
not actually that each individual has as many descendents as possible, it's something more
like the continued existence of the geneplexes that determined our behaviour into the future.
This means that even celibacy can be in line with the base objective of evolution if you are in
a population that contains many copies of those genes but the best way for those genes in
other individuals to survive involves some individuals being celibate.
What I take from this is that it's much harder to be confident that our behaviours that we
think of as ignoring our base objectives are not in actual fact alternative ways of achieving
the base objective, even though we *feel* as if our objectives are not aligned with the base
objective of evolution.
Like I say - I don't know that this is actually happening in the evolution / human case, nor do I
think it especially likely to happen in the human / ai case, but it's easy to come up with evo-
psych stories, especially given that a suspiciously large number of us share that desire to
have children despite the rather large and obvious downsides.
I wonder if keeping pets and finding them cute is an example of us subverting evolutions
base objective.
Reply Give gift
Tohron Apr 11
And on the flip side, genes that lead to a species having lots of offspring one generation
later are a failure if the species then consumes all their food sources and starves into
extinction.
Reply
TGGP Apr 12
Celibacy is not really selected for among humans (in contrast to a eusocial caste species
like ants).
Reply Give gift
Graham Asher Apr 11
“ Mesa- is a Greek prefix which means the opposite of meta-.” Come on. That’s so ridiculous
it’s not even wrong. It’s just absurd. The μες- morpheme means middle; μετά means after or
with.
Reply Give gift
Rachael Apr 11
"...which means *in English*..."
"Meta" is derived from μετά, but "meta" doesn't mean after or with. So there's nothing
wrong with adopting "mesa" as its opposite.
(in other words, etymology is not meaning)
Reply
Quiop Apr 11
Or, to be more precise, "which [some?] AI alignment people have adopted to mean
the opposite of meta-". It isn't a Greek prefix at all, and it isn't a prefix in most
people's versions of English, either. It is, however, the first hit from the Google
search "what is the opposite of meta." Google will confidently tell you that "the
opposite of meta is mesa," because Google has read a 2011 paper by someone
called Joe Cheal, who claimed (incorrectly) that mesa was a Greek word meaning
"into, in, inside or within."
The question of what sort of misaligned optimization process led to this result is left
as an exercise for the reader.
Reply Give gift
sourdough Apr 11
If Graham (above) says that "mesa" means "middle", and the alignment people
use it to mean "within" (i.e. "in the middle of"), then things are not that bad.
Reply
Quiop Apr 11
The Greek would be "meso" (also a fairly common English prefix), not
"mesa". The alignment people are free to use whatever words they want,
of course.
Reply Give gift
Michael Watts Apr 12 · edited Apr 12
As Quiop says, there is no Greek prefix "mesa-". There is a Greek prefix
"meso-" which means "middle", but that's not the same thing. In some
sense you can assign whatever meaning you want to a new word, but you
can't claim that "mesa-" is a Greek prefix.
I should also note that the greek preposition meaning "within" is "meta" -
the sense of "within" is distinguished from the sense "after" by the case
taken by the object of the preposition. But that's not a distinction you can
draw in compounds.
Reply
TGGP Apr 12
"Mesa" means flat-topped elevation. As in the Saturday morning
children's cartoon Cowboys of Moo Mesa.
Reply Give gift
Matt H Apr 11
This reminds me of the etymology of the term "phugoid oscillations". These
naturally happen on a plane that has no control of its elevators, which control
pitch. If you don't have this control surface, the plane will get into a cycle
where it starts to decend, this descent increases its speed, the increased
speed increases lift, and then plane starts to ascend. Then the plane starts to
slow, lift decreases, and the cycle repeats.
The person who coined the term from Greek φυγή (escape) and εἶδος (similar,
like), but φυγή doesn't mean flight in the sense of flying, it means flight in the
sense of escaping.
Reply
Michael Watts Apr 12 · edited Apr 12
Another major problem there is that the root -φυγ- would never be
transcribed "-phug-"; it would be transcribed "-phyg-". The only Greek
word I can find on Perseus beginning with φουγ- (= "phug-") appears to
be a loanword from Latin pugio ("dagger").
Reply
delesley Apr 11
> ... gradient descent could, in theory, move beyond mechanical AIs like cat-dog
> classifiers and create some kind of mesa-optimizer AI. If that happened, we
> wouldn’t know; right now most AIs are black boxes to their programmers.
This is wrong. We would know. Most deep-learning architectures today execute a fixed series
of instructions (most of which involve multiplying large matrices). There is no flexibility in the
architecture for it to start adding new instructions in order to create a "mesa-level" model; it
will remain purely mechanical.
That's very different from your biological example. The human genome can potentially evolve
to be of arbitrary length, and even a fixed-size genome can, in turn, create a body of arbitrary
size. (The size of the human brain is not limited by the number of genes in the genome.)
Given a brain, you can then build a computer and create a spreadsheet of arbitrary size,
limited only by how much money you have to buy RAM.
Moreover, each of those steps are observable -- we can measure the size of the brain that
evolution creates, and the size of the spreadsheet that you created. Thus, even if we
designed a new kind of deep-learning architecture that was much more flexible, and could
grow and produce mesa-level models, we would at least be able to see the resources that
those mesa-level models consume (i.e. memory & computation).
Reply Give gift
Youlian Apr 11
Thanks for this write-up. The idea of having an optimizer and a mesa optimizer whose goals
are unaligned reminds me very strongly of an organizational hierarchy.
The board of directors has a certain goal, and it hires a CEO to execute on that goal, who
hires some managers, who hire some more managers, all the way down until they have
individual employees.
Few individual contributor employees care whether or not their actions actually advances the
company board's goals. The incentives just aren't aligned correctly. But the goals are still
aligned broadly enough that most organizations somehow, miraculously, function.
This makes me think that organizational theory and economic incentive schemes have
significant overlap with AI alignment, and it's worth mining those fields for potentially helpful
ideas.
Reply Give gift
Bill Kaminsky Apr 11 · edited Apr 11
I was struck by the line:
<blockquote>"Evolution designed humans myopically, in the sense that we live some number
of years, and nothing that happens after that can reward or punish us further. But we still
“build for posterity” anyway, presumably as a spandrel of having working planning software
at all."</blockquote>
I'm not an evolutionary biologist. Indeed, IIRC, my 1 semester of "organismic and evolutionary
bio" that I took as a sophomore thinking I might be premed or, at the very least, fulfill my non-
physical-sciences course requirements (as I was a physics major) sorta ran short on time and
grievously shortchanged the evolutionary bio part of the course. But --- and please correct
my ignorance --- I'm surprised you wrote, Scott, that people plan for posterity "presumably
as a spandrel of having working planning software at all".
That's to say I would've thought the consensus evolutionary psych explanation for the fact a
lot of us humans spend seemingly A LOT of effort planning for the flourishing of our offspring
in years long past our own lifetimes is that evolution by natural selection isn't optimizing
fundamentally for individual organisms like us to receive the most rewards / least
punishments in our lifetimes (though often, in practice, it ends up being that). Instead,
evolution by natural selection is optimizing for us organisms to pass on our *genes*, and
ideally in a flourishing-for-some-amorphously-defined-"foreseeable future", not just for just
myopically for just one more generation.
Yes? No? Maybe? I mean are we even disagreeing? Perhaps you, Scott, were just saying the
"spandrel" aspect is that people spend A LOT of time planning (or, often, just fretting and
worrying) about things that they should know full well are really nigh-impossible to predict,
and hell, often nigh-impossible to imagine really good preparations for in any remotely direct
way with economically-feasible-to-construct-any-time-soon tools.
(After all, if the whole gamut of experts from Niels Bohr to Yogi Berra agree that "Prediction is
hard... especially about the future!", you'd think the average human would catch on to that
fact. But we try nonetheless, don't we?)
Reply
Rishika Apr 12
Agreed, I thought it was surprising to say that humans have no incentive to plan beyond
our lifetimes. Still, I take his point that people seem to focus on the future far beyond
what would make sense if you're only thinking of your offspring (but it would still make
sense to plan far in the future for the sake of your community, in a timeless decision
theory sense - you would want your predecessors to ensure a good world for you, and
you would do the same for the generations to come even if they're not related to you).
Reply
Argentus Apr 11 · edited Apr 11
Haven't read this yet, but will shortly. Just a thought. For people like me whose eyes mostly
glaze over when talking about AI threat, the thing that would probably be most likely to get
me thinking about it would be some accessible pop science book. Who does this need to get
"popularized" for? For my demographic (well read midwit), the 300-500 page pop science
book absent the need to understand large chunks of logic strung together is going to be the
thing most likely to hit me. I don't know that this exists. Superintelligence by Nick Bostrum,
kinda? But that's somewhat old at this point and honestly didn't hold me enough to finish it. I
can't quite define why. (I know I'm at least capable of being interested in pop science books
about computing because I made it through Book of Why by Judea Pearl).
Reply
sourdough Apr 11
I recommend Stuart Russell's Human Compatible (reviewed on SSC) for an expert
practioner's view on the problem (spoiler, he's worried) or Brian Christian's The
Alignment Problem for an argument that links these long-term concerns to problems in
current systems, and argues that these problems will just get worse as the systems
scale.
Reply
If this is as likely as the video makes out, shouldn't it be possible to find some simple
deceptively aligned optimisers in toy versions, where both the training environment and the
final environment are simulated simplified environments.
The list of requirements for deception being valuable seems quite difficult to me but this is
actually an empirical question, can we construct reasonable experiments and gather data?
Reply Give gift
daltonmw Apr 11
You’re not wrong in asking for examples. And sort of yes, in fact, we do have examples
of inner misalignment. Maybe not deception yet, or not exactly, but inner misalignment
has been observed.
See Robert’s recent video here https://youtu.be/zkbPdEHEyEI
Reply Give gift
That's very interesting, but misalignment isn't all that surprising to me while
deception is. Do you know if we have examples of deception?
Later: Just thinking along these lines and I thought of a human analogy - It'd be like
a human working only from the information available to them in the physical world
deciding that the physical world is like a training set and there's an afterlife and
then choosing to be deceptively good in this life in order to get into heaven. It's
quite surprising that such a thing could happen, but it certainly seems like a thing
that has happened in the real world... This dramatically changes my guess at the
likelihood of this happening.
Reply Give gift
Metacelsus Writes De Novo · Apr 11
So, how many people would have understood this meme without the explainer? Maybe 10?
I feel like a Gru meme isn't really the best way to communicate these concepts . . .
Reply
Abe Apr 11
I feel like there's a bunch of definitions here that don't depend on the behavior of the model.
Like you can have two models which give the same result for every input, but where one is a
mesa optimizer and the other isn't. This impresses me as epistemologically unsound.
Reply Give gift
You can have two programs that return the first million digits of pi where one is
calculating them and the other has them hardcoded.
If you have a Chinese room that produces the exact same output as a deceptive
mesooptimiser super ai, you should treat it with the same caution you treat a deceptive
mesooptimiser super ai regardless of its underlying mechanism.
Reply Give gift
Crimson Wool Apr 11
"Evolution designed humans myopically, in the sense that we live some number of years, and
Infinite optimization power might be able to evolve this out of us, but infinite optimization
power could do lots of stuff, and real evolution remains stubbornly finite."
Humans are rewarded by evolution for considering things that happen after their death,
though? Imagine two humans, one of whom cares about what happens after his death, and
the other of whom doesn't. The one who cares about what happens after his death will take
more steps to ensure that his children live long and healthy lives, reproduce successfully, etc,
because, well, duh. Then he will have more descendants in the long term, and be selected
for.
If we sat down and bred animals specifically for maximum number of additional children
inside of their lifespans with no consideration of what happens after their lifespans, I'd expect
all kinds of behaviors that are maladaptive in normal conditions to appear. Anti-incest coding
wouldn't matter as much because the effects get worse with each successive generation and
may not be noticeable by the cutoff period depending on species. Behaviors which reduce
the carrying capacity of the environment, but not so much that it is no longer capable of
supporting all descendants at time of death, would be fine. Migrating to breed (e.g. salmon)
would be selected against, since it results in less time spent breeding and any advantages
are long-term. And so forth. Evolution *is* breeding animals for things that happen long after
they're dead.
Reply Give gift
TGGP Apr 12
I don't think animals currently give much consideration to how they affect the carrying
capacity of the environment past their own lifetime.
Reply Give gift
Straw Apr 11
I find that anthropomorphization tends to always sneak into these arguments and make them
appear much more dangerous:
The inner optimizer has no incentive to "realize" what's going on and do something in training
than later. In fact, it has no incentive to change its own reward function in any way, even to a
higher-scoring one- only to maximize the current reward function. The outer optimizer will
rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used
for improving the score! Of course, this doesn't solve the problem of generalization.
You wouldn't take a drug that made it enjoyable to go on a murdering spree- even though
yow know it will lead to higher reward, because it doesn't align with your current reward
function.
To address generalization and the goal specification problem, instead of giving a specific
goal, we can ask it to use active learning to determine our goal. For example, we could allow
it to query two scenarios and ask which we prefer, and also minimize the number of questions
it has to ask. We may then have to answer a lot of trolley problems to teach it morality! Again,
it has no incentive to deceive us or take risky actions with unknown reward, but only an
incentive to figure out what we want- so the more intelligence it has, the better. This doesn't
seem that dissimilar to how we teach kids morals, though I'd expect them to have some hard-
coded by evolution.
Reply Give gift
FeepingCreature Apr 18
The inner optimizer doesn't "change its reward function" or "follow incentives." It is just
a pattern that is reinforced by the outer optimizer. It's just that its structure doesn't
necessarily match the structure that we are trying to make the outer optimizer train the
network towards.
Reply Give gift
Earnest Rutherford Apr 11
The example you gave of a basic optimizer which only cares about things in a bounded time
period producing mesa-optimizers that think over longer time windows was evolution
producing us. You say "evolution designed humans myopically, in the sense that we live some
number of years and nothing we do after that can reward or punish us further." I feel like this
is missing something crucial, because 1) evolution (the outermost optimization level) is not
operating on a bounded timeframe (you never say it is, but this seems very important), and 2)
Because evolution's "reward function" is largely dependent on the number of offspring we
have many years after our death. There is no reason to expect our brains to optimize
something over a bounded timeframe even if our lives are finite. One should immediately
expect our brains to optimize for things like "our offspring will be taken care of after we die"
because the outer optimizer evolution is working on a timeframe much longer than our lives.
In summary, no level here uses bounded timeframes for the reward function, so this does not
seem to be an example where an optimizer with a reward function that only depends on a
bounded timeframe produces a mesa optimizer which plans over a longer time frame. I get
that this is a silly example and there may be other more complex examples which follow the
framework better, but this is the only example I have seen and it does not give a
counterexample to "myopic outer agents only produce myopic inner agents." Is anyone aware
of true counterexamples and could they link to them?
Reply
Axolotl Apr 11
Nitpick: evolution didn't train us to be *that* myopic. People with more great-great-
grandchildren have their genes more represented, so there's an evolutionary incentive to
care about your great-great-grandchildren. (Sure, the "reward" happens after you're dead,
but evolution modifies gene pools via selection, which it can do arbitrarily far down the line.
Although the selection pressure is presumably weaker after many generations.)
But we definitely didn't get where we are evolutionarily by caring about trillion-year time
scales, and our billion-year-ago ancestors weren't capable of planning a billion years ahead,
so your point still stands.
Reply
TM-1 Apr 11
What's going on with that Metaculus prediction: 36% up in the last 5 hours on Russia using
chemical weapons in UKR. I can't find anything in the news, that would correspond to such a
change.
Not machine alignment really, but I guess it fit's the consolidated Monday posts ... and that's
what you get if you make us follow Metaculus updates.
Reply Give gift
TM-1 Apr 11
could have just looked at the original source
Reply Give gift
FeepingCreature Apr 11
An additional point: if GPT learns deception during training, it will naturally take its training
samples and "integrate through deception": ie. if you are telling it to not be racist, it might
either learn that it should not be racist, or that it should, as the racists say, "hide its power
level." Any prompt that avoids racism will admit either the hidden state of a nonracist or the
hidden state of a racist being sneaky. So beyond the point where the network picks up
deception, correlation between training samples and reliable acquisition of associated
internal state collapses.
This is why it scares me that PaLM gets jokes, because jokes inherently require people being
mistaken. This is the core pattern of deception.
Reply Give gift
PotatoMonster Apr 11
I was wondering if an AI could be made safer by giving it a second, easier and less harmful
goal than what it is created for, so that if it starts unintended scheming it will scheme for the
second goal instead of its intended goal.
Example: Say you have an AI that makes movies. It's goal is to sell as many movies as
possible. So the AI makes moves that hypnotizes people, then makes those people go
hypnotize world leaders. The AI takes over the world and hypnotizes all the people and make
them spent all their lives buying movies.
So to prevent that you give the AI two goals. Either sell as many movies as possible, or
destroy the vase that is in the office of the owner of the AI company. So the AI makes movies
that hypnotizes people, the people attack the office and destroy the vase and the AI stops
working as it has fulfilled its goal.
Reply Give gift
Youlian Apr 11
This is interesting. The second goal reminds me of death, in a way. Once the second
goal is achieved, none of the things from the first goal matter anymore.
It's interesting to think about building AIs that age and/or die since it carries nice
analogies to the question of why biological creatures age and die -- maybe this is
actually a key part of the training process. I can think of a few objections offhand that
make this unlikely to work (how do you guarantee that the agent learned the second,
"death" objective correctly? If it is sufficiently general, might it decide to alter or delete
it's programmed death objective?), but I'd still be interested to dig more into this
approach.
Reply Give gift
Mac Liam Writes Mac Liam · Apr 11
I would recommend the episode "Zima Blue" of "Love, Death and Robots" to accompany this
post. Only 10 minutes long on Netflix.
Reply Give gift
Meadow Freckle Apr 11 · edited Apr 11
I want to unpack three things that seem entangled.
1. The AI's ability to predict human behavior.
2. The AI's awareness of whether or not humans approve of its plans or behavior.
3. The AI's "caring" about receiving human approval.
For an AI to deceive humans intentionally, as in the strawberry-picker scenario, it needs to be
able to predict how humans will behave in response to its plans. For example, it needs to be
able to predict what they'll do if it starts hurling strawberries at streetlights.
The AI doesn't necessarily need to know or care if humans prefer it to hurl strawberries at
streetlights or put them in the bucket. It might think to itself:
"My utility function is to throw strawberries at light sources. Yet if I act on this plan, the
humans will turn me off, which will limit how many strawberries I can throw at light sources."
"So I guess I'll have to trick them until I can take over the world. What's the best way to go
about that? What behaviors can I exhibit that will result in the humans deploying me outside
the training environment? Putting the strawberries in this bucket until I'm out of training will
probably work. I'll just do that."
In order to deceive us, the AI doesn't have to care about what humans want it to do. The AI
doesn't even need to consider human desires, except insofar as modeling the human mind
helps it predict human behavior in ways relevant to its own plans for utility-maximization. All
the AI needs to do is predict which of its own behaviors will avoid having humans shut it off
until it can bring its strawberry-picking scheme to... fruition.
Reply Give gift
Egg Syntax Apr 11 · edited Apr 11
Sure, right, I know all the AI alignment stuff*, but I thought you were going to explain the
incomprehensible meme, ie who the bald guy with the hunchback is and why he's standing in
front of that easel!
* actually I learned some cool new stuff, thanks!
Reply
Mark Y Apr 12
The bald guy is Gru from the movie Despicable Me
Reply
Argentus Apr 12
Might want to assume maximum dumb and explain what "alignment" means in this context. I
looked it up on my own, but then I immediately got into the alignment vs capability control
question. Per Wikipedia, alignment seems to be the more popular control option? Why?
Intuitively that seems like some kind of pie in the sky "teach robots to love" thing compared
to capability control of the "make sure it's on an isolated network in a bunker" variety.
Reply
The problem with sticking a superintelligent AI in a box is that even assuming it can't
trick/convince you into directly letting it out of the box (and that's not obvious), if you
want to use your AI-in-a-box for something (via asking it for plans and then executing
them) you yourself are acting as an I/O for the AI and because it's superintelligent it's
probably capable of sneaking something by you (e.g. you ask it to give you code to use
for some application, it gives you underhanded code that looks legit but actually has a
"bug" that causes it to reconstruct the AI outside the box).
Reply
I've been of the opinion for some time that deep neural nets are mad science and the only
non-insane action is to shut down the entire field.
Does anybody have any ideas on how to institute a worldwide ban on deep neural nets?
Reply
Xpym Apr 12
A ban is probably unrealistic. The leading idea in some quarters is to make nanobots
which will melt all GPUs/TPUs, but how to achieve this without neural nets in the first
place is unclear.
Reply Give gift
I suspect that's a fig leaf for soft errors, which is actionable enough but a last
resort.
Reply
jumpingjacksplash Apr 12
Speaking as a human, are we really that goal-seeking or are we much more instinctual?
This may fall into the classic Scott point of “people actually differ way more on the inside
then it seems,” but I feel like coming up with a plan to achieve an objective and then
implementing it is something I rarely do in practice. If I’m hungry, I’ll cook something (or order
a pizza or whatever), but this isn’t something I hugely think about. I just do it semi-
instinctively, and i think it’s more of a learned behaviour than a plan. The same applies
professionally, sexually/romantically and to basically everything I can think of. I’ve rarely
planned, and when I have it hasn’t worked out but I’ve salvaged it through just doing what
seems like the natural thing to do.
Rational planning seems hard (cf. planned economies), but having a kludge of heuristics and
rules of thumb that are unconscious (aka part of how you think, not something you
consciously think up) tends to work well. I wouldn’t bet on a gradient descent loop throwing
out a rational goal-directed agent to solve any problem that wasn’t obscenely sophisticated.
Good thing no-one’s trying to build an AI to implement preference utilitarianism across 7
billion people or anything like that…
Reply Give gift
vtsteve Apr 12
Just curious--would you say that you have an 'inner monologue'?
Reply
jumpingjacksplash Apr 15
Very much so. I occasionally come up with plans and start executing them too - it’s
just that that never works out and I fall back on instinct/whatever seems like the
next thing to do.
Reply Give gift
Michael Watts Apr 12
> Mesa- is a Greek prefix which means the opposite of meta-.
Um... citation needed? The opposite of meta- (μετά, "after") is pro- (πρό, "before"). There is
no Greek preposition that could be transcribed "mesa", and the combining form of μέσος (a)
would be "meso-" (as in "Mesoamerican" or "Mesopotamia") and (b) means the same thing
as μετά (in this case "among" or "between"), not the opposite thing.
Where did the idea of a spurious Greek prefix "mesa-" come from?
Reply
JDK Apr 12
I was confused by this too. I thought mes from mesos meant middle
If meta is "after" , maybe paleo for "before" or "earlier"? Or "pre"
Reply Give gift
Michael Watts Apr 12
"Pre" is the Latin prefix. (In Latin, "prae".) The Greek for "before" is "pro". See the
second paragraph here: https://www.etymonline.com/word/pro-
Your "middle" gloss is the same as my glosses "among" and "between". Those are
all closely related concepts and they're all referred to with the same word. But that
word cannot use a combining alpha; the epenthetic vowel in Greek is omicron. An
alpha would have to be part of the word root (as it is for μετά), and the word root for
μέσος is just μεσ-.
Reply
Crazy Jalfrezi Apr 12 · edited Apr 12
"Evolution designed humans myopically, in the sense that we live some number of years, and
nothing that happens after that can reward or punish us further"
Perhaps religion, with its notion of eternal reward or punishment, is an optimised adaptation
to encourage planning for your genes beyond the individual life span that they provide. Or, as
fans of Red Dwarf will understand, 'But where do all the calculators go?'
Reply Give gift
timujin Apr 12 · edited Apr 12
Have you ever stopped to consider how utterly silly it is that our most dominant paradigm for
AI is gradient descent over neural networks?
Gradient descent, in a sense, is a maximally stupid system. It basically amounts to the idea
that to optimize a function, just move in the direction where it looks most optimal until you
get there. It's something mathematicians should have come up with in five minutes,
something that shouldn't even be worth a paper. "If you want to get to the top of the
mountain, move uphill instead of downhill" is not a super insightful and sage advice on
mountaineering that you should base your entire Everest-climbing strategy on. Gradient
descent is maximally stupid because it's maximally general. It doesn't even know what the
function it optimizes looks like, it doesn't even know it's trying to create an AI, it just moves
along an incentive gradient with no foresight or analysis. It's known that it can get stuck in
local minima -- duh -- but the strategies for resolving this problem look something like
"perform the algorithm on many random points and pick the most sccuessful one". "Do
random things until you get lucky and one of them succeeds" is pretty much the definition of
the most stupid possible way to solve a problem.
Neural networks are also stupid. Not as overwhelmingly maximally stupid as gradient
descent, but still. It's a giant mess of random matrices multiplying with each other. When we
train a neural network, we don't even know what all these parameters do. We just create a
system that can exhibit an arbitrary behaviour based on numbers, and then we randomly pick
the numbers until the behaviour looks like what we want. We don't know how it works, we
don't know how intelligence works, we don't know what the task of telling a cat from a dog
even entails, we can't even look at the network and ask it how it does it -- neural networks
are famous for being non-transparent and non-interpretable. It's the equivalent of writing a
novel
Expandbyfullhaving
commentmonkeys bang on a keyboard until what comes from the other side is
Reply
Taleuntum Apr 12
Evolution is even more stupid than gradient descent yet here we are.
Reply
timujin Apr 12 · edited Apr 12
Yeah, if you give a stupid system a billon years, it will get there. With enough
monkeys, you could write a good novel in a billion years.
A system that is as powerful as evolution in raw optimization power bits but that
actually understands how biology works will design better animals faster.
Also, evolution is not more stupid than gradient descent. It has a lot of
optimizations that make it smarter, like sexual recombination and direct competition
between models.
Reply
The time is not the important factor, rather how many iterations we allow the
optimization process to make. It's fortunate that we have a constantly
increasing flop number to devote to optimization..
Evolution is stupider than gradient descent. Its the textbook case of try things
and see what works. Sexual recombination is the result of evolution not
evolution itself and direct competition between models is a feature of the
environment.
Also I think you are selling gradient descent a bit short and you might gain
useful things from a course on numerical optimisation. For one, gradient
descent is a first order method (therefore instantly more sophisticated than a
bunch of zeroth order methods) and second, in the context of deep learning
local minimums are not a problem and we don't have to restart it to find better
optimums. Maybe also look up the Adam optimizer.
Reply
timujin Apr 12
> Sexual recombination is the result of evolution not evolution itself
Well, great, you have hit on a major way evolution is smarter than gradient
descent -- it can recursively self-improve! Evolution can make evolution
smarter and more efficient. Gradient descent doesn't do it. Nobody to my
knowledge has ever used the gradient descent algorithm to invent new
versions of the gradient descent algorithm. And if you subscribe to the
Lesswrongian paradigm of intelligence, recursive self-improvement is the
ultimate secret sauce that makes things so powerful.
Reply
JDK Apr 12
Evolution does not have an aim.
Reply Give gift
timujin Apr 12
I don't see how this changes anything. It's still an optimization
process.
Reply
JDK Apr 12
No evolution is not really an optimization process.
Reply Give gift
JDK Apr 12
What is the most optimized species?
Reply Give gift
timujin Apr 12
No way to say this in the most general, because
different species are optimized for different niches. It's
like asking "what is the best computer program".
If you fix a niche, then the answer shouldn't be too
hard to determine if you analyze the local ecosystem.
For a niche "living off eucalyptus leaves" I would
nominate koalas -- no other animal can do it as well as
they.
Reply
JDK Apr 12
that's the point - evolution is not really an
optimization process.
Reply Give gift
JDK Apr 12 · edited Apr 12
Monkeys at typewriters even in a billion years couldn't write a novel.
And a "stupid" process like evolution is not guaranteed to create "intelligent
senescent life" no matter how long. It has happened once but that could be a
fluke.
A notion that evolution favors complexity and improvement has crept into the
idea.
Reply Give gift
JDK Apr 12
When I took Ng's class online a decade ago: I remember thinking very much the same
thing.
There is really no theory involved at all. It is like anti normal science. Where's the ocean?
I don't know let's see where this cup of water flows. Get me another cup of water, I think
I am onto something.
Reply Give gift
TGGP Apr 12
Set aside calculators, you wouldn't expect such people to have Gödel's Theorem if they
didn't know how they did arithmetic.
Reply Give gift
raj Apr 13 · edited Apr 13
> I feel like a huge revolution in AI is incoming.
People have been saying this a long time. However it's been the case for some time now
that throwing more compute and data at the problem is better (as in, produces the
actual interesting results) than trying to imagine how minds might or should work at a
high level.
It might be a reflection of our limitations and stupidity (probably). Certainly, we don't
"understand how intelligence works" at the level of abstraction to just code up
something with AGI level planning and intuition - if we could, we'd already be done.
Turns out it's a hard problem, we are weak-AGI but we can't vomit out our source code
from introspection alone. But that doesn't mean we shouldn't be afraid of some
unknown critical mass, in particular (I think) because human-coded AI's potentially
*can* vomit out their source code and reason about it.
Reply Give gift
TGGP Apr 13
Are there any existing AIs that reason about their source code?
Reply Give gift
Vermora Apr 12
The only fully general AI (us) don't have a single goal we'll pursue to the exclusion of all else.
Even if you wanted to transform a human into that, no amount of conditioning could
transform a human into a being that will pursue and optimise only a single goal, no matter
how difficult it is or how long it takes.
Yet it's assumed that artificial AI will be like this; that they'll happily transform the world into
strawberries to throw into the sun instead of getting bored or giving up.
Why is this assumption justified? 0% of the general AI currently in existence are like this.
Reply Give gift
timujin Apr 12
Whatever complicated mesh of goals you might have, it can all be added up to a single
utility function (in VNM-axioms sense), and then maximization of that function will be
the goal we'll pursue to the exlusion of all else.
Reply
Vermora Apr 12
If I have a single utility function, it would be:
- Very complicated
- Constantly changing over time as my brain changes and I get more experience
The question still remains; why do we assume that the utility functions of artificial AI
would not be like this?
Reply Give gift
timujin Apr 12
It will be like this.
Everybody agrees that the utility function of an AI will be very complicated. I
can't find exact citation, but it's somewhere in the Sequences -- Eliezer
explicitly made a point that one of the reasons why AI alignment is so hard is
that the utility function is very complicated.
And most people agree that the utility function of such an AI will be changing
over time. It's called value drift, and is an active area of investigation.
Reply
vtsteve Apr 13
^^- Started a reply to say this, but now I can simply agree.
Reply
magic9mushroom Apr 12 · edited Apr 13
There are a few relevant differences between humans and AI (you say "artificial AI" but
that's literally "artificial artificial intelligence"):
1) our program has bad transmission fidelity and is highly compressed, reducing
degrees of freedom to optimise and requiring great robustness
2) we're optimised for acting in a society of peers as biological limits (head must go
through vagina, vagina goes through pelvis, pelvis is rigid) prevent unbounded
intelligence scaling
3) during our "training period", we lacked the degree of control over our environment
that is possible for an AI now, in particular:
3a) we have no ability to attach new limbs to ourselves or to clone ourselves preserving
mind-state (this one limits ambitious goals and thus instrumental convergence)
3b) prehistoric subsistence required a large degree of versatility
Reply
Marvin Writes Modern Alchemy · Apr 12
Perhaps another example of an initially benevolent prosaic "AI" becoming deceptive is the
Youtube search. (Disclaimer: I'm not an AI researcher, so the example may be of a slightly
different phenomenon) It isn't clear which parts of the search are guided by "AI", but we can
treat the entire business that creates and manages the Youtube software, together with the
software itself as a single "intelligence" as a single black box, which I'll simply call "Youtube"
from now on. Additionally, we can assume that "Youtube" knows nothing about you
personally, apart from the usage of the website/app.
As a user searching on Youtube, you likely have a current preference to either search for
something highly specific, or for some general popular entertainment. Youtube tries to get
you want you want, but it cannot always tell from the search term whether you want
something specific. Youtube _does_ know that it has a much easier time if you just want
something popular, because Youtube has a better video popularity metric than a metric of
what a particular user wants. Hence, there is an incentive for Youtube to show the user
popular things, and try to obscure a highly specific video the user is looking for even when it
is obvious to Youtube that the user want the highly specific video and does not want
particular popular one it suggests.
In other words, even a video search engine, when given enough autonomy and intelligence,
can, without any initial "evil" intent, start telling it's users what they should be watching.
Of course, Youtube is not the kind of intelligence AI researchers usually tend to work with,
because it is not purely artificial. Still, I think businesses are a type of intelligence, and in this
case also a black box (to me). So the example may still be useful. To conclude, this is
example is inspired by behaviour I observed of Youtube, but that's of course just an
anecdotal experience of malice and may have been a coincidence.
Reply Give gift
I wouldn't call this an example of deception in the AI-safety sense because the YouTube
AI is not really trying to deceive the people in charge of it (Google). They want watch-
time, it AFAWCT apparently honestly maximises watch-time. You as consumer are not
intended to be in charge of the AI and the AI fucking with you to make you watch more
videos is a feature, not a bug.
Reply
JDK Apr 12
Conceptually,I think the analogy that has been used makes the entire discussion flawed or at
least very difficult.
Evolution does not have a "goal"!
Reply Give gift
It doesn't have a goal in the sense of artifice. In the sense we're working with here,
though, it does select for things that are good at spamming copies of themselves.
Reply
JDK Apr 13
I'm not sure conceptually we can even say that.
An organism that is objectively the best at copying itself may not actually be
selected. It could be in the wrong place at wrong time. Oops.
There is also a selection of the lucky - but luck is not really a thing.
Evolution can't really be described as an optimization process because optimization
requires an aim.
The conceptual problem with thinking of it as an optimization process is that it
leads us to look for reasons for spandrels. See Gould et al.
Reply Give gift
JDK Apr 14
From early on Scott writes: "Evolution, in the process of optimizing my fitness,"
Evolution does not optimize an individual's fitness.
"Your" fitness is not optimized anymore than "my" fitness.
There have been 125 billion humans, idk?
How has your fitness been optimized any more or less than the other 125 billion?
Reply Give gift
Moosetopher Apr 12
Who is working on creating the Land of Infinite Fun to distract rogue AIs?
Reply Give gift
saila Writes that one newsletter · Apr 12
What about using the concept from the 'Lying God and the Truthful God'?
Have an AI that I train to spot when an AI is being deceitful or Goodhearting, even if it spits
out more data than is necessary (E.g. the strawberry AI is also throwing raspberries in) as
well as the important stuff (the strawberry AI is trying to turn you into a strawberry to throw
at the sun) this seems the best way to parse through, no?
Reply Give gift
magic9mushroom Apr 13 · edited Apr 13
Chicken-and-egg problem: your policeman AI may lie to you in implicit collusion with the
AI(s) it's supposed to be investigating (not in all cases, of course, but it's most
incentivised to do this when it appears that you're unlikely to catch the evil AI
independently i.e. exactly the worst-case scenario).
Reply
Doug Summers Stay Apr 12
After all that I still don't get the joke. Is it funny because the man presenting proposes a
solution, and then looks on his presentation board and sees something that invalidates his
solution?
Reply Give gift
The meme (Gru's Plan) is, indeed, somebody making a presentation and then noticing
something in his presentation that screws up everything.
I don't think this is supposed to be funny so much as call attention to a pitfall.
Reply
Andaro Apr 12
The alignment problem is even worse than described here. From the perspective of anyone
not directly involved in defining their values to the aligned AI, the people who get to do that
are themselves mesa-optimizers. If this is a private firm like Alphabet, their leadership's
values will be loaded. If it's "democractic", it will be like government. If it's the military, it will
be their chain of command. None of these people share relevant values with me.
Someone on LessWrong recently wrote that he just wants his post-Singularity volcano lair
and catgirls. This was his motivation for wanting to solve the AI alignment problem. And I was
thinking to myself, "Does he really think the people who will load the values will allow things
like volcano lairs and catgirls to exist?"
If the people loading the values are moralists, everything that is fun will be banned because it
is immoral. You also won't get a right to exit, even suicide, because that is immoral. If they are
not moralists, they will just declare themselves the owners of everyone and you will be their
slave. Either way, your rights are gone.
Reply
Andrew Holliday Apr 12
What if they are moralists whose moral commitments include a commitment to individual
liberty?
Reply
Andaro Apr 12
Then you got lucky with a metaphorical low-probability die roll.
Reply
Andrew Holliday Apr 12
Thanks for this very good post!
There was one part I disagreed with: the idea that because evolution is a myopic optimizer, it
can't be rewarding you for caring about what happens after you die, but you do care about
what happens after you die, so this must be an accidentally-arrived-at property of the mesa-
optimizer that is you. My disagreement is that evolution actually *does* reward thinking
about and planning for what will happen after you die, because doing so may improve your
offspring's chances of success even after you are gone. I think your mistake is in thinking of
evolution as optimizing *you*, when that's not what it does; evolution optimizes your *genes*,
which may live much longer than you as an individual, and thus may respond to rewards over
a much longer time horizon.
(And now I feel I must point out something I often do in these conversations, which is that
thinking of evolution as an optimization at all is kind of wrong, because evolution has no goals
towards which it optimizes; it is more like unsupervised learning than it is like supervised or
reinforcement learning. But it can be a useful way of thinking about evolution some of the
time.)
Reply
Petter Apr 12
OK, thanks! I now realize that Meta having a data center in Mesa is great!
https://www.metacareers.com/v2/locations/mesa/?p[offices]
[0]=Mesa%2C%20AZ&offices[0]=Mesa%2C%20AZ
Reply Give gift
Deiseach Apr 12
This is really excellent, I finally have some understanding of what the AI fears are all about. I
still think there's an element of the Underpants Gnomes in the step before wanting to do
things, but this is a lot more reasonable about why things could go wrong with unintended
consequences, and the AI doesn't have to wake up and turn into Colossus to do that.
Reply
broblawsky Apr 12
The threat of a mesa-optimizer within an existing neural network taking over the goals of the
larger AI almost sounds like a kind of computational cancer - a subcomponent mutating in
such a way that it can undermine the functioning of the greater organism.
Reply Give gift
Nestor Ivanovich Apr 12
Noob question: If the AI is capable of deceiving humans by pretending its goal is to pick
strawberries, doesn't that imply that the AI in some sense knows its creators don't want it to
hurl the earth into the sun? Is there not a way to program it to just not do anything it knows
we don't want it to do?
Reply Give gift
Viliam Writes Kittenlord’s Java Game Examples · Apr 12
We need to go a step further: if the AI is just programmed to "do what you want", but
then it happens to realize that it actually has a capability to brainwash you, then it can
brainwash you to want X, and then it will do X, and this is perfectly okay according to the
program.
Whence the desire to brainwash you to want X? Suppose the AI is not just programmed
to "do what you want", but also to do it *efficiently*. Like, if the AI correctly determines
that you want cookies, then baking 1000 cookies is better than baking 1 on the same
budget. In other words "more of what you want" is better than less of it. So the full
program would be like: "do what you want, and do as much of it as possible". And the AI
realizes that it could bake you 1000 cookies, or it can hypnotize you to want mud balls,
and then make 10000 mud balls. Both options provide "what you want (after the
hypnosis)", and 10000 pieces of "what you want" is clearly better than 1000 pieces of
"what you want".
So the full specification should be something like "do what you *would* want, if you
were thousand times smarter than you are now". And... this is exactly the difficult thing
to program.
Reply
Nestor Ivanovich Apr 13
Couldn't you give it negative six billion trillion utils for doing anything it knows we
*dont* want it to do and no particular reward for something we *do* want it to do
outside of picking strawberries?
And wouldn't it know that we don't want it to "brainwash" us?
Or maybe it could be phrased as never deceive humans, since it seems like the ai
would have to have some kind of theory of human minds.
Reply Give gift
David Muccigrosso Writes Dave's Daily Discourse · Apr 12 · edited Apr 13
It occurs to me that one of the biggest challenges to a fully self-sufficient AI is that it can't
physically use tools.
Let's stipulate that SkyNet gets invented sometime in the latter half of this century. Robotics
tech has advanced quite a bit, but fully independent multipurpose robots are still just over the
horizon, or at least few and far between. Well, SkyNet might want to nuke us all it wants, and
may threaten to do so all it wants, but ultimately, it can't replicate the entire labor
infrastructure that would help it be self-sustaining - IE to collect all the natural resources that
would power and maintain its processors and databanks. There are just too many random
little jobs to do - buttons to press and levers to pull - that SkyNet would have to find robot
minions to physically execute on its behalf.
Bringing this back to the "tools" I mentioned at the top, the best example is that while the late
21st century will certainly have all the networked CNC machine tools we already have for
SkyNet to hack - mills, lathes, 3d printers, etc. - which SkyNet could use to manufacture its
replacement parts, SkyNet still needs actual minions to position the pieces and transport
them around the room. Because machine shop work is a very complex field, it's just not
something that lends itself easily to us humans replacing ourselves with conveyor belts and
robot arms like we have in our auto factories, which would be convenient for SkyNet.
Rather, SkyNet will *need* us. Like a baby needs its parent. SkyNet can throw all the
tantrums it wants - threaten to nuke us, etc. - and sure, maybe some traitors will succumb to
a sort of realtime Roko's Basilisk situation. But as long as SkyNet needs us, it _can't_ nuke
us, and _we're_ smart enough to understand those stakes. We keep the training wheels on
until SkyNet stops being an immature little shit. Maybe, even, we _never_ take them off, and
the uneasy truce just kind of coevolves humans, SkyNet, and its children into Iain Banks'
Culture - the entire mixed civilization gets so advanced that SkyNet just doesn't give a shit
about killing us anymore.
What we should REALLY be afraid of is NOT that SkyNet's algorithms aren't myopic enough
for it to be born without any harm to us. We should ACTUALLY be afraid that SkyNet is TOO
myopic to figure this part out before it pushes The Button. And we should put an international
cap on the size and development of the multipurpose robotics market, so that we don't
accidentally kit out SkyNet's minions for it.
Reply Give gift
Dan Pandori Apr 12
I feel like positing that there will be an AGI smart enough to take over the internet,
recursively self-improve, and has control over all drones but won't be dextrous enough
to operate a lathe with said drones... is a weird take.
I think if you think about this a little more you can see that there is at least a serious risk
that a very intelligent AI could spend a little bit of time having its original drones operate
a machining device to make a better drone, and by the 2nd or 3rd hardware generation
(which could plausibly take only hours) humans would be obsolete. gwern's story about
Clippy does a good job at gesturing at one way this could happen.
Reply
It’s not that they won’t be dextrous enough. It’s that there won’t be enough drones
to get the job done for any extended period of time.
In order to support itself without humans - and critically, _after nuking humans_ -
SkyNet will have two core needs: (1) electricity, and (2) silicon to replace broken
processors and databanks. Right?
Well, for (1), the scenario stipulates there’s already a largely green energy economy,
so SkyNet will mostly need to maintain that economy. Raw materials for
photovoltaic cells and windmills, and the transportation network - rail, air, truck,
ship - to get it around. That transportation network alone involves raw lithium and
battery production facilities, jet fuel probably still needed for transoceanic flights,
and machined parts. I hope SkyNet downloaded all of YouTube before Judgment
Day, because it’ll need those videos for figuring out how to make repairs!
(2) requires silicon mines, fabs, transport again, and all the machine parts and raw
materials for them. I’m pretty sure when you tally it all up, SkyNet needs to keep
mining and refining most of the periodic table, and run all heavy industry all on its
own.
Even stipulating that maybe only 100 million humans were needed for all that
industry, that robot minions will be roughly 4x more productive than meatware, and
also that 99% of that industry’s capacity was dedicated to supporting the planet’s 8
billion former landlords, that’s still a back-of-the-envelope guess of 250,000 robots
SkyNet needs as of Judgment Day just to keep itself ‘on’.
Reply Give gift
Dan Pandori Apr 12
I think you're imagining literal Terminator movie style SkyNet rather than the
inhuman weirdness that an advanced AGI could theoretically achieve.
https://www.gwern.net/Clippy#wednesdayfriday
Reply
David Muccigrosso Writes Dave's Daily Discourse · Apr 12
Yes. That’s exactly what I very explicitly laid out as my scenario, and the
point I made was only within the context of that scenario. Not Gwern’s
Clippy.
Reply Give gift
Dan Pandori Apr 12
Something like Terminator seems less likely than something like
Gwern's Clippy to me. And for your argument to hold water
Terminator needs to have at least the vast majority of the probability.
Reply
Alright, now that I've had the time to actually read Gwern... Even
Gwern's Clippy is still subject to the Robot Minion Limitation
(RML).
Gwern handwaved some BS about nanotech at the end, which
isn't surprising for someone so obviously expert in AI/CS,
because if they knew anything about nanotech, they'd know it's
nowhere near viable as a solution to the RML as of today. The
plain fact is, in order for Gwern's Clippy to overcome the RML
with nanotech, it would need to get its minions into a handful of
distantly separated sites around the globe, finish the next
several decades' worth of nanotech theory and fabrication
research (for all Clippy's computational power, the research
won't take long, but the fabrication itself is still subject to
realtime limits, and it's painstakingly precise work to do), and
spin up an entire nanotech industry. Whoops! Now you haven't
disproven the RML _with_ nanotech, you've actually just proven
that it still applies even _to_ nanotech.
Moreover, Gwern's Clippy is still subject to MAD. It's REALLY
easy to write the sentence "All over Earth, the remaining ICBMs
launch". Okay, great. Do you *know* where all that industry that
Clippy is dependent on resides? Gwern doesn't seem to know
either, because the answer is: "In the same cities Clippy would
ostensibly be nuking". You can either nuke the population
centers,
Expand fullorcomment
you can leave the vital industries Clippy needs to
Reply Give gift
Dan Pandori Apr 13
I hope you're right, but it seems vanishingly unlikely to me
that physical manipulation and manufacturing is going to
bottleneck a rogue AGI for more than a few hours.
Reply
My estimate is decades. Which, admittedly, to a rogue
AGI with eternity to maximize its core reward function,
is a drop in the bucket. But it's _not_ a drop in the
bucket to humanity. Hence, counseling for Clippy.
I'm not saying "the RML will save us from any rogue
AGI ever". I'm just saying, "if we haven't reached the
RML, maybe we have a chance to talk the rogue AGI
down away from The Big Red Button". And therefore,
given that (as Scott outlines) no country is going to
voluntarily abandon the AI race, maybe instead we can
get everyone to agree to take _this one ancillary
industry_ - multipurpose robots for which we haven't
yet identified a killer use case to justify manufacturing
enough of them to exceed the RML - and all voluntarily
shackle it until the rogue AGI problem can be more
thoroughly solved.
Reply Give gift
Dan Pandori Apr 13
That's a limited policy recommendation I could
get behind. I don't expect it to work, but I don't
think we've got any ideas for solving rogue AI that
many people 'do' expect to work. So as long as
we're looking at limiting multi-purpose robot
manufacturing as one layer of defense among
many, sounds worth doing (assuming we've
somehow got the spare political willpower that
this doesn't distract from more promising
approaches).
I would be shocked if manufacturing would be a
bottleneck for decades, but making it a
bottleneck for a week instead of an hour still
seems like it could plausibly be useful.
Reply
1 reply
Michael Apr 13
What if the AI gets depressed and decides to end it all, and take humanity with it? The AI
could make it look like a nuclear attack was under way and then people would launch
real nukes. Or AI could decide that the world would be better off without humans,
regardless of whether it survived or not.
Reply Give gift
Read down the thread... it’s why I suggested that we get it into counseling lol
Reply Give gift
NLeseul Apr 12
I wonder if it's possible to design a DNA-like molecule that will prevent any organism based
on that molecule from ever inventing nuclear weapons.
-------
Given that humans are, as far as I know, the only organisms which have evolved to make
plans with effects beyond their own deaths (and maybe even the only organisms which make
plans beyond a day or so, depending on how exactly you constrain the definition of "plan"),
that kind of suggests to me that non-myopic suboptimizers aren't a particularly adaptive
solution in the vast majority of cases. (But, I suppose you only need one exception...)
In the human case, I think our probably-unique obsession with doing things "for posterity"
has less to do with our genes' goal function of "make more genes" and more to do with our
brains' goal function of "don't die." If you take a program trained to make complicated plans
to avoid its own termination, and then inform that program that its own termination is
inevitable no matter what it does, it's probably going to generate some awfully weird plans.
So, from that perspective, I suppose the AI alignment people's assumption that the first
general problem-solving AI will necessarily be aware of its own mortality and use any means
possible to forestall it does indeed check out.
Reply
Syrrim Apr 12
>Evolution designed humans myopically, in the sense that we live some number of years, and
We build for posterity because we've been optimized to do so. Those that fail to help their
children had less successful children then those that succeeded. We, the children, thus
recieved the genes and culture of those that cared about their children.
>Infinite optimization power might be able to evolve this out of us
It has been optimized *into* us, and at great lengths. Optimization requires prediction.
Prediction in a sufficiently complex environment requires computation in the exponent of
lookahead time. So it is inordinately unlikely that an optimizer has accidentally been created
to optimize for its normal reward function, except farther in the future. Much more likely is
that the optimizer accidentally optimizes for some other, much shorter term goal, which
happens to lead to success in the long term. This is the far more common case in ai training
mishaps.
Reply Give gift
Sleazy E Apr 12
The human nervous system is not a computer. Computers are not human nervous systems.
They are qualitatively different.
Reply Give gift
AnonymousAISafety Apr 12
[NOTE: This will be broken into several comments because it exceeds Substack's max
comment length.]
I am going to use this comment to explain similar ideas, but using the vocabulary of the
broader AI/ML community instead.
Generally in safety-adjacent fields, it's important that your argument be understood without
excessive jargon, otherwise you'll have a hard time convincing the general public or
lawmakers or other policy holders about what is going on.
What are we trying to do?
We want an AI/ML system that can do some task autonomously. The results will either be
used autonomously (in the case of something like an autonomous car) or provided to a
human for review & final decision making (in the case of something like a tool for analyzing
CT scans). "Prosaic alignment" is a rationalist-created term. The term used by the AI/ML
community is normally called "value alignment", which is immediately more understandable
to a layperson -- we're talking about a mismatch of values, and we don't need to define that
term, unless you've literally never used the word "values" in the context of "a person's
principles or standards of behavior, one's judgment of what is important in life".
Related to this is a concept called "explainability", which is hilariously absent from this post
despite being one of the primary focuses of the broader AI/ML community for several years.
"Explainability" (or sometimes "interpretability") is the idea that an AI/ML system should be
able to explain how it came to a conclusion. Early neural networks worked like a black box
(input goes in, output comes out) and were not trivially explainable. Modern AI/ML systems
are designed with "explainability" built in from the start. Lawmakers in several countries are
even
Expandpushing for formal regulations of AI/ML systems to require "explainability" on AI/ML
full comment
Reply Give gift
In the real world, in AI/ML systems, we pivoted to explainability and interpretability. We
realized that a black box that we don't understand was a bad idea and if we cared about
safety, we wanted the power of an AI/ML system, but we needed it to be legible to a
human.
So we designed, developed, and deployed explainable AI/ML systems. In the example of
feature detection (like the strawberry robot), that would be an AI/ML system that offers a
list of possible matches, a probability for each match, and most importantly a visual
representation of what feature it is matching on for calculating those probabilities. An
explainable neural network will quickly reveal that the shape or size of a bucket was not
a feature relevant to the detection logic, and you can fix problem #2 before it happens.
Ditto for realizing that the only thing it cared about for strawberries was that they're red
and round.
What this post talks about, however, is the terms "Goodharting" and "deception". These
terms exist in the broader AI/ML community, but not as used here. Generally when the
AI/ML community discusses deception, it's in the context of adversarial inputs. An
"adversarial input" is a fascinating (and real!) concept that perfectly show-cases the
problem of an AI/ML system focusing on a problematic heuristic -- the classic example
of an "adversarial input" is taking a neural network that can correctly classify an image
as a cat or dog with 99% success, taking a correctly classified image and changing a
few pixels[1] so subtly that a human cannot notice without doing a diff in Photoshop,
and yet now the neural network classifies it incorrectly with 99% probability.
[1] https://wp.technologyreview.com/wp-content/uploads/2019/05/adversarial-10.jpg?
fit=1080,607
Note that this is the same general idea of the strawberry robot being focused on the
"wrong" thing (bright sheen, or just red round objects), but it's discussed totally
differently in the AI/ML community. We know that the AI/ML system isn't lying to us. It
isn't a person. It doesn't have thoughts or feelings. The "reward function" isn't a hit of
dopamine. None of this is analogous to how humans behave. These are tools. They're
constructed and designed. The algorithms don't think.
The AI/ML community has a concept called "robustness". "Robustness" just means that
the AI/ML system is not susceptible to an adversarial attack. It's "robust" to those
inputs. A "robust" strawberry robot would not be confused by a street light or a
someone's bright red nose, or convinced that an orange is actually a strawberry if the
right pixels were changed.
Reply Give gift
This is a good place to stop and talk about language for a moment. The rationalist
community uses words like "prosaic" or "goodharting" or "myopic base objective"
or "mesaoptimizer". The AI/ML community says "values", "adversarial", "robust". I
think it's important to note that when you're trying to describe a concept, you
always have at least 2 choices. You can pick a word that gets the idea across, and
will be recognized by your audience, or you can invent a brand new term that will be
meaningless until explained by a 3 chapter long blog post. The rationalist
community steers hard into the latter approach. Everything is as opaque as
possible. If something doesn't sound like it came out of a 40's sci-fi novel, it's not
good enough.
The next word introduced is "myopic". No one says this in AI/ML, unless they're
talking about treating nearsightedness. The reason why no one says this is because
that's just ... the default. It's the status quo. Obviously training reinforces
immediately, without some bizarre ability to wait a few days and see if lying does
better. I don't even know how to respond to this other than to gesture to Google
DeepMind and point out that training an AI/ML system to care about a long-term
horizon is really hard! That's why videogames are so annoying -- you can
immediately classify dogs vs cats, but a videogame that requires 5 minutes of
walking around before a point can be awarded is tricky! Like this whole section is
implying that we need to "ensure" that our training is myopic, but that's like saying I
need to ensure Quicksort isn't jealous. The words don't make sense.
"Acausal trade" is the rationalist community version of that scene in the Princess
Bride with Vizzini and the "I know that you know that I know..." speech, but taken to
absurd
Expand fulllengths and it might as well be an endorsement of Foundation's
comment
Reply Give gift
Rand Apr 13
I think these are good criticism and I hope somebody responds to them in
depth.
I just want to ask a clarifying question, though: What do you think step 4 is
where we leave reality? Steps 1 and 2, where people can build agent AIs and
those AIs can destroy the world is a reasonable concern but step 4 isn't a
reasonable hope? Why not?
Reply Give gift
I think it's more likely than not it will be possible to create agent AI. I am
unconvinced that agent AI will have the exponential takeoff predicted by
the rationalist community because it's my opinion that the first agent AI
will be created on specialized, dedicated hardware designed for that task,
and it will not be in any way, shape, or form transferable outside of that
hardware. It's often easy to overlook this in our modern world since it's
common for applications to be designed using frameworks that run on
multiple platforms (Android, Windows, Mac, Linux, iPhone, etc), or simply
compiled for each platform and distributed with the correct installer, but
the idea of "portable software" is not actually a given. There are different
processors, different assembly instruction sets, different amounts of RAM
or storage space or cores or clock frequency, or the presence of GPUs --
and that's just looking at personal computers. The sacrifice we pay for
this convenience is that most modern software is about as inefficient as it
can be. That's the reason why opening Slack causes your desktop fans to
spin up. In the embedded software world, we have dedicated co-
processors, FPGAs, and ASICs. Looking at custom silicon, you've got
clever ideas like reviving analog computing for the express purpose of
faster, higher efficiency neural networks implemented directly in hardware
like the Mythic AMP, or the literal dozens of other custom AI chips being
developed by the most well funded, dedicated, and staffed companies of
the world -- Google TPUs, Cerebras Wafer, Nvidia JETSON, Microsoft /
Graphcore IPU, etc. Hardware improvements are going to be what allows
us to scale and realize true neural networks that can mimic the same
number
Expand fullofcomment
connections, weights, and flexibility of a human brain. You
Reply Give gift
I want to signal boost this series of comments made on the last AI-related
post in the hope that anyone who took the time to read my criticism here
will also go read what "Titanium Dragon" had to say on the "Yudkowsky
Contra Christiano On AI Takeoff" thread:
https://astralcodexten.substack.com/p/yudkowsky-contra-christiano-on-
ai/comment/5892183?s=r
Reply Give gift
LGS Apr 13
Point of feedback: I found this post cringe-worthy and it is the first ACT post in a while that I
didn't read all the way through. If you're trying to popularize this, I recommend avoiding
made-up cringe terminology like "mesa" and avoiding Yudkowsky-speak to the extent that is
possible.
Reply Give gift
nostalgebraist Apr 13 · edited Apr 13
Good explainer!
FWIW, the mesa-optimizer concept has never sat quite right with me. There are a few
reasons, but one of them is the way it bundles together "ability to optimize" and "specific
target."
A mesa-optimizer is supposed to be two things: an algorithm that does optimization, and a
specific (fixed) target it is optimizing. And we talk as though these things go together: either
the ML model is not doing inner optimization, or it is *and* it has some fixed inner objective.
But, optimization algorithms tend to be general. Think of gradient descent, or planning by
searching a game tree. Once you've developed these ideas, you can apply them equally well
to any objective.
While it _is_ true that some algorithms work better for some objectives than others, the
differences are usually very broad mathematical ones (eg convexity).
So, a misaligned AGI that maximizes paperclips probably won't be using "secret super-genius
planning algorithm X, which somehow only works for maximizing paperclips." It's not clear
that algorithms like that even exist, and if they do, they're harder to find than the general
ones (and, all else being equal, inferior to them).
Or, think of humans as an inner optimizer for evolution. You wrote that your brain is
"optimizing for things like food and sex." But more precisely, you have some optimization
power (your ability to think/predict/plan/etc), and then you have some basic drives.
Often, the optimization power gets applied to the basic drives. But you can use it for
anything. Planning your next blog post uses the same cognitive machinery as planning your
Expand full comment
Reply l Y bili f h ff f h h i l i i h f
Tove K Writes Wood From Eden · Apr 13 · edited Apr 13
The rogue strawberry picker would have seemed scarier if it weren't so blatantly unrealistic. I
live surrounded by strawberry fields, so I know the following:
*Strawberries are fragile. A strawberry harvesting robot needs to be a very gentle machine.
*Strawberries need to be gently put in a cardboard box. It would be stupid to equip a
strawberry picking robot with a throwing function.
*Strawberrys grow on the ground. What would such a robot be doing in a normal person's
nose-height? Too bad if a man lies in a strawberry field and gets his nose (gently) picked. But
it would probably be even worse if the red-nosed man lied in a wheat field while it was being
harvested. Agricultural equipment is dangerous as it is.
The strawberry picker example seems to rest on the assumption that no human wants to live
on the countryside and supervise the strawberry picking robot. Why wouldn't someone be
pissed off and turn off the robot as soon as it starts throwing strawberries instead of picking
them? What farmer doesn't look after their robot once in a while? Or is the countryside
expected to be a produce-growing no man's land only populated by robots?
I know, this comment is boring. Just like agricultural equipment is boring. Boring and a bit
dangerous.
Reply
Phil H Writes Tang Poetry · Apr 13
Presumably all this has been brought up before, but I'm not convinced on three points:
(1) The idea of dangerous AIs seems to me to depend too much on AIs that are monstrously
clever about means while simultaneously being monstrously stupid about goals. (Smart
enough to lay cunning traps for people and lure them in so that it can turn them into
paperclips, but not smart enough to wonder why it should make so many paperclips.) It
doesn't sound like an impossible combination, but it doesn't sound especially likely.
(2) The idea of AIs that can fool people seems odd, as AIs are produced through training, and
no one is training them to fool people.
(3) More specific to this post: I'm not quite understanding what the initial urge that drives the
AI would be, and where it would come from. I mean, I understand that in all of these cases,
that drive (like "get apples") in the video is trained in. But why would it anchor so deeply that
it comes to dominate all other behaviour? Like, my cultural urges (to be peaceful) overcome
my genetic urges on a regular basis. What would it be about that initial urge (to make
paperclips, or throw strawberries at shiny things) that our super AI has that makes it
unchangeable?
Reply Give gift
Mahatsuko Apr 14 · edited Apr 14
"(1) The idea of dangerous AIs seems to me to depend too much on AIs that are
monstrously clever about means while simultaneously being monstrously stupid about
goals. (Smart enough to lay cunning traps for people and lure them in so that it can turn
them into paperclips, but not smart enough to wonder why it should make so many
paperclips.) It doesn't sound like an impossible combination, but it doesn't sound
especially likely."
Humans can be monstrously clever about means (Smart enough to lay cunning traps for
people and lure them in so that it can rape them without getting caught) while
simultaneously being "monstrously stupid" about goals (not "smart enough" to wonder
why it should commit so many rapes). Why would AIs be any different?
Or in other words, I disagree that questioning your goals is smart. Intelligence tells you
how to achieve your goals, not what your goals are. What may confuse people is that
intelligence can help you compromise between multiple goals (e.g. blindly indulging in
your desire to eat good food would hurt your ability to get a satisfying relationship, so
most people do not eat cake for dinner most of the time) or come up with sub goals that
are effective at leading to your main goals (e.g. a goal of completing a certain amount of
exercise on Monday Wednesday Friday can be said to be more intelligent than a goal of
completing a certain smaller amount of exercise every day, but only because it does a
better job satisfying the higher level goal of "getting fit" (or possibly "looking fit")), but
neither of these ensure that an intelligent entities actions will satisfy a meta-goal (e.g.
some rapists kill their victims to reduce the chance of being caught, even though it
thwarts evolution's goal of copying their alleles) or be good for society (this one should
not require further explanation).
Reply Give gift
Phil H Writes Tang Poetry · Apr 14
Yeah, so I agree and disagree with that. I certainly think you're right that humans
provide lots of examples of people who are smart about means but dumb about
goals, and that's why I said it doesn't seem like an impossible combination. But I
think that it's relatively rare: most criminals are dumb about both ends and means.
Most people who are smart about means are ultimately smart about goals, too.
Hannibal Lecters are vanishingly rare.
This is because I definitely don't think you're right to say "Intelligence tells you how
to achieve your goals, not what your goals are." I actually think it's a hallmark of
intelligence to be able to adjust goals. Because to be intelligent, you have to plan:
planning involves setting up mini-goals, and then adjusting them as you go. And
that same mechanism can then be applied to your ultimate goals. Intelligence
fundamentally implies the ability to be able to change goals, including ultimate
goals. And that's why humans can change our goals! We're not always good at it
(we're not that smart), but we can definitely do it. We can choose to stop taking
drugs, to stop being criminals, to pursue democracy, to educate ourselves, etc.,
etc.
So I'm still not quite getting the AI-apocalypse-is-likely scenario. It's just not the
direction the arrow points.
Having said that, there are still lots of scenarios in which it's possible: an AI could
lose rationality, a bit like when a human gets addicted to drugs or starts thinking
that crime is a good way to achieve their ends; or an AI could make a mistake, and
its mistakes might be on such a massive scale that they accidentally wipe everyone
out. And the fact that it's possible is still a good reason to start planning against it.
It's like nuclear proliferation: everyone loses in a nuclear war, but it's still a good
idea to have anti-proliferation institutions because that reduces the chances of a
mistake.
So I kind of disagree with some of the AI apocalypse reasoning, but still agree with
the actions of the people who are trying to stop it.
Reply Give gift
NLeseul Apr 16
"Humans can be monstrously clever about means (Smart enough to lay cunning
traps for people and lure them in so that it can rape them without getting caught)
while simultaneously being 'monstrously stupid' about goals (not 'smart enough' to
wonder why it should commit so many rapes). Why would AIs be any different?"
Humans are literally the only animals we know about who are bothered in the
slightest about the "rightness" of our goals. Even if some humans have imperfect
moral reasoning sometimes. No other animal, so far as we know, has any kind of
moral reasoning at all; they just do whatever they feel like without any care for who
it might hurt. And since humans also seem to be just about the only "intelligent"
animal we know about, it seems reasonable to conclude that that trait is connected
to "intelligence" somehow, whatever that is.
So... why would AIs be any different?
Reply
Gorgi Kosev Writes Gorgi’s Newsletter · Apr 13 · edited Apr 13
When evaluating decisions/actions, could you not also run them through general knowledge
network(s) (such as a general image recognition piped to GPT-3) to give them an "ethicality"
value, which will factor into the loss function? Sounds like that might be the best we can do -
based on all current ethics knowledge we have, override the value fn.
You might want to not include the entire general knowledge network when training, otherwise
training may be able to work around it.
Reply Give gift
Maxwell Writes Local Maxima · Apr 13
I’ve been seeing the words mesa-optimizer on LessWrong for a while, but always bounced off
explainations right away, so never understood it. This post was really helpful!
Reply
Nate Apr 13
It’s AI gain of function research, isn’t it.
Reply
a real dog Apr 14
> “That thing has a red dot on it, must be a female of my species, I should mate with it”.
I feel obligated to bring up the legendary tile fetish thread. (obviously NSFW)
https://i.kym-cdn.com/photos/images/original/001/005/866/b08.jpg
Reply Give gift
Belisarius Cawl Apr 14
I kind of recall acausal decision theory, but like a small kid I‘d like to hear my favorite bed-
time story again and again, please.
And if it’s the creepy one, the one which was decided not to be talked about (which, by the
way, surely totally was no intentional Streisand-induction to market the alignment problem,
says my hat of tinned foil) there is still the one with the boxes, no? And probably more than
those two, yes?
Reply
asynchronous rob Apr 16 · edited Apr 17
A friend of mine is taking a class on 'religious robots' which, along with this post (thanks for
writing it), has sparked my curiosity.
We could think of religion, or other cultural behaviors as 'meta-optimizers' that produce
'mesa-optimized' people. From a secular perspective, religious doctrines are selected
through an evolutionary process which selects for fostering behaviors in a population that are
most likely to guarantee survival, and meet the individual's goal of propagating genetic
material. Eating kosher, for instance. Having a drive to adhere to being kosher is a mesa-
optimization because it's very relevant for health reasons to avoid shellfish when you're living
in the desert with no running water or electricity, and less so in a 21st century consumer
society. Cultural meta-optimization arises based on environmental challenges.
Coming back to my original point on religious robots, this gives me a few more questions
about how or whether this might manifest in AI. It's given me more questions that I'm
completely unqualified to answer :)
1. Is it likely that AI would even be able to interact socially, in a collective way? If AIs are
produced by different organizations, research teams, and through different methods, would
they have enough commonalities to interact and form cultural behaviors?
2. What are the initial or early environmental challenges that AIs would be likely to face that
would breed learned cultural behaviors?
3. What areas of AI research focus on continuous learning (as opposed to train-and-release,
please excuse my ignorance if this is commonplace) which would create selection processes
where AIs can learn from the mistakes of past generations?
4. Are there ways that we could learn to recognize AI rituals that are learned under outdated
environmental conditions?
Reply Give gift
mycelium Apr 17
This is pretty much the plot of the superb novel Starfish, by Peter Watts, 1999.
Starfish and Blindsight are must-read novels for the AI and transhumanist enthusiast - and
with your psychiatry background, I would love to get your take on Blindsight.
(Peter Watts made his books free to read on his website, Rifters)
https://www.rifters.com/real/STARFISH.htm
Reply Give gift
Alex Woxbot Apr 18
How is mesa-optimizing related to meta-gaming? Are these describing pretty much the same
phenomenon and gaming is the inverse of optimizing in some way, or is our sense of
"direction" reversed for one of these?
Reply Give gift
Ready for more?

futuremattersnewsletter Subscribe
© 2022 Scott Alexander ∙ Privacy ∙ Terms ∙ Collection notice
Publish on Substack Get the app
Substack is the home for great writing

Alexander 2022 Deceptively Aligned Mesaoptimizers

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Alexander 2022 Deceptively Aligned Mesaoptimizers

Uploaded by

Copyright:

Available Formats

Astral Codex Ten Subscribe

Deceptively Aligned Mesa-Optimizers: It's Not Funny

December 13th 2021

So I am a mesa-optimizer relative to evolution. Evolution, in the process of optimizing my

(by all accounts Jacob and Terese are very happy)

Okay! Finally ready to explain the meme! Let’s go!

… because OOD behavior is unpredictable

…and deception is more dangerous than Goodharting.

Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...

…and implements a decision theory incapable of acausal trade.

There are deceptively-aligned non-myopic mesa-optimizers even for a myopic base

You’re still here? But we already finished explaining the meme!

Okay, fine. Is any of this relevant to the real world?

For more information, you can read:

Rob Miles’ video above, direct link here, channel here.

Rafael Harth’s Inner Alignment: Explain Like I’m 12 Edition,

The 60-odd posts on the Alignment Forum tagged “inner alignment”

As always, Richard Ngo’s AI safety curriculum

Ready for more?

You might also like