Professional Documents
Culture Documents
How To Write Science Questions That Are Easy For People and Hard For Computers
How To Write Science Questions That Are Easy For People and Hard For Computers
Ernest Davis
T
I As a challenge problem for AI sys- he fundamental paradox of artificial intelligence is that
tems, I propose the use of hand-con- many intelligent tasks are extremely easy for people but
structed multiple-choice tests, with extremely difficult to get computers to do successfully.
problems that are easy for people but This is universally known as regards basic human activities
hard for computers. Specifically, I dis- such as vision, natural language, and social interaction, but
cuss techniques for constructing such
it is true of more specialized activities, such as scientific rea-
problems at the level of a fourth-grade
child and at the level of a high school
soning, as well. As everyone knows, computers can carry out
student. For the fourth-grade-level ques- scientific computations of staggering complexity and can
tions, I argue that questions that require hunt through immense haystacks of data looking for minus-
the understanding of time, of impossi- cule needles of insights or subtle, complex correlations. How-
ble or pointless scenarios, of causality, ever, as far as I know, no existing computer program can
of the human body, or of sets of objects, answer the question, “Can you fold a watermelon?”
and questions that require combining Perhaps that doesn’t matter. Why should we need com-
facts or require simple inductive argu- puter programs to do things that people can already do easi-
ments of indeterminate length can be ly? For the last 60 years, we have relied on a reasonable divi-
chosen to be easy for people, and are
sion of labor: computers do what they do extremely well —
likely to be hard for AI programs, in the
current state of the art. For the high calculations that are either extremely complex or require an
school level, I argue that questions that enormous, unfailing memory — and people do what they do
relate the formal science to the realia of well — perception, language, and many forms of learning
laboratory experiments or of real-world and of reasoning. However, the fact that computers have
observations are likely to be easy for almost no commonsense knowledge and rely almost entire-
people and hard for AI programs. I ly on quite rigid forms of reasoning ultimately forms a seri-
argue that these are more useful bench- ous limitation on the capacity of science-oriented applica-
marks than existing standardized tests tions including question answering; design, robotic
such as the SATs or New York Regents
execution, and evaluation of experiments; retrieval, summa-
tests. Since the questions in standard-
ized tests are designed to be hard for
rization, and high-quality translation of scientific docu-
people, they often leave many aspects of ments; science educational software; and sanity checking of
what is hard for computers but easy for the results of specialized software (Davis and Marcus 2016).
people untested. A basic understanding of the physical and natural world at
the level of common human experience, and an under-
standing of how the concepts and laws of formal science
relate to the world as experienced, is thus a critical objective
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 13
Articles
in developing AI for science. To measure progress knowledge that “every [human] fool knows”; since
toward this objective, it would be useful to have stan- everyone knows it, there is no point in testing it.
dard benchmarks; and to inspire radically ambitious However, this is often exactly the knowledge that AIs
research projects, it would be valuable to have spe- are missing. Sometimes the questions on standardized
cific challenges. tests do test this kind of knowledge implicitly; but
In many ways, the best benchmarks and chal- they do so only sporadically and with poor coverage.
lenges here would be those that are directly connect- Another possibility would be to automate the con-
ed to real-world, useful tasks, such as understanding struction of questions that are easy for people and
texts, planning in complex situations, or controlling hard for computers. The success of CAPTCHA (von
a robot in a complex environment. However, multi- Ahn et al. 2003) shows that it is possible automati-
ple-choice tests also have their advantages. First, as cally to generate images that are easy for people to
every teacher knows, they are easy to grade, though interpret and hard for computers; however, that is an
often difficult to write. Second, multiple-choice tests unusual case. Weston et al. (2015) propose to build a
can enforce a much narrower focus on commonsense system that uses a world model and a linguistic mod-
physical knowledge specifically than on more broad- el to generate simple narratives in commonsense
ly based tasks. In any more broadly based task, such domains. However, the intended purpose of this set
as those mentioned above, the commonsense rea- of narratives is to serve as a labeled corpus for an end-
soning will only be a small part of the task, and, to to-end machine-learning system. Having been gener-
judge by past experience, quite likely the part of the ated by a well-understood world model and linguis-
task with the least short-term payoff. Therefore tic model, this corpus certainly cannot drive work on
research on these problems is likely to focus on the original, richer, models of commonsense domains, or
other aspects of the problem and to neglect the com- of language, or of their interaction.
monsense reasoning. Having tabled the suggestion of using existing
If what we want is a multiple-choice science as a standardized tests and having ruled out automatical-
benchmark or challenge for AI, then surely the obvi- ly constructed tests, the remaining option is to use
ous thing to do is to use one of the existing multiple- manually designed test problems. To be a valid test
choice challenge tests, such as the New York State for AI, such problems must be easy for people. Oth-
Regents’ test (New York State Education Department erwise the test would be in danger of running into, or
2014) or the SAT. Indeed, a number of people have at least being accused of, the superhuman human fal-
proposed exactly that, and are busy working on lacy, in which we set benchmarks that AI cannot
developing systems aimed at that goal. Brachman et attain because they are simply impossible to attain.
al. (2005) suggest developing a program that can pass At this point, we have reached, and hopefully to
the SATs. Clark, Harrison, and Balasubramanian some extent motivated, the proposal of this article. I
(2013) propose a project of passing the New York propose that it would be worthwhile to construct
State Regents Science test for 4th graders. Strickland multiple-choice tests that will measure progress
(2013) proposes developing an AI that can pass the toward developing AIs that have a commonsense
entrance exams for the University of Tokyo. Ohlsson understanding of the natural world and an under-
et al. (2013) evaluated the performance of a system standing of how formal science relates to the com-
based on ConceptNet (Havasi, Speer, and Alonso monsense view; tests that will be easy for human sub-
2007) on a preprocessed form of the Wechsler Pre- jects but difficult for existing computers. Moreover,
school and Primary Scale of Intelligence test. Barker as far as possible, that difficulty should arise from
et al. (2004) describe the construction of a knowl- issues inherent to commonsense knowledge and
edge-based system that (more or less) scored a 3 (pass- commonsense reasoning rather than specifically
ing) on two sections of the high school chemistry from difficulties in natural language understanding
advanced placement test. The GEOS system (Seo et or in visual interpretation, to the extent that these
al. 2015), which answers geometry problems from can be separated.
the SATs, scored 49 percent on official problems and These tests will collectively be called science ques-
61 percent on a corpus of practice problems. tions appraising basic understanding — or SQUABU
The pros and cons of using standardized tests will (pronounced skwaboo). In this article we will consid-
be discussed in detail later on in this article. For the er two specific tests. SQUABU-Basic is a test designed
moment, let us emphasize one specific issue: stan- to measure commonsense understanding of the nat-
dardized tests were written to test people, not to test ural world that an elementary school child can be
AIs. What people find difficult and what AIs find dif- presumed to know, limited to material that is not
ficult are extremely different, almost opposite. Stan- explicitly taught in school because it is too obvious.
dardized tests include many questions that are hard The questions here should be easy for any contem-
for people and practically trivial for computers, such porary child of 10 in a developed country.
as remembering the meaning of technical terms or SQUABU-HighSchool is a test designed to measure
performing straightforward mathematical calcula- how well an AI can integrate concepts of high school
tion. Conversely, these tests do not test scientific chemistry and physics with a commonsense under-
14 AI MAGAZINE
Articles
SPRING 2016 15
Articles
problem that makes it simple. However, these proba- might be alive. (C) Most of Harrison’s horses are alive,
bly will not be effective challenges for AI. The AI pro- but a few might have died. (D) Probably all of Harri-
gram will, indeed, probably not find the clever son’s horses are alive.
approach; however, like John von Neumann in the Problem B.3 Every week during April, Mike goes to
well-known anecdote,1 the AI program will be able to school from 9 AM to 4 PM, Monday through Friday.
do the brute force calculation faster than ordinary Which of the following statements is true (only one)?
people can work out the clever solution. (A) Between Monday 9 AM and Tuesday 4 PM, Mike is
always in school. (B) Between Monday 9 AM through
Tuesday 4 PM, Mike is never in school. (C) Between
SQUABU-Basic Monday 4 PM and Friday 9 AM, Mike is never in
school. (D) Between Saturday 9 AM and Monday 8
What kind of science questions, then, are easy for AM, Mike is never in school. (E) Between Sunday 4 PM
people and hard for computers? In this section I will and Tuesday 9 AM, Mike is never in school. (F) It
consider this question in the context of SQUABU- depends on the year.
Basic, which does not rely on book learning. Later, I With regard to question B.2, the AI can certainly find
will consider the question in the context of SQUABU- the lifespan of a horse on Wikipedia or some similar
HighSchool, which tests the integration of high source. However, answering this question requires
school science with commonsense reasoning. combining this with the additional facts that lifespan
Time measures the time from birth to death, and that if
person P owns horse H at time T, then both P and H
In principle, representing temporal information in AI
are alive at time T. This connects to the feature “com-
systems is almost entirely a solved problem, and car-
bining multiple facts” discussed later.
rying out temporal reasoning is largely a solved prob-
This seems like it should be comparatively easy to
lem. The known representational systems for tempo-
do; I would not be very surprised if AI programs could
ral knowledge (for example, those discussed in Reiter
solve this kind of problem 10 years from now. On the
(2001) and in Davis (1990, chapter 5) suffice for all
other hand, I am not aware of much research in this
but a handful of the situations that arise in temporal
direction.
reasoning;2 almost all of the purely temporal infer-
ences that come up can be justified in established Inductive Arguments
temporal theories; and most of these can be carried of Indeterminate Length
out reasonably efficiently, though not all, and there
is always room for improvement. AI programs tend to be bad at arguments about
However, in practical terms, time is often serious- sequences of things of an indeterminate number. In
ly neglected in large-scale knowledge-based systems, the software verification literature, there are tech-
although CYC (Lenat, Prakash, and Shepherd 1986) niques for this, but these have hardly been integrat-
is presumably an exception. Mitchell et al. (2015) ed into the AI literature.
specifically mention temporal issues as an issue Examples include the following:
unaddressed in NELL, and systems like ConceptNet Problem B.4 Mary owns a canary named Paul. Does
(Havasi, Speer, and Alonso 2007) seem to be entirely Paul have any ancestors who were alive in the year
unsystematic in how they deal with temporal issues. 1750? (A) Definitely yes. (B) Definitely no. (C) There is
More surprisingly the abstract meaning representa- no way to know.
tion (AMR)3, a recent project to manually annotate a Problem B.5 Tim is on a stony beach. He has a large
large body of text with a formal representation of its pail. He is putting small stones one by one into the
meaning, has decided to exclude temporal informa- pail. Which of the following is true: (A) There will nev-
tion from its representation. (Frankly, I think this er be more than one stone in the pail. (B) There will
never be more than three stones in the pail. (C) Even-
may well be a short-sighted decision, which will be
tually, the pail will be full, and it will not be possible
regretted later.) Thus, there is a common impression
to put more stones in the pail. (D) There will be more
that temporal information is either too difficult or and more stones in the pail, but there will always be
not important enough to deal with in AI systems. room for another one.
Therefore, if a temporal fact is not stated explicit-
ly, then it is likely to be hard for existing AI systems Impossible and Pointless Scenarios
to derive. Examples include the following:
If you cook up a scenario that is obviously impossible
Problem B.1 Sally’s favorite cow died yesterday. The
for no very interesting reason, then it is quite likely
cow will probably be alive again (A) tomorrow; (B)
within a week; (C) within a year; (D) within a few
that no one has gone to the trouble of stating on the
years; (E) The cow will never be alive again. web that it is impossible, and that the AI cannot fig-
ure that out.
Problem B.2 Malcolm Harrison was a farmer in Vir-
ginia who died more than 200 years ago. He had a Of course, if all the questions of this form have the
dozen horses on his farm. Which of the following is answer “this is impossible,” then the AI or its design-
most likely to be true: (A) All of Harrison’s horses are er will soon catch on to that fact. So these have to be
dead. (B) Most of Harrison’s horses are dead, but a few counterbalanced by questions about scenarios that
16 AI MAGAZINE
Articles
are in fact obviously possible, but so pointless that no stick became longer. (D) After the pin is pulled out,
one will have bothered to state that they are possible the stick no longer has a length.
or that they occurred.
Examples include the following: Putting Facts Together
Problem B.6 Is it possible to fold a watermelon? Questions that require combining facts that are like-
Problem B.7 Is it possible to put a tomato on top of a ly to be expressed in separate sources are likely to be
watermelon? difficult for an AI. As already discussed, B.2 is an
Problem B.8 Suppose you have a tomato and a whole example. Another example:
watermelon. Is it possible to get the tomato inside the Problem B.15 George accidentally poured a little
watermelon without cutting or breaking the water- bleach into his milk. Is it OK for him to drink the
melon? milk, if he’s careful not to swallow any of the bleach?
Problem B.9 Which of the following is true: (A) A This requires combining the facts that bleach is a
female eagle and a male alligator could have a baby. poison, that poisons are dangerous even when dilut-
That baby could either be an eagle or an alligator. (B) ed, that bleach and milk are liquids, and that it is dif-
A female eagle and a male alligator could have a baby. ficult to separate two liquids that have been mixed.
That baby would definitely be an eagle. (C) A female
eagle and a male alligator could have a baby. That Human Body
baby would definitely be an alligator. (D) A female
eagle and a male alligator could have a baby. That Of course, people have an unfair advantage here.
baby would be half an alligator and half an eagle. (E) Problem B.16 Can you see your hand if you hold it
A female eagle and a male alligator cannot have a behind your head?
baby. Problem B.17 If a person has a cold, then he will prob-
Problem B.10 If you brought a canary and an alligator ably get well (A) In a few minutes. (B) In a few days or
together to the same place, which of the following a couple of weeks. (C) In a few years. (D) He will nev-
would be completely impossible: (A) The canary could er get well.
see the alligator. (B) The alligator could see the canary. Problem B.18 If a person cuts off one of his fingers,
(C) The canary could see what is inside the alligator’s then he will probably grow a new finger (A) In a few
stomach. (D) The canary could fly onto the alligator’s minutes. (B) In a few days or a couple of weeks. (C) In
back. a few years. (D) He will never grow a new finger.
SPRING 2016 17
Articles
One fruitful source of these kinds of domains is and decomposes into potassium chloride (KCl) and
simple high school level science lab experiments. On oxygen (O2). The gaseous oxygen expands out of the
the one hand experiments draw on or illustrate con- test tube, goes through the tubing, bubbles up
cepts and laws from formal science; on the other through the water in the beaker, and collects in the
hand, understanding the experimental set up often inverted beaker over the the water. Once the bub-
requires commonsense reasoning that is not easily bling has stopped, the experimenter raises or lowers
formalized. Experiments also must be physically the beaker until the level of the top of water inside
manipulable by human beings and their effects must and outside the beaker are equal. At this point, the
be visible (or otherwise perceptible) to human pressure in the beaker is equal to atmospheric pres-
beings; thus, the AI’s understanding of human pow- sure. Measuring the volume of the gas collected over
ers of manipulation and perception can also be test- the water, and correcting for the water vapor that is
ed. Often, an effective way of generating questions is mixed in with the oxygen, the experimenter can thus
to propose some change in the setup; this may either measure the amount of oxygen released in the
create a problem or have no effect. decomposition.
I have also found basic astronomy to be a fruitful Problem H.1: If the right end of the U-shaped tube
domain. Simple astronomy involves combining gen- were outside the beaker rather than inside, how would
eral principles, basic physical knowledge, elementary that change things? (A) The chemical decomposition
geometric reasoning, and order-of-magnitude rea- would not occur. (B) The oxygen would remain in the
soning. test tube. (C) The oxygen would bubble up through
A third category of problem is problems in every- the water in the basin to the open air and would not
day settings where formal scientific analysis can be be collected in the beaker. (D) Nothing would change.
The oxygen would still collect in the beaker, as shown.
brought to bear.
One general caveat: I am substantially less confi- Problem H.2: If the beaker had a hole in the base (on
dent that high school students would in fact do well top when inverted as shown), how would that change
things? (A) The oxygen would bubble up through the
on my sample questions for SQUABU-HighSchool
beaker and out through the hole. (B) Nothing would
than that fourth-graders would do well on the sam-
change. The oxygen would still collect in the beaker, as
ple questions for SQUABU-Basic. I feel certain that shown. (C) The water would immediately flow out
they should do well, and that something is wrong if from the inverted beaker into the basin and the beaker
they do not do well, but that is a different question. would fill with air coming in through the hole.
Chemistry Experiment Problem H.3 If the test tube, the beaker, and the U-
tube were all made of stainless steel rather than glass,
Read the following description of a chemistry exper- how would that change things? (A) Physically it would
iment,4 illustrated in figure 1. A small quantity of make no difference, but it would be impossible to see
potassium chlorate (KClO3) is heated in a test tube, and therefore impossible to measure. (B) The chemical
18 AI MAGAZINE
Articles
Charged Plate
– Oil drop
Charged Plate
decomposition would not occur. (C) The oxygen Problem H.8 The illustration shows a graduated beaker.
would seep through the stainless steel beaker. (D) The Suppose instead you use an ungraduated glass beaker. How
beaker would break. (E) The potassium chloride would will that change things? (A) The oxygen will not collect
accumulate in the beaker. properly in the beaker. (B) The experimenter will not know
Problem H.4 Suppose the stopper in the test tube were whether to raise or lower the beaker. (C) The experimenter
removed, but that the U-tube has some other support will not be able to measure the volume of gas.
that keeps it in its current position. How would that Problem H.9 At the start of the experiment, the beaker
change things? (A) The oxygen would stay in the test needs to be full of water, with its mouth in the basin below
tube. (B) All of the oxygen would escape to the outside the surface of the water in the basin. How is this state
air. (C) Some of the oxygen would escape to the out- achieved? (A) Fill the beaker with water rightside up, turn
side air, and some would go through the U-shaped it upside down, and lower it upside down into the basin.
tube and bubble up to the beaker. So the beaker would (B) Put the beaker rightside up into the basin below the
get some oxygen but not all the oxygen. surface of the water; let it fill with water; turn it upside
Problem H.5 The experiment description says, “The down keeping it underneath the water; and then lift it
experimenter raises or lowers the beaker until the lev- upward, so that the base is out of the water, but keeping
el of the top of water inside and outside the beaker are the mouth always below the water. (C) Put the beaker
equal. At this point, the pressure in the beaker is equal upside down into the basin below the surface of the water;
to atmospheric pressure.” More specifically: Suppose and then lift it back upward, so that the base is out of the
that after the bubbling has stopped, the level of water water, but keeping the mouth always below the water. (D)
in the beaker is higher than the level in the basin (as Put the beaker in the proper position, and then splash
seems to be shown in the right-hand picture). Which water upward from the basin into it. (E) Put the beaker in
of the following is true: (A) The pressure in the beaker its proper position, with the mouth below the level of the
is lower than atmospheric pressure, and the beaker water; break a small hole in the base of the beaker; suction
should be lowered. (B) The pressure in the beaker is the water up from the basin into the beaker using a
lower than atmospheric pressure, and the beaker pipette; then fix the hole.
should be raised. (C) The pressure in the beaker is
higher than atmospheric pressure, and the beaker Millikan Oil-Drop Experiment
should be lowered. (D) The pressure in the beaker is
higher than atmospheric pressure, and the beaker Problem H.10: In the Millikan oil-drop experiment, a tiny
should be raised. oil drop charged with a single electron was suspended
Problem H.6 Suppose that instead of using a small between two charged plates (figure 2). The charge on the
amount of potassium chlorate, as shown, you put in plates was adjusted until the electric force on the drop
enough to nearly fill the test tube. How will that exactly balanced its weight. How were the plates charged?
change things? (A) The chemical decomposition will (A) Both plates had a positive charge. (B) Both plates had
not occur. (B) You will generate more oxygen than the a negative charge. (C) The top plate had a positive charge,
beaker can hold. (C) You will generate so little oxygen and the bottom plate had a negative charge. (D) The top
that it will be difficult to measure. plate had a negative charge, and the bottom plate had a
positive charge. (E) The experiment would work the same,
Problem H.7 In addition to the volume of the gas in
no matter how the plates were charged.
the beaker, which of the following are important to
measure accurately? (A) The initial mass of the potas- Problem H.11: If the oil drop started moving upward, Mil-
sium chlorate. (B) The weight of the beaker. (C) The likan would (A) Increase the charge on the plates. (B)
diameter of the beaker. (D) The number and size of the Reduce the charge on the plates. (C) Increase the charge on
bubbles. (E) The amount of liquid in the beaker. the drop. (D) Reduce the charge on the drop. (E) Make the
SPRING 2016 19
Articles
drop heavier. (F) Make the drop lighter. (G) Lift the Problem H.26: If a star is nearby, and the line from
bottom plate. Earth to the star is perpendicular to the plane of
Problem H.12: If the oil drop fell onto the bottom Earth’s revolution, and you track its relative motion
plate, Millikan would (A) Increase the charge on the against the background of very distant stars over the
plates. (B) Reduce the charge on the plates. (C) course of a year, what figure does it trace? (A) A
Increase the charge on the drop. (D) Reduce the charge straight line. (B) A square. (C) An ellipse. (D) A cycloid.
on the drop. (E) Start over with a new oil drop.
Problem H.13: The experiment demonstrated that the
Problems in Everyday Settings
charge is quantized; that is, the charge on an object is Problem H.27: Suppose that you have a large closed
always an integer multiple of the charge of the elec- barrel. Empty, the barrel weighs 1 kg. You put into the
tron, not a fractional or other noninteger multiple. To barrel 10 gm of water and 1 gm of salt, and you dis-
establish this, Millikan had to measure the charge on solve the salt in the water. Then you seal the barrel
(A) One oil drop. (B) Two oil drops. (C) Many oil drops. tightly. Over time, the water evaporates into the air in
the barrel, leaving the salt at the bottom. If you put
Astronomy Problems the barrel on a scales after everything has evaporated,
the weight will be (A) 1000 gm (B) 1001 gm (C) 1010
Problem H.14: Does it ever happen that there is an gm (D) 1011 gm (E) Water cannot evaporate inside a
eclipse of the sun one day and an eclipse of the moon closed barrel.
the next?
Problem H.15: Does it ever happen that someone on Problem H.28: Suppose you are in a room where the
Earth sees an eclipse of the moon shortly after sunset? temperature is initially 62 degrees. You turn on a
heater, and after half an hour, the temperature
Problem H.16: Does it ever happen that someone on throughout the room is now 75 degrees, so you turn
Earth sees an eclipse of the moon at midnight? off the heater. The door to the room is closed; howev-
Problem H.17: Does it ever happen that someone on er there is a gap between the door and the frame, so air
Earth sees an eclipse of the moon at noon? can go in and out. Assume that the temperature and
Problem H.18: Does it ever happen that one person on pressure outside the room remain constant over the
Earth sees a total eclipse of the moon, and at exactly time period. Comparing the air in the room at the
the same time another person sees the moon start to the air in the room at the end, which of the fol-
uneclipsed? lowing is true: (A) The pressure of the air in the room
has increased. (B) The air in the room at the end occu-
Problem H.19: Does it ever happen that one person on
pies a larger volume than the air in the room at the
Earth sees a total eclipse of the sun, and at exactly the
beginning. (C) There is a net flow of air into the room
same time another person sees the sun uneclipsed?
during the half hour period. (D) There is a net flow of
Problem H.20: Suppose that you are standing on the air out of the room during the half hour period. (E)
moon, and Earth is directly overhead. How soon will Impossible to tell from the information given.
Earth set? (A) In about a week. (B) In about two weeks.
Problem H.29: The situation is the same as in problem
(C) In about a month. (D) Earth never sets.
H.28, except that this time the room is sealed, so that
Problem H.21: Suppose that you are standing on the no air can pass in or out. Which of the following is
moon, and the sun is directly overhead. How soon will true: (A) The pressure of the air in the room has
the sun set? (A) In about a week. (B) In about two increased. (B) The pressure of the air in the room has
weeks. (C) In about a month. (D) The sun never sets. decreased. (C) The air in the room at the end occupies
Problem H.22: You are looking in the direction of a a larger volume than the air in the room at the begin-
particular star on a clear night. The planet Mars is on ning. (D) The air in the room at the end occupies a
a direct line between you and the star. Can you see the smaller volume than the air in the room at the begin-
star? ning. (E) The ideal gas constant is larger at the end
than at the beginning. (F) The ideal gas constant is
Problem H.23: You are looking in the direction of a
smaller at the end than at the beginning.
particular star on a clear night. A small planet orbiting
the star is on a direct line between you and the star. Problem H.30: You blow up a toy rubber balloon, and tie
Can you see the star? the end shut. The air pressure in the balloon is: (A) Low-
er than the air pressure outside. (B) Equal to the air pres-
Problem H.24: Suppose you were standing on one of
sure outside. (C) Higher than the air pressure outside.
the moons of Jupiter. Ignoring the objects in the solar
system, which of the following is true: (A) The pattern
of stars in the sky looks almost identical to the way it
looks on Earth. (B) The pattern of stars in the sky looks Apparent Advantages
very different from the way it looks on Earth. of Standardized Tests
Problem H.25: Nearby stars exhibit parallax due to the
An obvious alternative to creating our own SQUABU
annual motion of Earth. If a star is nearby, and is in the
plane of Earth’s revolution, and you track its relative test is to use existing standardized tests. However, it
motion against the background of very distant stars seems to me that the apparent advantages of using
over the course of a year, what figure does it trace? (A) standardized tests as benchmarks are mostly either
A straight line. (B) A square. (C) An ellipse. (D) A minor or illusory. The advantages that I am aware of
cycloid. are the following:
20 AI MAGAZINE
Articles
Standardized Tests Exist you yourself designed and whose most conspicuous
Standardized tests exist, in large number; they do not feature is that they are spectacularly easy.
have to be created. This “argument from laziness” is However, this is a double-edged sword. The public
not entirely to be sneezed at. The experience of the can easily jump to the conclusion that, since an AI
computational linguistics community shows that, if program can pass a test, it has the intelligence of a
you take evaluation seriously, developing adequate human that passes the same test. For example, Ohls-
evaluation metrics and test materials requires a very son et al. (2013) titled their paper “Verbal IQ of a
substantial effort. However, the experience of the Four-Year Old Achieved by an AI System.”5 Unfortu-
computational linguistic community also suggests nately, this title was widely misinterpreted as a claim
that, if you take evaluation seriously, this effort can- about verbal intelligence or even general intelli-
not be avoided by using standardized tests. No one in gence. Thus, an article in ComputerWorld (Gaudin
the computational linguistics community would 2013) had the headline “Top Artificial Intelligence
dream of proposing that progress in natural language System Is As Smart As a 4-Year-Old;” the Independent
processing (NLP) should be evaluated in terms of published an article “AI System Found To Be as
scores on the English language SATs. Clever as a Young Child after Taking IQ Test;” and
articles with similar titles were published in many
Investigator Bias other venues. These headlines are of course absurd; a
Entrusting the issue of evaluation measures and four-year old can make up stories, chat, occasionally
benchmarks to the same physical reasoning commu- follow directions, invent words, learn language at an
nity that is developing the programs to be evaluated incredible pace; ConceptNet (the AI system in ques-
is putting the foxes in charge of the chicken coops. tion) can do none of these.
The AI researchers will develop problems that fit their
Unpublished
own ideas of how the problems should be solved.
This is certainly a legitimate concern; but I expect in Finally, some standardized tests, including the SATs,
practice much less distortion will be introduced this are not published and are available to researchers
way than by taking tests developed for testing people only under stringent nondisclosure agreements. It
and applying them to AI. seems to me that AI researchers should under no cir-
cumstances use such a test with such an agreement.
Vetting and Documentation The loss from the inability to discuss the program’s
Standardized tests have been carefully vetted and behavior on specific examples far outweighs the gain
the performance of the human population on them from using a test with the imprimatur of the official
is very extensively documented. On the first point, test designer. This applies equally to Haroun and
it is not terribly difficult to come up with correct Hestenes’ (1985) well-known basic physics test; in
tests. On the second point, there is no great value to any case, it would seem from the published infor-
the AI community in knowing how well humans of mation that that test focuses on testing understand-
different ages, training, and so on do on this prob- ing of force and energy rather than testing the rela-
lem. It hardly matters which questions can be tion of formal physics to basic world knowledge. The
solved by 5 year olds, which by 12 year olds, and same applies to the restrictions placed by kaggle.com
which by 17 year olds, since, for the foreseeable on the use of their data sets.
future, all AI programs of this kind will be idiot Standardized tests carry an immense societal bur-
savants (when they are not simply idiots), capable den and must meet a wide variety of very stringent
of superhuman calculations at one minute, and sub- constraints. They are taken by millions of students
human confusions at the next. There is no such annually under very plain testing circumstances (no
thing as the mental age of an AI program; the abili- use of calculators, let alone Internet). They bear a dis-
ties and disabilities of an AI program do not corre- proportionate share in determining the future of
spond to those of any human being who has ever those students. They must be fair across a wide range
existed or could ever exist. of students. They must conform to existing curricu-
la. They must maintain a constant level of difficulty,
Public Acceptance both across the variants offered in any one year, and
Success on standardized tests is easily accepted by the from one year to the next. They are subject to
public (in the broad sense, meaning everyone except intense scrutiny by large numbers of critics, many of
researchers in the area), whereas success on metrics them unfriendly. These constraints impose serious
we have defined ourselves requires explanation, and limitations on what can be asked and how exams
will necessarily be suspect. This, it seems to me, is the can be structured.
one serious advantage of using standardized tests. In developing benchmarks for AI physical reason-
Certainly the public is likely to take more interest in ing, we are subject to none of these constraints. Why
the claim that your program has passed the SAT, or tie our own hands, by confining ourselves to stan-
even the fourth-grade New York Regents test, than in dardized tests? Why not take advantage of our free-
the claim that it has passed a set of questions that dom?
SPRING 2016 21
Articles
22 AI MAGAZINE