Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

REPORTS The surprisingly common criterion ex-

A Bayesian Truth Serum for ploits an overlooked implication of Bayesian


reasoning about population frequencies.
Namely, in most situations, one should ex-
Subjective Data pect that others will underestimate the true
frequency of one_s own opinion or personal
Dražen Prelec characteristic. This implication is a corollary
to the more usual Bayesian argument that the
Subjective judgments, an essential information source for science and policy, highest predictions of the frequency of a
are problematic because there are no public criteria for assessing judgmental given opinion or characteristic in the popu-
truthfulness. I present a scoring method for eliciting truthful subjective data lation should come from individuals who
in situations where objective truth is unknowable. The method assigns high hold that opinion or characteristic, because
scores not to the most common answers but to the answers that are more holding the opinion constitutes a valid and
common than collectively predicted, with predictions drawn from the same favorable signal about its general popularity
population. This simple adjustment in the scoring criterion removes all bias in (11, 12). People who, for example, rate
favor of consensus: Truthful answers maximize expected score even for re- Picasso as their favorite should—and usually
spondents who believe that their answer represents a minority view. do (13)—give higher estimates of the per-
centage of the population who shares that
Subjective judgment from expert and lay Delphi method (10), it does not privilege the opinion, because their own feelings are an
sources is woven into all human knowledge. consensus answer. Hence, there is no reason informative Bsample of one[ (14). It follows,
Surveys of behaviors, attitudes, and inten- for respondents to bias their answer toward then, that Picasso lovers, who have reason to
tions are a research staple in political sci- the likely group mean. Truthful responding believe that their best estimate of Picasso
ence, psychology, sociology, and economics remains the correct strategy even for some- popularity is high compared with others_ es-
(1). Subjective expert judgment drives envi- one who is sure that their answer represents a timates, should conclude that the true popu-
ronmental risk analysis, business forecasts, minority view. larity of Picasso is underestimated by the
historical inferences, and artistic and legal Instead of using consensus as a truth population. Hence, one_s true opinion is also
interpretations (2). criterion, my method assigns high scores to the opinion that has the best chance of being
The value of subjective data is limited by answers that are more common than collect- surprisingly common.
its quality at the source—the thought process ively predicted, with predictions drawn from The validity of this conclusion does not
of an individual respondent or expert. Qual- the same population that generates the an- depend on whether the personally truthful an-
ity would plausibly be enhanced if respon- swers. Such responses are Bsurprisingly com- swer is believed to be rare or widely shared.
dents felt as if their answers were being mon,[ and the associated numerical index is For example, a male who has had more
evaluated by an omniscient scorer who knew called an information score. This adjustment than 20 sexual partners Eanswering question
the truth (3). This is the situation with tests in the target criterion removes the bias in- (iii)^ may feel that few people fall in this
of objective knowledge, where success is herent in consensus-based methods and promiscuous category. Nevertheless, accord-
defined as agreement with the scorer_s an- levels the playing field between typical and ing to Bayesian reasoning, he should expect
swer key, or in the case of forecasts, an ob- unusual opinions. that his personal estimate of the percentage
servable outcome (4). Such evaluations are The scoring works at the level of a single (e.g., 5%) will be somewhat higher than
rarely appropriate in social science, because question. For example, we might ask: (i) the average of estimates collected from the
the scientist is reluctant to impose a partic- What is your probability estimate that hu- population as a whole (e.g., 2%). The fact
ular definition of truth, even if one were manity will survive past the year 2100 (100- that he has had more than 20 sexual part-
available (5). point probability scale)? (ii) Will you vote ners is evidence that the general popula-
Here, I present a method of eliciting sub- in the next presidential election (Definitely/ tion, which includes persons with fewer
jective information, designed for situations Probably/Probably Not/Definitely Not)? (iii) partners, will underestimate the prevalence
where objective truth is intrinsically or prac- Have you had more than 20 sexual partners of this profile.
tically unknowable (6). The method con- over the past year (Yes/No)? (iv) Is Picasso Truth-telling is individually rational in the
sists of an Binformation-scoring[ system that your favorite 20th-century painter (Yes/No)? sense that a truthful answer maximizes ex-
induces truthful answers from a sample of Each respondent provides a personal pected information score, assuming that
rational (i.e., Bayesian) expected value– answer and also a prediction of the empirical everyone is responding truthfully Ehence, it
maximizing respondents. Unlike other distribution of answers (i.e., the fraction of is a Bayesian Nash equilibrium (15)^. It is
Bayesian elicitation mechanisms (7–9), the people endorsing each answer). Predictions also collectively rational in the sense that no
method does not assume that the researcher are scored for accuracy, that is, for how well other equilibrium provides a higher expected
knows the probabilistic relationship between they match the empirical frequencies. The information score, for any respondent. In
different responses. Hence, it can be applied personal answers, which are the main object actual applications of the method, one would
to previously unasked questions, by a re- of interest, are scored for being surprisingly not teach respondents the mathematics of
searcher who is a complete outsider for the common. An answer endorsed by 10% of scoring or explain the notion of equilibrium.
domain. Unlike earlier approaches to Btest the population against a predicted frequency Rather, one would like to be able to tell them
theory without an answer key[ (5), or the of 5% would be surprisingly common and that truthful answers will maximize their
would receive a high information score; if expected scores, and that in arriving at their
Massachusetts Institute of Technology, Sloan School
predictions averaged 25%, it would be a sur- personal true answer they are free to ignore
of Management, E56-320, 38 Memorial Drive, Cam- prisingly uncommon answer, and hence re- what other respondents might say. The equi-
bridge, MA 02139, USA. E-mail: dprelec@mit.edu ceive a low score. librium analysis confirms that under cer-

462 15 OCTOBER 2004 VOL 306 SCIENCE www.sciencemag.org


REPORTS
tain conditions one can make such a claim I calculate the population endorsement fre- tossing tails) and 1/6 (from those tossing
honestly. quencies, xk , and the (geometric) average, yk , heads), yielding a geometric mean clearly
The equilibrium results rest on two as- of predicted frequencies, lower than his or her predicted frequency of
sumptions. First, the sample of respondents 1/2. Hence, he or she expects that tails will
must be sufficiently large so that a single 1X n
prove to be more common than predicted
xk 0 lim xr ;
answer cannot appreciably affect empirical nYV n r 01 k and receive a positive information score. By
frequencies (16). The results do hold for contrast, heads is expected to be a surpris-
large finite populations but are simpler to 1X n
ingly uncommon toss, because the predicted
log yk 0 lim log y rk
state for a countably infinite population, as is nYV n r 01 frequency of 1/2 is lower than the expec-
done here. Respondents are indexed by r Z tation of others_ predictions, which is a mix
A1,2,IZ, and their truthful answer to a m Instead of applying a preset answer key, we of 1/2 and 5/6 predictions. A similar argu-
multiple-choice question by t r 0 (t1r ,..,tmr) evaluate answers according to their informa- ment would show that those who draw
(tkr Z A0,1Z, Fk xkr 0 1). tkr is thus an in- tion score, which is the log-ratio of actual-to- heads should expect that heads will prove
dicator variable that has a value of one or predicted endorsement frequencies. The in- to be the answer with the high information
zero depending on whether answer k is or formation score for answer k is score.
is not the truthful answer of respondent r. The example illustrates a general proper-
The truthful answer is also called a personal xk ty of information scores. Namely, a truthful
log ð1Þ
opinion or characteristic. yk answer constitutes the best guess about the
Second, respondents treat personal opin- most surprisingly common answer, if Bbest[
ions as an Bimpersonally informative[ signal At least one answer will have a nonnegative is defined precisely by expected information
about the population distribution, which is an information score. Variance in predictions score and if other respondents are answering
unknown parameter, < 0 (<1,..,<m) Z ; tends to lower all yk values and hence raises truthfully and giving truthful predicted fre-
(17). Formally, I assume common knowl- information scores. quencies. This property does not depend on
edge (18) by respondents that all posterior The total score for a respondent combines the number of possible answers or on the
beliefs, p(<kt r), are consistent with Bayesian the information score with a separate score prior (23). It leads directly to the equilibrium
updating from a single distribution over <, for the accuracy of predictions (20): result Eproof in the supporting online mate-
also called a common prior, p(<), and that: rial (SOM) text^.
p(<kt r) 0 p(<kt s) if and only if t r 0 t s. score for respondent r 0 For this theorem, assume that (i) every
Opinions thus provide evidence about <, information score þ prediction score 0 respondent r with opinion tr forms a poste-
but the inference is impersonal: Respon- rior over the population distribution of opin-
X xk X yr
dents believe that others sharing their opinion xkr log þ" xk log k ; 0 G " ð2Þ ions, p(<kt r), by applying Bayes_ rule to a
will draw the same inference about popula- k
yk k
xk common prior p(<); (ii) p(<kt r) 0 p(<kt s) if
tion frequencies (19). One can therefore and only if t r 0 t s; and (iii) scores are com-
denote a generic respondent with opinion Equation 2 is the complete payoff equation puted according to Eq. 2. Then, (T1) truth-
j by tj and suppress the respondent super- for the game. It is symmetric, and zero-sum telling is a Nash equilibrium for any " 9
script from joint and conditional probabili- if " 0 1. The first part of the equation selects 0: Truth-telling maximizes expected total
ties: ProbAtjr 0 1 k tis 0 1Z becomes p(tjkti), a single information-score value, given that score of every respondent who believes that
and so on. xkr 0 0 for all answers except the one others are responding truthfully; (T2) ex-
For a binary question, one may interpret endorsed by r. The second part is a pen- pected equilibrium information scores are
the model as follows. Each respondent alty proportional to the relative entropy (or nonnegative and attain a maximum for all
privately and independently conducts one Kullback-Leibler divergence) between the respondents in the truth-telling equilibrium;
toss of a biased coin, with unknown proba- empirical distribution and r_s prediction of (T3) for " 0 1, the game is zero-sum, and the
bility <H of heads. The result of the toss that distribution (21, 22). The best predic- total scores in the truth-telling equilibrium
represents his opinion. Using this datum, he tion score is zero, attained when prediction equal log p(<ktr) þ K, with K set by the zero-
forms a posterior distribution, p(<Hktr), whose exactly matches reality, y rk 0 xk . Expected sum constraint.
expectation is the predicted frequency of prediction score is maximized by reporting Truth-telling is defined as truthful answers,
heads. For example, if the prior is uniform, expected frequencies, ykr 0 EAxk kt r Z (2). The x r 0 t r, and truthful predictions, y r 0 EA<kt rZ.
then the posterior distribution following the constant " fine-tunes the weight given to pre- T2 states that although there are other equi-
toss will be triangular on E0,1^, skewed diction error. libria, constructed by mapping multiple true
toward heads or tails depending on the result To see how this works in the simple coin opinions into a single response category or
of the toss, with an expected value of one- toss setting, imagine that there are only two by randomization, these less revealing equi-
third or two-thirds. However, if the prior is equally likely possibilities: Either the coin is libria result in lower information scores for
not uniform but strongly biased toward the fair, or it is unfair, in which case it always all respondents. If needed, one can enhance
opposite result (i.e., tails), then the expected comes up heads. A respondent who privately the strategic advantage of truth-telling by
frequency of heads following a heads toss observes a single toss of tails knows that the giving relatively more weight to information
might still be quite low. This would corre- coin is fair, and predicts a 50-50 split of score in Eq. 2 (24). For sufficiently small ",
spond to a prima facie unusual characteristic, observations. A respondent observing heads the expected total scores in the truth-telling
such as having more than 20 sexual partners lowers the probability of fairness from the equilibrium will Pareto-dominate expected
within the previous year. prior 1/2 to a posterior of 1/3, in accord with scores in any other equilibrium. T3 shows
An important simplification in the meth- Bayes_ rule, which in turn yields a predicted that by setting " 0 1 we also have the option
od is that I never elicit prior or posterior (i.e., expected) frequency of 1/6 for tails (mul- of presenting the survey as a purely compet-
distributions, only answers and predicted fre- tiplying 1/3 by 1/2). From the perspective of itive, zero-sum contest. Total scores then
quencies. Denoting answers and predictions someone observing tails, the expectation of rank respondents according to how well they
by x r 0 (x1r,..,xmr) (xkr Z A0,1Z, Fk xkr 0 1) and others_ predictions of the frequency of tails anticipate the true distribution of answers.
y r 0 ( y1r,..,ymr) ( ykr Q 0, Fk ykr 0 1), respectively, will be a mix of predictions of 1/2 (from those Note that the scoring system asks only for

www.sciencemag.org SCIENCE VOL 306 15 OCTOBER 2004 463


REPORTS
the expected distribution of true answers, point multiple-choice question (in practice, experts_ estimates are based on a private
EA<ktrZ and not for the posterior distribution one would have fewer categories and smooth signal, distributed between zero and one,
p(<ktr), which is an m-dimensional probabil- out the empirical frequencies). The population representing a personal assessment of the
ity density function. Remarkably, one can vector < 0 (<00,..,<99) indexes the unknown credibility of evidence supporting the bad
infer which respondents assign more proba- distribution of such probabilities among ex- outcome. The Bcredibility signal[ is a valid
bility to the actual value of < by means of a perts. Given any prior, p(<), it is a laborious but stochastic indicator of the true state of
procedure that does not elicit these probabil- but straightforward exercise to calculate ex- affairs: On the bad scenario, credibility signals
ities directly. pected information score as function of true are independent draws from a uniform dis-
In previous economic research on incen- personal probability and endorsed probabili- tribution, so that some experts Bget the mes-
tive mechanisms, it has been standard to as- ty. Figure 1, lines A90 and B90, present the sage[ and some do not; on the good scenario,
sume that the scorer (or the Bcenter[) knows result of such calculations, with two different they are independent draws from a triangular
the prior and posteriors and incorporates this priors, pA(<) and pB(<), for experts who distribution, peaking at zero (no credibility)
knowledge into the scoring function (7–9, 25). happen to agree that the probability of di- and declining linearly to one (full credibili-
In principle, any change in the prior, whether saster striking before 2100 is 90%. The ex- ty). A prior probability of catastrophe then
caused by a change in question wording, in perts thus share the same assessment but have induces a monotonic mapping from credibil-
the composition of the sample, or by new different theories about how their assess- ity signals to posterior probabilities of ca-
public information, would require a recalcu- ment is related to the assessment of others. tastrophe, as well as a prior over experts_
lation of the scoring functions. By contrast, Although lines A90 and B90 differ, the ex- probability estimates, p(<).
my method employs a universal Bone-size-fits- pected information score is in both cases Lines A and B differ in that the prior
all[ scoring equation, which makes no men- maximized by a truthful endorsement of 90%. probability of catastrophe is presumed to be
tion of prior or posterior probabilities. This This confirms T1. In both cases, each expert 50% for line A and 20% for line B. Expected
has three benefits for practical application. believes that his subjective probability is pes- scores are higher for B, because the 90%
First, questions do not need to be limited to simistic relative to the population: The ex- estimate is more surprising in that case.
some pretested set for which empirically es- pectation of others_ probabilities, conditioned One could question any of the assump-
timated base rates and conditional probabil- on a personal estimate of 90%, is only 65% tions of this model (28). However, changing
ities are available; instead, one can use the with pA(<) and 54% with pB(<). the assumptions would not move the opti-
full resources of natural language to tailor a If the subjective probability shifts to 50%, mum, as long as the impersonally informative
new set of questions for each application. the lines move to A50, B50, and the opti- requirement is preserved. (The impersonally
Second, it is possible to apply the same sur- mum, in both cases, relocates to 50%. Hence, informative requirement means that two ex-
vey to different populations, or in a dynamic the optimum automatically tracks changes in perts will estimate the same probability of
setting (which is relevant to political polling). subjective belief, in this case the subjective catastrophe if and only if they share the
Third, one can honestly instruct respondents probability of an unknown future event, but is same posterior distribution over other experts_
to refrain from speculating about the answers invariant with respect to assumptions about probabilities). Thus, even though information
of others while formulating their own answer. how that belief is related to beliefs of other scoring conditions success on the answers
Truthful answers are optimal for any prior, individuals. Changing these assumptions will of other people, the respondent does not
and there are no posted probabilities for them simply lead back to the same recommenda- need to develop a theory of other people_s
to consider, and perhaps reject. tion: Truthfully report subjective probability. answers; the most popular answer has no
These are decisive advantages when it Respondents are thus free to concentrate advantage of Bwinning,[ and the entire struc-
comes to scoring complex, unique questions. on their personal answer and need not worry ture of mutual beliefs, as embodied in the
In particular, one can apply the method to about formulating an adequate prior. Any mod- prior, is irrelevant.
elicit honest probabilistic judgments about el of the prior is likely to be complex and It is instructive to compare information
the truth value of any clearly stated proposi- involve strong assumptions. For example, in scores with scores that would be computed if
tion, even if actual truth is beyond reach and the calculations in Fig. 1, I assumed that the scorer had a crystal ball and could score
no prior is available. For example, a recent
book, Our Final Century, by a noted British
astronomer, gives the chances of human sur- Fig. 1. The expected information
vival beyond the year 2100 at no better than score is maximized by a truthful
50:50 (26). It is a provocative assessment, report of subjective belief in a
which will not be put to the test anytime proposition (i.e., ‘‘this is our final
century’’), irrespective of priors
soon. With the present method, one could
(A or B) or subjective probability
take the question: BIs this our final century?[ values (50% or 90%). Line A90
and submit it to a sample of experts, who gives expected score for differ-
would each provide a subjective probability ent reported probabilities when
and also estimate probability distributions true personal estimate of catas-
over others_ probabilities. T1 implies that trophe is 90% and prior proba-
bility is 50%. It is optimal to
honest reporting of subjective probabilities
report 90% even though that is
would maximize expected information score. expected to be an unusually pes-
Experts would face comparable truth-telling simistic estimate. Changing the
incentives as if they were betting on the prior to 20% (line B90) increases
actual outcome Ee.g., as in a futures market expected scores but does not dis-
(27)^ and that outcome could be determined place the optimum. Changing sub-
jective probability to 50% shifts
in time for scoring.
the optimum to 50% (A50 as-
I illustrate this with a discrete computa- sumes a 50% prior, B50 a 20% prior). Standard proper scoring (expectation of Eq. 3, displayed as line
tion, which assumes that probabilities are PS90) also maximally rewards a truthful report (90%). However, proper scoring requires knowl-
elicited at 1% precision by means of a 100- edge of the true outcome, which may remain moot until 2100.

464 15 OCTOBER 2004 VOL 306 SCIENCE www.sciencemag.org


REPORTS
estimates for accuracy. The standard instru- nonstandard political views might treat his the expected score for i is the information-
ment for eliciting honest probabilities about or her liking for a candidate as evidence that theoretic measure of how much endorsing
publicly verifiable events is the logarithmic most people will prefer someone else. This opinion i shifts others_ posterior beliefs about
proper scoring rule (2, 4, 29). With the rule, would be a case of: p(<ktr) m p(<kts) although the population distribution. An expert en-
an expert who announces a probability tr 0 ts. Here, too, the remedy is to expand the dorsement will cause greater shift in beliefs,
distribution z 0 (z1,..,zn) over n mutually questionnaire, allowing the person to reveal because it is more informative about the
exclusive events would receive a score of both the opinion and characteristic. underlying variables that drive opinions for
A last example, an art evaluation, illus- both segments (31). This measure of impact
K þ log zi ð3Þ trates both remedies. The example assumes is quite insensitive to the size of the expert
existence of experts and laymen, and a bi- segment or to the direction of association
if event i is realized. For instance, an expert nary state-of-nature: a question of whether a between expert and nonexpert opinion.
whose true subjective probability estimate that particular artist either does or does not re- By establishing truth-telling incentives, I
humanity will perish by 2100 is 90%, but present an original talent. By hypothesis, art do not suggest that people are deceitful or
who announced a possibly different proba- experts recognize this distinction quite well, unwilling to provide information without ex-
bility z, would calculate an expected score of but laymen discriminate poorly and, indeed, plicit financial payoffs. The concern, rather,
0.9 log z þ 0.1 log(1 – z), assuming, again, have a higher chance of enjoying a deriva- is that the absence of external criteria can
that there was some way to establish the true tive artist than an original one. The fraction promote self-deception and false confidence
outcome. This expectation is maximized at of experts is common knowledge, as are the even among the well-intentioned. A futurist,
the true value, z 0 0.90, as shown by line other probabilities (Table 1). or an art critic, can comfortably spend a life-
PS90 in Fig. 1 (elevation is arbitrary). It is In the short version of the survey, respon- time making judgments without the reality
hard to distinguish proper scoring, which re- dents only state their opinion; in the long ver- checks that confront a doctor, scientist, or
quires knowledge of the true outcome, from sion, they also report their expertise. Table 1 business investor. In the absence of reality
information scoring, which does not require displays expected information scores for all checks, it is tempting to grant special status
such knowledge (30). possible answers, as a function of opinion to the prevailing consensus. The benefit of
There are two generic ways in which the and expertise. With the short version, truth- explicit scoring is precisely to counteract
assumption of an impersonally informative telling is optimal for experts but not for informal pressures to agree (or perhaps to
prior might fail. First, a true answer might laymen, who do have a slight incentive to Bstand out[ and disagree). Indeed, the mere
not be informative about population frequen- deceive if they happen to like the exhibition. existence of a truth-inducing scoring system
cies in the presence of public information With the long version, however, the diago- provides methodological reassurance for
about these frequencies (inducing a sharp nal, truth-telling entries have highest expected social science, showing that subjective data
prior). For instance, a person_s gender would score. In particular, respondents will do better can, if needed, be elicited by means of a
have minimal impact on their judgment of if they reveal their true expertise even though process that is neither faith-based (Ball an-
the proportion of men and women in the pop- the distribution of expertise in the surveyed swers are equally good[) nor biased against
ulation. This would be a case of tr m ts but population is common knowledge. the exceptional view.
p(<ktr) ; p(<kts), and the difference between Expected information scores in this and
expected information scores for honest and other examples reflect the amount of infor- References and Notes
deceptive answers would be virtually zero mation associated with a particular opinion 1. C. F. Turner, E. Martin, Eds., Surveying Subjective
Phenomena (Russell Sage Foundation, New York,
(though still positive). As shown below, the or characteristic. In Table 1, experts have a 1984), vols. I and II.
remedy is to combine the gender question clear advantage even though they comprise a 2. R. M. Cooke, Experts in Uncertainty (Oxford Univ.
with an opinion question that interacts with minority of the sample, because their opinion Press, New York, 1991).
3. A formalized scoring rule has diverse uses: training,
gender. is more informative about population frequen- as in psychophysical experiments (32); communicat-
Second, respondents with different tastes cies. In general, the expected information ing desired performance (33); enhancing motivation
or characteristics might choose the same an- score for opinion i equals the expected rela- and effort (34); encouraging advance preparation,
swer for different reasons and hence form dif- tive entropy between distribution p(<ktk,ti) as in educational testing; attracting a larger and
more representative pool of respondents; diagnosing
ferent posteriors. For example, someone with and p(<ktk), averaged over all tk. In words, suboptimal judgments (4); and identifying superior
respondents.
4. R. Winkler, J. Am. Stat. Assoc. 64, 1073 (1969).
Table 1. An incomplete question can create incentives for misrepresentation. The first pair of columns 5. W. H. Batchelder, A. K. Romney, Psychometrika 53,
gives the conditional probabilities of liking the exhibition as function of originality (so that, for example, 71 (1988).
experts have a 70% chance of liking an original artist). It is common knowledge that 25% of the sample 6. In particular, this precludes the application of a
are experts, and that the prior probability of an original exhibition is 25%. The remaining columns futures markets (27) or a proper scoring rule (29).
7. C. d’Aspremont, L.-A. Gerard-Varet, J. Public Econ. 11,
display expected information scores. Answers with highest expected information score are shown by 25 (1979).
bold numbers. Truth-telling is optimal in the long version but not in the short version of the survey. 8. S. J. Johnson, J. Pratt, R. J. Zeckhauser, Econometrica
58, 873 (1990).
Probability of Expected score 9. P. McAfee, P. Reny, Econometrica 60, 395 (1992).
opinion conditional on 10. H. A. Linstone, M. Turoff, The Delphi Method:
Techniques and Applications (Addison-Wesley, Read-
quality of exhibition Long version Short version ing, MA, 1975).
11. R. M. Dawes, in Insights in Decision Making, R. Hogarth,
Expert claim Layman claim Ed. (Univ. of Chicago Press, Chicago, IL, 1990), pp.
Opinion Original Derivative Like Dislike 179–199.
Like Dislike Like Dislike 12. S. J. Hoch, J. Pers. Soc. Psychol. 53, 221 (1987).
13. It is one of the most robust findings in exper-
Expert imental psychology that participants’ self-reported
Like 70% 10% þ575 –776 –462 þ67 þ191 –57 characteristics—behavioral intentions, preferences, and
Dislike 30% 90% –934 þ95 þ84 –24 –86 þ18 beliefs—are positively correlated with their esti-
Layman mates of the relative frequency of these characteris-
Like 10% 20% –826 þ32 þ45 –18 –66 þ12 tics (35). The psychological literature initially regarded
Dislike 90% 80% –499 –156 –73 þ2 –6 –4 this as an egocentric error of judgment (a ‘‘false
consensus’’) (36) and did not consider the Bayesian

www.sciencemag.org SCIENCE VOL 306 15 OCTOBER 2004 465


REPORTS
explanation, as was pointed out by Dawes (11, 14). sonal opinion i but endorsing a possibly different Strategic Interaction: Essays in the Honor of Reinhard
There is still some dispute over whether the relation- answer j, Selten, W. Albers, W. Guth, B. Hammerstein, B.
ship is entirely consistent with Bayesian updating Z Moldovanu, E. van Damme, Eds. (Springer, New York,
(37). xj xj 1997), pp. 441–463.
EAlog kti Z 0 pð<kti ÞEAlog k<Z d< (a)
14. R. M. Dawes, J. Exp. Soc. Psychol. 25, 1 (1989). yj yj 28. Certainly, the assumption of independent credibility
15. D. Fudenberg, J. Tirole, Game Theory (MIT Press, ; signals is unrealistic in that it implies that expert
Cambridge, MA, 2000). Z Xm opinion can in aggregate predict the true outcome
<j
16. With finite players, the truth-telling result holds 0 pð<kti Þ <k log d< (b) perfectly; a more realistic model would have to
provided that the number of players exceeds some k01
pðtj ktk Þ interpose some uncertainty between the outcome
;
finite n, which in turn depends on p(<). Z and the totality of expert opinion.
17. J. M. Bernardo, A. F. M. Smith, Bayesian Theory, Wiley X
m
29. L. J. Savage, J. Am. Stat. Assoc. 66, 783 (1971).
Series in Probability and Statistics (Wiley, New York, 0 pðtk kti Þ pð<ktk , ti Þ (c) 30. Information scoring is nonmetric, and the 100
k01
2000). ; probability levels are treated as simply 100 distinct
18. R. J. Aumann, Econometrica 55, 1 (1987). pðtj k<Þ pðtk ktj , <Þ response categories. The smooth lines in Fig. 1
 log d<
19. More precisely, I assume in the SOM text that for pðtj ktk Þ pðtk k<Þ reflect smooth underlying priors, pA(<) and pB(<).
any finite subset of respondents, there is a common Unlike proper scoring, information scoring could be
and exchangeable prior over their opinions (hence, X
m Z applied to verbal expressions of probability (‘‘likely,’’
the prior is invariant under permutation of respon- 0 pðtk kti Þ pð<ktk , ti Þ (d) ‘‘impossible,’’ etc.).
dents). By de Finetti’s representation theorem (25), k01 31. Precisely, if the opinions of one type of respondent
;
this implies the existence of a probability distribu- pð<ktk , tj Þ are a statistical ‘‘garbling’’ of the opinions of a
 log d<:
tion, p(<), such that opinions are independent pð<ktk Þ second type, then the first type will receive a lower
conditional on <. Conditional independence ensures score in the truth-telling equilibrium. Garbling means
tr 0 ts Á p(<ktr) 0 p(<kts). The reverse implication Once we reach (d), we can use the fact that the that the more informed individual could replicate the
(i.e., that different opinions imply different posteri- integral, statistical properties of the signal received by the
ors) is also called stochastic relevance (8). Z less informed individual, simply by applying a
20. The finite n-player scoring formula (n Q 3), for pð<ktk , ti Þ log pð<ktk , tj Þ d<, randomization device to his own signal (38).
respondent r, is ; 32. D. M. Green, J. A. Swets, Signal Detection Theory and
XX XX Psychophyics (Peninsula Publishing, Los Altos, CA, 1989).
xjrs y rk is maximized when: p(<ktk,tj) 0 p(<ktk,ti), to conclude 33. W. Edwards, Psychol. Rev. 68, 275 (1961).
x rk log kjrs þ" xjrs
k log jrs that a truthful answer, i, will have higher expected
smr k
yk smr k
xk 34. C. F. Camerer, R. Hogarth, J. Risk Uncert. 18, 7
information score than any other answer j. To derive (1999).
P  (d), we first compute expected information score (a)
where xjrs 0
q
xk þ 1 =ðn þ m j 2Þ a n d 35. G. Marks, N. Miller, Psychol. Bull. 102, 72 (1987).
k with respect to the posterior distribution, p(<kti), and 36. L. Ross, D. Greene, P. House, J. Exp. Soc. Psychol. 13,
qmr, s
P q use the assumption that others are responding
log y jrs
k 0 log yk =ðn j 2Þ. The score for r is built 279 (1977).
qmr, s truthfully to derive (b). For an infinite sample, 37. J. Krueger, R. W. Clement, J. Pers. Soc. Psychol. 67,
up from pairwise comparisons of r against all other truthful answers imply: xj 0 <j, and truthful predic- 596 (1994).
respondents s, excluding from the pairwise calcu- tions: log yj 0 Fk<k log p(tjktk), because the fraction 38. D. Blackwell, Ann. Math. Stat. 24, 265 (1953).
lations the answers and predictions of respondents r <k of respondents who draw k will predict p(tjktk) for 39. I thank D. Mijovic-Prelec, S. Frederick, D. Fudenberg,
and s. To prevent infinite scores associated with zero answer j. To derive (c) from (b), we apply conditional J. R. Hauser, M. Kearns, E. Kugelberg, R. D. Luce,
frequencies, I replace the empirical frequencies with independence to write <k p(<kti) as p(tkkti)p(<ktk,ti), D. McAdams, S. Seung, R. Weaver, and B. Wernerfelt
Laplace estimates derived from these frequencies. <j as p(tjk<), and 1 as p(tkktj,<)/p(tkk<), which is for comments and criticism. I acknowledge early
This is equivalent to ‘‘seeding’’ the empirical sample inserted into the fraction. (d) follows from (c) by support for this research direction by Harvard
with one extra answer for each possible choice. Any Bayes’ rule. Society of Fellows, MIT E-Business Center, and the
distortion in incentives can be made arbitrarily small 24. The scoring system is not easy to circumvent by MIT Center for Innovation in Product Development.
by increasing the number of respondents, n. The collective collusion, because if everyone agrees to
scoring is zero-sum when " 0 1. give the same response then that response will no
21. T. M. Cover, J. A. Thomas, Elements of Information longer be surprisingly common, and will receive a Supporting Online Material
Theory (Wiley, New York, 1991). zero information score. The prediction scores will www.sciencemag.org/cgi/content/full/306/5695/462/
22. S. Kullback, Information Theory and Statistics (Wiley, also be zero in that case. DC1
New York, 1954). 25. J. Cremer, R. P. McLean, Econometrica 56, 1247 (1988). SOM Text
23. The key step in the proof involves calculation of 26. M. Rees, Our Final Century (Heinemann, London, 2003).
expected information score for someone with per- 27. J. Berg, R. Forsythe, T. A. Rietz, in Understanding 28 June 2004; accepted 15 September 2004

ductance spectra described as Bzero-bias anom-


Single-Atom Spin-Flip alies[ (1–4). Such anomalies were shown to
reflect both spin-flips driven by inelastic
Spectroscopy electron scattering and Kondo interactions of
magnetic impurities with tunneling electrons
A. J. Heinrich,* J. A. Gupta, C. P. Lutz, D. M. Eigler (5–7). Single, albeit unknown, magnetic im-
purities were later studied in nanoscopic tunnel
We demonstrate the ability to measure the energy required to flip the spin of junctions (8, 9). Recently, magnetic properties
single adsorbed atoms. A low-temperature, high–magnetic field scanning of single-molecule transistors that incorporated
tunneling microscope was used to measure the spin excitation spectra of either one or two magnetic atoms were probed
individual manganese atoms adsorbed on Al2O3 islands on a NiAl surface. We by means of their elastic conductance spectra
find pronounced variations of the spin-flip spectra for manganese atoms in (10, 11). These measurements determined g
different local environments. values and showed field-split Kondo reso-
nances due to the embedded magnetic atoms.
The magnetic properties of nanometer-scale netic structures are composed of magnetic The scanning tunneling microscope (STM)
structures are of fundamental interest and may atoms in precise arrangements. The magnetic offers the ability to study single magnetic
play a role in future technologies, including properties of each atom are profoundly influ- moments in a precisely characterized local
classical and quantum computation. Such mag- enced by its local environment. Magnetic environment and to probe the variations in
properties of atoms in a solid can be probed by magnetic properties with atomic-scale spatial
IBM Research Division, Almaden Research Center, 650 placing the atoms in tunnel junctions. Early resolution. Previous STM studies of atomic-
Harry Road, San Jose, CA 95120, USA. experiments with planar metal-oxide-metal scale magnetism include Kondo resonances of
*To whom correspondence should be addressed. tunnel junctions doped with paramagnetic magnetic atoms on metal surfaces (12, 13), in-
E-mail: heinrich@almaden.ibm.com impurities exhibited surprisingly complex con- creased noise at the Larmor frequency (14, 15),

466 15 OCTOBER 2004 VOL 306 SCIENCE www.sciencemag.org

You might also like