Professional Documents
Culture Documents
Fact-Free Learning
Fact-Free Learning
People may be surprised to notice certain regularities that hold in existing knowl-
edge they have had for some time. That is, they may learn without getting new
factual information. We argue that this can be partly explained by computational
complexity. We show that, given a knowledge base, finding a small set of variables
that obtain a certain value of R2 is computationally hard, in the sense that this term
is used in computer science. We discuss some of the implications of this result and
of fact-free learning in general. (JEL D83)
The process of induction is the process of regularities in that information. Many theoreti-
assuming the simplest law that can be cal models of learning focus on learning new
made to harmonize with our experience. facts, on their integration in an existing knowl-
Ludwig Wittgenstein (1922) edge base, and on the way they modify beliefs.
Within the Bayesian framework, the integration
Understanding one’s social environment re- of new facts and the modification of beliefs is
quires accumulating information and finding done mechanically according to Bayes’s rule.
Much of human learning, however, has to do
with making observations and finding regulari-
* Aragones: Institut d’Analisi Economica, CSIC, Cam-
ties that, in principle, could have been deter-
pus UAB, 08193 Bellaterra, Spain 08193 (e-mail: enriqueta. mined using existing knowledge, rather than
aragones@uab.es); Gilboa: School of Economics, Tel Aviv with the acquisition of new facts.
University, Tel Aviv, Israel 69978 and Cowles Foundation, Consider technological innovations. In many
Yale University (e-mail: igilboa@tau.ac.il); Postlewaite: cases, the main idea of an innovation involves
Department of Economics, University of Pennsylvania,
3718 Locust Walk, Philadelphia, PA 19104 (e-mail: combining well-known facts. For instance, put-
apostlew@econ.sas.upenn.edu); Schmeidler: Department of ting wheels at the bottom of a suitcase allows it
Statistics, Tel Aviv University, Ramat Aviv, Tel Aviv, to roll easily. This idea was quite original when
Israel 69978 and The Ohio State University (e-mail: it was first introduced. But, since it only se-
schmeid@tau.ac.il.). Earlier versions of this paper circu-
lated under the titles “From Cases to Rules: Induction and
lected and combined facts that everyone had
Regression” and “Accuracy versus Simplicity: A Complex already known, it appears obvious in hindsight.
Trade-Off.” We have benefited greatly from comments and It takes originality to come up with such an
references by the editor, an anonymous referee, Yoav Ben- idea, but no particular expertise is needed to
jamini, Joe Halpern, Offer Lieberman, Bart Lipman, Yishay judge its value. This phenomenon is so perva-
Mansour, Nimrod Megiddo, Dov Samet, Petra Todd, and
Ken Wolpin, as well as the participants of the SITE con- sive that it has been canonized in literature:
ference on Behavioral Economics at Stanford, August 2003, Sherlock Holmes regularly explains how the
and the Cowles Foundation workshop on Complexity in combination of a variety of clues leads inexo-
Economic Theory at Yale, September 2003. Aragones ac- rably to a particular conclusion, following
knowledges financial support from the Spanish Ministry of
Science and Technology, Grant SEC2003-01961, and CREA-
which Watson exclaims, “Of course!”
Barcelona Economics. Aragones, Gilboa, and Schmeidler To consider an even more extreme case, as-
gratefully acknowledge support from the Polarization and sume that an individual follows a mathematical
Conflict Project CIT-2-CT-2004-506084 funded by the Eu- proof of a theorem. In order to check the proof,
ropean Commission-DG Research Sixth Framework Pro- one need not resort to the knowledge of facts.
gramme. Gilboa and Schmeidler gratefully acknowledge
support from the Israel Science Foundation (Grant 790/00 The knowledge that the agent acquires in the
and 975/03). Postlewaite gratefully acknowledges support process has always been, in principle, available
from National Science Foundation Grant SBR-9810693. to her. Yet, mathematics has to be studied. In
1355
1356 THE AMERICAN ECONOMIC REVIEW DECEMBER 2005
fact, it is an entire discipline based solely on she had been unaware: democratic countries
fact-free learning. have seldom waged war on each other.2
In this paper we focus on a particular type of It is likely that Ann failed to notice that the
fact-free learning. We consider an agent who democratic peace phenomenon holds in her own
has access to a database, involving many vari- knowledge base simply because it had not oc-
ables and many observations. The agent at- curred to her to categorize wars by the type of
tempts to find regularities in the database. We regime of the countries involved. For most peo-
model this learning problem and explain the ple, wars are categorized, or “indexed,” by
difficulty in solving it optimally.1 chronology and geography, but not by regime.
The immediate consequence of this difficulty Once the variable “type of regime” is intro-
is that individuals typically will not discover all duced, Ann will be able to reorganize her
the regularities in their knowledge base, and knowledge base and observe the regularity she
may overlook the most useful regularities. Two had failed to notice earlier.
people with the same knowledge base may no- Fact-free learning is not always due to the
tice different regularities, and may consequently introduction of a new variable, or a categoriza-
hold different views about a particular issue. tion that the individual has not been aware of.
One person may change the beliefs and actions Often, one may be aware of all variables in-
of another without communicating new facts, volved, and yet fail to see a regularity that
but simply by pointing to a regularity over- involves a combination of such variables. Con-
looked by the other person. On the other hand, sider an econometrician who wants to under-
people may agree to disagree even if they have stand the determinants of the rate of economic
the same knowledge base and are communicat- growth. She has access to a large database of
ing. We elaborate on these consequences in realized growth rates for particular economies
Section III. that includes a plethora of variables describing
For illustration, consider the following example. these economies in detail.3 Assume that the
econometrician prefers fewer explanatory vari-
Ann: “Russia is a dangerous country.” ables to more. Her main difficulty is to deter-
Bob: “Nonsense.” mine what set of variables to use in her
Ann: “Don’t you think that Russia might initiate regression.
a war against a Western country?” We can formalize her problem as determining
Bob: “Not a chance.” whether there exists a set of k regressors that
Ann: “Well, I believe it very well might.” gives a particular level of R2. This is a well-
Bob: “Can you come up with examples of defined problem that can be relegated to com-
wars that erupted between two democratic puter software. However, testing all subsets of k
countries?” regressors out of, say, m variables involves run-
ning (m k ) ⫽ O(m ) regressions. When m and k
k
Ann: “I guess so. Let me see ... How about
England and the United States in 1812?” are of realistic magnitude, it is impractical to
Bob: “OK, save colonial wars.” perform this exhaustive search. For instance,
Ann: “Well, then, let’s see. OK, maybe you choosing the best set of k ⫽ 13 regressors out of
have a point. Perhaps Russia is not so m ⫽ 100 potentially relevant variables involves
dangerous.”
13 ) ⬇ 7 ⴱ 10
(100 15
regressions. On a computer that some idea about which variables may be con-
can perform 10 million regression analyses per ducive to growth. She therefore need not ex-
second, this task would take more than twenty- haust all subsets of k regressors in her quest for
two years. the “best” regression. Our model does not cap-
Linear regression is a structured and rela- ture the development of and selection among
tively well-understood problem, and one may causal theories, but even the set of variables
hope that, using clever algorithms that employ potentially relevant to our econometrician’s the-
statistical analysis, the best set of k regressors ory is typically large enough to raise computa-
can be found without actually testing all (m k) tional difficulties. More importantly, if the
subsets. Our main result is that this is not the econometrician wants to test her scientific par-
case. Formally, we prove that finding whether k adigm, and if she wants to guarantee that she is
regressors can obtain a prespecified value of not missing some important regularities that lie
R2, r, is, in the language of computer science, outside her paradigm, she cannot restrict atten-
NP-Complete.4 Moreover, we show that this tion to the regressors she has already focused
problem is hard (NP-Complete) for every posi- on.
tive value of r. Thus, our regression problem While computational complexity is not the
belongs to a large family of combinatorial prob- only reason why individuals may be surprised to
lems for which no efficient (polynomial) algo- discover regularities in their own knowledge
rithm is known. An implication of this result is bases, it is one of the reasons that knowledge of
that, even for moderate size datasets, it will facts does not imply knowledge of all their
generally be impossible to know the trade-off implications. Hence, computational complexity,
between increasing the number of regressors alongside unawareness, makes fact-free learn-
and increasing the explanatory power of those ing a common phenomenon.
regressors.5 In Section I, we lay out our model and dis-
Our interest is not in the problem econome- cuss several notions of regularities and the cri-
tricians face, but in the problems encountered teria to choose among them. The difficulty of
by nonspecialists attempting to understand their discovering satisfactory sets of regressors is
environment. That is, we wish to model the proven in Section II. In the last section, we
reasoning of standard economic agents, rather discuss the results, their implications, and re-
than of economists analyzing data. We contend, lated literature.
however, that a problem that is difficult to solve
for a working economist will also be difficult
for an economic agent. If an econometrician I. Regularities in a Knowledge Base
cannot be guaranteed to find the “best” set of
regressors, many economic agents may also fail An individual’s knowledge base consists of
to identify important relationships in their per- her observations and past experiences, as well
sonal knowledge base.6 as observations related to her by others. We will
Neither economic agents nor social scientists assume that observations are represented as vec-
typically look for the best set of regressors tors of numbers. An entry in the vector might be
without any guiding principle. Rather than en- the value of a certain numerical variable, or a
gaging in data mining, they espouse and de- measure of the degree to which the observation
velop various theories that guide their search for has a particular attribute. Thus, we model the
regularities. Our econometrician will often have information available to an individual as a
knowledge base consisting of a matrix of num-
bers where rows correspond to observations
4
In Section II, we explain the concept of NP-Complete- (distinct pieces of information) and columns to
ness and provide references to formal definitions.
5
attributes.
In particular, principle components analysis, which We show below a fraction of a conceivable
finds a set of orthogonal components, is not guaranteed to
find the best combination of predictors (with unconstrained knowledge base pertinent to the democratic
correlations). peace example. The value in a given entry rep-
6
We discuss this further in Section III. resents the degree to which the attribute
1358 THE AMERICAN ECONOMIC REVIEW DECEMBER 2005
past instances in which women with breast can- simple as possible. Of course, these properties
cer were given different treatments. Each obser- present one with nontrivial trade-offs. In the
vation would consist of the treatment, a number next section, we discuss functional rules for a
of diagnostic tests such as blood chemistry, given knowledge base. We will show that the
location of the tumor, size of the tumor in feasible set in the accuracy-simplicity space
X-rays, etc., and the degree to which the treat- cannot be easily computed. A similar result can
ment was successful. These observations would be shown for association rules. We choose to
be used to determine a linear relationship be- focus on linear regression for two reasons. First,
tween the diagnostic tests and the degree of in economics it is a more common technique for
success for each treatment. The resulting rela- uncovering rules. Second, our main result is less
tionship is then used to predict the success of straightforward in the case of linear regression.
future cases.
When faced with a problem such as this, a II. The Complexity of Linear Regression
scientist need not automatically prefer fewer
explanatory variables to more. The literature in In this section, we study the trade-off be-
statistics and machine learning provides criteria tween simplicity and accuracy of functional
for “model selection,” and, in particular, for the rules in the case of linear regression. While
inclusion of explanatory variables, in such a regression analysis is a basic tool of scientific
way as to avoid spurious correlations and “over- research, here we view it as an idealized model
fitting.” Our interest, however, is not in the way of nonprofessional human reasoning.10 For a
a scientist or an econometrician would use a given variable, one attempts to find those vari-
database to predict future outcomes, but rather ables that predict the variable of interest. A
in the way an ordinary person might find rela- common measure of amount of variation in the
tionships in his or her personal knowledge base. variable of interest that is explained by the
We maintain that, other things being equal, peo- predicting variables is the coefficient of deter-
ple tend to have more faith in the robustness of mination, R2. A reasonable measure of com-
relationships that use fewer variables than in plexity is the number of explanatory variables
those that use more. That is, we suggest that the one uses. The “adjusted R2” is frequently used
preference for parsimony and simplicity, as as a measure of the quality of a regression,
measured by the number of variables employed, trading off accuracy and simplicity. Adjusted R2
is a natural tendency of the human mind. essentially levies a multiplicative penalty for
Individuals may prefer fewer explanatory additional variables to offset the spurious in-
variables because of availability of data. Having crease in R2 which results from an increase in
a rule that involves more variables implies that the number of predicting variables. In recent
more variables need to be gathered and main- years statisticians and econometricians for the
tained in order to use it. Importantly, it also most part have used additive penalty functions
makes it less likely that all the variables needed in model specification (choosing the predicting
for the application of the rule will indeed be variables) for a regression problem.11 The dif-
available in a related problem. ferent penalties are associated with different
When fewer variables are involved, people criteria determining the trade-off between par-
will find it easier to make up explanations for a simony and precision. Each penalty function
regularity in the data. This may be another can be viewed as defining preferences over the
reason for the preference for fewer variables. Be number of included variables and R2, reflecting
that as it may, the (normative) claim that people the trade-off between simplicity and accuracy.
should prefer simpler theories to more complex Rather than choose a specific penalty function,
ones goes back to William of Occam, and the
(descriptive) claim that this is how the human 10
mind works can also be found in Wittgenstein See Margaret Bray and Nathan Savin (1986), who
used regression analysis to model the learning of economic
(1922). agents.
In this paper, we assume that people gener- 11
See, e.g., Trevor Hastie et al. (2001) for a discussion
ally prefer rules that are as accurate and as of model specification and penalty functions.
1360 THE AMERICAN ECONOMIC REVIEW DECEMBER 2005
we assume more generally that an individual How does one select a set of explanatory
can be ascribed a function u : ⺢⫹ ⫻ [0, 1] 3 ⺢, variables? First, consider the feasible set of
which represents her preferences for simplicity rules projected onto the accuracy-complexity
and accuracy, where u(k, r) is her utility for a space. For a set of explanatory variables K, let
regression that attains R2 ⫽ r with k explanatory the degree of complexity be k ⫽ 兩K兩 and a
variables. Thus, if u(䡠 , 䡠) is decreasing in its first degree of accuracy ⫺ r ⫽ RK 2
. Consider the k-r
argument and increasing in the second, a person space and, for a given knowledge base X ⫽
who chooses a rule so as to maximize u may be (X1, ... , Xm) and a variable Y, denote by F(X, Y)
viewed as though she prefers both simplicity the set of pairs (k, r) for which there exists a rule
and accuracy, and trades them off as described with these parameters. Because the set F(X) is
by u. defined only for integer values of k, and for
Our aim is to demonstrate that finding “good” certain values of r, it is more convenient to
rules is a difficult computational task. We use visualize its comprehensive closure defined by:
the concept of NP-Completeness from com-
puter science to formalize the notion of diffi-
F⬘共X, Y兲 ⬅ 兵共k, r兲 僆 ⺢⫹
culty of solving problems. A yes/no problem is
NP if it is easy (can be performed in polynomial
⫻ 关0, 1兴兩 ? 共k⬘, r⬘兲 僆 F共X, Y兲, k ⱖ k⬘, r ⱕ r⬘其.
worst-case time complexity) to verify that a
suggested solution is indeed a solution to it. If
an NP problem is also NP-Complete, then there The set F⬘(X, Y) is schematically illustrated in
is at present no known algorithm, whose (worst- Figure 1. Note that it need not be convex.
case time) complexity is polynomial, that can The optimization problem that such a person
solve it. NP-Completeness, however, means with utility function u(䡠 , 䡠) faces is depicted in
more than that there is no such known algo- Figure 2.
rithm. The nonexistence of such an algorithm is This optimization problem is hard to solve,
not due to the fact that the problem is new or because one generally cannot know its feasible
unstudied. For NP-Complete problems, it is set. In fact, for every r ⬎ 0, given X, Y, k,
known that if a polynomial algorithm were determining whether (k, r) 僆 F⬘(X, Y) is com-
found for one of them, such an algorithm could putationally hard:
be translated into polynomial algorithms for all
other problems in NP. Thus, a problem that is
NP-Complete is at least as hard as many prob- THEOREM 1: For every r 僆 (0, 1], the fol-
lems that have been extensively studied for lowing problem is NP-Complete: Given explan-
years and for which no polynomial algorithm atory variables X ⫽ (X1, ... , Xm), a variable Y,
has yet been found. and an integer k ⱖ 1, is there a subset K of
We emphasize again that the rules we discuss {X1, ... , Xm} such that 兩K兩 ⱕ k and RK 2
ⱖ r?
do not necessarily offer complete theories, iden-
tify causal relationships, provide predictions, or Theorem 1 explains why people may be sur-
suggest courses of action. Rules are regularities prised to learn of simple regularities that exist in
that hold in a given knowledge base, and they a knowledge base they have access to. A person
may be purely coincidental. Rules may be as- who has access to the data should, in principle,
sociated with theories, but we do not purport to be able to assess the veracity of all linear theo-
model the process of developing and choosing ries pertaining to these data. Yet, due to com-
among theories. putational complexity, this capability remains
Assume that we are trying to predict a vari- theoretical. In practice, one may often find that
able Y given the explanatory variables X ⫽ one has overlooked a simple linear regularity
(X1, ... , Xm). For a subset K of {X1, ... , Xm}, let that, once pointed out, seems evident.
2
RK be the value of the coefficient of determina- We show that, for any positive value of r, it
tion R2 when we regress ( yi)i ⱕ n on (xij)i ⱕ n, j 僆 K. is hard to determine whether a given k is in the
We assume that the data are given in their r-cut of F⬘(X, Y) when the input is (X, Y, k). By
entirety, that is, that there are no missing values. contrast, for a given k, computing the k-cut of
VOL. 95 NO. 5 ARAGONES ET AL.: FACT-FREE LEARNING 1361
FIGURE 1
FIGURE 2
F⬘(X, Y) is a polynomial problem (when the ily the simplest (for a given degree of accuracy)
input is (X, Y, r)), bounded by a polynomial of or the most accurate (for a given degree of
degree k. Recall, however, that k is bounded complexity).
only by the number of columns in X. Moreover, We conclude this section with the observa-
even if k is small, a polynomial of degree k may tion that one may prove theorems similar to
assume large values if m is large. We conclude Theorem 1, which would make explicit refer-
that, in general, finding the frontier of the set ence to a certain function u(k, r). The following
F⬘(X, Y), as a function of X and Y, is a hard is an example of such a theorem.
problem. The optimization problem depicted in
Figure 2 has a fuzzy feasible set, as described in THEOREM 2: For every r 僆 (0, 1], the fol-
Figure 3. lowing problem is NP-Complete: given explan-
A decision maker may choose a functional atory variables X ⫽ (X1, ... , Xm) and a
rule that maximizes u(k, r) out of all the rules variable Y, is there a subset K of {X1, ... , Xm}
she is aware of, but the latter are likely to that obtains an adjusted R2 of at least r?
constitute only a subset of the set of rules de-
fining the actual set F⬘(X, Y). Hence, many of As is clear from the proof of Theorem 2, this
the rules that people formulate are not necessar- result does not depend on the specific measure
1362 THE AMERICAN ECONOMIC REVIEW DECEMBER 2005
FIGURE 3
of the accuracy-simplicity trade-off, and similar whose union equals S? This is the yes/no prob-
results can be proven for a variety of functions lem we have used in the proof of Theorem 1.13
u(k, r).12 To define the notion of approximation, one de-
fines an optimization problem that corresponds
III. Discussion to the yes/no problem. For instance, the Mini-
mum Exact Cover problem can be viewed as
A. Approximation corresponding to the following optimization
problem: minimize the sum of the cardinalities
We posed a particular question— does there of the sets in a collection that covers S. If the
exist a set of k explanatory variables for which solution is the cardinality of S, an exact cover
the adjusted R2 is at least r?—and showed that has been identified.
it is NP-Complete. We argue that an implication How good an approximation can one get to
of the result is that people will generally not the problem—minimize the sum of the cardi-
know the regularities that exist in their knowl- nalities of the sets that cover S—with an algo-
edge base. But it is possible that, while it may rithm that is polynomial in the size of the
be extremely difficult to get an exact answer to problem? Suppose, for example, that one
the question “What is the maximum R2 possible wanted an algorithm that had the property that,
with k variables?,” it may be dramatically easier for all problems in this class, if the minimum
to obtain a very good approximation to such a possible sum for the problem were n, the algo-
question. If there are fast heuristics that do rithm would find a set of subsets with total
reasonably well on the regression problem, the cardinality n for some ⬎ 1. ( might be
scope of fact-free learning may be quite limited. thought of as the accuracy of the approxima-
However, it is generally not the case that tion.) It is known that there does not exist such
NP-Complete problems admit polynomial ap- a polynomial algorithm, no matter how large
proximations. Consider, for instance, the NP- is, unless P ⫽ NP (Carston Lund and Mihalis
Complete problem Minimum Exact Cover, Yannakakis, 1994; Ran Raz and Shmuel Safra,
which can be described as follows. Given a set 1997). In other words, finding an algorithm that
S and a set of subsets of S, ᑭ, is there a assures any degree of reliability for large prob-
collection of pairwise disjoint subsets of S in ᑭ
13
That is, our proof consists of showing that any in-
12
There are, however, functions v for which the result stance of the Minimum Exact Cover problem can be trans-
does not hold. For example, consider v(k, r) ⫽ min(r, 2 ⫺ lated, via a polynomial algorithm, to an instance of the
k). This function obtains its maximum at k ⫽ 1 and it is problem defined in Theorem 1, such that the answer to the
therefore easy to maximize it. latter is “yes” iff “so” is the answer to the former.
VOL. 95 NO. 5 ARAGONES ET AL.: FACT-FREE LEARNING 1363
in a Bayesian model with perfect rationality, to find the “optimal” rules that a given knowl-
people cannot change their opinions unless new edge base induces.
information has been received. It follows that
the example we started out with cannot be ex-
plained by such models. APPENDIX: PROOFS
Relaxing the perfect rationality assumption,
one may attempt to provide a pseudo-Bayesian
account of the phenomena discussed here. For PROOF OF THEOREM 1:
instance, one can use a space of states of the Let there be given r ⬎ 0. It is easy to see that
world to describe the subjective uncertainty that the problem is in NP: given a suggested set K 傺
2
a decision maker has regarding the result of a {1, ... , m}, one may calculate RK in polynomial
computation, before this computation is carried time in 兩K兩n (which is bounded by the size of the
out (see Luca Anderlini and Leonardo Felli, input, (m ⫹ 1)n).16 To show that the problem is
1994; Nabil Al-Najjar et al., 1999). In such a NP-Complete, we use a reduction of the follow-
model, one would be described as if one en- ing problem, which is known to be NP-Complete
tertained a prior probability of, say, p, that (see Michael Gary and David Johnson, 1979; or
“democratic peace” holds. Upon hearing the Papadimitriou, 1994):
rhetorical question as in our dialogue, the
decision maker performs the computation of Problem Exact Cover: Given a set S, a set of
the accuracy of this rule, and is described as subsets of S, ᑭ, are there pairwise disjoint sub-
if the result of this computation were new sets in ᑭ whose union equals S?
information. (That is, does a subset of ᑭ constitute a
A related approach employs a subjective state partition of S?)
space to provide a Bayesian account of unfore- Given a set S, a set of subsets of S, ᑭ, we will
seen contingencies (see David Kreps, 1979, generate n observations of (m ⫹ 1) variables,
1992; Eddie Dekel et al., 1997, 1998). Should (xij)i ⱕ n, j ⱕ m and ( yi)i ⱕ n , and a natural number
this approach be applied to the problem of in- k, such that S has an exact cover in ᑭ iff there
duction, each regularity that might hold in the is a subset K of {1, ... , m} with 兩K兩 ⱕ k and
knowledge base would be viewed as an unfore- RK2
ⱖ r.
seen contingency that might arise. A decision Let there be given, then, S and ᑭ. Assume
maker’s behavior will then be viewed as arising without loss of generality that S ⫽ {1, ... , s},
from Bayesian optimization with respect to a and that ᑭ ⫽ {S1, ... , Sl} (where s, l ⱖ 1 are
subjective state space that reflects her subjective natural numbers). We construct n ⫽ 2(s ⫹ l ⫹
uncertainty. 1) observations of m ⫽ 2l predicting variables.
Our approach is compatible with these pseudo- It will be convenient to denote the 2l predicting
Bayesian models. Its relative strength is that it variables by X1, ... , Xl and Z1, ... , Zl and the
models the process of induction more explicitly, predicted variable— by Y. Their corresponding
allowing a better understanding of why and values will be denoted (xij)i ⱕ n, j ⱕ l, (zij)i ⱕ n, j ⱕ l,
when induction is likely to be a hard problem. and ( yi)i ⱕ n. We will use Xj , Zj , and Y also to
Gilboa and Schmeidler (2001) offer a theory denote the column vectors (xij)i ⱕ n , (zij)i ⱕ n , and
of case-based decision making. They argue that ( yi)i ⱕ n , respectively. Let M ⱖ 0 be a constant to
cases are the primitive objects of knowledge, be specified later. We now specify the vectors
and that rules and probabilities are derived from X1, ... , Xl, Z1, ... , Zl, and Y as a function of M.
cases. Moreover, rules and probabilities cannot
be known in the same sense, and to the same
degree of certitude, that cases can. Yet, rules
16
and probabilities may be efficient and insightful Here and in the sequel, we assume that reading an
ways of succinctly summarizing many cases. entry in the matrix X or in the vector Y, as well as any
algebraic computation, requires a single time unit. Our
The present paper suggests that summarizing results hold also if one assumes that xij and yi are all rational
knowledge bases by rules may involve loss of and take into account the time it takes to read and manip-
information, because one cannot be guaranteed ulate these numbers.
1366 THE AMERICAN ECONOMIC REVIEW DECEMBER 2005
For i ⱕ s and j ⱕ l, xij ⫽ 1 if i 僆 Sj , and xij ⫽ follows from our construction (assigning pre-
0 if i ⰻ Sj; cisely one of {j , ␥j} to 1 and the other—to 0).
For i ⱕ s and j ⱕ l, zij ⫽ 0; Obviously, ␣ ⫹ ¥j ⱕ l j xnj ⫹ ¥j ⱕ l ␥j znj ⫽ 0 ⫽
For s ⬍ i ⱕ s ⫹ l and j ⱕ l, xij ⫽ zij ⫽ 1 if yi ⫽ 0. The number of variables used in this
i ⫽ s ⫹ j and xij ⫽ zij ⫽ 0 if i ⫽ s ⫹ j; regression is l. Specifically, choose K ⫽ {Xj兩j 僆
For j ⱕ l, xs ⫹ l⫹1, j ⫽ zs ⫹ l⫹1, j ⫽ 0; J} 艛 {Zj兩j ⰻ J}, with 兩K兩 ⫽ l, and observe that
For i ⱕ s ⫹ l, yi ⫽ 1 and ys ⫹ l⫹1 ⫽ M; 2
R̂K ⫽ 1.
For i ⬎ s ⫹ l ⫹ 1, yi ⫽ ⫺yi ⫺ (s ⫹ l⫹1) and for We now turn to the converse direction. As-
all j ⱕ l, xij ⫽ ⫺xi ⫺ (s ⫹ l⫹1), j and zij ⫽ sume, then, that there is a subset K of {X1, ... ,
⫺zi ⫺ (s ⫹ l⫹1), j. Xl} 艛 {Z1, ... , Zl} with 兩K兩 ⱕ l for which R̂K 2
⫽
Observe that the bottom half of the matrix X, 1. Since all variables have zero means, this
as well as the bottom half of the vector Y, are regression has an intercept of zero (␣ ⫽ 0 in the
the negatives of the respective top halves. This notation above). Let J 傺 {1, ... , l} be the set of
implies that each of the variables X1, ... , Xl, indices of the X variables in K, i.e., {Xj}j 僆 J ⫽
Z1, ... , Zl, and Y has a mean of zero. This, in K 艚 {X1, ... , Xl}. We will show that {Sj}j 僆 J
turn, implies that for any set of variables K, constitutes a partition of S. Let L 傺 {1, ... , l} be
when we regress Y on K, we get a regression the set of indices of the Z variables in K, i.e.,
equation with a zero intercept. {Zj}j 僆 L ⫽ K 艚 {Z1, ... , Zl}. Consider the
Consider the matrix X and the vector Y ob- coefficients of the variables in K used in the
tained above by the construction for different regression obtaining R̂K 2
⫽ 1. Denote them by
values of M. Observe that the collection of sets (j)j 僆 J and (␥j)j 僆 L. Define j ⫽ 0 if j ⰻ J and
K that maximize RK 2
is independent of M. ␥j ⫽ 0 if j ⰻ L. Thus, we have
2
Hence, it is useful to define R̂K as the R2 ob-
tained from regressing Y on K, ignoring obser-
vations s ⫹ l ⫹ 1 and 2(s ⫹ l ⫹ 1). Obviously,
2 2
冘  X ⫹ 冘 ␥ Z ⫽ Y.
jⱕl
j j
jⱕl
j j
l ⫹ 1 and 2(s ⫹ l ⫹ 1), where SSR and SST We claim that there exists a set of regressors
denote the variances of the regression with all that obtains an adjusted R2 of r iff there exists a
2
observations. Thus, RK ⫽ SSR/SST and R̂K 2
⫽ set of l regressors that obtains an R2 of r⬘
ˆ /SST
SSR ˆ . Observe that SST
ˆ ⫽ 2共s ⫹ l 兲 and (hence, iff there exists an exact cover in the
ˆ is original problem). The “if” part is obvious from
SST ⫽ 2(s ⫹ l ) ⫹ 2M2. Also, SSR ⫽ SSR our construction. Consider the “only if” part.
independent of M. Assume, then, that a set of regressors obtains an
Note that if K is such that R̂ K 2
⫽ 1, then adjusted R2 of r. If it has l regressors, the same
ˆ ˆ
(SSR ⫽)SSR ⫽ SST⫽ 2(s ⫹ l). In this case, calculation shows that it obtains the desired R2.
RK2
⫽ [2(s ⫹ l )]/[2(s ⫹ l ) ⫹ 2M2]. If, however, We now argue that if no set of l regressors
2
K is such that R̂K ⬍ 1, then we argue that (SSR ⫽) obtains an adjusted R2 of r, then no set of
ˆ ˆ regressors (of any cardinality) obtains an ad-
SSR ⱕ SST ⫺ ⁄9 . Assume not. That is, assume
1
justed R2 of r.
ˆ ⬎ SST ˆ ⫺ 1⁄9 . This
that K is such that SSR Consider first a set K0 with 兩K0兩 ⫽ k0 ⬎ l
implies that on each of the observations 1, ... , regressors. Observe that, by the choice of M, r⬘
s ⫹ l, s ⫹ l ⫹ 2, ... , 2(s ⫹ l ) ⫹ 1, the fit is the upper bound on all RK 2
for all K with 兩K兩
produced by K is at most 1⁄3 away from yi. Then ⫽ l, as r⬘ was computed assuming that an exact
for every j ⱕ l, 兩j ⫹ ␥j ⫺ 1兩 ⬍ 1⁄3 . Hence for cover exists, and that, therefore, there are l
every j ⱕ l either j ⫽ 0 or ␥j ⫽ 0, but not both, variables that perfectly match all the observa-
and the non-zero coefficient out of {j , ␥j} has tions but s ⫹ l ⫹ 1 and 2(s ⫹ l ⫹ 1). Due to the
to be in (2⁄3 , 4⁄3). But then, considering the first structure of the problem, r⬘ is also an upper
s observations, we find that K is an exact cover. bound on RK 2
for all K with 兩K兩 ⱖ l. This is so
It follows that, if R̂K2
⬍ 1, then RK2
ⱕ [2(s ⫹ l ) ⫺ because the only observations that are not per-
1⁄9]/[2(s ⫹ l ) ⫹ 2M2].
fectly matched (in the hypothesized l-regressor
Choose a rational M in the interval set) correspond to zero values of the regressors.
共冑 关共1 ⫺ r兲共s ⫹ l 兲 ⫺ 1⁄18兴/r, 冑 关共1 ⫺ r兲共s ⫹ l 兲兴/r) so It follows that the adjusted R2 for K0 is lower
that [2(s ⫹ l ) ⫺ 1⁄9]/[2(s ⫹ l ) ⫹ 2M2] ⬍ r ⬍ than r.
[2(s ⫹ l )]/[2(s ⫹ l ) ⫹ 2M2], and observe that Next, consider a set K0 with 兩K0兩 ⫽ k0 ⬍ l
for this M, there exists a K such that RK 2
ⱖ r iff regressors. For such a set there exists a j ⱕ l
there exists a K for which R̂K ⫽ 1, that is, iff K
2
such that neither Xj nor Zj are in K0. Hence,
is an exact cover. observations s ⫹ j and 2s ⫹ l ⫹ j ⫹ 1 cannot be
To conclude the proof, it remains to observe matched by the regression on K0. The lowest
that the construction of the variables (Xj)j ⱕ l, possible SSE in this problem, corresponding to
(Zj)j ⱕ l, and Y can be done in polynomial time the hypothesized set of l regressors, is 2M2. This
in the size of the input. means that the SSE of K0 is at least 2M2 ⫹ 2.
That is, the SSE of the set K0 is at least (M2 ⫹
1)/M2 larger than the SSE used for the calcula-
PROOF OF THEOREM 2: tion of r. On the other hand, K0 uses fewer
Let there be given r ⬎ 0. The proof follows variables. But if (t ⫹ 2s ⫹ l ⫹ 1)/(t ⫹ 2s ⫹ k ⫹
that of Theorem 1, with the following modifi- 1) ⬍ (M2 ⫹ 1)/M2, the reduction in the number
cation. For an integer t ⱖ 1, to be specified later, of variables cannot pay off, and K0 has an
we add t observations for which all the variables adjusted R2 lower than r. It remains to choose t
((Xj)j ⱕ l, (Zj)j ⱕ l, and Y) assume the value 0. large enough so that the above inequality holds,
These observations do not change the R2 ob- and to observe that this t is bounded by a
tained by any set of regressors, as both SST and polynomial of the input size.
SSR remain the same. Assuming that t has been
fixed (and that it is polynomial in the data), let
r⬘ be the R2 corresponding to an adjusted R2 of REFERENCES
r, with l regressors. That is, (1 ⫺ r⬘) ⫽ (1 ⫺
r)(t ⫹ 2s ⫹ 2l ⫹ 1)/(t ⫹ 2s ⫹ l ⫹ 1). Define M Al-Najjar, Nabil I.; Casadesus-Masanell, Ramon
as in the proof of Theorem 1 for r⬘. and Ozdenoren, Emre. “Probabilistic Repre-
1368 THE AMERICAN ECONOMIC REVIEW DECEMBER 2005
sentation of Complexity.” Journal of Eco- and games: Essays in honor of Frank Hahn.
nomic Theory, 2003, 111(1), pp. 49 – 87. Cambridge, MA: MIT Press, 1992, pp. 258 –
Anderlini, Luca and Felli, Leonardo. “Incomplete 81.
Written Contracts: Indescribable States of La Porta, Rafael; Lopez-de-Silanes, Florencio;
Nature.” Quarterly Journal of Economics, Shleifer, Andrei and Vishny, Robert. “The
1994, 109(4), pp. 1085–1124. Quality of Government.” Unpublished paper,
Bray, Margaret M. and Savin, Nathan E. “Ratio- 1998.
nal Expectations Equilibria, Learning, and Lund, Carston and Yannakakis, Mihalis. “On the
Model Specification.” Econometrica, 1986, Hardness of Approximating Minimization
54(5), pp. 1129 – 60. Problems.” Journal of the ACM, 1994, 41(5),
Dekel, Eddie; Lipman, Barton and Rustichini, pp. 960 – 81.
Aldo. “A Unique Subjective State Space for Mallows, Colin. “Some Comments on Cp.” Tech-
Unforeseen Contingencies.” Unpublished Pa- nometrics, 1973, 15(4), pp. 661–75.
per, 1997. Maoz, Zeev and Russett, Bruce. “Normative and
Dekel, Eddie; Lipman, Barton L. and Rustichini, Structural Causes of Democratic Peace,
Aldo. “Recent Developments in Modeling 1946 –1986.” American Political Science Re-
Unforeseen Contingencies.” European Eco- view, 1993, 87(3), pp. 624 –38.
nomic Review, 1998, 42(3–5), pp. 523– 42. Papadimitriou, Christos H. Computational com-
Gary, Michael R. and Johnson, David S. Comput- plexity. Reading, MA: Addison-Wesley,
ers and intractability: A guide to the theory of 1994.
NP-completeness. San Francisco: W. Free- Raz, Ran and Safra, Shmuel. “A Sub-constant
man and Co., 1979. Error-Probability Low-Degree Test, and Sub-
Gilboa, Itzhak and Schmeidler, David. A Theory constant Error-Probability PCP Characteriza-
of case-based decisions. Cambridge: Cam- tion of NP,” in Proceedings of the 29th
bridge University Press, 2001. annual ACM Symposium on Theory of Com-
Goodman, Nelson. Fact, fiction and forecast, 2nd putation. New York, ACM Press, 1997, pp.
ed. Indianapolis: Bobbs-Merrill, 1965. 475– 84.
Hastie, Trevor; Tibshirani, Robert and Friedman, Simon, Herbert. “A Behavioral Model of Ratio-
Jerome. The elements of statistical learning. nal Choice.” Quarterly Review of Economics,
New York: Springer, 2001. 1955, 69(1), pp. 99 –118.
Kreps, David M. “A Representation Theorem for Tibshirani, Robert J. “Regression Shrinkage and
‘Preference for Flexibility’.” Econometrica, Selection via the Lasso.” Journal of the Royal
1979, 47(3), pp. 565–77. Statistical Society, Series B, 1996, 58(1), pp.
Kreps, David M. “Static Choice in the Presence 267– 88.
of Unforeseen Contingencies,” in Partha Das- Wittgenstein, Ludwig. Tractatus Logico-
gupta, Douglas Gale, Oliver Hart and Eric Philosophicus. 1922. Reprint, London:
Maskin, eds., Economic analysis of markets Routledge, 1961.