Professional Documents
Culture Documents
SSRN Id2061029
SSRN Id2061029
SSRN Id2061029
Lewis Davis
Department of Economics
Union College
807 Union Street
Schenectady NY 12308
davisl@union.edu
Abstract
This note describes an extension of the Kashima and Kashima (1998) linguistic dataset. The
variables in the dataset reflect grammatical rules for pronoun use in a country’s primary language
and are commonly employed as instruments for key dimensions of national culture, such as
individualism and egalitarianism. This extension adds linguistic data for thirteen countries and
increases the number of countries in three commonly used cultural datasets for which linguistic
data is available by seven to fifteen percent.
JEL Codes: Z1
1 Introduction
This note describes an extension of the Kashima and Kashima (1998, 2005) linguistic dataset,
hereafter referred to as KK. Drawing on the Linguistic Relativity hypothesis of Whorf (1956)
and Saphr (1970), which holds that the structure of language influences cultural development,
KK provide data on two grammatical rules and relate these variables to important dimensions of
cultural variation. Since they were first introduced to economists by Licht, Goldschmidt and
Schwartz, (2007), the KK linguistic variables have been used increasingly as instrumental
variables to address the identification issues that arise when cultural variables are used as
regressors, e.g. Tabellini (2008), Gorodnichenko and Roland (2010, 2011) and Davis (2012).
1
I wish to thank the Faculty Resource Network at New York University, where I was a Scholar-in-Residence during
initial work on this project . I am also grateful to Emiko Kashima for comments on an earlier draft of this paper.
Any remaining errors are my own.
is a dummy variable that takes on a value of one if a country’s primary language permits the
speaker to drop a pronoun when it is used as the subject of a sentence, and is zero otherwise. For
example, pronoun drop is permitted in Spanish, such that the English sentence “I love” may be
translated as either “Amo” or “Yo amo,” but it is not permitted in English, as the pronoun “I” is
required to make sense of the sentence. In languages that permit pronoun drop, the identity of
the subject is understood in the context of the rest of the sentence. In contrast, in languages that
do not permit pronoun drop, the subject stands apart from the context. As a result, pronoun drop
is associated with more collectivist cultures and may be used as an instrument for variables that
primary language has both an informal and a formal form of the second person singular pronoun.
In sociolinguistics this property is known as the T-V distinction, after French, in which the
second person singular may take two forms, the informal Tu and the formal Vu. Languages that
employ both formal and informal forms of second person address explicitly recognize social
hierarchy. Thus, having multiple second person singular pronouns is associated with an
emphasis on social stratification and may be used as an instrument for cultural variables that
An important practical limitation on the use of these instruments is that the sample of 71
countries in the original KK dataset overlaps modestly with the primary sources of cultural data
used by economists, such as the Hofstede (2001) and Schwartz (2006) datasets and the World
Values Survey, which is an international collaboration led by Ronald Inglehart et al. (2000). In
this note, I describe an extension to the original KK data that increases the number of countries
Construction
The sample consists of the union of the countries covered by the Hofstede, Schwartz and five
waves of the World Value Survey datasets. It includes countries from the Hofstede dataset
belonging to three regions that were subject to a common survey several countries. The variable
hof_group identifies the countries that were subject to a common survey and the name of the
regional group. This variable permits researcher to test the sensitivity of their results to the
For countries that are not in the KK dataset, we use Mayer and Zignago (2011), hereafter MZ, to
identify up to three popular languages defined as languages spoken by over 20% of the
population. If a country has more than one popular language, we compute the share of the
observed population that speaks each language, referring first to the language data in the CIA
World Factbook (CIA) and, if language shares are not listed there, to the website
www.ethnologue.com (ETH). For each language, the langshare variable is the population that
speaks that language as a share of the total population that speaks one of the popular languages.
The source of the language information used is recorded in the variable langsource: KK, MZ,
CIA, or ETH. Some language names differ across datasets. In assigning grammatical rules, we
match the following language pairs: Rumanian and Romainian, Pilipino and Filipino, Mandarin
and Chinese Mandarin, Farsi and Persian, and Hindi and Hindustani Caribbean. Bosnian,
pronoun drop and the T-V distinction for up to three languages weighted by their population
shares. In computing the values of the grammatical rules, information regarding a particular
language was used if it was the most popular language or if information on more popular
languages was also available. Thus, in the case of Bolivia, with languages Spanish (74%) and
Quechua (26%), the grammar variables are coded using information for Spanish, as the KK
dataset does not provide grammatical information on Quechua. However, in the case of
Namibia, with languages Afrikaans (65%) and German (35%), the grammar variables are coded
as missing because the KK dataset does not include grammatical information on Afrikaans. As a
practical matter, no country in the sample had three popular languages for which grammatical
data was available, so the published dataset only contains the two most popular languages, lang1
and lang2, along with their respective shares of the observed population, langshare1 and
langshare2. The dataset also includes the linguistic variables from the original KK dataset,
which are distinguished by the prefix “KK_”, e.g. KK_lang, KK_drop and KK_second.
Table 1 shows summary statistics for the original and extended data. These statistics indicate
that the extension has little impact on the mean or standard deviation of the two grammar
variables. Table 2 lists the countries with new and updated linguistic information along with
their region, primary and secondary languages and the values of their linguistic variables. The
first 14 countries listed in Table 2 are new, while the final entry, Singapore, has updated
linguistic variables based on the use of grammatical information for multiple languages. Table 2
also indicates that new countries are concentrated in three regions, Latin America and the
Caribbean, Middle East and North Africa, and Eastern Europe and Central Asia, an outcome that
largely reflects the roles of Spanish, English and Arabic as international languages.
The regional concentration of the new countries raises the question of whether the
repeated sampling of countries that share a language significantly alters the relationship between
the linguistic variables and their cultural counterparts. To address this question, we compare
regressions of cultural variables on their associated linguistic rule using the original and
extended datasets. Table 3 presents regression results for measures of individualism, including
responsibility from Davis (2012) that is derived from the WVS. For each dependent variable,
both drop and KK-drop are significant at the 1% level in each regression, and the point estimates
of the coefficient on pronoun drop using the two samples are generally similar in magnitude.
second and KK_second. Regression results using the Hofstede’s Power Distance Index as the
dependent variable are nearly identical. Both coefficients are significant at the 1% level and the
point estimates are also quite similar. The remaining columns show regressions using measures
of hierarchy and egalitarianism from Schwartz. These results are also similar, though in this case
it is because the linguistic variables are not significant in any of these regressions.
The results presented in Tables 3 and 4 do not suggest that the increased coverage of
Spanish, English and Arabic speaking countries in the extended dataset significantly alters the
relationships between linguistic and cultural variables. While the results for pronoun drop are
generally stronger than those for the TV distinction, this is true for regression using both the
original and extended samples. As indicated in Tables 3 and 4, the use of the extended linguistic
dataset increases coverage by 7% for the Hofstede data, 5% for the Schwartz data and 15% for
the WVS data. It also results in a mild reduction in fit for some regressions, especially those
using the Hofstede cultural variables, an outcome that may reflect the addition observations with
“atypical” language-country pairing, such as English with Sierra Leone and Jamaica. By
comparison, the English speaking countries in the KK sample include the UK, Ireland and the
four neo-Europes. We view the inclusion of atypical pairs as a strength of the extended dataset,
as it may serve as a partial control for omitted regional and historical variables. The extended
References
Davis, Lewis S. (2012), “Individualism and Economic Development: Evidence from Rainfall
Data,” manuscript, Union College.
Gorodnichenko, Yuriy and Gerard Roland (2010) “Culture, Institutions, and the Wealth of Nations,”
NBER Working Paper #16368.
Gorodnichenko, Yuriy and Gerard Roland (2011) "Which dimensions of culture matter for long run
growth?" American Economic Review Papers and Proceedings 101, 492-498.
Hofstede, Geert H. (2001) Culture’s Consequences: Comparing Values, Behaviors, Institutions, and
Organizations across Nations. Second ed. Sage, Thousand Oaks, CA.
Inglehart, Ronald, European Values Survey Group, and World Values Survey Group, (2000) “World
Values Surveys and European Values Surveys 1981-84, 1990-93, 1995-97.”
http://www.stanford.edu/group/ssds/dewidocs/icpsr2790_superseded/cb2790.pdf.
Kashima, Emiko S. and Yoshihisa Kashima (1998) “Culture and language: The case of cultural
dimensions and personal pronoun use,” Journal of Cross-Cultural Psychology 29, 461–487.
Kashima, Emiko S. and Yoshihisa Kashima (2005), “Erratum to Kashima and Kashima (1998) and
Reiteration,” Journal of Cross-Cultural Psychology 36(3), 396-400.
Licht, Amir, Chanan Goldschmidt and Shalom H. Schwartz (2007) “Culture rules: The foundations of
the rule of law and other norms of governance,” Journal of Comparative Economics 35, 659–688.
Mayer, Thierry and Soledad Zignago (2011) “Notes on CEPII’s distances measures: The
GeoDist database,” CEPII, WP No 2011 – 25.
Sapir, Edward (1970) “Language.” In: Mandelbaum, D.G. (Ed.), Culture, Language and Personality:
Selected Essays. Univ. of California Press, Berkeley, 1–44.
Schwartz, Shalom H. (2006) “A Theory of Cultural Value Orientations: Explication and Applications,”
Comparative Sociology 5(2-3), 137-182.
Tabellini, Guido (2008) “Institutions and Culture,” Journal of the European Economic Association 6(2-
3), 255-294.
Whorf, Benjamin L. (1956) Language, Thought and Reality. MIT Press, Cambridge, MA.
World Values Survey 1981-2008 Official Aggregate, v. v20090914, (2009). World Values Survey
Association (www.worldvaluessurvey.org). Aggregate File Producer: ASEP/JDS,
Madrid.
Table 1: Summary Statistics
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
VARIABLES hof_idv hof_idv schw_embed schw_embed schw_af_auto schw_af_auto schw_in_auto schw_in_auto indresp indresp
Observations 72 67 43 41 43 41 43 41 68 59
R-squared 0.367 0.499 0.302 0.304 0.288 0.279 0.205 0.217 0.367 0.366
Robust t-statistics in parentheses
*** p<0.01, ** p<0.05, * p<0.1
9
Table 4: Hierarchy and the Second Person Singular
Observations 72 67 43 41 43 41
R-squared 0.088 0.099 0.015 0.006 0.004 0.001
Robust t-statistics in parentheses
*** p<0.01, ** p<0.05, * p<0.1
10