SSRN Id2061029

An Extension of the Kashima and Kashima (1998) Linguistic Dataset 1
Lewis Davis
Department of Economics
Union College
807 Union Street
Schenectady NY 12308
davisl@union.edu
May 16, 2012
Abstract
This note describes an extension of the Kashima and Kashima (1998) linguistic dataset. The
variables in the dataset reflect grammatical rules for pronoun use in a country’s primary language
and are commonly employed as instruments for key dimensions of national culture, such as
individualism and egalitarianism. This extension adds linguistic data for thirteen countries and
increases the number of countries in three commonly used cultural datasets for which linguistic
data is available by seven to fifteen percent.
JEL Codes: Z1
Key Words: Culture, Values, Instrumental Variables, Individualism, Egalitarianism.
1 Introduction
This note describes an extension of the Kashima and Kashima (1998, 2005) linguistic dataset,
hereafter referred to as KK. Drawing on the Linguistic Relativity hypothesis of Whorf (1956)
and Saphr (1970), which holds that the structure of language influences cultural development,
KK provide data on two grammatical rules and relate these variables to important dimensions of
cultural variation. Since they were first introduced to economists by Licht, Goldschmidt and
Schwartz, (2007), the KK linguistic variables have been used increasingly as instrumental
variables to address the identification issues that arise when cultural variables are used as
regressors, e.g. Tabellini (2008), Gorodnichenko and Roland (2010, 2011) and Davis (2012).
1
I wish to thank the Faculty Resource Network at New York University, where I was a Scholar-in-Residence during
initial work on this project . I am also grateful to Emiko Kashima for comments on an earlier draft of this paper.
Any remaining errors are my own.
Electronic copy available at: http://ssrn.com/abstract=2061029

The linguistic variables that KK encode regard grammatical rules of pronoun use. Drop
is a dummy variable that takes on a value of one if a country’s primary language permits the
speaker to drop a pronoun when it is used as the subject of a sentence, and is zero otherwise. For
example, pronoun drop is permitted in Spanish, such that the English sentence “I love” may be
translated as either “Amo” or “Yo amo,” but it is not permitted in English, as the pronoun “I” is
required to make sense of the sentence. In languages that permit pronoun drop, the identity of
the subject is understood in the context of the rest of the sentence. In contrast, in languages that
do not permit pronoun drop, the subject stands apart from the context. As a result, pronoun drop
is associated with more collectivist cultures and may be used as an instrument for variables that
measure individualism or collectivism.
The second linguistic dummy variable, second, takes on a value of 1 if a country’s
primary language has both an informal and a formal form of the second person singular pronoun.
In sociolinguistics this property is known as the T-V distinction, after French, in which the
second person singular may take two forms, the informal Tu and the formal Vu. Languages that
employ both formal and informal forms of second person address explicitly recognize social
hierarchy. Thus, having multiple second person singular pronouns is associated with an
emphasis on social stratification and may be used as an instrument for cultural variables that
measure hierarchy or egalitarianism.
An important practical limitation on the use of these instruments is that the sample of 71
countries in the original KK dataset overlaps modestly with the primary sources of cultural data
used by economists, such as the Hofstede (2001) and Schwartz (2006) datasets and the World
Values Survey, which is an international collaboration led by Ronald Inglehart et al. (2000). In
this note, I describe an extension to the original KK data that increases the number of countries
Electronic copy available at: http://ssrn.com/abstract=2061029

for which both linguistic and cultural data are available by between five and fifteen percent,
depending on the variable in question.
Construction
The sample consists of the union of the countries covered by the Hofstede, Schwartz and five
waves of the World Value Survey datasets. It includes countries from the Hofstede dataset
belonging to three regions that were subject to a common survey several countries. The variable
hof_group identifies the countries that were subject to a common survey and the name of the
regional group. This variable permits researcher to test the sensitivity of their results to the
inclusion of countries from Hofstede’s group surveys.
If a country is the KK dataset, we use the KK designation of its language or languages.
For countries that are not in the KK dataset, we use Mayer and Zignago (2011), hereafter MZ, to
identify up to three popular languages defined as languages spoken by over 20% of the
population. If a country has more than one popular language, we compute the share of the
observed population that speaks each language, referring first to the language data in the CIA
World Factbook (CIA) and, if language shares are not listed there, to the website
www.ethnologue.com (ETH). For each language, the langshare variable is the population that
speaks that language as a share of the total population that speaks one of the popular languages.
The source of the language information used is recorded in the variable langsource: KK, MZ,
CIA, or ETH. Some language names differ across datasets. In assigning grammatical rules, we
match the following language pairs: Rumanian and Romainian, Pilipino and Filipino, Mandarin
and Chinese Mandarin, Farsi and Persian, and Hindi and Hindustani Caribbean. Bosnian,
Serbian and Croatian are all matched to their macro-language, Serbo-Croatian.

The linguistic variables drop and second are the averages of the grammatical rules for
pronoun drop and the T-V distinction for up to three languages weighted by their population
shares. In computing the values of the grammatical rules, information regarding a particular
language was used if it was the most popular language or if information on more popular
languages was also available. Thus, in the case of Bolivia, with languages Spanish (74%) and
Quechua (26%), the grammar variables are coded using information for Spanish, as the KK
dataset does not provide grammatical information on Quechua. However, in the case of
Namibia, with languages Afrikaans (65%) and German (35%), the grammar variables are coded
as missing because the KK dataset does not include grammatical information on Afrikaans. As a
practical matter, no country in the sample had three popular languages for which grammatical
data was available, so the published dataset only contains the two most popular languages, lang1
and lang2, along with their respective shares of the observed population, langshare1 and
langshare2. The dataset also includes the linguistic variables from the original KK dataset,
which are distinguished by the prefix “KK_”, e.g. KK_lang, KK_drop and KK_second.
Overview of the Data
Table 1 shows summary statistics for the original and extended data. These statistics indicate
that the extension has little impact on the mean or standard deviation of the two grammar
variables. Table 2 lists the countries with new and updated linguistic information along with
their region, primary and secondary languages and the values of their linguistic variables. The
first 14 countries listed in Table 2 are new, while the final entry, Singapore, has updated
linguistic variables based on the use of grammatical information for multiple languages. Table 2
also indicates that new countries are concentrated in three regions, Latin America and the
Caribbean, Middle East and North Africa, and Eastern Europe and Central Asia, an outcome that
largely reflects the roles of Spanish, English and Arabic as international languages.
The regional concentration of the new countries raises the question of whether the
repeated sampling of countries that share a language significantly alters the relationship between
the linguistic variables and their cultural counterparts. To address this question, we compare
regressions of cultural variables on their associated linguistic rule using the original and
extended datasets. Table 3 presents regression results for measures of individualism, including
Hofstede’s measure of individualism, three variables from Schwartz, embeddedness (a measure
of collectivism), intellectual autonomy and affective autonomy, and a measure of individual
responsibility from Davis (2012) that is derived from the WVS. For each dependent variable,
both drop and KK-drop are significant at the 1% level in each regression, and the point estimates
of the coefficient on pronoun drop using the two samples are generally similar in magnitude.
Table 4 presents results from regressing measures of egalitarianism and hierarchy on
second and KK_second. Regression results using the Hofstede’s Power Distance Index as the
dependent variable are nearly identical. Both coefficients are significant at the 1% level and the
point estimates are also quite similar. The remaining columns show regressions using measures
of hierarchy and egalitarianism from Schwartz. These results are also similar, though in this case
it is because the linguistic variables are not significant in any of these regressions.
The results presented in Tables 3 and 4 do not suggest that the increased coverage of
Spanish, English and Arabic speaking countries in the extended dataset significantly alters the
relationships between linguistic and cultural variables. While the results for pronoun drop are
generally stronger than those for the TV distinction, this is true for regression using both the
original and extended samples. As indicated in Tables 3 and 4, the use of the extended linguistic
dataset increases coverage by 7% for the Hofstede data, 5% for the Schwartz data and 15% for
the WVS data. It also results in a mild reduction in fit for some regressions, especially those
using the Hofstede cultural variables, an outcome that may reflect the addition observations with
“atypical” language-country pairing, such as English with Sierra Leone and Jamaica. By
comparison, the English speaking countries in the KK sample include the UK, Ireland and the
four neo-Europes. We view the inclusion of atypical pairs as a strength of the extended dataset,
as it may serve as a partial control for omitted regional and historical variables. The extended
linguistic dataset is available at http://minerva.union.edu/davisl/.
References
CIA World Factbook, available at https://www.cia.gov/library/publications/the-world-factbook/,

accessed May 10, 2012.
Davis, Lewis S. (2012), “Individualism and Economic Development: Evidence from Rainfall
Data,” manuscript, Union College.
Ethnologue, available at http://www.ethnologue.com/home.asp, accessed May 10, 2012.
Gorodnichenko, Yuriy and Gerard Roland (2010) “Culture, Institutions, and the Wealth of Nations,”
NBER Working Paper #16368.
Gorodnichenko, Yuriy and Gerard Roland (2011) "Which dimensions of culture matter for long run
growth?" American Economic Review Papers and Proceedings 101, 492-498.
Hofstede, Geert H. (2001) Culture’s Consequences: Comparing Values, Behaviors, Institutions, and
Organizations across Nations. Second ed. Sage, Thousand Oaks, CA.
Inglehart, Ronald, European Values Survey Group, and World Values Survey Group, (2000) “World
Values Surveys and European Values Surveys 1981-84, 1990-93, 1995-97.”
http://www.stanford.edu/group/ssds/dewidocs/icpsr2790_superseded/cb2790.pdf.
Kashima, Emiko S. and Yoshihisa Kashima (1998) “Culture and language: The case of cultural
dimensions and personal pronoun use,” Journal of Cross-Cultural Psychology 29, 461–487.
Kashima, Emiko S. and Yoshihisa Kashima (2005), “Erratum to Kashima and Kashima (1998) and
Reiteration,” Journal of Cross-Cultural Psychology 36(3), 396-400.
Licht, Amir, Chanan Goldschmidt and Shalom H. Schwartz (2007) “Culture rules: The foundations of
the rule of law and other norms of governance,” Journal of Comparative Economics 35, 659–688.
Mayer, Thierry and Soledad Zignago (2011) “Notes on CEPII’s distances measures: The
GeoDist database,” CEPII, WP No 2011 – 25.
Sapir, Edward (1970) “Language.” In: Mandelbaum, D.G. (Ed.), Culture, Language and Personality:
Selected Essays. Univ. of California Press, Berkeley, 1–44.
Schwartz, Shalom H. (2006) “A Theory of Cultural Value Orientations: Explication and Applications,”
Comparative Sociology 5(2-3), 137-182.
Tabellini, Guido (2008) “Institutions and Culture,” Journal of the European Economic Association 6(2-
3), 255-294.
Whorf, Benjamin L. (1956) Language, Thought and Reality. MIT Press, Cambridge, MA.
World Values Survey 1981-2008 Official Aggregate, v. v20090914, (2009). World Values Survey
Association (www.worldvaluessurvey.org). Aggregate File Producer: ASEP/JDS,
Madrid.
Table 1: Summary Statistics
Variable Obs. Mean Std. Dev. Min Max

drop 83 0.7112 0.4494 0 1
KK_drop 69 0.7101 0.4570 0 1
second 83 0.7349 0.4440 0 1
KK_second 69 0.7391 0.4423 0 1
Table 2: New and Updated Observations in the Extended Linguistic Dataset
Name wbcode region lang1 lang2 drop second

Algeria DZA MENA Arabic 1 1
Bolivia BOL LAC Spanish Quechua 1 1
Bosnia And Herzegovina BIH ECA Serbo-Croatian 1 1
Croatia HRV ECA Serbo-Croatian 1 1
Cyprus CYP WE Greek Turkish 1 1
Dominican Republic DOM LAC Spanish 1 1
Jamaica JAM LAC English 0 0
Jordan JOR MENA Arabic 1 1
Moldova MDA ECA Romanian 1 1
Morocco MAR MENA Arabic Tamazight 1 1
Puerto Rico PRI LAC Spanish 1 1
Sierra Leon SLE SSA English 0 0
Suriname SUR LAC Dutch Hindi 0.429 1
Trinidad And Tobago TTO LAC English 0 0
Singapore SGP EAP Chinese Mandarin English 0.603 0
Table 3: Individualism and Pronoun Drop
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
VARIABLES hof_idv hof_idv schw_embed schw_embed schw_af_auto schw_af_auto schw_in_auto schw_in_auto indresp indresp
drop -30.23*** 0.398*** -0.522*** -0.358*** -1.206***

(-5.668) (4.191) (-3.806) (-3.178) (-6.085)
kkdrop -36.49*** 0.385*** -0.478*** -0.359*** -1.153***
(-7.563) (4.032) (-3.586) (-3.135) (-5.637)
Constant 61.72*** 67.75*** 3.499*** 3.488*** 3.631*** 3.639*** 4.666*** 4.682*** 6.043*** 6.093***
(13.02) (16.09) (45.45) (45.34) (30.79) (30.64) (48.06) (48.40) (36.60) (35.95)
Observations 72 67 43 41 43 41 43 41 68 59
R-squared 0.367 0.499 0.302 0.304 0.288 0.279 0.205 0.217 0.367 0.366
Robust t-statistics in parentheses
*** p<0.01, ** p<0.05, * p<0.1
9
Table 4: Hierarchy and the Second Person Singular
(1) (2) (3) (4) (5) (6)

VARIABLES hof_pdi hof_pdi schw_hier schw_hier schw_egal schw_egal
second 13.35*** -0.128 0.0383

(2.741) (-0.732) (0.423)
kksecond 14.90*** -0.0850 0.0153
(2.826) (-0.463) (0.160)
Constant 51.55*** 49.67*** 2.318*** 2.284*** 4.827*** 4.838***
(13.30) (11.43) (15.11) (14.07) (68.96) (64.55)
Observations 72 67 43 41 43 41
R-squared 0.088 0.099 0.015 0.006 0.004 0.001
Robust t-statistics in parentheses
*** p<0.01, ** p<0.05, * p<0.1
10

SSRN Id2061029

Uploaded by

Copyright:

Available Formats

You might also like

SSRN Id2061029

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSRN Id2061029

Uploaded by

Copyright:

Available Formats

An Extension of the Kashima and Kashima (1998) Linguistic Dataset 1

May 16, 2012

Key Words: Culture, Values, Instrumental Variables, Individualism, Egalitarianism.

Electronic copy available at: http://ssrn.com/abstract=2061029

measure individualism or collectivism.

The second linguistic dummy variable, second, takes on a value of 1 if a country’s

measure hierarchy or egalitarianism.

Electronic copy available at: http://ssrn.com/abstract=2061029

depending on the variable in question.

inclusion of countries from Hofstede’s group surveys.

If a country is the KK dataset, we use the KK designation of its language or languages.

Serbian and Croatian are all matched to their macro-language, Serbo-Croatian.

Overview of the Data

Hofstede’s measure of individualism, three variables from Schwartz, embeddedness (a measure

of collectivism), intellectual autonomy and affective autonomy, and a measure of individual

Table 4 presents results from regressing measures of egalitarianism and hierarchy on

linguistic dataset is available at http://minerva.union.edu/davisl/.

CIA World Factbook, available at https://www.cia.gov/library/publications/the-world-factbook/,

Ethnologue, available at http://www.ethnologue.com/home.asp, accessed May 10, 2012.

Variable Obs. Mean Std. Dev. Min Max

Table 2: New and Updated Observations in the Extended Linguistic Dataset

Name wbcode region lang1 lang2 drop second

drop -30.23*** 0.398*** -0.522*** -0.358*** -1.206***

(1) (2) (3) (4) (5) (6)

second 13.35*** -0.128 0.0383

You might also like

drop -30.23* 0.398* -0.522* -0.358* -1.206***