SSRN Id2061029

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

An Extension of the Kashima and Kashima (1998) Linguistic Dataset 1

Lewis Davis
Department of Economics
Union College
807 Union Street
Schenectady NY 12308
davisl@union.edu

May 16, 2012

Abstract

This note describes an extension of the Kashima and Kashima (1998) linguistic dataset. The
variables in the dataset reflect grammatical rules for pronoun use in a country’s primary language
and are commonly employed as instruments for key dimensions of national culture, such as
individualism and egalitarianism. This extension adds linguistic data for thirteen countries and
increases the number of countries in three commonly used cultural datasets for which linguistic
data is available by seven to fifteen percent.

JEL Codes: Z1

Key Words: Culture, Values, Instrumental Variables, Individualism, Egalitarianism.

1 Introduction

This note describes an extension of the Kashima and Kashima (1998, 2005) linguistic dataset,

hereafter referred to as KK. Drawing on the Linguistic Relativity hypothesis of Whorf (1956)

and Saphr (1970), which holds that the structure of language influences cultural development,

KK provide data on two grammatical rules and relate these variables to important dimensions of

cultural variation. Since they were first introduced to economists by Licht, Goldschmidt and

Schwartz, (2007), the KK linguistic variables have been used increasingly as instrumental

variables to address the identification issues that arise when cultural variables are used as

regressors, e.g. Tabellini (2008), Gorodnichenko and Roland (2010, 2011) and Davis (2012).

1
I wish to thank the Faculty Resource Network at New York University, where I was a Scholar-in-Residence during
initial work on this project . I am also grateful to Emiko Kashima for comments on an earlier draft of this paper.
Any remaining errors are my own.

Electronic copy available at: http://ssrn.com/abstract=2061029


The linguistic variables that KK encode regard grammatical rules of pronoun use. Drop

is a dummy variable that takes on a value of one if a country’s primary language permits the

speaker to drop a pronoun when it is used as the subject of a sentence, and is zero otherwise. For

example, pronoun drop is permitted in Spanish, such that the English sentence “I love” may be

translated as either “Amo” or “Yo amo,” but it is not permitted in English, as the pronoun “I” is

required to make sense of the sentence. In languages that permit pronoun drop, the identity of

the subject is understood in the context of the rest of the sentence. In contrast, in languages that

do not permit pronoun drop, the subject stands apart from the context. As a result, pronoun drop

is associated with more collectivist cultures and may be used as an instrument for variables that

measure individualism or collectivism.

The second linguistic dummy variable, second, takes on a value of 1 if a country’s

primary language has both an informal and a formal form of the second person singular pronoun.

In sociolinguistics this property is known as the T-V distinction, after French, in which the

second person singular may take two forms, the informal Tu and the formal Vu. Languages that

employ both formal and informal forms of second person address explicitly recognize social

hierarchy. Thus, having multiple second person singular pronouns is associated with an

emphasis on social stratification and may be used as an instrument for cultural variables that

measure hierarchy or egalitarianism.

An important practical limitation on the use of these instruments is that the sample of 71

countries in the original KK dataset overlaps modestly with the primary sources of cultural data

used by economists, such as the Hofstede (2001) and Schwartz (2006) datasets and the World

Values Survey, which is an international collaboration led by Ronald Inglehart et al. (2000). In

this note, I describe an extension to the original KK data that increases the number of countries

Electronic copy available at: http://ssrn.com/abstract=2061029


for which both linguistic and cultural data are available by between five and fifteen percent,

depending on the variable in question.

Construction

The sample consists of the union of the countries covered by the Hofstede, Schwartz and five

waves of the World Value Survey datasets. It includes countries from the Hofstede dataset

belonging to three regions that were subject to a common survey several countries. The variable

hof_group identifies the countries that were subject to a common survey and the name of the

regional group. This variable permits researcher to test the sensitivity of their results to the

inclusion of countries from Hofstede’s group surveys.

If a country is the KK dataset, we use the KK designation of its language or languages.

For countries that are not in the KK dataset, we use Mayer and Zignago (2011), hereafter MZ, to

identify up to three popular languages defined as languages spoken by over 20% of the

population. If a country has more than one popular language, we compute the share of the

observed population that speaks each language, referring first to the language data in the CIA

World Factbook (CIA) and, if language shares are not listed there, to the website

www.ethnologue.com (ETH). For each language, the langshare variable is the population that

speaks that language as a share of the total population that speaks one of the popular languages.

The source of the language information used is recorded in the variable langsource: KK, MZ,

CIA, or ETH. Some language names differ across datasets. In assigning grammatical rules, we

match the following language pairs: Rumanian and Romainian, Pilipino and Filipino, Mandarin

and Chinese Mandarin, Farsi and Persian, and Hindi and Hindustani Caribbean. Bosnian,

Serbian and Croatian are all matched to their macro-language, Serbo-Croatian.


The linguistic variables drop and second are the averages of the grammatical rules for

pronoun drop and the T-V distinction for up to three languages weighted by their population

shares. In computing the values of the grammatical rules, information regarding a particular

language was used if it was the most popular language or if information on more popular

languages was also available. Thus, in the case of Bolivia, with languages Spanish (74%) and

Quechua (26%), the grammar variables are coded using information for Spanish, as the KK

dataset does not provide grammatical information on Quechua. However, in the case of

Namibia, with languages Afrikaans (65%) and German (35%), the grammar variables are coded

as missing because the KK dataset does not include grammatical information on Afrikaans. As a

practical matter, no country in the sample had three popular languages for which grammatical

data was available, so the published dataset only contains the two most popular languages, lang1

and lang2, along with their respective shares of the observed population, langshare1 and

langshare2. The dataset also includes the linguistic variables from the original KK dataset,

which are distinguished by the prefix “KK_”, e.g. KK_lang, KK_drop and KK_second.

Overview of the Data

Table 1 shows summary statistics for the original and extended data. These statistics indicate

that the extension has little impact on the mean or standard deviation of the two grammar

variables. Table 2 lists the countries with new and updated linguistic information along with

their region, primary and secondary languages and the values of their linguistic variables. The

first 14 countries listed in Table 2 are new, while the final entry, Singapore, has updated

linguistic variables based on the use of grammatical information for multiple languages. Table 2

also indicates that new countries are concentrated in three regions, Latin America and the
Caribbean, Middle East and North Africa, and Eastern Europe and Central Asia, an outcome that

largely reflects the roles of Spanish, English and Arabic as international languages.

The regional concentration of the new countries raises the question of whether the

repeated sampling of countries that share a language significantly alters the relationship between

the linguistic variables and their cultural counterparts. To address this question, we compare

regressions of cultural variables on their associated linguistic rule using the original and

extended datasets. Table 3 presents regression results for measures of individualism, including

Hofstede’s measure of individualism, three variables from Schwartz, embeddedness (a measure

of collectivism), intellectual autonomy and affective autonomy, and a measure of individual

responsibility from Davis (2012) that is derived from the WVS. For each dependent variable,

both drop and KK-drop are significant at the 1% level in each regression, and the point estimates

of the coefficient on pronoun drop using the two samples are generally similar in magnitude.

Table 4 presents results from regressing measures of egalitarianism and hierarchy on

second and KK_second. Regression results using the Hofstede’s Power Distance Index as the

dependent variable are nearly identical. Both coefficients are significant at the 1% level and the

point estimates are also quite similar. The remaining columns show regressions using measures

of hierarchy and egalitarianism from Schwartz. These results are also similar, though in this case

it is because the linguistic variables are not significant in any of these regressions.

The results presented in Tables 3 and 4 do not suggest that the increased coverage of

Spanish, English and Arabic speaking countries in the extended dataset significantly alters the

relationships between linguistic and cultural variables. While the results for pronoun drop are

generally stronger than those for the TV distinction, this is true for regression using both the

original and extended samples. As indicated in Tables 3 and 4, the use of the extended linguistic
dataset increases coverage by 7% for the Hofstede data, 5% for the Schwartz data and 15% for

the WVS data. It also results in a mild reduction in fit for some regressions, especially those

using the Hofstede cultural variables, an outcome that may reflect the addition observations with

“atypical” language-country pairing, such as English with Sierra Leone and Jamaica. By

comparison, the English speaking countries in the KK sample include the UK, Ireland and the

four neo-Europes. We view the inclusion of atypical pairs as a strength of the extended dataset,

as it may serve as a partial control for omitted regional and historical variables. The extended

linguistic dataset is available at http://minerva.union.edu/davisl/.

References

CIA World Factbook, available at https://www.cia.gov/library/publications/the-world-factbook/,


accessed May 10, 2012.

Davis, Lewis S. (2012), “Individualism and Economic Development: Evidence from Rainfall
Data,” manuscript, Union College.

Ethnologue, available at http://www.ethnologue.com/home.asp, accessed May 10, 2012.

Gorodnichenko, Yuriy and Gerard Roland (2010) “Culture, Institutions, and the Wealth of Nations,”
NBER Working Paper #16368.

Gorodnichenko, Yuriy and Gerard Roland (2011) "Which dimensions of culture matter for long run
growth?" American Economic Review Papers and Proceedings 101, 492-498.

Hofstede, Geert H. (2001) Culture’s Consequences: Comparing Values, Behaviors, Institutions, and
Organizations across Nations. Second ed. Sage, Thousand Oaks, CA.

Inglehart, Ronald, European Values Survey Group, and World Values Survey Group, (2000) “World
Values Surveys and European Values Surveys 1981-84, 1990-93, 1995-97.”
http://www.stanford.edu/group/ssds/dewidocs/icpsr2790_superseded/cb2790.pdf.

Kashima, Emiko S. and Yoshihisa Kashima (1998) “Culture and language: The case of cultural
dimensions and personal pronoun use,” Journal of Cross-Cultural Psychology 29, 461–487.

Kashima, Emiko S. and Yoshihisa Kashima (2005), “Erratum to Kashima and Kashima (1998) and
Reiteration,” Journal of Cross-Cultural Psychology 36(3), 396-400.

Licht, Amir, Chanan Goldschmidt and Shalom H. Schwartz (2007) “Culture rules: The foundations of
the rule of law and other norms of governance,” Journal of Comparative Economics 35, 659–688.
Mayer, Thierry and Soledad Zignago (2011) “Notes on CEPII’s distances measures: The
GeoDist database,” CEPII, WP No 2011 – 25.

Sapir, Edward (1970) “Language.” In: Mandelbaum, D.G. (Ed.), Culture, Language and Personality:
Selected Essays. Univ. of California Press, Berkeley, 1–44.

Schwartz, Shalom H. (2006) “A Theory of Cultural Value Orientations: Explication and Applications,”
Comparative Sociology 5(2-3), 137-182.

Tabellini, Guido (2008) “Institutions and Culture,” Journal of the European Economic Association 6(2-
3), 255-294.

Whorf, Benjamin L. (1956) Language, Thought and Reality. MIT Press, Cambridge, MA.

World Values Survey 1981-2008 Official Aggregate, v. v20090914, (2009). World Values Survey
Association (www.worldvaluessurvey.org). Aggregate File Producer: ASEP/JDS,
Madrid.
Table 1: Summary Statistics

Variable Obs. Mean Std. Dev. Min Max


drop 83 0.7112 0.4494 0 1
KK_drop 69 0.7101 0.4570 0 1
second 83 0.7349 0.4440 0 1
KK_second 69 0.7391 0.4423 0 1

Table 2: New and Updated Observations in the Extended Linguistic Dataset

Name wbcode region lang1 lang2 drop second


Algeria DZA MENA Arabic 1 1
Bolivia BOL LAC Spanish Quechua 1 1
Bosnia And Herzegovina BIH ECA Serbo-Croatian 1 1
Croatia HRV ECA Serbo-Croatian 1 1
Cyprus CYP WE Greek Turkish 1 1
Dominican Republic DOM LAC Spanish 1 1
Jamaica JAM LAC English 0 0
Jordan JOR MENA Arabic 1 1
Moldova MDA ECA Romanian 1 1
Morocco MAR MENA Arabic Tamazight 1 1
Puerto Rico PRI LAC Spanish 1 1
Sierra Leon SLE SSA English 0 0
Suriname SUR LAC Dutch Hindi 0.429 1
Trinidad And Tobago TTO LAC English 0 0
Singapore SGP EAP Chinese Mandarin English 0.603 0
Table 3: Individualism and Pronoun Drop

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
VARIABLES hof_idv hof_idv schw_embed schw_embed schw_af_auto schw_af_auto schw_in_auto schw_in_auto indresp indresp

drop -30.23*** 0.398*** -0.522*** -0.358*** -1.206***


(-5.668) (4.191) (-3.806) (-3.178) (-6.085)
kkdrop -36.49*** 0.385*** -0.478*** -0.359*** -1.153***
(-7.563) (4.032) (-3.586) (-3.135) (-5.637)
Constant 61.72*** 67.75*** 3.499*** 3.488*** 3.631*** 3.639*** 4.666*** 4.682*** 6.043*** 6.093***
(13.02) (16.09) (45.45) (45.34) (30.79) (30.64) (48.06) (48.40) (36.60) (35.95)

Observations 72 67 43 41 43 41 43 41 68 59
R-squared 0.367 0.499 0.302 0.304 0.288 0.279 0.205 0.217 0.367 0.366
Robust t-statistics in parentheses
*** p<0.01, ** p<0.05, * p<0.1

9
Table 4: Hierarchy and the Second Person Singular

(1) (2) (3) (4) (5) (6)


VARIABLES hof_pdi hof_pdi schw_hier schw_hier schw_egal schw_egal

second 13.35*** -0.128 0.0383


(2.741) (-0.732) (0.423)
kksecond 14.90*** -0.0850 0.0153
(2.826) (-0.463) (0.160)
Constant 51.55*** 49.67*** 2.318*** 2.284*** 4.827*** 4.838***
(13.30) (11.43) (15.11) (14.07) (68.96) (64.55)

Observations 72 67 43 41 43 41
R-squared 0.088 0.099 0.015 0.006 0.004 0.001
Robust t-statistics in parentheses
*** p<0.01, ** p<0.05, * p<0.1

10

You might also like