Professional Documents
Culture Documents
Unit 3 Chapter 3: Discovering and Describing Patterns of Variation and Change
Unit 3 Chapter 3: Discovering and Describing Patterns of Variation and Change
ETHICAL LINGUISTICS
Sociolinguistics have been concerned with the people who agree to provide data. It
begins with the permission to record someone's speech. We always obtain informed
consent for recordings before they are made, usually in written form. Ethics panels and
internal review boards will need to provide approval as well. In speech community with
low rates of literary, the verbal consent is allowed. Informed consent means that
speakers need to know more than the simple fact that they are being recorded, they
should have some idea of what we will do with that speech. Informed: "we will use the
recordings to describe how people use lang in this community." It is best to not be too
specific, speakers are probably not interested in intricate details of the project, and one
does not want to sound too academic. If we tell them very specific things like what
variable we are interested, we call attention to it and speakers may adjust their use of
the variable.
Principles that Wolfram outlines:
1. Principle of error correction: 'A scientist who becomes aware of a widespread
idea or social practice with important consequences that is invalidated by his
own data is obligated to bring this error to the attention of the widest possible
audience'.
2. Principle of debt incurred: 'an investigator who has obtained linguistic data
from members of a speech community has the obligation to use the knowledge
based on that data for the benefit of that community, when it has need of it'.
3. Principle of linguistic gratuity: 'Investigators who have obtained linguistic data
from members of a speech community should actively pursue ways in which
they can return linguistic favours to the community'.
Many variationist sociolinguists argue going beyond research reports in order to explain
findings to the public. Labov and Wolfram have written popular books or in magazines
to educate the public about the systematicity of vernacular dialects that show the
importance to the community of ways of speaking. When designing a research project,
we should keep in mind ways of benefiting and involving the members of the
community in our research.
FINDING LANGUAGE TO MEASURE
The basic outline of a variationist study is:
1. A speech community to study is identified, and at least one linguistic variable.
2. If the researcher is not already familiar with the community, preliminary
ethnography and study should be performed on understanding their boundaries,
identities and social practices to correlate with variation. Exploratory interviews
may be performed.
3. The parameters of the speech community are defined, and a strategy for
recording speech samples, attitudes, and social practices of members of the
community is designed.
4. The data are collected.
5. Linguistic variables and factors are coded.
6. Significant factors affecting the linguistic variables are discovered using
statistical tests.
7. Other data are brought to bear to explain the correlational patterns that are
found.
CODING VARIABLES
Coding the speech means measuring and recording each token of the linguistic variable
and recording any social or linguistic variables that go with that linguistic variable. Ex:
monophthongisation of the English diphthong /aʊ/ in Pittsburgh, such the word house is
pronounced [ha:s]. This vowel varies from a completely monophthongal, open [a:] to a
more standard [aʊ].
The first decision in coding is to determine what counts as the variable. Ex: vowels only
from monosyllabic words, or at least only in stressed positions. Make sure that one kind
of token does not dominate the coding, so in this example, I might want to limit the
number of tokens of 'our' that can be used, or not count it at al. This is called
determining or defining the variable context, and it's an important decision to make
in performing a variation study.
We need to know the best way to measure the linguistic variable. This could be a
numeric measurement, ex: a phonetic analysis program to measure the formants in hertz
at two points of each token of (aʊ), and use the difference bt the two measurements for
each formant as the variable. This type of measurement will provide a farily fine-
grained view of the variation, with theoretically an infinite number of measurements.
I could listen to each token and auditorily decide whether to categorise the variable as
monophthong or diphthong (or 'in-between' category), so I would have a categorical or
discrete variable. There are advantages to each approach: the interval variable can
presumably capture more subtle levels of variation, and is more objective and not
susceptible to uncontrolled biases of the analyst, as might be the case for the categorical
version. It will probably take longer to perform the coding for the continuous
measurement. Maybe only clearly monophthongal variants make a difference in the
speech community, and not subtle changes in the diphthong, which argues in favour of
using a categorical coding.
We are looking for measures that are reliable and valid. Reliability refers to the extent
to which the measures will be the same if repeated at another time or by another rater.
Validity refers to whether the measurement is a 'true' or accurate representation of the
thing being measured. In an experiment listeners were asked to tell whether they hear a
difference in the pronunciation of different words, and control exactly how different
they are. If listeners only notice extreme differences, then a categorical variable might
be more faithful to the speech community. The formants of a subset of the vowels could
be measured to determine whether or not those categories are objectively different, also
other raters code a subset to determine whether or not they put most vowels in the same
category. The goal is to balance validity and reliability against practicality, and any
study should be explicit about how all o these decisions were made.
How one sets up the record for coding? The easiest way is to use a spreadsheet program
such as Microsoft Excel, a data sheet in a statistics program such as SPSS or SAS. It is
best to have each row represented one token, the first item should be the variant of the
linguistic variable, followed by any factors such as linguistic environment or speaker
characteristics.
The information in this spreadsheet is as independent as possible. Ex: not only the word
that the vowel appears in is recorded, but also the preceding and following phonological
environments. This way, we can explore the effect of the following environment
separately from the word, and we can investigate the preceding environment separately
from the following. By coding for both it will be possible to explore which type of
coding is more robust for the age factor.
How many factors should you code for? The rule is to code for as much as is practical,
make as many distinctions as you can. We always record for more factors than we thing
we need, and we always make as many distinctions or categories as we can.
DESCRIBING PATTERNS
Once we have finished our coding, we count the frequency of one variant for specific
factors. Ex: how often (aʊ) is monophthongal in the Pittsburth example. To address the
constraints, transition and embedding problems, we will want to count in such a way
that we can find patterns. First, describing the patterns, then testing whether these
patterns are significantly different from data that have no discernible pattern. One of the
primary patterns we are looking for is in which there is a difference in variant use by
age group, indicating a change in apparent time.
In displaying a pattern that represents one factor group (single pattern) tables are the
most useful form. First, everything should be labelled transparently for examples
abbreviations for the (aʊ) values, but in the table it has to be spelled out what is in each
column and row. Because we are interested in relative frequencies, the data are
represented as percentages for each factor, and I have shown through the 'Total' column
that it is the rows, and not the columns, that are added to equal 100 per cent. In tables of
percentages, one should always show how the percentages total to 100, also the number
of tokens each percentage represents.
Tables are useful for relatively straightforward patterns when only one or two factors
are involved. A simple graph is often the best way to show patterns when two or more
factors are involved. In figure 3.2 (pag 41), each graph represents a separate comparison
for each age category, which makes it clear that the oldest group does not show the
gender differentiation that is exhibited by the younger two categories. This
representation makes the point more forcefully than the numbers by themselves.
In figure 3.3 (pag. 42), it has been matched the men and the women of the same
generation by line type, and made all the female lines solid and all the male lines
dashed. The oldest generation's lines do not cross, but it is more difficult than with three
separate graphs. We could simplify the chart by viewing the frequency of only one
variant. If a variable has only two variants (a binary variable) then a graph showing the
frequency of one of them is the best way to present our data. It is the 'new' or
'vernacular' variable, or the output of the variable rule, that is graphed. Figure 3.4 (pag.
44) still shows that men are usually using more of the monophthongal variant, while at
the same time showing that the gender difference is less in the oldest category.
Tables and graphs are not dispassionate displays of objective data, but in many ways
create the patterns that you 'find' in your data. When presenting your data in written
work, choose the display that contributes best to the patterns you see in the data. If I
wanted to emphasise men's dominance of the use of the monophthongal variant, I would
use the bar graph. If I wanted to point out the changes in the gender difference from the
oldest to youngest age group, I would use the series of line graphs. You must explicate
the table or graph so that it fits into the patterns you want to point out in your narrative.
FINDING STRUCTURE IN VARIABILITY
We need to ask ourselves whether the patterns we see make a difference -are they
significant? There are two ways to think about significance. The first is statistical
significance, which means 'this pattern is very unlikely to have come about by chance -
there must be some reason why the linguistic variable patterns this way'. The other kind
of significance is practical significance. In statistics, given enough tokens or speakers of
a variable, it becomes unlikely that our statistics will tell us that something is not
significant. Ex: hundred tokens of (aʊ) for a thousand speakers. If women have 25,330
of the monophthongal variant, and men have 26,220, then we will end up with a
statistically significant outcome. But this is 50,66 per cen and 52,44 percent, which
means that if we hear two people actually utter 100 tokens of (aʊ), men will utter one
more monophthongal variant.
What statistical significant is, and how we model variation using statistical procedures
to discover robust patterns of the orderly heterogeneity of a variable is explained here.
Statistical test tell us only how the variable patterns, what constraints there are on it, and
how it is embedded in the speech community. Before we run statistics, we need to have
coded for the most plausible factors to have a large effect on the linguistic variable, and
once the statistics have provided us with a numerical model of the variation, we will
need to add a qualitative, possibly theoretical, interpretation to those patterns.
There are several ways that statistics can be used to help describe and verify patterns. If
we find a difference between or among groups, is that difference likely to have been
bound by chance or not? We need to know how likely it is that the mean (promedio) of
our sample is the mean we would measure if we measured the entire population. If we
have a certain number of observations (tokens), we can predict how likely our pattern is
to be the one of the actual population. In general, we will always want to report
frequencies.
Two of the most important ways to describe patterns are the mean and the median. The
mean is the average (all tokens added up and then divided by the number of tokens),
while the median is the 'middle' of the data (there are as many tokens larger than the
median as there are smaller). We use these measures to compare groups and describe
patterns.
So we want to know whether the difference bt men and women in the (aʊ) data is one
that is just because we were 'lucky', or because there is a real pattern. We do this by
determining the range of values that are the most likely for the 'real' mean for each
group (called the confidence interval), and then determining how much those two
ranges overlap. If they don't overlap, or don't overlap much, then we can be sure that the
rates of the two groups are different. We need to know how many tokens we have for
each group, because if we have more tokens, then the likelihood that we were just lucky
is lower. The t-test is one of the statistical tests that make these comparisons, which is
how most statistical tests work.
Another way of testing data is look at how they are distributed, and to see if that
distribution is different from what would be expected if everything were left entirely to
chance: the chi-square test.
In a table like 3.5 in pag 47, to decide whether this distribution has come about by
chance, you work out what distribution would be expected if the data were distributed
by chance (shown in 3.6) and find the difference bt the observed and expected
distributions. You can look up the probability that the difference is significant (numbers
are distributed in a way that statisicians can predict). The difference with a 'p value' of
0.0181 means that there's a 2 in 100 chance that our distribution could have come about
by chance, which tells you whether the distribution is significant, not what makes it
significant.
A similar procedure for different kinds of data is the Analysis of Variance, or ANOVA.
This is like the chi-square test in that you are trying to decide whether the pattern is not
due to chance, but it is used when we have interval data (real numbers that can be
expressed in integers, or counts of things). Similar to the t-test, but it can handle more
than two groups. The test can say whether the different categories are distributed
similarly or are different, and then you need to decide where the difference is.
These tests are useful possible tools for variationist analysis, but they lack the ability to
decide which factors are having effects on the linguistic variable. In Figure 3.1, each
factor could be something that makes a difference in frequency with which (aʊ) is
monophthongal. With chi-square, we could determine whether or not the distribution
was significant, but not the importance of each effect. This is where the idea of
modelling comes in.
The idea is to write an equation that uses the factors we have coded for to predict the
rate of our linguistic variable. We then compare the prediction and the real data. The
best such equation will give us the size of the effect of each factor. We want to be able
to predict how frequently (aʊ) is manifest as [a:].
If there is a woman speaking, the gender value will be negative, and it will be positive if
a man. If we have a linguistic variable that is binary, we use probabilities, and are given
a probability after we multiply all the individual factor probabilities together. This is the
essence of Varbrul, the dominant statistical procedure used by variationists. It works by
creating a model, and then comparing it to the real data. It then changes the model
slightly to see if the second model is different, and if it is, it keeps this later model and
tweaks it. This comparison goes on until the program can't do anything to make the
model fit the data better. The output is a series of probability weights that tell us how
important different factors are, and in what direction they change the rate of our
linguistic variable.
It can be argued that variationists should be using a more technique known as mixed
effects modelling, a technique like multiple regression, but it models factors differently
depending on their status as so-called fixed or random effects. 'Fixed' and 'random'
come from how they are seen in experimental methods. A fixed effect is essentially a
factor that the experimenter controls, and suspects has an effect on the linguistic
variable. A random effect is one that might have an effect on the variable, but the effect
is not predictable. Our main random effect will be individual speakers (also lexical
item), while our fixed effects will be linguistic factors and social grouping factors such
as age, gender, and class. What the mixed effects model does is factor out the effect of
the individual speaker from the effect of the speaker's group. We use these statistics to
explore which of the patterns that we see in the data are significant. The meaning
depends on understanding of how the patterns we found relate to those that have been
found in other variation studies.