Data Collection and Analysis: Loading

Module 7:
Data Collection and Analysis
LOADING…
C. DATA ANALYSIS
Descriptive Research
It generally uses different types of descriptive
statistics: frequencies, central tendencies or
averages, variability.
• Describing what is and what the data shows.
Descriptive
Statistics • Used to present quantitative descriptions in manageable
form.
• Helps to simplify large number of people on any measure.
• Each descriptive stat. reduces lots of data into simpler
summary.
• Typically distinguished from inferential statistics.
The Distribution
It is a summary of frequency of individual values or
ranges of values for variable. The simplest
distribution would list every value of a variable and
the number of person who had each value.
Frequencies
Indicate the number of occurrences of a language

phenomenon, as well as impressions, and better
understanding of the learners’ proficiency on the
language elements
The Distribution
Category Percent
Under 35 years old 9%
36-45 21%
46-55 45%
56-65 19%
66+ 6%
Figure 1. Frequency distribution bar chart
50%
The Distribution 45%
40%
35%
Frequency 30%
25%
20%
Histogram or bar chart 15%
10%
5%
0%
Under 35 36-45 46-55 56-65 66+
Series 1
Central tendencies
Includes the mean, mode, and the median. These measure
provide information about the average and typical behavior of
the language learners as regards the linguistic elements being
investigated.
The mean refers to the measure The mode refers to the scores which
obtained by adding all scores of the occur frequently in the large group of
respondents and dividing the sum of respondents.
subjects.
The median is the score which divides the
population into two in which half of the scores are
above and half are below it. (Seliger and Shohamy;
Catane, 2000)
For example, the mean or average quiz score is determined by summing all the scores
and dividing by the number of students taking the exam. For example, consider the test
score values:
15 20 21 20 36 15 25 15
The sum of these 8 values is 167, so the men is 167/8 = 20.875

For example, if there are 500 scores in the list, score #250 would be the median. If
we order the 8 scores shown above, we would get:
15 15 15 20 20 21 25 36
There are 8 scores and score #4 and #5 represent the halfway

point. Since both of these scores are 20, the median is 20. If the
two middle scores had different values, you would have to
interpolate to determine the median.
In our example, the value 15 occurs three times and is the model.
15 15 15 20 20 21 25 36
Notice that for the same set of 8 scores we got three different values
(20.875, 20, and 15) for the mean, median and mode respectively. If the
distribution is truly normal (i.e., bell-shaped), the mean, median and mode are
all equal to each other.
Dispersion
Refers to the spread of the values around the central tendency. There are
two common measures of dispersion, the range and standard deviation.
The range is simply the highest value minus the lowest value.
In our example, the high value is 36 and the low is 15, so the range is 36 -15 =21
15 15 15 20 20 21 25 36
The Standard Deviation is a more accurate and detailed estimate of

dispersion because an outlier can greatly exaggerate the range (as was true
in this example where the single outlier value of 36 stands apart from the
rest of the values. The Standard Deviation shows the relation that set of
scores has to the mean of the sample.
Again lets take the set of scores:
15 15 15 20 20 21 25 36
to compute the standard deviation, we first find the distance between each value and the mean.
We know from above that the mean is 20.875.
Notice that values that are
below the mean have negative
discrepancies and values above
it have positive ones. Next, we
square each discrepancy:
Difficult
Round
Here, the result is 350.875 After that divide it by 7 = ?
This value is known as the variance. To get the standard deviation, we take the square root of
the variance (remember that we squared the deviations earlier). This would be SQRT(50.125)
=?
Although this computation may seem convoluted, it’s actually quite simple.
To see this, consider the formula for the standard deviation:
For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799, we can
from the above statement estimate that approximately 95% of the scores will fall in the range
of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348. This kind of information
is a critical stepping stone to enabling us to compare the performance of an individual on one
variable with their performance on another, even when the variables are measured on entirely
different scales.
Different types of experimental designs call for different
methods of analysis. When comparing two groups, such as
experimental and control, t-test is used; whereas, if more than
two groups are compared, one way analysis of variance is an
appropriate statistical measurement. Factorial analysis of
variance is used for more complex experimental designs
What Is a T-Test?
A t-test is a statistical test that is used to compare the means of two
groups. It is often used in hypothesis testing to determine whether a
process or treatment actually has an effect on the population of interest, or
whether two groups are different from one another.
When to use a t-test
A t-test can only be used when comparing the means of two groups (a.k.a. pairwise
comparison). If you want to compare more than two groups, or if you want to do multiple
pairwise comparisons, use an ANOVA test or a post-hoc test.
The t-test is a parametric test of difference, meaning that it makes the same assumptions
about your data as other parametric tests. The t-test assumes your data:
1. are independent
2. are (approximately) normally distributed.
3. have a similar amount of variance within each group being compared (a.k.a.
homogeneity of variance)
What type of t-test should I use?
One-tailed or two-tailed t-test?
•If you only care whether the two populations are
When choosing a t-test, you will need to consider two things: different from one another, perform a two-tailed t-test.
whether the groups being compared come from a single •If you want to know whether one population mean is
population or two different populations, and whether you greater than or less than the other, perform a one-tailed
want to test the difference in a specific direction. t-test.
One-sample, two-sample, or paired t-test?
•If the groups come from a single population (e.g. measuring
before and after an experimental treatment), perform a paired t-
test.
•If the groups come from two different populations (e.g. two
different species, or people from two separate cities), perform
a two-sample t-test (a.k.a. independent t-test).
•If there is one group being compared against a standard value (e.g.
comparing the acidity of a liquid to a neutral pH of 7), perform
a one-sample t-test.
Performing a t-test
The t-test estimates the true difference between

two group means using the ratio of the
difference in group means over the pooled
standard error of both groups. You can calculate
it manually using a formula, or use statistical
analysis software.
T-test formula
A larger t-value shows that the difference between
group means is greater than the pooled standard
The formula for the two-sample t-test (a.k.a. the Student’s t- error, indicating a more significant difference
test) is shown below. between the groups.
T-test formula You can compare your calculated t-value against the
In this formula, t is the t-value, x1 and x2 are the means of values in a critical value chart to determine whether
the two groups being compared, s2 is the pooled standard your t-value is greater than what would be expected
error of the two groups, and n1 and n2 are the number of by chance. If so, you can reject the null hypothesis
observations in each of the groups. and conclude that the two groups are in fact
different.
T-test function in statistical software
Most statistical software (R, SPSS, etc.)

includes a t-test function. This built-in
function will take your raw data and
calculate the t-value. It will then compare it
to the critical value, and calculate a p-value.
This way you can quickly see whether your
groups are statistically different.
Interpreting test results
The output provides:
If you perform the t-test for your flower hypothesis in R, you 1. An explanation of what is being compared,
will receive the following output: called data in the output table.
2. The t-value: -33.719. Note that it’s negative; this
is fine! In most cases, we only care about the
absolute value of the difference, or the distance
from 0. It doesn’t matter which direction.
3. The degrees of freedom: 30.196. Degrees of
freedom is related to your sample size, and
shows how many ‘free’ data points are available
in your test for making comparisons. The
greater the degrees of freedom, the better your
statistical test will work.
4. The p-value: 2.2e-16 (i.e. 2.2 with 15 zeros in
front). This describes the probability that you
would see a t-value as large as this one by
chance.
A statement of the alternate hypothesis (Ha).
In this test, the Ha is that the difference is not
0.
The 95% confidence interval. This is the From the output table, we can see that the
range of numbers within which the true difference in means for our sample data is
difference in means will be 95% of the time. -4.084 (1.456 – 5.540), and the confidence
This can be changed from 95% if you want a interval shows that the true difference in
larger or smaller interval, but 95% is very means is between -3.836 and -4.331. So,
commonly used. 95% of the time, the true difference in
The mean petal length for each group. means will be different from 0. Our p-value
of 2.2e-16 is much smaller than 0.05, so we
can reject the null hypothesis of no
difference and say with a high degree of
confidence that the true difference in
means is not equal to zero.
An introduction to the one-way ANOVA
ANOVA, which stands for Analysis of Variance, is a statistical test used to
analyze the difference between the means of more than two groups.
One-way ANOVA example

As a crop researcher, you want to test the effect of three different fertilizer
mixtures on crop yield. You can use a one-way ANOVA to find out if there is a
difference in crop yields between the three groups.
When to use a one-way ANOVA
Use a one-way ANOVA when you have collected data about one categorical
independent variable and one quantitative dependent variable. The independent
variable should have at least three levels (i.e. at least three different groups or
categories).
ANOVA tells you if the dependent variable changes according to the level
of the independent variable. For example:
• Your independent variable is social media use, and you assign groups
to low, medium, and high levels of social media use to find out if there is a
difference in hours of sleep per night.
• Your independent variable is brand of soda, and you collect data
on Coke, Pepsi, Sprite, and Fanta to find out if there is a difference in the price
per 100ml.
• You independent variable is type of fertilizer, and you treat crop fields with
mixtures 1, 2 and 3 to find out if there is a difference in crop yield.
The null hypothesis (H0) of ANOVA is that there is no difference among
group means. The alternate hypothesis (Ha) is that at least one group
differs significantly from the overall mean of the dependent variable.
How does an ANOVA test work?
ANOVA determines whether the groups created by the The F-test compares the variance in each
levels of the independent variable are statistically group mean from the overall group
different by calculating whether the means of the variance. If the variance within groups is
treatment levels are different from the overall mean of
smaller than the variance between groups,
the dependent variable.
the F-test will find a higher F-value, and
If any of the group means is significantly different therefore a higher likelihood that the
from the overall mean, then the null hypothesis is
difference observed is real and not due to
rejected.
ANOVA uses the F-test for statistical significance. This chance.
allows for comparison of multiple means at once,
because the error is calculated for the whole set of
comparisons rather than for each individual two-way
comparison (which would happen with a t-test).
Performing a one-way ANOVA
While you can perform an ANOVA by hand, it is difficult After loading the dataset into our R
to do so with more than a few observations. We will environment, we can use the command
perform our analysis in the R statistical program aov() to run an ANOVA. In this example we
because it is free, powerful, and widely available. For a
will model the differences in the mean of
full walkthrough of this ANOVA example, see our guide
to performing ANOVA in R. the response variable, crop yield, as a
function of type of fertilizer.
The sample dataset from our imaginary crop yield
experiment contains data about:
fertilizer type (type 1, 2, or 3) One-way ANOVA R code
planting density (1 = low density, 2 = high density) one.way <- aov(yield ~ fertilizer, data = crop.data)
planting location in the field (blocks 1, 2, 3, or 4)
final crop yield (in bushels per acre).
Performing a one-way ANOVA
While you can perform an ANOVA by hand, it is difficult After loading the dataset into our R
to do so with more than a few observations. We will environment, we can use the command
perform our analysis in the R statistical program aov() to run an ANOVA. In this example we
because it is free, powerful, and widely available. For a
will model the differences in the mean of
full walkthrough of this ANOVA example, see our guide
to performing ANOVA in R. the response variable, crop yield, as a
function of type of fertilizer.
The sample dataset from our imaginary crop yield
experiment contains data about:
fertilizer type (type 1, 2, or 3)
planting density (1 = low density, 2 = high density)
planting location in the field (blocks 1, 2, 3, or 4)
final crop yield (in bushels per acre).
One-way ANOVA R code

one.way <- aov(yield ~ fertilizer, data = crop.data)
Interpreting the results
The ANOVA output provides an estimate of how
One-way ANOVA model summary R code much variation in the dependent variable that
Summary (one.way) can be explained by the independent variable.
The first column lists the independent

The summary of an ANOVA test (in R) looks like this: variable along with the model residuals (aka
the model error).
The Df column displays the degrees of freedom
for the independent variable (calculated by
taking the number of levels within the variable
and subtracting 1), and the degrees of freedom
for the residuals (calculated by taking the total
number of observations minus 1, then
subtracting the number of levels in each of the
independent variables).
The Sum Sq column displays the sum of squares
One-way ANOVA model summary R code (a.k.a. the total variation) between the group
Summary (one.way) means and the overall mean explained by that
variable. The sum of squares for the fertilizer
variable is 6.07, while the sum of squares of the
The summary of an ANOVA test (in R) looks like this: residuals is 35.89.
The Mean Sq column is the mean of the sum of
squares, which is calculated by dividing the sum
of squares by the degrees of freedom.
The F-value column is the test statistic from the
F test: the mean square of each independent
variable divided by the mean square of the
residuals. The larger the F value, the more likely
it is that the variation associated with the
independent variable is real and not due to
chance.
One-way ANOVA model summary R code The Pr(>F) column is the p-value of the F-
Summary (one.way) statistic. This shows how likely it is that the F-
value calculated from the test would have
occurred if the null hypothesis of no difference
The summary of an ANOVA test (in R) looks like this: among group means were true.
Using Chi-Square Statistic in Research
The Chi Square statistic is commonly used for testing relationships between
categorical variables. The null hypothesis of the Chi-Square test is that no
relationship exists on the categorical variables in the population; they are
independent. An example research question that could be answered using a Chi-
Square analysis would be:
Is there a significant relationship between voter intent and political party

membership?
The calculation of the Chi-Square statistic is quite straight-forward and intuitive:
The Test of Independence assesses whether an

association exists between the two variables by
comparing the observed pattern of responses in
where fo = the observed frequency (the the cells to the pattern that would be expected
observed counts in the cells) if the variables were truly independent of each
and fe = the expected frequency if NO other.
relationship existed between the
variables
How does the Chi-Square statistic work?
The Chi-Square statistic is most The Test of Independence assesses whether an
commonly used to evaluate Tests of association exists between the two variables by
Independence when using a cross comparing the observed pattern of responses in
tabulation (also known as a bivariate the cells to the pattern that would be expected
if the variables were truly independent of each
table). presents the distributions of
other.
two categorical variables
simultaneously, Cross tabulation with
the intersections of the categories of
the variables appearing in the cells of
the table.
How is the Chi-Square statistic run in SPSS and how is the output interpreted?
This statistic can be evaluated by comparing the actual value against a critical value
found in a Chi-Square distribution (where degrees of freedom is calculated as # of rows
– 1 x # of columns – 1), but it is easier to simply examine the p-value provided by SPSS.
To make a conclusion about the hypothesis with 95% confidence, the value labeled
Asymp. Sig. (which is the p-value of the Chi-Square statistic) should be less than .05
(which is the alpha level associated with a 95% confidence level).
Is the p-value (labeled Asymp. Sig.) less than .
05? If so, we can conclude that the variables
are not independent of each other and that
there is a statistical relationship between the
categorical variables.
This statistic can be evaluated by comparing the actual value against a critical value
found in a Chi-Square distribution (where degrees of freedom is calculated as # of rows
– 1 x # of columns – 1), but it is easier to simply examine the p-value provided by SPSS.
To make a conclusion about the hypothesis with 95% confidence, the value labeled
Asymp. Sig. (which is the p-value of the Chi-Square statistic) should be less than .05
(which is the alpha level associated with a 95% confidence level).
Is the p-value (labeled Asymp. Sig.) less than .
05? If so, we can conclude that the variables
are not independent of each other and that
there is a statistical relationship between the
categorical variables.
The p-value indicates that these variables are not independent of each other
and that there is a statistically significant relationship between the categorical
variables.
What are special concerns with regard to the Chi-Square statistic?
There are a number of important It is also sensitive to the distribution within the
considerations when using the Chi- cells, and SPSS gives a warning message if cells
Square statistic to evaluate a cross have fewer than 5 cases. This can be addressed
tabulation. Because of how the Chi- by always using categorical variables with a
limited number of categories (e.g., by
Square value is calculated, it is
combining categories if necessary to produce a
extremely sensitive to sample size – smaller table).
when the sample size is too large
(~500), almost any small difference will
appear statistically significant.

Data Collection and Analysis: Loading

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Collection and Analysis: Loading

Uploaded by

Copyright:

Available Formats

Module 7:

Data Collection and Analysis

Indicate the number of occurrences of a language

Under 35 years old 9%

The Distribution 45%

The sum of these 8 values is 167, so the men is 167/8 = 20.875

There are 8 scores and score #4 and #5 represent the halfway

The Standard Deviation is a more accurate and detailed estimate of

The t-test estimates the true difference between

Most statistical software (R, SPSS, etc.)

One-way ANOVA example

One-way ANOVA R code

The first column lists the independent

Is there a significant relationship between voter intent and political party

The Test of Independence assesses whether an

You might also like