Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 69

Soc 2003

Statistical Methods and Computer


Applications in Social Sciences
18/19 (1)
Lecture 4
Click
Dr. Cemre
to addErciyes
Text
Lecture 1 – Topics covered
 Key concepts: What is statistics? Data?
How is data collected?
 What is population and sample?
Parameter and statistic?
 What methods does statistic provide for?
Design, Description, Inference
 Sampling and Measurement: Variable,
Types of Variables- Categorical (Nominal,
Ordinal); Quantitative-Interval Scale
(Continuous, Discrete)
Example MT questions from this
lecture
1. Please choose which of the following the above
definition belongs to? (5 points)
a) Data b) Statistic c) Parameter d) Variable
2. Explain the difference between parameter and statistic;
and give a formula of how the calculation of measures
of variance change for each. (15 points)
3. Give an example to each type of variable listed below.
(10 points)
4. Explain the methods statistics provide for? Or Explain
the difference between descriptive and inferential
statistics. (5 points)
Lecture 1-2- Topics covered
 Frequency distribution
 Frequency, Proportion, Percentage
 Cumulative Frequency, Cumulative
Percentage
 Histogram, Bar chart, Pie chart
 Stem and Leaf Graph
Lecture 2 – Topics Covered
 Measures of Central Tendency
 Mean
 Median (50th percentile, 2nd quartile or Q2)
 Percentile
 Lower and upper quartile (Q1 and Q3 or 25th
and 75th percentile)
 Mode
Example questions from this
lecture
1. For the below data prepare 1 frequency table (5
points), one bar chart (5 points), one histogram (5
points), 1 stem and leaf graph (5 points).
2. Calculate the suitable measures of tendency for each
of the given three variables (mean, median, mode,
lower and upper quartiles), if a measure cannot be
calculated, write down why. (10 points)
Lecture 3 – Topics Covered
 When to use which measure of central
tendency?
 Normal distribution (introduction)
 Measures of varience (Range, Variance,
Std. Dev., Emprical Rule)
Example questions from this
lecture
1. Above is data ... from 4 different groups.
For this data, calculate the suitable
measures of variance and compare the 4
groups. In your comparison, state which
group is performing better. (20 points)
2. What is a normal distribution? Explain
and draw a graph for the data given
above. (10 points)
Today
 Grouped data
 Emprical Rule
 Properties of distributions
 Weighted mean
Examples of questions from today

1. Compare the distributions of the four groups.


Draw the distribution graph and comment on
the spread of data. (20 points)
2. What is the Emprical rule? Calculate the
percentage of the data between x and y in the
below graph according to the rule. (10 points)
3. Bonus Questions (10 points)
Lets remember some topics
we covered...
Difference Between Descriptive
and Inferential Statistics
• Descriptive statistics: describes and summarizes data. You are
just describing what the data shows: a trend, a specific feature, or a
certain statistic (like a mean or median).
• Inferential statistics: uses statistics to make predictions.
• For example, descriptive statistics about a college could include:
the average SAT score for incoming freshmen; the median income
of parents; racial makeup of the student body. It says nothing
about why the data might exist, or what trends you might be able to
see from the data.
• When you take your data and start to make predictions about future
behavior or trends, that’s inferential statistics. Inferential statistics
also allows you to take sample data (e.g. from one university) and
apply it to a larger population (e.g. all universities in the country).
Measures of Central Tendency
 The best way to reduce a set of data and
still retain its information is to summarize it
with a single value.
 Measures of central tendency—mean,
median, and mode—can help you capture,
with a single number, what is typical of a
data set.
Properties of the mean
 Mean is appropriate only for quantitative variables.
 The mean can be highly influenced by an observation
that falls well above or well below the bulk of the data,
called an outlier.
 The mean is pulled in the direction of the longer tail of a
skewed distribution, relative to most of the data.
 The mean is the point of balance on the number line
when an equal weight is at each observation point.
Properties of the median
 If there are even number of scores you can either find
the arithmetic average of the two middle scores or
calculate the 50th percentile.
 The median, like the mean, is appropriate for quantitative
variables.
 For symmetric distributions, the median and the mean
are identical.
 For skewed distributions, the mean lies toward the
direction of skew (the longer tail) relative to the median.
 The median is insensitive to the distances of the
observations from the middle, since it uses only the
ordinal characteristics of the data.
Grouped data
• Grouped data is data that has been bundled together in categories.
• Histograms and frequency tables can be used to show this type of
data.
• When you have a frequency table or other group of data, the original
set of data is lost — replaced with statistics for the group. You can’t
find the exact sample mean (as you don’t have the original data) but
you can find an estimate.
• The formula for estimating the sample mean for grouped data is:
x̄ is the sample mean,
x is the class (or category) midpoint,
f is the class frequency.
Relative frequency histogram showing book
sales for a certain day, sorted by price.
Grouped data example
Score Frequency Midpoint (x) fx
5 <=x<10 1 5+(10- 1*7.5=7.5
5/2)=7.5
10 <=x< 15 4 12.5 4*12.5=50
15 <=x< 20 6 17.5 105
20 <=x< 25 4 22.5 90
25 <=x< 30 2 27.5 55
x-
30 <=x< 35 3 32.5 97.5 405/20=
20.25
Total 20 405
Lets drove the histogram for this
grouped data
Score Frequenc Midpoint fx
y (x)
5 to 9 1 7.5 7.5
10 to 4 12.5 50
14
15 to 6 17.5 105
19
20 to 4 22.5 90
24
25 to 2 27.5 55
29
The
30 to mean
34
3 of grouped
32.5 data
97.5 ̄
(x ) =
405
Total / 20
20 = 20.25. 405
Normal distribution

In a normal
Click to add distribution,
Text mean,
median and mode are
identical in value.
Measures of
Varience
Measures
Click to addofText
variability
describe the spread of the
data
Measures of Varience
 The range of a set of data is the difference between the highest and
lowest values in the set. To find the range, first order the data from
least to greatest. Then subtract the smallest value from the largest
value in the set.
 The variance averages the squared deviations about the mean.
 Its square root, the standard deviation, is easier to interpret,
describing a typical distance from the mean.
 The Empirical Rule states that for a bell-shaped distribution, about
68% of the observations fall within one standard deviation of the
mean, about 95% fall within two standard deviations, and nearly all,
if not all, fall within three standard deviations.
Range
 Range = difference between highest
and lowest observed values
 The range value of a data set is greatly
influenced by the presence of just one
unusually large or small value (outlier).
Variance

Click to add Text


Variance
 The variance averages the squared
deviations about the mean.
 The variance of a data set tells you how
spread out the data points are. The closer
the variance is to zero, the more closely
the data points are clustered together.
The expression ∑ (xi - x̄ )2
in these formulas
is called a sum of squares.
Why do we subtract one when
calculating sample varience?
 Remember that a sample is only part of
the population and isn't actually the whole
picture.
 Because of that, statisticians found a way
to compensate, by subtracting one from
the total number of numbers in the data
set.
Spread of data: see for different s2
Standard deviation

Click to add Text


Standard deviation
 Square root of varience, the standard
deviation, is easier to interpret, describing
typical distance of an observation from the
mean.
 The larger the standard deviation s, the
greater the spread of the data
Properties of the Standard Deviation

 s>0.
 s = 0 only when all observations have the same value.
 For instance, if the ages for a sample of five students are
19,19,19,19,19, then the sample mean equals 19, each of the
five deviations equals 0, and s = 0. This is the minimum possible
variability.
 The greater the variability about the mean, the larger is
the value of s.
 If the data are rescaled, the standard deviation is also
rescaled.
 For instance, if we change annual incomes from dollars (such as
34,000) to thousands of dollars (such as 34.0), the standard
deviation also changes by a factor of 1000 (such as from 11,800
to 11.8)
Same mean different variability
Same variability different mean
 Variabilityprovides a quantitative
measure of the degree to which
scores in a distribution are spread out
or clustered together.
In other words variabilility refers to the
degree of “differentness” of the scores in
the distribution.
Emprical rule

Click to add Text


Emprical rule
 A distribution with s = 5.1 has greater variability than one with s =
3.3, but how do we interpret how large s = 5.1 is?
 We’ve seen that a rough answer is that s is a typical distance of an
observation from the mean.
 To illustrate, suppose the first exam in your course, graded on a
scale of 0 to 100, has a sample mean of 77. A value of s = 0 in
unlikely, since every student must then score 77. A value such as s =
50 seems implausibly large for a typical distance from the mean.
Values of s such as 8 or 12
seem much more realistic.
 More precise ways to interpret 5 require further knowledge of the
shape of the frequency distribution.
 The following rule provides an interpretation for many data sets.
Emprical rule
 If the histogram of the data is approximately
bell shaped, then
 1. About 68% of the observations fall between y - s a
n d y + s.
 2. About 95% of the observations fall between y - 2s
and y + 2s.
 3. All or nearly all observations fall between y - 3s and
y + 3s.
 The rule is called the Empirical Rule because
many distributions seen in practice (that is,
empirically) are approximately bell shaped.
Emprical rule visual
Properties of Emprical Rule
 The Empirical Rule applies only to distributions that are
approximately bellshaped.
 For other shapes, the percentage falling within two
standard deviations of the mean need not be near 95%.
 It could be as low as 75% or as high as 100%.
 Empirical Rule may not work well if the distribution is
highly skewed or if it is highly discrete, with the variable
taking few values.
 The exact percentages depend on the form of the
distribution.
Quiz - 2

Click to add Text


Click to add Text
Answers:
#1 below 60 => below the median => 50%
#2 btw 50-70 => %95 (see the picture)
#3 btw 60-65 => % 34 (see the picture)
#4 greater than 65 => (13.5+2.35+0.15)=16%
#5 btw 70-75 => %2.35 (see picture)
#6 btw 55-75 => 68+13.5+2.35=83.85%
#7 below 70 => 100- (2.35+0.15)= 97.50%
Properties of
distributions
Click to add Text
Properties of distributions
 Distributions are typically described with
three properties:
 Center: mean, median, mode
 Spread (variability): standard deviation,
variance
 Shape: unimodal, symmetric, skewed, etc.
Shapes of Frequency Distributions
 Unimodal, bimodal, and
rectangular
 Data distributions in statistics
can have one peak, or they
can have several peaks.
 The normal distribution,
or bell curve, has one peak.
 The bimodal distribution has
two peaks.
Bimodal distribution
• The “bi” in bimodal distribution refers to “two”
and modal refers to the peaks.
• In statistics, the term “mode” refers to the most
common number.
• The peaks in any distribution are the most
common number(s).
• The two peaks in a bimodal distribution also
represent two local maximums; these are points
where the data points stop increasing and start
decreasing.
What does a bimodal distribution
tell you?
• You’ve got two peaks of data, which usually indicates
you’ve got two different groups.
• For example, exam scores tend to be normally
distributed with a single peak.
• However, grades sometimes fall into a bimodal
distribution with a lot of students getting A grades and a
lot getting F grades.
• This can tell you that you are looking at two different
groups of students. It could be that one group is
underprepared for the class (perhaps because of a lack
of previous classes). The other group may have
overprepared.
Shapes of Frequency Distributions
 Symmetrical and skewed distributions

 Normal and kurtotic distributions


Skewness
• A measure of symmetry in a distribution.
• Or a measure of lack of symmetry in a distribution.
• A distribution, or data set, is symmetric if it looks the
same to the left and right of the center point.
• A standard normal distribution is perfectly symmetrical
and has zero skew.
Kurtosis
• A measure of whether the data are heavy-tailed
or light-tailed relative to a normal distribution.
• That is, data sets with high kurtosis tend to have
heavy tails, or outliers.
• Data sets with low kurtosis tend to have light
tails, or lack of outliers.
• A uniform distribution would be the extreme
case.
What can you say about these?
Variability of a distribution

• High variability means that


the scores differ by a lot

• Low variability means that the scores


are all similar
Which center when?
 Depends on a number of factors, like scale of
measurement and shape.
 The mean is the most preferred measure and it is
closely related to measures of variability
 However, there are times when the mean isn’t the
appropriate measure.
 Use the median if:
 The distribution is skewed
 The distribution is ‘open-ended’
 (e.g. your top answer on your questionnaire is ‘5 or more’)
 Data are on an ordinal scale (rankings)
 Use the mode if the data are on a nominal scale
Weighted mean

Denote the sample means for two sets of data with sample sizes n1
Click to add Text
and n2 by x̄ 1 and x̄ 2. The overall sample mean for the combined set
of (n1 + n2) observations is the weighted average
x̄ = n1 x̄ 1+n2 x̄ 2 /n1+n2
The numerator n1 x̄ 1+n2 x̄ 2 is the sum of all the observations, since
n x̄= ∑x for each set of observations.
The denominator is the total sample size.
Example of two groups of students
asked to state their number of shoes.
 Number of shoes:
 30, 11, 12, 20, 14, 12, 15, 8, 6, 8, 10, 15, 25, 6, 35, 20, 20, 20, 5,
7, 5, 5, 5, 25, 15
X 57555
X   5.4
n 5
X 327
X   16.35
n 20

 Suppose we want the mean of the entire group? Can


we simply add the two means together and divide by
2?
 NO. Why not?
The Weighted Mean
 Number of shoes:
 30, 11, 12, 20, 14, 12, 15, 8, 6, 8, 10, 15, 25, 6, 35, 20, 20, 20, 5,
7, 5, 5, 5, 25, 15 X  5.4 X  16.35
X1n1  X2 n2

5.4 * 5   16.35 * 20 
 14.16
XN 
n1  n2 5  20

 Suppose we want the mean of the entire group? Can


we simply add the two means together and divide by
2?

 NO. Why not? Need to take into account the number


of scores in each mean
The Weighted Mean
 Number of shoes:
 30, 11, 12, 20, 14, 12, 15, 8, 6, 8, 10, 15, 25, 6, 35, 20, 20, 20, 5,
7, 5, 5, 5, 25, 15

XN 
X1n1  X2 n2

5.4 * 5   16.35 * 20 
 14.16
n1  n2 5  20

Let’s check: X   X  354


  14.16
n 25

 Both ways give the same answer



Assignment 3
 With your own data
 Calculate the measures of variance where
appropriate
 Tell the properties of the distribution
 Divide your data into groups of 4 and 6
respectively, and compare means of each group
 Calculate the weighted mean for the two groups
Correction for the Varience example in
Lecture3-4.ppt:
The data was listed as 7, 15, 23, 7, 9,13 but
it should have been 17, 15, 23, 7, 9, 13
The corrected version is below!
Varience – lets calculate
 You want to analyze the number of muffins
sold each day at a cafeteria, you sample
six days at random and get these
results: 17,15, 23, 7, 9, 13.
 What can you say about this data
immediately?
Example
 17,15, 23, 7, 9, 13
 Mode: There is no
mode
 Range: 10
Which one to use?
Sample varience 1
 xi => x̄=
 x1=
 x2=
 x3=
 x4=
 x5=
 x6=

 n= n-1=
Sample varience 1- solution:
 xi => X̄= (17+7+9+13+15+23)/6=
 x1= 17 14
 x2= 7
 x3= 9
 x4= 13
 x5= 15
 x6= 23

 n= 6 n-1= 5
Sample varience 2:
 (xi - x̄ )=> (xi - x̄ )2=>
 x1 - x̄ = (x1 - x̄ )2 =
 x2 - x̄ = (x2 - x̄ )2 =
 x3 - x̄ = (x3 - x̄ )2 =
 x4 - x̄ = (x4 - x̄ )2 =
 x5 - x̄ = (x5 - x̄ )2 =
 x6 - x̄ = (x6 - x̄ )2 =

∑ (xi - x̄ )2 =
Sample varience 2 – solution:
 (xi - x̄ )=> (xi - x̄ )2=>
 x1 - x̄ = 17-14 = 3 (x1 - x̄ )2 = 9
 x2 - x̄ = 7-14= -7 (x2 - x̄ )2 = 49
 x3 - x̄ = 9 -14= -5 (x3 - x̄ )2 = 25
 x4 - x̄ = 13-14=-1 (x4 - x̄ )2 = 1
 x5 - x̄ = 15-14=1 (x5 - x̄ )2 = 1
 x6 - x̄ = 23-14= (x6 - x̄ )2 = 81

 Sum of squares represents ∑ (xi - x̄ )2 = 166


squaring each deviation and then
adding those squares. It is
incorrect to first add the deviations
and then square that sum; this
gives a value of 0.
 The larger the deviations, the
larger the sum of squares and the
larger s tends to be.
Sample varience 3:
 s2= ∑ (xi - x̄ )2 /n-1
 s2=
Sample varience 3 – solution:
 s2= ∑ (xi - x̄ )2 /n-1
 s2= 166/5= 33.2

You might also like