Lect 1 Descriptive Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Descriptive Statistics

By
Dr. Jupiter Simbeye
Preliminary issues
STA121: Descriptive Statistics
• Course outline uploaded on the classroom
• Classroom code:

• Date for tests:


Test 1: 21st April, 2023
Test 2: 26th May, 2023
2020 STA121 Perfomance

25

20

15

Percent
10

0
A+ A B+ B C+ C C- D E F
Module Aims and Learning Outcomes

Aim:
• To introduce students to basic descriptive statistical analysis

Learning outcomes:
On successful completion of this module, students should be able to:
• Summarise data in form of central measures, frequencies, tables and
graphs,
• Interpret summary statistics,
• Apply descriptive statistics to answer practical questions,
Indicative Content

• Review of statistical concepts: definition of statistics, types of


statistics, data and types, scales of measurement.

• Tables and graphs for frequencies and other statistics: use and
interpretation of multi-way tables.

• Numerical summaries for quantitative data: percentile, quartile,


deciles, mean, median, mode, range, variance and standard deviation,
relative variation, coefficient of variation, skewness and kurtosis.
Indicative Content

• Processing single and multiple variables: concepts and calculations


applied on real data, effect of outliers on calculation of standard
deviation, summarising single columns of data
• Risk and return periods: cumulative frequency distributions and their
interpretations.
• Introducing a statistics package: working with SPSS and Stata.
• Common complications when analysing survey data: analysis of
multiple response questions, presence of missing values in the data,
need to produce weighted tables, presence of zero values.
Descriptive Statistics
Terms and definitions
• Statistics: the discipline that concerns the collection, organization,
analysis, interpretation and presentation of data
• Two types of statistical methods are used in analyzing
data: descriptive statistics and inferential statistics.
• Descriptive statistics are used to summarize data from a sample e.g.
in form of mean or standard deviation.
• Inferential statistics are used when data is viewed as a subclass of a
specific population
Terms and definitions
• Population: In statistics, a
population is a set of similar
items or events which is of
interest for some question or
experiment.
• Sample: a sample is a set of
individuals or objects collected
or selected from
a statistical population
More terms and definitions

1. Variable: A variable is any quantity or attribute whose value varies


from one unit of investigation to another.
Examples
a) Age: if you try to record age of students in this class, you are likely
to get different values each time you ask the next student’s age
b) Sex of babies at birth: Babies born to mothers take any values of
male or female. In this case, “male” and “female” are the two
possible values of the variable sex
c) GPA: As you progress with your studies, your end of semester GPA
is likely going to be different from semester to semester.
More terms and definitions

2. Observation: An observation is the value taken by a variable for a


particular unit of investigation

Example
Below are percentage point grades obtained by 10 students in STA121
67, 70, 55, 62, 40, 81, 90, 60, 69, 56

Observation
More terms and definitions

3. Quantitative variable: A quantitative variable is a variable whose


values are numerical.

Examples
a) Age in years (25, 15, 74, etc)
b) Birth-weight of a babies in kg (3.1, 2.5, 2.9, 3.5, 4.2, etc).
c) Number of antenatal care (ANC) visits by a pregnant mother (0, 1, 4,
7, etc)
More terms and definitions

Quantitative variables can be divided into two types: continuous or


discrete

4. Continuous variable: A continuous variable is a variable which may


take all values within a given range.

5. Discrete variable: A discrete variable is variable whose values change


by steps or jumps.
More terms and definitions

Thus age or birth-weight are continuous, because they can take any
values such as 25.5237873244 years or 2.93927634529 kg,
respectively, even if we may not have scales that could measure this
accurately!

However, number of antenatal care (ANC) visits by a pregnant mother


is discrete, since it values must be a whole number 0 , 1, 2, …, 9;
decimal values cannot be accommodated.
More terms and definitions

6. Qualitative variable or attribute: A qualitative variable or attribute is a


variable whose values are not numerical.

Examples
a) Names of countries (Malawi, Zambia, Egypt, Mozambique)
b) Answer to opinion question (strongly disagree, disagree, agree, strongly
agree)
c) Sex of an individual ( male, female)

Note: In most analysis qualitative variables that take limited values are
discretized by assigning them codes (e.g. 1=strongly disagree, …, 4=strongly
agree)
More terms and definitions
7. Frequency Distribution: A frequency distribution is an overview of all
distinct values in some variable and the number of times they occur

• Frequency distributions are mostly used for summarizing discrete /


categorical variables. Metric (continuous) variables tend to have many
distinct values. These result in huge tables and charts that don't give insight
into your data.

Example:
• A sample of 183 students were asked to state which study major they are
following. Below shows part of these data.
Study majors
SN Name of student Sex Major
1 Andrew Gondwe Male Biology
2 John Samale Male Mathematics
3 Pempho Yasini Female Other
4 Felix Wadabwa Male Mathematics
: : : :
: : : :
182 Maren Dickson Female Physics
183 Jack Filipo Male Chemistry
Observations
• Just looking at our 183 values can not provide any important
information about majoring subjects.
• A more viable approach is to simply tabulate each distinct study
major in our data and its frequency -the number of times it occurs.
• The resulting table (below) shows how frequencies are
distributed over values – majoring subjects in this example- and
hence is a frequency distribution.
Frequency distribution table

What is currently your majoring subject? N Percent


Mathematics 62 33.9%
Biology FREQUENCIES 35 19.1%
Chemistry ARE DISTRIBUTED OVER 33 18.0%
Physics VALUES 37 20.2%
Others 16 8.7%
Total 183 100%
Observations
• The most popular study major is mathematics (n = 62).
• “Other” is the least popular major (n = 16).
• The remaining majors are roughly equally popular (n between 33 and
37).
• Note that the frequencies add up to our sample size of 183 students.
This is always the case unless a variable contains missing values:
respondents can sometimes skip a question or answer “no answer”
or something similar.
Relative frequencies

Optionally, a frequency distribution may contain relative frequencies:


frequencies relative to (divided by) the total number of values. Relative
frequencies are often shown as percentages or proportions.
What is currently your majoring subject? N Percent
Mathematics 62 33.9%
Biology 35 19.1%
Chemistry RELATIVE FREQUENCIES 33 18.0%
Physics 37 20.2%
Others 16 8.7%
Total 183 100%
Relative frequencies

• Relative frequencies provide easy insight into frequency distributions.


Besides, they facilitate comparisons.
• For example,
“33.9% of students major in mathematics”
is much easier to understand the degree of majoring
“62 out of 183 major in mathematics”.
Frequency distributions for continuous variables

• Sometimes, we are interested in summarizing continuous variables into


frequency tables.

• However, as we noted, metric variables tend to have many distinct values.


These result in huge tables and charts that don't give insight into your data.

• Therefore, instead of looking at the frequency of each variable-value that


occurs we can first group the values of the variable into intervals, that is
subdivisions of the total range of possible values of the variable.
Example: Malawi birth-weights (MDHS, 2010)

• The Demographic and Health Survey of 2010 collected birth-weights from


13, 079 babies. These values are so may such that creating a frequency
table from these values may not give a reasonable way for summarizing
birth weight.

• However, if we decided to do it anyway, get results as tabulated below.


Frequency table – birth-weight in grams
Children's birth weight in
grams Frequency Percent Cummulative

200 1 0.01 0.01


300 6 0.05 0.05
400 2 0.02 0.07
500 1 0.01 0.08
600 1 0.01 0.08
700 2 0.02 0.1
1000 52 0.4 0.5
1100 10 0.08 0.57
1200 16 0.12 0.7
1300 12 0.09 0.79
1400 4 0.03 0.82
1500 53 0.41 1.22
1600 20 0.15 1.38
1700 20 0.15 1.53
1800 38 0.29 1.82
1900 30 0.23 2.05
Frequency table - continued
2000 584 4.47 6.51
2100 187 1.43 7.94
2200 111 0.85 8.79
2300 233 1.78 10.57
2400 144 1.1 11.68
2500 788 6.02 17.7
2600 199 1.52 19.22
2700 182 1.39 20.61
2800 439 3.36 23.97
2900 315 2.41 26.38
3000 2,428 18.56 44.94 Majority
3100 454 3.47 48.41
3200 1,029 7.87 56.28
3300 288 2.2 58.48
3400 538 4.11 62.6
3500 1,267 9.69 72.28
3600 250 1.91 74.2
3700 204 1.56 75.76
3800 326 2.49 78.25
3900 208 1.59 79.84
Frequency table - continued
4000 1,074 8.21 88.05
4100 113 0.86 88.91
4200 224 1.71 90.63
4300 155 1.19 91.81
4400 53 0.41 92.22
4500 320 2.45 94.66
4600 44 0.34 95
4700 23 0.18 95.18
4800 41 0.31 95.49
4900 30 0.23 95.72
5000 302 2.31 98.03
5100 27 0.21 98.23
5200 36 0.28 98.51
5300 25 0.19 98.7
5400 15 0.11 98.81
5500 36 0.28 99.09
5600 14 0.11 99.2
5700 1 0.01 99.2
Frequency table - continued
6000 48 0.37 99.63
6100 2 0.02 99.64
6200 4 0.03 99.67
6300 4 0.03 99.7
6400 4 0.03 99.73
6500 7 0.05 99.79
6600 2 0.02 99.8
6700 4 0.03 99.83
7000 10 0.08 99.91
7100 1 0.01 99.92
7200 1 0.01 99.92
7300 2 0.02 99.94
7500 1 0.01 99.95
8000 1 0.01 99.95
8500 2 0.02 99.97
9000 3 0.02 99.99
9100 1 0.01 100
Total 13,079 100
Observations
• The majority of the babies are born weighing 3000 grams

• One important message we observe is age heaping at 2000, 2500, 3000, 3200, 3500,
4000, 4500 and 5000 grams. This could be recording errors by birth attendants or
mothers rounding the figures when recalling birth-weights.

• Since the summary is not very informative, it is a good idea to group the birth-weights
into some sensible groups before tabulating, say: 1 – 1000, 1001- 2000, 2001-3000, 3001
– 4000, 4001 – 5000, 5001 – 6000, 6001 – 7000, 7001 – 8000, 8001 – 9000, 9001 – 10000
grams.

• Table below provides a frequency table from the ten groups that we have created.
Observations Grouped birth-weight
in grams Frequency Percent Cummulative
• The majority of the babies (5,638)
are born weighing between 3001 1-1000 65 0.5 0.5
and 4000 grams. This represents 1001-2000 787 6.02 6.51
43.11 % of all 13,079 babies whose
birth-weight was recorded in the 2001-3000 5,026 38.43 44.94
survey. 3001-4000 5,638 43.11 88.05
• The second majority of babies are 4001-5000 1,305 9.98 98.03
born weighing between 2001 – 3000
grams. 5001-6000 209 1.6 99.63
• Overall, over 80% of the babies are 6001-7000 37 0.28 99.91
born weighing between 2001 to 7001-8000 6 0.05 99.95
4000 grams.
8001-9000 5 0.04 99.99
• The least likely birth-weights are
birth-weights over 6000 grams. 9000-10000 1 0.01 100
Total 13,079 100
More terms and definitions
8. Class-interval: A class interval is a subdivision of the total range of
values which a (continuous) variable may take

In our example above, the age variable is reported into class-intervals


of 1 – 1000, 1001- 2000, … , 8001 – 9000, 9001 – 10000 grams.
Grouped birth-weight
in grams Frequency Percent Cummulative
• 9. Class-frequency: A class- 1-1000 65 0.5 0.5
frequency is the number of
1001-2000 787 6.02 6.51
observations of the variable
which fall in a given interval 2001-3000 5,026 38.43 44.94
Therefore, 3001-4000 5,638 43.11 88.05
Class
• 10. The frequency distribution of 4001-5000 1,305 9.98 98.03
frequencies
a (continuous) variable is the set 5001-6000 209 1.6 99.63
of class-intervals for the variable, 6001-7000 37 0.28 99.91
together with the associated
class-frequences 7001-8000 6 0.05 99.95
8001-9000 5 0.04 99.99
9000-10000 1 0.01 100
Total 13,079 100
Disaggregated frequency distributions

• At times it is important to disaggregate frequency distributions by factors


that are known or can be thought to affect their distributions

• For example, in our DHS data, if we suspect that birth-weights of babies


may differ between babies born in the rural and those born in the urban, a
frequency distribution disaggregated by rural/urban residence may prove
to be useful.

• Since the total frequencies in the two groups differ, it becomes difficult to
make direct comparisons. In this case, the use of relative frequencies
become useful.
Frequency table of birth-weights distributed by rural and urban residence

Rural Urban

Grouped birth-weight in grams Frequency Percent Frequency Percent


1-1000 65 0.56 0 0
1001-2000 702 6.09 85 5.45
2001-3000 4,461 38.73 565 36.19
3001-4000 4,850 42.11 788 50.48
4001-5000 1,200 10.42 105 6.73
5001-6000 195 1.69 14 0.9
6001-7000 34 0.3 3 0.19
7001-8000 6 0.05 0 0
8001-9000 4 0.03 1 0.06
9000-10000 1 0.01 0 0
Total 11,518 100 1,561 100
Frequency polygons

• Comparisons between the two groups can be more visible using


graphs.
• A frequency polygon illustrating a set of frequencies through
percentages (relative frequencies) is obtained by plotting class-
frequencies or relative frequencies as ordinates against centre-points
of class-intervals as abscissae. Then the plotted points are joined by
the straight lines.
• Figure below contains frequency polygons for the birth-weight for
rural and urban babies.
Frequency polygons of birth-weight of babies born in the rural versus those born in
the urban

60

50
Percent of babies

40

30

20

10

0
500 1500 2,500 3,500 4,500 5,500 6,500 7,500 8,500 9,500
Rural Urban
Observations
• The two distributions have similar shapes but greatly overlap. It appears urban or
rural are equally likely to have heavier or lighter babies.

• A big gap is observed around weights of 3,500 grams, where you have more
babies around those weights in the urban than rural.

You might also like