Professional Documents
Culture Documents
Research Paper
Research Paper
Research Paper
Myla Sustituido
BS Criminology, Batch 1
Prepared
to
Mr. Cagas
Concept of Statistics
The science of statistics deals with the collection, analysis, interpretation, and
presentation of data. We see and use data in our everyday lives.
Example:
1. If you wished to compute the overall grade point average at your school, it would make sense to
select a sample of students who attend the school. The data collected from the sample would be
the students’ grade point averages.
2. In presidential elections, opinion poll samples of 1,000–2,000 people are taken. The opinion
poll is supposed to represent the views of the people in the entire country.
From the sample data, we can calculate a statistic. A statistic is a number that represents
a property of the sample. For example, if we consider one math class to be a sample of
the population of all math classes, then the average number of points earned by students
in that one math class at the end of the term is an example of a statistic. The statistic is
an estimate of a population parameter. A parameter is a number that is a property of the
population. Since we considered all math classes to be the population, then the average
number of points earned per student over all the math classes is an example of a
parameter.
Parameter: Average number of points earned per student over all math classes
One of the main concerns in the field of statistics is how accurately a statistic estimates
a parameter. The accuracy really depends on how well the sample represents the
population. The sample must contain the characteristics of the population in order to be
a representative sample. We are interested in both the sample statistic and the
population parameter in inferential statistics. In a later chapter, we will use the sample
statistic to test the validity of the established population parameter.
A study was conducted at a local college to analyze the average cumulative GPA’s of
students who graduated last year. Fill in the letter of the phrase that best describes
each of the items below.
Population
Sample
Data
Statistics
Variable
Parameter:
TRY IT
Determine what the key terms refer to in the following study. We want to know the average (mean)
amount of money first year college students spend at ABC College on school supplies that do not include
books. We randomly survey 100 first year students at the college. Three of those students spent $150,
$200, and $225, respectively.
Show Answer:
Sample: 100 first year students who are surveyed at the college.
Parameter: Average amount of money a first year college student spent on school supplies that do not
include books.
Statistics: Average amount of money these 100 first year college student spent on school supplies that
do not include books.
Variable: The amount of money a first year ABC College student spend on school supplies that do not
include books.
Data: The amount that we collected from these 100 students, like $150, $200, and $225 .
Statistics
Mathematical statistics is the application of probability theory, a branch of mathematics,
to statistics, as opposed to techniques for collecting statistical data. Specific mathematical
techniques which are used for this include mathematical analysis, linear
algebra, stochastic analysis, differential equations, and measure theory.
Introduction
Statistical data collection is concerned with the planning of studies, especially with
the design of randomized experiments and with the planning of surveys using random
sampling. The initial analysis of the data often follows the study protocol specified prior to
the study being conducted. The data from a study can also be analyzed to consider
secondary hypotheses inspired by the initial results, or to suggest new studies. A
secondary analysis of the data from a planned study uses tools from data analysis, and
the process of doing this is mathematical statistics.
descriptive statistics - the part of statistics that describes data, i.e. summarises the data
and their typical properties.
inferential statistics - the part of statistics that draws conclusions from data (using some
model for the data): For example, inferential statistics involves selecting a model for the
data, checking whether the data fulfill the conditions of a particular model, and with
quantifying the involved uncertainty (e.g. using confidence intervals).
While the tools of data analysis work best on data from randomized studies, they are also
applied to other kinds of data. For example, from natural experiments and observational
studies, in which case the inference is dependent on the model chosen by the statistician,
and so subjective.
Example of Statistics
A survey was conducted to find the favorite fruit of 100 people. The circle graph below
shows the results of the survey.
Ques: What is the probability of getting two tails and one head, when 3 coins are tossed
at a time?
Choices:
A. 13
B. 14
C.3/8
D. 17
Correct Answer: C
Solution:
Step 1: Number of possible outcomes when one coin is tossed = 2. [Outcomes are
HeadHH and TailTT.]
Step 2: The possible outcomes, when 3 coins are tossed are {TTT, THT, TTH, THH, HHT,
HTH, HTT, HHH}.
Step 3: Number of favorable outcomes = 3 [Favorable outcomes are THT, TTH, HTT.]
Step 4: [Substitute.]
Step 5: So, probability of getting two tails and one head is 3/8.
Descriptive Statistics
Descriptive statistics are brief descriptive coefficients that summarize a given data set,
which can be either a representation of the entire or a sample of a population. Descriptive
statistics are broken down into measures of central tendency and measures of variability
(spread). Measures of central tendency include the mean, median, and mode, while
measures of variability include the standard deviation, variance, the minimum and
maximum variables, and the kurtosis and skewness.
Descriptive statistics, in short, help describe and understand the features of a specific
data set by giving short summaries about the sample and measures of the data. The most
recognized types of descriptive statistics are measures of center: the mean, median, and
mode, which are used at almost all levels of math and statistics. The mean, or the
average, is calculated by adding all the figures within the data set and then dividing by
the number of figures within the set. For example, the sum of the following data set is 20:
(2, 3, 4, 5, 6). The mean is 4 (20/5). The mode of a data set is the value appearing most
often, and the median is the figure situated in the middle of the data set. It is the figure
separating the higher figures from the lower figures within a data set. However, there are
less-common types of descriptive statistics that are still very important.
KEY TAKEAWAYS
You also need to know which data type you are dealing with to choose the right
visualization method. Think of data types as a way to categorize different types of
variables. We will discuss the main types of variables and look at an example for each.
We will sometimes refer to them as measurement scales.
Categorical Data
Nominal Data
Nominal values represent discrete units and are used to label variables, that have no
quantitative value. Just think of them as „labels“. Note that nominal data that has no order.
Therefore if you would change the order of its values, the meaning would not change.
You can see two examples of nominal features below:
The left feature that describes a persons gender would be called „dichotomous“, which is
a type of nominal scales that contains only two categories.
Ordinal Data
Ordinal values represent discrete and ordered units. It is therefore nearly the same as
nominal data, except that it’s ordering matters. You can see an example below:
Note that the difference between Elementary and High School is different than the
difference between High School and College. This is the main limitation of ordinal data,
the differences between the values is not really known. Because of that, ordinal scales
are usually used to measure non-numeric features like happiness, customer satisfaction
and so on.
Numerical Data
1. Discrete Data
We speak of discrete data if its values are distinct and separate. In other words: We speak
of discrete data if the data can only take on certain values. This type of data can’t be
measured but it can be counted. It basically represents information that can be
categorized into a classification. An example is the number of heads in 100 coin flips.
You can check by asking the following two questions whether you are dealing with
discrete data or not: Can you count it and can it be divided up into smaller and smaller
parts?
2. Continuous Data
Continuous Data represents measurements and therefore their values can’t be counted
but they can be measured. An example would be the height of a person, which you can
describe by using intervals on the real number line.
Interval Data
Interval values represent ordered units that have the same difference. Therefore we
speak of interval data when we have a variable that contains numeric values that are
ordered and where we know the exact differences between the values. An example would
be a feature that contains temperature of a given place like you can see below:
The problem with interval values data is that they don’t have a „true zero“. That means in
regards to our example, that there is no such thing as no temperature. With interval data,
we can add and subtract, but we cannot multiply, divide or calculate ratios. Because there
is no true zero, a lot of descriptive and inferential statistics can’t be applied.
Ratio Data
Ratio values are also ordered units that have the same difference. Ratio values are the
same as interval values, with the difference that they do have an absolute zero. Good
examples are height, weight, length etc.
Summarizing data
When you want to measure something in the natural world you usually have to take
several measurements. This is because things are variable, so you need several results
to get an idea of the situation. Once you have these measurements you need to
summarize them in some way because sets of raw numbers are not easily interpreted by
most people.
There are four key areas to consider when summarizing a set of numbers:
Dispersion – how spread out the values are from the average.
Shape – the data distribution, which relates to how "evenly" the values are spread either
side of the average.
You need to present the first three summary statistics in order to summarize a set of
numbers adequately. There are different measures of centrality and dispersion – the
measures you select are based on the the last item, shape (or data distribution).
In this section:
Summary
You should always summarize a sample of data values to make them more easily
understood (by you and others). At the very least you need to show:
Dispersion – how spread out the data are around the average.
The shape of the data (its distribution) is also important because the shape determines
which summary statistics are most appropriate to describe the sample. Your data may be
normally distributed (i.e. with a symmetrical, bell-shaped curve) and so parametric, or
they may be skewed and therefore non-parametric.
The shape of the data also leads you towards the most appropriate ways of analyzing the
data, that is, which statistical tests you can use.
Frequency Distribution
Frequency
Saturday Morning,
Saturday Afternoon
Thursday Afternoon
The frequency was 2 on Saturday, 1 on Thursday and 3 for the whole week.
Frequency Distribution
Example: Goals
Sam's team has scored the following numbers of goals in recent games
2, 3, 1, 2, 1, 3, 2, 3, 4, 5, 4, 2, 2, 3
Frequency Distribution: values and their frequency (how often each value occurs).
Example: Newspapers
These are the numbers of newspapers sold at a local shop over the last 10 days:
It is also possible to group the values. Here they are grouped in 5s:
Measures of dispersion
The central tendency is not the only interesting or useful information about a data set.
The two data sets illustrated below have the same mean (00), but have different spreads
around the mean. Each circle represents one value from the data set (or one datum).
Dispersion is a general term for different statistics that describe how values are distributed
around the centre. In this section we will look at measures of dispersion.
Range
DEFINITION
Range
The range of a data set is the difference between the maximum and minimum values in
the set.
The most straightforward measure of dispersion is the range. The range simply tells us
how far apart the largest and smallest values in a data set are. The range is very sensitive
to outliers.
EXAMPLE
QUESTION
{1;4;5;8;6;7;5;6;7;4;10;9;10}
WHAT WOULD HAPPEN IF WE REMOVED THE FIRST VALUE FROM THE SET?
SOLUTION
The smallest value in the data set is 11 and the largest value is 1010.
If the first value, 11, were to be removed from the set, the minimum value would be 44.
This means that the range would change to 10−4=610−4=6. 11 is not typical of the other
values. It is an outlier and has a big influence on the range.
Percentiles
DEFINITION
Percentile
The 𝒑𝒕𝒉 percentile is the value, 𝒗, that divides a data set into two parts, such that 𝒑 perce
nt of the values in the data set are less than 𝒗 and 𝟏𝟎𝟎 − 𝒑 percent of the values are
greater than 𝒗. Percentiles can lie in the range 𝟎 ≤ 𝒑 ≤ 𝟏𝟎𝟎
The rank of a datum is its position in the sorted data set (for example, first, second,
third, and so on).
The percentile at which a particular datum is, tells us what percentage of the values
in the full data set are less than this datum.
The table below summarises the value, rank and percentile of the data set:
10,3 1 0
11,1 2 20
13,0 3 40
13,9 4 60
14,2 5 80
19,8 6 100
As an example, 13, is at the 40th percentile since there are 2 values less than 13,0
and 3 values greater than 13,0.
𝟐
+= 𝟎, 𝟒 = 𝟒𝟎+3
𝟐+𝟑
In general, the formula for finding the pth percentile in an ordered data set with n values
is
𝑷
𝒓= (𝒏 − 𝟏 + 𝟏)
𝟏𝟎𝟎
This gives us the rank, r, of the pth percentile. To find the value of the pth percentile, we
have to count from the first value in the ordered data set up to the rth value.
Sometimes the rank will not be an integer. This means that the percentile lies between
two values in the data set. The convention is to take the value halfway between the two
values indicated by the rank.
The figure below shows the relationship between rank and percentile graphically. We
have already encountered three percentiles in this chapter: the median
(50th percentile), the minimum (0th percentile) and the maximum (100th). The median is
defined as the value halfway in a sorted data set.
EXAMPLE
QUESTION
Determine the minimum, maximum and median values of the following data set using the
percentile formula.
{14;17;45;20;19;36;7;30;8}
SOLUTION
Before we can use the rank to find values in the data set, we always have to order the
values from the smallest to the largest. The sorted data set is
{7;8;14;17;19;20;30;36;45}
We already know that the minimum value is the first value in the ordered data set. We will
now confirm that the percentile formula gives the same answer. The minimum is
equivalent to the 0th percentile. According to the percentile formula the rank, r, of the p=0
percentile in a data set with n=9 values is:
𝒑
𝒓= (𝒏 − 𝟏) + 𝟏
𝟏𝟎𝟎
𝟎
= (𝟗 = 𝟏) + 𝟏
𝟏𝟎𝟎
=𝟏
This confirms that the minimum value is the first value in the list, namely 7
Step 3: Find the maximum
We already know that the maximum value is the last value in the ordered data set. The
maximum is also equivalent to the 100th100th percentile. Using the percentile formula
with p=100 and n=9, we find the rank of the maximum value is:
𝑃
𝑟= (𝑛 − 1) + 1
100
100
−
̅ (9 − 1)1
100
=9
This confirms that the maximum value is the last (the ninth) value in the list, namely 45.
The median is equivalent to the 50th percentile. Using the percentile formula with p=50
and n=9, we find the rank of the median value is:
50
𝑟 (𝑛 − 1) + 1
100
=5
This shows that the median is in the middle (at the fifth position) of the ordered data set.
Therefore the median value is 19.
DEFINITION
Quartiles
The quartiles are the three data values that divide an ordered data set into four groups,
where each group contains an equal number of data values. The median (50th percentile)
is the second quartile (Q2). The 25th percentile is also called the first or lower quartile
(Q1). The 75th percentile is also called the third or upper quartile (Q3).
QUARTILES
QUESTION
{7;45;11;3;9;35;31;7;16;40;12;6}
SOLUTION
{3;6;7;7;9;11;12;16;31;35;40;45}
Using the percentile formula with n=12, we can find the rank of
the 25th, 50th and 75th percentiles:
𝟐𝟓
(𝟏𝟐−𝟏)+𝟏
𝒓𝟐𝟓 = 𝒊𝟏𝟎𝟎
𝟑.𝟕𝟓
𝟓𝟎
𝒓𝟓𝟎 = (𝟏𝟐 − 𝟏) + 𝟏
𝟏𝟎. 𝟎
=6,5
𝟕𝟓
𝒓𝟕𝟓 = (𝟏𝟐 − 𝟏) + 𝟏
𝟏𝟎𝟎
=9,25
Step3: Find the values of the quartiles
Note that each of these ranks is a fraction, meaning that the value for each percentile is
somewhere in between two values from the data set.
For the 25th percentile the rank is 3,75, which is between the third and fourth values.
Since both these values are equal to 7, the 25th percentile is 7.
For the 50th percentile (the median) the rank is 6,5, meaning halfway between the sixth
and seventh values. The sixth value is 11 and the seventh value is 12, which means that
𝟏𝟏+𝟏𝟐=
the median is . 𝟐 =11.5 For the 75th percentile the rank is 9,25, meaning between
𝟑𝟏+𝟑𝟓
the ninth and tenth values. Therefore the 75th percentile is = 𝟑𝟑
𝟐
Deciles
The deciles are the nine data values that divide an ordered data set into ten groups,
where each group contains an equal number of data
values.
28;33;35;45;57;59;61;68;69;72;75;78;80;83;86;91;92;95;101;105;111;117;118;125;1
27;131;137;139;
In grouped data, the percentiles will lie somewhere inside a range, rather than at a specific
value. To find the range in which a percentile lies, we still use the percentile formula to
determine the rank of the percentile and then find the range within which that rank is.
EXAMPLE
QUESTION
The mathematics marks of 100 grade 10 learners at a school have been collected. The
data are presented in the following table:
Percentage Number of
Mark Learners
0≤x<20 2
20≤x<30 5
30≤x<40 18
40≤x<50 22
50≤x<60 18
60≤x<70 13
70≤x<80 12
80≤x<100 10
Since we are given grouped data rather than the original ungrouped data, the best we
can do is approximate the mean as if all the learners in each interval were located at the
central value of the interval.
Mean = 2(10)+5(25)+18(35)+22(45)+18(55)+13(65)+12(75)+10(90)
100
= 54%
Step:2 Find the quartiles
Since the data have been grouped, they have also already been sorted. Using the
percentile formula and the fact that there are 100 learners, we can find the rank of
the 25th, 50th and 75th percentiles as
𝟐𝟓
𝒓𝟐𝟓 = (𝟏𝟎𝟎 − 𝟏) + 𝟏
𝟏𝟎𝟎
=24,75
𝟓𝟎
𝒓𝟓𝟎 = (𝟏𝟎𝟎 − 𝟏) + 𝟏
𝟏𝟎𝟎
=50,5
𝟕𝟓
𝒓𝟕𝟓 = (𝟏𝟎𝟎 − 𝟏) + 𝟏
𝟏𝟎𝟎
=75,25
For the lower quartile, we have that there are 2+5=7 learners in the first two ranges
combined and 2+5+18=25 learners in the first three ranges combined. Since 7<r25<25,
this means the lower quartile lies somewhere in the third range: 30≤x<40.
For the second quartile (the median), we have that there are 2+5+18+22=47 learners in
the first four ranges combined. Since 47<r50<65, this means that the median lies
somewhere in the fifth range: 50≤x<60.
For the upper quartile, we have that there are 65 learners in the first five ranges combined
and 65+13=78 learners in the first six ranges combined. Since 65<r75<78, this means
that the upper quartile lies somewhere in the sixth range: 60≤x<70.
Using the same method as for the quartiles, we first find the rank of the 30thpercentile.
30
𝑟= (100 − 1) + 1
100
=30,7
Now we have to find the range in which this rank lies. Since there are 25 learners in the
first 3 ranges combined and 47 learners in the first 4 ranges combined,
the 30th percentile lies in the fourth range: 40≤x<50
Ranges
We define data ranges in terms of percentiles. We have already encountered the full data
range, which is simply the difference between the 100th and the 0th percentile (that is,
between the maximum and minimum values in the data set).
Interquartile range
Choosing the best measure of central tendency depends on the type of data you have. In
this post, I explore these measures of central tendency, show you how to calculate them,
and how to determine which one is best for your data.
The three distributions below represent different data conditions. In each distribution, look
for the region where the most common values fall. Even though the shapes and type of
data are different, you can find that central location. That’s the area in the distribution
where the most common values are located.
As the graphs highlight, you can see where most values tend to occur. That’s the concept.
Measures of central tendency represent this idea with a value. Coming up, you’ll learn
that as the distribution and kind of data changes, so does the best measure of central
tendency. Consequently, you need to know the type of data you have, and graph it, before
choosing a measure of central tendency!
Mean
The mean is the arithmetic average, and it is probably the measure of central tendency
that you are most familiar. Calculating the mean is very simple. You just add up all of
the values and divide by the number of observations in your dataset.
The calculation of the mean incorporates all values in the data. If you change any value,
the mean changes. However, the mean doesn’t always locate the center of the data
accurately. Observe the histograms below where I display the mean in the distributions.
However, in a skewed distribution, the mean can miss the mark. In the histogram above,
it is starting to fall outside the central area. This problem occurs because outliers have a
substantial impact on the mean. Extreme values in an extended tail pull the mean away
from the center. As the distribution becomes more skewed, the mean is drawn further
away from the center.
Consequently, it’s best to use the mean as a measure of the central tendency when you
have a symmetric distribution.
However, in a skewed distribution, the mean can miss the mark. In the histogram above,
it is starting to fall outside the central area. This problem occurs because outliers have a
substantial impact on the mean. Extreme values in an extended tail pull the mean away
from the center. As the distribution becomes more skewed, the mean is drawn further
away from the center. Consequently, it’s best to use the mean as a measure of the central
tendency when you have a symmetric distribution.
Assume that the elements in a data set are rank ordered from the smallest to the largest.
The values that divide a rank-ordered set of elements into 100 equal parts are
called percentiles.
An element having a percentile rank of Pi would have a greater value than i percent of all
the elements in the set. Thus, the observation at the 50th percentile would be denoted
P50, and it would be greater than 50 percent of the observations in the set. An observation
at the 50th percentile would correspond to the median value in the set.
Quartiles
Quartiles divide a rank-ordered data set into four equal parts. The values that divide
each part are called the first, second, and third quartiles; and they are denoted by Q1,
Q2, and Q3, respectively. The chart below shows a set of four numbers divided into
quartiles.
A standard score (aka, a z-score) indicates how many standard deviations an element is from
the mean. A standard score can be calculated from the following formula.
z = (X - μ) / σ
where z is the z-score, X is the value of the element, μ is the mean of the population, and σ is the
standard deviation.
Example
A national achievement test is administered annually to 3rd graders. The test has a mean
score of 100 and a standard deviation of 15. If Jane's z-score is 1.20, what was her score on
the test?
(A) 82
(B) 88
(C) 100
(D) 112
(E) 118
Solution
where z is the z-score, X is the value of the element, μ is the mean of the population, and σ
is the standard deviation.