Research Paper

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 29

Mathematics

Research Paper in the


Modern World
GE4

Myla Sustituido
BS Criminology, Batch 1

Prepared
to
Mr. Cagas
 Concept of Statistics
The science of statistics deals with the collection, analysis, interpretation, and
presentation of data. We see and use data in our everyday lives.

Important Terms in Statistics

In statistics, we generally want to study a population. You can think of a population as a


collection of persons, things, or objects under study. To study the population, we select
a sample. The idea of sampling is to select a portion (or subset) of the larger population
and study that portion (the sample) to gain information about the population. Data are the
result of sampling from a population.

Example:

1. If you wished to compute the overall grade point average at your school, it would make sense to
select a sample of students who attend the school. The data collected from the sample would be
the students’ grade point averages.

*Population: All the students at your school


Sample: a sample of 50 students

2. In presidential elections, opinion poll samples of 1,000–2,000 people are taken. The opinion
poll is supposed to represent the views of the people in the entire country.

* Population: All the people in the entire country


Sample: a sample of 1000-2000

From the sample data, we can calculate a statistic. A statistic is a number that represents
a property of the sample. For example, if we consider one math class to be a sample of
the population of all math classes, then the average number of points earned by students
in that one math class at the end of the term is an example of a statistic. The statistic is
an estimate of a population parameter. A parameter is a number that is a property of the
population. Since we considered all math classes to be the population, then the average
number of points earned per student over all the math classes is an example of a
parameter.

Population: all math classes

Sample: One of the math classes

Parameter: Average number of points earned per student over all math classes
One of the main concerns in the field of statistics is how accurately a statistic estimates
a parameter. The accuracy really depends on how well the sample represents the
population. The sample must contain the characteristics of the population in order to be
a representative sample. We are interested in both the sample statistic and the
population parameter in inferential statistics. In a later chapter, we will use the sample
statistic to test the validity of the established population parameter.

Determine what the key terms refer to in the following study.

A study was conducted at a local college to analyze the average cumulative GPA’s of
students who graduated last year. Fill in the letter of the phrase that best describes
each of the items below.

Population

Sample

Data

Statistics

Variable

Parameter:

TRY IT

Determine what the key terms refer to in the following study. We want to know the average (mean)
amount of money first year college students spend at ABC College on school supplies that do not include
books. We randomly survey 100 first year students at the college. Three of those students spent $150,
$200, and $225, respectively.

Show Answer:

Population: All the first year college students at ABC College

Sample: 100 first year students who are surveyed at the college.

Parameter: Average amount of money a first year college student spent on school supplies that do not
include books.

Statistics: Average amount of money these 100 first year college student spent on school supplies that
do not include books.

Variable: The amount of money a first year ABC College student spend on school supplies that do not
include books.

Data: The amount that we collected from these 100 students, like $150, $200, and $225 .
 Statistics
Mathematical statistics is the application of probability theory, a branch of mathematics,
to statistics, as opposed to techniques for collecting statistical data. Specific mathematical
techniques which are used for this include mathematical analysis, linear
algebra, stochastic analysis, differential equations, and measure theory.

Introduction

Statistical data collection is concerned with the planning of studies, especially with
the design of randomized experiments and with the planning of surveys using random
sampling. The initial analysis of the data often follows the study protocol specified prior to
the study being conducted. The data from a study can also be analyzed to consider
secondary hypotheses inspired by the initial results, or to suggest new studies. A
secondary analysis of the data from a planned study uses tools from data analysis, and
the process of doing this is mathematical statistics.

Data analysis is divided into:

descriptive statistics - the part of statistics that describes data, i.e. summarises the data
and their typical properties.

inferential statistics - the part of statistics that draws conclusions from data (using some
model for the data): For example, inferential statistics involves selecting a model for the
data, checking whether the data fulfill the conditions of a particular model, and with
quantifying the involved uncertainty (e.g. using confidence intervals).

While the tools of data analysis work best on data from randomized studies, they are also
applied to other kinds of data. For example, from natural experiments and observational
studies, in which case the inference is dependent on the model chosen by the statistician,
and so subjective.
Example of Statistics
A survey was conducted to find the favorite fruit of 100 people. The circle graph below
shows the results of the survey.

Solved Example on Statistics

Ques: What is the probability of getting two tails and one head, when 3 coins are tossed
at a time?

Choices:

A. 13
B. 14
C.3/8
D. 17
Correct Answer: C

Solution:

Step 1: Number of possible outcomes when one coin is tossed = 2. [Outcomes are
HeadHH and TailTT.]

Step 2: The possible outcomes, when 3 coins are tossed are {TTT, THT, TTH, THH, HHT,
HTH, HTT, HHH}.

Step 3: Number of favorable outcomes = 3 [Favorable outcomes are THT, TTH, HTT.]

Step 4: [Substitute.]

Step 5: So, probability of getting two tails and one head is 3/8.
 Descriptive Statistics

What is Descriptive Statistics?

Descriptive statistics are brief descriptive coefficients that summarize a given data set,
which can be either a representation of the entire or a sample of a population. Descriptive
statistics are broken down into measures of central tendency and measures of variability
(spread). Measures of central tendency include the mean, median, and mode, while
measures of variability include the standard deviation, variance, the minimum and
maximum variables, and the kurtosis and skewness.

Understanding Descriptive Statistics

Descriptive statistics, in short, help describe and understand the features of a specific
data set by giving short summaries about the sample and measures of the data. The most
recognized types of descriptive statistics are measures of center: the mean, median, and
mode, which are used at almost all levels of math and statistics. The mean, or the
average, is calculated by adding all the figures within the data set and then dividing by
the number of figures within the set. For example, the sum of the following data set is 20:
(2, 3, 4, 5, 6). The mean is 4 (20/5). The mode of a data set is the value appearing most
often, and the median is the figure situated in the middle of the data set. It is the figure
separating the higher figures from the lower figures within a data set. However, there are
less-common types of descriptive statistics that are still very important.

People use descriptive statistics to repurpose hard-to-understand quantitative insights


across a large data set into bite-sized descriptions. A student's grade point average
(GPA), for example, provides a good understanding of descriptive statistics. The idea of
a GPA is that it takes data points from a wide range of exams, classes, and grades, and
averages them together to provide a general understanding of a student's overall
academic abilities. A student's personal GPA reflects his mean academic performance.

KEY TAKEAWAYS

 Descriptive statistics summarizes or describes characteristics of a data set.


 Descriptive statistics consists of two basic categories of measures: measures of
central tendency and measures of variability or spread.
 Measures of central tendency describe the center of a data set.
 Measures of variability or spread describe the dispersion of data within the
set.
The 2 Main Types of Descriptive Statistics

Descriptive statistics has 2 main types:

 Measures of Central Tendency (Mean, Median, and Mode).


 Measures of Dispersion or Variation (Variance, Standard Deviation, Range).

 Data Types in Statistics


Data Types are an important concept of statistics, which needs to be understood, to
correctly apply statistical measurements to your data and therefore to correctly conclude
certain assumptions about it. This blog post will introduce you to the different data types
you need to know, to do proper exploratory data analysis (EDA), which is one of the most
underestimated parts of a machine learning project.
Introduction to Data Types
Having a good understanding of the different data types, also called measurement scales,
is a crucial prerequisite for doing Exploratory Data Analysis (EDA), since you can use
certain statistical measurements only for specific data types.

You also need to know which data type you are dealing with to choose the right
visualization method. Think of data types as a way to categorize different types of
variables. We will discuss the main types of variables and look at an example for each.
We will sometimes refer to them as measurement scales.

Categorical Data

Categorical data represents characteristics. Therefore it can represent things like a


person’s gender, language etc. Categorical data can also take on numerical values
(Example: 1 for female and 0 for male). Note that those numbers don’t have mathematical
meaning.

Nominal Data

Nominal values represent discrete units and are used to label variables, that have no
quantitative value. Just think of them as „labels“. Note that nominal data that has no order.
Therefore if you would change the order of its values, the meaning would not change.
You can see two examples of nominal features below:

The left feature that describes a persons gender would be called „dichotomous“, which is
a type of nominal scales that contains only two categories.
Ordinal Data

Ordinal values represent discrete and ordered units. It is therefore nearly the same as
nominal data, except that it’s ordering matters. You can see an example below:

Note that the difference between Elementary and High School is different than the
difference between High School and College. This is the main limitation of ordinal data,
the differences between the values is not really known. Because of that, ordinal scales
are usually used to measure non-numeric features like happiness, customer satisfaction
and so on.

Numerical Data

1. Discrete Data

We speak of discrete data if its values are distinct and separate. In other words: We speak
of discrete data if the data can only take on certain values. This type of data can’t be
measured but it can be counted. It basically represents information that can be
categorized into a classification. An example is the number of heads in 100 coin flips.

You can check by asking the following two questions whether you are dealing with
discrete data or not: Can you count it and can it be divided up into smaller and smaller
parts?

2. Continuous Data

Continuous Data represents measurements and therefore their values can’t be counted
but they can be measured. An example would be the height of a person, which you can
describe by using intervals on the real number line.
Interval Data

Interval values represent ordered units that have the same difference. Therefore we
speak of interval data when we have a variable that contains numeric values that are
ordered and where we know the exact differences between the values. An example would
be a feature that contains temperature of a given place like you can see below:

The problem with interval values data is that they don’t have a „true zero“. That means in
regards to our example, that there is no such thing as no temperature. With interval data,
we can add and subtract, but we cannot multiply, divide or calculate ratios. Because there
is no true zero, a lot of descriptive and inferential statistics can’t be applied.

Ratio Data

Ratio values are also ordered units that have the same difference. Ratio values are the
same as interval values, with the difference that they do have an absolute zero. Good
examples are height, weight, length etc.
 Summarizing data
When you want to measure something in the natural world you usually have to take
several measurements. This is because things are variable, so you need several results
to get an idea of the situation. Once you have these measurements you need to
summarize them in some way because sets of raw numbers are not easily interpreted by
most people.

Basics of summarizing data

There are four key areas to consider when summarizing a set of numbers:

Centrality – the middle value or average.

Dispersion – how spread out the values are from the average.

Replication – how many values there are in the sample.

Shape – the data distribution, which relates to how "evenly" the values are spread either
side of the average.

You need to present the first three summary statistics in order to summarize a set of
numbers adequately. There are different measures of centrality and dispersion – the
measures you select are based on the the last item, shape (or data distribution).

In this section:

 Basics of summarizing data


 Averages
o Mean
o Median
o Mode
 Dispersion
o Standard deviation
o Inter-quartile range
o Range
 Replication
 Shape
o Types of data distribution
o Drawing the distribution
 Tally plots
 Histograms
o Shape statistics
 Skewness
 Kurtosis
 Summary

Summary

You should always summarize a sample of data values to make them more easily
understood (by you and others). At the very least you need to show:

Middle value – centrality, that is, an average.

Dispersion – how spread out the data are around the average.

Replication – how large the sample is.

The shape of the data (its distribution) is also important because the shape determines
which summary statistics are most appropriate to describe the sample. Your data may be
normally distributed (i.e. with a symmetrical, bell-shaped curve) and so parametric, or
they may be skewed and therefore non-parametric.

Tally plots – a simple frequency plot.

Histograms – a frequency plot like a bar chart.

You can also use shape statistics:

Skewness – how central the average is.

Kurtosis – how pointed the distribution is.

The shape of the data also leads you towards the most appropriate ways of analyzing the
data, that is, which statistical tests you can use.
 Frequency Distribution

Frequency

Frequency is how often something occurs.

Example: Sam played football on:

 Saturday Morning,
 Saturday Afternoon
 Thursday Afternoon

The frequency was 2 on Saturday, 1 on Thursday and 3 for the whole week.

Frequency Distribution

By counting frequencies we can make a Frequency Distribution table.

Example: Goals

Sam's team has scored the following numbers of goals in recent games

2, 3, 1, 2, 1, 3, 2, 3, 4, 5, 4, 2, 2, 3

Sam put the numbers in order, then added up:

 how often 1 occurs (2 times),


 how often 2 occurs (5 times),
 etc,

and wrote them down as a Frequency Distribution table.

From the table we can see interesting things such as


 getting 2 goals happens most often
 only once did they get 5 goals
This is the definition:

Frequency Distribution: values and their frequency (how often each value occurs).

Here is another example :

Example: Newspapers

These are the numbers of newspapers sold at a local shop over the last 10 days:

22, 20, 18, 23, 20, 25, 22, 20, 18, 20

Let us count how many of each number there is:

Papers Sold Frequency


18 2
19 0
20 4
21 0
22 2
23 1
24 0
25 1

It is also possible to group the values. Here they are grouped in 5s:

Papers Sold Frequency


15-19 2
20-24 7
25-29 1
 Measures Of Dispersion

Measures of dispersion

The central tendency is not the only interesting or useful information about a data set.
The two data sets illustrated below have the same mean (00), but have different spreads
around the mean. Each circle represents one value from the data set (or one datum).

Dispersion is a general term for different statistics that describe how values are distributed
around the centre. In this section we will look at measures of dispersion.

Range
DEFINITION

Range

The range of a data set is the difference between the maximum and minimum values in
the set.

The most straightforward measure of dispersion is the range. The range simply tells us
how far apart the largest and smallest values in a data set are. The range is very sensitive
to outliers.

EXAMPLE

QUESTION

FIND THE RANGE OF THE FOLLOWING DATA SET:

{1;4;5;8;6;7;5;6;7;4;10;9;10}

WHAT WOULD HAPPEN IF WE REMOVED THE FIRST VALUE FROM THE SET?
SOLUTION

Step 1: Determine the range

The smallest value in the data set is 11 and the largest value is 1010.

The range is 10−1=910−1=9

Step 2: Remove the first value

If the first value, 11, were to be removed from the set, the minimum value would be 44.
This means that the range would change to 10−4=610−4=6. 11 is not typical of the other
values. It is an outlier and has a big influence on the range.

Percentiles
DEFINITION

Percentile

The 𝒑𝒕𝒉 percentile is the value, 𝒗, that divides a data set into two parts, such that 𝒑 perce

nt of the values in the data set are less than 𝒗 and 𝟏𝟎𝟎 − 𝒑 percent of the values are
greater than 𝒗. Percentiles can lie in the range 𝟎 ≤ 𝒑 ≤ 𝟏𝟎𝟎

To understand percentiles properly, we need to distinguish between 3 different


aspects of a datum: its value., its rank and its percentile:

The value of a datum is what we measured and recorded during an experiment or


survey.

The rank of a datum is its position in the sorted data set (for example, first, second,
third, and so on).

The percentile at which a particular datum is, tells us what percentage of the values
in the full data set are less than this datum.

The table below summarises the value, rank and percentile of the data set:

{𝟏𝟒, 𝟐; 𝟏𝟑, 𝟗; 𝟏𝟗, 𝟖; 𝟏𝟎, 𝟑; 𝟏𝟑, 𝟎; 𝟏𝟏, 𝟏}


Value Rank Percentile

10,3 1 0

11,1 2 20

13,0 3 40

13,9 4 60

14,2 5 80

19,8 6 100

As an example, 13, is at the 40th percentile since there are 2 values less than 13,0
and 3 values greater than 13,0.

𝟐
+= 𝟎, 𝟒 = 𝟒𝟎+3
𝟐+𝟑

In general, the formula for finding the pth percentile in an ordered data set with n values
is

𝑷
𝒓= (𝒏 − 𝟏 + 𝟏)
𝟏𝟎𝟎

This gives us the rank, r, of the pth percentile. To find the value of the pth percentile, we
have to count from the first value in the ordered data set up to the rth value.

Sometimes the rank will not be an integer. This means that the percentile lies between
two values in the data set. The convention is to take the value halfway between the two
values indicated by the rank.

The figure below shows the relationship between rank and percentile graphically. We
have already encountered three percentiles in this chapter: the median
(50th percentile), the minimum (0th percentile) and the maximum (100th). The median is
defined as the value halfway in a sorted data set.
EXAMPLE

QUESTION

Determine the minimum, maximum and median values of the following data set using the
percentile formula.

{14;17;45;20;19;36;7;30;8}

SOLUTION

Step 1: Sort the values in the data set

Before we can use the rank to find values in the data set, we always have to order the
values from the smallest to the largest. The sorted data set is

{7;8;14;17;19;20;30;36;45}

Step 2: Find the minimum

We already know that the minimum value is the first value in the ordered data set. We will
now confirm that the percentile formula gives the same answer. The minimum is
equivalent to the 0th percentile. According to the percentile formula the rank, r, of the p=0
percentile in a data set with n=9 values is:

𝒑
𝒓= (𝒏 − 𝟏) + 𝟏
𝟏𝟎𝟎

𝟎
= (𝟗 = 𝟏) + 𝟏
𝟏𝟎𝟎

=𝟏

This confirms that the minimum value is the first value in the list, namely 7
Step 3: Find the maximum

We already know that the maximum value is the last value in the ordered data set. The
maximum is also equivalent to the 100th100th percentile. Using the percentile formula
with p=100 and n=9, we find the rank of the maximum value is:

𝑃
𝑟= (𝑛 − 1) + 1
100
100

̅ (9 − 1)1
100
=9

This confirms that the maximum value is the last (the ninth) value in the list, namely 45.

Step 4: Find the median


50
=100 (9 − 1) + 1
1
=2 (8) + 1

The median is equivalent to the 50th percentile. Using the percentile formula with p=50
and n=9, we find the rank of the median value is:

50
𝑟 (𝑛 − 1) + 1
100
=5

This shows that the median is in the middle (at the fifth position) of the ordered data set.
Therefore the median value is 19.

DEFINITION

Quartiles

The quartiles are the three data values that divide an ordered data set into four groups,
where each group contains an equal number of data values. The median (50th percentile)
is the second quartile (Q2). The 25th percentile is also called the first or lower quartile
(Q1). The 75th percentile is also called the third or upper quartile (Q3).
QUARTILES

QUESTION

Determine the quartiles of the following data set:

{7;45;11;3;9;35;31;7;16;40;12;6}
SOLUTION

Step 1: Sort the data set

{3;6;7;7;9;11;12;16;31;35;40;45}

Step 2: Find the ranks of the quartiles

Using the percentile formula with n=12, we can find the rank of
the 25th, 50th and 75th percentiles:

𝟐𝟓
(𝟏𝟐−𝟏)+𝟏
𝒓𝟐𝟓 = 𝒊𝟏𝟎𝟎
𝟑.𝟕𝟓

𝟓𝟎
𝒓𝟓𝟎 = (𝟏𝟐 − 𝟏) + 𝟏
𝟏𝟎. 𝟎

=6,5

𝟕𝟓
𝒓𝟕𝟓 = (𝟏𝟐 − 𝟏) + 𝟏
𝟏𝟎𝟎

=9,25
Step3: Find the values of the quartiles

Note that each of these ranks is a fraction, meaning that the value for each percentile is
somewhere in between two values from the data set.

For the 25th percentile the rank is 3,75, which is between the third and fourth values.
Since both these values are equal to 7, the 25th percentile is 7.

For the 50th percentile (the median) the rank is 6,5, meaning halfway between the sixth
and seventh values. The sixth value is 11 and the seventh value is 12, which means that
𝟏𝟏+𝟏𝟐=
the median is . 𝟐 =11.5 For the 75th percentile the rank is 9,25, meaning between
𝟑𝟏+𝟑𝟓
the ninth and tenth values. Therefore the 75th percentile is = 𝟑𝟑
𝟐

Deciles

The deciles are the nine data values that divide an ordered data set into ten groups,
where each group contains an equal number of data

values.

For example, consider the ordered data set:

28;33;35;45;57;59;61;68;69;72;75;78;80;83;86;91;92;95;101;105;111;117;118;125;1
27;131;137;139;

The nine deciles are: 35;59;69;78;86;95;111;125;137

Percentiles for grouped data

In grouped data, the percentiles will lie somewhere inside a range, rather than at a specific
value. To find the range in which a percentile lies, we still use the percentile formula to
determine the rank of the percentile and then find the range within which that rank is.
EXAMPLE

QUESTION

The mathematics marks of 100 grade 10 learners at a school have been collected. The
data are presented in the following table:

Percentage Number of
Mark Learners
0≤x<20 2

20≤x<30 5

30≤x<40 18

40≤x<50 22

50≤x<60 18

60≤x<70 13

70≤x<80 12

80≤x<100 10

 Calculate the mean of this grouped data set.


 In which intervals are the quartiles of the data set?
 In which interval is the 30th percentile of the data set?

Calculate the mean

Since we are given grouped data rather than the original ungrouped data, the best we
can do is approximate the mean as if all the learners in each interval were located at the
central value of the interval.

Mean = 2(10)+5(25)+18(35)+22(45)+18(55)+13(65)+12(75)+10(90)

100

= 54%
Step:2 Find the quartiles

Since the data have been grouped, they have also already been sorted. Using the
percentile formula and the fact that there are 100 learners, we can find the rank of
the 25th, 50th and 75th percentiles as

𝟐𝟓
𝒓𝟐𝟓 = (𝟏𝟎𝟎 − 𝟏) + 𝟏
𝟏𝟎𝟎

=24,75

𝟓𝟎
𝒓𝟓𝟎 = (𝟏𝟎𝟎 − 𝟏) + 𝟏
𝟏𝟎𝟎

=50,5

𝟕𝟓
𝒓𝟕𝟓 = (𝟏𝟎𝟎 − 𝟏) + 𝟏
𝟏𝟎𝟎

=75,25

Now we need to find in which ranges each of these ranks lie.

For the lower quartile, we have that there are 2+5=7 learners in the first two ranges
combined and 2+5+18=25 learners in the first three ranges combined. Since 7<r25<25,
this means the lower quartile lies somewhere in the third range: 30≤x<40.

For the second quartile (the median), we have that there are 2+5+18+22=47 learners in
the first four ranges combined. Since 47<r50<65, this means that the median lies
somewhere in the fifth range: 50≤x<60.

For the upper quartile, we have that there are 65 learners in the first five ranges combined
and 65+13=78 learners in the first six ranges combined. Since 65<r75<78, this means
that the upper quartile lies somewhere in the sixth range: 60≤x<70.

Step:3 Find the 30th percentile

Using the same method as for the quartiles, we first find the rank of the 30thpercentile.

30
𝑟= (100 − 1) + 1
100
=30,7
Now we have to find the range in which this rank lies. Since there are 25 learners in the
first 3 ranges combined and 47 learners in the first 4 ranges combined,
the 30th percentile lies in the fourth range: 40≤x<50

Ranges

We define data ranges in terms of percentiles. We have already encountered the full data
range, which is simply the difference between the 100th and the 0th percentile (that is,
between the maximum and minimum values in the data set).

Interquartile range

The interquartile range is a measure of dispersion, which is calculated by subtracting the


first quartile (Q1) from the third quartile (Q3). This gives the range of the middle half of
the data set.

Semi interquartile range

The semi interquartile range is half of the interquartile range.

 Measures Of Central Tendency


A measure of central tendency is a summary statistic that represents the center point or
typical value of a dataset. These measures indicate where most values in a distribution
fall and are also referred to as the central location of a distribution. You can think of it as
the tendency of data to cluster around a middle value. In statistics, the three most
common measures of central tendency are the mean, median, and mode. Each of these
measures calculates the location of the central point using a different method.

Choosing the best measure of central tendency depends on the type of data you have. In
this post, I explore these measures of central tendency, show you how to calculate them,
and how to determine which one is best for your data.

Locating the Center of Your Data

The three distributions below represent different data conditions. In each distribution, look
for the region where the most common values fall. Even though the shapes and type of
data are different, you can find that central location. That’s the area in the distribution
where the most common values are located.
As the graphs highlight, you can see where most values tend to occur. That’s the concept.
Measures of central tendency represent this idea with a value. Coming up, you’ll learn
that as the distribution and kind of data changes, so does the best measure of central
tendency. Consequently, you need to know the type of data you have, and graph it, before
choosing a measure of central tendency!

The central tendency of a distribution represents one characteristic of a distribution.


Another aspect is the variability around that central value. While measures of variability
is the topic of a different article (link below), this property describes how far away the data
points tend to fall from the center. The graph below shows how distributions with the same
central tendency (mean = 100) can actually be quite different. The panel on the left
displays a distribution that is tightly clustered around the mean, while the distribution on
the right is more spread out. It is crucial to understand that the central
tendency summarizes only one aspect of a distribution and that it provides an incomplete
picture by itself.

Mean

The mean is the arithmetic average, and it is probably the measure of central tendency
that you are most familiar. Calculating the mean is very simple. You just add up all of
the values and divide by the number of observations in your dataset.
The calculation of the mean incorporates all values in the data. If you change any value,
the mean changes. However, the mean doesn’t always locate the center of the data
accurately. Observe the histograms below where I display the mean in the distributions.

In a symmetric distribution, the mean locates the center accurately.

However, in a skewed distribution, the mean can miss the mark. In the histogram above,
it is starting to fall outside the central area. This problem occurs because outliers have a
substantial impact on the mean. Extreme values in an extended tail pull the mean away
from the center. As the distribution becomes more skewed, the mean is drawn further
away from the center.

Consequently, it’s best to use the mean as a measure of the central tendency when you
have a symmetric distribution.
However, in a skewed distribution, the mean can miss the mark. In the histogram above,
it is starting to fall outside the central area. This problem occurs because outliers have a
substantial impact on the mean. Extreme values in an extended tail pull the mean away
from the center. As the distribution becomes more skewed, the mean is drawn further
away from the center. Consequently, it’s best to use the mean as a measure of the central
tendency when you have a symmetric distribution.

 Measures Of Relative Relation


Percentiles

Assume that the elements in a data set are rank ordered from the smallest to the largest.
The values that divide a rank-ordered set of elements into 100 equal parts are
called percentiles.

An element having a percentile rank of Pi would have a greater value than i percent of all
the elements in the set. Thus, the observation at the 50th percentile would be denoted
P50, and it would be greater than 50 percent of the observations in the set. An observation
at the 50th percentile would correspond to the median value in the set.

Quartiles

Quartiles divide a rank-ordered data set into four equal parts. The values that divide
each part are called the first, second, and third quartiles; and they are denoted by Q1,
Q2, and Q3, respectively. The chart below shows a set of four numbers divided into
quartiles.

Note the relationship between quartiles and percentiles. Q 1 corresponds to P25,


Q2 corresponds to P50, Q3 corresponds to P75. Q2 is the median value in the set.
Standard Scores (z-Scores)

A standard score (aka, a z-score) indicates how many standard deviations an element is from
the mean. A standard score can be calculated from the following formula.

 z = (X - μ) / σ

where z is the z-score, X is the value of the element, μ is the mean of the population, and σ is the
standard deviation.

Here is how to interpret z-scores.

 A z-score less than 0 represents an element less than the mean.


 A z-score greater than 0 represents an element greater than the mean.
 A z-score equal to 0 represents an element equal to the mean.
 A z-score equal to 1 represents an element that is 1 standard deviation greater than
the mean; a z-score equal to 2, 2 standard deviations greater than the mean; etc.
 A z-score equal to -1 represents an element that is 1 standard deviation less than
the mean; a z-score equal to -2, 2 standard deviations less than the mean; etc.

Example

A national achievement test is administered annually to 3rd graders. The test has a mean
score of 100 and a standard deviation of 15. If Jane's z-score is 1.20, what was her score on
the test?

(A) 82
(B) 88
(C) 100
(D) 112
(E) 118

Solution

The correct answer is (E). From the z-score equation, we know


 z = (X - μ) / σ

where z is the z-score, X is the value of the element, μ is the mean of the population, and σ
is the standard deviation.

Solving for Jane's test score (X), we get


 X = ( z * σ) + 100 = ( 1.20 * 15) + 100 = 18 + 100 = 118

You might also like