Professional Documents
Culture Documents
Biostat - Descriptive Statistics - Lec 2
Biostat - Descriptive Statistics - Lec 2
TAGUIBAO | BSMT 2D
used to organize data into a EXAMPLE: Scores of 20 students on a 20-item
meaningful form. quiz: 15, 8, 9, 14, 18, 7, 17, 3, 15, 19, 7, 20, 13, 15,
18, 15, 2, 9, 2, 5
o DESCRIPTIVE DATA ➢ Grouped data
• it is organized or
EXAMPLE 1: A poll found that 49% of the people processed data that is
in a survey knew the name of the first book of the presented in a
Bible. The statistic 49 describes the percentage frequency table.
(proportion) of persons who knew the first book EXAMPLE: Rating of 70 students on their
of the Bible. performance task.
- In the example, the statistic 49 is simply RATING FREQUENCY
the proportion of persons. We are not Superior 6
drawing conclusions but just describing Good 28
the percentage. Average 21
Poor 12
EXAMPLE 2: According to consumer reports, Inferior 3
Sharp washing machine owners reported 9
problems per 100 machines in 2011. The statistic
9 describes the number of problems out of every ORGANIZING DATA
100 reported machines. - we use statistics and analytics to
- The statistics being used here simply summarize and analyze data and
describes a proportion of 9 out of every information to support our decisions.
100. There is no conclusion being made, And with this, graphical representation
only the actual data is being described. of data is important to easily visualize the
amount of data that we process.
“Sometimes, we must make decisions with a - Descriptive statistics organize data to
limited set of data. For example, we would like to show the general pattern of the data to
know the operating characteristics such as fuel identify where values tend to
efficiency measured by km/L of SUV’s currently concentrate and to expose extreme
in use. If we spent a lot of time, money, and effort, data values.
all the owners of SUV’s could be surveyed. In this
case, our goal would be to survey the population CONSTRUCTING FREQUENCY TABLES
of SUV owners. However, based on Inferential - A frequency table is a grouping of
statistics, we can survey a limited number of SUV qualitative data into mutually exclusive
owners and collect a sample from the and collectively exhaustive classes
population. Samples are often used to obtain showing the number of observations in
reliable estimates of population parameters. In each class.
the process, we can save time, money, and EXAMPLE: Suppose that you are a car sales
effort in collecting the data.” agent, and you want to summarize last month's
sale by location.
TYPES OF DATA
- “datum” is the singular form and it is just
one information.
- “data” is the plural form and it is a group
of information, facts, observations,
statistics, records, and reports.
- Data can be classified into two:
➢ Ungrouped Data
• Raw and unorganized
information
TAGUIBAO | BSMT 2D
➢ The first step is to sort out the we can describe profit by using
vehicles sold last month frequency distribution.
according to their location.
➢ Then tally or count the vehicles What is a frequency distribution?
that are sold in each of the four - It is a grouping of quantitative data into
locations. The four locations are mutually exclusive and collectively
used to develop a frequency exhaustive classes showing the number
table with four mutually of observations in each class.
exclusive or distinctive classes.
Mutually exclusive classes How do we develop a frequency distribution?
means that a particular vehicle - The following examples will show the
can be assigned to only one steps on how to develop a frequency
class. In addition, the frequency distribution.
table must be collectively - Our goal is to summarize quantitative
exhaustive that is every vehicle variable profit with a frequency
sold last month is accounted for distribution and display it using charts
in the table. If every vehicle is and graphs. With this information, we
included in table, then the can easily answer the following
table is collectively exhaustive, questions:
and the total number of ➢ What is the typical profit for
vehicles will be equal to all each sale?
vehicles sold last month. ➢ What is the largest or maximum
profit for any sale?
What is a relative class frequency? ➢ What is the smallest or minimum
- You can convert class frequencies into profit for any sale?
relative class frequencies to show the ➢ Around what value do the
fraction of the total number of profits tend to cluster?
observations in each class.
- A relative frequency captures the
relationship between a class frequency
and the total number of observations.
EXAMPLE:
TAGUIBAO | BSMT 2D
➢ And it is possible to search the *in practice, the interval size is rounded up to
list to find the smallest or some convenient number such as a multiple of
minimum profit at $294. As for 10 or 100. For the example, we will round up the
the largest or the maximum class interval to 400 because it is a reasonable
profit would be $3292. Those are choice to round off to multiple of 100.
just the two we can see or
search in the table. Set the individual class limits.
➢ It is difficult to determine the We state clear class limits so that we
typical profit or to visualize can put each observation into only
where the profits tend to cluster. one category. This means we must
➢ To do this, the raw data must be STEP 3 avoid overlapping or unclear class
summarized with a frequency limits. We must always remember
distribution table to be easily that the data must be mutually
interpreted. exclusive which means that it can
only belong in one class.
STEPS ON HOW TO MAKE A DISTRIBUTION TABLE: Number of classes = 8
Decide on the number of classes. Class interval = 400
STEP 1
To do this, we use the rule 2k. *we base the lower limit
In the example, there were 180 vehicles sold. of the first class from the
n= 180 minimum value of
k = 7 (7 classes) datum in our set data.
27 = 128 (less than 180) Since our minimum
*we could not use 7 as our k because it gives us datum is 294, we use
a result less than the total vehicles sold which is 200 as our lower limit.
180. We will try a number greater than 7 to give Since our class interval
us a result greater than 180. is 400, we add this to the
lower limit which will give us 600 as the upper
k = 8 (8 classes) limit in the first class. For the rest of the classes,
28 = 256 (greater than 180) you continue to add 400. From 600, add 400, you
get 1000 which is for the second class. From 1000,
Determine the class interval. add 400, you get 1400 for the third class and so
Generally, the class interval is the on and so forth. Remember to check if the
same for all classes. The classes maximum value fits in a class. Since we have
taken altogether, must cover at decided that there are 8 classes, take note on
least the distance from the minimum the 8th class if your maximum value should fit
value in the data up to the there.
STEP 2
maximum value. We have the
formula: Tally.
Tally the vehicle profits into classes
and determine the number of
STEP 4
i = class interval observations in each class. This is just
k = number of classes basically tallying the data into which
Minimum value = 294 class they belong or fall under.
Maximum value = 3292
k = 8 classes
i=
TAGUIBAO | BSMT 2D
When the profits are tallied, the table will appear the lower limit and the upper limit of
like this: each class.
→ The class midpoint can be determined
by adding the lower limit and the upper
limit and the answer to that would be
divided to 2. For Class 1, we add 200 and
600, we get 800. 800 would be divided
into 2 so the class midpoint of class 1
would be 400. The class midpoint of 400
best represents the profit for eight
vehicles in Class 1. The largest
*the number of observations in each class is concentration or the highest frequency
called the frequency. For the first class which is of vehicles sold is within the class of
from 200 up to 600, the frequency would be 8. In $1800 to $2200 or the 5th class with the
the second class, which is 600 up to 1000, the frequency of 45. The class midpoint of
frequency would be 11. Since the total number the 5th class will be determined by
of profits is 180, then the frequency should sum adding $2200 and $1800 which will give
up to 180. us $4000 we divide this by two then we
get the class midpoint of $2000.
FREQUENCY DISTRIBUTION TABLE Therefore, the typical profit with the
class the has the highest frequency
would be $2000.
TAGUIBAO | BSMT 2D
enrollment of all medical technology
NUMERICAL MEASURES
students at Velez College.
- This section is concerned with two - For raw data—that is, data that have
numerical ways on describing not been grouped in a frequency
quantitative variables namely: distribution—the population mean is the
➢ Measures of location sum of all the values in the population
• this is often referred to divided by the number of values in the
as “averages”. population.
• The purpose of this is to
pinpoint the center of
the distribution of data. To find the population mean, we use the formula:
• An average is a
measure of location. It
shows the central value
of the data. If we only
consider the measures
of location in the data, where:
we may draw an μ = represents the population mean. It is the
erroneous conclusion. Greek lowercase letter “mu.”
➢ Measures of dispersion N = is the number of values in the population.
• This is often called “the x = represents any particular value.
variation or the spread Σ = is the Greek capital letter “sigma” and
in the data.” indicates the operation of adding.
• To describe the Σx = is the sum of the x values in the population.
dispersion, we would - Any measurable characteristic of a
consider the range, the population is called a parameter. The
variants, and the mean of a population is an example of
standard deviation. a parameter. PARAMETER is a
o MEASURES OF LOCATION characteristic of a population.
- there is not just one measure of location, in fact
there are many. We will consider four measures EXERCISE:
of location and the following are:
1. Arithmetic mean
• This is commonly called
the “mean” or
“average”
• This is the most widely ➢ This problem is considered a
used and widely population because we are
reported measure of considering the number of exits
location on the I-75.
• We study the mean as ➢ We add the distances between
both a population the 42 exits which is a total of
parameter and a 192 miles.
sample statistic. ➢ To find the arithmetic mean, we
➢ Population mean divide this total by 42.
- Many studies involve all the values in a
population. For example, there are 1000
medical technology students enrolled
at Velez College. This is a population
value because we considered the
TAGUIBAO | BSMT 2D
➢ Sample mean → To compute a mean, the data
- We often select a sample of the must be at the interval or ratio
population to estimate a specific level. Ratio level data include
characteristic of the population. such data as ages, income,
EXAMPLE: A quality assurance department and weight.
needs to be assured that the amount of → All the values are included in
orange marmalade in a jar labeled 12 computing the mean.
ounces contains the amount. But it would be → The mean is unique. That is
very expensive and time consuming to there is only one mean in a set
check the weight of each jar. Therefore, a of data.
sample of 20 jars is selected and the mean → The sum of the deviations of
of the sample is determined, and the value each value from the mean is 0.
is used to estimate the amount in each jar. e.g.
- For raw data—that is, data that have
not been grouped in a frequency
distribution— the mean is the sum of all
the sample values and then it is divided
by the total number of the sample *the mean of 3, 8 and 4 is 5. If we subtract the
values. mean from the data, such as 3 minus 5 we get 2,
8 minus 5 we get 3, and 4 minus 5 we get -1. If we
add all these together, they said that the sum of
the deviations of each value is equal to 0.
- To find the mean of a sample, we use • We can now consider the mean as the
the formula: balance point of the data. However, the
mean does have a weakness.
Considering that the mean is the sum of
all the values in a set of data, if one or
two values are extremely large or
where: extremely small compared to the
x̄ = represents the sample mean. majority of the data, the mean might not
n = is the number of values in the sample. be an appropriate average to represent
x = represents any particular value. the data.
Σ = is the Greek capital letter “sigma” and
indicates the operation of adding. 2. Median
Σx = the summation of the x values. • We have stressed that one or two data
- Any measurable characteristic of a is extremely large or extremely small, the
sample is called a “statistic.” arithmetic mean might not be a
representative. The center for such data,
EXERCISE: is better described by a measure of the
location called the median.
• The median is the midpoint of the values
after they have been ordered from the
minimum to the maximum values.
• The data must be atleast an ordinal
level of measurement.
• It is not affected by extremely low or
extremely high values.
TAGUIBAO | BSMT 2D
EXAMPLE: Suppose that you are seeking to buy a ➢ The arithmetic mean of the two
condominium in California. medial observations gives us
the median hours.
➢ The median is found by
averaging the two middle
values. The middle values are 5
hours and 7 hours. And the
mean of these two values is 6.
➢ So, we conclude that adults
Your real estate agent says that the typical price spend 6 hours every month in
of the units currently available is $110,000. Would Facebook.
you still want to look if your budgeted price is at
$75,000? This may be out of your price range 2.A. PROPERTIES OF MEDIAN
however, looking at the price of the individual → It is not affected by extremely large or
units you found out that the prices are $60,000, extremely small values. Therefore, the
$65,000, $70,000, $80,000, and the Super deluxe median is a valuable measure of
penthouse unit is worth $275,000. The arithmetic location when such values do occur.
mean price is $110,000 as what your real estate → It can be computed for ordinal level
agent reported. But the $275,000 price is pulling data or higher. Ordinal level data can
the arithmetic mean upward causing it to be an be ranked low to high.
unrepresentative average. It does seem that the
price around $70,000 is a more typical 3. Mode
representative average. In cases like this, the • The mode is especially useful for
midpoint gives a more accurate measure of summarizing nominal level data.
location. EXAMPLE:
• For larger data sets, finding the median
is by manually listing the values in
ascending or descending order would
prove to be difficult. In such case, a
formula for the median is given:
TAGUIBAO | BSMT 2D
EXERCISE: Conversely for some data sets,
What is the modal distance of the following there is more than one mode.
values shown below:
SYMMETRY AND SKEWNESS
- It has two types:
➢ Normal or Symmetrical
Distribution
TAGUIBAO | BSMT 2D
median we could have the next largest ➢ The formula of the weighted
value and the mode is the smallest of mean is:
the three measures.
- If the distribution is highly skewed, the
mean would not be a good measure to
use. And the median and mode would
be more representative.
• Negatively skewed
Note: the denominator of the weighted mean is
always the sum of the weights.
➢ Using the formula in the
problem:
EXERCISE:
TAGUIBAO | BSMT 2D
production varies from 48 to 52 assemblies per by the difference between the
hour. While production at the Tucson plant, is maximum number of 60 and the
more erratic ranging from 40 to 60 per hour. minimum number of 40.
Therefore, the hourly output for Baton Rouge is ➢ Therefore, there is less
clustered near the mean of 50 while for Tucson, it dispersion in the hourly
is more dispersed. production of the Baton Rouge
- We will consider several measures of plant than in the Tucson plant
dispersion. because the range of 4
- The variants and the standard deviation computer monitors is less than
use all the values in the data set and are the range of 20.
based on deviation on the arithmetic • A limitation of the range is
mean. that it is based on only 2
values. It does not take into
o MEASURES OF DISPERSION consideration all the values.
1. Range
• It is the difference between 2. Variance
the maximum and • It measures the mean
minimum values in a data amount by which the
set. values in the population or
• Sometimes, the range is sample vary from their
interpreted as an interval. mean.
• It is the arithmetic mean of
the squared deviations
• It is widely used in from the mean.
production management EXAMPLE:
and control applications
because it is very easy to
calculate and understand.
EXERCISE:
Find the range of the number of computer
monitors produced per hour for the Baton Rouge
and the Tucson plant and interpret the two
ranges.
TAGUIBAO | BSMT 2D
➢ If we compare the variance for
Orange County and Ontario,
we conclude that the
dispersion of the sales of the
coffee distribution in Ontario is
more concentrated which
means it is nearer to the mean
of 50.
• The variance shows the closeness or
clustering of the data relative to the
mean or center of the distribution.
• The variance has an important
advantage over the range because it
uses all the values in the computation.
➢ Population Variance
- It is the mean of the squared difference
between each value and the mean.
- The formula is as follows:
where:
σ2 = the population variance. The symbol is a
Greek letter sigma which is read as “sigma - The variance can be used to compare
squared.” the dispersion in two or more sets of
x = value of a particular observation in the observation.
population - The smaller the variance, the more
μ = is the arithmetic mean of the population clustered the data are closer to the
N = number of observations in the population. mean. The larger the variance, the set of
- For populations whose values are near data are scattered far away from the
the mean, the population variance will mean.
be small. ➢ Population Standard Deviation
- For populations whose values are - By taking the square root of the
dispersed from the mean, the variance, we can transform it to the
population variance will be large. same unit of measurement used for the
- The variance overcomes the weakness original data.
of the range by using all the values in the - This is the square root of the population
population. variance.
- The formula for this is as follows:
EXERCISE:
The number of traffic citations issued last year is
posted. Determine the population variance.
➢ Sample Variance
TAGUIBAO | BSMT 2D
where: - This is just simply the square root of the
s2 = the sample variance. sample variance.
x = value of a particular observation in the - The formula of this is as follows:
sample
x̄ = is the sample mean.
n = number of observations in the sample.
TAGUIBAO | BSMT 2D
shaped frequency distribution, we can - One method is to determine the
be more precise in explaining the location of values that divide a set of
dispersion about the mean. These observations into equal parts.
relationships involving the standard - These measures include:
deviation and the mean, are described ➢ Quartiles
by the Empirical Rule sometimes called • These are values of an
the “Normal Rule.” ordered data set (min.
- The empirical rule states that for a to max.) that divide the
symmetrical, bell-shaped frequency data into four intervals.
distribution, approximately 68% of the • The first quartile usually
observations lie within plus & minus one labeled as Q1, is the
standard deviation of the mean; about value below in which
95% of the observations will lie within plus 25% of the observations
& minus two standard deviations of the occur.
mean; and practically all 99.7% will lie • The third quartile
within plus & minus three standard usually labeled as Q3, is
deviations of the mean. the value which 75% of
- The relationships are found in the picture the observations occur.
above with a mean of 100 and a ➢ Deciles
standard deviation of 10. • These are values of an
- Applying the empirical rule where in it is ordered data set that
symmetrical and bell-shaped, divide the data into 10
practically all the observations lie intervals or equal parts.
between the mean plus and minus three ➢ Percentiles
standard deviations. Thus, if the mean is • These are values of
equal to 100, and standard deviation is unordered data set
equal to 10, practically all the that divide the data
observations lie between 100 and 130 into 100 intervals or
and also between 100 and 70. equal parts.
• Percentile scores are
EXERCISE: frequently used to
The monthly apartment rental rates near a report results of such
university approximate a symmetrical, bell- national standardized
shaped distribution. The sample mean is $500; test such as NMAT and
the standard deviation is $20. LSAT.
Using the Empirical Rule, answer these questions: - To formalize the computation of
1. About 68% of the monthly rentals are procedure,
between what two amounts?
- Between $480 and $520
2. About 95% of the monthly rentals are
between what two amounts?
- Between $460 and $540 EXERCISE:
3. Almost all of the monthly rentals are Morgan Stanley Is an investment company
between what two amounts? with offices located throughout the United
- Between $440 and $560 States. Listed below are the commissions
earned last month by a sample of 15 brokers
o MEASURES OF POSITION at the Morgan Stanley office in Oakland CA.
- There are other ways to describe the Locate the median, the first quartile, and the
variation and spread in a set of data. third quartile for the commissions earned.
TAGUIBAO | BSMT 2D
We know that 5.25 is between
the 5th and 6th position. What
we should do is to calculate the
distance between the 5th and
6th position and the result is 29.
With the result of 29 we will
➢ The first step is to arrange the
multiply it with .25 so that we will
data into the smallest
get 7.25. We will then add 7.25
commission to the biggest one.
to the 5th position which is
$1758. Therefore, the exact
value in the fifth position
➢ The median value is the
considering the 0.25 is $1765.25
observation in the center which
is the same as the 50th
PRACTICE EXERCISES
percentile. So, P is equal to 50.
Substituting the values, L50 = (15
+ 1)50/100 which will result to 8.
The number 8 is the 8th position
from the largest number which
will give us 2,038 dollars as our
median.
➢ To locate the 1st quartile:
TAGUIBAO | BSMT 2D
TAGUIBAO | BSMT 2D
TAGUIBAO | BSMT 2D