Professional Documents
Culture Documents
Chapter 1 BKU2032
Chapter 1 BKU2032
Statistics
CHAPTER 1
BKU2032
1
CONTENT
1.1 Overview
1.2 Statistical Problem-Solving Methodology
1.3 Review on Descriptive Statistics
1.3.1 Measures of Central Tendency
1.3.2 Measures of Variation
1.3.3 Concept of Variance
1.3.3.1 Chebychev’s Theorem
1.3.3.2 Game of Dart
2
OBJECTIVES
By the end of this chapter, you should be able to
3
1.1 OVERVIEW
• Define the meaning of statistics, population, sample,
parameter, statistic, descriptive statistics and
inferential statistics.
4
What is Statistics?
Most people become familiar with probability and statistics through
radio, television, newspapers, and magazines. For example, the
following statements were found in newspapers:
• Ten of thousands parents in Malaysia have chosen StemLife as their trusted
stem cell bank.
• The death rate from lung cancer was 10 times for smokers compared to
nonsmokers.
• The average cost of a wedding is nearly RM10,000.
• In USA, the average salary for men with a bachelor’s degree is $49,982, while
the average salary for women with a bachelor’s degree is $35,408.
• Globally, an estimated 500,000 children under the age of 15 live with Type 1
diabetes.
• Women who eat fish once a week are 29% less likely to develop heart disease.
5
Statistics
The science of conducting studies to
collect organize
analyze summarize
Tangible Conceptual
Always finite & after a population is sampled, Population that consists of all the
the population size decreases by 1. value that might possibly have been
The total number of members is finite & observed & does not consist of
consist of actual physical object actual objects
Sample Statistic
A subset of a population,
A number that describes a
containing the objects or outcomes
sample characteristics
that are actually observed
7
EXERCISE 1.1
1. The freshman class at Engineering College has 317 students and
an IQ pre-test is given to all of them in their first week. The dean
of admission collected data on 27 of them and found their mean
score on the IQ pre-test was 51. The mean for the entire
freshman class was therefore estimated to approximately 51 on
this test. A subsequent computer analysis of all freshmen showed
the true mean to be 52.
Based on the above problem,
8
Descriptive & Inferential Statistics
Descriptive statistics Inferential statistics
consists of the collection, consists of generalizing from
organization, samples to populations,
classification, performing estimations
summarization, and
presentation of data hypothesis testing,
obtain from the sample. determining relationships
Used to describe the
among variables, and making
characteristics of the predictions.
sample Used to describe, infer,
3. Suppose UMP wants to estimate the average time a student takes to find a
proper parking spot. During semester 1 2009/2010, an administrator randomly
asked 200 students and recorded their parking times and found that it takes on
average 10 minutes to find a parking spot.
Gathering of
Data
Classification,
Summarization, and
Processing of data
Presentation and
Communication of
Summarized information
Yes
Use sample information
Is Information from a
to make inferences about
sample?
the population Statistical
Inference
No
Descripti
ve
Draw conclusions about
Statistics Use cencus data to
the population
analyze the population
characteristic (parameter)
characteristic under study
under study
STOP
11
Need for Statistics
It is a fact that, you need a knowledge of
statistics to help you:
12
1.2: STATISTICAL
PROBLEM SOLVING
METHODOLOGY
• Outline the 6 basic steps in the statistical problem
solving methodology.
13
STATISTICAL PROBLEM
SOLVING METHODOLOGY
6 Basic Steps
1. Identifying the problem or opportunity
2. Deciding on the method of data collection
3. Collecting the data
4. Classifying and summarizing the data
5. Presenting and analyzing the data
6. Making the decision
14
STEP 1
Identifying the problem or opportunity
15
Characteristics of Sample Size
The larger the sample, the smaller the magnitude of
sampling errors.
Survey studies needed large sample because the returns
of the survey is voluntary based.
Easy to divide into subgroups.
In mail response the percentage of response may be as
low as 20%-30%, thus the bigger number of samples is
required.
Subject availability and cost factors are legitimate
considerations in determining appropriate sample size.
16
STEP 2
Deciding on the Method of Data Collection
Data must be gathered that are accurate, as complete as possible
& relevant to the problem
B. Probability data
Is one in which the chance of selection of each item in
the population is known before the sample is picked
4 basic methods : random, systematic, stratified, and
cluster.
18
A) Nonprobability Data Samples
1. Judgment samples
Base on opinion of one or more expert person
Ex: A political campaign manager intuitively picks certain voting
districts as reliable places to measure the public opinion of his
candidate
2. Voluntary samples
Question are posed to the public by publishing them over radio or
tv (phone or sms)
3. Convenience samples
Take an ‘easy sample’ (most conveniently available)
Ex: A surveyor will stand in one location & ask passerby their
questions
19
B) Probability Data Samples
1. Random samples
Selected using chance method or random methods
Example:
A lecturer wants to study the physical fitness levels
of students at her university. There are 5,000
students enrolled at the university, and she wants to
draw a sample of size 100 to take a physical fitness
test. She obtains a list of all 5,000 students,
numbered it from 1 to 5,000 and then randomly
invites 100 students corresponding to those numbers
to participate in the study.
20
B) Probability Data Samples
2. Systematic samples
Numbering each subject of the populations and data is
selected every kth number.
- Subjects are selected by using every k-number after the first
subject is selected from 1 through k.
Example:
A lecturer wants to study the physical fitness levels of
students at her university. There are 5,000 students
enrolled at the university, and she wants to draw a sample
of size 100 to take a physical fitness test. She obtains a list
of all 5,000 students, numbered it from 1 to 5,000 and
randomly picks one of the first 50 voters (5000/100 = 50)
on the list. If the pick number is 30, then the 30th student in
the list should be invited first. Then she should invite the
selected every 50th name on the list after this first random
starts (the 80th student, the 130th student, etc) to produce
100 samples of students to participate in the study. 21
B) Probability Data Samples
3. Stratified samples
Dividing the population into groups according to some
characteristics that is important to the study, then sampling
from each group
- Subjects are selected by dividing up the population into groups (strata),
and subjects within groups are randomly selected.
Example:
A lecturer wants to study the physical fitness levels of students at
her university. There are 5,000 students enrolled at the
university, and she wants to draw a sample of size 100 to take a
physical fitness test. Assume that, because of different lifestyles,
the level of physical fitness is different between male and female
students. To account for this variation in lifestyle, the population
of student can easily be stratified into male and female students.
Then she can either use random method or systematic methods
to select the participants. As example she can use random
sample to chose 50 male students and use systematic method to
chose another 50 female students or otherwise.
22
B) Probability Data Samples
4. Cluster samples
Dividing the population into sections/clusters, then
randomly select some of those cluster and then choose
all members from those selected cluster
- Subjects are selected by using an intact group that is representative
of the population.
Using a cluster sampling can reduce cost and time.
Example:
A lecturer wants to study the physical fitness levels of students at
her university. There are 5,000 students enrolled at the university,
and she wants to draw a sample to take a physical fitness test.
Assume that, because of different lifestyles, the level of physical
fitness is different between freshmen, juniors and seniors
students. To account for this variation in lifestyle, the population of
student can easily be clustered into freshmen, juniors and seniors
students. Then she can choose any one cluster such as freshmen
and take all the freshmen students as the participant.
23
STEP 4
Classifying and Summarizing the Data
Summarization
Graphical & Descriptive statistics ( tables, charts, measure of
central tendency, measure of variation, measure of position)
24
Variables & Data Classification
Data are the values that variables can assume
Variables is a characteristic or attribute that can assume different
values.
Variables whose values are determined by chance are called
random variables
Variables can be
classified
Examples
28
EXERCISE 1.2
5. The chart shows the number of job-related injuries for each of the
transportation industries for 1998.
29
STEP 5
Presenting and Analyzing the data
Summarized & analyzed information given by the
graphical statistics (graph and chart)
hypothesis testing
ANOVA
Regression analysis
31
Types of Graph & Chart
32
Distribution Shapes for Histogram
Bell Shaped Uniform
Has a single Basically
peak & tapers flat/rectangular
off at either end
Approximately
symmetry
It is roughly the
same on the
both sides of a
line running
through the
center
J-Shaped Reverse J-
Has a few data
Shaped
values on the Opposite J-
left side & Shaped
increase as one Has a few data
move to the values on the
right right side &
increase as one
move to the left
33
Distribution Shapes for Histogram
Right Skewed Left Skewed
The peak is to The peak is to
the left the right
The data value The data value
taper off to the taper off to the
right left
Bimodal U-Shaped
Have 2 peak at The shape is U
the same height
34
STEP 6
Making the decision
35
START
Problem Yes
Methodology
Present and communicate
summarized information in
form of tables, charts and
descriptive measure
Is information from
Yes Use sample information to
1. Estimate value of parameter
a sample? 2. Test assumptions about
parameter
No
Use cencus information to
Interpret the results, draw
evaluate alternative courses of
conclusions, and make decisions
action and make decisions
STOP 36
Role of the Computer in Statistics
2. Statistical Packages
MINITAB, SAS, SPSS and SPlus
37
Data Analysis Aplication in EXCEL
• Graph and chart
• Formulas
• Add in – Analisis Tool Park – Data Analysis
38
1.3: REVIEW ON
DESCRIPTIVE
STATISTICS
• Summarize data using measures of central tendency, such as
the mean, median, mode.
Measures of Measures of
Central Tendency Variation
Mean
N n
x i x i
i 1
, N population size x i 1
, n sample size
N n
Example: 9 2 1 4 3 3 7 5 8 6 , x 4.8
42
Properties of Mean
The mean is compute by using all the values of the data.
The mean varies less than the median or mode when samples are taken
from the same population and all three measures are computed for
these samples.
The mean is used in computing other statistics, such as variance.
The mean for the data set is unique, and not necessarily one of the data
values.
The mean cannot be computed for an open-ended frequency
distribution.
The mean is affected by extremely high or low values and may not be
the appropriate average to use in these situations
Note: Assume the data are obtained from samples unless otherwise specified.
43
1.3.1 Measures of Central Tendency
Median
the middle number of n ordered data (smallest to largest)
If n is odd If n is even
Median(MD) xn1 xn xn
1
2 Median(MD) 2 2
2
Example: 9 2 1 3 3 7 5 8 6 Example: 9 2 1 4 3 3 7 5 8 6
Step 1: 1 2 3 3 5 6 7 8 9 MD = 4.5
Step 2 : MD = 5 44
Properties of Median
The median is used when one must find the center or middle value
of a data set.
The median is used when one must determine whether the data
values fall into the upper half or lower half of the distribution.
45
1.3.1 Measures of Central Tendency
Mode
the most commonly occurring value in a data series
The mode can be used when the data are nominal, such as
religious preference, gender, or political affiliation.
The mode is not always unique. A data set can have more than
one mode, or the mode may not exist for a data set.
Example: 9 2 1 4 3 3 7 5 8 6 Mode = 3
46
Types of Distribution
Symmetric
48
1.3.2 Measures of Variation / Dispersion
Range
Example: 9 2 1 4 3 3 7 5 8 6
R=9-1=8
50
1.3.2 Measures of Variation / Dispersion
Variance
is the average of the squares of the distance each value is from the mean.
x x x
2 2
i i
2 i 1
, N population size s2 i 1
, n sample size
N n 1
Example: 9 2 1 4 3 3 7 5 8 6
2 6.4 s 2 7.1
51
1.3.2 Measures of Variation / Dispersion
Standard Deviation
is the square root of the variance
N n
xi xi x
2 2
i 1
, N population size s i 1
, n sample size
N n 1
Example: 9 2 1 4 3 3 7 5 8 6
2.5 s 2.7
52
Properties of Variance
& Standard Deviation
Variances and standard deviations can be used to determine the
spread of the data. If the variance or standard deviation is large, the
data are more dispersed. The information is useful in comparing two
or more data sets to determine which is more variable.
The measures of variance and standard deviation are used to
determine the consistency of a variable.
The variance and standard deviation are used to determine the
number of data values that fall within a specified interval in a
distribution.
The variance and standard deviation are used quite often in
inferential statistics.
The standard deviation is used to estimate amount of spread in the
population from which the sample was drawn.
53
EXERCISE 1.3.2
7. A testing lab wishes to test two experimental brands of outdoor paint
to see how long it will last before fading. The testing lab makes 6
gallons of each paint to test. Since different chemical agents are
added to each group and only 6 cans are involves, these two groups
constitutes two small populations. The results (in month) are shown.
A 10 20 30 40 50 60
B 25 30 35 40 45 35
54
1.3.3 Concept of Variance:
1.3.3.1 Chebyshev’s Theorem
Chebyshev’s Theorem states that the proportion of values from any data set that fall
(lies) wthin k standard deviation of the mean will be at least 1-1/k², where k is any
number greater than 1.
In other words, if the data sets have μ mean and σ standard deviation, so at least 1-
1/k² data fall (lies) within (μ – kσ, μ + kσ ) .
Chebychev’s theorem
applies to any distribution
regardless of its shape. But
when a distribution is bell-
shaped, another rules
apply for the distribution.
Exercise 1.3.3.1: A machine produces bullet shell with mean length 2cm and a standard
deviation of 0.01cm. By using Chebyshev’s Theorem, what is the percentage of
bullet shell that is within the range 1.98cm and 2.02cm?
55
1.3.3.2 Games of Dart