Lecture 1-Statistics Introduction-Defining, Displaying and Summarizing Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Lecture 1-Statistics

Introduction-Defining, Displaying and


Summarizing Data
Data

The measurements obtained in a research


study are called the data.

The goal of statistics is to help


researchers organize and interpret the
data.

2
What is statistics?

◼ A branch of mathematics taking and


transforming numbers into useful information for
decision makers

◼ Methods for processing & analyzing numbers

◼ Methods for helping reduce the uncertainty


inherent in decision making
Why Study Statistics?
Decision Makers Use Statistics To:

▪ Present and describe business data and information properly

▪ Draw conclusions about large groups of individuals or


items, using information collected from subsets of the
individuals or items.

▪ Make reliable forecasts about a business activity

▪ Improve business processes


Key Definitions
◼ A population (universe) is the collection of all
members of a group
◼ A sample is a portion of the population selected for
analysis
◼ A parameter is a numerical measure that describes a
characteristic of a population
◼ A statistic is a numerical measure that describes a
characteristic of a sample
A Sample Is

◼ Less time consuming than selecting every


item in the population.

◼ Less costly than selecting every item in the


population.

◼ Less cumbersome and more practical than


analyzing the entire population.
Population vs. Sample

Population Sample

a b cd b c
ef gh i jk l m n gi n
o p q rs t u v w o r u
x y z y

Measures used to describe a Measures computed from


population are called sample data are called
parameters statistics
Types of Statistics

◼ Statistics
◼ The branch of mathematics that transforms data into
useful information for decision makers.

Descriptive Statistics Inferential Statistics

Collecting, summarizing, and Drawing conclusions and/or


describing data making decisions concerning a
population based only on sample
data
Descriptive Statistics

◼ Collect data
◼ e.g., Survey

◼ Present data
◼ e.g., Tables and graphs

◼ Characterize data
◼ e.g., Sample mean =
 X i

n
Inferential Statistics
◼ Estimation
◼ e.g.: Estimate the population
mean weight using the
sample mean weight
◼ Hypothesis testing
◼ e.g.: Test the claim that the
population mean weight is 70
Kg
Drawing conclusions and/or making decisions
concerning a population based on sample results.
Basic Vocabulary of Statistics

VARIABLE
A variable is a characteristic of an item or individual that can change or take on
different values. Most research begins with a general question about the relationship
between two variables

DATA
Data are the different values associated with a variable.

OPERATIONAL DEFINITIONS
Data values are meaningless unless their variables have operational definitions,
universally accepted meanings that are clear to all associated with an analysis.
Collecting Data
Primary Secondary
Data Collection Data Compilation

Print or Electronic

Observation Survey

Experimentation
Collecting Data Correctly Is A Critical Task

▪ Need to avoid data flawed by biases,


ambiguities, or other types of errors.

▪ Results from flawed data will be suspect or in


error.

▪ Even the most sophisticated statistical


methods are not very useful when the data is
flawed.
Types of Variables

▪ Categorical (qualitative) variables have values that


can only be placed into categories, such as “yes” and
“no.”

▪ Numerical (quantitative) variables have values that


represent quantities.
Types of Variables

Variables

Categorical Numerical

Examples:
◼ Marital Status
◼ Political Party Discrete Continuous
◼ Eye Color
(Defined categories) Examples: Examples:
◼ Number of Children ◼ Weight
◼ Defects per hour ◼ Voltage
(Counted items) (Measured characteristics)

.
Examples of Types of Variables

Question Responses Variable Type

Do you have a Facebook


profile? Yes or No Categorical (Qualitative)

How many text messages Numerical


have you sent in the past --------------- (discrete)
three days?
How long did the mobile Numerical
app update take to --------------- (continuous)
download?
Measuring Variables
• To establish relationships between variables,
researchers must observe the variables and
record their observations. This requires that the
variables be measured.

18
Levels of measurement
• There are four levels of measurement: Nominal, Ordinal,
Interval and Ratio. These go from lowest level to highest
level.
• Data is classified according to the highest level which it
fits. Each additional level adds something the previous
level did not have.
– Nominal is the lowest level. Only names are
meaningful here;
– Ordinal adds an order to the names;
– Interval adds meaningful differences;
– Ratio adds a zero so that ratios are meaningful.
Data Types
Levels of Measurement

▪ A nominal scale classifies data into distinct categories in


which no ranking is implied.

Categorical Variables Categories

Personal Computer Yes / No


Ownership

Type of Stocks Owned Growth Value Other

Internet Provider Microsoft Network / AOL/ Other


Levels of Measurement

▪ An ordinal scale classifies data into distinct categories


in which ranking is implied

Categorical Variable Ordered Categories

Student class designation Level 1, 2 3,4

Product satisfaction Satisfied, Neutral, Unsatisfied

Faculty rank Professor, Associate Professor,


Assistant Professor, Instructor
Standard & Poor’s bond ratings AAA, AA, A, BBB, BB, B, CCC, CC,
C, DDD, DD, D
Student Grades A, B, C, D, F
Levels of Measurement (con’t.)

▪ An interval scale is an ordered scale in which the


difference between measurements is a meaningful
quantity but the measurements do not have a true
zero point.

▪ A ratio scale is an ordered scale in which the


difference between the measurements is a
meaningful quantity and the measurements have a
true zero point.

.
Levels of Measurement
and Measurement Scales
Differences between Highest Level
measurements, true Ratio Data
zero exists (Strongest forms of
measurement)

Differences between
measurements but no Interval Data
true zero
Higher Levels
Ordered Categories
(rankings, order, or Ordinal Data
scaling)

Categories (no Lowest Level


ordering or direction) Nominal Data (Weakest form of
measurement)
Levels of Measurement
and Measurement Scales
EXAMPLES:
Differences between Height, Age, Weekly
Ratio Data measurements, true
Food Spending
zero exists

Differences between Temperature in


Interval Data measurements but no Centigrade
true zero

Service quality rating,


Ordered Categories
Ordinal Data (rankings, order, or scaling) Student letter grades

Categories (no ordering Marital status, Type of car


Nominal Data or direction) owned
Descriptive Statistics
Descriptive statistics utilizes numerical and graphical
methods to look for patterns in a data set, to summarize
the information revealed in a data set, and to represent
that information in a convenient form.

•Graphical Representation of Data


•Measures of Central Tendency
•Measures of Dispersion
Organizing Categorical Data:
Summary Table

▪ A summary table indicates the frequency, amount, or percentage of items


in a set of categories so that you can see differences between categories.

Banking Preference? Percent


ATM 16%
Automated or live telephone 2%
Drive-through service at branch 17%
In person at branch 41%
Internet 24%
Visualizing Categorical Data:
The Bar Chart
▪ In a bar chart, a bar shows each category, the length of which
represents the amount, frequency or percentage of values falling into
a category which come from the summary table of the variable.

Banking Preference

Banking Preference? % Internet


ATM 16%
In person at branch
Automated or live 2%
telephone
Drive-through service at 17%
Drive-through service at branch
branch
In person at branch 41% Automated or live telephone
Internet 24%
ATM

0% 5% 10% 15% 20% 25% 30% 35% 40% 45%


Visualizing Categorical Data:
The Pie Chart
▪ The pie chart is a circle broken up into slices that represent categories.
The size of each slice of the pie varies according to the percentage in
each category.
Banking Preference

Banking Preference? %
16% ATM
ATM 16% 24%
Automated or live 2% 2% Automated or live
telephone telephone
Drive-through service at
Drive-through service at 17%
17% branch
branch
In person at branch
In person at branch 41%
Internet 24% Internet
41%
Cross Tabulations

◼ Used to study patterns that may exist between


two or more categorical variables.

◼ Cross tabulations can be presented in:


◼ Tabular form -- Contingency Tables
◼ Graphical form -- Side by Side Charts
Cross Tabulations:
The Contingency Table
▪ A cross-classification (or contingency) table presents the
results of two categorical variables. The joint responses are
classified so that the categories of one variable are located in
the rows and the categories of the other variable are located
in the columns.

▪ The cell is the intersection of the row and column and the
value in the cell represents the data corresponding to that
specific pairing of row and column categories.

▪ A useful way to visually display the results of cross-


classification data is by constructing a side-by-side bar
chart.
Cross Tabulations:
The Contingency Table

A survey was conducted to study the importance of brand


name to consumers as compared to a few years ago. The
results, classified by gender, were as follows:

Importance of Male Female Total


Brand Name
More 450 300 750
Equal or Less 3300 3450 6750

Total 3750 3750 7500


Visualizing Numerical Data:
The Histogram
Relative
Class Frequency Percentage
Frequency

10 but less than 20 3 .15 15


20 but less than 30 6 .30 30
30 but less than 40 5 .25 25
40 but less than 50 4 .20 20 8
50 but less than 60 2 .10 10 Histogram: Age Of Students
Total 20 1.00 100
6

Frequency
4
(In a percentage
histogram the vertical
axis would be defined to 2
show the percentage of
observations per class)
0
5 15 25 35 45 55 More
Histograms
A histogram shows three general types of
information:
It provides visual indication of where
the approximate center of the data is.
We can gain an understanding of the
degree of spread, or variation, in the
data.
We can observe the shape of the
distribution.
Organizing Numerical Data:
Stem and Leaf Display
▪ A stem-and-leaf display organizes data into groups (called
stems) so that the values within each group (the leaves)
branch out to the right on each row.
Age of College Students

Age of Day Students Day Students Night Students


Surveyed
16 17 17 18 18 18 Stem Leaf
College Stem Leaf
Students 19 19 20 20 21 22
1 67788899 1 8899
22 25 27 32 38 42
Night Students 2 0012257 2 0138
18 18 19 19 20 21
3 28 3 23
23 28 32 33 41 45
4 2
4 15
Visualizing Two Numerical Variables:
Scatter Plot

Volume Cost per


per day day Cost per Day vs. Production Volume
23 125
250
26 140
200
Cost per Day

29 146
150
33 160
100
38 167
50
42 170
0
50 188
20 30 40 50 60 70
55 195
Volume per Day
60 200
Visualizing Two Numerical Variables:
Time Series Plot

Number of
Year Franchises Number of Franchises, 1996-2004
120
1996 43
100
1997 54 Franchises
Number of

80
1998 60 60
1999 73 40

2000 82 20
0
2001 95
1994 1996 1998 2000 2002 2004 2006
2002 107 Year
2003 99
2004 95
Measures of Central Tendency
the extent to which all the data values group around a typical or central value

The Mean Sample mean x


n

x1 + x2 + x3 + + xn x i
x= = i =1
n n
The Median

The Mode
Measures of Dispersion
The variation is the amount of dispersion, or scattering, of values

Variance

The Standard Deviation

The Range

Interquartile Range
Measures of Variation:
The Variance
◼ Average (approximately) of squared deviations
of values from the mean
n
◼ Sample variance:
 (X − X) i
2

S =2 i=1
n -1
Where X = arithmetic mean
n = sample size
Xi = ith value of the variable X
Measures of Variation:
The Standard Deviation

◼ Most commonly used measure of variation


◼ Shows variation about the mean
◼ Is the square root of the variance
◼ Has the same units as the original data

n
◼ Sample standard deviation:  (X − X)
i
2

S= i=1
n -1
Measures of Variation:
Comparing Standard Deviations

Smaller standard deviation

Larger standard deviation


Measures of Variation:
Sample Standard Deviation:
Calculation Example

Sample
Data (Xi) : 10 12 14 15 17 18 18 24
n=8 Mean = X = 16

(10 − X )2 + (12 − X )2 + (14 − X )2 +  + (24 − X )2


S=
n −1

(10 − 16)2 + (12 − 16)2 + (14 − 16)2 +  + (24 − 16)2


=
8 −1

130 A measure of the “average”


= = 4.3095
7 scatter around the mean
Short-cut formula
Shape of a Distribution

◼ Describes how data are distributed


◼ Measures of shape
◼ Symmetric or skewed

Left-Skewed Symmetric Right-Skewed


Mean < Median Mean = Median Median < Mean
Sample statistics versus
population parameters

Measure Population Sample


Parameter Statistic
Mean
 X
Variance
2 S2
Standard
 S
Deviation
The Five Number Summary

The five numbers that help describe the center, spread


and shape of data are:
▪ Xsmallest
▪ First Quartile (Q1)
▪ Median (Q2)
▪ Third Quartile (Q3)
▪ Xlargest
Five Number Summary and
The Boxplot

◼ The Boxplot: A Graphical display of the data


based on the five-number summary:
Xsmallest -- Q1 -- Median -- Q3 -- Xlargest
Example:

25% of data 25% 25% 25% of data


of data of data

Xsmallest Q1 Median Q3 Xlargest


Five Number Summary:
Shape of Boxplots
◼ If data are symmetric around the median then the box
and central line are centered between the endpoints

Xsmallest Q1 Median Q3 Xlargest

◼ A Boxplot can be shown in either a vertical or horizontal


orientation
Distribution Shape and
The Boxplot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q 2 Q3 Q1 Q2 Q3
Personal Computer Programs
Used For Statistics

◼ Minitab
◼ A statistical package to perform statistical analysis
◼ Designed to perform analysis as accurately as possible

◼ Microsoft Excel
◼ A multi-functional data analysis tool
◼ Can perform many functions but none as well as programs that
are dedicated to a single function.

◼ Both Minitab and Excel use worksheets to store data


Minitab & Microsoft Excel Terms
▪ When you use Minitab or Microsoft Excel, you place the data you
have collected in worksheets.

▪ The intersections of the columns and rows of worksheets form


boxes called cells.

▪ If you want to refer to a group of cells that forms a contiguous


rectangular area, you can use a cell range.

▪ Worksheets exist inside a workbook in Excel and inside a


Project in Minitab.

▪ Both worksheets and projects can contain both data, summaries,


and charts.
Example

◼ A random sample of size 9 yields the following


observations on the random variable X, the coal
consumption in millions of tons by electric utilities for a
given year:
406 395 400 450 390 410 415 401 408

Calculate the sample mean, sample variance and the


sample standard deviation

You might also like