SEE5211 Chapter1 P2017

Data Analysis in Envir Application
(SEE5211/SEE8212)
Dr. Wen Zhou

School of Energy and Environment
Email: wenzhou@cityu.edu.hk ; Office: B5425, AC1

Outline
• The role of statistics and the data analysis process

• Numerical method of describing data
• Summarizing bivariate data
• Population distributions
• Sampling variability and Confidence interval
• Hypothesis Testing Using a Single Sample
• Comparing Two populations
• Regression Analysis
• Analysis of Variance
• Wavelet analysis
Course Assessment
100%
In-class Assignment #1 Group project #2 Computer-based Exam Written Exam

• 20% • 20% • 20% • 40%
• Weekly • Week 12 • Week 13
Reading material
Statistics: The Exploration and Analysis of Data, (2011)
Roxy Peck, Jay L DeVore | ISBN-10: 0840058012 | ISBN-13: 9780840058010
Data Analysis in Envir Application
(SEE5211/SEE8212)
The role of statistics and the data analysis process
Week 1
What is statistics?
• the science of collecting, analyzing, and drawing conclusions from

data
Why should one study statistics?
1. To be informed . . .
a) Extract information from tables, charts and graphs
b) Follow numerical arguments
c) Understand the basics of how data should be gathered, summarized,
and analyzed to draw statistical conclusions
2. To make informed judgments
3. To evaluate decisions that affect your life
If you choose a particular major, what are your

chances of finding a job when you graduate?
What is variability?
Suppose you went into a convenience store to purchase a soft drink.

Does every can on the shelf contain exactly 12 ounces?
NO – there may be a little more or less in the various cans due to

the variability that is inherent in the filling process.
It is variability that makes life

interesting!!
The Role of Statistics
Data Analysis Process
Data Collecting Graphical Methods for Results

Describing Data
The Data Analysis Process
1. Understand the nature of the problem

2. Decide what to measure and how to measure it
3. Collect data
4. Summarize data and perform preliminary analysis
5. Perform formal analysis
6. Interpret results
Variable
• Any characteristic whose value may change from one

individual to another
Two types of variables
categorical numerical
discrete continuous
Identify the following variables:
1. the color of cars in parking lot

Categorical
2. the number of calculators owned by students
Discrete numerical
3. the zip code of an individual
Categorical
4. the amount of time it takes students to school
Continuous numerical
5. the appraised value of homes in your city
discrete numerical
Classifying variables by the number of
variables in a data set
Suppose that the PE coach records the height of each student in

his class.
This is an example of a univariate data
Univariate - data that describes a single characteristic of the

population
Suppose that the PE coach records the height and weight of

each student in his class.
This is an example of a bivariate data

Bivariate - data that describes two characteristics of the
population
Suppose that the PE coach records the height, weight, number
of sit-ups, and number of push-ups for each student in his
class.
This is an example of a multivariate data
Multivariate - data that describes more than

two characteristics
Observational Study & Experiment Study
• Observational study – a study Experiment Study

in which the researcher
observes characteristics of a • Experiment - a study in which
sample selected from one or the researcher observes how a
more populations. response variable behaves
• Observational studies CAN be when one or more explanatory
generalized to the population if variables (factors) are
the sample is randomly selected manipulated.
from the population of interest, • Well-designed experiments
but CANNOT show cause- CAN show cause-effect
effect relationships. relationships, but CANNOT be
generalized to the population if
the groups are volunteers or are
Observational Study not randomly assigned.
Sources of bias
Selection bias
• Occurs when the way the sample is selected systematically
excludes some part of the population of interest –called
undercoverage
• May also occur if only volunteers or self-selected individuals
are used in a study
Sources of bias
Nonresponse
• occurs when responses are not obtained from

all individuals selected for inclusion in the
sample
• To minimize nonresponse bias, it is critical that

a serious effort be made to follow up with
individuals who did not respond to the initial
request for information
Example
Consider Anna, a waitress. She decides to perform an
experiment to determine if writing “Thank you” on the receipt
increases her tip percentage.
She plans on having two groups. On one group she will write
“Thank you” on the receipt and on the other group she will not
write “Thank you” on the receipt.
Which of these is the control group?

Control experiment
Control group is an experimental group that does NOT

receive any treatment.
The use of a control group allows the experimenter to

assess how the response variable behaves when the
treatment is not used.
This provides a baseline against which the treatment
groups can be compared to determine whether the
treatment had an effect.
Experimental Designs
1. Completely randomized design –experimental units are

assigned at random to treatments or treatments are
assigned at random to trials
The ONLY way to show a cause-effect

relationship is with a well-designed, well-
controlled experiment!!!
Measure
Random Assignment
Treatment response for

A A
Experimental Compare
Units treatments
Measure
Treatment response for
B B
Experimental Designs
2. Randomized block – units are blocked into groups

(homogeneous) and then randomly assigned to
treatments
Random Assignment
Treatment
Measure
A response
for A
Block Compare
treatments
1 for block 1
Treatment Measure
Experimental
response
results from the

Create blocks
Compare the
for B
2 blocks
Units
Random Assignment
Treatment
Measure
A response
for A
Block Compare
treatments
2 for block 2
Treatment Measure
B response
for B
The Role of Statistics
Data Analysis Process
Data Collecting Graphical Methods for Results

Describing Data
Bar Chart
What to Look For

Frequently or infrequently occurring categories
Collect the following data and then display the data in a

bar chart:
What is your favorite ice cream flavor?
Vanilla, chocolate, strawberry, or other

Bar Charts
When to Use Categorical data
How to construct
• Constructed like bar charts, but with two (or more) groups
being compared
• MUST use relative frequencies on the vertical axis
• MUST include a key to denote the different bars
Example
A survey of students applying to college and of parents of college applicants:
In 2009, 12,715 high school students responded to the question “Ideally how
far from home would you like the college you attend to be?” Also, 3007
parents of students applying to college responded to the question “how far
from home would you like the college your child attends to be?” Data is
displayed in the frequency table below.
Frequency
Ideal Distance Students Parents
Less than 250 miles 4450 1594
250 to 500 miles 3942 902
500 to 1000 miles 2416 331
More than 1000 miles 1907 180
Example
Relative Frequency
Ideal Distance Students Parents
Less than 250 miles .35 .53
250 to 500 miles .31 .30
500 to 1000 miles .19 .11
More than 1000 miles .15 .06
What does this graph show

about the ideal distance
college should be from
home?
Example
First draw a bar that
represents 100% of
the students who
answered the survey.
1.0
0.8 Less than 250 miles

250 to 500 miles
Relative frequency
0.6
500 to 1000 miles
0.4
More than 1000 miles
0.2
Do the same thing for

parents – don’t forget a key
Students Parents
denoting each category
Segmented (or Stacked) Bar Charts
How to construct
• MUST first calculate relative frequencies
• Draw a bar representing 100% of the group
• Divide the bar into segments corresponding to the relative
frequencies of the categories
Pie (Circle) Chart
How to construct
• Draw a circle to represent the entire data set
• Calculate the size of each “slice”:
Relative frequency × 360°
• Using a protractor, mark off each slice
To describe
– comment on which category had the largest
proportion or smallest proportion
Example
Typos on a résumé do not make a very good impression when
applying for a job. Senior executives were asked how many typos
in a résumé would make them not consider a job candidate. The
resulting data are summarized in the table below.
Number of Typos Frequency Relative Frequency

1 60 .40
2 54 .36
3 21 .14
4 or more 10 .07
Don’t know 5 .03
What does this pie chart tell us about the number of typos occurring
in résumés before the applicant would not be considered for a job?
Number of Typos Frequency Relative Frequency

1 60 .40
2 54 .36
3 21 .14
4 or more 10 .07
Don’t know 5 .03
First draw a circle to

represent the entire
data set.
Next, calculate the size of

the slice for “1 typo”
.40×360º =144º
Draw that slice.
Numerical / Univariate Graph: Center
What strikes you as the most distinctive difference among the
distributions of exam scores in classes A, B, & C ?
1. Center
• discuss where the middle of the data falls
• three measures of central tendency as these three measures
focus on where the data is centred or clustered
• mean, median, & mode
• The mean is useful for predicting future results when there are no extreme
values in the data set.
• The median may be more useful than the mean when there are extreme
values in the data set as it is not affected by the extreme values.
• The mode is useful when the most common item, characteristic or value of
a data set is required.
Example: 2,3,5,6,7,8,8,13,15,17,17,17,17,19,22,33
Mean=199/16=12.4; Median=(13+15)/2=14; Mode=17
Numerical / Univariate Graph: Spread
distributions of scores in classes D, E, & F?
2. Spread
• discuss how spread out the data is
• refers to the variability in the data
• Measure of spread are

• Range, standard deviation, IQR
Numerical / Univariate Graph: Shape
distributions of exam scores in classes G, H, & I ?
3. Shape
• refers to the overall shape of the
distribution
• symmetrical, uniform, skewed, or

bimodal
Symmetrical
• refers to data in which both sides
are (more or less) the same when
the graph is folded vertically down
the middle
• bell-shaped is a special type
• has a center mound with two
sloping tails
Uniform
• refers to data in which every class
has equal or approximately equal
frequency
Skewed
• refers to data in which one side

(tail) is longer than the other side
• the direction of skewness is on the
side of the longer tail
The directions are positively (or right) skewed or negatively (or left) skewed.
4. Unusual occurrences
• Outlier - value that lies away from the rest of the data
• Gaps
• Clusters
Stem-and-Leaf Displays
When to Use Univariate numerical data
How to construct
• Select one or more of the leading digits for the
stem
• List the possible stem values in a vertical column
• Record the leaf for each observation beside each
corresponding stem value
• Indicate the units for stems and leaves in a key
or legend
To describe
– comment on the center, spread, and shape of the
distribution and if there are any unusual features
The following data are price per ounce for various
brands of different brands of dandruff shampoo at a
local grocery store.
0.32 0.21 0.29 0.54 0.17 0.28 0.36 0.23
Create a stem-and-leaf display with this data?

Stem Leaf
1 7
2 1 9 8 3
3 2 6
4
5 4
Histograms
When to Use Univariate numerical data

How to constructDiscrete data
―Draw a horizontal scale and mark it with the possible
values for the variable
―Draw a vertical scale and mark it with frequency or
relative frequency
―Above each possible value, draw a rectangle centered
at that value with a height corresponding to its
frequency or relative frequency
To describe
– comment on the center, spread, and shape of the
distribution and if there are any unusual features
Example
Queen honey bees mate shortly after they become adults. During a
mating flight, the queen usually takes several partners, collecting
sperm that she will store and use throughout the rest of her life. A
study on honey bees provided the following data on the number of
partners for 30 queen bees.
12 2 4 6 6 7 8 7 8 11
8 3 5 6 7 10 1 9 7 6
9 7 5 4 7 4 6 7 8 10
Example
A study examined the length of hours spent watching TV per day
for a sample of children age 1 and for a sample of children age 3.
Below are comparative histograms.
Children Age 1 Children Age 3

Histograms with unequal intervals
When to use
- when you have a concentration of data in the
middle with some extreme values
How to construct
- construct similar to histograms with continuous
data, but with density on the vertical axis
relative frequency for interval

density 
width of interval
Cumulative Relative Frequency Plot
When to use
- used to answer questions about percentiles ( a value with a given percent of
observations at or below that value)
How to construct
- Mark the boundaries of the intervals on the horizontal axis
- Draw a vertical scale and mark it with relative frequency
- Plot the point corresponding to the upper end of each interval with its
cumulative relative frequency, including the beginning point
- Connect the points.
The National Climatic Center has been collecting weather data for many
years. The annual rainfall amounts for Albuquerque, New Mexico from
1950 to 2008 were used to create the frequency distribution below.
Annual Rainfall Relative Cumulative relative
(in inches) frequency frequency
4 to <5 0.052 0.052
5 to <6 0.103
+
0.155
6 to <7 0.086 +
0.241
7 to <8 0.103
8 to <9 0.172
Continue this pattern to
9 to <10 0.069 complete the table
10 to < 11 0.207
11 to <12 0.103
12 to <13 0.052
13 to <14 0.052
The National Climatic Center has been collecting weather data for many
years. The annual rainfall amounts for Albuquerque, New Mexico from
1950 to 2008 were used to create the frequency distribution below.
Annual Rainfall Relative Cumulative relative
(in inches) frequency frequency
4 to <5 0.052 0.052
5 to <6 0.103 0.155
6 to <7 0.086 0.241
7 to <8 0.103 0.344
8 to <9 0.172 0.516
9 to <10 0.069 0.585
10 to < 11 0.207 0.792
11 to <12 0.103 0.895
12 to <13 0.052 0.947
13 to <14 0.052 0.999
1.0 What proportion of years had rainfall
Cumulative relative frequency amounts that were 9.5 inches or less?
0.8
0.6
Approximately 0.55
0.4
0.2
2 4 6 8 10 12 14
Rainfall
1.0 Approximately 30% of the years had
Cumulative relative frequency
annual rainfall less than what amount?
0.8
0.6
0.4
0.2
Approximately 7.5 inches

2 4 6 8 10 12 14
Rainfall
1.0
Which interval of rainfall

Cumulative relative frequency
0.8
amounts had a larger
proportion of years –
9 to 10 inches or
0.6 10 to 11 inches?
Explain
0.4
The interval 10 to 11 inches,
because its slope is steeper,
indicating a larger proportion
0.2
occurred.
2 4 6 8 10 12 14
Rainfall
Scatterplots
When to Use Bivariate numerical data
How to construct
- Draw a horizontal scale and mark it with appropriate values of the
independent variable
- Draw a vertical scale and mark it appropriate values of the dependent
variable
- Plot each point corresponding to the observations
To describe
- comment the relationship between the variables
Time Series Plots
When to Use
- measurements collected over time at regular intervals
How to construct
- Draw a horizontal scale and mark it with appropriate values
of time
- Draw a vertical scale and mark it appropriate values of the
observed variable
- Plot each point corresponding to the observations and
connect
To describe
- comment on any trends or patterns over time
Group project
Group Project --- 20% (group presentation 10% and term paper 10%
(Individual Participation 2% ) students will first be divided into 10-12 small
groups (4-6 students form a group) .
Each small group will conduct a forum on a topic of your choice. Your group
will select one type of datasets (such as Air pollutant concentration, weather
data, Power data, or others). Group members will work together to prepare a
15-minute presentation and a term paper (1500 words + 4 figures ） about
data analysis , each project should first introduce the environmental datasets or
historical events and discuss the types of datasets, especially focus on
collecting, analyzing, and drawing conclusions from data.
Topics
Defining the problem:
Reducing the threat of acid rain to our environment
-Cause of Acid rain: sulfuric and nitric acids

-Sources of acidic components of rain: Hydrocarbon fuels, which spew
sulfur and nitric oxide into the atmosphere when burned.
-Solutions to the problem: strive for a ~50% reduction in sulfur –oxide
emissions; develop a new technology to allow us to use available energy
sources; develop alternative cleaner energy.
In China, high sulfur coal is a major source of these emissions, but

dependent on coal for energy, a shift to lower sulfur coal is not always
possible.
Statistics will play a key role in monitoring atmosphere conditions,

testing the effectiveness of proposed emission control devices…
Topics

Ozone exposure and population density
-Ambient ozone pollution: cause damage to the human respiratory system,

agricultural crops, and trees
Ozone (O3): higher concentrations, longer exposure and greater activity levels
cause greater effects
O3 impact on respiratory system; irritates mucous membranes of nose, throat and

airways; Symptoms are cough, chest pain, throat and eye irritation.
O3 increases susceptibility to respiratory infection; impair normal lung functioning

and induce respiratory inflammation.
Healthy individuals who exercise heavily for 1 to 2 hours may experience

respiratory symptoms at levels exceeding 240 g/m3; experience these symptoms
at a lower concentration for 6 to 8 hours during moderate exercise.
Individuals with sensitive respiratory systems (asthma or respiratory disease) are

more susceptible to the effects of O3.
Topics
Electricity Supply in Hong Kong
In 2003, the electricity supply of CLP Power is 28,035 million kWh (

CLP Holdings Annual Report 2003, Ten-year Summary: Scheme of
Control Financial and Operating Statistics
), while that of HEC is 10,413 million kWh (
Hong Kong Electric Holdings Ltd. Units Sold); the total supply is
therefore 38,448 million kWh, on average
3.8 × 1014 J every day.
https://www.clpgroup.com/en
Topics
Climatological characteristics of TC rainfall in Hong Kong
The day when a TC has

entered 800 km region of
Hong Kong is defined as a
TC day and the associated
precipitation are treated as
TC-related precipitation.

SEE5211 Chapter1 P2017

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SEE5211 Chapter1 P2017

Uploaded by

Copyright:

Available Formats

Data Analysis in Envir Application

Dr. Wen Zhou

Email: wenzhou@cityu.edu.hk ; Office: B5425, AC1

• The role of statistics and the data analysis process

In-class Assignment #1 Group project #2 Computer-based Exam Written Exam

The role of statistics and the data analysis process

• the science of collecting, analyzing, and drawing conclusions from

If you choose a particular major, what are your

Suppose you went into a convenience store to purchase a soft drink.

NO – there may be a little more or less in the various cans due to

It is variability that makes life

Data Analysis Process

Data Collecting Graphical Methods for Results

1. Understand the nature of the problem

• Any characteristic whose value may change from one

1. the color of cars in parking lot

Suppose that the PE coach records the height of each student in

This is an example of a univariate data

Univariate - data that describes a single characteristic of the

Suppose that the PE coach records the height and weight of

This is an example of a bivariate data

This is an example of a multivariate data

Multivariate - data that describes more than

• Observational study – a study Experiment Study

• occurs when responses are not obtained from

• To minimize nonresponse bias, it is critical that

Which of these is the control group?

Control group is an experimental group that does NOT

The use of a control group allows the experimenter to

1. Completely randomized design –experimental units are

The ONLY way to show a cause-effect

Treatment response for

2. Randomized block – units are blocked into groups

results from the

Data Analysis Process

Data Collecting Graphical Methods for Results

What to Look For

Collect the following data and then display the data in a

Vanilla, chocolate, strawberry, or other

When to Use Categorical data

What does this graph show

0.8 Less than 250 miles

Do the same thing for

When to Use Categorical data

When to Use Categorical data

Number of Typos Frequency Relative Frequency

Number of Typos Frequency Relative Frequency

First draw a circle to

Next, calculate the size of

• refers to the variability in the data

• Measure of spread are

• symmetrical, uniform, skewed, or

• refers to data in which one side

When to Use Univariate numerical data

Create a stem-and-leaf display with this data?

When to Use Univariate numerical data

Children Age 1 Children Age 3

relative frequency for interval

Approximately 7.5 inches

Which interval of rainfall

When to Use Bivariate numerical data

Defining the problem: