Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 60

Data Analysis in Envir Application

(SEE5211/SEE8212)

Dr. Wen Zhou


School of Energy and Environment

Email: wenzhou@cityu.edu.hk ; Office: B5425, AC1


Outline

• The role of statistics and the data analysis process


• Numerical method of describing data
• Summarizing bivariate data
• Population distributions
• Sampling variability and Confidence interval
• Hypothesis Testing Using a Single Sample
• Comparing Two populations
• Regression Analysis
• Analysis of Variance
• Wavelet analysis
Course Assessment

100%

In-class Assignment #1 Group project #2 Computer-based Exam Written Exam


• 20% • 20% • 20% • 40%
• Weekly • Week 12 • Week 13

Reading material
Statistics: The Exploration and Analysis of Data, (2011)
Roxy Peck, Jay L DeVore | ISBN-10: 0840058012 | ISBN-13: 9780840058010
Data Analysis in Envir Application

(SEE5211/SEE8212)

The role of statistics and the data analysis process

Week 1
What is statistics?

• the science of collecting, analyzing, and drawing conclusions from


data
Why should one study statistics?

1. To be informed . . .
a) Extract information from tables, charts and graphs
b) Follow numerical arguments
c) Understand the basics of how data should be gathered, summarized,
and analyzed to draw statistical conclusions
2. To make informed judgments
3. To evaluate decisions that affect your life

If you choose a particular major, what are your


chances of finding a job when you graduate?
What is variability?

Suppose you went into a convenience store to purchase a soft drink.


Does every can on the shelf contain exactly 12 ounces?

NO – there may be a little more or less in the various cans due to


the variability that is inherent in the filling process.

It is variability that makes life


interesting!!
The Role of Statistics

Data Analysis Process

Data Collecting Graphical Methods for Results


Describing Data
The Data Analysis Process

1. Understand the nature of the problem


2. Decide what to measure and how to measure it
3. Collect data
4. Summarize data and perform preliminary analysis
5. Perform formal analysis
6. Interpret results
Variable

• Any characteristic whose value may change from one


individual to another
Two types of variables

categorical numerical

discrete continuous
Identify the following variables:

1. the color of cars in parking lot


Categorical
2. the number of calculators owned by students
Discrete numerical
3. the zip code of an individual
Categorical
4. the amount of time it takes students to school
Continuous numerical
5. the appraised value of homes in your city

discrete numerical
Classifying variables by the number of
variables in a data set

Suppose that the PE coach records the height of each student in


his class.

This is an example of a univariate data

Univariate - data that describes a single characteristic of the


population
Classifying variables by the number of
variables in a data set

Suppose that the PE coach records the height and weight of


each student in his class.

This is an example of a bivariate data


Bivariate - data that describes two characteristics of the
population
Classifying variables by the number of
variables in a data set
Suppose that the PE coach records the height, weight, number
of sit-ups, and number of push-ups for each student in his
class.

This is an example of a multivariate data

Multivariate - data that describes more than


two characteristics
Observational Study & Experiment Study

• Observational study – a study Experiment Study


in which the researcher
observes characteristics of a • Experiment - a study in which
sample selected from one or the researcher observes how a
more populations. response variable behaves
• Observational studies CAN be when one or more explanatory
generalized to the population if variables (factors) are
the sample is randomly selected manipulated.
from the population of interest, • Well-designed experiments
but CANNOT show cause- CAN show cause-effect
effect relationships. relationships, but CANNOT be
generalized to the population if
the groups are volunteers or are
Observational Study not randomly assigned.
Sources of bias
Selection bias
• Occurs when the way the sample is selected systematically
excludes some part of the population of interest –called
undercoverage
• May also occur if only volunteers or self-selected individuals
are used in a study
Sources of bias
Nonresponse

• occurs when responses are not obtained from


all individuals selected for inclusion in the
sample

• To minimize nonresponse bias, it is critical that


a serious effort be made to follow up with
individuals who did not respond to the initial
request for information
Example
Consider Anna, a waitress. She decides to perform an
experiment to determine if writing “Thank you” on the receipt
increases her tip percentage.

She plans on having two groups. On one group she will write
“Thank you” on the receipt and on the other group she will not
write “Thank you” on the receipt.

Which of these is the control group?


Control experiment

Control group is an experimental group that does NOT


receive any treatment.

The use of a control group allows the experimenter to


assess how the response variable behaves when the
treatment is not used.
This provides a baseline against which the treatment
groups can be compared to determine whether the
treatment had an effect.
Experimental Designs

1. Completely randomized design –experimental units are


assigned at random to treatments or treatments are
assigned at random to trials

The ONLY way to show a cause-effect


relationship is with a well-designed, well-
controlled experiment!!!

Measure
Random Assignment

Treatment response for


A A
Experimental Compare
Units treatments
Measure
Treatment response for
B B
Experimental Designs

2. Randomized block – units are blocked into groups


(homogeneous) and then randomly assigned to
treatments
Random Assignment
Treatment
Measure
A response
for A

Block Compare
treatments
1 for block 1

Treatment Measure
Experimental

response

results from the


Create blocks

Compare the
for B

2 blocks
Units

Random Assignment

Treatment
Measure
A response
for A

Block Compare
treatments
2 for block 2

Treatment Measure
B response
for B
The Role of Statistics

Data Analysis Process

Data Collecting Graphical Methods for Results


Describing Data
Bar Chart

What to Look For


Frequently or infrequently occurring categories

Collect the following data and then display the data in a


bar chart:
What is your favorite ice cream flavor?

Vanilla, chocolate, strawberry, or other


Bar Charts

When to Use Categorical data

How to construct
• Constructed like bar charts, but with two (or more) groups
being compared
• MUST use relative frequencies on the vertical axis
• MUST include a key to denote the different bars
Example
A survey of students applying to college and of parents of college applicants:
In 2009, 12,715 high school students responded to the question “Ideally how
far from home would you like the college you attend to be?” Also, 3007
parents of students applying to college responded to the question “how far
from home would you like the college your child attends to be?” Data is
displayed in the frequency table below.

Frequency
Ideal Distance Students Parents
Less than 250 miles 4450 1594
250 to 500 miles 3942 902
500 to 1000 miles 2416 331
More than 1000 miles 1907 180
Example
Relative Frequency
Ideal Distance Students Parents
Less than 250 miles .35 .53
250 to 500 miles .31 .30
500 to 1000 miles .19 .11
More than 1000 miles .15 .06

What does this graph show


about the ideal distance
college should be from
home?
Example
First draw a bar that
represents 100% of
the students who
answered the survey.

1.0

0.8 Less than 250 miles


250 to 500 miles
Relative frequency

0.6
500 to 1000 miles
0.4
More than 1000 miles
0.2

Do the same thing for


parents – don’t forget a key
Students Parents
denoting each category
Segmented (or Stacked) Bar Charts

When to Use Categorical data

How to construct
• MUST first calculate relative frequencies
• Draw a bar representing 100% of the group
• Divide the bar into segments corresponding to the relative
frequencies of the categories
Pie (Circle) Chart

When to Use Categorical data

How to construct
• Draw a circle to represent the entire data set
• Calculate the size of each “slice”:
Relative frequency × 360°
• Using a protractor, mark off each slice

To describe
– comment on which category had the largest
proportion or smallest proportion
Example
Typos on a résumé do not make a very good impression when
applying for a job. Senior executives were asked how many typos
in a résumé would make them not consider a job candidate. The
resulting data are summarized in the table below.

Number of Typos Frequency Relative Frequency


1 60 .40
2 54 .36
3 21 .14
4 or more 10 .07
Don’t know 5 .03
What does this pie chart tell us about the number of typos occurring
in résumés before the applicant would not be considered for a job?

Number of Typos Frequency Relative Frequency


1 60 .40
2 54 .36
3 21 .14
4 or more 10 .07
Don’t know 5 .03

First draw a circle to


represent the entire
data set.

Next, calculate the size of


the slice for “1 typo”
.40×360º =144º
Draw that slice.
Numerical / Univariate Graph: Center
What strikes you as the most distinctive difference among the
distributions of exam scores in classes A, B, & C ?
1. Center
• discuss where the middle of the data falls
• three measures of central tendency as these three measures
focus on where the data is centred or clustered
• mean, median, & mode
• The mean is useful for predicting future results when there are no extreme
values in the data set. 
• The median may be more useful than the mean when there are extreme
values in the data set as it is not affected by the extreme values.
• The mode is useful when the most common item, characteristic or value of
a data set is required.

Example: 2,3,5,6,7,8,8,13,15,17,17,17,17,19,22,33
Mean=199/16=12.4; Median=(13+15)/2=14; Mode=17
Numerical / Univariate Graph: Spread
What strikes you as the most distinctive difference among the
distributions of scores in classes D, E, & F?
2. Spread
• discuss how spread out the data is

• refers to the variability in the data

• Measure of spread are


• Range, standard deviation, IQR
Numerical / Univariate Graph: Shape
What strikes you as the most distinctive difference among the
distributions of exam scores in classes G, H, & I ?
3. Shape
• refers to the overall shape of the
distribution

• symmetrical, uniform, skewed, or


bimodal
Symmetrical
• refers to data in which both sides
are (more or less) the same when
the graph is folded vertically down
the middle
• bell-shaped is a special type
• has a center mound with two
sloping tails
Uniform
• refers to data in which every class
has equal or approximately equal
frequency
Skewed

• refers to data in which one side


(tail) is longer than the other side
• the direction of skewness is on the
side of the longer tail

The directions are positively (or right) skewed or negatively (or left) skewed.
4. Unusual occurrences

• Outlier - value that lies away from the rest of the data
• Gaps
• Clusters
Stem-and-Leaf Displays

When to Use Univariate numerical data

How to construct
• Select one or more of the leading digits for the
stem
• List the possible stem values in a vertical column
• Record the leaf for each observation beside each
corresponding stem value
• Indicate the units for stems and leaves in a key
or legend
To describe
– comment on the center, spread, and shape of the
distribution and if there are any unusual features
The following data are price per ounce for various
brands of different brands of dandruff shampoo at a
local grocery store.
0.32 0.21 0.29 0.54 0.17 0.28 0.36 0.23

Create a stem-and-leaf display with this data?


Stem Leaf
1 7
2 1 9 8 3
3 2 6
4
5 4
Histograms

When to Use Univariate numerical data


How to constructDiscrete data
―Draw a horizontal scale and mark it with the possible
values for the variable
―Draw a vertical scale and mark it with frequency or
relative frequency
―Above each possible value, draw a rectangle centered
at that value with a height corresponding to its
frequency or relative frequency
To describe
– comment on the center, spread, and shape of the
distribution and if there are any unusual features
Example
Queen honey bees mate shortly after they become adults. During a
mating flight, the queen usually takes several partners, collecting
sperm that she will store and use throughout the rest of her life. A
study on honey bees provided the following data on the number of
partners for 30 queen bees.
12 2 4 6 6 7 8 7 8 11
8 3 5 6 7 10 1 9 7 6
9 7 5 4 7 4 6 7 8 10
Example
A study examined the length of hours spent watching TV per day
for a sample of children age 1 and for a sample of children age 3.
Below are comparative histograms.

Children Age 1 Children Age 3


Histograms with unequal intervals

When to use
- when you have a concentration of data in the
middle with some extreme values
How to construct
- construct similar to histograms with continuous
data, but with density on the vertical axis

relative frequency for interval


density 
width of interval
Cumulative Relative Frequency Plot

When to use
- used to answer questions about percentiles ( a value with a given percent of
observations at or below that value)

How to construct
- Mark the boundaries of the intervals on the horizontal axis
- Draw a vertical scale and mark it with relative frequency
- Plot the point corresponding to the upper end of each interval with its
cumulative relative frequency, including the beginning point
- Connect the points.
The National Climatic Center has been collecting weather data for many
years. The annual rainfall amounts for Albuquerque, New Mexico from
1950 to 2008 were used to create the frequency distribution below.
Annual Rainfall Relative Cumulative relative
(in inches) frequency frequency
4 to <5 0.052 0.052
5 to <6 0.103
+
0.155
6 to <7 0.086 +
0.241
7 to <8 0.103
8 to <9 0.172
Continue this pattern to
9 to <10 0.069 complete the table
10 to < 11 0.207
11 to <12 0.103
12 to <13 0.052
13 to <14 0.052
The National Climatic Center has been collecting weather data for many
years. The annual rainfall amounts for Albuquerque, New Mexico from
1950 to 2008 were used to create the frequency distribution below.
Annual Rainfall Relative Cumulative relative
(in inches) frequency frequency
4 to <5 0.052 0.052
5 to <6 0.103 0.155
6 to <7 0.086 0.241
7 to <8 0.103 0.344
8 to <9 0.172 0.516
9 to <10 0.069 0.585
10 to < 11 0.207 0.792
11 to <12 0.103 0.895
12 to <13 0.052 0.947
13 to <14 0.052 0.999
1.0 What proportion of years had rainfall
Cumulative relative frequency amounts that were 9.5 inches or less?
0.8

0.6
Approximately 0.55

0.4

0.2

2 4 6 8 10 12 14

Rainfall
1.0 Approximately 30% of the years had
Cumulative relative frequency
annual rainfall less than what amount?
0.8

0.6

0.4

0.2

Approximately 7.5 inches


2 4 6 8 10 12 14

Rainfall
1.0

Which interval of rainfall


Cumulative relative frequency

0.8
amounts had a larger
proportion of years –
9 to 10 inches or
0.6 10 to 11 inches?
Explain

0.4
The interval 10 to 11 inches,
because its slope is steeper,
indicating a larger proportion
0.2
occurred.

2 4 6 8 10 12 14

Rainfall
Scatterplots

When to Use Bivariate numerical data

How to construct
- Draw a horizontal scale and mark it with appropriate values of the
independent variable
- Draw a vertical scale and mark it appropriate values of the dependent
variable
- Plot each point corresponding to the observations
To describe
- comment the relationship between the variables
Time Series Plots

When to Use
- measurements collected over time at regular intervals
How to construct
- Draw a horizontal scale and mark it with appropriate values
of time
- Draw a vertical scale and mark it appropriate values of the
observed variable
- Plot each point corresponding to the observations and
connect
To describe
- comment on any trends or patterns over time
Group project

Group Project --- 20% (group presentation 10% and term paper 10%
(Individual Participation 2% ) students will first be divided into 10-12 small
groups (4-6 students form a group) .

Each small group will conduct a forum on a topic of your choice. Your group
will select one type of datasets (such as Air pollutant concentration, weather
data, Power data, or others). Group members will work together to prepare a
15-minute presentation and a term paper (1500 words + 4 figures ) about
data analysis , each project should first introduce the environmental datasets or
historical events and discuss the types of datasets, especially focus on
collecting, analyzing, and drawing conclusions from data.
Topics

Defining the problem:

Reducing the threat of acid rain to our environment

-Cause of Acid rain: sulfuric and nitric acids


-Sources of acidic components of rain: Hydrocarbon fuels, which spew
sulfur and nitric oxide into the atmosphere when burned.
-Solutions to the problem: strive for a ~50% reduction in sulfur –oxide
emissions; develop a new technology to allow us to use available energy
sources; develop alternative cleaner energy.

In China, high sulfur coal is a major source of these emissions, but


dependent on coal for energy, a shift to lower sulfur coal is not always
possible.

Statistics will play a key role in monitoring atmosphere conditions,


testing the effectiveness of proposed emission control devices…
Topics

Defining the problem:


Ozone exposure and population density

-Ambient ozone pollution: cause damage to the human respiratory system,


agricultural crops, and trees

Ozone (O3): higher concentrations, longer exposure and greater activity levels
cause greater effects

O3 impact on respiratory system; irritates mucous membranes of nose, throat and


airways; Symptoms are cough, chest pain, throat and eye irritation.

O3 increases susceptibility to respiratory infection; impair normal lung functioning


and induce respiratory inflammation.

Healthy individuals who exercise heavily for 1 to 2 hours may experience


respiratory symptoms at levels exceeding 240 g/m3; experience these symptoms
at a lower concentration for 6 to 8 hours during moderate exercise.

Individuals with sensitive respiratory systems (asthma or respiratory disease) are


more susceptible to the effects of O3.
Topics

Defining the problem:

Electricity Supply in Hong Kong

In 2003, the electricity supply of CLP Power is 28,035 million kWh (


CLP Holdings Annual Report 2003, Ten-year Summary: Scheme of
Control Financial and Operating Statistics
), while that of HEC is 10,413 million kWh (
Hong Kong Electric Holdings Ltd. Units Sold); the total supply is
therefore 38,448 million kWh, on average
3.8 × 1014 J every day.
https://www.clpgroup.com/en
Topics

Climatological characteristics of TC rainfall in Hong Kong

The day when a TC has


entered 800 km region of
Hong Kong is defined as a
TC day and the associated
precipitation are treated as
TC-related precipitation.

You might also like