Lecture 1 introduction

ENVR 5320
Environmental Data Analysis
Lecture 1
Dr. Zhi NING

Division of Environment and Sustainability
The Hong Kong University of Science and Technology
Agenda
• Environmental problems and statistics

• Brief review of statistics
• Get familiar with the Excel tools
• Statistical distribution measures
• Probability distributions
2
Environmental Problems and Statistics
• The goal of statistics

– to make discovery process efficient.
• Environmental laws and regulations:
– toxic chemicals;
– water quality criteria;
– air quality criteria.
• Environmental data
– the limit of detection;
– acute and chronic toxicity criteria;
– cancer potency factors.
3
Environmental Problems and Statistics
• Use statistic tools to understand the nature
4
Structure of course teaching
• Introduction
– engineering problem and statistical method.
• Case study
– introduce a specific environmental example with
real world data
• Method
– give a brief explanation of statistical method that is
used to prepare the solution.
• Analysis
– show how the data suggest and influence the
method of analysis and give the solution.
5
Brief Review of Statistics
• Population and sample

– A population is a very large set of N observations
(or data values) from which the sample of n
observations can be imagined to have come.
• Two types of statistics:
– DESCRIPTIVE
• a way of summarizing the complexity of data with a
single number.
– INFERENTIAL
• answer the question, "To what extent can these findings
be GENERALIZED?
6
•DESCRIPTIVE Statistics
– For one variable ("univariate analysis"):
– Measures of "CENTRAL TENDENCY" (averages) and
of DISPERSION or variance around that average.
– Examples: Means, Modes, Medians, Standard
Deviation, quartiles etc
– For multiple variables:

– The strength of relationship between two variables
(bivariate analysis) or among a set of variables
(multivariate analysis)
– Examples: correlation coefficient
7
•INFERENTIAL Statistics
– Measures of the SIGNIFICANCE of the relationship
between two or more variables. Significance refers to
the probability that the findings could be attributed to
sampling error.
– Appropriate statistics depend on the LEVEL OF
MEASUREMENT OF THE DEPENDENT VARIABLE
(and of the independent variable).
– Example: t-Test, ANOVA (F-ratio)
8
Let’s get Familiar with Excel Advanced
Tools
• Formula in Excel
• Hidden Developer functions in Excel.
• Practice calculations in Excel data example
• Good practice in using Excel
9
Excel basics I
• Use of formula
• Use of $
• Use of shortcut to go to cells
• Note the black and white cross
• Plot
• Use of Ctrl + Shift + Enter for array calculation
• Developer tool
• ActiveX
10
Statistical distribution measures
• Central values
– Arithmetic mean, Geometric mean
– Mode, Median
• Measures of spread
– The range
– The interquartile range (IQR)
– Standard deviation, variance
– Coefficient of variation (CoV)
• Quartiles, Quantiles and percentiles
11
• Central values
– Arithmetic mean Average(a,b,c)
– Geometric mean
–
Geomean(a,b,c)
– Mode: value with highest probability of occurrence
– The median: central value of the ordered data
Median(a,b,c)
• Trimmed mean:
– e.g. 5 percent trimmed mean is the average of the
data between 5th and 95th percentiles 12
• Influence of the shape of the data distribution

• “heavy tails”.
• Arithmetic mean is
• The “heaviness” of the
influenced by high
tails depends degrees of
values;
freedom (df)
• G is same as median
• G best represents
• Right skewed
• Higher df leads to
normal dist.
13
• Bimodal distribution in nature
• The implications
14
– The range (MIN and MAX)
– The interquartile range (IQR)
Percentile (array, k)
Quartile (array, 0/1/2/3/4)
IQR=0.7413*(Q3-Q1)
– The standard deviation
15
– Variance
VAR(array)
– Coefficient of variation (CV)
16
– Quartiles, quantiles and percentiles
Quartile (array, 0/1/2/3/4)
Percentile (array, 0.05/0.10/0.95)
– Skewness:
• measure of symmetry of data distribution
Skew (array)
0 is symmetric; <0, left skewed; >0, right
skewed.
17
• Frequency distributions
– Identify cutting points to divide the data into
categories. The cutoff points should be chosen to
divide the data fairly evenly.
Frequency (data_array,bin_array)
PRESS SHIFT/CTRL/ENTER
Bin Frequency
1 10 2
2 20 0
3 30 2
4 40 3
5 50 5
6 60 4
7 70 2
8 80 0
9 90 1
10 100 1 18
• Accuracy, Bias and Precision

– Bias measures systematic errors
– Precision measures the degree of scatter in the
data
– Accuracy is a function of both bias and precision.
A known concentration of 8.00 mg/L.
19
Probability distributions
• The Normal Distribution

– Often called Gaussian distribution
– Characterized completely by N(η, σ2 ), “a normal
distribution with mean η and variance σ2 .
20
Read and type Greek letters correction
• Alt 956
• Alt 963
• Alt 961
• Alt 960
https://www.thespruceeats.com/the-greek-
21
alphabet-1705558
• The Normal Distribution

1. The vertical axis (probability density) is scaled
such that area under the curve is unity (1.0).
2. The standard deviation σ measures the distance
from the mean to the point of inflection.
3. The probability that a positive deviation from the
mean will exceed one σ is 0.1587.
4. Because of symmetry, the probabilities are the
same for negative deviations
5. The chance that a deviation in either direction will
exceed 2σ is 2(0.0228) = 0.0456
22
• NORM.DIST(x, mean, standard_dev, cumulative)

– Returns the normal cumulative distribution of with specific η and σ.
– Returns α value with given z and η σ values.
• NORM.INV (probability, mean, standard_dev)
– Returns the inverse of the normal cumulative distribution for η and σ.
– Returns z value with given α, η and σ values
• NORM.S.DIST (z, cumulative)
– Returns the standard normal cumulative distribution of with η=0 and σ=1
• NORM.S.INV (probability)
– Returns the inverse of the standard normal distribution with η=0 and σ=1
• Cumulative or not?
• Left tailed or right tailed?
• How to generate a normal distribution in excel?
23
• Examples
– A normal distribution with η=8mg/L and σ=1 mg/L;
– Look for the value with 95% of data below?
– Look for the probability that the value is read
below 6.4mg/L?
– How to draw a normal distribution in Excel?

– Use function: norm.inv(rand(),8,1,1)
24
• t distribution
– In normal distributions, both η and σ are known;
– In practice, σ is often not known and we use Se to
replace σ:
– Bell shaped and symmetric but tails are wider.

– Width of the t distribution depends on degree of
freedom.
Guinness brewer
Gosset, 1908 25
“Student” as pen name
• Part of the t table as function of  and 
26
• T.INV (probability, degree of freedom)

– Returns the inverse of the left tailed Student t distribution
• T.INV.2T (probability, degree of freedom)
– Returns the inverse of the two tailed Student t distribution
• T.DIST (x, degree of freedom, cumulative)

– Returns the left tailed Student t distribution
• T.DIST.RT (x, degree of freedom, cumulative)
– Returns the right tailed Student t distribution
• T.DIST.2T (x, degree of freedom, cumulative)
– Returns the two tailed Student t distribution
• If we enter α as probability and n-1 as Deg_freedom, then T.INV

outputs tn-1, 1-α/2, the 1-α/2 th percentile of a t distribution with n-1
degrees of freedom.
27
• Example
– What is the 97.5th percentile of a t distribution with
degree of freedom 24 ?
– T.INV.2T(0.05, 24)=2.06
OR -T.INV(0.025,24)
– What is the probability of t value larger than 2.064

in a t distribution with degree of freedom 24?
– T.DIST.2T(2.064,24)
28
Distribution of average and variance
• Consider a sampling distribution of the

average, with many random samples of size n
were collected from a population
• Sample standard deviation:
• Standard error of the mean is:
29
• Central limit effect:

– If parent distribution where the samples come
from is normal, the distribution of average is
normal
– If the parent distribution is not normal, the
distribution of average will be more nearly normal
than the parent one.
– With increasing number of sample n, the
distribution becomes increasingly more normal.
30
• How to estimate the t statistic?

– From normal parent population to samples with t
distribution with df= n-1:
– The sample variance s2 is distributed as Chi-

square distribution:
31
• Example:
From Sd to Se, 0.266
NORM.DIST(7.51,8,0.27,1) With t=-1.842 and =26,

T.DIST(-1.842,26,1)
N(8,0.27)
32
Tutorial session

Lecture 1 introduction

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 1 introduction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1 introduction

Uploaded by

Copyright:

Available Formats

ENVR 5320

Environmental Data Analysis

Dr. Zhi NING

• Environmental problems and statistics

• The goal of statistics

• Use statistic tools to understand the nature

• Population and sample

– For multiple variables:

• Influence of the shape of the data distribution

– Coefficient of variation (CV)

• Accuracy, Bias and Precision

A known concentration of 8.00 mg/L.

• The Normal Distribution

• The Normal Distribution

• NORM.DIST(x, mean, standard_dev, cumulative)

– How to draw a normal distribution in Excel?

– Bell shaped and symmetric but tails are wider.

• Part of the t table as function of  and 

• T.INV (probability, degree of freedom)

• T.DIST (x, degree of freedom, cumulative)

• If we enter α as probability and n-1 as Deg_freedom, then T.INV

– What is the probability of t value larger than 2.064

• Consider a sampling distribution of the

• Standard error of the mean is:

• Central limit effect:

• How to estimate the t statistic?

– The sample variance s2 is distributed as Chi-

From Sd to Se, 0.266

NORM.DIST(7.51,8,0.27,1) With t=-1.842 and =26,

You might also like