Lecture 1 introduction

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

ENVR 5320

Environmental Data Analysis

Lecture 1

Dr. Zhi NING


Division of Environment and Sustainability
The Hong Kong University of Science and Technology
Agenda

• Environmental problems and statistics


• Brief review of statistics
• Get familiar with the Excel tools
• Statistical distribution measures
• Probability distributions

2
Environmental Problems and Statistics

• The goal of statistics


– to make discovery process efficient.
• Environmental laws and regulations:
– toxic chemicals;
– water quality criteria;
– air quality criteria.
• Environmental data
– the limit of detection;
– acute and chronic toxicity criteria;
– cancer potency factors.

3
Environmental Problems and Statistics

• Use statistic tools to understand the nature

4
Structure of course teaching

• Introduction
– engineering problem and statistical method.
• Case study
– introduce a specific environmental example with
real world data
• Method
– give a brief explanation of statistical method that is
used to prepare the solution.
• Analysis
– show how the data suggest and influence the
method of analysis and give the solution.
5
Brief Review of Statistics

• Population and sample


– A population is a very large set of N observations
(or data values) from which the sample of n
observations can be imagined to have come.
• Two types of statistics:
– DESCRIPTIVE
• a way of summarizing the complexity of data with a
single number.
– INFERENTIAL
• answer the question, "To what extent can these findings
be GENERALIZED?

6
Brief Review of Statistics

•DESCRIPTIVE Statistics
– For one variable ("univariate analysis"):
– Measures of "CENTRAL TENDENCY" (averages) and
of DISPERSION or variance around that average.
– Examples: Means, Modes, Medians, Standard
Deviation, quartiles etc

– For multiple variables:


– The strength of relationship between two variables
(bivariate analysis) or among a set of variables
(multivariate analysis)
– Examples: correlation coefficient
7
Brief Review of Statistics

•INFERENTIAL Statistics
– Measures of the SIGNIFICANCE of the relationship
between two or more variables. Significance refers to
the probability that the findings could be attributed to
sampling error.
– Appropriate statistics depend on the LEVEL OF
MEASUREMENT OF THE DEPENDENT VARIABLE
(and of the independent variable).
– Example: t-Test, ANOVA (F-ratio)

8
Let’s get Familiar with Excel Advanced
Tools
• Formula in Excel
• Hidden Developer functions in Excel.
• Practice calculations in Excel data example
• Good practice in using Excel

9
Excel basics I

• Use of formula
• Use of $
• Use of shortcut to go to cells
• Note the black and white cross
• Plot
• Use of Ctrl + Shift + Enter for array calculation
• Developer tool
• ActiveX

10
Statistical distribution measures

• Central values
– Arithmetic mean, Geometric mean
– Mode, Median
• Measures of spread
– The range
– The interquartile range (IQR)
– Standard deviation, variance
– Coefficient of variation (CoV)
• Quartiles, Quantiles and percentiles

11
Statistical distribution measures

• Central values
– Arithmetic mean Average(a,b,c)

– Geometric mean


Geomean(a,b,c)
– Mode: value with highest probability of occurrence
– The median: central value of the ordered data
Median(a,b,c)
• Trimmed mean:
– e.g. 5 percent trimmed mean is the average of the
data between 5th and 95th percentiles 12
Statistical distribution measures

• Influence of the shape of the data distribution


• “heavy tails”.
• Arithmetic mean is
• The “heaviness” of the
influenced by high
tails depends degrees of
values;
freedom (df)
• G is same as median
• G best represents

• Right skewed
• Higher df leads to
normal dist.

13
• Bimodal distribution in nature
• The implications

14
Statistical distribution measures

• Measures of spread
– The range (MIN and MAX)
– The interquartile range (IQR)
Percentile (array, k)
Quartile (array, 0/1/2/3/4)
IQR=0.7413*(Q3-Q1)
– The standard deviation

15
Statistical distribution measures

• Measures of spread
– Variance
VAR(array)

– Coefficient of variation (CV)

16
Statistical distribution measures

• Measures of spread
– Quartiles, quantiles and percentiles
Quartile (array, 0/1/2/3/4)
Percentile (array, 0.05/0.10/0.95)
– Skewness:
• measure of symmetry of data distribution
Skew (array)
0 is symmetric; <0, left skewed; >0, right
skewed.

17
Statistical distribution measures

• Frequency distributions
– Identify cutting points to divide the data into
categories. The cutoff points should be chosen to
divide the data fairly evenly.
Frequency (data_array,bin_array)
PRESS SHIFT/CTRL/ENTER
Bin Frequency
1 10 2
2 20 0
3 30 2
4 40 3
5 50 5
6 60 4
7 70 2
8 80 0
9 90 1
10 100 1 18
Statistical distribution measures

• Accuracy, Bias and Precision


– Bias measures systematic errors
– Precision measures the degree of scatter in the
data
– Accuracy is a function of both bias and precision.

A known concentration of 8.00 mg/L.

19
Probability distributions

• The Normal Distribution


– Often called Gaussian distribution
– Characterized completely by N(η, σ2 ), “a normal
distribution with mean η and variance σ2 .

20
Read and type Greek letters correction

• Alt 956
• Alt 963
• Alt 961
• Alt 960

https://www.thespruceeats.com/the-greek-
21
alphabet-1705558
Probability distributions

• The Normal Distribution


1. The vertical axis (probability density) is scaled
such that area under the curve is unity (1.0).
2. The standard deviation σ measures the distance
from the mean to the point of inflection.
3. The probability that a positive deviation from the
mean will exceed one σ is 0.1587.
4. Because of symmetry, the probabilities are the
same for negative deviations
5. The chance that a deviation in either direction will
exceed 2σ is 2(0.0228) = 0.0456

22
Probability distributions

• NORM.DIST(x, mean, standard_dev, cumulative)


– Returns the normal cumulative distribution of with specific η and σ.
– Returns α value with given z and η σ values.
• NORM.INV (probability, mean, standard_dev)
– Returns the inverse of the normal cumulative distribution for η and σ.
– Returns z value with given α, η and σ values
• NORM.S.DIST (z, cumulative)
– Returns the standard normal cumulative distribution of with η=0 and σ=1
• NORM.S.INV (probability)
– Returns the inverse of the standard normal distribution with η=0 and σ=1

• Cumulative or not?
• Left tailed or right tailed?
• How to generate a normal distribution in excel?

23
Probability distributions

• Examples
– A normal distribution with η=8mg/L and σ=1 mg/L;
– Look for the value with 95% of data below?
– Look for the probability that the value is read
below 6.4mg/L?

– How to draw a normal distribution in Excel?


– Use function: norm.inv(rand(),8,1,1)

24
Probability distributions

• t distribution
– In normal distributions, both η and σ are known;
– In practice, σ is often not known and we use Se to
replace σ:

– Bell shaped and symmetric but tails are wider.


– Width of the t distribution depends on degree of
freedom.

Guinness brewer
Gosset, 1908 25
“Student” as pen name
Probability distributions

• Part of the t table as function of  and 

26
Probability distributions

• T.INV (probability, degree of freedom)


– Returns the inverse of the left tailed Student t distribution
• T.INV.2T (probability, degree of freedom)
– Returns the inverse of the two tailed Student t distribution

• T.DIST (x, degree of freedom, cumulative)


– Returns the left tailed Student t distribution
• T.DIST.RT (x, degree of freedom, cumulative)
– Returns the right tailed Student t distribution
• T.DIST.2T (x, degree of freedom, cumulative)
– Returns the two tailed Student t distribution

• If we enter α as probability and n-1 as Deg_freedom, then T.INV


outputs tn-1, 1-α/2, the 1-α/2 th percentile of a t distribution with n-1
degrees of freedom.
27
Probability distributions

• Example
– What is the 97.5th percentile of a t distribution with
degree of freedom 24 ?
– T.INV.2T(0.05, 24)=2.06
OR -T.INV(0.025,24)

– What is the probability of t value larger than 2.064


in a t distribution with degree of freedom 24?

– T.DIST.2T(2.064,24)

28
Distribution of average and variance

• Consider a sampling distribution of the


average, with many random samples of size n
were collected from a population
• Sample standard deviation:

• Standard error of the mean is:

29
Distribution of average and variance

• Central limit effect:


– If parent distribution where the samples come
from is normal, the distribution of average is
normal
– If the parent distribution is not normal, the
distribution of average will be more nearly normal
than the parent one.
– With increasing number of sample n, the
distribution becomes increasingly more normal.

30
Distribution of average and variance

• How to estimate the t statistic?


– From normal parent population to samples with t
distribution with df= n-1:

– The sample variance s2 is distributed as Chi-


square distribution:

31
Distribution of average and variance

• Example:

From Sd to Se, 0.266

NORM.DIST(7.51,8,0.27,1) With t=-1.842 and =26,


T.DIST(-1.842,26,1)
N(8,0.27)

32
Tutorial session

You might also like