Lecture (1) - Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Applied Statistics

By

Dr. Hany Gomaa Ahmed

Associate professor
Irrigation and Hydraulics Department
Faculty of Engineering, Cairo University

Academic Year 2023-2024

Course Outline
Chapter 1: Introduction
Chapter 2: Organizing and Graphing Data
Chapter 3: Basic Probability Concepts
Chapter 4: Random Variables, Probability Distributions
Chapter 5: Common Discrete Probability Distributions
Chapter 6: Common Continuous Probability Distributions
Chapter 7: Sampling Distributions
Chapter 8: Confidence Intervals
Chapter 9: Fundamentals of Hypothesis: Part I
Chapter 10: Fundamentals of Hypothesis: Part II
Chapter 11: Linear Regression Analysis
Chapter 12: Linear Regression and Correlation Analysis
Course Outline
Chapter 1: Introduction
Chapter 2: Organizing and Graphing Data
Chapter 3: Basic Probability Concepts
Chapter 4: Random Variables, Probability Distributions
Chapter 5: Common Discrete Probability Distributions
Chapter 6: Common Continuous Probability Distributions
Chapter 7: Sampling Distributions
Chapter 8: Confidence Intervals
Chapter 9: Fundamentals of Hypothesis: Part I
Chapter 10: Fundamentals of Hypothesis: Part II
Chapter 11: Linear Regression Analysis
Chapter 12: Linear Regression and Correlation Analysis

1- Introduction

Why an Engineer Needs to


Know about Statistics
• To know how to properly present information
• To Know how to properly interpret information
• To know how to draw conclusions about
populations based on sample information
• To know how to optimize the use of limited
resources (sampling)
• To know how to obtain reliable forecasts
1- Introduction

Key Definitions
• A population (universe) is the collection of things under
consideration (e.g. Grades of 100 students)
• A sample is a portion of the population selected for
analysis (e.g. grades of 10 students out of the 100)
• A parameter is a summary measure computed to describe
a characteristic of the population (e.g. average grade of
all 100 students, constant)
• A statistic is a summary measure computed to describe a
characteristic of the sample (e.g. mean of grades of a
sample of 10 students, variable)

1- Introduction

Population and Sample

Population
Sample

Use statistics to
summarize features
Descriptive statistics

Use parameters to
summarize features

Inferential statistics
Inference on the population from the sample
1- Introduction

Statistical Methods
• Descriptive statistics
– Collecting and describing data
• Inferential statistics
– Drawing conclusions and/or making decisions
concerning a population based only on sample
data

1- Introduction

Descriptive Statistics
• Collect data
– e.g., rain depth, temperature, river flow,
compressive strength, … etc.
• Present data
– e.g., Tables and graphs
• Characterize data
– e.g., Sample mean = X i

n
1- Introduction

Inferential Statistics
• Estimation
– e.g.: Estimate the population mean
weight using the sample mean
weight
• Hypothesis testing
– e.g.: Test the claim that the
population mean weight is 120
pounds
Drawing conclusions and/or making decisions concerning
a population based on sample results

1- Introduction

2. Sampling Concepts
1- Introduction

Definitions
Population: is the total set of elements of
interest for a given problem

1) Finite population: described by actual


distribution of its values
2) Infinite Population: described by
corresponding probability distribution or
probability density

1- Introduction

Sample

• A subset of the population’s elements that


gives sense about the population or inference
can be drawn from it about population
OR
• A group of units selected from a larger group
(the population). By studying the sample it is
hoped to draw valid conclusions about the
larger group
1- Introduction

Population and Sample

Sample 1 Population Sample 2

Random Sample: is a sample where all


population elements have equal probability Sample 3
(chance) to be included in the sample

Thus, samples 1, 2, and 3 have the same chance


to be extracted

1- Introduction

Reasons for Sampling


• More economic
• Time saving
• Inaccessible population
• Infinite population
1- Introduction

Applied Statistics

3. Presentation and Analysis of Data

1- Introduction

Presentation of Data
• Topics
– Organizing numerical data
• The ordered array
– Tabulating and graphing numerical data
• Grouping of Data
• Frequency distributions: tables, histograms, polygons
• Cumulative distributions: tables, diagrams
– Graphing bivariate numerical data
• Scatter plots
– Numerical Descriptive Measure
1- Introduction

Organizing Numerical Data


Numerical Data 41, 24, 32, 26, 27, 27, 30, 24, 38, 21

Ordered Array
21, 24, 24, 26, 27, 27, 30, 32, 38, 41

1- Introduction

Organizing Numerical Data


(continued)
• Data in raw form (as collected):
24, 26, 24, 21, 27, 27, 30, 41, 32, 38
• Data in ordered array from smallest to largest:
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
1- Introduction

Tabulating and Graphing Numerical


Data: Grouping of Data

• When data points are very large, it may be


advantageous to group or classify the data
• Grouping condenses the data and makes it easier
to extract information (some information will be
lost though)

1- Introduction
Tabulating and Graphing Numerical Data

Numerical Data 41, 24, 32, 26, 27, 27, 30, 24, 38, 21

Frequency Distributions
Ordered Array
Cumulative Distributions
21, 24, 24, 26, 27, 27, 30, 32, 38, 41

Histograms
7

Tables
4

1
Polygons
0

10 20 30 40 50 60
1- Introduction

Describing Numerical Data with Tables

• Frequency Tables
Simple
Multiple
• Relative Frequency Tables
Fraction
Percentage
• Cumulative Frequency Tables
More than
Less than

1- Introduction

Steps to Create Frequency Tables


• Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

• Find range: 58 - 12 = 46
• Select number of classes: 5 (usually between 5 and 15)
– The smaller the number of classes, the greater the loss
of information
• Compute class interval (width): 10 (46/5 then round up)
• Determine class boundaries (limits): 10, 20, 30, 40, 50, 60
• Compute class midpoints: 15, 25, 35, 45, 55

• Count observations & assign to classes


1- Introduction

Example
The following are the grades of midterm exam for
a certain section of 50 students, arrange the
following data using tables
18 19 9 3 12 13 8 17 19 15 7 16 13
13 4 14 18 17 12 11 16 15 17 12 11
12 12 15 16 14 5 17 15 18 19 13 11
9 13 17 12 13 9 18 19 11 6 15 12 9

1- Introduction

3 4 5 6 7 8 9 9 9 9 11 11 11 11
12 12 12 12 12 12 12 13 13 13 13 13 13 14
14 15 15 15 15 15 16 16 16 17 17 17 17 17
18 18 18 18 19 19 19 19

1) Range= Max. – Min. = 19 – 3 =16


2) Select number of groups (5 15). Let it 5 groups
3) Class interval= (Range/ class No.)= 16/5=3.2 4
1- Introduction

Simple Frequency Table


Class Bars frequency
(f)
More than 0–4 | 1
or equal 0
but less 4–8 |||| 4
than 4
8 – 12 |||| |||| 9

12 – 16 |||| |||| |||| |||| 20

16 - 20 |||| |||| |||| | 16

Σ 50 50

1- Introduction

Relative Frequency Table


Class Mid point Frequency as a %
fraction frequency
0–4 2 1 0.02 2
4–8 6 4 0.08 8
8 – 12 10 9 0.18 18
12 – 16 14 20 0.40 40
16 - 20 18 16 0.32 32
Σ 50 1.0 100
1- Introduction
Cumulative Frequency Table (More than) (more
than the lower limit)
Class lower Cumulative Cumulative
Class Frequency
limit Frequency % Frequency
0-4 1 (f)
4-8 4 >0 50 100

8 -12 9 >4 49 98

12 - 16 20 >8 45 90

16 - 20 16 >12 36 72

20 -24 0 >16 16 32

Σ 50 >20 0 0

1- Introduction
Cumulative Frequency Table (Less than) (Less
than the upper limit)
Class Frequency Class upper Cumulative Cumulative
0-4 1
limit Frequency %
(f) Frequency
4-8 4
<4 1 2
8 -12 9
<8 5 10
12 - 16 20 <12 14 28
16 - 20 16
<16 34 68
Σ 50 <20 50 100
2- Organizing and Graphing Data

Describing Numerical Data With Graphs

• Histogram

• Frequency Polygon

• Frequency Curve

• Less & More than Ogive (Cumulative Frequency


Polygon)

2- Organizing and Graphing Data

Frequency Table

Data in ordered array:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class Frequency Relative Percentage


j fj Frequency rfj
10 but under 20 3 0.15 15
20 but under 30 6 0.30 30
30 but under 40 5 0 .25 25
40 but under 50 4 0.20 20
50 but under 60 2 0.10 10
Total 20 1 100
2- Organizing and Graphing Data
Graphing Numerical Data:
The Histogram
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Histogram

7 6
6 5
Frequency

5 4 No Gaps
4 3
3 2
Between
2 Bars
1 0 0
0
5 15 25 36 45 55 More

Class Boundaries
Class Midpoints

2- Organizing and Graphing Data


Graphing Numerical Data:
The Frequency Polygon
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Frequency Polygon

7
6
5
4
3
2
1
0
5 15 25 36 45 55 More

Class Midpoints
2- Organizing and Graphing Data
Graphing Numerical Data:
The Frequency Curve
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Frequency Curve

7
6
5
4
3
2
1
0
5 15 25 36 45 55 More

Class Midpoints

2- Organizing and Graphing Data

Cumulative Frequency Curve


Create Cumulative Frequency Table first
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Cumulative Cumulative Cumulative


Class Frequency Relative Frequency % Frequency
< 20 3 0.15 15
< 30 9 0.45 45
< 40 14 0.70 70
< 50 18 0.90 90
< 60 20 1.00 100
Can we do More than (>)?
2- Organizing and Graphing Data
Tabulating and Graphing Numerical
Data: Example
• Consider the data for the mean flow of a river for the
month of May during the period from 1922 to 1971 (see
Table below)
discharge discharge discharge discharge discharge
year year year year year
(m3/s) (m3/s) (m3/s) (m3/s) (m3/s)
1922 3532 1932 2338 1942 1608 1952 1949 1962 2568
1923 2071 1933 1873 1943 1456 1953 1396 1963 1944
1924 4188 1934 1243 1944 1570 1954 1344 1964 2062
1925 2080 1935 2849 1945 2301 1955 1886 1965 3919
1926 2036 1936 2359 1946 1460 1956 1786 1966 2944
1927 2685 1937 3070 1947 1584 1957 1455 1967 2175
1928 1832 1938 1222 1948 1410 1958 3025 1968 2877
1929 1500 1939 2841 1949 1490 1959 1828 1969 3208
1930 2856 1940 2110 1950 1959 1960 1401 1970 4750
1931 3043 1941 2058 1951 1981 1961 2427 1971 1475

2- Organizing and Graphing Data


Tabulating Numerical Data: Example
(Continued)
• Sort raw data in ascending order:
1222, 1243, …, 4750
• Number of observations n = 50
• Minimum discharge is 1222 m3/s
• Maximum discharge is 4750 m3/s
• Find range: 4750 - 1222 = 3528
• Select number of classes: 6 (usually between 5 and 15)
• Compute class interval (width): 600 (3528/6 then round up)
• Determine class boundaries (limits): 1200, 1800, 2400,3000, 3600, 4200,
4800
• Compute class midpoints: 1500, 2100, 2700, 3300, 3900, 4500
• Count observations & assign to classes
2- Organizing and Graphing Data

Example (Continued)

Relative
Class No. Class Interval Frequency
3 Description Frequency
j I j (m /s) fj
rf j
1 (1200, 1800) 1200 but under 1800 16 0.32
2 (1800, 2400) 1800 but under 2400 18 0.36
3 (2400, 3000) 2400 but under 3000 8 0.16
4 (3000, 3600) 3000 but under 3600 5 0.1
5 (3600, 4200) 3600 but under 4200 2 0.04
6 (4200, 4800) 4200 but under 4800 1 0.02
Total 50 1.00

2- Organizing and Graphing Data

Example (Continued)

Frequency
Histogram

Relative
Frequency
Histogram
2- Organizing and Graphing Data

Example (Continued)

Area under polygon


= Area under histogram

2- Organizing and Graphing Data

Example (Continued)
Boundary Cumulative
Cumulative
3 Description Relative
Value (m /s) Frequency
Frequency
This is called an Ogive 1,200 Less than 1,200 0 0
1,800 Less than 1,800 16 0.32
Cumulative frequency 2,400 Less than 2,400 34 0.68
3,000 Less than 3,000 42 0.84
polygon & cumulative 3,600 Less than 3,600 47 0.94
frequency curve (smooth 4,200
4,800
Less than 4,200
Less than 4,800
49
50
0.98
1.00
Ogive)

Less than cumulative


frequency polygon

How does the “more


than” Ogive look like?
2- Organizing and Graphing Data

More-Than Curve (Ogive)

More than curve

100
90
80
70
60
%F

50
40
30
20
10
0
2 6 10 14 18 22

Classes

2- Organizing and Graphing Data

Graphing Bivariate Numerical Data

Scatter Plot of bi-variate numerical data


2- Organizing and Graphing Data
Describing Numerical Data with Numbers –
Numerical Descriptive Measures
• Measures of central tendency
– Mean, median, mode
• Measure of variation (or Dispersion)
– Range, variance and standard deviation,
coefficient of variation
• Measure of Shape
– Skewness Coefficient
• Measure of accordance
– Coefficient of Correlation

2- Organizing and Graphing Data

Measures of Central Tendency

Central Tendency

Average or
Arithmetic Mean

Population mean Sample mean


N
1 1 n

N
X
i 1
i X   Xi
n i 1
Parameter Statistic
2- Organizing and Graphing Data

Mean (Arithmetic Mean)

• Mean (arithmetic mean)


– Sample mean
n Sample Size
X i
X1  X 2   Xn
X i 1

n n
– Population mean Population Size
N

X i
X1  X 2   XN
 i 1

N N

2- Organizing and Graphing Data

Median
• The variate value that divides the data into two equal halves

1 3 5 7 9 Median = 5

Median = (5+7)/2= 6
1 3 5 7 9 24

• In an ordered array, the median is the “middle” number


– If n or N is odd, the median is the middle number @ (n+1)/2.
– If n or N is even, the median is the average of the two middle numbers at
(n/2) and ((n/2)+1).
2- Organizing and Graphing Data

Mode
• A measure of central tendency
• Value that occurs most often
• Not affected by extreme values 0 1 2 3 4 5 6
• There may be no mode No Mode
• There may be several modes

0 1 2 3 4 5 6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 one Mode =3

Three Modes =5, 9, 12

2- Organizing and Graphing Data

Measures of Variation
Variation

Variance Standard Deviation Coefficient


of Variation
Range Population Population
Variance Standard
Relative
Sample Deviation
Range
Variance Sample
Standard
Deviation
2- Organizing and Graphing Data

Range
• Measure of variation
• Difference between the largest and the smallest
observations:

Range  X Largest  X Smallest


• Ignores the way in which data are distributed
Range = 12 - 7 = 5 Range = 12 - 7 = 5

7 8 9 10 11 12 7 8 9 10 11 12

2- Organizing and Graphing Data

Relative Range

Range X Largest  X Smallest


Relative Range  
Mean Mean
2- Organizing and Graphing Data

Variance

• Important measure of variation


• Shows variation about the mean
– Sample variance: n

 X i  X
2

S2  i 1

n 1

– Population variance: N

 X 
2
i
2  i 1

2- Organizing and Graphing Data

Standard Deviation
• Most important measure of variation
• Shows variation about the mean
• Has the same units as the original data
– Sample standard deviation: n

 X X
2
i
S i 1

n 1
N
– Population standard deviation:  X 
2
i
 i 1

N
2- Organizing and Graphing Data

Coefficient of Variation

S 
CV   100%
X 

• Measures relative variation


• Always in percentage (%)
• Shows variation relative to mean
• Is used to compare two or more sets of data measured
in different units
• Not suitable if mean is close to zero

2- Organizing and Graphing Data

Shape of a Distribution
• Skewness Coefficient
– Describes how data is distributed
– Measure of shape
 X X
n
3
n i

For population or large sample CS  i 1


3/ 2
 2
  X i  X  
n

 i 1 

• Corrected form of CS 2  X
n

i  X
3

n n
For small sample CS  i 1
(n  1)(n  2)  n 2
3/ 2

  X i  X  
 i 1 
2- Organizing and Graphing Data

Shape of a Distribution
• Symmetric or skewed

CS < 0 CS = 0 CS > 0

Left-Skewed Symmetric Right-Skewed


Mean < Median < Mode Mean = Median =Mode Mode < Median < Mean

2- Organizing and Graphing Data


Descriptive Measure using Grouped Data
(Frequency Distribution)
• Sample Mean
Class Mid f
k

f
point
j X Classj
1 k 0-4 2 1
X
j 1
k
  f j X Classj
f
n j 1 4-8 6 4
j
j 1
8 -12 10 9
• Sample Variance
1 k 12 - 16 14 20
S 
2

n  1 j 1
f j ( X Classj  X ) 2
16 - 20 18 16

X Classj is the mid-point for class j

fj is the frequency for class j


2- Organizing and Graphing Data
Descriptive Measure using Grouped Data
(Frequency Distribution)
• Sample Mode
 
 Δ1 
M ode  L1     C
 Δ1  Δ 2 
C  L2  L1

L1 L2

Lecture 1- Page 57

2- Organizing and Graphing Data


Descriptive Measure using Grouped Data
(Frequency Distribution)
• Sample Median Median is in this class

fmedian

M edian  L1 
N /2  f i
C
f median
f i   frequency until L1
L1 L2

C  L2  L1
2- Organizing and Graphing Data

Coefficient of Correlation
• Measures the strength of the linear relationship
between two quantitative variables

 X i  X Yi  Y 
r i 1
n n

 X X  Y  Y 
2 2
i i
i 1 i 1

2- Organizing and Graphing Data


Features of
Correlation Coefficient
• Unit free
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker any linear relationship
2- Organizing and Graphing Data
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y

X X X
r = -1 r = -.6 r=0
Y Y

X X
r = .6 r=1

You might also like