Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Chapter 1:

Data Analysis-
Descriptive statistics

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Calculation and Probability Academic Year 2023-2024 1
Descriptive statistics: Understanding the basics 1.1 Types of Data in Descriptive Statistics

1.2. Presenting Data Using Tables


a. Dataset
b. Frequency distribution
c. Cumulative frequency distribution

1.3. Graphical Representation of Data


a. Bar graph
Descriptive statistics involve extracting relevant b. Pie chart
information from a dataset. There are various methods c. Histogram graph
to interpret the data, such as using graphs, tables, or d. Cumulative function
parameters (central tendency, variability, form, scale, etc)
1.4. Measures of Central Tendency
a. Mean,
b. Median
c. Mode

1.5. Measures of Dispersion


a. Range
b. Variance
c. Standard Deviation
d. Interquartile
• Percentile/quantile
• Quartiles

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 2
1.1. type of dataset

Data
Type
Categorical or Numerical or
qualitative Data Quantitative Data

Nominal Data
Discrete Data

Ordinal Data
Continuous Data

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 3
1.1.1 Types of Categorical Data

a. Nominal Data: there is no natural order between categories. Examples:


• Color of an Eye,
• Gender (Male & Female), etc.

b. Ordinal Data: have categories in which only the ordering counts. The difference between the
values in order does not matter. Examples:
• Military ranks (Private; Corporal; Sergeant; Lieutenant; Captain; Colonel)
• Socio-economic status (poor, middle class, rich),

1.1.2 Types of Numerical Data


c. Discrete Data: The measurements are integers. It represents count or an item that
can be counted. Example:
• Number of people in a family,
• the number of kids in class,

d. Continous Data: The data is said to be continuous if the measurements can take any
value usually within some range. Example:
• height,
• weight,

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 4
1.2. Frequency distribution Table

Descriptive statistics is to make sense of a data collection.

Descriptive statistics refers to this task of summarizing a set of data.

One ways of starting to understand the collected data is to create a


frequency table.

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 5
1.2.1 Frequency Distribution Table

Example 1.1: Frequency Distribution of pH levels in a river water sample.

pH [6.0 6.5[ [6.5 7.0[ [7.0 7.5[ [7.5 8.0[ [8.0 8.5[ [8.5 9.0[ [9.0 9.5[ Total
Frequency 5 12 18 25 20 8 2 90

Example 1.2. Frequency distribution of dissolved oxygen concentrations in a pond ecosystem.


Create a frequency distribution table based on the following chronological measurements of ‘DOC'

Date 1 2- 3- 4- 5- 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20-
Sept sept Sept Sept Sept Sept
DOC 7.8 7.2 6.5 7.1 8.2 7.3 6.9 7.8 7.6 6.8 6.9 7.5 7.0 6.7 7.4 8.0 6.8 7.2 7.3 7.6
mg/L

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 6
1.2.1 Frequency Distribution Table

Example 1.3:
To understand the distribution of electricity consumption, the electricity
usage of a sample consisting of 1000 households is analyzed. The data is
presented in the following table:

Electricity Consumption Number of Relative


(kWh) Households frequency
100 - 200 150 15%
200 - 300 300 30%
300 - 400 250 25%
400 - 500 200 20%
500 - 600 50 5%
600 - 700 30 3%
700 - 800 20 2%

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 7
Example 1.4: Frequency distribution
of energy consumption in the United
States across different sectors: Example 1.5: A survey of 2,500
families focuses on the number
Sector Frequency of children per family in a city.
Residential 21%
Commercial 18% Number of Kids Frequency
Transportation 28% 0 8% Example 1.6: Water quality
Industrial 22% 1 12% of 50 water samples
Electric Power 11% 2 25%
3 30% Water Clarity Frequency
Example 1.7: Energy sources used in 4 15% Clear 12
200 households in a particular cite 5 6%
Slightly Cloudy 20
6 1.5%
Energy Source Frequency
7 2% Moderately 10
Electricity 100 Cloudy
8 0
Natural Gas 60
9 0.5% Turbid 8
Propane 20
Heating Oil 10
Renewable Energy 10

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 8
Example 1.8 1.2.2.How to construct a frequency table for a series of
observations?
Yr Q Yr Q Yr Q Yr Q Yr Q Measurements of the annual flow of the
River Nile at Aswan, in 10^8 m^3
1871 1120 1891 1100 1911 831 1931 781 1951 744
(1871-1970).
1872 1160 1892 1210 1912 726 1932 865 1952 749
1873 963 1893 1150 1913 456 1933 845 1953 838
1874 1210 1894 1250 1914 824 1934 944 1954 1050
1875 1160 1895 1260 1915 702 1935 984 1955 918
1876 1160 1896 1220 1916 1120 1936 897 1956 986
1877 813 1897 1030 1917 1100 1937 822 1957 797
1878 1230 1898 1100 1918 832 1938 1010 1958 923
Example 1.9 How to construct
1879 1370 1899 774 1919 764 1939 771 1959 975
Frequency distribution of dissolved
1880 1140 1900 840 1920 821 1940 676 1960 815 oxygen concentrations in a pond
1881 995 1901 874 1921 768 1941 649 1961 1020 ecosystem based on the following
1882 935 1902 694 1922 845 1942 846 1962 906 chronological measurements:
1883 1110 1903 940 1923 864 1943 812 1963 901
1884 994 1904 833 1924 862 1944 742 1964 1170 7.8, 7.2, 6.5, 7.1, 8.2, 7.3, 6.9, 7.8,
1885 1020 1905 701 1925 698 1945 801 1965 912 7.6, 6.8, 6.9, 7.5, 7.0, 6.7, 7.4, 8.0,
1886 960 1906 916 1926 845 1946 1040 1966 746 6.8, 7.2, 7.3, 7.6.
1887 1180 1907 692 1927 744 1947 860 1967 919
1888 799 1908 1020 1928 796 1948 874 1968 718
1889 958 1909 1050 1929 1040 1949 848 1969 714
1890 1140 1910 969 1930 759 1950 890 1970 740
Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 9
The following table shows Nile River Flow measurements sorted in ascending way

456 771 845 944 1100


649 774 846 958 1100
676 781 848 960 1110
692 796 860 963 1120
694 797 862 969 1120
698 799 864 975 1140
701 801 865 984 1140
702 812 874 986 1150
714 813 874 994 1160
718 815 890 995 1160
726 821 897 1010 1160
740 822 901 1020 1170
742 824 906 1020 1180
744 831 912 1020 1210
744 832 916 1030 1210
746 833 918 1040 1220
749 838 919 1040 1230
759 840 923 1050 1250
764 845 935 1050 1260
768 845 940 1100 1370

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 10
1.2.2.1 Guideline for the construction of frequency table

1. Sort data in ascending order

2. Determine the range of the observer values

3. Choose the number of intervals. Intervals should be non-overlapping and of equal length. (The
objective is to use an adequate number of classes to display the data's variation, while avoiding
having too few data points in numerous classes).

4. The class width should be slightly larger than the ratio:

5. The first interval should begin a Little below the minimum value, and the last intervalle should end a
Little above the maximum value.

6. The intervals are called class intervals and the bounderies are called class bouderies.

7. The class mark is the midpoint of a class

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 11
1.2.2.2 Some rules commonly used to determine the number of classes

• Sturges' Formula : K= 1 + log2(n) • Square Root Rule : 𝐾 = 𝑛

• Rice’s Rule : 𝐾 = 2 𝑛1/3 • Scott’s Rule : 𝐾 = 3,49 𝜎/𝑛1/3

• Freedman’s Rule : Bin Width = 2 * IQR * n^(-1/3) Number of Classes = (max - min) / Bin Width

Where:
• K: Number of classes
• n: The total number of data points in the dataset.
• σ (Standard Deviation
• Bin Width: The width of each class (bin).
• IQR (Interquartile Range): The range between the 75th percentile and the 25th percentile of the dataset.
• max: The maximum value in the dataset.
• min: The minimum value in the dataset.

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 12
1.2.2.3 Creating a frequency table for the data in the example 1.8

Example 1.8: Weight of 50 students in kilogram. 1st : we sort the dataset in ascending way
85 67 85 107 59 76 82 79 95 74
59 65 78 114 63 56 115 96 99 80 Value Rank Value Rank Value Rank
95 55 75 65 63 66 105 65 66 54 50 1 66 18 85 35
57 87 110 50 58 89 64 61 64 77
54 2 66 19 85 36
74 84 105 82 68 85 66 68 73 105
55 3 67 20 85 37
2nd : We determine the number of classes (here we choose to 56 4 68 21 87 38
appley Sturge’s rule): 57 5 68 22 89 39
𝐿𝑁(50) 58 6 73 23 95 40
𝐾 = 1 + 𝐿𝑜𝑔2 50 = 1 + =6.6
𝐿𝑁(2) 59 7 74 24 95 41
We choose to take the number of classes as k=7. 59 8 74 25 96 42
61 9 75 26 99 43
3rd : Calculating the range of the dataset: 63 10 76 27 105 44
63 11 77 28 105 45
𝑅𝑎𝑛𝑔𝑒 = 𝑀𝑎𝑥 = 𝑀𝐼𝑛 = 115 − 50 = 65
64 12 78 29 105 46
64 13 79 30 107 47
4th: We approximate the width of the classes: 65 14 80 31 110 48
𝑅𝑎𝑛𝑔𝑒 65 65 15 82 32 114 49
= =9.285 65 16 82 33 115 50
𝑘 7
We estimate the width of the classes at L=9,3 66 17 84 34

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 13
5th : We identify the class boundaries: 𝑎0 , 𝑎1 , … , 𝑎6 , 𝑎7 Value Rank Value Rank Value Rank
52 1 66 18 85 35
𝑎0 = 50; 𝑎1 = 50 + 9.3 = 59.3 ; 𝑎3 = 59.3 + 9.3 ; … ; 54 2 66 19 85 36
The last boundary will be: 𝑎7 = 115.1 55 3 67 20 85 37
56 4 68 21 87 38
6th: We complete the frequency table: 57 5 68 22 89 39
58 6 73 23 95 40
59 7 74 24 95 41
59 8 74 25 96 42
61 9 75 26 99 43
As commonly agreed upon in statistical studies, the 63 10 76 27 105 44
lower bound of a class interval is included, while the
63 11 77 28 105 45
upper bound is excluded."
64 12 78 29 105 46
64 13 79 30 107 47
65 14 80 31 110 48
65 15 82 32 114 49
65 16 82 33 115 50
66 17 84 34

Weigth 52 61.1 70.2 79.3 88.4 97.5 106.6 115.7


Freq 8 14 6 10 4 4 4

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 14
1.3. Cumulative frequencies
1.3.1. Increasing & Decreasing Cumulative Frequencies
Cumulative frequencies are associated with the class boundaries.
Increasing Cumulative Frequencies corresponding to the class limit 𝒂𝒊 is the number of measurements strictly less than that
boundary.
Conversely, Decreasing cumulative frequencies corresponding to the class limit 𝒂𝒊 is the number of measurements greater
or equal to than that boundary.
Weight 50 59.3 68.6 77.9 87.2 96.5 105.8 115.1
Frequency 8 14 6 10 4 4 4
ICF 0 8 22 28 38 42 46 50
DCF 50 42 28 22 12 8 4 0

1.3.2. Relative Increasing & Decreasing Cumulative Frequencies (frequencies are in percentages %)
Weight 50 59.3 68.6 77.9 87.2 96.5 105.8 115.1
Frequenc 8 14 6 10 4 4 4
y
RICF 0% 16% 44% 56% 76% 84% 92% 100%
RDCF 100% 84% 56% 44% 24% 16% 8% 0%
Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 15
1.3. GRAPHICAL DISPLAY

Time Series plots


Histograms
Bar plots
Pie Charts
Cumulative distribution functions

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 16
1.3.1 Time series plot or chronological plot

Example: Nile River Flow

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 17
1.3.2 Time series plot: anomalies display

Example: Nile River flow

Flow discharge in 10^8 m^3


600

400

200

-200

-400

-600
1860 1880 1900 1920 1940 1960 1980

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 18
1.3.3 Bar plot or Bar chart 350
300
250
Example 1.11: Two series of data represent

Frequency
the numbers of boys and girls that have a A graph representing the 200
smartphone at the Secondary School from numbers of girls with 150
2012 to 2019. The blue bar represents the smartphones.
number of boys, and the pink bar 100
represents the number of girls. 50

Year Number of boys Number of girls 0


2012 2013 2014 2015 2016 2017 2018 2019
2012 110 85
2013 185 175 350
300
2014 240 225
250
2015 285 295

Frequency
A graph representing the 200
2016 305 280 numbers of boys with
smartphones. 150
2017 310 315 100
2018 315 305 50
2019 315 320 0
2012 2013 2014 2015 2016 2017 2018 2019
Total 2065 2000
Number of girls
350 18%
300 16%

250 14%
Frequency

12%
200
10%
150 8%
100 6%
4%
50
2%
0 0%
2012 2013 2014 2015 2016 2017 2018 2019 2012 2013 2014 2015 2016 2017 2018 2019

Number of boys
350 18%
16%
300
14%
250 12%
Frequency

200 10%
150 8%
6%
100
4%
50 2%
0 0%
2012 2013 2014 2015 2016 2017 2018 2019 2012 2013 2014 2015 2016 2017 2018 2019
1.3.3.1 Generating a bar plot that displays several variables

A graph representing the numbers of boys and girls with smartphones in both
series.

Year Number of boys Number of girls Number of boys Number of girls


350
2012 110 85
2013 185 175 300

2014 240 225 250


2015 285 295 200
2016 305 280
150
2017 310 315
2018 315 305 100

2019 315 320 50


Total 2065 2000
0
2012 2013 2014 2015 2016 2017 2018 2019

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 21
Number of boys Number of girls
18%
16%
14%
12%
Frequency

10%
8%
6%
4%
2%
0%
2012 2013 2014 2015 2016 2017 2018 2019

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 22
1.4 Histogram Example 1.9: data refer to a certain type of chemical impurity measured in
parts per million in 25 drinking-water samples randomly collected from
different areas of county.

Histogram of Data Impurity Using Frequencies Histogram of Data Impurity Using Relative Frequencies

Histogram of Water impurity Histogram of Water impurity


8
7 30% 28%
7
6 25% 24%
6
5 20%
5 20%
Frequency

Frequency
4 16%
4
15%
3 12%
3
10%
2
5%
1

0 0%
10.8-15.7 15.7-20.6 20.6-25.5 25.5-30.4 30.4-35.3 10.8-15.7 15.7-20.6 20.6-25.5 25.5-30.4 30.4-35.3

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Calculation and Probability Academic Year 2023-2024 23
1.5 Cumulative Function Creating a cumulative frequency polygon for the data in
example 1.8
1.5.1 CUMULATIVE FREQUENCY POLYGONE
Weight 50 59.3 68.6 77.9 87.2 96.5 105.8 115.1
Frequency 8 14 6 10 4 4 4
ICF 0 8 22 28 38 42 46 50
RICF 0% 16% 44% 56% 76% 84% 92% 100%

Increasing Cumulative Frequency Polygon Increasing Cumulative Relative Frequency Polygon


50 100%
45 90%
40 80%
35 70%

Frequency
Frequency

30 60%
25 50%
20 40%
15 30%
10 20%
5 10%
0 0%
40 50 60 70 80 90 100 110 120 40 50 60 70 80 90 100 110 120
Weight in kg Weight in kg

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Calculation and Probability Academic Year 2023-2024 24
Weight 50 59.3 68.6 77.9 87.2 96.5 105.8 115.1
Freq. 8 14 6 10 4 4 4
ICF 0 8 22 28 38 42 46 50
DCF 50 42 28 22 12 8 4 0
DCF 0% 16% 44% 56% 76% 84% 92% 100%
RDCF 100% 84% 56% 44% 24% 16% 8% 0%
Increasing & Decreasing Cumulative Frequency Increasing & Decreasing Relative Cumulative
Polygon Frequency Polygon
50 100%

40 80%
Frequency

Frequency
30 60%

20 40%

10 20%

0 0%
40 50 60 70 80 90 100 110 120 40 50 60 70 80 90 100 110 120
Weight in kg Weight in kg

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Calculation and Probability Academic Year 2023-2024 25
1.5 Cumulative Function The cumulative function is an extension (red line) of the
cumulative relative frequency polygon, starting at 0% for
1.5.2 CUMULATIVE FUNCTION the lowest boundary and reaching 100% for the highest
boundary
Cumulative Function
100%

80%
Frequency

60%

40%

20%

0%
40 50 60 70 80 90 100 110 120 130
Weight in kg

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Calculation and Probability Academic Year 2023-2024 26
EXERCISES
Test.: Complete with the correct mention identifying the data type

Variable Type • Scores and Marks (Ex: 59, 80, 60, etc.)
• Room Temperature continuouse • Marital status (Single, Widowed, Married)
• Time • What language do you speak
• The total number of players who • Colour of hair (Blonde, red, Brown, Black,
participated in a competition etc.)
• Opinion on something (agree, disagree, or • Gender (Male, Female)
neutral) • Wi-Fi Frequency
• Nationality (Indian, German, American) • Ranking of people in a competition (First,
• “Time-taken” to finish the work Second, Third, etc.)
• Education Level (Higher, Secondary, • Letter grades in the exam (A, B, C, D, etc.)
Primary) • Cost of a cell phone
• Speed of a vehicle • Eye Color (Black, Brown, etc.)
• Economic Status (High, Medium, and Low) • Total numbers of students present in a class
• Market share price • Weight of object
• Favorite holiday destination • Height of a person
• Numbers of employees in a company

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 27
Exercises Chapter 1

Exercise 1.1
a) Create a frequency table for the data in example 1.2. Include two rows for increasing and decreasing cumulative frequencies.
b) Generate a frequency histogram.
c) Plot the polygon of increasing cumulative frequencies and the polygon of decreasing cumulative frequencies.
d) Deduce the coordinates of the intersection point of the two polygons and provide a commentary on the graph.

Exercise 1.2 For the Nile River Flow dataset, create a frequency table and draw the histogram

Exercise 1.3. Plot the chronological evolution of the DOC from example 1.2.

Exercice 1.4. For the data in Example 1.4 and 1,5.


A. What type of data is it?
B. What is the total headcount?
C. Create bar charts

Exercise 1.5. For the data in Examples 1,6 and 1,7


A. What type of data is it?
B. What is the total Frequency?
C. Create pie charts

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 28
Exercise 1.6. The following data refer to a certain type of chemical impurity measured in parts per million in 25 drinking-water
samples randomly collected from different areas of county 11 19 24 12 20 29 15 31 21
What type of data is it? 24 31 16 23 26 26 32 25 17
Make a frequency table displaying class intervals, frequency,
22 26 35 18 24 18 27
relative frequency, and percentages

Exercise 1.7 For the data in Example 1.8: weight of 50 students in kilogram.;
1. Draw the cumulative function,
2. Deduce the median and the interquartile,
3. Determine the relative frequency of the students having a weight below 50 kg.
4. Determine the percentage of the students having a weight greater than 50 kg
5. Determine the percentage of the students having a weight equal 50 kg
6. Determine the percentage of the students having a weight between 50 and 75 kg.
7. Determine percentage of the students having a weight greater than 100 kg.

Pr. Abdesselam Megnounif Course: ITS1.3- Statistics Caluclation and Probability Academic Year 2023-2024 29

You might also like