Unit 1 - Descriptive Statistics

STAT 210

Probability and Statistics

Unit 1:Descriptive Statistics
 Introduction to Statistics:

 Graphical method:
Bar and pie charts, Histogram

 Summary Statistics:
Measures of location, measures of variability,

Why Statistics?
 Statistics deals with collecting, processing, summarizing,
analyzing and interpreting data. On the other hand,
engineering and industrial management deal with such
diverse issues as solving production problems, effective use
of materials and labor, development of new products,
quality improvement and reliability and, of course, basic
 The field of statistics involves methods for:
1. Designing and carrying out research studies.
2. Describing collected data.
3. Making decisions, predictions, or inferences about
phenomena represented by the data by designing valid
experiments and drawing reliable conclusions.
Why Statistics?
 Branches of Statistics
1. Descriptive statistics: statistical
methods that summarize and describe the
prominent features of data.
2. Inferential statistics: statistical methods that
generalize results from a sample to a

 UAEU ( average grade of students)
 Compare the grades over the last 10 years

 Collect the grade of students

 Put the grades into table
 Calculate the average
 Graphs

 100 students
 Grades , average = 17

 Estimate the value of the grade average in


 Average is around 17

 Sample population
 Average average
 Percentage percentage
 Proportion proportion

 Statistic parameter
 estimator

 As it is generally impossible or impractical to find out something about
the entire population, we examine a part of it to make inferences.
 A population is the entire collection of objects or outcomes about
which information is sought.
 A sample is a subset of a population, containing the objects or
outcomes that are actually observed.
 A parameter is a numerical characteristic of a population, which is
usually unknown.
 A statistic is computed from the sample and varies from sample to
sample and used as an estimate of the population parameter.

Example: A researcher is interested in measuring the satisfaction of

customers about the internet connection in a certain city. He randomly
sampled 50 customers from a list of subscribers. The population of
interest is all customers in the city while the sample is the 50 selected
Data Collection
Besides organizing and analyzing data, statistics
deals with the development of techniques for
collecting the data. If data is not properly collected,
an investigator may not be able to answer the
questions under consideration with a reasonable
degree of confidence.
Observational Studies: Engineer simply observes the
process without disturbing it and records quantities
of interest. May be able to find relationship between
input and output but cannot study relationship
between all factors because appropriate changes
were not made.
 Code python
 Observe and Record the result

 Provide the code of python and recode the


Data Collection
Controlled (Designed) Experiments: Measurements
are recorded while controlling some factors that
might influence the results of the study. Measures
the response or output variable of interest.

Surveys: Questionnaires designed to solicit

information from people. Data may be collected by
face-to-face interview, telephone interview, postal
mail, email, fax.

Simple Random Sampling (SRS)
A simple random sample (SRS) of size n is a sample chosen by a
method in which each collection of n population items is equally
likely to comprise the sample.
 A SRS is not guaranteed to reflect the population perfectly;

 SRS's always differ in some ways from each other;

 Two samples from the same population may vary from each
other. This is known as sampling variation;
 Items in a SRS may be treated as independent in most
cases encountered in practice. The exception occurs when
the population is finite and the sample comprises a
substantial fraction (more than 5%) of the population.

Simple Random sampling
Sampling with replacement: Replace each item after it is
 The population remains the same on every draw. The
sampled units are truly independent.
 In the sample the researcher collected, 80% of users were
satisfied with their internet connection.
 In the population of customers, it is unlikely there will be
exactly 80% who are satisfied with their internet connection.
 It is more realistic to think that there will be somewhere
around 80% of the customers who are satisfied with their
internet connection.
 Another researcher repeats the study with a different SRS of
50 customers. She finds 90% are satisfied with their internet
Simple Random Sampling
 Did she do something wrong or did the first researcher do
something wrong?
 Sample variation at work; two different samples from the
same population will differ from each other and from the

Stratified Sampling
Sometimes alternative sampling methods can be used to
make the selection process easier, to obtain extra information,
or to increase the degree of confidence in conclusions.
 One such method, stratified sampling, entails separating
the population units into non-overlapping groups and
taking a sample from each one.
 For example, a manufacturer of TV might want
information about customer satisfaction for units produced
during the previous year. If three different models were
manufactured and sold, a separate sample could be
selected from each of the three corresponding strata.
 This would result in information on all three models and
ensure that no one model was over- or underrepresented
in the entire sample.
Convenience Sampling
Frequently a convenience sample is obtained by selecting
individuals or objects without systematic randomization. Such
sample is not drawn by a well defined random method.
Example: A computer engineer received a shipment of
1000 monitors in a huge container. He wants to test the
brightness of the monitors by testing a sample of 10 ones.
The engineer takes 10 monitors from the top of the
container as the sample.
 Things to consider with convenience samples:
 Differ systematically in some way from the population.
 Only use when it is not feasible to draw a random

Types of Variable
 A variable is any characteristic whose value may change from one object to
another. The variables can be classified as either quantitative or qualitative.
 Quantitative (Numerical) variables: A numerical quantity is assigned to
each item in the sample. Quantitative variables can be classified as either
discrete or continuous:
 A discrete variable is a variable whose possible values can be listed,
even though the list may continue indefinitely. For example, the
number of visits to a particular Web site during a specified period, the
number of PCs owned by a family, or the number of students in an
introductory statistics class.
 A continuous variable is a variable whose possible values form some
interval of numbers. Typically, a continuous variable involves a
measurement of something, such as the price of a laptop, the CPU time
of a certain task (in seconds), or the length of time a PC battery lasts.
 Quantitative (quantity)
 Continuous :Age, weight 75.8 kg, temp,
time,… [-1,1]
 Discrete: number of students, number of
laptop 0,1

 Qualitative ( quality
 Nominal: Names, color, gender, nationality,
 Ordinal : level of education ( school, high
school, college)

Types of Variable
Qualitative (Categorical) variables: The sample items are
placed into categories, groups or levels.
Examples: brand of laptop owned by a student, the defective
status (defective or not), computer knowledge (beginner,
intermediate, expert), education level (less than high school,
high school, etc.).
Values of a qualitative variable are sometimes coded with numbers.
We cannot do arithmetic with such numbers, in contrast to those of a
quantitative variable.
Qualitative data can be classified as either nominal or ordinal. The
categories of an ordinal data can be ranked or meaningfully ordered
but the categories of a nominal data can't be ordered. Of the four
qualitative data sets listed above, brand of laptop and defective
status are nominal while computer knowledge and education level
are ordinal.

(1) An IT student, working on his thesis, plans a survey to determine the
proportion of all computer users who regularly scan flash disks before
using them. He decides to interview his classmates in the three classes
he is currently enrolled.
a) What is the population of interest? all computer users
b) What is the sample ? The classmates in the three classes
c) What is the parameter and the statistic?
Parameter :proportion of all computer users who regularly scan flash
disks before using them
Statistic: proportion of users who regularly scan flash disks before
using them in the three classes

(2) Are the following data quantitative or qualitative?
a) Number of hard drives a PC has. Quantitative :discrete 0,1,2,

b) Employment Status (employed, unemployed). Qualitative:


c) The price of a laptop. Quantitative : Continuous [ 300, 4000]

d) Quality of an item (low, medium, high). Qualitative : ordinal

Size of cloths :
32 36 37 38 40 : quantitative
S, M, L, XL : qualitative
Graphical Methods
Descriptive statistics can be divided into two general areas;
graphical and numerical. In this part, we consider
representing a data set using graphical techniques.
Appropriate graphs are-
 For qualitative data: Bar chart and Pie chart
 For quantitative data: Histogram; Boxplot

Bar and Pie Charts
 Bar chart: A vertical or horizontal rectangle represents the
frequency for each category.
Height can be frequency, relative frequency, or percent
In some cases, there will be a natural ordering of groups; for
example, freshmen, sophomores, juniors, seniors, graduate
students whereas in other cases the order will be arbitrary; for
example, Dell, hp, etc.
What to Look For: Frequently and infrequently occurring
categories. In Minitab: Graph - Bar Chart
 Pie chart: A circle divided into slices where the size of each slice
represents its relative frequency or percent frequency.
What to Look For: Categories that form large and small
proportions of the data set.
In Minitab: Graph - Pie Chart
A quality manager uses a questionnaire to ask customers how
they rate the customer support services o ered by the IT
Services center. The services are rated on a scale of
outstanding (O), very good (V), good (G), average (A), and
poor (P). The responses of 50 customers were:
The data are summarized in the following frequency table:
Rating Frequency
Outstanding 19
Very good 13
Good 10
Average 6
Poor 2
Rating Frequency frequency Percent
19/50= 0.38
Outstanding 19 38%
Very good 13 13/50=0.26 26%
Good 10 10/50=0.2 20%
Average 6 6/50=0.12 12%
Poor 2 2/50=0.04 4%

relative frequency each category = frequency/ n ( n : size of the
sample: sum of the frequency )

or percent frequency= relative frequency *100%

The top three internet browsers in 2011 were Internet Explorer (IE),
Firefox (FF) and Chrome (GC) besides others (OT). Data indicating
the preferred browser for a sample of 60 internet users follow.






(a) Are these data categorical or quantitative?

(b) Provide frequency and percent frequency distributions.
© Construct a bar chart and a pie chart.
(d) On the basis of the sample, which browser has the largest
share? Which one is second?

Graphical display that gives an idea of the shape of
the data distribution.
The bars of the histogram touch each other. A space
indicates that there are no observations in that
What to Look For: Central or typical value, extent
of spread or variation, general shape, location and
number of peaks, presence of gaps and outliers.

In Minitab: Graph - Histogram

Shapes of Histogram
 A histogram is perfectly symmetric if its right half is a
mirror image of its left half.
 Histograms that are not symmetric are referred to as
 A histogram with a long right-hand tail is said to be
skewed to the right, or positively skewed.
 A histogram with a long left-hand tail is said to be skewed
to the left, or negatively skewed.

A histogram is unimodal if it has only one peak, or mode, and

bimodal if it has two clearly distinct modes. Bimodality can occur
when the data set consists of observations on two quite different
kinds of individuals or objects. In principle, a histogram can have
more than two modes, but this does not happen often in practice.
STAT210: Probability and Statistics 29
Shapes of Histogram

To evaluate the effectiveness of a processor for a certain type
of tasks, a researcher recorded the CPU time for n = 30
randomly chosen jobs (in seconds),
70 36 43 69 82 48
34 62 35 15 59 139
46 37 42 30 55 56
36 82 38 89 54 25
35 24 22 9 56 19

Construct a histogram and describe the distribution of the CPU


STAT210: Probability and Statistics 31


The distribution of the CPU times is skewed to the right with one potential outlier.

For each of the following data set, draw a histogram
determine whether the distribution is right-skewed, left-
skewed, or symmetric.
(1) 19, 24, 12, 19, 18, 24, 8, 5, 9, 20, 13, 11, 1, 12, 11, 10,

22, 21, 7, 16, 15, 15, 26, 16, 1, 13, 21, 21, 20, 19
(2) 17, 24, 21, 22, 26, 22, 19, 21, 23, 11, 19, 14, 23, 25,
26, 15, 17, 26, 21, 18, 19,21,24,18,16,20,21,20,23,33

(3) 56,52, 13,34,33, 18, 44, 41, 48, 75, 24, 19,35, 27, 46,
62, 71, 24, 66, 94, 40,18,15,39,53,23,41,78,15,35

Descriptive Statistics
Visual summaries of data are excellent tools for obtaining preliminary
impressions and insights. More formal data analysis often requires
the calculation and interpretation of numerical summary measures.
In practice, the entire population is never observed, so the
population parameters cannot be calculated directly. However,
sample statistics are often used to estimate parameters.
Percentages and proportions are used to summarize the distribution
of qualitative variables. For quantitative data, we will look at:
 Measures of location (center): mean, median, trimmed mean,
percentiles and quartiles.
 Measures of variability (spread): variance, standard deviation
(SD), range, interquartile range (IQR).

In Minitab, all summary statistics can be produced using:

Stat - Basic Statistics - Display Descriptive Statistics

Let x1, x2,…, xn be the values of the sample data, then the
mean is the average of these values.
The sample mean, denoted by x , is given by

x i
x i 1

Similarly, the population mean, denoted by µ, is given by

x i
 i 1
where N is the population size.
Sometimes a sample may contain a few points that are much
larger or smaller than the rest. Such points are called outliers
and may affect the mean.
The median is the value in the middle when the data are
arranged in ascending order (smallest value to largest value).
To find the median the values in the sample are ordered from
smallest to largest, then
 If n is odd, the sample median is the number in (n+1)/2
position .
 If n is even, the sample median is the average of the
numbers in n/2 and (n/2)+1 positions.
Although the mean is the more commonly used measure of
central location, in some situations the median is preferred.
The mean is influenced by extremely small and large data
values. In such case, the median is often the preferred
measure of central location.
Mean vs. Median
 Mean tends to be drawn in the direction of the tail of a
skewed distribution. The median is more appropriate when
the distribution is highly skewed.
 Mean can be greatly a effected by the presence of outliers
whereas median is not.
 For symmetric distributions, mean and median are the
 For skewed distributions, the mean lies towards the longer
tail relative to the median.

Mode and Trimmed Mean
 Themode is the value which occurs most frequently in the
sample. There may be no mode or may be several modes.
 The mode is not a affected by extreme values.
 Mainly used for grouped numerical data or categorical data.

Trimmed Mean:
 The trimmed mean is a measure of center that is not affected by
 With the trimmed mean, p% of the data is trimmed from either
end of the data set.
 First, arranging the sample values in (ascending or descending)
order. 2 Then, trimming an equal number of them (np/100 points)
from each end. Finally, computing the sample mean of the
remaining points.
Note: Minitab prints the 5% trimmed mean.

Percentile and Quartile
The pth percentile of a sample, for a number between 0 and 100,
divides the sample so that as nearly as possible p% of the sample
values are less than the pth percentile.

To find the percentiles, order the sample values from smallest to

largest. Then compute the quantity i = (n+1)p/100, where n is the
sample size. If this quantity is an integer, the sample value in this
position is the pth percentile. Otherwise, average the two sample
values on either side.

The first quartile, Q1, is the value that has approximately 25% of
the observations below it. It represents the median of the lower half
of the data and corresponds to the 25th percentile.
The second quartile or median is the 50th percentile.
The third quartile, Q3, has approximately 75% of the observations
below it and corresponds to the 75th percentile.
Measures of Variability: Variance and Standard
The variance is the average of squared deviations of values from the
mean. The population variance (σ2) is given by
  2

 ( x  )
i 1

While the sample variance (s2) is given by

1 N
s 
 i
n  1 i 1
( x  x ) 2

The sample variance is a reasonable estimate of the population


The standard deviation is the square root of the variance.

Range and Inter Quartile Range
The Range (R) is simplest measure of variation but of limited use.

It is difference between the largest and the smallest observations.

R= max(xi) - min(xi)
It is not commonly used as it is based on only two observations and
is highly influenced by extreme values.

The Interquartile Range (IQR) is the range for the middle 50% of
the data.
IQR = Q3 - Q1
It is not in influenced by outliers but used to detect them.

Detection of outliers: Measure 1.5×(IQR) down from the first

quartile and up from the third quartile. All the data points observed
outside of this interval are classified as outliers.
To evaluate the effectiveness of a processor for a certain type of
tasks, a researcher recorded the CPU time for n = 30 randomly
chosen jobs (in seconds),
70 36 43 69 82 48 34 62 35 15
59 139 46 37 42 30 55 56 36 82
38 89 54 25 35 24 22 9 56 19

Minitab Output:
Descriptive Statistics: CPU Time

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum

CPU Time 30 0 48.23 4.84 26.52 9.00 33.00 42.50 59.75 139.00

 The boxplot is a graphical display that simultaneously
describes several important features of a data set, such as
center, spread, departure from symmetry, and identification
of outliers.
 The plot is based on the five number summary:
(minimum; Q1; median; Q3; maximum)
 Comparative or side-by-side boxplots is a very effective
way of comparing two or more data sets consisting of
observations on the same variable fuel efficiency
observations for four different types of automobiles, prices
for three different brands of note-books, and so on.

In Minitab: Graph - Boxplot

Distribution shape and Boxplot

STAT210: Probability and Statistics 44


The distribution of the CPU times is skewed to the right with

one outlier.

Comparative or side-by-side
The following comparative boxplots represent the amount of internet traffic
handled by a certain center during a week. What we can see:
 Traffic is heaviest on Fridays and least on Saturdays and Sundays.
 The greatest spread occurs on Fridays and the least on Saturdays and
 The distributions all appear to be slightly right skewed, although there is
little skew in the distributions on Saturday and Sunday. There our large
outliers on Monday, Thursday, and Friday.

(1) The following data set represents the number of new computer
accounts registered during ten consecutive days:
43 37 50 51 58 105 52 45 45 10
a) Compute the mean, median, quartiles, and standard deviation.
b) Delete the outliers and redo part (a) again.
c) Make a conclusion about the effect of outliers.

(2) The numbers of blocked intrusion attempts on each day during

the first two weeks of the month were
56 47 49 37 38 60 50 43 43 59 50 56 54 58
After the change of firewall settings, the numbers of intrusions during the next
20 days were
53 21 32 49 45 38 44 33 32 43

53 46 36 48 39 35 37 36 39 45
compare the number of intrusions before and after the change, construct
parallel boxplots and comment on your findings.
(3) Match each histogram to the boxplot that represents the
same data set.

(4) A network provider investigates the load of its network. The
number of concurrent users is recorded at 50 locations (‘000 of
17.2 22.1 18.5 17.2 18.6 14.8 21.7 15.8 16.3 22.8

24.1 13.3 16.2 17.5 19.0 23.9 14.8 22.2 21.7 20.7

13.5 15.8 13.1 16.1 21.9 23.9 19.3 12.0 19.9 19.4

15.4 16.7 19.5 16.2 16.9 17.1 20.2 13.4 19.8 17.7

19.7 18.7 17.6 15.9 15.2 17.1 15.0 18.8 21.6 11.9

a) Compute the sample mean, variance, and standard deviation of

the number of concurrent users.
b) Compute the five-number summary and construct a boxplot.
c) Compute the interquartile range. Are there any outliers?
d) It is reported that the number of concurrent users follows
approximately normal distribution. Does the histogram support
this claim?
