Download as pdf or txt
Download as pdf or txt
You are on page 1of 510

Statistics

Module 1

1
Table of Contents
1. Statistical Modeling
2. Measures of Central
Tendency
3. Measures of Variability
4. Correlation & Regression
5. The Normal Distribution

2
STATISTICAL
MODELING
What is the Statistical
Modeling?
What are the advantages?
What are the
disadvantages?

3
STATISTICAL
MODELING :
DEFINITION
✔ A statistical model is a simplification of a real-world situation,
usually describing a real-world situation using equations. It
can be used to make predictions about a real-world problem.
By analyzing and refining the model an improved
understanding may be obtained.

4
STATISTICAL
MODELING :
ADVANTAGES
• The model is quick and easy
to produce
• Helps our understanding of
the real-world problem
• Helps us to make predictions
• Be controlled a situation –
e.g., railway timetables, air
traffic control etc.

5
STATISTICAL MODELING :
DISADVANTAGES
✔ The model simplifies the
situation and only describes a
part of the real-world problem.
✔ The model may only work in
certain situations, or for a
particular range of values.

6
Variables:
1. Qualitative variables: Non – numerical –
e.g. red, blue, or long, short etc.
2. Quantitative variables: Numerical –e.g.,
length, age, time, number of coins in
REPRESENTATI pocket, etc.
ON OF SAMPLE 3. Continuous variables: Can take any value
DATA within a given range – e.g height, time,
age, etc.
4. Discrete variables: Can only take certain
values –e.g., shoe size, cost in $ and p,
number of coins

7
Frequency Distribution:
✔ A distribution is best thought of as a table. Thus, a
frequency distribution can be thought of as a
frequency table, i.e., a list of discrete values and their
frequencies.
FREQUENCY
DISTRIBUTIO Example:
NS The number of M&Ms is counted in several bags, and
recorded in the frequency distribution/table below:
Number of M&Ms:
37 38 39 40 41 42 43
Frequency:
3 8 11 19 13 7 2

8
Cumulative Frequency:
✔ Add up the frequencies as you go down/along the list
Example:
FREQUENCY The number of M&Ms is counted in several bags, and
DISTRIBUTIO recorded in the frequency distribution/table below:

NS Number of M&Ms:
42 43
37 38 39 40 41

Frequency: 3 8 11 19 13 7 2
Cumulative Frequency: 3 11 22 41 54 61 63

9
MEASURES OF
CENTRAL TENDENCY

What is a central tendency?


What measures can be used to measure a central
tendency?

10
✔ Central tendency is defined as “the statistical measure
that identifies a single value as representative of an
entire distribution”
WHAT IS A
✔ It aims to provide an accurate description of the entire
CENTRAL data. It is the single value that is most
typical/representative of the collected data
TENDENCY
? ✔ The term “number crunching” is used to illustrate this
aspect of data description

✔ The mean, median and mode are the three commonly


used measures of central tendency

11
✔ Is the most used measure of central tendency

CENTRAL ✔ There are different types of mean: viz, arithmetic


TENDENCY mean, weighted mean, geometric mean (GM) and
harmonic mean (HM)
: THE MEAN
✔ If mentioned without an adjective ( as mean), it
generally refers to the Arithmetic Mean!

12
CENTRAL
TENDENCY :
THE MEAN
The Arithmetic Mean:
✔ Arithmetic Mean (or simply,
“mean”) is the average. It is
computed by adding all the
values in the data set (+)
divided ( \ ) by the number
of observations in it.
✔ If we have the raw data,
mean is given by the
formula:

13
CENTRAL TENDENCY : THE MEAN

Example:
We have Numbers of friends of 11 Facebook users dot according to the
formula calculate the mean:
N = 22 +40+ 116 57 + 93 + 103 + 108 + 93 + 121 +252 =1063

M = 1063 / 11 =96.64

14
CENTRAL ✔ The median is the middle number in a sorted,
TENDENCY ascending or descending, list of numbers and can be
more descriptive of that data set than the average.
: THE
MEDIAN ✔ The median is sometimes used as opposed to the
mean when there are outliers in the sequence that
might skew the average of the values

15
CENTRAL
TENDENCY Example :
: THE ✔ The median of 4, 1, and 7 is 4 because when
the numbers are put in order (1 , 4, 7) , the
MEDIAN number 4 is in the middle.

16
✔ Is the value that appears most often in a set of data
values
CENTRAL Example :
If X is a discrete random variable, the mode is the
TENDENCY value x (i.e., X = x) at which the probability mass
function takes its maximum value.
: THE MODE - In other words, it is the value that is most likely to be
sampled.

17
✔ The mode is not necessarily unique to a
given discrete distribution, since the
CENTRAL probability mass function may take the same
maximum value at several points x1, x2, etc.
TENDENCY o Bimodal:
: THE MODE -Having two modes

o Multimodal :
-Having several modes

18
All the Distributions:

CENTRAL
TENDENCY :
THE MODE

19
Mode:
MEASURES You should use the mode if the data is qualitative (colour
etc.) or if quantitative (numbers) with a clearly defined
mode (or bi-modal). It is not much use if the distribution is
OF CENTRAL fairly even
TENDENCY- Median:
When to use You should use this for quantitative data (numbers), when
the data is skewed, i.e., when the median, mean and mode
mode, median are probably not equal, and when there might be extreme
values (outliers)
and mean :
Mean :
This is for quantitative data (numbers) and uses all pieces
of data. It gives a true measure, and should only be used if
the data is fairly symmetrical (not skewed), i.e., the mean
could not be affected by extreme values (outliers)

20
MEASURES OF
VARIABILITY

21
MEASURES
OF • Range & Interquartile Range
VARIABILIT • Variance & Standard Deviation
Y • Measurement of relationship between
variables

22
✔ Range:
The range is the largest number minus the
smallest (including outliers)
MEASURES
OF Example:
VARIABILIT Number of friends of 11 Facebook users:
Y: RANGE 22, 40, 53, 57, 93,
108, 116, 121, 252
98, 103,

Range = 252 – 22 = 230

– Very biased by outliers

23
MEASURES
OF Example:
VARIABILITY: The difference between the score representing
INTERQUARTI the 75th percentile and the score representing
the 25th percentile is the interquartile range.
LE RANGE This value give us the range of the middle 50%
of the values in the data set.

24
MEASURES OF VARIABILITY: VARIANCE AND
STANDARD DEVIATION
✔ The standard deviation is the square root of the average squared deviation from the mean.

The average squared deviation from the mean is also known as the variance

25
MEASURES OF ✔ Computers are used
extensively for
VARIABILITY: calculating the
UNDERSTANDIN standard deviation and
G AND other statistics.
CALCULATING ✔ However, calculating
the standard deviation
THE STANDARD by hand once or twice
DEVIATION can be helpful in
developing an
understanding of its
meaning!

26
Example: Calculating the variance and standard deviation:
MEASURES OF
VARIABILITY: Consider the observations 8, 25, 7, 5, 8, 3, 10, 12, 9
UNDERSTANDIN 1. Determine n, which Is the number of data values
2. Calculate the arithmetic mean, which is the sum of scores
G AND divided by n.
CALCULATING 3. Calculate the mean = (8+25+7+5+8+3+10+12+9)/9 or 9.67
4. Subtract the mean from each individual score to find the
1. THE STANDARD
8+3+10+12+9 individual deviations
DEVIATION 5. Square the individual deviations
6. Find the sum of the squares of the deviation…can you see
why we squared them before adding the values?
7. Divide the sum of the squares of the deviations by n-1. Well
done this is the Variance!
8. Take the square root of the variance to obtain the standard
deviation, which has the same units as the original data

27
MEASURES OF
VARIABILITY:
UNDERSTANDING
AND
CALCULATING THE
STANDARD
DEVIATION

28
MEASURES OF VARIABILITY: UNDERSTANDING
AND CALCULATING THE STANDARD DEVIATION

Raw score method for calculating standard deviation


Again , consider the observations 8, 25, 7, 5, 8, 3, 10, 12, 9
1. Square each of the scores
2. Determine N, which is the number of scores
3. Compute the sum of X and the sum of X-squared
4. Calculate the standard deviation as illustrated below

29
MEASURES OF
VARIABILITY:
UNDERSTANDING
AND CALCULATING
THE STANDARD
DEVIATION

30
MEASURMENTS
OF
RELATIONSHIPS
BETWEEN
VARIABLES

31
Graph of Two Measurement Variables
/EXAMPLE 1
1. What is your height (inches)?
Notice we have two different measurement
2. What is your weight (lbs)? variables. It would be inappropriate to put
Scatterplot: these two variables on side-by-side boxplots
because they do not have the same units of
measurement. Comparing height to weight is
like comparing apples to oranges. However,
we do want to put both of these variables on
one graph so that we can determine if there
is an association (relationship) between them.

Notice: as height increases, weight also tends


to increase. These two variables have
a positive association because as the values
of one measurement variable tend to
increase, the values of the other variable also
increase
32
Graph of Two Measurement Variables
/EXAMPLE 2
1. How far do you live from campus (miles)?
2. How much is your monthly rent ($)?

The further an unfurnished one-bedroom


apartment is away from campus, the less
it costs to rent.

We say that two variables have a negative


association when the values of one
measurement variable tend to decrease
as the values of the other variable
increase.

33
Graph of Two Measurement Variables
/EXAMPLE 3
1. About how many hours do you typically study each week?
2. About how many hours do you typically exercise each week?

As the number of hours spent exercising


each week increases there is really no
pattern to the behavior of hours spent
studying including visible increases or
decreases in values.

Consequently, we say that that there is


essentially no association between the
two variables.

34
CORRELATION &
REGRESSION

35
Remember!

✔ The overall statistical methods are one of two


Correlation & types:

Significance
✔ Descriptive methods (that describe attributes of a
data set) and

✔ Inferential methods (that try to draw conclusions


about a population based on sample data).

36
✔ Many relationships between two measurement
variables tend to fall close to a straight line.
✔ In other words, the two variables exhibit a linear
relationship.
✔ It is also helpful to have a single number that will
measure the strength of the linear relationship
Correlation between the two variables. This number is
the correlation.
✔ The correlation is a single number that indicates
how close the values fall to a straight line.
✔ Correlation quantifies both the strength and
direction of the linear relationship between the two
measurement variables.

37
EXAMPLE VARIABLES CORRELATION(r)
CORRELATION FOR Example 1 Height and r=0.541
EXAMPLES 1-3 Weight

Example 2 Distance and r=−0.903


Monthly Rent

Example 3 Study Hours r=0.109


and Exercise
• https://www.youtube.com/watch/cwc2 Hours
JWqI6aQ

38
✔ The correlation of a sample is represented by the
FEATURES letter r

OF ✔ The range of possible values for a correlation is


CORRELATI between -1 to +1

ON ✔ A positive correlation indicates a positive linear


association like the one in example 5.8. The
strength of the positive linear association increases
as the correlation becomes closer to +1

39
✔ A negative correlation indicates a negative linear
association. The strength of the negative linear
association increases as the correlation becomes
FEATURES closer to -1

OF ✔ A correlation of either +1 or -1 indicates a perfect


CORRELATI linear relationship. This is hard to find with real data

ON ✔ A correlation of 0 indicates either that:


✔ there is no linear relationship between the two
variables, and/or
✔ the best straight line through the data is
horizontal.

40
✔ The correlation is independent of the original units
FEATURES of the two variables. This is because the correlation
depends only on the relationship between the
OF standard scores of each variable
CORRELATI
✔ The correlation is calculated using every
ON observation in the data set

✔ The correlation is a descriptive result.

41
✔ As you compare the scatterplots of the data from the
three examples with their actual correlations, you
should notice that findings are consistent for each
example.
FEATURES ✔ In Example 1 , the scatterplot shows a positive
association between weight and height. However,
OF there is still quite a bit of scatter around the pattern.
Consequently, a correlation of .541 is reasonable. It is
common for a correlation to decrease as sample size
CORRELATI increases.
ON ✔ In Example 2 , the scatterplot shows a negative
association between monthly rent and distance from
campus. Since the data points are very close to a
straight line it is not surprising the correlation is -.903.
✔ In Example 3 , the scatterplot does not show any
strong association between exercise hours/week and
study hours/week. This lack of association is
supported by a correlation of .109.

42
✔ A statistically significant relationship is one that is
large enough to be unlikely to have occurred in the
sample if there's no relationship in the population.

STATISTICA ✔ The issue of whether a result is unlikely to happen


by chance is an important one in establishing
L cause-and-effect relationships from experimental
data
SIGNIFICAN ✔ If an experiment is well planned, randomization
makes the various treatment groups similar to each
CE other at the beginning of the experiment except for
the luck of the draw that determines who gets into
which group.
✔ Then, if subjects are treated the same during the
experiment (e.g. via double blinding), there can be
two possible explanations for differences seen: 1)
the treatment(s) had an effect or 2) differences are
due to the luck of the draw.

43
✔ There are three key caveats that must be
KEY recognized with regard to correlation.
CAVEATS ✔ It is impossible to prove causal relationships with
correlation. However, the strength of the evidence
WITH for such a relationship can be evaluated by
examining and eliminating important alternate
CORRELATI explanations for the correlation seen.
ONS ✔ Outliers can substantially inflate or deflate the
correlation.
✔ Correlation describes the strength and direction of
the linear association between variables. It does not
describe non-linear relationships

44
✔ It is often tempting to suggest that, when the
correlation is statistically significant, the change in
CORRELATI one variable causes the change in the other
variable
ON AND
CAUSATION ✔ However, outside of randomized experiments,
there are numerous other possible reasons that
might underlie the correlation

45
Check for the possibility that the response (y) might
be directly affecting the explanatory(x) variable
(rather than the other way around).

CORRELATI For example, you might suspect that the number of


times children wash their hands might be causally
ON AND related to the number of cases of the common cold
amongst the children at a pre-school.
CAUSATION ✔ However, it is also possible that children who have
colds are made to wash their hands more often. In
this example, it would also be important to evaluate
the timing of the measured variables - does an
increase in the amount of hand washing precede a
decrease in colds or did it happen at the same
time?

46
Check whether changes in the explanatory(x)
CORRELATI variable contribute, along with other variables, to
changes in the response(y).
ON AND
CAUSATION For example, the amount of dry brush in a forest
does not cause a forest fire; but it will contribute to
it if a fire is ignited.

47
Check for confounders or common causes that may
affect both the explanatory and response variables.
For example, there is a moderate association
between whether a baby is breastfed or bottle-fed
CORRELATI and the number of incidences of gastroenteritis
recorded on medical charts (with the breastfed
ON AND babies showing more cases).

CAUSATION But it turns out that breastfed babies also have, on


average, more routine medical visits to
pediatricians. Thus, the number of opportunities for
mild cases of gastroenteritis to be recorded on
medical charts is greater for the breastfed babies
providing a clear confounder.

48
Check whether the association between the variables
might be just a matter of coincidence.
This is where a check for the degree of statistical
significance would be important.
However, it is also important to consider whether the
CORRELATI search for significance was a priori or a posteriori.
For example, a story in the national news one year
ON AND reported that at a hospital in Potsdam, New York, 15
babies in a row were all boys.
CAUSATION Does that indicate that something at that hospital was causing
more male than female births? Clearly, the answer is no, even if
the chance of having 15 boys in a row is quite low (about 1
chance in 33,000). But there are over 5000 hospitals in the
United States and the story would be just as newsworthy if it
happened at any one of them at any time of the year and for
either 15 boys in a row or for 15 girls in a row. Thus, it turns out
that we actually expect a story like this to happen once or twice
a year somewhere in the United States every year.

49
Check whether both variables may have changed
together over time or space.

For example, data on the number of cases of


internet fraud and on the amount spent on election
CORRELATI campaigns in the United States taken over the last
30 years would have a strong association merely
ON AND because they have both increased over time.
As another example, if you examine the percent of
CAUSATION the population with home computers and the life
expectancy for every country in the world, there will
be a positive association merely because richer
countries have both how life expectancy and
greater computer use. Tyler Vigen's website lists
thousands of spurious correlations that result from
variables that coincidentally change the same way
over time.

50
EFFECT OF OUTLIERS ON CORRELATION
Scatterplot of the relationship between the Infant Mortality The correlation is 0.73 but
Rate and the Percent of Juveniles Not Enrolled in School for looking at the plot it can be
each of the 50 states plus the District of Columbia: observed that for the 50
states alone the relationship
is not nearly as strong as a
0.73 correlation would
suggest. Here, the District of
Columbia (identified by the X)
is a clear outlier in the scatter
plot being several standard
deviations higher than the
other values for both the
explanatory (x) variable and
the response (y) variable.
Without Washington D.C. in
the data, the correlation
drops to about 0.5.
51
Correlations measure linear association
The degree to which relative standing on the x list of
numbers (as measured by standard scores) are
associated with the relative standing on the y list.
Since means and standard deviations, and hence
CORRELATI standard scores, are very sensitive to outliers, the
correlation will be as well.
ON AND
In general, the correlation will either increase or
OUTLIERS decrease, based on where the outlier is relative to
the other points remaining in the data set. An
outlier in the upper right or lower left of a
scatterplot will tend to increase the correlation
while outliers in the upper left or lower right will
tend to decrease a correlation

https://www.youtube.com/watch/XZ9vrVmvbj8
https://www.youtube.com/watch/7YsQ9xwjryo
52
Regression is a descriptive method used
with two different measurement
variables to find the best straight line
(equation) to fit the data points on the
scatterplot.
REGRESSI
ON A key feature of the regression
equation is that it can be used to make
predictions. In order to carry out a
regression analysis, the variables need
to be designated as either the:

53
Explanatory or Predictor Variable = x (on horizontal axis)

REGRESSI Response or Outcome Variable = y (vertical axis)


ON
The explanatory variable can be used to predict (estimate) a
typical value for the response variable. (Note: It is not
necessary to indicate which variable is the explanatory
variable and which variable is the response with correlation.)

54
REVIEW: EQUATION OF A LINE

• Let's review the basics of the equation of a line:

55
REVIEW:
EQUATION
OF A LINE

56
EXAMPLE 1: Consider the following
two variables for a
EXAMPLE sample of ten Stat 100
students:
OF x = quiz score
REGRESSIO y = exam score

N ✔ The scatterplot depicts


EQUATION the data whose
correlation
Is 0.883

57
Can we predict the exam score based on the quiz score for
students who come from this same population?
EXAMPLE 1:
to make that prediction we notice that the points
EXAMPLE generally fall in a linear pattern so we can use the
equation of a line that will allow us to put in a specific
value for x (quiz) and determine the best estimate of the
OF corresponding y (exam).

REGRESSIO The line represents our best guess at the average value
of y for a given x value and the best line would be one
N that has the least variability of the points around it (i.e.
we want the points to come as close to the line as
possible).
EQUATION
Remembering that the standard deviation measures the
deviations of the numbers on a list about their average,
we find the line that has the smallest standard deviation
for the distance from the points to the line. That line is
called the regression line or the least squares line.

58
EXAMPLE 1: EXAMPLE OF REGRESSION
EQUATION
• Least squares essentially find the line that will be the closest to all the data points
than any other possible line.

This figure displays the least squares


regression for the data in Example 1

As we can see from the plot of the


regression, some of the points lie above
the line, where other points lie below the
line. In fact, the total distance for the
points above the line is exactly
equal to the total distance from the
line to the points that fall below it.

59
EXAMPLE 1: EXAMPLE OF REGRESSION EQUATION

The least squares regression equation used to plot the equation


in plot is:

60
The Normal Distribution

61
THE NORMAL DISTRIBUTION

Bell Shaped
Symmetrical
Mean (μ), Median and Mode are EQUAL

Location is determined by the mean, μ


Spread is determined by the standard deviation, σ
The random variable has an infinite theoretical range:

62
THE NORMAL
DISTRIBUTION

63
THE NORMAL
DISTRIBUTION

64
THE STANDARDIZED NORMAL
DISTRIBUTION

Any normal distribution (with any mean and standard deviation combination) can be transformed
into the standardized normal distribution (Z)

Need to transform X units (NORMAL DISTRIBUTION. WITH UNITS) into Z (standardized normal
distribution. NO UNITS ). E.g.:
• NORMAL DISTRIBUTION: “X-units”-> 3km , 2m , 1 year, 10min, 400gr, 45 km/h et.c. …
• STANDARDIZED NORMAL DISTRIBUTION: No units at all!!!
Recipe:
• 1st step: calculate μ and σ of the datasample (value X),
• 2nd step: subtract from every value (X) it’s mean value (μ): X-μ
• 3rd step: divide with σ! (consider that σ it is NOT equal to 0)

The standardized normal distribution (Z) has a mean of 0 and a standard deviation of 1

65
TRANSLATION TO THE
STANDARDIZED NORMAL
DISTRIBUTION

• Translate from X to the standardized normal (the


Z distribution) by subtracting the mean of X and
dividing by its standard deviation:

• The Z-distribution always has mean = 0 and


standard deviation = 1

66
THE STANDARDIZED
NORMAL DISTRIBUTION

Also known as the Z distribution


Mean is 0
Standard Deviation is 1

Values above the mean correspond to positive


Z-values, values below the mean correspond to
negative Z-values

67
THE
EXAMPLE:
STANDARDIZ If X is distributed normally
ED NORMAL with mean of $100 and
standard deviation of $50,
DISTRIBUTIO the Z value for X = $200 is:
N
This says that X = $200 is
two standard deviations (2
increments of $50 units)
above the mean of $100.

68
COMPARING X AND Z UNITS

• Note that the shape of


the distribution is the
same, only the scale
has changed. We can
express the problem in
the original units (X in
dollars) or in
standardized units (Z)

69
FINDING NORMAL
PROBABILITIES

• Probability is
measured by the
area under the curve

70
PROBABILITY AS AREA UNDER
THE CURVE

The total area under


the curve is 1.0, and
RIGHT: P(X>=μ+σ)=16% the curve is
←1σ
LEFT: P(X<μ+σ)=84% → symmetric, so half is
above the mean, half
←2σ→ is below
LEFT: P(X<μ+2σ)=97.72% RIGHT: P(X>=μ+2σ)=2.28%

71
The general Exercise:
The results of an examination were
normal Normally distributed. 10% of the
candidates had more than 70 marks
distribution and 20% had fewer than 35 marks.

Find the mean and standard deviation


of the marks.

72
The general normal distribution

• Solution:

73
• Exercise:
The weights of chocolate bars are
The general normally distributed with mean 205 g
and standard deviation 2⋅6 g. The
normal stated weight of each bar is 200 g. (a)
Find the probability that a single bar is
distribution underweight.
(b) Four bars are chosen at random.
Find the probability that fewer than two
bars are underweight.

74
The general normal distribution

• Solution:

75
Simply put, a z-score (also called a standard
score) gives you an idea of how far from
the mean a data point is.
But more technically it’s a measure of
how many standard deviations below or
Z-SCORE: What above the population mean a raw
score is
is a Z-Score? A z-score can be placed on a normal
distribution curve. Z-scores range from -3
standard deviations (which would fall to the
far left of the normal distribution curve) up
to +3 standard deviations (which would fall
to the far right of the normal distribution
curve).
In order to use a z-score, you need to know
the mean μ and the population standard
deviation σ.

76
The basic z score formula for a sample is:

z = (x – μ) / σ

For example, let’s say you have a test score of


190. The test has a mean (μ) of 150 and
Z-SCORE a standard deviation (σ) of 25. Assuming
a normal distribution, your z score would be:
FORMULA: ONE z = (x – μ) / σ
SAMPLE = (190 – 150) / 25 = 1.6

The z score tells you how many standard


deviations from the mean your score is. In
this example, your score is 1.6 standard
deviations above the mean.

77
• You may also see the z
Z-SCORE score formula shown to the
FORMULA: ONE left. This is the same
formula as z = x – μ / σ,
SAMPLE except that x̄ (the sample
mean) is used instead of μ
(the population mean) and
s (the sample standard
deviation) is used instead of
σ (the population standard
deviation). However, the
steps for solving it are the
same.

78
Example question: You take
the SAT and score 1100. The
How to Calculate mean score for the SAT is
1026 and the standard
a Z-Score deviation is 209. How well
did you score on the test
compared to
the average test taker?

Step 1: Write your X-value


into the z-score equation.
For this example, question
the X-value is your SAT score,
1100.

79
How to Calculate a Z-Score

80
How to ✔ Step 4: Find the answer using a calculator:
Calculate (1100 – 1026) / 209 = .354. This means that your
a Z-Score score was .354 std devs above the mean.

✔ Step 5: (Optional) Look up your z-value in


the z-table to see what percentage of test-takers
scored below you. A z-score of .354 is .1368 +
.5000* = .6368 or 63.68%.

81
When you have multiple samples and want to describe the standard deviation of
those sample means (the standard error), you would use this z score formula:
z = (x – μ) / (σ / √n)

Z-SCORE FORMULA: STANDARD


This z-score will tell you how many standard errors there are between the
ERROR OF THE sample mean and the population mean.
MEAN
Example problem: In general, the mean height of women is 65″ with a standard
deviation of 3.5″. What is the probability of finding a random sample of 50
women with a mean height of 70″, assuming the heights are normally
distributed?
z = (x – μ) / (σ / √n)
= (70 – 65) / (3.5/√50) = 5 / 0.495 = 10.1
The key here is that we’re dealing with a sampling distribution of means, so we
know we have to include the standard error in the formula. We also know that
99% of values fall within 3 standard deviations from the mean in a normal
probability distribution (see 68 95 99.7 rule). Therefore, there’s less than 1%
probability that any sample of women will have a mean height of 70″.

82
THE STANDARDIZED
NORMAL TABLE

• The Cumulative Standardized Normal table gives


the probability less than a desired value of Z (i.e.,
from negative infinity to Z)
• P(z<a), a: given value
• If you are given this P(z>a)=1 - P(z<a)<-why?
Because the value of the entire area (in red) is
equal to one. P(z<a)+P(z>a)=1!

83
THE
STANDARDIZED
NORMAL TABLE

84
GENERAL
PROCEDURE To find P(a < X < b) when X is distributed
normally:
FOR FINDING
NORMAL
PROBABILITI Draw the normal curve for the problem in
ES terms of X

Translate X-values to Z-values

Use the Standardized Normal Table

85
Let X represent the
time it takes, in
FINDING seconds to
NORMAL download an image
file from the internet.
PROBABILIT Suppose X is normal
IES with a mean of 18.0
seconds and a
standard deviation of
5.0 seconds. Find
P(X < 18.6)

86
SOLUTION:
FINDING
P(Z<0.12)

87
FINDING
✔ Suppose X is
NORMAL normal with mean
18.0 and standard
UPPER TAIL deviation 5.0
PROBABILITI ✔ Now Find P(X >
ES 18.6)

88
FINDING NORMAL UPPER TAIL
PROBABILITIES

• DON’T FORGET->
P(Z<a)+P(Z>=a)=1 ->
P(Z>=a)=1-P(Z<a)

• Now Find P(X > 18.6)…

89
• Suppose X is
FINDING normal with mean
18.0 and standard
NORMAL deviation 5.0. Find
PROBABILITI P(18 < X < 18.6) :
ES BETWEEN
TWO VALUES

90
SOLUTION: FINDING P(0<Z<0.12)

Standardized Normal Probability Table (Portion)

91
PROBABILITIES IN THE LOWER
TAIL

✔ Suppose X is normal with


mean 18.0 and standard
deviation 5.0.
✔ Now Find P(17.4 < X < 18)

92
PROBABILITIES IN
THE LOWER TAIL

Now Find P(17.4 < X < 18)…..


P(17.4 < X < 18) = P(-0.12 < Z < 0) = P(Z < 0) – P(Z
≤ -0.12)
= 0.5000 - 0.4522 = 0.0478

93
EMPIRICAL
RULES
What can we say about
the distribution of values
around the mean? For any
normal distribution:

94
THE EMPIRICAL
RULE

95
Introduction to
Probabilities
Module 1

96
97
Basic probability
concepts
• Probability: the chance that an uncertain
event will occur (always between zero and
one)
• Impossible event: an event that has no
chance of occurring (probability = 0)
• Certain event : an event that is sure to occur
(probability = 1)

98
Assessing probability

99
Example of empirical probability

100
Events
Each possible outcome of a variable is an event.
• Simple event
• An event described by a single characteristic
• E.g. A day in January from all days in 2013
• Joint event
• An event described by two or more characteristics
• E.g. A day in January that is also a Wednesday from all days in 2013
• Complement of an event A (denoted as A’)
• All events that are not part of event A
• E.g. All days from 2013 that are not in January

101
Sample space
• The sample space is the collection of all possible
events
• E.g. All 6 faces of a die:

• E.g. All 52 cards of a bridge deck:

102
Organizing & Visualizing Events

103
• Simple probability refers to the probability of a simple
Definition: Simple event.
e.g. P(Jan.)
Probability e.g. P(Wed.)

104
Definition:
Joint
probability
• Joint probability refers
to the probability of an
occurrence of two or
more events (joint
event).
• E.g. P(Jan. and
Wed.)
• E.g. P(Not Jan. and
Not Wed.)

105
Mutually exclusive events

• Mutually exclusive events


• Events that cannot occur simultaneously

• Example: Randomly choosing a day from 2013

• Events A and B are mutually exclusive

106
Collectively exhaustive events

• Collectively exhaustive events


• one of the events must occur
• the set of events covers the entire sample space
Example: randomly choose a day from 2013

• Events A , B ,C , D are collectively exhaustive (but not mutually exclusive - a weekday can
be in January or in spring)
• events A and B are collectively exhaustive and also mutually exclusive

107
Computing joint and marginal probabilities

• The probability of a joint event, A and B:

• Computing a marginal (or simple) probability :

• Where B1, B2, ..Bk are k mutually exclusive and collectively exhaustive
events

108
Joint
Probability
Example

109
Marginal
probabilit
y example

110
Marginal and
Joint
probabilities in
a contingency
table

111
Probability Summary so far

• Probability is the numerical measure of the likelihood that an event will


occur
• The probability of an event must be between 0 and 1, inclusively
1 = certain
• The summary of the probabilities of all mutually exclusive
and collectively exhaustive events is 1.

0 = impossible

112
General Addition Rule
• General addition rule:

• If A and B are mutually exclusive, then P(A and B) =


0, so the rule can be simplified:

For mutually exclusive events A and B

113
General
addition
rule
example

114
Computing conditional probabilities

• A conditional probability is the probability of 1 event, given that another


event has occurred
The conditional probability of A
given that B has occurred

The conditional probability of B


given that A has occurred

• Where P(A and B) = joint probability of A and B


• P(A) = marginal or simple probability of A
• P(B) = marginal or simple probability of B

115
Conditiona • Of the cards on an used car lot, 70% have air
conditioning (AC) and 40% have a GPS. 20%
of the cars have both.
l
probability • What is the probability that a car has a GPS,
given that it has AC?
example • E.g. we want to find P(GPS | AC)

116
Conditional Probability Example (Continued)

• Of the cars on a used car lot, 70% have air conditioning (AC) and 40%
have a GPS and 20% of the cars have both.

117
118
119
Conditional
Probability Example
(Continued)

• Given AC, we only


consider the top row
(70% of the cars). Of
these, 20% have a
GPS. 20% of 70% is
about 28.57%.

120
• Two events are independent if and only if:

Independenc P(A | B) = P(A)

e • Events A and B are independent when the


probability of one event is not affected by the
fact that the other event has occurred.

121
Multiplications
rules
• Multiplication rule for two events A and
B:

• Note: If A and B are independent, then


P(A | B) = P(A) and the multiplication
rule simplifies to:

P(A and B) = P(A)P(B)

122
Statistics
Practical Examples
Module 1

123
Statistical Modeling
• A statistical model is our try to describe a real-world problem using
mathematical equations. This model can be used in order to make
predictions, while in the future we aim to analyze and refine this model
in order to produce even better predictions.
• However: The model simplifies the situation and only describes a part
of the real-world problem and sometimes may only work for a particular
range of values.

• Example of a Statistical Model:


• x+5 = y (Let’s say this statistical model describes the real-world problem of
temperature prediction)

124
Type of variables
• Discrete: Can only take certain values
• E.g. shoe size (is just a single number, we don’t have to pass from all the numbers
before to end up having our current show size which could be 40 (EU point system),
cost in $ and euro (Say that 10$ is 11.5 euros, so it’s just a conversion), number of
coins (I have 2 euros in my pocket, is not mandatory that I started by having 1 euro + 10
cents + … to get to have 2 euros), number of people within a store for a time period
of 5 minutes (it might be 0 people but it might be 1 person, 2, 4, 6, … the point is that
we might see that 2 people enter the store, so we won’t count 1 + 1 but rather the
number 2)
• Continuous: Can only take values within a given range
• E.g Age, Height, time (For example with age, we can’t just say we are 25 years old, our
life begins – let’s just say- at 0 years old and gradually we end up being 25 years old
while next year we will be 26 years old, 27 years old,… So the point here is that we start
from a value and we MUST pass from all the values until we end up in our current age
which is 25 years old), temperature
• Quantitative: Numerical values
• E.g. the length of a trouser, our age, the time, the number of coins in a pocket, the
temperature, score within a game, etc.
• Qualitative: Non-numerical values
• E.g. red, blue, yellow, short, long, cold, hold, light, dark, etc.

125
Terms
• Values = Data Points = Numbers
When you hear these terms, you should always consider something like
this:

1, 2, 5, 8, 7, 3, 10, …
Or red, blue, green, yellow, etc.

So, we will refer to a set of numbers (quantitative) or things (qualitative)!

126
Frequency Distribution Exercise: We gather the Instagram followers
for 3 people. Find the Frequency
Distribution (or else called Frequency
Table):
100, 100, 90
• Frequency distribution or
Frequency table is how many 10 90
times you see each value within
a set of numbers: 0
2 1
• Example:
• The number of M&Ms:

127
Cumulative Frequency
• Add up the frequencies as you go down/along the list:
Example:
• The number of Instagram followers:
100 90 10 120
Frequency 3 2 1 5
Table
Cumulative 3 3+2 = 5 5+1 = 6 (or 6+5 = 11 (or
So, the Frequency
point is that we start with the first number not 3+2+1
having= a6)number
3+2+1+5
before =(so
11) the same
frequency) and the next number is the addition of the frequency of numbers that we found
before

128
Measures of Central Tendency

129
Mean
• Calculate the summary of
numbers divided by the total
numbers seen:

• Remember!
• Σ (Greek letter) tells you to sum
up all the numbers seen

130
Mean - Example
• I have the following numbers:
•1 2 5 6 2 3 1

Step 1: Sum up all numbers: 1 + 2 + 5 + 6 + 2 + 3 + 1 = 20


Step 2: Find n (how many numbers you see): 7
Step 3: Calculate: 20 / n = 20 / 7 = 2.85…

131
Median
• Is the middle number in a sorted, ascending or descending, list of
numbers.
• The median is sometimes used as opposed to the mean when there
are outliers in the sequence that might skew the average of the values

Example:
• Step 1: Put the numbers in order
• I have the following numbers: 1 3 2 5 10 5 4
• Ascending order: 1 2 3 4 5 5 10
• Descending order: 10 5 5 4 3 2 1

132
Median
• Step 2: Question yourself. Is the middle number obvious? Can you find a number that has
equal number of values from both sides?
• 1 2 3 4 5 5 10

Example 2:
I have the following numbers: 1 4 2 3 5 2
1) 122345
2) 122345
This time the middle number is not obvious, but a set of middle numbers are obvious! So, in
this case:
Take the summary of these two values and divide them by 2. This will be our median.

133
Mode
It’s how many times you see each unique number within a set of values.

134
Mode – Example
Example:
143256785
• Step 1: Unique values:
• 12345678
• Step 2: How many times you see each one?

1 2 3 4 5 6 7 8
1 1 1 1 2 1 1 1
• Step 3: What numbers are most frequently seen?

1 2 3 4 5 6 7 8
1 1 1 1 2 1 1 1
So, 5 is the mode.

135
Mode – Example (Bimodal)
Example:
1433256785
• Step 1: Unique values:
• 12345678
• Step 2: How many times you see each one?

1 2 3 4 5 6 7 8
1 1 2 1 2 1 1 1
• Step 3: What numbers are most frequently seen?

1 2 3 4 5 6 7 8

So, 1in this case,


1 we have
2 two values,
1 so2mode is1bimodal1(two modes)
1

136
Mode – Example (Multimodal)
Example:
14332567785
• Step 1: Unique values:
• 12345678
• Step 2: How many times you see each one?

1 2 3 4 5 6 7 8
1 1 2 1 2 1 2 1
• Step 3: What numbers are most frequently seen?

1 2 3 4 5 6 7 8

So, 1in this case,


1 we have
2 multiple
1 values,
2 so mode
1 is multimodal
2 1
(Multiple modes)

137
Terms

Extreme
• Some values that are far from
values = others
Outliers

• 1, 2, 1, 5, 10, 100, 5
For • 100 is far from the other
numbers, so we call it an extreme
example: value or an outlier.

138
When you should use each measure
Mode:
You should use the mode if the data is qualitative (colour etc.) or if quantitative
(numbers) with a clearly defined mode (or bi-modal). It is not much use if the distribution
is fairly even

Median:
You should use this for quantitative data (numbers), when the data is skewed, i.e., when
the median, mean and mode are probably not equal, and when there might be extreme
values (outliers)

Mean:
This is for quantitative data (numbers) and uses all pieces of data. It gives a true
measure, and should only be used if the data is fairly symmetrical (not skewed), i.e., the
mean could not be affected by extreme values (outliers)

139
Measures of Variability

• Range
• Variance and Standard Deviation

140
Range
• The range is the largest number minus the smallest (including outliers)

• Example:

Number of friends of 11 Facebook users:


22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252

Range = 252 – 22 = 230

– Very biased by outliers – In more detail, the resulting range is affected


by extreme values. So, in this case, range is not a preferable measure to
use in order to describe the set of values.

141
Variance

142
Standard
deviatio
n

143
Explanation of each term

144
Explanation of each term

145
Explanation of each term

• When finishing calculating the squared difference, we need to find the


summary of all these results.
• So, in our example: 23.61 + 8.17 + 0.73 + 17.13 + 0.73 + 0.01 + 26.41 = 76.79

146
Explanation of each term

147
Skewed
Distribution

• Positive Skew: Mode < Median < Mean


• Symmetrical Distribution (e.g. Normal Distribution): Mode = Median =
Mean
• Negative Skew: Mode > Median > Mean or Mean < Median < Mode

148
Normal Distribution

When the values of a variable X:


1) Draw a bell-shaped curve
2) Have mean= median = mode
3) They are symmetrical with each
other
Then we call the distribution of
the data points of variable X as
Normal.
Also, Location of a data point is
determined by the mean, while
spread of different data points
are determined by the standard
deviation.

149
Normal Distribution

• Normal Distribution appears only when we deal with a Continuous


variable e.g. Temperature.
• Normal Distribution can take values from .
• Continuous variables that can’t take negative values (e.g. height, age,
coins in a pocket), will never be part of a Normal Distribution.

150
Standardized Normal Distribution
• We transform the X unit (e.g. $,
people, etc.) of the values of
variable X into Z units.

• Standardized Normal Distribution


(from now on Z-Distribution) has
the following characteristics:
• Mean = 0
• Standard deviation = 1
• While the mathematical formula
for the transformation is:

151
A typical exercise - Examples

A) Find the probability that variable X < 200 – P (X < 200)


B) Find the probability that variable X >= 100 – P (X >= 100)
C) Find the probability that variable X is between 100 < X < 200 – P(100
< X < 200)

152
What we need to know for these exercises

1) Mean
2) Population’s Standard Deviation

153
Example for
Case A
Find the probability
that variable X < 200
1) We transform the X
value to Z-Value (which
will call from now on
Z-Score).
2) Because we have a
negative z-score, so
below the mean, we
transform -1.92 to 1.92
(this value we are gonna
look in the Z-Table) and
we will subtract the
result from 1.

154
Example for
Case A (2) Let’s say we have a data set which is normally distributed with a
mean of 150 and a standard deviation of 20. Find the
probability that a value is less than 200:
Find the probability
that variable X < 200 1) Z = (200-150)/20 = 50/20 = 2.5
2) P(X<200) = P(Z<2.5) = Φ(2.5) = 0.9938
1) We transform the X
value to Z-Value (which
will call from now on
Z-Score)
2) 2.5 will be the value
we are going to look in
the Z-Table
3) This is the probability
that this initial case will
happen

155
Example for
Case B
Find the probability
that variable X >= 100
1) We transform the X P(X>= 100) where μ = 55 and σ = 20.
value to Z-Value (which P(X>=100) = P(Z>=2.25) = 1 – Φ(2.25) = 1 – 0.9875
will call from now on = 0.0125 = 1.25%
Z-Score).
2) Because we have a
the >= (greater than),
we need to find locate
the number 2.25 from
the Z-table and then that
value subtract it from 1.

156
Example for
Case C
Find the probability that
variable 50 < X < 100
1) In that case we need P(X>50) and P(X<100) where μ = 55 and σ = 20.
to find 2 probabilities. P(X>50) = P(Z>-0.25) = 1 – Φ(-0.25) (Z-table for negative
I. P(X>50) and P(X<100) values)
2) We transform the X = 1 – 0.4013 = 0.5987 = 59.87%
value to Z-Value (which P(X<100) = P(Z< 2.25) = Φ(2.25) = 0.9875 = 98.75%
will call from now on
Z-Score) for each one. The final result is:
3) Look what we do in P(Z<2.25) – P(Z>-0.25) = 0.9875 – 0.5987 = 0.3888 =
each of the previous 38.88 %
cases based on sign (< or
>)

157
Correlation
✔ Many relationships between two measurement variables tend to fall close to a straight line.
✔ In other words, the two variables exhibit a linear relationship.
✔ It is also helpful to have a single number that will measure the strength of the linear
relationship between the two variables. This number is the correlation.
✔ The correlation is a single number that indicates how close the values fall to a straight line.
✔ Correlation quantifies both the strength and direction of the linear relationship between the
two measurement variables.
✔ The range of possible values for a correlation is between -1 to +1.
✔ Correlation describes the strength and direction of the linear association between variables.
It does not describe non-linear relationships
✔ Correlation is heavily impacted by outliers!

158
Type of Association (or Relationship)

• A positive correlation indicates a positive linear association. The


strength of the positive linear association increases as the correlation
becomes closer to +1.
• A negative correlation indicates a negative linear association. The
strength of the negative linear association increases as the correlation
becomes closer to -1.
• A correlation of 0 indicates either that:
• there is no linear relationship between the two variables, and/or
• the best straight line through the data is horizontal.

159
Positive correlation

The following two questions were asked on a


survey of 220 STAT 100 students:
1. What is your height (inches)?
2. What is your weight (lbs)?
Notice we have two different measurement
variables.
Finally, we notice that as height increases,
weight also tends to increase. These two
variables have a positive association
because as the values of one measurement
variable tend to increase, the values of the
other variable also increase.
The scatterplot shows a positive association
between weight and height. However, there
is still quite a bit of scatter around the
pattern. Consequently, a correlation of 0.541
is reasonable. It is common for a correlation
to decrease as sample size increases.
160
Negative correlation

The following two questions were asked on


a survey of ten PSU students who live
off-campus in unfurnished one-bedroom
apartments.
1. How far do you live from campus (miles)?
2. How much is your monthly rent ($)?
Finally, we notice that the further an
unfurnished one-bedroom apartment is
away from campus, the less it costs to rent.
We say that two variables have a negative
association when the values of one
measurement variable tend to decrease as
the values of the other variable increase.
The scatterplot shows a negative
association between monthly rent and
distance from campus. Since the data points
are very close to a straight line it is not
surprising the correlation is -0.903.

161
No association
The following two questions were asked on a
survey of 220 Stat 100 students:
1. About how many hours do you typically
study each week?
2. About how many hours do you typically
exercise each week?
Finally, we notice that as the number of
hours spent exercising each week increases
there is really no pattern to the behavior of
hours spent studying including visible
increases or decreases in values.
Consequently, we say that that there is
essentially no association between the two
variables.
The scatterplot does not show any strong
association between exercise hours/week
and study hours/week. This lack of
association is supported by a correlation of
0.109.

162
Scatterplot of the relationship between the Infant
Mortality Rate and the Percent of Juveniles Not Enrolled
Outliers & in School for each of the 50 states plus the District of
Columbia:
Correlation
The correlation is 0.73 but looking
at the plot it can be observed that
for the 50 states alone the
relationship is not nearly as strong
as a 0.73 correlation would suggest.
Here, the District of Columbia
(identified by the X) is a clear
outlier in the scatter plot being
several standard deviations higher
than the other values for both the
explanatory (x) variable and the
response (y) variable. Without
Washington D.C. in the data, the
correlation drops to about 0.5.

163
Causation vs Correlation

• Causation is when we have two measurement variables and the one cause the other
to happen. So, we have a cause-and-effect relationship between those two variables.
• Correlation is when we have two measurement variables and the one is somehow
related to the other. So, this means that they simply have a relationship, but one event
doesn’t necessarily cause the other event to happen.
• Causation explicitly applies to cases where action A causes outcome B. On the other
hand, correlation is simply a relationship. Action A relates to Action B—but one
event doesn’t necessarily cause the other event to happen.

164
Causation vs correlation
When there is causation, there is correlation, but when there is correlation that
doesn’t imply causation!!
There are two main reasons why correlation isn’t causation. These problems are
important to identify for drawing sound scientific conclusions from research:
• The third variable problem means that a confounding variable affects both variables
to make them seem causally related when they are not. For example, ice cream sales
and violent crime rates are closely correlated, but they are not causally linked with
each other. Instead, hot temperatures, a third variable, affects both variables
separately.
• The directionality problem is when two variables correlate and might actually have a
causal relationship, but it’s impossible to conclude which variable causes changes in
the other. For example, vitamin D levels are correlated with depression, but it’s not
clear whether low vitamin D causes depression, or whether depression causes
reduced vitamin D intake.
• You’ll need to use an appropriate research design to distinguish between correlational
and causal relationships.

165
Regression

• Regression is a descriptive method used with two different numerical variables to find
the best straight line (which in fact is a mathematical equation) to fit the data points
when looking at a scatterplot.
• Regression is used to make predictions, where X is the
explanatory/predictor/independent variable, and Y is the
response/outcome/dependent variable.

166
Regression

• y = a + bx where:

• a = y-intercept (the value of y when x = 0)


• b = slope of the line. The slope is the change in the variable (y) as the
other variable (x) increases by one unit. When b is positive there is a
positive association, when b is negative there is a negative association.

167
Regression

When can we use Regression to predict?


• When we notice that the points generally fall in a linear pattern so we can use the
equation of a line that will allow us to put in a specific value for X and determine the best
estimate of the corresponding Y.
• The best line would be the one that has the least variability of the points around it,
meaning that we want the data points to come as close to the line as possible.
• The best line will have the smallest standard deviation for the distance from all the data
points to the line. This line is called the regression line or the least squares line.

168
Regression

169
Regression -
Example
• Least
squares essentially find
the line that will be the
closest to all the data
points than any other
possible line.
• Based on the plot, some
of the points lie above
the line, where other
points lie below the line.
In fact, the total distance
for the points above the
line, is exactly equal to
the total distance from
the line to the points that
fall below it.

170
Statistical Distributions

• A statistical distribution is a parameterized mathematical function that gives the


probabilities of different outcomes for a random variable X.
• Random Variable: A variable that can take whatever value and this value is chosen
by random
• Parameterized mathematical function:
e.g.

where x is the random variable, f(x) is the function, and p or 1-p are the
probabilities of different outcomes of function f(x) for the variable x.

171
Types of Statistical Distributions

Discrete

• Refers to Discrete variables meaning that we don’t have values within a given
range, but we only refer to specific numbers (1,4,6) without having to go
through 2,3,5, etc., or even text (e.g. red, white, tall, slim, etc.)

Continuous

• Refers to Continuous variables meaning that we deal with values that are within
a specific range. For example, the temperature is a continuous variable because
when the temperatures is 37 Celsius, in order to get there, we have been also in
36.5, 36, 34, etc. Celsius.

172
Most important Distributions

• Bernoulli Distribution
• Binomial Distribution
• Geometric Distribution
• Uniform Distribution
• Normal Distribution
• Poisson Distribution

173
Binomial Distribution

• Binomial distribution is also a discrete distribution, and it describes the random


variable x as the number of success in n Bernoulli trials. You can think of the binomial
distribution as the outcome distribution of n identical Bernoulli distributed random
variables. The assumptions of the Binomial distribution are:
• 1) each trial only has two outcomes (like tossing a coin);
• 2) there are n identical trials in total (tossing the same coin for n times);
• 3) each trial is independent of other trials (getting “Head” at the first trial wouldn’t
affect the chance of getting “Head” at the second trial);
• 4) p, and 1-p are the same for all trials (the chance of getting “Head” is the same
across all trials);

174
There are two parameters in the distribution, the success probability p and the number of trials n. The PMF
is defined using the combination formula:

Binomial
Distribution

175
A binomial distribution graph where the probability of success does
not equal the probability of failure looks like:

Binomial
distiribution Now, when probability of success = probability of failure, in such a
situation the graph of binomial distribution looks like:

176
Bernoulli Distribution
• Binomial distribution is a discrete distribution. The assumptions of Bernoulli distribution
include:
• Only two outcomes (e.g. 0 and 1)
• One one trial
• Bernoulli distribution describes a random variable that only contains two outcomes. For
example, when tossing a coin one time, you can only get “Head” or “Tail”. We can also
generalize it by defining the outcomes as “success” and “failure”.
• Example of Bernoulli distribution:
• We toss a coin so we have ½ chances to get “head” and ½ chances to get “tail”. So
two outcomes (“Head”, “Tail”) and each one (which is a single outcome) has the
probability of ½.
• We roll a dice and if I got six then for me it’s “Success” and everything else is “Failure”.
So, we have two outcomes and so Bernoulli Distribution can be used.

177
Bernoulli Distribution
• The probability mass function of a random variable x
that follows the Bernoulli Distribution is:

• P is the probability that this random variable x equals


“success”, which is defined based on different
scenarios. Sometimes w ehave p = 1-p, link when
tossing a fair coin.

178
Bernoulli
Distribution

179
Geometric Distribution

• Geometric distribution is a discrete distribution that models the number of failures (x


failures) before the first success in repeated, independent Bernoulli trials. For
example, the random variable can be how many “Tails” would you get before you get
your first “Head.” It can also model the number of trials to get the first success (x-1
failures), like how many times you have to toss until you get the first “Head.” The only
difference between these two random variables is the number of failures. The
Geometric distribution assumptions are the same as the Binomial distribution because
they both derive from some identical independent Bernoulli trails.

180
Geometric distribution • When the random variable x is the number of
failures before the first success, the PMF is:

• When the random variable x is the number of


trials to get the first success, the PMF is:

181
Geometric
Distribution

182
Uniform distribution

• Uniform distribution models a random variable whose outcomes are equally likely to
happen. The outcomes can be discrete, like the outcomes getting from tossing a dice,
or continuous, like the waiting time for a bus to arrive. Thus, Uniform distribution can
be a discrete or continuous distribution depending on the random variable. The
assumptions are:
1) there are n outcomes (discrete), or a range for the outcomes to be at
(continuous);
2) All values in the outcome set or the range are equally likely to occur.

183
• The discrete uniform distribution is straightforward, and it
is easy to calculate. For a continuous Uniform distribution
that is uniformly distributed at [a, b], the probability
density function (PDF) is:

Uniform
distribution

184
Uniform distribution

185
Normal Distribution

• Normal (Gaussian) distribution is the most widely used continuous distribution


because it represents the most universe situations. Many random variables are
normally distributed because of the Central Limit Theory, or they are assumed to be
normally distributed before fitting them into a statistical model. Normal distribution has
some unique characteristics:
1) mean=mode=median=µ ;
2) the PDF is bell-shaped and symmetric at x=µ;
3) the values between [µ-σ, µ+σ] takes roughly 68% of the data, where σ is the
standard deviation, and:

186
Probability function

187
Normal
distribution

188
Poisson distribution

• Poisson distribution is a discrete distribution that models the probability of a number of


events occurring in a fixed interval of time or space. Poisson distribution can be used to
present the number of customers arriving in a store in an hour, or the number of phone
calls a company receives one day, etc. Poisson distribution is closely related to binomial
distribution if you measure the number of event occurrences as the number of success.
For example, when measuring how many cars will pass a particular street in an hour, the
number of cars passing is a random variable that follows Poisson distribution. To
understand the car passing event, you can break down one hour into 60 minutes, and see
how many cars will pass in a minute, then generalize it into an hour. For a minute, maybe
more than one cars pass the street, thus it is not a binary random variable. However, if we
break down one minute into 60 seconds, it is likely that only one car passes or no car pass
in a single second. We can keep breaking down a second to make a more confident claim.
We can then consider car passing as a successful event and no car passing as a failure
event, and model how many success events (how many cars passing) in 3600 trials (3600
seconds in an hour), which is a Binomial distribution with success probability p, and the
number of trials equals to 3600

189
• The assumptions Poisson distribution are:
1) any successful event should not influence the outcome of other
successful events (observing one car at the first second doesn’t affect the
chance of observing another car the next second);
2) the probability of success p, is the same across all intervals (there
is no difference between this hour with other hours to observe cars
passing by);
3) the probability of success p in an interval goes to zero as the
interval gets smaller (if we are discussing how many cars will pass in a
millisecond, the probability is close to zero because the time is too short);
Poisson • The PMF of Poisson distribution can be derived from the PMF of
distribution binomial distribution:

Where x is the number of success in Bernoulli trials. 190


Poisson distribution

191
Useful Content
192
Z-Table
for Positive
Z-Values

193
Z-Table
for Negative
Z-Values

194
Excel
Module 1

195
Week 1

• Intro
• Insert/Delete
• Format cells/Conditional Formatting

196
What is Excel?

• Based on Techopedia.com, Microsoft


Excel is a software program produced
by Microsoft that allows users to
organize, format and calculate data with
formulas using a spreadsheet system.
• In a spreadsheet we can have a lot of
worksheets.
• Excel provides tables, charts and
generally analytical tools e.g. pivot
tables.

197
Basic Understanding

• In Excel we deal with three things:


• Cells
• Rows
• Columns

198
Key Features

• Spreadsheet applications such as


Microsoft (MS) Excel use a collection
of cells arranged into rows and
columns to organize and manipulate
data. They can also display data as
charts, histograms and line graphs.
• We are referring to the very first cell
by saying A1 (Column-Row), B2, etc.

199
Quick access toolbar and ribbon menu

This is the Quick access toolbar:

And this is the Ribbon Menu:

These are the tabs:

200
Mini Toolbar

• Mini Toolbar can be found if you press


right-click anywhere in the worksheet.
• It offers a lot of important commands
from the ribbon menu that the user uses
the most.

201
Insert/Delete a new Column/Row

When you have some data, and you want to insert/delete a new row/column in a
specific cell then you should do the following:
• Right-click on the cell that you want
• Select Insert/Delete

202
Workbook vs Worksheet

• A workbook is a file (.xlsx is the extension of the file) while a


worksheet is a blank/filled sheet within the workbook.
• Every workbook has at least one worksheet.

203
Shortcuts for Start/End of a
row/column
• If we have a table with data in a worksheet, then we might need to go to the first
(last) cell that is filled with data. In order to do this, we have the following shortcuts:
• Ctrl + Home (Start of the table – First Cell)
• Ctrl + End (End of the table – Last Cell)
• Ctrl + Up Arrow (First cell of the column)
• Ctrl + Down Arrow (Last cell of the row)
• Ctrl + Right (Last cell of the row)
• Ctrl + Left (First cell of the row)
• Cases:
• If we want to go to the very first cell (Most of the time is A1) of the table that is
filled with data just press Ctrl + Home
• If we want to go the last cell of the table is filled with data (right down corner) just
press Ctrl + End

204
Questions

• When should we use Excel for a task?


• When writing an article about a new product
• When creating a presentation to give to a team
• When writing an instruction guide for a team
• When creating a list of employees with their ID numbers
• A user wants to quickly format a cell and right-clicks the cell to use the
features seen in the image. Which Excel tool is being used?
• Ribbon
• Quick access toolbar
• Mini toolbar

205
Questions
• You would like to add some frequently used buttons to the Quick Access
Toolbar. How can you do this?
• Drag-and-drop the button to the Quick Access Toolbar
• Right-click the button on the ribbon and select Add
• Double-click the button on the ribbon and select Move
• Left-click the button on the ribbon and select Add
• You want to add an average in the worksheet shown below. You want to
learn more about the average function in Excel and think that a video will
assist you. How can you access training from the workbook?
• Click help from the menu bar
• Click Featured help
• Click tell me what you want to do
• Click tell me more

206
• Ctrl + C = Copy
• Ctrl + V = Paste
• Ctrl + X = Cut
• Ctrl + A = Select All
• Ctrl + Z = Undo last(s) edit
• Ctrl + Y = Redo last edit (if you pressed Undo
before)
• Ctrl + B = Bold text
• Ctrl + I = Italic text
• Ctrl + U = Underlined text
• Ctrl + N = New Workbook
• Ctrl + F = Find text and Replace it (optional)
• Ctrl + S = Save

Handful shortcuts •

Ctrl + D = Delete cell content
Ctrl + P = Print
- Windows

207
Notes

• Notes are useful when you want to give further explanation to a cell.
In order to create a note:
• Right-click on the cell where you want to create the note
• Select New Note
• Write your note (it is auto saved)

208
Workbook Statistics

• Workbook Statistics is a nice way to


understand the several components of
the Workbook you have just created.
• In Review Tab 🡪 Proofing 🡪 Workbook
Statistics

209
Data Types and Formatting

• Date
• Text
• General
• Currency

210
Handful tip for understanding the data type
• Let’s say we create a column that has data with several dates. In order to
understand whether we write the date in the correct format, we should pay
attention to this:
• Do you see that some “dates” are in the right side of the cell (or
there is a blank space on the beginning of the cell) and some
others are in the left side of the cell (or there is a blank space on
the end of the cell)?
• In this case, the cells that are in the right side are in correct
format as far as the date data type is concerned, while the
2nd and the 4th cell are not of data type Date, but data type
Text!
• The same is happening for Number, when we deal with
numbers then we will see the on them right side of the cell.

211
Format Cells

• Right-click on the cell you want to


configure the data type and select
Format Cells…
• The data type of the cell is automatically
selected and in the section Type you
can select how you want the cell to be
formatted.
• In our example, we deal with a date
data type and we can find several
types of formats.

212
Format cells

• Most of the times, when you right-click


in a cell and select Format Cells.. you
will notice that the data type is
General.

• However, if for example, you deal with


a decimal number but the data type is
General, just select Number from the
Category and then you can choose
between some options.
• SOS Tip! The Decimal part of a
number is denoted with point (.),
while if we want to separate
thousands, we use the comma (,)

213
Smart replacing
Drag one cell or two cells in order that Excel will generate for you what you want in an
automatic way.
Case 1: You need to generate the numbers from 1-100.
Solution 1:
• Type the numbers 1 and 2
• Select both cells
• Drag the cell to the bottom until you reach the row 100 (see on the left)
Case 2: You need to generate the number 1 multiple times
Solution 2:
• Type the preferable number (1 in our example)
• Select the cell
• Drag the cell to the bottom until you reach the number of values that you want to generate

214
Conditional Formatting – Highlight
Cells Rules
Conditional Formatting enables you to highlight cells with a certain color,
depending on the cell’s value.
Steps
• Select a range of cells
• On the Home tab, in the Styles group, click Conditional Formatting
• Click highlight cells rules, e.g. Greater than
• Enter a random value, e.g. 80, and select a formatting style.

If you change a cell value of the column that you have apply Conditional
Formatting then it would automatically colored, or not, based on the
condition you have applied.

215
Conditional Formatting -
Top/Bottom Rules

To highlight cells that are above average:


Steps
• Select a range of cells
• On the Home tab, in the Styles group, click Conditional Formatting
• Click Top/Bottom Rules, e.g. Above Average
• Select a formatting style.

216
Conditional Formatting –
With Formulas

Use a formula to determine which cells to format. Formulas that apply conditional
formatting must evaluate to TRUE or FALSE.
Steps
• Select a range of cells
• On the Home tab, in the Styles group, click Conditional Formatting
• Click New Rule
• Select Use formula to determine which cells to format
• Enter the formula e.g. ISODD(A1) (Always write the formula for the upper-left cell
in the selected range because Excel automatically apply the formula to the rest of
the cells)
• Select a formatting style and press OK

217
Freeze a row or column heading in Excel

You can freeze the row and column headings. Therefore, no matter how
much you scroll you will still see these rows and columns. This trick is
rather simple to implement.
Steps
• Click on the View tab and select one of the following options:
• Freeze Panes: Freezes the top row and first column
• Free Top Row: Freezes the top row
• Freeze First Column: Freezes first column
• Your first row and/or column will be visible as you scroll.

218
Clear Format

When you want to clear a content/style in a specific cell (range of cells)


we use Clear from Editing Section in Home tab.
• Remove cell styling: Clear Formats
• Remove content: Clear Contents

219
Sort Data

• Sort One column:


• Click any cell in the column you want
to sort
• To sort in ascending order, on the Data
tab, in the Sort & Filter group, click AZ
• Sort multiple columns:
• On the Data tab, in the Sort & Filter
group, click Sort
• The sort dialog box appears. Select the
column you prefer from the Sort by
drop-down list
• Click on Add Level
• Select the 2nd column from the Then
by drop-down list

220
Filter Data

• Click any single cell inside a data set


• On the Data tab, in the Sort & Filter group, click
Filter
• Arrows in the column headers appear
• Click the arrow next to the preferable column
• Click on Select All to clear all the check boxes,
and click the check box you prefer
• On the Data tab, in the Sort & Filter group, click
Clear in order to remove the filter and the
arrows
• Tip! You can omit showing Blank data when
filtering data!

221
Remove Duplicate Data

Duplicate Data
When we deal with the same data more than once and this is not the
default behavior. For example:
- We gather data for each of our customers
Alexandra Ath, Mary Tze, Nick Gre, Mary Tze, etc.
It is obvious that Mary Tze is found twice and this is wrong because we
need to have each customer once.

222
Week 2

• Intro to Functions
• Count/CountIF
• Logical Operators

223
Remove Duplicate
Data (2)
In Data tab 🡪 Data Tools 🡪 Remove
Duplicates

224
Formulas/Functions

• When we want to calculate something


(e.g. the summary of numbers) we
use formulas.
• Formulas start with the equal sign (=)
• Then you right the preferable function
that you want to use (e.g. SUM).
Hopefully, Excel offers real-time
helping tips in order to find the correct
syntax for the preferable function to
use.

225
Formulas/Functions (2)
• In the parentheses, you need to write the cells that you prefer to calculate the summary of:

Select the first cell and Close the parentheses This is the result!
drag your cursor until and then press Enter.
you have selected all
the preferable cells.

226
Formulas/Functions – Operator Precedence

1) Parentheses
2) Multiplication / Division
3) Addition / Subtraction

e.g. (A1 – A3) * 2 /B1


At first, we calculate A1-A3 because it is in parentheses, then we multiply the result with 2
and finally, we divide by B1. So, when we have two operators in the same precedence
(e.g. multiplication/division) we will do the calculations from left to right.
e.g. 2: (A1*A2 – A3) + B3*B1/B2
First we will try to calculate the formula inside the parentheses. Due to multiplication goes
first, we calculate A1*A2 and then we find the difference of it with A3. After finishing with
the parentheses, we have + B3*B1/B2. In this case, multiplication goes first so we do
B3*B1, we continue by divide the results with B2 and lastly, we add the previous
calculation to the result we found after calculating the content inside the parentheses

227
All Functions
• In order to see the available functions of Excel
with detailed descriptions:
1. Go to tab Formulas
2. Insert Function
3. Select the preferable category
4. Press ok when choosing the preferable
function

• In the Function Arguments dialogue box:


1. Based on the function we chose; the most
usual case is to select a range of cells that
you want this function to apply. So, when this
dialogue box is open, go to the sheet and
click on the first cell that you want to choose
and drag your cursor until it chooses all the
cells that you prefer.

228
Count and Sum Functions

• Count
• Countif/Countifs
• Sum/sumif/sumifs

229
COUNTIF -
Example
Use the COUNTIF function
in Excel to count cells that
are equal to a value, count
cells that are greater than or
equal to a value, etc.
• The COUNTIF function
below counts the number
of cells that are equal to
20.

230
COUNTIF –
Example 2
The COUNTIF function
below counts the number of
cells that are greater than or
equal to value in cell C1.

231
COUNTIF – Text Tricks

Use the COUNTIF function in Excel and a few


tricks to count cells that contain specific text.
Always enclose text in double quotation marks.
• The COUNTIF function below counts the number
of cells that contain exactly star.
• The COUNTIF function below counts the number
of cells that contain exactly star + 1 character. A
question mark (?) matches exactly one
character.
• The COUNTIF function below counts the number
of cells that contain exactly star + a series of
zero or more characters. An asterisk (*) matches
a series of zero or more characters.

232
COUNTIFS –
Example
To count cells between two
numbers, use the
COUNTIFS function (with
the letter S at the end).

233
Logical Operators and Round

• If • Round(Cell, 3) = Round a number to three


• And decimal places
• Or • Round(Cell,0) = Round a number to the
nearest integer
• Not
• Round(Cell,-1) = Round a number to the
nearest 10

234
IF - Examples

The IF function checks whether a


condition is met, and returns one value if
true and another value if false.
1. For example, take a look at the IF
function in cell B2 below.
2. Always enclose text in double
quotation marks.

235
AND/OR - Examples

OR AND

236
Nested IF

The IF function in Excel can be


nested, when you have multiple
conditions to meet. The FALSE
value is being replaced by
another IF function to make a
further test.
If the score equals 1, the nested
IF formula returns Bad, if the
score equals 2, the nested IF
formula returns Good, if the score
equals 3, the nested IF formula
returns Excellent, else it returns
Not Valid. If you have Excel 2016
or later, simply use the IFS
function.

237
Week 3

• Functions (2)
• VLookUp/Index-Match
• Correlation

238
Most important Statistical Functions

• Average
• Averageif
• Median
• Mode
• Stdev
• Min/max

239
Generate Random values

• RANDBETWEEN(x,y) where x and y is the range of numbers you


want Excel to generate random values

240
VLOOKUP – Exact Match

• Looking for an exact match:


=VLOOKUP(C1, X0…Xn, C2, B) where
C1 is the cell where has the value that
we want to find, X0…Xn is the range of
values that we want to search for the
preferable value, C2 is the returning
value IF we find the value in X0..Xn
range of values and B is what to do IF
we don’t find the value of C1 cell

241
VLOOKUP – Approximate Match

• The VLOOKUP function below looks up the value


85 (first argument) in the leftmost column of the red
table (second argument). There's just one problem.
There's no value 85 in the first column.
• Fortunately, the Boolean TRUE (fourth argument)
tells the VLOOKUP function to return an
approximate match. If the VLOOKUP function
cannot find the value 85 in the first column, it will
return the largest value smaller than 85. In this
example, this will be the value 80.
• The value 2 (third argument) tells the VLOOKUP
function to return the value in the same row from
the second column of the red table.
• Tip! always sort the leftmost column of the red table in
ascending order if you use the VLOOKUP function in
approximate match mode (fourth argument set to TRUE).

242
VLOOKUP –
First Match
• If the leftmost column of
the table contains
duplicates, the VLOOKUP
function matches the first
instance.
• Tip! the VLOOKUP
function is case-insensitive
so it looks up MIA or Mia
or mia or miA, etc. As a
result, the VLOOKUP
function returns the salary
of Mia Clark (first
instance).

243
Concatenate text

• Type = and select the first cell. Add the


& symbol. This tells Excel that there will
be another string being added. If you
want a space between the strings you
are combining, use ” ” to add a space.
• An example of combining three strings
with a space between each string would
look like the equation below:
=A5&” “&B5&” “&C5
• A5, B5, and C5 are the cell references
for each string.

244
MATCH and INDEX

The MATCH function returns the position of a value


in a given range. For example, the MATCH function
below looks up the value 53 in the range B3:B9.
53 (first argument) found at position 5 in the range
B3:B9 (second argument). In this example, we use
the MATCH function to return an exact match so we
set the third argument to 0.

The INDEX function below returns a specific value in a


one-dimensional range.
The INDEX function returns the 5th value (second
argument) in the range E3:E9 (first argument).

245
MATCH and INDEX (2)

Replace the value 5 in the INDEX function


(see previous example) with the MATCH
function (see first example) to lookup the
salary of ID 53.
The MATCH function returns position 5.
The INDEX function needs position 5. It's
a perfect combination. If you like, you can
also use the VLOOKUP function. It's up to
you. However, you'll need INDEX and
MATCH to perform advanced lookups, as
we will see next.

246
1st way:

Correlation
The correlation coefficient (a value
between -1 and +1) tells you how
strongly two variables are related to
each other. We can use the CORREL
function or the Analysis Toolpak
add-in (we are going to talk about it in
the next slide) in Excel to find the
correlation coefficient between two
variables.
Steps
On the Data tab, in the Analysis 2nd way:
Group, click Data Analysis
Select Correlation and click OK
For example, select the range A1:C6
as the Input Range
Check Labels in first row
Select cell A8 as the Output Range

247
Week 4

• Charts
• Linear Regression
• File Types

248
Data
Analysis

249
Analysis ToolPak

The Analysis ToolPak is an Excel add-in


program that provides data analysis tools for
financial, statistical and engineering data analysis.
To load the Analysis ToolPak add-in, execute the
following steps.
• On the File tab, click Options.
• Under Add-ins, select Analysis ToolPak and click
on the Go button.
• Check Analysis ToolPak and click on OK.
• On the Data tab, in the Analysis group, you can
now click on Data Analysis
The following dialog box below appears.
• For example, select Histogram and click OK to
create a Histogram in Excel.

250
Create a chart

• Select the range A1:A19.


• On the Insert tab, in the
Charts group, click on
Recommended Charts.
• Tip: Each chart has
another type of settings

251
Create
Histogram
• First, enter the bin numbers (upper levels) in
the range C4:C8.
• On the Data tab, in the Analysis group, click
Data Analysis.
• Select Histogram and click OK
• Select the range A2:A19
• Click in the Bin Range box and select the
range C4:C8.
• Click the Output Range option button, click
in the Output Range box and select cell F3.
• Check Chart Output.
• Click OK
• Click the legend on the right side and press
Delete.
• Properly label your bins.
• To remove the space between the bars, right
click a bar, click Format Data Series and
change the Gap Width to 0%.
• To add borders, right click a bar, click Format
Data Series, click the Fill & Line icon, click
Border and select a color.

252
Line charts are used to display trends over time. Use a line chart if you have text labels, dates or a few numeric labels on the horizontal axis. Use a scatter plot (XY chart) to show scientific XY
data.

Create a Line Steps

• Select the range A1:D7.

Chart •


On the Insert tab, in the Charts group, click the Line symbol

Click Line with Markers

Tip! Only if you have numeric labels, empty cell A1 before you create the line chart. By doing this, Excel does not recognize the numbers in column A as a data series and automatically places
these numbers on the horizontal (category) axis. After creating the chart, you can enter the text Year into cell A1 if you like.

253
Create a Combination
Chart
A combination chart is a chart that combines two
or more chart types in a single chart.
Steps
Select the range A1:C13
On the Insert tab, in the Charts group, click the
Combo symbol.
Click Create Custom Combo Chart
For the Rainy Days series choose Clustered
Column as the chart type
For the Profit series, choose Line as the chart type
Pot the Profit series on the secondary axis
Click OK

254
Add Sparklines!

Sparklines is a minimal way to visualize the


data per row mainly either as a line or as a bar.
In order to do so, we should go to Insert tab, in
the Sparklines section and select either Line or
Column.

After selecting the data (per row) in the Data


Range, select also the Location Range, which
is going to be, for example, next to the data
(within the same row).
The result will be like the picture on the right!

255
Linear Regression Analysis

Below you can find our data. The big question is: is
there a relation between Quantity Sold (Output) and
Price and Advertising (Input). In other words: can we
predict Quantity Sold if we know Price and
Advertising?
Steps
• On the Data tab, in the Analysis group, click Data Analysis.
• Select Regression and click OK
• Select the Y Range (A1:A8). This is the predictor variable (also called
dependent variable).
• Select the X Range(B1:C8). These are the explanatory variables (also
called independent variables). These columns must be adjacent to each
other.
• Check Labels.
• Click in the Output Range box and select cell A11.
• Check Residuals.
• Click OK.

256
Linear Regression Analysis –
Summary Output
• R Square
R Square equals 0.962, which is a very good fit. 96% of the variation in Quantity Sold
is explained by the independent variables Price and Advertising. The closer to 1, the
better the regression line (read on) fits the data.
• Significance F and P-values
To check if your results are reliable (statistically significant), look at Significance F
(0.001). If this value is less than 0.05, you're OK. If Significance F is greater than
0.05, it's probably better to stop using this set of independent variables. Delete a
variable with a high P-value (greater than 0.05) and rerun the regression until
Significance F drops below 0.05.
Most or all P-values should be below below 0.05. In our example this is the case.
(0.000, 0.001 and 0.005).
• Coefficients
The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 *
Advertising. In other words, for each unit increase in price, Quantity Sold decreases
with 835.722 units. For each unit increase in Advertising, Quantity Sold increases
with 0.592 units. This is valuable information.
You can also use these coefficients to do a forecast. For example, if price equals $4
and Advertising equals $3000, you might be able to achieve a Quantity Sold of
8536.214 -835.722 * 4 + 0.592 * 3000 = 6970.
• Residuals
The residuals show you how far away the actual data points are fom the predicted
data points (using the equation). For example, the first data point equals 8500. Using
the equation, the predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 =
8523.009, giving a residual of 8500 - 8523.009 = -23.009.

257
File types - CSV

A CSV (Comma Separated Values) file is a special type of file that you
can create or edit in Excel. Rather than storing information in columns,
CSV files store information separated by commas. When text and
numbers are saved in a CSV file, it's easy to move them from one
program to another.

258
File types - Database

We will work with:


• Azure SQL Database
• SQL Server Database

259
Week 5

• Pivot Tables
• Pivot Charts
• Tables

260
Pivot Tables

A pivot table allows you to extract the significance from a large, detailed
data set.
Our data set consists of 213 records and 6 fields. Order ID, Product,
Category, Amount, Date and Country.

261
Insert a Pivot Table

• Click any single cell inside the data set


• On the Insert tab, in the Tables group,
click PivotTable
The following dialog box appears. Excel
automatically selects the data for you. The
default location for a new pivot table is New
Worksheet.

262
Drag fields

The PivotTable Fields pane appears. To get the


total amount exported of each product, drag the
following fields to the different areas.
• Product field to the Rows area.
• Amount field to the Values area.
• Country field to the Filters area.

263
Drag Fields (2)

• Below you can find the pivot table. Bananas


are our main export product. That's how easy
pivot tables can be!

264
Change Summary
Calculation

By default, Excel summarizes


your data by either summing or
counting the items. To change
the type of calculation that you
want to use, execute the
following steps.
• Click any cell inside the Sum
of Amount column.
• Right click and click on Value
Field Settings.
• Choose the type of
calculation you want to use.
For example, click Count.

265
Two-dimensional Pivot Table

If you drag a field to the Rows area and Columns


area, you can create a two-dimensional pivot
table. First, insert a pivot table. Next, to get the
total amount exported to each country, of each
product, drag the following fields to the different
areas.
• Country field to the Rows area.
• Product field to the Columns area.
• Amount field to the Values area.
• Category field to the Filters area.

266
Pivot Tables - Slicers
Use slicers in Excel to quickly and easily filter pivot tables.
Steps
• Click any cell inside the pivot table.
• On the Analyze tab, in the Filter group, click Insert Slicer.
• Check Country and click OK.
• Click United States to find out which products we export the most to the United
States.
Let’s insert a second slicer:
Click any cell inside the pivot table.
On the Analyze tab, in the Filter group, click Insert Slicer.
Check Product and click OK.
Click the Multi-Select button to select multiple products.

267
A pivot chart is the visual representation of a pivot table
in Excel. Pivot charts and pivot tables are connected with
Pivot Chart each other. Below you can find a two-dimensional pivot
table.

268
Insert & Filter Pivot
Chart
Insert
• Click any cell inside the pivot table.
• On the Analyze tab, in the Tools group, click PivotChart.
• The Insert Chart dialog box appears, Click OK

Filter
• Use the standard filters (triangles next to Product and
Country). For example, use the Country filter to only show
the total amount of each product exported to the United
States.
• Remove the Country filter.
• Because we added the Category field to the Filters area, we
can filter this pivot chart (and pivot table) by Category. For
example, use the Category filter to only show the vegetables
exported to each country.

269
Tables - Insert
a Table
• Click any single cell inside the
data set.
• On the Insert tab, in the
Tables group, click Table.
• Excel automatically selects
the data for you. Check 'My
table has headers' and click
on OK.

270
Tables – Total Row

To display a total row at the end of the table:


• First, select a cell inside the table. Next, on the
Design tab, in the Table Style Options group,
check Total Row.

• Click any cell in the last row to calculate the Total


(Average, Count, Max, Min, Sum etc.) of a
column. For example, calculate the sum of the
Sales column.

271
Quick Analysis Tool

Use the Quick Analysis tool in Excel to


quickly analyze your data. Quickly calculate
totals, quickly insert tables, quickly apply
conditional formatting and more.
• Select a range of cells and click the Quick
Analysis button.

272
Power Query

Power Query is a business intelligence tool available in Excel that


allows you to import data from many different sources and then clean,
transform and reshape your data as needed.
It allows you to set up a query once and then reuse it with a simple
refresh.

273
Guide for Power Query

https://www.howtoexcel.org/power-query/the-complete-guide-to-power-que
ry/

274
References

• Excel Essential Training (Office 365/Microsoft 365) on LinkedIn


Learning
• https://www.excel-easy.com/

275
SQL
How to access SQL Server Management Studio
Module 2

276
Open application – SQL Sever Management
Studio

This element is going to be


shown in your screen as
welcome one.

277
Interface This is the screen that you are going to see.

278
• Connect with:
SQL Server Authentication
• Server Type:
Database Engine
• Server name:

Credentials curiousiq-78d054-sqlsvr.dat
abase.windows.net/
• Username (one of the following):
sqluser1, sqluser2, …,
sqluser50
• Password:
password!1

279
SQL
Module 2

280
Table of Contents
1.1 | Introduction to Data
1.2 | Database systems
1.3 | Relational model
1.4 | Normalisation
1.5 | Introduction to SQL
1.6 | Basic Query Structure
1.7 | Database Operations
1.8 | Aggregate Functions
1.9 | Window Functions
1.10 | Subqueries
1.11 | Database modification

281
Week 1

1. Intro
2. Relational Model

282
INTRODUCTION TO
DATA

283
Introduction to Data

Data is a valuable asset which plays pivotal role in informing


critical business decisions

Data is a collection of facts such as


numbers, descriptions, and
observations used in decision
making and can be classified as:
• structured
• semi-structured
• unstructured

284
Introduction to Data

Analytical systems perform 4 main activities with data

1. Data Ingestion
Process of capturing the raw data. To process and analyse this
data, you must first store the data in a repository of some sort.
The repository could be a file store, a document database, or
even a relational database.

2. Data Transformation/Processing
After data is ingested into a data repository, we may want to
do some cleaning operations and remove any questionable or
invalid data, or perform some aggregations such as calculating
certain KPIs.
285
3. Data Querying
Data Classification

We can classify data in three categories: structured,


semi-structured, or unstructured

Structured data
Structured data is typically tabular data that is
represented by rows and columns in a
database.

Databases that hold tables in this form are


called relational databases. Each row in a table
has the same set of columns.

Example: Datawarehouse, ERP, CRM

286
Data Classification

We can classify data in three categories: structured,


semi-structured, or unstructured
Example of JSON file

Semi-structured data
Semi-structured data is information that
doesn't reside in a relational database but still
has some structure to it.

>JSON – Example: Documents held in


JavaScript Object Notation (JSON) format

Example of key-value data


>Key-value stores - A key-value store is similar
to a relational table, except that each row can
have any number of columns.

287
Data Classification

We can classify data in three categories: structured,


semi-structured, or unstructured

Semi-structured data
>Graph DB – Stores and queries information
about complex relationships.

A graph contains
-nodes (information about objects), and
-edges (information about the relationships
between objects).

288
Data Classification

We can classify data in three categories: structured,


semi-structured, or unstructured

Unstructured data
Unstructured data is data which is not
organized in any predefined manner.

Example: Audio and video files, and binary


data files might not have a specific structure.

289
History of Database Systems
▪ 1950s and early 1960s
• Data processing using magnetic tapes for storage
▪ Late 1960s and 1970s
• Hard disks allowed direct access to data
• Network and hierarchical data models in widespread use
• High-performance (for the era) transaction processing
▪ 1980s
• SQL becomes industrial standard
• Parallel and distributed database systems
▪ 1990s
• Large multi-terabyte data warehouses & emergence of Web commerce
▪ 2000s
• Big data storage systems (Google BigTable, Amazon, “NoSQL” systems)
• Big data analysis: beyond SQL
▪ 2010s 290
Latest trend: Big Data
The challenges of big data management result from the
expansion of all three properties.

Volume:
The quantity of generated and stored
data.

Variety:
The type and nature of the data. Big data
does not only draw from text but also
images, audio, video.

Velocity:
The speed at which the data is 291
Big Data

292
DATABASE SYSTEMS

293
Database Systems

▪ DBMS contains information about a particular enterprise


• Collection of interrelated data
• Set of programs to access the data
• An environment that is both convenient and efficient to use

▪ A modern database system is a complex software system


whose task is to manage a large, complex collection of
data, accessible by multiple users.

294
Database Applications Examples 1/2
▪ Enterprise Information
• Sales: customers, products, purchases
• Accounting: payments, receipts, assets
• Human Resources: Information about employees, salaries, payroll taxes.
▪ Manufacturing: management of production, inventory, orders, supply chain.
▪ Banking and finance
• Customer information, accounts, loans, and banking transactions.
• Credit card transactions
• Finance: sales and purchases of financial instruments (e.g., stocks and bonds;
storing real-time market data
▪ Universities: registration, grades

295
Database Applications Examples 2/2

▪ Airlines: reservations, schedules


▪ Telecommunication: records of calls, texts, and data usage, generating monthly bills,
maintaining balances on prepaid calling cards
▪ Web-based services
• Online retailers: order tracking, customized recommendations
• Online advertisements
▪ Document databases
▪ Navigation systems: For maintaining the locations of various places of interest
along with the exact routes of roads, train systems, buses, etc.

296
RELATIONAL MODEL

297
Relational Model

▪ Data model is a collection of tools for describing data, data


relationships, data semantics, data constraints
▪ In a relational model all the data is stored in various tables.
Columns

Rows

298
Characteristics of Relational Data

Relational data
>Data is stored in a table consisting of rows &
columns
>Row: Each row represents a single instance of
entity
>Column: Define the properties of the entity
>Each column is defined by a datatype
>All rows have the same number of columns

299
Characteristics of Relational Data

Relational data
> Some columns are used to maintain
relationships between tables
> Model shows the structure of the entities
> Primary key uniquely identifies each row
> Foreign key references the primary key of
another table and is used to maintain
relationships between tables

300
Characteristics of Relational Data

Relational data
> Some columns are used to maintain
relationships between tables
> Model shows the structure of the entities
> Primary key uniquely identifies each row
> Foreign key references the primary key of
another table and is used to maintain
relationships between tables

301
Why don’t we have all the data in one big table?
Week 2

1. Relational Database
2. Basic query structure

302
A Sample Relational Database

303
SQL Query Language

▪ SQL query language is nonprocedural. A query takes as input several tables


(possibly only one) and always returns a single table.
▪ Example to find all instructors in Computer Science dept
select name
from instructor
where dept_name = 'Comp. Sci.'

▪ Application programs generally access databases through one of


• Language extensions to allow embedded SQL
• Application program interface (e.g., ODBC/JDBC) which allow SQL queries to
be sent to a database

304
Query Processing

1. Parsing and translation


2. Optimization
3. Evaluation

305
History

▪ IBM Sequel language developed as part of System R project at the IBM San Jose
Research Laboratory
▪ Renamed Structured Query Language (SQL)
▪ SQL syntax is similar to the English language, which makes
it relatively easy to write, read, and interpret.
▪ Many RDBMSs use SQL (and variations of SQL) to access
the data in tables. Tables can have hundreds, thousands,
sometimes even millions of rows of data. These rows are
often called records.
▪ Tables can also have many columns of data. Columns are
labeled with a descriptive name (say, age for example) and
have a specific data type. Columns are often called fields.
306
Data Definition Language

The SQL data-definition language (DDL) allows the specification of information


about relations, including:

▪ The schema for each relation.


▪ The type of values associated with each attribute.
▪ The Integrity constraints
▪ The set of indices to be maintained for each relation.
▪ Security and authorization information for each relation.
▪ The physical storage structure of each relation on disk.

307
Data Types in SQL

▪ char(n). Fixed length character string, with user-specified length n.


▪ varchar(n). Variable length character strings, with user-specified maximum
length n.
▪ int. Integer (a finite subset of the integers that is machine-dependent).
▪ smallint. Small integer (a machine-dependent subset of the integer domain
type).
▪ numeric(p,d). Fixed point number, with user-specified precision of p digits,
with d digits to the right of decimal point. (ex., numeric(3,1), allows 44.5 to be
stores exactly, but not 444.5 or 0.32)
▪ real, double precision. Floating point and double-precision floating point
numbers, with machine-dependent precision.
▪ float(n). Floating point number, with user-specified precision of at least n
digits.

308
Create Table Construct

▪ An SQL relation is defined using the CREATE table command:


create table r

▪ Example:
create table instructor (
ID char(5),
name varchar(20),
dept_name varchar(20),
salary numeric(8,2))

309
Integrity Constraints in Create Table

▪ Types of integrity constraints


• primary key
• foreign key
• not null
▪ SQL prevents any update to the database that violates an integrity
constraint.
▪ Example:
create table instructor (
ID char(5),
name varchar(20) not null,
dept_name varchar(20),
salary numeric(8,2),
primary key (ID),
foreign key (dept_name) references department);

310
And a Few More Relation Definitions

▪ create table student (


ID varchar(5),
name varchar(20) not null,
dept_name varchar(20),
tot_cred numeric(3,0),
primary key (ID),
foreign key (dept_name) references department);

▪ create table takes (


ID varchar(5),
course_id varchar(8),
sec_id varchar(8),
semester varchar(6),
year numeric(4,0),
grade varchar(2),
primary key (ID, course_id, sec_id, semester, year) ,
foreign key (ID) references student,
foreign key (course_id, sec_id, semester, year) references section);

311
...and more
▪ create table course (
course_id varchar(8),
title varchar(50),
dept_name varchar(20),
credits numeric(2,0),
primary key (course_id),
foreign key (dept_name) references department);

312
Updates to tables

▪ INSERT
• insert into instructor values ('10211', 'Smith', 'Biology', 66000);
▪ DELETE
• Remove all tuples from the student relation
▪ delete from student

▪ DROP Table
• drop table r
▪ ALTER
• alter table r add A
▪ where A is the name of the attribute to be added to relation r
• alter table r drop A
▪ where A is the name of an attribute of relation r
▪ Dropping of attributes not supported by many databases.

313
BASIC QUERY
STRUCTURE

314
Basic Query Structure

▪ A typical SQL query has the form:

select column_name
from schema_name.table_name
where condition

• Column_name represents an attribute


• table_name represents a relation
• schema_name represents the schema of the relation e.g. dbo
• condition is a predicate.
▪ The result of an SQL query is a relation.

315
The SELECT [...] FROM Clause 1/4
▪ The SELECT clause lists the attributes desired in the result of a query
▪ The FROM command is used to specify which table to
select data from.
▪ Example: find the names of all instructors:
select name
from instructor
▪ NOTE: SQL names are case insensitive (i.e., you may use upper- or
lower-case letters.)
• E.g., Name ≡ NAME ≡ name
• Some people use upper case wherever we use bold font.

316
The SELECT [...] FROM Clause 2/4
▪ SQL allows duplicates in relations as well as in query results.
▪ To force the elimination of duplicates, insert the keyword DISTINCT after select.
▪ Find the department names of all instructors, and remove duplicates
select distinct dept_name
from instructor
▪ The keyword all specifies that duplicates should not be removed.
select all dept_name
from instructor

317
The SELECT [...] FROM Clause 3/4
▪ An asterisk in the select clause denotes “all attributes”
select *
from instructor
▪ An attribute can be a literal with no from clause
select '437'
• Results is a table with one column and a single row with value “437”
• Can give the column a name using:
select '437' as FOO
▪ An attribute can be a literal with from clause
select 'A'
from instructor
• Result is a table with one column and N rows (number of tuples in the instructors
table), each row with value “A”

318
The SELECT [...] FROM Clause 4/4
▪ The select clause can contain arithmetic expressions involving the operation, +, –, *,
and /, and operating on constants or attributes of tuples.
• The query:
select ID, name, salary/12
from instructor
would return a relation that is the same as the instructor relation, except that the
value of the attribute salary is divided by 12.
• Can rename “salary/12” using the as clause:
select ID, name, salary/12 as monthly_salary

319
The WHERE Clause

▪ The where clause specifies conditions that the result must satisfy
• Corresponds to the selection predicate of the relational algebra.
▪ To find all instructors in Comp. Sci. dept
select name
from instructor
where dept_name = 'Comp. Sci.'
▪ SQL allows the use of the logical connectives and, or, and not
▪ The operands of the logical connectives can be expressions involving the
comparison operators <, <=, >, >=, =, and <>.
▪ Comparisons can be applied to results of arithmetic expressions
▪ To find all instructors in Comp. Sci. dept with salary > 70000
select name
from instructor
where dept_name = 'Comp. Sci.' and salary > 70000

320
The Rename Operation

▪ The SQL allows renaming relations and attributes using the as clause:
old-name as new-name

▪ Find the names of all instructors who have a higher salary than
some instructor in 'Comp. Sci'.
• select distinct T.name
from instructor as T, instructor as S
where T.salary > S.salary and S.dept_name = 'Comp. Sci.’

▪ Keyword as is optional and may be omitted


instructor as T ≡ instructor T

321
Week 3

1. Joins
2. Set Operations
3. Null Values

322
DATABASE
OPERATIONS

323
SQL JOINS
▪ A JOIN clause is used to combine rows from two or
more tables, based on a related column between them

324
INNER JOIN OPERATION
▪ The INNER JOIN will combine the records and will
show only the common records between the 2 tables.
▪ Also, the JOIN clause in SQL is equivalent to INNER
JOIN

325
LEFT JOIN
▪ The SQL LEFT JOIN returns all rows from the left table,
even if there are no matches in the right table.
▪ This means that if the USING clause matches 0 (zero)
records in the right table, the join will still return a row in
the result, but with NULL in each column from the right
table.

326
RIGHT JOIN
▪ The SQL RIGHT JOIN does the same as the LEFT
JOIN but with the RIGHT side table.

327
String Operations
▪ SQL includes a string-matching operator for comparisons on character strings. The operator
like uses patterns that are described using two special characters:
• percent ( % ). The % character matches any substring.
• underscore ( _ ). The _ character matches any character.
▪ Find the names of all instructors whose name includes the substring “dar”.
select name
from instructor
where name like '%dar%’
▪ Find the names of all instructors whose name starts from “Ma”
select name
from instructor
where name like ‘Ma%’
▪ Find the names of all instructors where their names starts from Mar and there is one last
character that can be whatever
select name
from instructor
where name like ‘Mar_’
▪ Match the string “100%”
like '100 \%' escape '\'
328
in that above we use backslash (\) as the escape character.
Ordering the Display of Tuples

▪ List in alphabetic order the names of all instructors


select distinct name
from instructor
order by name
▪ We may specify DESC for descending order or ASC for ascending
order, for each attribute; ascending order is the default.
• Example: order by name desc
▪ Can sort on multiple attributes
• Example: order by dept_name, name

329
Where Clause Predicates

▪ SQL includes a BETWEEN comparison operator


▪ Example: Find the names of all instructors with salary between $90,000
and $100,000 (that is, ≥ $90,000 and ≤ $100,000)
• select name
from instructor
where salary between 90000 and 100000

330
Set Operations

▪ Find courses that ran in Fall 2017 or in Spring 2018


(select course_id from section where sem = 'Fall' and year = 2017)
union
(select course_id from section where sem = 'Spring' and year = 2018)
▪ Find courses that ran in Fall 2017 and in Spring 2018
(select course_id from section where sem = 'Fall' and year = 2017)
intersect
(select course_id from section where sem = 'Spring' and year = 2018)
▪ Find courses that ran in Fall 2017 but not in Spring 2018
(select course_id from section where sem = 'Fall' and year = 2017)
except
(select course_id from section where sem = 'Spring' and year = 2018)

331
Set Operations (Cont.)

▪ Set operations union, intersect, and except


• Each of the above operations automatically eliminates duplicates
▪ To retain all duplicates use the
• union all,
• intersect all
• except all.

332
Null Values

▪ It is possible for tuples to have a null value, denoted by null, for some of their
attributes
▪ null signifies an unknown value or that a value does not exist.
▪ The result of any arithmetic expression involving null is null
• Example: 5 + null returns null
▪ The predicate is null can be used to check for null values.
• Example: Find all instructors whose salary is null.
select name
from instructor
where salary is null
▪ The predicate is not null succeeds if the value on which it is applied is not null.

333
Null Values (Cont.)

▪ SQL treats as unknown the result of any comparison involving a null value (other
than predicates is null and is not null).
• Example: 5 < null or null <> null or null = null
▪ The predicate in a where clause can involve Boolean operations (and, or, not); thus
the definitions of the Boolean operations need to be extended to deal with the value
unknown.
• and : (true and unknown) = unknown,
(false and unknown) = false,
(unknown and unknown) = unknown
• or: (unknown or true) = true,
(unknown or false) = unknown
(unknown or unknown) = unknown
▪ Result of where clause predicate is treated as false if it evaluates to unknown

334
Week 4

1. Avg
2. Min/Max
3. Sum/Count

335
AGGREGATE
FUNCTIONS

336
Aggregate Functions

▪ These functions operate on the multiset of values of a column of a relation, and return
a value
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values

337
Aggregate Functions Examples

▪ Find the average salary of instructors in the Computer Science department


• select avg (salary)
from instructor
where dept_name= 'Comp. Sci.';
▪ Find the total number of (unique) instructors who teach a course in the Spring 2018
semester
• select count (distinct ID)
from teaches
where semester = 'Spring' and year = 2018;
▪ Find the number of tuples in the course relation
• select count (*)
from course;

338
Aggregate Functions – Group By

▪ The SQL GROUP BY clause is used in collaboration with


the SELECT statement to arrange identical data into
groups. It is also mandatory when we are using an
aggregate function, along with a column that we want to
create groups.
▪ Find the average salary of instructors in each department
• select dept_name, avg (salary) as avg_salary
from instructor
group by dept_name;

339
Aggregation (Cont.)

▪ Attributes in select clause outside of aggregate functions must appear in


group by list
• /* erroneous query */
select dept_name, ID, avg (salary)
from instructor
group by dept_name;

340
Aggregate Functions – Having Clause

▪ Find the names and average salaries of all departments whose average salary is
greater than 42000.

select dept_name, avg (salary) as avg_salary


from instructor
group by dept_name
having avg (salary) > 42000;

▪ Note: predicates in the having clause are applied after the formation of groups
whereas predicates in the where clause are applied before forming groups.
▪ Also, having clause is the only way to filter the result of
an aggregate function!

341
Week 5

1. Set comparison
2. Exists
3. With

342
SUBQUERIES

343
Nested Subqueries
▪ SQL provides a mechanism for the nesting of subqueries. A subquery is a
select-from-where expression that is nested within another query.
▪ The nesting can be done in the following SQL query

select column_name
from table_name
where condition

as follows:
• From clause: table_name can be replaced by any valid subquery
• Where clause: condition can be replaced with an expression of the form:
B <operation> (subquery)
B is an attribute and <operation> to be defined later.
• Select clause:
column_name can be replaced be a subquery that generates a single value.

344
Set Membership

▪ Find courses offered in Fall 2017 and in Spring 2018


select distinct course_id
from section
where semester = 'Fall' and year= 2017 and
course_id in (select course_id
from section
where semester = 'Spring' and year= 2018);

▪ Find courses offered in Fall 2017 but not in Spring 2018


select distinct course_id
from section
where semester = 'Fall' and year= 2017 and
course_id not in (select course_id
from section
where semester = 'Spring' and year= 2018);

345
Set Membership (Cont.)

▪ Name all instructors whose name is neither “Mozart” nor Einstein”


select distinct name
from instructor
where name not in ('Mozart', 'Einstein')

▪ Find the total number of (distinct) students who have taken course sections
taught by the instructor with ID 10101

select count (distinct ID)


from takes
where (course_id, sec_id, semester, year) in
(select course_id, sec_id, semester, year
from teaches
where teaches.ID= 10101);

▪ Note: Above query can be written in a much simpler manner.


The formulation above is simply to illustrate SQL features

346
Set Comparison – “some” Clause

▪ Find names of instructors with salary greater than


that of some (at least one) instructor in the Biology
select distinct T.name
department.
from instructor as T, instructor as S
where T.salary > S.salary and S.dept name = 'Biology';

select name
from instructor
▪ Same query
where salaryusing
> some> some
(select clause
salary
from instructor
where dept name = 'Biology');

347
Set Comparison – “all” Clause

▪ Find the names of all instructors whose salary is greater than the salary of all
instructors in the Biology department.
select name
from instructor
where salary > all (select salary
from instructor
where dept name = 'Biology');

348
Use of “exists” Clause

▪ Yet another way of specifying the query “Find all courses taught in both the Fall 2017 semester and in
the Spring 2018 semester”
select course_id
from section as S
where semester = 'Fall' and year = 2017 and
exists (select *
from section as T
where semester = 'Spring' and year= 2018
and S.course_id = T.course_id);
▪ Also, we can apply not exists. So, for example we want to “Find all courses taught in Fall 2017
semester and not in the Spring 2018 semester.
select course_id
from section as S
where semester = 'Fall' and year = 2017 and
exists (select *
from section as T
where semester = 'Spring' and year= 2018
and S.course_id = T.course_id);
▪ Correlation name – variable S in the outer query

▪ Correlated subquery – the inner query

349
Test for Absence of Duplicates

▪ The unique construct tests whether a subquery has any duplicate tuples in its result.
▪ The unique construct evaluates to “true” if a given subquery contains no duplicates.
▪ Find all courses that were offered at most once in 2017
select T.course_id
from course as T
where unique ( select R.course_id
from section as R
where T.course_id= R.course_id
and R.year = 2017);

350
Subqueries in the From Clause

351
Subqueries in the Form Clause

▪ SQL allows a subquery expression to be used in the from clause


▪ Find the average instructors’ salaries of those departments where the average salary
is greater than $42,000.”
select dept_name, avg_salary
from ( select dept_name, avg (salary) as avg_salary
from instructor
group by dept_name)
where avg_salary > 42000;
▪ Note that we do not need to use the having clause
▪ Another way to write above query
select dept_name, avg_salary
from ( select dept_name, avg (salary)
from instructor
group by dept_name)
as dept_avg (dept_name, avg_salary)
where avg_salary > 42000;

352
WITH Clause
▪ The WITH clause provides a way of defining a temporary relation whose definition
is available only to the query in which the with clause occurs.
▪ Find all departments with the maximum budget
with max_budget as
(select max(budget) as value
from department
)
select department.name
from department, max_budget
where department.budget = max_budget.value;

353
Complex Queries using With Clause

▪ Find all departments where the total salary is greater than the average of the total
salary at all departments
with dept _total (dept_name, value) as
(select dept_name, sum(salary)
from instructor
group by dept_name),
dept_total_avg(value) as
(select avg(value)
from dept_total)
select dept_name
from dept_total, dept_total_avg
where dept_total.value > dept_total_avg.value;

354
Scalar Subquery

▪ Scalar subquery is one which is used where a single value is expected


▪ List all departments along with the number of instructors in each department
select dept_name,
( select count(*)
from instructor
where department.dept_name = instructor.dept_name)
as num_instructors
from department;
▪ Runtime error if subquery returns more than one result tuple

355
WINDOW / ANALYTIC
FUNCTIONS

356
Window / Analytic Functions
▪ Analytic functions calculate an aggregate value based on a
group of rows.

▪ Unlike aggregate functions, however, analytic functions


can return multiple rows for each group. Use analytic
functions to compute moving averages, running totals,
percentages or top-N results within a group.

Analytic functions include


▪ AVG and SUM (returns avg & sum for specific group of
rows)
▪ FIRST_VALUE & LAST_VALUE (returns first & last value)
▪ LEAD & LAG (returns subsequent & previous row)
▪ RANK & DENSE_RANK (returns ranks for specific group of
357
rows)
Window / Analytic Functions
There are three parts to this syntax, (a) function, (b) partition
by and (c) order by.
Below is a brief description of those, along with an example:

▪ analytic_function_name: name of the function — like


RANK(), SUM(), FIRST(), etc

▪ partition_expression: column/expression on the basis of


which the partition or window frames have to be created

▪ sort_expression: column/expression on the basis of which


the rows in the partition will be sorted

Query example
SELECT ord_date, sales_agent,
SUM (ord_amount) OVER ( 358
PARTITION BY agent_code
LEAD() and LAG() example
Query
SELECT Sales_Agent_ID, Order_Amount, LAG(Order_Amount, 1) OVER (
PARTITION BY Sales_Agent_ID
ORDER BY Order_Amount DESC
) Last_Amount
FROM orders
ORDER BY Sales_Agent_ID

Results

359
CUME_DIST() example
Query
SELECT month(sales_date), Sales_Agent_ID, Order_Amount, CUME_DIST() OVER(
PARTITION BY month(sales_date)
ORDER BY Order_Amount
)
FROM orders

Results

360
DATABASE
MODIFICATION

361
Modification of the Database

▪ Deletion from a given relation.


▪ Insertion into a given relation
▪ Updating of values in a given relation

362
Deletion

▪ Delete all instructors


delete from instructor
▪ Delete the instructors table
drop table instructor

▪ Delete all instructors from the Finance department


delete from instructor
where dept_name= 'Finance’;

▪ Delete all tuples in the instructor relation for those instructors associated with a
department located in the Watson building.
delete from instructor
where dept name in (select dept name
from department
where building = 'Watson');

363
Deletion (Cont.)

▪ Delete all instructors whose salary is less than the average salary of instructors

delete from instructor


where salary < (select avg (salary)
from instructor);

• Problem: as we delete tuples from instructor, the average salary


changes
• Solution used in SQL:
1. First, compute avg (salary) and find all tuples to delete
2. Next, delete all tuples found above (without recomputing avg or
retesting the tuples)

364
Insertion

▪ Add a new tuple to course


insert into course
values ('CS-437', 'Database Systems', 'Comp. Sci.', 4);

▪ or equivalently
insert into course (course_id, title, dept_name, credits)
values ('CS-437', 'Database Systems', 'Comp. Sci.', 4);

▪ Add a new tuple to student with tot_creds set to null


insert into student
values ('3003', 'Green', 'Finance', null);

365
Insertion (Cont.)

▪ Make each student in the Music department who has earned more than 144 credit
hours an instructor in the Music department with a salary of $18,000.
insert into instructor
select ID, name, dept_name, 18000
from student
where dept_name = 'Music' and total_cred > 144;

▪ The select from where statement is evaluated fully before any of its results are
inserted into the relation.
Otherwise queries like
insert into table1 select * from table1
would cause problem

366
Updates

▪ Give a 5% salary raise to all instructors


update instructor
set salary = salary * 1.05
▪ Give a 5% salary raise to those instructors who earn less than 70000
update instructor
set salary = salary * 1.05
where salary < 70000;
▪ Give a 5% salary raise to instructors whose salary is less than average
update instructor
set salary = salary * 1.05
where salary < (select avg (salary)
from instructor);

367
Updates (Cont.)

▪ Increase salaries of instructors whose salary is over $100,000 by 3%, and all others
by a 5%
• Write two update statements:
update instructor
set salary = salary * 1.03
where salary > 100000;
update instructor
set salary = salary * 1.05
where salary <= 100000;
• The order is important
• Can be done better using the case statement (next slide)

368
Case Statement for Conditional Updates

▪ Same query as before but with case statement


update instructor
set salary = case
when salary <= 100000 then salary * 1.05
else salary * 1.03
end

369
Updates with Scalar Subqueries

▪ Recompute and update tot_creds value for all students


update student S
set tot_cred = (select sum(credits)
from takes, course
where takes.course_id = course.course_id and
S.ID= takes.ID.and
takes.grade <> 'F' and
takes.grade is not null);
▪ Sets tot_creds to null for students who have not taken any course
▪ Instead of sum(credits), use:
case
when sum(credits) is not null then sum(credits)
else 0
end

370
NORMALISATION

371
Characteristics of Relational Data

Elements of normalisation
> Primary/Foreign keys define the relationships
> Data is retrieved by joining tables in queries

Benefits of normalisation
> Reduce storage
> Avoid data duplication 372
> Improve data quality
Normalisation
The process of structuring a relational database in accordance
with a series of so-called normal forms in order to reduce data
redundancy and improve data integrity. In simple words, is the
process of splitting an entity into more than one tables.

▪ Database normalization is a process used to organize a


database into tables and columns. The idea is that a table
should be about a specific topic and that and only
supporting topics included.
▪ There are three main reasons to normalize a database.
• minimize duplicate data,
• minimize or avoid data modification issues(anomalies), and
• simplify queries
▪ In order to normalize our database we are using normal373
forms. The most common are 1NF, 2NF, 3NF
Normalisation - 1NF
▪ Each table cell should contain a single value.
▪ Each record needs to be unique.

374
Normalisation - 1NF
▪ Each table cell should contain a single value.
▪ Each record needs to be unique.

Not normalised

Normalised

Source
375
Normalisation - 2NF
▪ Be in 1NF
▪ All of the attributes that are not part of the candidate key
depend on Title, but only Price also depends on Format. To
conform to 2NF and remove duplicities, every non
candidate-key attribute must depend on the whole
candidate key, not just part of it.

Not normalised

Normalised 376
Normalisation - 3NF
The Book table still has a transitive functional dependency
({Nationality} is dependent on {Author}, which is dependent on
{Title}). A similar violation exists for genre ({Genre Name} is
dependent on {Genre ID}, which is dependent on {Title}).

Hence, the Book table is not in 3NF, so we need to eliminate


the dependencies by placing {Nationality} and {Genre Name}
in their own respective tables:

Normalised

377
Python
Introduction to Programming
Module 3

378
What is Python?

Python is a high-level, general-purpose programming language


It was developed by Guido van Rossum and first released in 1991
It’s one of the most popular programming languages

379
Python Download and Installation

You can download Python from the official


website https://www.python.org/downloads/ and
follow the installer instructions to install it
You can also write Python code on
https://colab.research.google.com/ where there
you can find a notebook-like environment ready
to use
Python programs end in .py and can be
executed either from Python IDLE (Integrated
Development and Learning Environment) or
directly from the command line

380
Python Basics

• Comments start with #


• Variables are containers for storing data values

Data Types

Text Type String (str)

Numeric Types Integer (int), Float (float)

Sequence Types List (list), Tuple (tuple)

Mapping Type Dictionary (dict)

Set Type Set (set)

Boolean Type Boolean (bool)

381
Python Basics
• Strings
• “strings can be between double quotes”
• ‘strings can also be between single quotes’
• Numbers
• Integers: 1, 2, 3454, -5
• Floats: 1.3, 5.0, 2343.456
• List: [“this”, “is”, “a”, “list”, 2, 5, True, ‘Hello’, 42.0]
• Tuple: (4, 50, ‘test’)
• Dictionary: {‘John’: 22, ‘Jack’: 27} The 4 basic data
• Set: {1, 10, 2, 7} structures
• Boolean: True, False

382
Python operators

• Arithmetic operators (+, -, *, /, %, **, //)


• Assignment operators (=, +=, etc) e.g. a = b 🡪 it means that you give the
value of b to a
• Comparison operators (==, !=, >, <, >=, <=) e.g. a == b 🡪 it means that
you are comparing if the value of a is equal or is the same with b
• Logical operators (and, or, not)

https://www.w3schools.com/python/python_operators.asp

383
Python Libraries

• A Python library is simply a collection of codes or modules of codes that


we can use in a program for specific operations

Common Python Libraries

Standard Library Built-in modules

Numpy Support for large,


multi-dimensional arrays and
matrices and high-level
mathematical functions
Pandas Data analysis and manipulation
tool. Data structures and
operations for manipulating
numerical tables and time
series
Matplotlib Plotting library for the Python
programming language and its
numerical mathematics
extension NumPy 384
Logical Operators
Once there is one Lie, one False then the result is going to be False
• True and False = False
• True and True = True
• False and False = False
• False and True = False

Once there is one thing Right, one thing True, the result is going to be True
• True or False = True
• True or True = True
• False or False = False
• False or True = True

The opposite of the expression next to not operator


• not True = False
• not False = True

385
Lists

• mylist = ["apple", "banana", "cherry"]

• Lists are ordered, allow duplicates and are changeable


• Lists can be indexed, take any data type and have length

386
Lists

• To access elements of list we use indexing


• Lists start from element in position 0

• Last element is element in position -1

387
List Methods

• append(): adds an item to the end of the list


• insert(): inserts item at a specified index
• remove(): removes specified item
• pop(): removes specified index
• sort(): sorts the list (numerically or alphabetically)

https://www.w3schools.com/python/python_lists_methods.asp

388
Dictionaries

• cardict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
• Dictionaries are ordered (Python 3.7+), changeable and have key:value
pairs
• They don’t allow duplicates (items with the same key)

389
Dictionaries

• To access a dictionary item, we refer to its key name

• Add items by using a new index key and assigning a value to it

390
Dictionary Methods

• keys(): Returns list of dictionary keys


• values(): Returns list of dictionary values
• items(): Returns list of dictionary items (tuple for each key:value pair)

391
Loops

• Python has two types of loops: For and While loops


For loops While Loops

392
For Loop

We start from 1 and loop until we get to 20 adding 2 in each loop. Pay
attention that i doesn’t take the value 21 as it is outside the range.

393
While Loop

The above does exactly the same as the for loop but it has a different
structure.

394
List Comprehensions

list1 and list2 have the same elements but were created with different
methods

395
Conditional Logic

We have a Test Expession that is being evaluated


to either True or False. If it is True the body
of the if statement is being executed, in other case
it is skipped.

396
Conditional Logic

In this example we start by evaluating b>a.


This is false so we move to the elif a==b.
This is false as well so we are left with only the
else that has now to be executed.

397
Conditional Logic

Opposite operators

Operator Opposite operator


< >=
> <=
<= >
>= <

398
Exception Handling

When an error occurs, or exception as we call it, Python will normally stop
and generate an error message.

These exceptions can be handled using the try statement:

399
Functions

Functions are defined using the “def” keyword.

In the above function we simply add two numbers. The function takes two
arguments (a, b), adds them together and returns their sum

400
Lambda Functions

A faster way to write functions is with using lambdas. Lambda functions. A


lambda function is a small anonymous function.
Similar to the previous example, here we have a lambda function that takes
two arguments a and b and returns their sum (a+b)

401
Python
Pandas & Matplotlib Libraries
Module 3 & Module 4

402
What is Pandas?

Pandas is a library in python


It is used for data manipulation and analysis

It has data structures and operations for manipulating


Numerical tables
Time series

It is built based on the NumPy library


Generally rows are observations and columns are variables

Source:https://pandas.pydata.org/

403
Data Structures: Pandas

▪ Heart of pandas: two primary data structures


▪ Series: data structure designed for a sequence of one-dimensional data
▪ DataFrame: data structure designed to contain cases with several
dimensions
▪ Although these data structures are not the universal solution to all the
problems, they do provide a valid and robust tool for most apps
▪ A particular feature, integration in their structure of
▪ Index objects and Labels

404
Series

405
Series

▪ 1D data structure, similar to an array, but with some additional features


▪ Composition: two arrays associated with each other
▪ main array: hold the data (any NumPy type) to which each element is associated
with
▪ a label contained within the other array, called the Index

406
Series
Series is an one dimensional labeled array

It can hold any type of data (int, float, string)

407
Declaring a Series
▪ Series() constructor pass as an argument an array containing the
values

>>> import pandas as pd


>>> s = pd.Series([12,-4,7,9])

>>> s
0 12
1 -4
27
39
dtype: int64

▪ Left: the values in the Index,


▪ Right: the corresponding values
▪ If you do not specify any index 🡪 pandas assigns 408
Series: Access

Access Elements It supports slicing

409
Series

include the index option assigning an array of strings containing the labels

>>> s = pd.Series([12,-4,7,9],
index=['a','b','c','d’])
>>> s
a 12
b -4
c 7
d 9
dtype: int64

410
Series
▪ You call the two attributes of the Series
▪ index & values

>>> s.values
array([12, -4, 7, 9], dtype=int64)
>>> s.index
Index([u'a', u'b', u'c', u'd'], dtype='object’)
>>> s[['b','c']]
b -4
c 7
dtype: int64

411
Series: Selecting Elements
▪ individual elements, specifying the key
>>> s[2]
7

▪ Or you can specify the label corresponding to the position of


the index
>>> s['b’]
-4

▪ select multiple items


>>> s[0:2]
a 12
b -4
dtype: int64

>>> s[['b','c’]]
b -4
c 7 412
NumpPy to Series
Define new Series starting with NumPy arrays or existing Series
>>> arr = np.array([1,2,3,4])
>>> s3 = pd.Series(arr)
>>> s3
0 1
1 2
2 3
3 4
dtype: int32
▪ Note: values contained within the NumPy array or the original
Series are not copied
▪ But passed by reference
▪ the object is inserted dynamically within the new Series object
▪ If it changes, for example its internal element varies in value 413
Filtering Value

▪ NumPy library: the base for the development of pandas


▪ Many NumPy arrays operations are extended to Series, e.g.
▪ filtering of the values contained within the data structure through
conditions
▪ which elements within the series have value greater than 8:

>>> s[s > 8]


a 12
d 9
dtype: int64

414
Operations and Mathematical function
▪ operators (+, -, *, /) or mathematical functions that are applicable to NumPy
array can be extended to objects Series
>>> s / 2
a 6.0
b -2.0
c 3.5
d 4.5
dtype: float64
▪ NumPy math. functions: must specify the function referenced with np & the
instance of the Series passed as argument
>>> np.log(s)
a 2.484907
b NaN
c 1.945910
d 2.197225
dtype: float64
415
Operations: Unique
>>> serd = pd.Series([1,0,2,1,2,3],
index=['white','white','blue','green','green','yellow’])
>>> serd
white 1
white 0
blue 2
green 1
green 2
yellow 3
dtype: int64

▪ exclude duplicates: use unique() function


▪ return value: an array containing the unique values in the Series

>>> serd.unique()
array([1, 0, 2, 3], dtype=int64)

416
Operations: Value count

>>> serd = pd.Series([1,0,2,1,2,3],


index=['white','white','blue','green','green','yellow’])
► value_counts() function, returns the unique values & calculates occurrences within a
Series
>>> serd.value_counts()
2 2
1 2
3 1
0 1
dtype: int64

417
Membership
▪ isin() evaluates membership
▪ Boolean values returned, useful during the filtering of data within a
series or in a column of a DataFrame
>>> serd.isin([0,3])
white False white 1
white True
blue False white 0
green False
blue 2
green False
yellow True green 1
dtype: bool
▪ Where is it stored? green 2
>>> serd[serd.isin([0,3])]
white 0 yellow 3
yellow 3
dtype: int64
418
NaN value

▪ NaN (Not a Number): pandas = indicates the


presence of an empty field or not definable
numerically
▪ NaN generated:
▪ when extracting data from some source gave some
trouble, or when the source is a missing data
▪ such as calculations of logarithms of negative values, or
exceptions during execution of some calculation or
function

419
>>> s2 = pd.Series([5,-3,np.NaN,14])
>>> s2
…NaN value
0 5
1 -3
2 NaN
3 14
► isnull() and notnull() functions are useful to identify the indexes without a value
>>> s2.isnull( )
0 False
1 False
2 True
3 False
dtype: bool
>>> s2.notnull( )
0 True
1 True
2 False
3 True
dtype: bool

420
…NaN value
These functions are useful to be placed inside a filtering to make a
condition
>>> s2 = pd.Series([5,-3,np.NaN,14])
>>> s2[s2.notnull( )]
0 5
1 -3
3 14
dtype: float64
>>> s2[s2.isnull( )]
2 NaN
dtype: float64

421
Series as Dictionaries
▪ Alternative way to see a Series: think of them as dict (dictionary)
▪ You can create a series from a dict previously defined

>>> mydict = {'red': 2000, 'blue': 1000, 'yellow': 500, 'orange':


1000}
>>> myseries = pd.Series(mydict)
blue 1000
orange 1000
red 2000
yellow 500
dtype: int64

422
Operations on Series

#append series s2 to series s1


>>>s1.append(s2)

#to delete an element, or multiple elements


>>>s1.drop(‘index’)
>>>s1.drop([‘index1’,’index2’])

423
Dataframes

424
Dataframes
▪ Dataframes are two dimensional arrays
▪ Columns can be different types(int, float, string)

Rows and columns are labeled


Rows labeling called index(usually 0, 1, 2,...)
Columns have names, usually of the variable that is holds

The size of the dataframe can change


There are built in operations that can perform operations on the
dataframe(that’s why it is used so widely in data science)

425
Dataframes
We can define a dataframe by passing a dictionary

426
Dataframes
Another way of defining a dataframe

427
Dataframes: Access
Access the elements of a column

Access a particular element in a column

428
Dataframes: Access with loc and iloc
loc: Access elements with the particular ‘label’ as index
iloc: Access elements with the position given(only takes integers)

429
Dataframe

>>> data = {'color' : ['blue','green','yellow','red','white'],


'object' : ['ball','pen','pencil','paper','mug'],
'price' : [1.2,1.0,0.6,0.9,1.7]}

frame = pd.DataFrame(data)

>>> frame
color object price
0 blue ball 1.2 • data frame: two index arrays
1 green pen 1.0 • 1st: lines: similar functions to the index array in Series. Εach label is
2 yellow pencil 0.6 associated with all the values in the row
3 red paper 0.9
• 2nd: contains a series of labels, each associated with a column
4 white mug 1.7

430
Dataframe
► create quickly a matrix of values you can use
► np.arange(16).reshape((4,4)) generates 4x4 matrix of increasing numbers from 0-15
>>> frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
... index=['red','blue','yellow','white'],
... columns=['ball','pen','pencil','paper'])
>>> frame3
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15

431
Selecting Elements
>>> frame.columns
Index([u'colors', u'object', u'price'], dtype='object’)

to get the list of indexes: use the index attribute


>>> frame.index
Int64Index([0, 1, 2, 3, 4], dtype='int64’)

can get the entire set of data: use the values attribute
>>> frame.values
array([['blue', 'ball', 1.2],
['green', 'pen', 1.0],
['yellow', 'pencil', 3.3],
['red', 'paper', 0.9],
['white', 'mug', 1.7]], dtype=object)

432
Assigning values
>>> frame['new'] = [3.0,1.3,2.2,0.8,1.1]

>>> frame.index.name = 'id'; frame.columns.name =


'item'
>>> frame
item color object price >>> frame['new'] = 12
>>> frame
id colors object price new
0 blue ball 1.2 0 blue ball 1.2 12
1 green pen 1.0 12
1 green pen 1.0 2 yellow pencil 0.6 12
2 yellow pencil 3.3 3 red paper 0.9 12
3 red paper 0.9 4 white mug 1.7 12
4 white mug 1.7

433
Assigning values
>>> ser = pd.Series(np.arange(5))
>>> ser
0 0
1 1
2 2
3 3
4 4
dtype: int32

>>> frame['new'] = ser


>>> frame
color object price new
0 blue ball 1.2 0
1 green pen 1.0 1
2 yellow pencil 0.6 2
3 red paper 0.9 3
4 white mug 1.7 4
434
Membership check

▪ function isin() applied to the Series to decide the membership of a set of


values; also applicable on DataFrame objects
>>> frame.isin([1.0,'pen'])

color object price


0 False False False
1 False True True
2 False False False
3 False False False
4 False False False

435
Membership check

>>> frame[frame.isin([1.0,'pen'])]
color object price
0 NaN NaN NaN
1 NaN pen 1
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

436
Deleting a column

to delete an entire column with all its contents, use the del command

>>> del frame['new’]


>>> frame
colors object price
0 blue ball 1.2
1 green pen 1.0
2 yellow pencil 0.6
3 red paper 0.9
4 white mug 1.7

437
Filtering

you can apply the filtering through the application of certain conditions,
e.g. to get all values smaller than a certain number (e.g.12)

>>> frame[frame < 12]


ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white NaN NaN NaN NaN

438
Transpose a dataframe

>>> frame2.T
2011 2012 2013
blue 17 27 18
red NaN 22 33
white 13 22 16

439
Reading & Writing Data

440
Reading CSV or Text data

The file is comma-delimited, you can use the read_csv() function to


▪ read its content
white,red,blue,green,animal ▪ convert it in a DataFrame object
1,5,2,3,cat
>>> csvframe = pd.read_csv('myCSV_01.csv’)
2,7,8,5,dog
3,3,6,7,horse
>>> csvframe
2,2,8,3,duck white red blue green animal
4,4,2,1,mouse 0 1 5 2 3 cat
1 2 7 8 5 dog
2 3 3 6 7 horse
3 2 2 8 3 duck
4 4 4 2 1 mouse

441
read_table

▪ CSV files are tabulated data in which the values on the same column are separated by commas.
▪ CSV files are considered text files 🡪 use the read_table() function, but specifying the delimiter

>>> read_table('ch05_01.csv',sep=',’)
white red blue green animal
0 1 5 2 3 cat 1 2 7 8 5 dog
2 3 3 6 7 horse
3 2 2 8 3 duck
4 4 4 2 1 mouse

442
read_csv
>>>
pd.read_csv('ch05_02.csv’)

1 5 2 3 cat File: ch05_02.csv


0 2 7 8 5 dog
1 3 3 6 7 horse white, red, blue, green, animal
2 2 2 8 3 duck 1,5,2,3,cat
3 4 4 2 1 mouse 2,7,8,5,dog
3,3,6,7,horse
>>> pd.read_csv('ch05_02.csv', header=None) 2,2,8,3,duck
0 1 2 3 4 4,4,2,1,mouse
0 1 5 2 3 cat
1 2 7 8 5 dog
2 3 3 6 7 horse
3 2 2 8 3 duck
4 4 4 2 1 mouse

443
Specify column names

>>> read_csv('ch05_02.csv', names=['white','red','blue','green','animal’])


white red blue green animal
0 1 5 2 3 cat
1 2 7 8 5 dog
2 3 3 6 7 horse
3 2 2 8 3 duck
4 4 4 2 1 mouse

444
Writing CSV data to a file

#save all
>>> frame2.to_csv('ch05_07.csv’)

#do not save indexes and headers


>>> frame2.to_csv('ch05_07b.csv', index=False, header=False)

Listing 5-7: ch05_07.csv

ball,pen,pencil,paper
0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15

445
Example

The data set will consist of 5 baby names and the number
of births recorded for that year (1880).

446
Create a data frame

• use the pandas library to export this data set into a csv file
• df will be a DataFrame object
• holds the contents of the BabyDataSet
• format similar to a sql table or an excel spreadsheet

447
Save / load a dataframe

• Save to disk
• Indexes are not saved: leftmost column
• Header is not saved

448
Observing/Checking data

► 1st level of check (more to follow)


► Ignore names
► Births should be an integer

449
A Quick Look at the Data

▪ csvframe.head(3)
▪ csvframe.tail(Number)

▪ The above commands allow you to see the head/ tail

450
A view of the memory requirements

csvframe.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
white 5 non-null int64
red 5 non-null int64
blue 5 non-null int64
green 5 non-null int64
animal 5 non-null object
dtypes: int64(4), object(1)
memory usage: 550.0 bytes
451
JSON: Read a file
import json
from pandas.io.json import
json_normalize
file = open('books.json','r')
text = file.read() books.json
text = json.loads(text)
[{'writer': 'Mark Ross',
'nationality': 'USA',
'books': [{'title': 'XML Cookbook', 'price': 23.56},
{'title': 'Python Fundamentals', 'price': 50.7},
{'title': 'The NumPy library', 'price': 12.3}]},

{'writer': 'Barbara Bracket',


'nationality': 'UK',
'books': [{'title': 'Java Enterprise', 'price': 28.6},
{'title': 'HTML5', 'price': 31.35},
{'title': 'Python for Dummies', 'price': 28.0}]}] 452
JSON: data
>>>text[0] books.json
[{'writer': 'Mark Ross',
'nationality': 'USA', [{'writer': 'Mark Ross',
'books': [{'title': 'XML Cookbook', 'price': 23.56}, 'nationality': 'USA',
'books': [{'title': 'XML Cookbook', 'price': 23.56},
{'title': 'Python Fundamentals', 'price': 50.7},
{'title': 'Python Fundamentals', 'price': 50.7},
{'title': 'The NumPy library', 'price': 12.3}]} {'title': 'The NumPy library', 'price': 12.3}]},
>>>text[0]['writer']
[{'writer': 'Mark Ross', {'writer': 'Barbara Bracket',
'nationality': 'USA', 'nationality': 'UK',
'books': [{'title': 'XML Cookbook', 'price': 23.56}, 'books': [{'title': 'Java Enterprise', 'price': 28.6},
{'title': 'Python Fundamentals', 'price': 50.7}, {'title': 'HTML5', 'price': 31.35},
{'title': 'The NumPy library', 'price': 12.3}]} {'title': 'Python for Dummies', 'price': 28.0}]}]
>>>text[0]['writer']
[{'writer': 'Mark Ross',
'nationality': 'USA', >>>text[0]['books'][1]['price’]
'books': [{'title': 'XML Cookbook', 'price': 23.56}, What do you get?
{'title': 'Python Fundamentals', 'price': 50.7},
{'title': 'The NumPy library', 'price': 12.3}]}
453
JSON data

Flatten data 🡪 make them data frame


>>>data=json_normalize(text)
>>>json_normalize(text)
books ... writer
0 [{'title': 'XML Cookbook', 'price': 23.56}, {'... ... Mark Ross
1 [{'title': 'Java Enterprise', 'price': 28.6}, ... ... Barbara Bracket

>>>data.columns
Index(['books', 'nationality', 'writer'], dtype='object')

454
Basic Analysis Operations

455
Basic Analysis Operations

•loading
•assembling
•merging
•concatenating
•combining
•reshaping (pivoting)
•removing

456
Merge

▪ Merge: corresponds to the JOIN operation in SQL


▪ A combination of data through the connection of rows using one or more keys
▪ Relational databases use JOIN query with SQL

▪ to get data from different tables


▪ using some reference values (keys) shared between them

457
Merging
>>> import numpy as np
>>> import pandas as pd
>>> frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'], 'price':
[12.33,11.44,33.21,13.23,33.62]})
>>> frame1
id price
0 ball 12.33
1 pencil 11.44
2 pen 33.21
3 mug 13.23
4 ashtray 33.62
>>> frame2 = pd.DataFrame( {'id':['pencil','pencil','ball','pen'], 'color':
['white','red','red','black']})
>>> frame2
color id
0 white pencil
1 red pencil
2 red ball
3 black pen
458
Merging…

>>> frame1
id price
applying the merge( ) function to the two DataFrame objects 0 ball 12.33
>>> pd.merge(frame1,frame2) 1 pencil 11.44
2 pen 33.21
id price color 3 mug 13.23
0 ball 12.33 red 4 ashtray 33.62
1 pencil 11.44 white
>>> frame2
2 pencil 11.44 red color id
3 pen 33.21 black 0 white pencil
1 red pencil
2 red ball
If the field has not a common name across two tables, then:
3 black pen
pd.merge(frame1, frame3,left_on='id',right_on='id2')
459
Concatenate: Create 1st data frame

raw_data = {
'subject_id': ['1', '2', '3', '4', '5’],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung’],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches’]}

df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])

subject_i
first_name last_name
d
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
460
Concatenate: create 2nd Data frame

raw_data = {
'subject_id': ['4', '5', '6', '7', '8’],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty’],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan’]}

df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])

subject_id first_name last_name


0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan

461
rd
Create 3 Data frame

raw_data = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11’],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

df_n = pd.DataFrame(raw_data, columns = ['subject_id','test_id'])

462
Join 2 data frames

#join along rows


df_new = pd.concat([df_a, df_b])

#join along columns


pd.concat([df_a, df_b], axis=1)

Original data

463
Removing Duplicates
▪ Duplicate rows might be present in a DataFrame for various reasons
▪ First, create a simple DataFrame with some duplicate rows

>>> dframe = pd.DataFrame({ 'color':


['white','white','red','red','white'],
'value': [2,1,3,3,2]})
>>> dframe
color value >>> dframe[dframe.duplicated()]
0 white 2 color value
1 white 1 3 red 3
4 white 2
2 red 3
3 red 3
4 white 2

>>> dframe.duplicated() 464


Discretization and Binning

▪ Discretization: sometimes is necessary to transform this


data into discrete categories,
▪ E.g. by dividing the range of values of such readings in smaller intervals
and counting the occurrence or statistics within each of them
▪ Another case might be to have a huge amount of samples due to
precise readings on a population. Even here, to facilitate analysis of the
data it is necessary to divide the range of values into categories and
then analyze the occurrences and statistics related to each of them

465
Binning

>>> results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]


>>> bins = [0,25,50,75,100]

>>> cat = pd.cut(results, bins)

#the first interval


>>>cat[0]

#number of values per interval


>>>pd.value_counts(cat)

466
Detecting & Filtering Outliers

>>> randframe =
pd.DataFrame(np.random.randn(1000,3))

With the describe() function you can see the statistics


for each column.

>>> randframe.describe()
0 1 2
count 1000.000000 1000.000000 1000.000000
mean 0.021609 -0.022926 -0.019577
std 1.045777 0.998493 1.056961
min -2.981600 -2.828229 -3.735046 467
25% -0.675005 -0.729834 -0.737677
Outlier detection
► E.g. outliers: that have a value greater than three times the standard deviation std()
>>> randframe.std()
0 1.045777
1 0.998493
2 1.056961
dtype: float64
► Apply filtering of all the values of the DataFrame, by applying the corresponding standard
deviation for each column. Use any() function, to apply the filter on each column
► 1 in any: means columns
>>> randframe[(np.abs(randframe) > (3*randframe.std())).any(1)]
0 1 2
69 -0.442411 -1.099404 3.206832
576 -0.154413 -1.108671 3.458472
907 2.296649 1.129156 -3.735046

468
Group By

469
Group by
► Its internal mechanism a process called: SPLIT-APPLY-COMBINE
► splitting: division into groups of datasets
► Applying: application of a function on each group
► combining: combination of all the results obtained by different groups
► Example:
► Splitting: the data contained within a data structure, such as a Series or a
DataFrame, are divided into several groups, according to a given criterion, which is
often linked to indexes or just certain values in a column.
► In SQL, values contained in this column are reported as keys
► if you are working with two-dimensional objects e.g. DataFrame, the grouping
criterion may be applied both to the line (axis = 0) for that column (axis = 1)s
► Applying: a function, which will produce a new and single value, specific to that
group
► Combining: collects all the results obtained from each group & combine them to a
new object

470
Example

>>> frame = pd.DataFrame({ 'color':


['white','red','green','red','green'],
'object':
['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})

>>> frame
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
471
2 green pencil 1.30 1.60
Example…
► Calculate the average price1 column using group labels listed in the column color
► E.g. access the price1 column and call the groupby() function with the column color

>>> group = frame['price1'].groupby(frame['color’])

>>> group
<pandas.core.groupby.SeriesGroupBy object at 0x00000000098A2A20>

>>> group.groups
{'white': [0L], 'green': [2L, 4L], 'red': [1L, 3L]}

>>> group.mean()
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64

>>> group.sum()
color
green 4.05
red 4.76
white 5.56
472
Name: price1, dtype: float64
Hierarchical grouping
▪ You have seen how to group the data according to the values of a column as a key choice.
▪ Can be extended to multiple columns, i.e., make a grouping of multiple keys hierarchical

>>> ggroup = frame['price1'].groupby([frame['color'],frame['object']])


>>> ggroup.groups
{('red', 'ashtray'): [3L], ('red', 'pencil'): [1L], ('green', 'pen'): [4L], ('green', ' ], ('white', 'pen'): [0L]}
>>> ggroup.sum()
color object
green pen 2.75
pencil 1.30
red ashtray 0.56
pencil 4.20
white pen 5.56

Name: price1, dtype: float64

473
Aggregations

474
Aggregations
Aggregations are functions that can be applied efficient on dataframes
They can be applied to whole columns or groups that come from groupby

475
Aggregations

476
Aggregations
We can directly create a new column from our existing dataframe using the agg
command

477
Operations: Count vs Sum

▪ In this example we can’t use the count method we saw before to find how many
planets were discovered per method, because the number column can be more
than one
▪ So we use sum

478
Operations

479
Operations

480
Operations

Slicing

481
Operations
Get the shape of the array Info

482
Operations

Another piece of information we can get about our data is with dtypes

▪ Notice that dtype doesn’t uses parentheses


▪ Because it isn't a function, it is an object
▪ Returns the type for each column

483
Operations
Describe is a very useful operation that gives a general idea for all the numerical
data

484
Operations
We can also sort with a certain value using the sort values method

485
Data Cleaning: Replace
Replace is a pandas function that finds a value and replaces it with another

486
Data Cleaning: Replace
It also accepts regular exceptions as an input

487
Visualizations

488
Visualizations

▪ Pandas offers a direct plotting tool (is uses matplotlib as


backend)
▪ It is a powerful plotting tool that support all the forms of
visualization(histogram, bar graph, pie chart, heatmap etc)

▪ We will see 3 examples in the planets dataset we explored


before

- Histogram
- Pie Chart
- Bar Graph
- Boxplot

489
Visualizations: Basic Components in
matplotlib library

490
Visualizations: histograms

491
Visualizations: Pie Charts

492
Visualizations: Bar Plots

493
Visualizations: Box Plot

source 494
Visualizations: Box Plot

source 495
Data Visualization
Module 4

496
Data Visualization

•What is Data Visualization?


• Data Visualization is the representation of data or information in a graph, chart, or other visual
format.
•Why Data Visualization?
• A visual summary of information makes it easier to identify patterns, trends and outliers than
looking through thousands of rows on a spreadsheet.

497
Examples

Google Map Geo Spatial Hard Disk


Traffic Data Drives
The live traffic status
in Google map makes Amongst the various The blue line and
types of geospatial empty space in
humans life simple visualisations, Heat horizontal bar and an
and easy. You can see maps are useful when pie chart is an
you have to represent example to inform
the red colour
large sets of about the consumed
indicating slow traffic continuous data on a and free space in hard
area in the city on the map using a color disk.
spectrum.
map and blue colour
indicating fast etc.

498
Variable Types

499
Four levels of
measurement

500
Basic Plots

Histogram Scatter Plot

Basic
Bar Plot Box Plot
Plots

Pie Plot Line Plot

501
Scatter Plot

• A scatter plot (aka scatter chart,


scatter graph) uses dots to
represent values for different
numeric variables.
• The position of each dot on the
horizontal and vertical axis
indicates values for an individual
data point
• Scatter plots are used to observe
relationships between variables

502
Scatter Plot
Applications

• A scatter plot can be useful for identifying other patterns in data


• We can divide data points into groups based on how closely sets of points cluster together
• Scatter plots can also show if there are any unexpected gaps in the data and if there are any
outlier points (For example, look at the two points away from rest of the data in the scatter plot.
Those are outliers)
• This can be useful if we want to segment the data into different parts, like categorizing users into
different groups

503
Line Plot

A line chart is used to represent


data over a continuous time span. It
is generally used to show trend of a
variable over time. Data values are
plotted as points that are connected
using line segments
Using a line chart one can see the
pattern of ant dependent variable
over time like share price, weather
recordings (like temperature,
precipitation or humidity), etc.

504
Histogram

A histogram is a graphical display of data using


bars(rectangles) of different heights.

Parts of Histogram:
Title: The title describes the information
included in the histogram
X-axis: The X-axis are intervals that show the
scale of values which the measurements fall
under. These intervals are also called bins.
Y-axis: The Y-axis shows the number of times
that the values occurred for each interval on
the X-axis
Bars: The height of the bar shows the number
of times that the values occurred within the
interval, while the width of the bar shows the
interval that is covered.

505
Histogram
Applications

• Histograms are a very common type of plots when we are looking at data like height and weight,
stock prices, waiting time for a customer, etc which are continuous in nature.
• Histograms are good for showing general distributional features of dataset variables. You can see
roughly where the peaks of the distribution are, whether the distribution is skewed or symmetric,
and if there are any outliers.

506
Bar plot

Bar charts are one of the most common types of graphs and are used to show data associated
with the categorical variables.
Bar graphs are used to match things between different groups or to trace changes over time.

507
Types of Bar Plot

Bar plot with multiple quantities


When comparing several
quantities and when changing
one variable, we might want a
bar chart where we have bars of
one color for one quantity value.

Stacked Bar plot


The stacked bar chart stacks
bars that represent different
groups on top of each other. The
height of the resulting bar shows
the combined result of the
groups.

508
Bar Plot vs Histogram

Histograms are a great way to show


results of continuous data, such as:
• Weight
• Height
• How much time, etc.

But when the data is in categories (such


as country or favorite movie), we
should use a bar plot.

509
Pie Plot

A pie chart is a circular statistical


graphic, which is divided into sliced to
illustrate numerical proportion.
Parameters of a pie chart:
X: The wedge sizes
Labels: A sequence of strings providing
the labels for each wedge
Colors: A sequence of colors through
which the pie chart will cycle. If None,
will use the colors in the currently
active cycle
A pie chart is best used then trying to
work out the composition of something.
If you have categorical data then using
a pie chart would work well as each
slice can represent a different category.

510

You might also like