Professional Documents
Culture Documents
Stats Week 1 To 9
Stats Week 1 To 9
Descriptive statistics - The part of statistics concerned with the description and
summarization of data is called descriptive statistics.
Inferential Statistics - The part of statistics concerned with the drawing of conclusions from
data is called inferential statistics.
Definition –
Population - The total collection of all the elements that we are interested in is called
a population.
Data - Data are the facts and figures collected, analyzed, and summarized for
presentation and interpretation.
Classification of data –
• Categorical
• Numerical
• Categorical data
• Also called qualitative variables.
• Identify group membership
• Numerical data
• Also called quantitative variables.
• Describe numerical properties of cases
• Have measurement units
Measurement units: Scale that de_nes the meaning of •
numerical data, such as weights measured in kilograms, prices
• in rupees, heights in centimeters, etc.
• I The data that make up a numerical variable in a data table
• must share a common unit.
Scales of measurement
Nominal scale - When the data for a variable consist of labels or names used to identify the
characteristic of an observation, the scale of measurement is considered a nominal scale. Examples:
Name, Board, Gender, Blood group etc.
Ordinal scale - Data exhibits properties of nominal data and the order or rank of data is meaningful, the
scale of measurement is considered a ordinal scale.
Interval scale - If the data have all the properties of ordinal data and the
interval between values is expressed in terms of a fixed unit of
measure, then the scale of measurement is interval scale.
Ratio Scale - If the data have all the properties of interval data and the
ratio of two values is meaningful, then the scale of measurement is
ratio scale.
Ex –
Construct a frequency table for the given data
1. A,A,B,C,A,D,A,B,D,C
2. A,A,B,C,A,D,A,B,D,C,A,B,C,D,A
3. A,A,B,C,A,A,B,B,D,C,A,B,C,D,B
4. A, A, B, C,A ,D, A,B,D,C, A,B,C,D,A,C,D,D
Pie chart - A pie chart is a circle divided into pieces proportional to the
relative frequencies of the qualitative data.
Bar chart - A bar chart displays the distinct values of the qualitative data on a horizontal axis and the
relative frequencies (or frequencies or percents) of those values on a vertical axis. The
frequency/relative frequency of each distinct value is represented by a vertical bar whose height is
equal to the frequency/relative frequency of that value. The bars should be positioned so that they do
not touch each other.
If the categorical variable is ordinal, then the bar chart must preserve the ordering.
• Another common violation is when the baseline of a bar chart is not at zero.
• Left graph exaggerates the number coming from the South and North. Graph on right shows
same data with the baseline at zero.
Mode - The mode of a categorical variable is the most common category, the category with the
highest frequency. The mode labels
• The longest bar in a bar chart
• The widest slice in a pie chart.
• In a Pareto chart, the mode is the first category shown.
• Let consider the example A,A,B,C,A,D,A,B,C,C, A,B,C,D,A
• The longest bar in a bar chart
• The most common category is "A"
Statistics Week 3
Types of variables-
1) Categorical
2) Numerical
I) Discrete
II) Continuous
Example
• Suppose the dataset reports the number of people in a household. The following
data is the response from 15
• individuals.
• 2,1,3,4,5,2,3,3,3,4,4,1,2,3,4
• The distinct values the variable, number of people in each household, takes is
1,2,3,4,5.
Descriptive measures –
• The objective is to develop measures that can be used to summarize a data set.
• These descriptive measures are quantities whose values are determined by the data.
Measures of central tendency: These are measures that indicate the most typical value or
center of a data set.
Measures of dispersion: These measures indicate the variability or spread of a dataset.
The mean - The mean of a data set is the sum of the observations divided by
the number of observations.
Here m = midpoint
Adding a constant
Multiplying a constant
Median - The median of a data set is the middle value in its ordered list.
Steps to obtain median
Arrange the data in increasing order. Let n be the total number of observations in the dataset.
1. If the number of observations is odd, then the median is the observation exactly in the middle
of the ordered list, i.e. n+1 2 observation
2. If the number of obsevations is even, then the median is the mean of the two middle
observations in the ordered list, i.e. mean of n2 and n2 + 1 observation
Adding a constant
• Let yi = xi + c where c is a constant then new median = old median + c.
Multiplying a constant
I Let yi = xi c where c is a constant then
new median = old median *c
Mode - The mode of a data set is its most frequently occurring value.
Adding a constant
• Let yi = xi + c where c is a constant then new mode = old mode + c
Multiplying a constant
I Let yi = xi c where c is a constant then
new mode = old mode *c
Measures of dispersion
I To describe that difference quantitatively, we use a descriptive
measure that indicates the amount of variation, or spread, in a data set.
The range of a data set is given by the formula Range = Max - Min where Max and
Min denote the maximum and minimum observations, respectively.
o Though the two datasets di_er only in one datapoint, we can see that this contributes to the
value of Range significantly. This happens because the range takes into consideration only the
Min and Max of the dataset.
Variance –
o In contrast to the Range, the variance takes into account all the observations.
o One way of measuring the variability of a data set is to consider the deviations of the data
values from a central value.
Adding a constant
• Let yi = xi + c where c is a constant then new variance = old variance.
Multiplying a constant
• Let yi = xi c where c is a constant then new variance = c2 *old variance.
Standard definition – The quantity which is the square root of sample variance is the sample
standard deviation.
Adding a constant
I Let yi = xi + c where c is a constant then new variance = old variance.
Multiplying a constant
I Let yi = xi c where c is a constant then
new variance = C2 _ old variance, (C2 is c square)
percentiles –
• The sample 100p percentile is that data value having the property that at least 100p percent of
the data are less than or equal to it and at least 100(1 - p) percent of the data values are
greater than or equal to it.
• If two data values satisfy this condition, then the sample 100p percentile is the arithmetic
average of these values.
• Median is the 50th percentile.
Computing Percentile
To find the sample 100p percentile of a data set of size n
1. Arrange the data in increasing order.
2. If np is not an integer, determine the smallest integer greater than np. The data value in that
position is the sample 100p percentile.
3. If np is an integer, then the average of the values in positions np and np + 1 is the sample 100p
percentile.
Quartiles
Defnition
The sample 25th percentile is called the _rst quartile. The sample50th percentile is called the
median or the second quartile. Thesample 75th percentile is called the third quartile.In other
words, the quartiles break up a data set into four partswith about 25 percent of the data values
being less than the first(lower) quartile, about 25 percent being between the _rst andsecond
quartiles, about 25 percent being between the second and third(upper) quartiles, and about 25
percent being larger than the third quartile.
• Minimum
• Maximum
The Interquartile Range (IQR) - The interquartile range, IQR, is the difference between
the first and third quartiles; that is,
IQR = Q3 - Q1
• IQR = Q3 - Q1 = 18:25
Contingency table –
Column relative frequency: Divide each cell frequency in a column by its column total.
• To decide which variable to put on the x-axis and which to put on the y-axis, display the
variable you would like to explain along the y-axis (referred as response variable) and the
variable which explains on x-axis (referred as explanatory variable.)
Scatter plot -
Describing association
When describing association between varaibles in a scatter plot, there are four key questions1 that
need to be answered
i) UP
ii) Down
ii) Variable
Covariance
Covariance quantifies the strength of the linear association between two numerical variables.
Key observation
I When large (small) values of x tend to be associated with large (small) values of y- the signs of the
deviations, (xi - _x)and (yi - _y) will also tend to be same.
I When large (small) values of x tend to be associated with small (large) values of y- the signs of the
deviations, (xi - _x)and (yi - _y) will also tend to be different.
Covariance
Definition
Let xi denote the i th observation of variable x, and yi denote thei th observation of variable y. Let (xi ; yi )
be the i th pairedobservation of a population (sample) dataset having N(n) observations. The Covariance
between the variables x and y is given by
Covariance: Example 1
Units of Covariance
• The size of the covariance, however, is di_cult to interpret because the covariance has units.
• The units of the covariance are those of the x-variable times those of the y-variable.
Correlation
• A more easily intepreted measure of linear association between two numerical variables is
correlation
• It is derived from covariance.
• To find the correlation between two numerical variables x and y divide the covariance between
x and y by the product of the standard deviations of x and y. The Pearson correlation
o coefficient, r , between x and y is given by
Remark
The units of the standard deviations cancel out the units of covariance.
Remark
It can be shown that the correlation measure always lies between -1 and +1.
Correlation: Example 1