Statistics

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

10.

Statistics
Introduction to Statistics
Meaning and Definition of Statistics
Statistics is the practice or science of collecting and analyzing numerical data in large quantities,
especially for the purpose of inferring proportions in a whole from those in a representative sample.
(Reference : Oxford dictionary)
There are two categories in statistics:
1) Descriptive statistics
2) Inferential statistics.
From a GMAT perspective, our focus would be on some of the measures of descriptive statistics.
Descriptive statistics
Descriptive statistics is a summary of certain data, whose purpose is to give an overview.
The most commonly used example of this is the average, like the average marks obtained by a
student in math in a period of 3 years.
Categories of Descriptive statistics
Measures of Central tendency refer to a value that is usually the center point of a data set.
Measures of Dispersion refer to how far the values in a data set have deviated from the mean
value.
Measures of Central Tendency
The most common measures of central tendency are mean, median and mode.
Mean
The Mean, also called the Arithmetic Mean or the Average, of a set of numbers is obtained by
calculating the sum of all elements in the set divided by the number of elements in the set.
Formula for calculating the Mean / Average
Mean = Sum of the elements/Number of elements in the set.
Consider the following example:
Tom scored 88 in English, 97 in Math, 90 in Science and 85 in Social Studies. Calculate his
average marks.
Solution: Average marks obtained by Tom = (88+97+90+85)/4 = 90.
Median
The Median of a data set is the middle value of the set when the elements are arranged in
ascending or descending order.
When a data set has an odd number of elements, we choose the middle value.
When the data set has an even number of elements, the average or the mean of the two middle
values becomes the median.
Consider the following examples:
1) Find the median of the set {4, 7, 1, 0, 9}.
Solution: Arranging the set in an order = {0, 1, 4, 7, 9}
4 is the median of this set.
2) Find the median of the set {3, 2, 5, 10, 8, 7}
Solution: Arranging the set in an order = {2, 3, 5, 7, 8, 10}
Median = (5+7)/2 = 6 is the median of this set.

Note:
When a set is evenly distributed, which means the difference between consecutive elements of the
set is equal, the median and mean of the set are equal.
This can be verified with the help of an example.
Find the mean and median of the set {4,8,10,6}
Mean = (4+8+10+6)/4 = 28/4 = 7
Median = {4,6,8,10} = (6+8)/2 = 14/2 = 7
Mode
The Mode of a data set is the most frequently occurred value in the set.
A set may have more than one mode or no mode at all.
Consider the following examples:
1) 3, 4, 7, 3, 1, 2, 3, 9, 13
3 is the mode in this set.
2) 21, 34, 9, 57, 64, 34, 90, 9, 12, 2, 34, 9
This is a bimodal set. 34 and 9 are its modes.
3) 6, 7, 36, 2, 1, 41
This set has no mode.
Take a look at this example:
The mean of 2,6,9,13,x is 9. Find the median of {22,x,38,11,5,9}.
Solution:
(2+6+9+13+x)/5 = 9
30+x = 5*9
X = 45-30 = 15.
For finding Median, arranging the numbers in an order,
{5,9,11,15,22,38}
The median is (11+15)/2 = 26/2 = 13
Measures of Dispersion
We will be focusing on two measures of Dispersion: Range and Standard deviation.
Range
This is probably the simplest measure of dispersion. It is obtained by calculating the difference
between the highest and the least values of a set.
The range of the data set {3, 4, 10, 14, 8} is 14-3 = 11.
Standard deviation
The Standard deviation of a set is calculated in five steps.

1. Calculate the arithmetic mean of the set.


2. Find the difference between each value and the arithmetic mean.
3. Square the differences of each value.
4. Find the average of the squared differences. [The value obtained at this step is called the
variance. Standard deviation is simply the square root of variance].
5. Find the square root of the average.

Consider the following example:


Calculate the standard deviation for the set {3,4,8,10}.
Solution:
Arithmetic Mean of the set = (3+4+7+10)/2 = 24/2 = 12

Average of the differences = (81+64+16+4)/4 = 165/4 = 41.25 is the variance


SD = √variance = √41.25 = 6.422

Statistics and Probability (Part - 1)


STATISTICS
FUNDAMENTAL CHARACTERISTICS OF STATISTICS
Statistics have the following important characteristics:
(i)   Statistics are aggregate of facts and not a single observation.
(ii)  Statistics are expressed quantitatively.
(iii) In an experiment statistics are related to each other and comparable. It can be classified into
various groups.
(iv) Statistics are collected for a pre-determined purpose.
(v)  In collection of statistics a reasonable standard of accuracy must be maintained.
LIMITATIONS OF STATISTICS
Statistics have the following limitations:
(i)   Statistics is not fit for study of qualitative phenomenon like honesty, intelligence, poverty etc.
(ii)  Statistics deals with groups and does not study individuals.
(iii) Laws of statistics are not exact. These are true on averages.
(iv) Data collected for a definite purpose may not be suitable for another purpose.
STATISTICAL DATA
Statistical data are the facts which are collected for the purpose of investigation. There are two
types of statistical data:
(i)   Primary data: The data collected by an investigator for the first time for his own purpose are
called primary data. As the primary data are collected by the user of the data, so it is more reliable
and relevant.
(ii)  Secondary data: The data collected by a secondary source and used by the investigator for his
purpose is called secondary data. For example score of a cricket match noted from newspapers is
secondary data.
Thus data which are primary in the hands of one become secondary in the hands of the other.
Data collected by any source also can be divided in following two types:
(i)   Raw Data: Raw data are those data which are obtained from the original source but not
arranged numerically. This is also called ‘ungrouped data’ for example marks of 10 students in
maths are given as:
75, 96, 25, 32, 89, 62, 40, 79, 35, 55
An ‘array’ is an arrangement of raw numerical data in the ascending or descending order of
magnitude. Above data can be written as
25, 32, 35, 40, 55, 62, 75, 79, 89, 96
(ii)  Grouped data: An array can be placed systematically in groups or categories. For example the
above data can be grouped in following manner.

GROUPS MARKS TOTAL NUMBER OF STUDENT

0 to 20 - 0

21 to 40 25, 32, 35, 40 4

41 to 60 55 1

61 to 80 62, 75, 79 3

81 to 100 89, 96 2

Total 10
SOME BASIC DEFINITIONS
(i) Variate: Variate is a quantity that may vary from observation to observation.
(ii) Range: Range is difference between the maximum and minimum observations.
(iii) Class Interval: When data are divided in groups, each group is called a class interval.
(iv) Class Limit: Every class interval has two limits. The smallest observation of the interval is
called lower limit and the largest observation of the interval is called upper limit.
(v) Class Mark: The mid value of any class is called its class mark.

Class Mark = 
(vi) Class Size: Class size is defined as the difference between two successive class marks. It is
also the difference between the upper and lower limits of any class interval.
(vii) Frequency: In a particular class the count of the number of observation is called its frequency.
So the corresponding frequency of a class is called its class frequency.
(viii) Cumulative Frequency: The cumulative frequency of any class is obtained by adding all the
frequencies successively prior to that class i.e. it is the sum of all frequencies up to that class.
Inclusive and Exclusive distributions:
Inclusive Distribution: When in a distribution, the upper limit does not coincide with the lower limit
of the next class then the distribution is called an inclusive distribution. e.g.
Inclusive Form

Height (in cm) No. of Students

150-152 4

153-156 10

157-169 6

170-173 3
Exclusive Distribution: An exclusive distribution is that distribution in which the upper limit of one
class coincides with the lower limit of the next class. e.g.

Inclusive Form

Height (in years) No. of Students

10-20 10

20-30 8

30-40 15

40-50 4
True Class Limit: In the case of exclusive classes the upper and lower limits are respectively
known as its true upper limits and true lower limits.
In the case of inclusive classes, the true lower and upper limits are obtained by subtracting 0.5 from
the lower limit and adding 0.5 to the upper limit.
True upper limits and true lower limits are also known as boundaries of the class.
Tally: Tally method is used to keep the chance of error at minimum in counting. A bar (|) called tally
mark is put against any item when it occurs. The fifth occurrence of any item is represented by
putting diagonally a cross tally (|) on the first four tallies.
FREQUENCY DISTRIBUTION TABLE
The tabular arrangement of data showing the frequency of each item is called a frequency
distribution table. It is a method to present raw data in the form from which one can easily
understand the information contained in the raw data.
Frequency distribution are of two types:
(i) Discrete frequency distribution: In this type of frequency distribution, in the first column of
frequency table we write all possible values of the variables from the lowest to the highest, in the
second column we write tally marks and in the third column we show frequency of each item. In this
method data are not divided into groups or classes.
(ii) Continuous or Grouped Frequency Distribution: In the frequency distribution data are
divided  into groups or classes. This method is used only where the values in the raw data are
largely repeating and the difference between the greatest and the smallest observations is not very
large.
PREPARATION OF A FREQUENCY DISTRIBUTION TABLE:
The following steps are taken to prepare a frequency distribution table:
(i)   First of all we arrange the data in an array.
(ii)  Then draw a table consisting of 3 columns. First column is used for class, the second column
for tally and the third column for frequency.
(iii) Then in the first column we write the classes keeping the lowest and the highest scores in view.
(iv) In second column we put tally marks against each class according to the scores.
(v)  Then we write frequency of each class in the third column after counting the tally.
(vi) Figures in first column and third column taken together represent the frequency table.
CUMULATIVE FREQUENCY TABLE
Cumulative frequency table is obtained from the ordinary frequency table by successively adding
the several frequencies. Thus to form a cumulative frequency table we add a column of cumulative
frequency in the frequency distribution table. It is obvious that the cumulative frequency of the last
class is the sum of the frequencies of all the classes.
Cumulative frequency series are of two types:
(i)   Less than series
(ii)  More than series
GRAPHICAL REPRESENTATION OF DATA:
A given data can be represented in graphical way. There are various methods of graphical
representation of frequency distribution. Here we shall study only four of them:

1. Bar Graphs
2. Histogram
3. Frequency Polygon
4. Cumulative frequency curve or ogive

BAR GRAPH
The frequency distribution of a discrete value is best represented by a bar graph. The height of the
bars is proportional to the frequency of each variate-value. In a bar graph the bars must be kept
distinct to show that the variate-values are distinct. The bars are of equal width and are drawn with
equal spacing between them on the x-axis depicting the variable. The values of the variable are
shown on the y-axis.
HISTOGRAM
Histogram is a graphical representation of a grouped frequency distribution with continuous classes.
It consists of a set of rectangles where heights of rectangles are proportional to their class
frequencies, for equal class intervals. There is no gap between two successive rectangles. The
rectangles are constructed with base as the class size and their heights representing the
frequencies.

Statistics and Probability (Part - 2)


FREQUENCY POLYGON
A frequency polygon is a graph of frequency distribution. It is a line graph of class frequency which
is plotted against class mark. A frequency polygon can be obtained by two methods:
(1) By using Histogram: A frequency polygon can be obtained by joining mid points of the top of
the rectangles of a histogram. For this we obtain the mid points of the upper horizontal sides of
each rectangle and then join these mid points by dotted lines to get frequency polygon. End of a
frequency polygon preferably extended to the mid points of imagined class intervals adjacent to first
and last class intervals.

(2) Frequency polygon without using Histogram: Following procedure is used to make a


frequency polygon without using histogram.
(i)  Calculate the class marks, x 1, x2, ...., xn of each of the given class intervals.
(ii) Mark class marks x 1, x2, .... xn, along X-axis and frequencies f 1, f2, .... fn along Y-axis.
iii) Plot the points (x 1, f1), (x2, f2), ,....., (xn, fn).
(iv) Obtain the mid-points of two class intervals of zero frequencies at the beginning of the first
interval and at the end of the last interval.
(v) Join the points (x 1, f1), (x2, f2), ..., (xn, fn) by the line segments and complete the frequency
polygon by joining the mid points of the first and last intervals to the mid points of the imagined
classes adjacent to them.
CUMULATIVE FREQUENCY CURVE OR OGIVE
The graphical representation of a cumulative frequency distribution is known as cumulative
frequency curve or an ogive.
An ogive can be constructed by following two methods:
(1) Less than method: A less than ogive can be constructed by following steps:
(i) First of all we make class intervals in exclusive form if it is given in inclusive form.
(ii) Then we construct a less than type cumulative frequency distribution by adding the frequency of
each class to the sum of frequencies of its prior classes.
(iii) Now we mark upper class limits along X-axis and cumulative frequencies along Y-axis.(iv) We
plot the points (upper class limit, corresponding cumulative frequency) and join them by a free hand
curve.
(v) The lower limit of the first class interval becomes the upper limit of the imagined class with
frequency 0. We join the imagined point (lower limit of first class, 0) with the first point of the curve
and so on.
In this way we get the required curve called an Ogive by less than type method.
More than Type:
We apply the following steps to construct a more than type ogive:
Step (1)    :   First of all we make class intervals in exclusive form if it is given in inclusive form.
Step (2)    :   Then we construct a more than type cumulative frequency distribution.
Step (3)    :   Now we mark lower lass limits along x-axis and cumulative frequencies along y-axis.
Step (4)    :  We plot the points (lower class limit, corresponding cumulative frequency) and join
them by a free hand curve.
Step (5)    :   The upper limit of the last class interval becomes the lower limit of the imagined class
interval with frequency 0. We join the imagine point (upper limit of last class, 0) with the last point of
the curve to end the ogive.
In this way we get the required curve called an ogive by more than type method.
MEASURES OF CENTRAL TENDENCY
An average of a distribution is a single expression which represents a group of variables in a simple
and concise manner. It is the representative of entire distribution. Averages are generally in the
central parts of the distribution and therefore they are called Measures of Central Tendency.
An ideal measures of central tendency should have following properties:
(i) It should be defined rigidly.
(ii) It should be based on all observations.
(iii) It should be easy to calculate and readily comprehensible.
(iv) It should be affected as less as possible by fluctuations of sampling.
(v) Extreme values should not affect very much to measure of central tendency.
Following three types of measures of central tendency are used for analysing data:
(i)   Arithmetic mean
(ii)  Median
(iii) Mode
Arithmetic mean for ungrouped data (A. M.)
The arithmetic mean is the most commonly used measure of central tendency. It is obtained by
dividing number of observations to the sum of observations. The A. M.  of n observations, x1, x2,
x3, ......,, xn is given by

A.M =   
Properties of Arithmetic Mean
(1) If x is the mean of n observations, x 1, x2, ....., xn, then the mean of observations x 1 + a, x2 +
a, ...., xn + a is , i.e. if each observation is increased by a, then the mean is also increased by a.
(2) If  is the mean of n observations, x 1, x2, ..... xn, then the mean of observation, x 1 – a, x2 – a, ...,
xn – a is  i.e. if each observation is decreased by a, then the mean is also decreased by a.
(3) If  is the mean of x1, x2, .... xn then mean of ax1, ax2, .... axn is , where a is any number different
from zero i.e. if each observation is multiplied by a non-zero number a, then the mean is also
multiplied by a.
(4) If  is the mean of n observations x 1, x2, ...., xn then the mean of x1/a, x2/a, ..... xn/a is xÌ„/a where
a ≠ 0,  i.e. if each observation is divided by a non-zero number, then the mean is also divided by it.
Arithmetic mean of Grouped Data:
Let x1, x2, x3, ..... xn be n observations whose frequencies are f 1, f2, f3, .., fn respectively, then the
arithmetic mean of this distribution is given by

Combined Mean
Let  and  be the means of two groups of observations with number of observations n1 and n2
respectively, then the combined mean of two groups is given by,

Merits of Arithmetic Mean


(i) A. M. is rigidly defined.
(ii) It is very simple. One can easily understand and calculate it.
(iii) It is uniquely defined.
(iv) It is based upon all the observations.
(v)  A. M. is least affected by sampling fluctuations.
(vi) We can mathematically analysis mean.
(vii) A. M. relatively reliable.
Demerits of Arithmetic Mean
(i) A. M. cannot be used for qualitative characteristics like richness, beauty, poverty etc.
(ii)  A. M. of a given data can not be determined by inspection. It can be also represented
graphically also.
(iii) If any observation is missing then A.M. cannot be calculated.
(iv) A. M. is very much affected by extreme values. In case of extreme items, A. M. gives a distorted
picture of the distribution and no longer remains representative of the distribution.
(v)  If the extreme class is open, e.g. below 10 or above 100 then A. M. cannot be calculated.
(vi) If the given data from which the mean has to be calculated, is not given then A. M. may lead to
wrong conclusions.
(vii) A. M. cannot be used in the study of ratios, rates etc.
Uses of Arithmetic Mean
(i) A. M. is extensively used in practical statistics.
(ii)  Estimates can be obtained using A. M.
(iii) A. M. is used for different purposes by different persons like it is used for calculating average
marks of the students. It is also used by businessmen to find out profit per unit article, output per
machine, average monthly income and expenditure etc.
=  16 – 6
=  10
Hence, f1 = 6 and     f2 = 10
MEDIAN
Median is defined as the value of that item of the arrayed data which divides the whole data into two
equal parts. Hence we have following definition of median:
The middle item of the arrayed data is called its median.
Calculation of median of raw data:
(i)   If the number of observations ‘n’ is odd, then the median will be the value of  observation. 

(ii) If n is even, then we have two middle terms i.e. (n/2)th observation and  (n/2 + 1)th observation.
Median of the given data will be mean of these two middle observations.

Statistics and Probability (Part - 3)


Merits of Median
(i) Median is rigidly defined.
(ii) It can be easily understood and calculate.
(iii) The median is not much affected by extreme values and therefore it is a better representative as
an average of given data.
(iv) The median can be calculated graphically, while mean can not be.
(v)  In some cases, median can be determined even by inspection.
(vi) If the class intervals are unequal then also median can be calculated.
Demerits of Median
(i) Median is not based on all the observations.
(ii) If the number of observations is even, median cannot be determined exactly.
(iii) If there is fluctuation of sampling then the median would be much affected by it.
(iv) It is not subject to algebraic treatment.
Uses of Median
(1) Since the median is middle term of an arrayed data, therefore it is the only average which is
used while dealing with qualitative data which can be arrayed but cannot be measured
quantitatively.
(2) Median is used for determining the typical value in problems concerning wages, distribution of
wealth etc.
Mode
Mode of a given data is the value of that observation which occurs maximum number of times i.e.
the observation which occurs with the highest frequency.
According to Croxton and Cowden, “The mode of a distribution is the value at the point around
which, the items tend to be most heavily concentrated.”
Mode of ungrouped data
For a given ungrouped data, the mode can be located simply by inspection. It is variate which is
having maximum frequency.
Empirical Formula: It two or more observations occurs the same number of time with highest
frequency, then mode can be determined by following formula
Mode = 3 median – 2 mean
Merits of Mode:
(i) Mode can be easily understood and calculate.
(ii) It can be calculated graphically.
(iii) It is not affected by extreme values.
(iv) In some cases, it can be found by inspection also.
(v) It can be used for open ended distribution and qualitative data.
Demerits of Mode:
(i) Mode is not based upon all the observations.
(ii) Mode is ill-defined. It is not always possible to find a clearly defined mode.
(iii) Mode is affected to a greater extent by fluctuations of sampling.
(iv) Mode is not capable of further mathematical treatment. It is often indeterminate.
Uses of Mode
Mode is the average which is used to find the ideal size, e.g. in business forecasting, in
manufacturing of ready-made garments, shoes etc.
PROBABILITY
HISTORY OF PROBABILITY
The theory of probability was originated from the games of chance related to gambling. An Italian
Mathematician, Jerome Cardan (1501–1576) was the first to write a book named “Book on Games
of Chance” published in 1663. Notable contributions were also made by mathematicians J.
Bernoulli, P. Laplace and A. A. Markov. In the twentieth century, a book “Foundation of Probability”
was published by Russian Mathematician Kolomogorov in 1933 and this was the first book to
introduce probability as a set function.
SOME IMPORTANT OBJECTS
(i) Coin : Coin is a well known object. It has tw
and other is Tail.

(ii) Die : A die is a well balanced solid cube ha


marked with numbers (dots) from 1 to
one face. The plural of die is dice.

(iii) Playing Cards : A pack of playing cards contains 52 c


are red cards and 26 are black cards.

These 52 cards are divided in four gro


called a suit and has 13 cards. Name o

(i) Diamond (¨)

(ii) Heart (©)

(ii) Spades (ª)

(iv) Club (§)


Out of these four suits Diamond and Heart are read cards and Spade and Club are black cards.
Each suit having 13 cards which are 1, 2, 3,
...., 10, Jack, Queen and King. Card having 1 is also called an ace. Jack, Queen and King are
known as face cards. Therefore total 12 face cards are in a pack of 52 playing cards.
SOME DEFINITIONS:
Experiment: An activity which ends in some well defined results is called an experiment. These
results are called outcomes. There are two types of experiments:
(i) Deterministic experiment
Those experiments which when repeated under identical conditions produce the same results or
outcomes are known as deterministic experiments.
Example: Formation of Methane in laboratory.
(ii) Random Experiment:
An experiment, when repeated under identical conditions do not produce the same outcome every
time but the outcome is one of the several outcomes, it is known as Random Experiment:
Trial:
Performing an experiment once is called a trial.
Sample Space:
The collection of all the possible outcomes of a random experiment is called a sample space. It is
usually denoted by S.
Example: After tossing a coin, possible outcomes are head and tail so sample space for tossing a
coin consists of head and tail.
Event:
Each possible outcome of a trial is known as an event. It is generally denoted by E. It is of two
types:
(i) Simple Event: If any event E contains only one outcome of sample space then it is known as
simple event. In this way each outcome of sample space related to any experiment is a simple
event.
Example: The experiment of throwing a die once consists of 6 simple events viz. coming the face
showing up 1 or 2 or 3 or 4 or 5 or 6.
(ii) Compound Event: If any event contains more than one outcomes of sample space, then it is
known as compound event.
Example: After throwing a die the outcome is an even number i.e. 2 or 4 or 6.
DIFFERENT APPROACHES TO PROBABILITY
There are following approaches to theory of probability:
(1) Empirical Approach
(2) Classical Approach
(3) Axiomatic Approach
Here we study only Empirical Approach to Probability
DEFINITION OF PROBABILITY
Let E be any event related to a Random experiment whose sample space has n outcomes and out
of these n outcomes, the event can be performed by m outcomes, then probability of occurrence of
event E will be

i.e. probability of any event lies between 0 and 1.


Note:
(i) Probability of any event cannot be less than 0 and cannot be more than 1. So it can be any
fraction from 0 to 1.
(ii) It p is the probability of occurrence of an event E and q is the probability of non occurrence of
that event then
p+q=1
q=1–p
(iii) The sum of the probabilities of all the possible outcomes of a trial is 1.
IMPOSSIBLE EVENT
If the number of favorable outcomes for an event is zero then the probability of occurrence of that
event will be zero and such type of event is known as Impossible Event.
Sure Event or Certain Event
It the number of favorable outcomes for an event is equal to the total number of possible outcomes
then the probability of occurrence of that event will be one and such type of event is known as sure
or certain event.

You might also like