CL202: Introduction To Data Analysis: Mani Bhushan, Sachin Patwardhan

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

CL202: Introduction to Data Analysis

Mani Bhushan, Sachin Patwardhan

Department of Chemical Engineering,


Indian Institute of Technology Bombay
Mumbai, India- 400076
mbhushan,sachinp@iitb.ac.in

Acknowledgements: Santosh Noronha (most of the material taken directly from his notes)

Spring 2015

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 1 / 17


Introduction to Statistics

Statistics involves
collection of data,
description of data,
analysis of data,
identification of logical conclusions.
Think of these in the context of the following examples:
Census data.
Market survey.
Reactor yield data.
Identification of genetic defects leading to cancer.
Chemistry lab experiment.

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 2 / 17


Probability and Statistics

Probability: “Pure Math”, is axiomatic in approach.


I For a given process (a chemical reactor), model the uncertainty by random
variables (yield of the reactor), and then predict what will happen (probability
of yield greater than 90%).
Statistics: “Applied Math”, deals with observed data.
I Given observed data (yield from a few batches), infer about the process
generating the data (is the catalyst degraded?).
I Can loosely think of it as inverse of probability.
I Data analysis integral to statistics.

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 3 / 17


Why learn statistics?

There must be a hypothesis which needs analysis!

Statistics without a relevant question to be answered can be pointless.


“There are lies, damned lies, and statistics.” [Benjamin Disraeli]
You will learn a variety of techniques in this course...but the hypothesis and
subsequent use of the conclusions are NOT within the Statistician’s domain!

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 4 / 17


A Reactor Yield Example

Yield data (in %) with two catalysts (Ogunnaike, 2009):

YA YB
74.04 75.75
75.29 68.41
75.62 74.19
75.91 68.10
77.21 68.10
75.07 69.23
74.23 70.14
74.92 69.22
76.57 74.17
77.77 70.23

Intrinsic variability or randomness in data.


Question: Is YA > YB ?

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 5 / 17


Descriptive and Inferential statistics

Descriptive statistics: Summarize the collected data, in the form of various


measures, to allow subsequent analysis.
Inferential statistics: The subsequent analysis and drawing of conclusions.
I We need to worry about the data being observed by chance.
- Think of a coin toss experiment.
- Think of your chemistry experiment.

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 6 / 17


Some Historical Aspects

Statistics = collection of facts relevant to the governance and economy of


the city states of Venice and Florence, in Italy (∼1400s).
Churches in Europe started maintaining birth, marriage, and death registries.
Plagues were common. In England, identifying spurts in the death rate was
important towards saving the King!
The causes and places of death started to be stored.

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 7 / 17


John Graunt, ∼1662

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 8 / 17


John Graunt, ∼1662

The number of deaths in England:


Year Burials Plague Deaths
1592 25,886 11,503
1593 17,844 10,662
1603 37,294 30,561
1625 51,758 35,417
1636 23,359 10,400
There were ∼13,200 deaths in London in 1660.
That year, in one London neighborhood, there were ∼3 deaths for every 88
people.
The London population, he reasoned, was
88
13, 200 × = 387, 200
3

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 9 / 17


Graunt and Mortality Tables

Graunt next inferred ages at death


Age at Death Deaths per 100 Births
0-6 36
6-16 24
16-26 15
26-36 9
36-46 6
46-56 4
56-66 3
66-76 2
76 and greater 1

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 10 / 17


Graunt and Mortality Tables

This was useful to businessmen selling annuities!


Invest a lump sum, receive annuity once a year, profit if the insured does not
live long!
Edmund Halley next computed the odds of a person of any age living to any
other age.
Useful for computing life insurance premiums!

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 11 / 17


Statistics in the 1800’s

Probability theory had been developed by Bernoulli, Gauss, and Laplace. The
connection to statistics had not been made!
Francis Galton, a cousin of Darwin, worked on the inheritance of intelligence.
Regression and correlation are due to him.
Florence Nightingale used a pie chart to represent data on the Crimean war!

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 12 / 17


Statistics in the 1900’s

Karl Pearson next developed several test procedures.

W. S. Gosset, a chemist at a brewery, trained at Pearson’s Lab.


A Student of statistics!

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 13 / 17


Statistics in the 1900’s

Ronald Fisher (and Pearson) investigated applications in population biology


and agriculture.

Analysis of variance
Design of experiments
Maximum likelihood
Discriminant analysis and Classification
theory.

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 14 / 17


Statistics in India

Prasanta Chandra Mahalanobis (1893-1972) founded the Indian Statistical


Institute in Calcutta, famous for Mahalanobis distance amongst others.

C. R. Rao (1920-): provided the famous Cramer-Rao bound that


characterices the best accuracy that an estimated parmater can have given
sampled data.

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 15 / 17


Next class:
Describing and visualizing data

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 16 / 17


THANK YOU

Mani Bhushan, Sachin Patwardhan (IIT Bombay) CL202 Spring 2015 17 / 17

You might also like