Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Statistics helps Numerical evidence problem

Statistics method

1. Classification- segment into appropriate group based on key characteristics

Eg. Long term customer, short term cus, brand switchers

2. Pattern recognition
Histogram, scatter plot
Eg. Scatter tells relationship between variable
Box plot- finding outliers
3. Association analysis- finding related items
4. Predictive Models
Regression= y=a+bx-number
a) Logistic Regression- Why
b) Neural Networks- Multiple output

Classical definition and type of statistics

Types

1.Descriptive

Data summarization, graph, chart, tables

2. Inferential Statistics
Talk about population from sample

Vital Terms in Statistics

Papulation- All possible data

Parameter- numerical value associated with the papulation


Sample- selection of observations from the population

Statistics- numerical value associated with sample

Data now-> data in future

In sample(Train data)- out sample( test data)

If I know the patterns in the population I can know the pattern for the future data

But how to find pattern in population through sampling and sampling paramater

Data sources

Primary- data from the organisation

Secondary- collection fro other sources

Type of data

1. Qualitative data- > non numeric- gendor, age, place


2. Quantitative data- numeric –

Discrete- no continuity between alues

Continuous-

Data objects

Row->data objects
Colums->attributes

Attriutes- > data field

Nominal-> names, color, place , zip codes, id etc


Binary-> nominal with 0 or 1
Symmetric binary- both outcome equally important e g.ender
Asymmetric binary—outcomes are not equally important( medical output positive or negative)
Ordinal-> qualitative with ranking

Numeric- quantitative
Interval- split in to equal size range - no true aero poit
Ratio- true zero point

DATA AND HISTOGRAM

Raw data is not informative

Data->information->knowledge->wisdom

Data->information( Descriptive)

Frequency distributon-> summarise the data arranged in to cases and frequency

X axis-> data
Y axis= how much data
Data spitted in to bin( width of the bar)

Cumulative distribution function: how many observation are <= some variables

Central Tendancy- cluster of data around the median

Arithmetic meana= x bar= ex/n

Cons:efftected by outliers

Median:

Sort the data in ascending order and take the middle most which is 50%
Odd number -> Median= n+1/2
Cons= Even number-> Median=average of two numbers
Pros= not effected by outliers

Mode:
Maximum frequency
Cons= same number with more freq.
Pros= not effected by outliers

MEASURE OF DISPERSION

Once you know the central tendancy the next step is to find the data speread distributed
around CT

1. Range of the distribution = X Max- X Min


No Range of distribution= X Max=XMin
What if X Max extreme value Then go for IQR

IQRS
Inter QUARTILE Range -> Remove top and bottom 25% consider only middle 50 %

IQR=Q3-Q1

Standard Deviation

Average deviation from the middle of the data

Variance= x-x bar/ n-1


SD= sqrt of variance

Co efficient of variation
CV=S\V gives proportion of variations

Chebyshiv Rule

68 % -> +1 1 SD
95%-> +,- 2 SD
99.7% 3SD

(1-1/k^2)*100 with in the k standard dev


Five Number summary: Describe the shape, spread, shape

Xmin
Q1
Q2(Median)
Q3
Xmax

Distribution

Left skewed

Symmetry

Right Skewed

1.5*iqrs

You might also like