Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 9

Data Mining

Handling Missing Data


• Replace the missing value with some constant, specified by the
analysis
• Replace the missing value with the field mean (for numeric variables)
or the mode (for categorical variables).
• Replace the missing values with a value generated at random from
the observed distribution of the variable.
• Replace the missing values with imputed values based on the other
characteristics of the record.
Missing data
ID Age Income Marital Status Credit Score Class
1 34 18 Married ? churner
2 28 14 single ? nonchurner
3 22 10 ? 730 churner
4 50 ? single ? churner
5 48 25 widowed 670 nonchurner
6 30 17 single 650 nonchurner
7 27 14 single ? churner
Outlier Detection and treatment
• Graphical Methods for Identifying outliers
• Measures of center and spread
• Data transformation
• MIN-MAX Normalization
• Z-score stardization
• Numerical Methods for Identifying outliers
if it is lower than first quartile (Q1) – 1.5 * IQR or
if it is higher than third quartile (Q3) + 1.5*IQR.
IQR – the InterQuartile Range (Q3-Q1) (a measure of variability)
Simple Linear Regression
• Assumptions:
• Outlier
• z – score = (𝑦ො − 𝑦)/𝜎
ത 𝑦
• Z-score more than 3 is an outlier
• High Leverage point
• Unusual x-value. This point does certainly will have effect on the model summary
statistics such as R2 and the standard errors of the regression coefficients.

1 (𝑥𝑖 −𝑥)ҧ 2
• ℎ𝑖 = + σ(𝑥𝑖 −𝑥)ҧ 2
𝑛
• An observation with leverage 2(m+1)/n or 3(m+1)/n may be considered
high leverage point (m number of predictors).
Simple Linear Regression
• Influential observation:
• Omitting this point will have effect on regression equation.
• One of the way it can be measured Cook’s Distance. It is given by
෢𝑗 − 𝑦෡𝑖 )2
σ𝑗(𝑦 𝑗
𝐷𝑖 =
𝑘 + 1 ∗ 𝑀𝑆𝐸
Where Di is the Cook’s distance of ith observation and k – number
of predictor in the model. Yj is the predicted value of jth observation
including ith observation and yji is the predicted value of jth
observation after excluding ith observation.
• A cook’s distance more than 1, is highly influential observation.
Assumption
• Normality of errors
• E(e) = 0
• Var(e) = 𝜎 2
• Breush-pagan test (bptest)
• Indepenence
• Model evaluation
• AIC (akaike information criterion)
• BIC (Bayesian information criterion)
Interpretation
• Multiple R
• R2 =SSR/SST
• Coefficient of determination
• Adjusted R2 = 1 – (1-R2)*((n-1)/(n-k-1)) = 1 –MSE/MST
• Model building (variable selection)
• Standard Error
• Variable selection & comparisons of models
• Precision
• F-test
• T-test

You might also like