ML CH 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Go, change the world

RV College of
Engineering

IntroductiontoMachineLearning

Improvi
UNITI
JyotiShetty

4/26/2022 1
RV College
Go, change the world
of
Engineering

Chapter2
PreparingtoModel

4/26/2022 2
Four Step Process of Machine Learning Go, change the world
RV College of
Engineering
Four Step Process of Machine Learning Go, change the world
RV College of
Engineering
Go, change the world
Types of data
RV College of
Engineering
Go, change the world
Types of data
RV College of
Engineering

Nominal
A nominal scale describes a variable with categories that do not have a natural order or ranking. You can code nominal
variables with numbers if you want, but the order is arbitrary and any calculations, such as computing a mean, median, or
standard deviation, would be meaningless.

Examples of nominal variables include:

genotype, blood type, zip code, gender, race, eye color, political party

Ordinal
An ordinal scale is one where the order matters but not the difference between values.

Examples of ordinal variables include:

socio economic status (“low income”,”middle income”,”high income”), education level (“high school”,”BS”,”MS”,”PhD”),
income level (“less than 50K”, “50K-100K”, “over 100K”), satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”,
“extremely like”).

Note the differences between adjacent categories do not necessarily have the same meaning. For example, the difference
between the two income levels “less than 50K” and “50K-100K” does not have the same meaning as the difference between
the two income levels “50K-100K” and “over 100K”.
Go, change the world
Types of data
RV College of
Engineering

Interval
An interval scale is one where there is order and the difference between two values is meaningful.

Examples of interval variables include:

temperature (Farenheit), temperature (Celcius), pH, SAT score (200-800), credit score (300-850).

Ratio
A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0. When the variable
equals 0.0, there is none of that variable.

Examples of ratio variables include:

enzyme activity, dose amount, reaction rate, flow rate, concentration, pulse, weight, length, temperature in Kelvin (0.0
Kelvin really does mean “no heat”), survival time.

When working with ratio variables, but not interval variables, the ratio of two measurements has a meaningful
interpretation. For example, because weight is a ratio variable, a weight of 4 grams is twice as heavy as a weight of 2
grams. However, a temperature of 10 degrees C should not be considered twice as hot as 5 degrees C. If it were, a conflict
would be created because 10 degrees C is 50 degrees F and 5 degrees C is 41 degrees F. Clearly, 50 degrees is not twice 41
degrees. Another example, a pH of 3 is not twice as acidic as a pH of 6, because pH is not a ratio variable.
Go, change the world
Types of data
RV College of
Engineering

The difference between interval and ratio scales comes from their ability to dip below zero. Interval scales
hold no true zero and can represent values below zero. For example, you can measure temperature below 0
degrees Celsius, such as -10 degrees.

Ratio variables, on the other hand, never fall below zero. Height and weight measure from 0 and above, but
never fall below it.

An interval scale allows you to measure all quantitative attributes. Any measurement of interval scale can be
ranked, counted, subtracted, or added, and equal intervals separate each number on the scale. However,
these measurements don’t provide any sense of ratio between one another.

A ratio scale has the same properties as interval scales. You can use it to add, subtract, or count
measurements. Ratio scales differ by having a character of origin, which is the starting or zero-point of the
scale.
Go, change the world
Exploring numerical data
RV College of
Engineering

• Centraltendencymeasures–HelpusunderstandtheCentralpointofdata.
• mean,medianandmode
• Mean–numericaldata,average
• Median–numericaldata,middledatapointofordereddataset
• Mode–categoricaldata,mostappearingdatavalue

• Dispersionofdata(spread)–variance,standarddeviation.Thedifferencebetweenmean&
median.
Go, change the world
Exploring numerical data
RV College of
Engineering
Go, change the world
Exploring numerical data
RV College of
Engineering

• Dispersionofdata(spread)–variance,standarddeviation.Thedifferencebetweenmean&
median.
• 5statisticalvalues–minimum,Q1,Median,Q2,maximum
• Boxplot,histogram,scatterplotsaremostcommonlyusedplotsfornumericaldata.
• Boxplotgivesall5values
Go, change the world
Exploring numerical data
RV College of
Engineering
Go, change the world
Exploring numerical data
RV College of
Engineering

fig = plt.figure(figsize =(10, 7))


x=[-7, 1,2,5,5,7,8,10,11,12,12,18,25]
# Creating plot
plt.boxplot(x)

# show plot
plt.show()

fig = plt.figure(figsize =(10, 7))


x=[199, 201, 236, 269,271,278,283,291, 301, 303, 341]
# Creating plot
plt.boxplot(x)

# show plot
plt.show()
Go, change the world
Exploring numerical data
RV College of
Engineering

• Histogramplotsthedataintorangesorbins

Go, change the world
Exploring numerical data
RV College of
Engineering

# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

data = pd.read_csv('/content/auto-mpg.csv')
data.head()
data.describe()
plt.hist(data['origin'])
Go, change the world
Exploring numerical data
RV College of
Engineering

• Scatterplots–relationshipbetweendata

Go, change the world
Exploring numerical data
RV College of
Engineering

# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

data = pd.read_csv('/content/auto-mpg.csv')
data.head()
data.describe()
plt.scatter(data['mpg'],data['displacement' ])
plt.xlabel('displacement')
plt.ylabel('mpg')
plt.show()
Go, change the world
Exploring categorical data
RV College of
Engineering

• Mode–categoricaldata,mostappearingdatavalue

• crosstabulationaremostcommonlyusedplotsforcategoricaldata

Go, change the world
BOX PLOT
RV College of
Engineering
Go, change the world
Exploring numerical data
RV College of
Engineering
Go, change the world
Exploring numerical data
RV College of
Engineering

199, 201, 236, 269,271,278,283,291, 301, 303, and 341

Median = 278
Q1=236
Q3=301
IQR= Q3-Q1=301-236= 65
Lower wisker = Q1-1.5*65=
Upper wisker = Q3+1.5*65
Go, change the world
Exploring numerical data
RV College of
Engineering

https://www.khanacademy.org/math/statistics-probability/
summarizing-quantitative-data/box-whisker-plots/a/
identifying-outliers-iqr-rule
Go, change the world
data quality & remediation
RV College of
Engineering

• Majordataqualityissuesare
• Missingdata
• Outliers

• Factorsthatleadtodataqualityissuesare
• Errorsindatacollection
• Incorrectsamplesetselection

• Dataremediation
1. HandingMissingdata
a) EliminaterecordswithMissingdata
b) ImputerecordswithMissingdata
c) Estimate
2. HandingOutliersdata
a) Eliminaterecords
b) Impute
c) Capping
Go, change the world
data quality & remediation
RV College of
Engineering

#thisremovestheoutlierinauto-mpgdataset

Q1=data['mpg'].quantile(0.25)
Q3=data[ 'mpg'].quantile(0.75)
IQR=Q3-Q1
low_val=Q1-1.5*IQR
high_val=Q3+1.5*IQR
outlier_rem=data.loc[(data['mpg']>low_val)&(data['mpg']<high_val)]
print (outlier_rem)

#cappingtheoutlierinauto-mpgdatasetusingoutlier&outlier_reminprevcode
outlier.loc[:,'mpg']=1.5*IQR
data_outlier_capped=outlier_rem.append(outlier,ignore_index=True)
data_outlier_capped.shape
Go, change the world
data quality & remediation
RV College of
Engineering

#thisremovestherowswith‘?’inauto-mpgdataset
data_clean=data.applymap(lambdax:np.nanifx=='?'elsex).dropna()

#thisreturnstheoutlierinauto-mpgdataset
Q1=data['mpg'].quantile(0.25)
Q3=data[ 'mpg'].quantile(0.75)
IQR=Q3-Q1
low_val=Q1-1.5*IQR
high_val=Q3+1.5*IQR
outlier=data.loc[(data['mpg']<low_val)|(data['mpg']>high_val)]
print (outlier)
Go, change the world
data pre-processing
RV College of
Engineering
Go, change the world
data pre-processing
RV College of
Engineering
Go, change the world
data pre-processing
RV College of
Engineering
Go, change the world
data pre-processing
RV College of
Engineering
Go, change the world
data pre-processing
RV College of
Engineering

#PCAEXAMPLE

fromsklearn.decompositionimportPCA
data_y=data.drop('horsepower',axis='columns',inplace=True)
data_y=data.drop('carname',axis='columns',inplace=True)
print (data)
pca=PCA(n_components=3)
princomp=pca.fit_transform(data)
princomp_ds=pd.DataFrame(data=princomp,columns=[ 'PC1','PC2','PC3'])
print (princomp_ds)

#FEATURESUBSETSELECTIONEXAMPLE
fromsklearn.feature_selectionimportSelectKBest,chi2
print (data.shape)
test=SelectKBest(chi2,k=2).fit_transform(data,data['cylinders'])
print (test.shape)

You might also like