ML CH 1

Go, change the world
RV College of
Engineering
IntroductiontoMachineLearning
Improvi
UNITI
JyotiShetty
4/26/2022 1
RV College
of
Engineering
Chapter2
PreparingtoModel
4/26/2022 2
Four Step Process of Machine Learning Go, change the world
RV College of
Engineering
Four Step Process of Machine Learning Go, change the world
RV College of
Engineering
Types of data
RV College of
Engineering
Types of data
RV College of
Engineering
Nominal
A nominal scale describes a variable with categories that do not have a natural order or ranking. You can code nominal
variables with numbers if you want, but the order is arbitrary and any calculations, such as computing a mean, median, or
standard deviation, would be meaningless.
Examples of nominal variables include:
genotype, blood type, zip code, gender, race, eye color, political party
Ordinal
An ordinal scale is one where the order matters but not the difference between values.
Examples of ordinal variables include:
socio economic status (“low income”,”middle income”,”high income”), education level (“high school”,”BS”,”MS”,”PhD”),
income level (“less than 50K”, “50K-100K”, “over 100K”), satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”,
“extremely like”).
Note the differences between adjacent categories do not necessarily have the same meaning. For example, the difference
between the two income levels “less than 50K” and “50K-100K” does not have the same meaning as the difference between
the two income levels “50K-100K” and “over 100K”.
Types of data
RV College of
Engineering
Interval
An interval scale is one where there is order and the difference between two values is meaningful.
Examples of interval variables include:
temperature (Farenheit), temperature (Celcius), pH, SAT score (200-800), credit score (300-850).
Ratio
A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0. When the variable
equals 0.0, there is none of that variable.
Examples of ratio variables include:
enzyme activity, dose amount, reaction rate, flow rate, concentration, pulse, weight, length, temperature in Kelvin (0.0
Kelvin really does mean “no heat”), survival time.
When working with ratio variables, but not interval variables, the ratio of two measurements has a meaningful
interpretation. For example, because weight is a ratio variable, a weight of 4 grams is twice as heavy as a weight of 2
grams. However, a temperature of 10 degrees C should not be considered twice as hot as 5 degrees C. If it were, a conflict
would be created because 10 degrees C is 50 degrees F and 5 degrees C is 41 degrees F. Clearly, 50 degrees is not twice 41
degrees. Another example, a pH of 3 is not twice as acidic as a pH of 6, because pH is not a ratio variable.
Types of data
RV College of
Engineering
The difference between interval and ratio scales comes from their ability to dip below zero. Interval scales
hold no true zero and can represent values below zero. For example, you can measure temperature below 0
degrees Celsius, such as -10 degrees.
Ratio variables, on the other hand, never fall below zero. Height and weight measure from 0 and above, but
never fall below it.
An interval scale allows you to measure all quantitative attributes. Any measurement of interval scale can be
ranked, counted, subtracted, or added, and equal intervals separate each number on the scale. However,
these measurements don’t provide any sense of ratio between one another.
A ratio scale has the same properties as interval scales. You can use it to add, subtract, or count
measurements. Ratio scales differ by having a character of origin, which is the starting or zero-point of the
scale.
Exploring numerical data
RV College of
Engineering
• Centraltendencymeasures–HelpusunderstandtheCentralpointofdata.
• mean,medianandmode
• Mean–numericaldata,average
• Median–numericaldata,middledatapointofordereddataset
• Mode–categoricaldata,mostappearingdatavalue
• Dispersionofdata(spread)–variance,standarddeviation.Thedifferencebetweenmean&
median.
RV College of
Engineering
RV College of
Engineering
• Dispersionofdata(spread)–variance,standarddeviation.Thedifferencebetweenmean&
median.
• 5statisticalvalues–minimum,Q1,Median,Q2,maximum
• Boxplot,histogram,scatterplotsaremostcommonlyusedplotsfornumericaldata.
• Boxplotgivesall5values
RV College of
Engineering
RV College of
Engineering
fig = plt.figure(figsize =(10, 7))

x=[-7, 1,2,5,5,7,8,10,11,12,12,18,25]
# Creating plot
plt.boxplot(x)
# show plot
plt.show()
fig = plt.figure(figsize =(10, 7))

x=[199, 201, 236, 269,271,278,283,291, 301, 303, 341]
# Creating plot
plt.boxplot(x)
# show plot
plt.show()
RV College of
Engineering
• Histogramplotsthedataintorangesorbins
•
RV College of
Engineering
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv('/content/auto-mpg.csv')
data.head()
data.describe()
plt.hist(data['origin'])
RV College of
Engineering
• Scatterplots–relationshipbetweendata
•
RV College of
Engineering
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv('/content/auto-mpg.csv')
data.head()
data.describe()
plt.scatter(data['mpg'],data['displacement' ])
plt.xlabel('displacement')
plt.ylabel('mpg')
plt.show()
Exploring categorical data
RV College of
Engineering
• Mode–categoricaldata,mostappearingdatavalue
• crosstabulationaremostcommonlyusedplotsforcategoricaldata
•
BOX PLOT
RV College of
Engineering
RV College of
Engineering
RV College of
Engineering
199, 201, 236, 269,271,278,283,291, 301, 303, and 341
Median = 278
Q1=236
Q3=301
IQR= Q3-Q1=301-236= 65
Lower wisker = Q1-1.5*65=
Upper wisker = Q3+1.5*65
RV College of
Engineering
https://www.khanacademy.org/math/statistics-probability/
summarizing-quantitative-data/box-whisker-plots/a/
identifying-outliers-iqr-rule
data quality & remediation
RV College of
Engineering
• Majordataqualityissuesare
• Missingdata
• Outliers
• Factorsthatleadtodataqualityissuesare
• Errorsindatacollection
• Incorrectsamplesetselection
• Dataremediation
1. HandingMissingdata
a) EliminaterecordswithMissingdata
b) ImputerecordswithMissingdata
c) Estimate
2. HandingOutliersdata
a) Eliminaterecords
b) Impute
c) Capping
RV College of
Engineering
#thisremovestheoutlierinauto-mpgdataset
Q1=data['mpg'].quantile(0.25)
Q3=data[ 'mpg'].quantile(0.75)
IQR=Q3-Q1
low_val=Q1-1.5*IQR
high_val=Q3+1.5*IQR
outlier_rem=data.loc[(data['mpg']>low_val)&(data['mpg']<high_val)]
print (outlier_rem)
#cappingtheoutlierinauto-mpgdatasetusingoutlier&outlier_reminprevcode
outlier.loc[:,'mpg']=1.5*IQR
data_outlier_capped=outlier_rem.append(outlier,ignore_index=True)
data_outlier_capped.shape
RV College of
Engineering
#thisremovestherowswith‘?’inauto-mpgdataset
data_clean=data.applymap(lambdax:np.nanifx=='?'elsex).dropna()
#thisreturnstheoutlierinauto-mpgdataset
Q1=data['mpg'].quantile(0.25)
Q3=data[ 'mpg'].quantile(0.75)
IQR=Q3-Q1
low_val=Q1-1.5*IQR
high_val=Q3+1.5*IQR
outlier=data.loc[(data['mpg']<low_val)|(data['mpg']>high_val)]
print (outlier)
data pre-processing
RV College of
Engineering
data pre-processing
RV College of
Engineering
data pre-processing
RV College of
Engineering
data pre-processing
RV College of
Engineering
data pre-processing
RV College of
Engineering
#PCAEXAMPLE
fromsklearn.decompositionimportPCA
data_y=data.drop('horsepower',axis='columns',inplace=True)
data_y=data.drop('carname',axis='columns',inplace=True)
print (data)
pca=PCA(n_components=3)
princomp=pca.fit_transform(data)
princomp_ds=pd.DataFrame(data=princomp,columns=[ 'PC1','PC2','PC3'])
print (princomp_ds)
#FEATURESUBSETSELECTIONEXAMPLE
fromsklearn.feature_selectionimportSelectKBest,chi2
print (data.shape)
test=SelectKBest(chi2,k=2).fit_transform(data,data['cylinders'])
print (test.shape)

ML CH 1

Uploaded by

Copyright:

Available Formats

You might also like

ML CH 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML CH 1

Uploaded by

Copyright:

Available Formats

Go, change the world

Examples of nominal variables include:

Examples of ordinal variables include:

Examples of interval variables include:

Examples of ratio variables include:

fig = plt.figure(figsize =(10, 7))

fig = plt.figure(figsize =(10, 7))

199, 201, 236, 269,271,278,283,291, 301, 303, and 341

You might also like