Data Preprocessing

Data Preprocessing Operations

The set of operations need to performs on the
dataset is
• Handling missing data
• Managing categorical data
• Dataset distribution(training, testing)
• Scaling the features
Preprocessing Steps in Python
• How to import libraries ?
Import pandas as pd
Import numpy as np
• How to Load datasets?

If the dataset is in excel form then


• If the dataset is in CSV format then

df=pd.read_csv(“dataset.csv”, sep=‘ ‘)
Inde Student Subject Marks Grade
x Name
0 Ramu Maths 70 A
1 Somu Maths 55 B
2 Lilly Technical English NaN O
3 Rose Python 80 A+
4 Nisha Java NaN O
5 Seetha Compiler 50 B
6 Patrick Big Data 40 Fail
7 Peter E-Commerce 75 A
Process Continues
• In the above dataset you can find the dataset
with NaN value, which requires to handle the
missing value. The missing values can be
handled by :
• Mean
• Median
• Mode
• Constant value
Handling Missing Values
from sklearn.impute import SimpleImputer
impute = SimpleImputer
(missing_values=np.nan, stratergy=“mean”)[:,1:3)
X[:,1:3] = imputer.transform(X[:, 1:3])
Classifying Dependent and
Independent Variables
• In our dataset Mark and Grade are dependent
x=df.iloc [:, [0,1,2] ] . values
y=df.iloc[:, 3 ] .values
Feature Encoding
• In our dataset we have subject feature which
has string-based value. Since, String-based
features cannot be processed in training
model. Hence, a method is required to convert
the string to numeric value. This is called
Onehot Encoding.
How to encode?
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
Ct=ColumTransformer (transformers=[‘encoder’,
OneHotEncoder(), [0])], remainder=‘passthrough’)
Print (X)
Dataset Distribution
#split data into train and test dataset

from sklearn.model_selection import train_test_split #(for

#from sklearn.model_selection import train_test_split (for
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.2, random_state=0)print('X_train.shape: ',
print('X_test.shape: ', X_test.shape)
print('y_train.shape: ', y_train.shape)
print('y_test.shape: ', y_test.shape)
Scaling the Features
#feature scaling of training dataset

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)print(X_train)

