Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Data Preprocessing

Data Preprocessing Operations


The set of operations need to performs on the
dataset is
• Handling missing data
• Managing categorical data
• Dataset distribution(training, testing)
• Scaling the features
Preprocessing Steps in Python
• How to import libraries ?
Import pandas as pd
Import numpy as np
Continues…
• How to Load datasets?

If the dataset is in excel form then


Dataset=pd.read_excel(“dataset”)

• If the dataset is in CSV format then


df=pd.read_csv(“dataset.csv”, sep=‘ ‘)
Dataset
Inde Student Subject Marks Grade
x Name
0 Ramu Maths 70 A
1 Somu Maths 55 B
2 Lilly Technical English NaN O
3 Rose Python 80 A+
4 Nisha Java NaN O
5 Seetha Compiler 50 B
6 Patrick Big Data 40 Fail
7 Peter E-Commerce 75 A
Process Continues
• In the above dataset you can find the dataset
with NaN value, which requires to handle the
missing value. The missing values can be
handled by :
• Mean
• Median
• Mode
• Constant value
Handling Missing Values
from sklearn.impute import SimpleImputer
impute = SimpleImputer
(missing_values=np.nan, stratergy=“mean”)
imputer.fit(X[:,1:3)
X[:,1:3] = imputer.transform(X[:, 1:3])
Print(X)
Classifying Dependent and
Independent Variables
• In our dataset Mark and Grade are dependent
variables.
x=df.iloc [:, [0,1,2] ] . values
y=df.iloc[:, 3 ] .values
Print(x)
Print(y)
Feature Encoding
• In our dataset we have subject feature which
has string-based value. Since, String-based
features cannot be processed in training
model. Hence, a method is required to convert
the string to numeric value. This is called
Onehot Encoding.
How to encode?
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
Ct=ColumTransformer (transformers=[‘encoder’,
OneHotEncoder(), [0])], remainder=‘passthrough’)
X=ct.fit_transform(X)
X=np.array(X)
Print (X)
Dataset Distribution
#split data into train and test dataset

from sklearn.model_selection import train_test_split #(for


python2)
#from sklearn.model_selection import train_test_split (for
python3)
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.2, random_state=0)print('X_train.shape: ',
X_train.shape)
print('X_test.shape: ', X_test.shape)
print('y_train.shape: ', y_train.shape)
print('y_test.shape: ', y_test.shape)
Scaling the Features
#feature scaling of training dataset

from sklearn.preprocessing import StandardScaler


sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)print(X_train)
print(X_test)

You might also like