EDA Pipeline Final

1/31/24, 6:56 PM EDA_Pipeline_Final
In [30]: # importing libreries
# for analysis
import pandas as pd
import numpy as np
# for preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
# for pipline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# to create custom extimator

from sklearn.base import BaseEstimator, TransformerMixin
# for searialization and deseralization

import pickle
In [2]: # dsiplay pipeline

from sklearn import set_config
set_config(display='diagram')
Rference Youtube - ColumnTransformer -- Sklearn Pipelines -- Custom Estimators
In [3]: # issiue
## different features have their different issues.
## eg. one feature have missing values, one feature needs to be scaled.
## handling each of them seperately is hectic task.
## differnet processes will produce their respective numpy arrays.
## at the end we need to combine them.
## if number of columns and process are large then it becomes a hetict task.
## whatever transformation we do on train side need to replicated on test side.
## whenever new data came for prediction -> we need to modify incoming data with excat
##### SOLUTION #######
## COLUMN TRANSFORMER
## a class of sklearn
## using this whole preprocessing could be done in few lines of codes
Read and Inspect CSV

In [4]: # reading csv
df = pd.read_csv('titanic.csv')
df.head()
file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 1/7
Out[4]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Em
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [5]: # checking shape

df.shape
(891, 12)
Out[5]:
In [6]: # checking info

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [7]: # checking for stats summary

df.describe()
Out[7]: PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [8]: # Train test split

X_train, X_test, y_train, y_test = train_test_split(df.drop('Survived',axis=1),
df['Survived'], test_size=0.2, ran
plan
Basic Pipeline structure
Dropping unnecessary columns ->
Missing values(2 cols, age, embarked) ->
One hot encode (2 cols, sex, embarked) ->
Sacling(numerical cols)
Creating Column Transformers

In [9]: # to create pipline we have to create a estimator
# an esitmator which has methods such as fit, fit_transform, transform
# in pipeline we have estimators
# to create custom estimatiors we need to extend from baseestimator class -> and crea
# when we define fit and transform in our class, TransformerMixin will automatically c
# we dont nedd to create it seperately.
# fit - to learn something from data - usually on train data

# transform - to apply learning done in fit, to data - on test data
# fit_transform - to do both the things - on train data
In [10]: # for Dropping unnecessary column we need to create custom estimator

# custom estimator
# in this estimator we dont want to learn any thing from column

# we just want to apply action
# not writting anything under fit
# code will come under transform
# X is dataframe
class FeaturesDropper(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):

return self
def transform(self, X):

columns = ["PassengerId","Name","Ticket","Cabin"]
return X.drop(columns=columns, axis=1)
def get_feature_names_out(self):
pass
drop_column_tx = FeaturesDropper()
# basically estimators works with arrays

# our custom imputer works with dataframes
# thats why even if eimple imputer is an estimator -> creating a custom imputer
In [11]: # HOW?????
# tx - Transformer
# 1. initialize column transformer object

# 2. pass list of transformers and remainders
# column tx
# remainder: sometimes we do not tx all the columns.. so what to do with those columns
# two options: 1. Drop 2.keep them as it is -> drop or passthrough
In [12]: # here we are doing noting with age. so using remainder option we are going to keep it
# we pass tx in form of tuple
# tuple consist of name of tx, tx object, list of columns
In [13]: # creating col tx for imputation process

impute_tx = ColumnTransformer(transformers=[
('impute_age', SimpleImputer(strategy='mean'), ['Age']),

('impute_embarked', SimpleImputer(strategy="most_frequent"), ['Embarked'])
],remainder="passthrough", verbose_feature_names_out=False).set_output(transform='pand
In [14]: # creating tx for encoding

encode_tx = ColumnTransformer(transformers=[
('ohe_sex_embarked', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
], remainder="passthrough",verbose_feature_names_out=False).set_output(transform='pand
In [15]: # Scaling
scaling_tx = ColumnTransformer(transformers=[
('scale', MinMaxScaler(), slice(0,1000))

],
remainder="passthrough",verbose_feature_names_out=False
# Slice -> it will slice the df from column 0,10
# Why not column Names?
# -->> during the Embarked , sex one hot encoding new columns generates with unknown n
# for eg. Embarke will become Embarked_C Embarked_Q Embarked_S after one h
# so it is not feasable to use column names. in scaling
# in pipeline data flows in form of numpy array, array dont have column names facility
Create Pipeline
In [16]: # what is pipeline????????????
# it is a mechanism in sklearn.
# which chains together multiple steps so that o/p of each step is used as i/p to next
# pipelines makes it easy to apply the same preprocessing to train and test
# if we want to build a ML model. first step is to clean the data.

# suppose data have multiple issues -> missing values, outlier, encoding, scaling etc.
# eg. missing values -> outliers -> encoding cat cols -> scaling num cols -> train mod
# using pipline we can chain above steps
# we just need to give i/p at the starting of pipeline. at the end pipline will prduce
# performing individual vs performing in chain operation
# when we deploy our model on server

# whenever input data comes, we need to apply same steps of preprocessing
# if there is no plipline-> task of doing this on individual level is quite hetict ->
In [17]: # pass list of tuples to pipline

# tuple consists of two things
# 1. transformation name, tx object
pipe = Pipeline([
('drop_column_tx',drop_column_tx),
('impute_tx', impute_tx),
('encode_tx', encode_tx),
('scaling_tx', scaling_tx)
], verbose=True,).set_output(transform='pandas')
In [18]: pipe.fit_transform(X_train)
[Pipeline] .... (step 1 of 4) Processing drop_column_tx, total= 0.0s

[Pipeline] ......... (step 2 of 4) Processing impute_tx, total= 0.0s
[Pipeline] ......... (step 3 of 4) Processing encode_tx, total= 0.0s
[Pipeline] ........ (step 4 of 4) Processing scaling_tx, total= 0.0s
Out[18]: Sex_female Sex_male Embarked_C Embarked_Q Embarked_S Age Pclass SibSp Parch
331 0.0 1.0 0.0 0.0 1.0 0.566474 0.0 0.000 0.000000
733 0.0 1.0 0.0 0.0 1.0 0.283740 0.5 0.000 0.000000
382 0.0 1.0 0.0 0.0 1.0 0.396833 1.0 0.000 0.000000
704 0.0 1.0 0.0 0.0 1.0 0.321438 1.0 0.125 0.000000
813 1.0 0.0 0.0 0.0 1.0 0.070118 1.0 0.500 0.333333
... ... ... ... ... ... ... ... ... ...
106 1.0 0.0 0.0 0.0 1.0 0.258608 1.0 0.000 0.000000
270 0.0 1.0 0.0 0.0 1.0 0.365404 0.0 0.000 0.000000
860 0.0 1.0 0.0 0.0 1.0 0.509927 1.0 0.250 0.000000
435 1.0 0.0 0.0 0.0 1.0 0.170646 0.0 0.125 0.333333
102 0.0 1.0 0.0 0.0 1.0 0.258608 0.0 0.000 0.166667
712 rows × 10 columns
In [19]: # converting pipline into pickle file

pickle.dump(pipe, open('titanic_EDA.pkl','wb'))
In [20]: # loading pickle file

titanic_eda = pickle.load(open('titanic_EDA.pkl', 'rb'))
In [21]: # traforming test data

titanic_eda.transform(X_test)
Out[21]: Sex_female Sex_male Embarked_C Embarked_Q Embarked_S Age Pclass SibSp Parch
709 0.0 1.0 1.0 0.0 0.0 0.365404 1.0 0.125 0.166667
439 0.0 1.0 0.0 0.0 1.0 0.384267 0.5 0.000 0.000000
840 0.0 1.0 0.0 0.0 1.0 0.246042 1.0 0.000 0.000000
720 1.0 0.0 0.0 0.0 1.0 0.070118 0.5 0.000 0.166667
39 1.0 0.0 1.0 0.0 0.0 0.170646 1.0 0.125 0.000000
... ... ... ... ... ... ... ... ... ...
433 0.0 1.0 0.0 0.0 1.0 0.208344 1.0 0.000 0.000000
773 0.0 1.0 1.0 0.0 0.0 0.365404 1.0 0.000 0.000000
25 1.0 0.0 0.0 0.0 1.0 0.472229 1.0 0.125 0.833333
84 1.0 0.0 0.0 0.0 1.0 0.208344 0.5 0.000 0.000000
10 1.0 0.0 0.0 0.0 1.0 0.044986 1.0 0.125 0.166667
179 rows × 10 columns
Additional information from pipe object

In [22]: # we can check information of pipe
# it will tell all the steps that this transformer follows

pipe.named_steps
{'drop_column_tx': FeaturesDropper(),
Out[22]:
'impute_tx': ColumnTransformer(remainder='passthrough',
transformers=[('impute_age', SimpleImputer(), ['Age']),
('impute_embarked',
SimpleImputer(strategy='most_frequent'),
['Embarked'])],
verbose_feature_names_out=False),
'encode_tx': ColumnTransformer(remainder='passthrough',
transformers=[('ohe_sex_embarked',
OneHotEncoder(handle_unknown='ignore',
sparse_output=False),
['Sex', 'Embarked'])],
verbose_feature_names_out=False),
'scaling_tx': ColumnTransformer(remainder='passthrough',
transformers=[('scale', MinMaxScaler(),
slice(0, 1000, None))],
verbose_feature_names_out=False)}
In [23]: # if we want to see how this pipeline works, how learning happened???
pipe.named_steps['impute_tx'].transformers_
[('impute_age', SimpleImputer(), ['Age']),

Out[23]:
('impute_embarked', SimpleImputer(strategy='most_frequent'), ['Embarked']),
('remainder', 'passthrough', [0, 1, 3, 4, 5])]
In [24]: # we can see above the transformers in impute_tx -> impute_age, impute_embarked
# lets check the learning of tx
pipe.named_steps['impute_tx'].transformers_[0][1].statistics_
array([29.49884615])
Out[24]:
In [25]: # learning of impute embarked

pipe.named_steps['impute_tx'].transformers_[1][1].statistics_
array(['S'], dtype=object)
Out[25]:
In [26]: # using above method we can check the learning, parameters

# we can do cross validation on pipline
# we can also do hyperparameter tuning on pipeline

EDA Pipeline Final

Uploaded by

Copyright:

Available Formats

You might also like

EDA Pipeline Final

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EDA Pipeline Final

Uploaded by

Copyright:

Available Formats

1/31/24, 6:56 PM EDA_Pipeline_Final

In [30]: # importing libreries

# to create custom extimator

# for searialization and deseralization

In [2]: # dsiplay pipeline

Rference Youtube - ColumnTransformer -- Sklearn Pipelines -- Custom Estimators

##### SOLUTION #######

Read and Inspect CSV

In [5]: # checking shape

In [6]: # checking info

In [7]: # checking for stats summary

Out[7]: PassengerId Survived Pclass Age SibSp Parch Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

In [8]: # Train test split

Missing values(2 cols, age, embarked) ->

One hot encode (2 cols, sex, embarked) ->

Creating Column Transformers

# fit - to learn something from data - usually on train data

In [10]: # for Dropping unnecessary column we need to create custom estimator

# in this estimator we dont want to learn any thing from column

class FeaturesDropper(BaseEstimator, TransformerMixin):

def fit(self, X, y=None):

def transform(self, X):

return X.drop(columns=columns, axis=1)

# basically estimators works with arrays

# 1. initialize column transformer object

In [13]: # creating col tx for imputation process

('impute_age', SimpleImputer(strategy='mean'), ['Age']),

In [14]: # creating tx for encoding

('ohe_sex_embarked', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),

('scale', MinMaxScaler(), slice(0,1000))

# Slice -> it will slice the df from column 0,10

# if we want to build a ML model. first step is to clean the data.

# performing individual vs performing in chain operation

# when we deploy our model on server

In [17]: # pass list of tuples to pipline

[Pipeline] .... (step 1 of 4) Processing drop_column_tx, total= 0.0s

712 rows × 10 columns

In [19]: # converting pipline into pickle file

In [20]: # loading pickle file

In [21]: # traforming test data

39 1.0 0.0 1.0 0.0 0.0 0.170646 1.0 0.125 0.000000

25 1.0 0.0 0.0 0.0 1.0 0.472229 1.0 0.125 0.833333

84 1.0 0.0 0.0 0.0 1.0 0.208344 0.5 0.000 0.000000

10 1.0 0.0 0.0 0.0 1.0 0.044986 1.0 0.125 0.166667

179 rows × 10 columns

Additional information from pipe object

# it will tell all the steps that this transformer follows

[('impute_age', SimpleImputer(), ['Age']),

In [25]: # learning of impute embarked

In [26]: # using above method we can check the learning, parameters