EDA Pipeline Final

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

1/31/24, 6:56 PM EDA_Pipeline_Final

In [30]: # importing libreries

# for analysis
import pandas as pd
import numpy as np

# for preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

# for pipline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# to create custom extimator


from sklearn.base import BaseEstimator, TransformerMixin

# for searialization and deseralization


import pickle

In [2]: # dsiplay pipeline


from sklearn import set_config
set_config(display='diagram')

Rference Youtube - ColumnTransformer -- Sklearn Pipelines -- Custom Estimators

In [3]: # issiue
## different features have their different issues.
## eg. one feature have missing values, one feature needs to be scaled.
## handling each of them seperately is hectic task.
## differnet processes will produce their respective numpy arrays.
## at the end we need to combine them.
## if number of columns and process are large then it becomes a hetict task.
## whatever transformation we do on train side need to replicated on test side.
## whenever new data came for prediction -> we need to modify incoming data with excat

##### SOLUTION #######

## COLUMN TRANSFORMER
## a class of sklearn
## using this whole preprocessing could be done in few lines of codes

Read and Inspect CSV


In [4]: # reading csv
df = pd.read_csv('titanic.csv')
df.head()

file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 1/7
1/31/24, 6:56 PM EDA_Pipeline_Final

Out[4]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Em

Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry

In [5]: # checking shape


df.shape

(891, 12)
Out[5]:

In [6]: # checking info


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

In [7]: # checking for stats summary


df.describe()

file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 2/7
1/31/24, 6:56 PM EDA_Pipeline_Final

Out[7]: PassengerId Survived Pclass Age SibSp Parch Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

In [8]: # Train test split


X_train, X_test, y_train, y_test = train_test_split(df.drop('Survived',axis=1),
df['Survived'], test_size=0.2, ran

plan
Basic Pipeline structure
Dropping unnecessary columns ->

Missing values(2 cols, age, embarked) ->

One hot encode (2 cols, sex, embarked) ->

Sacling(numerical cols)

Creating Column Transformers


In [9]: # to create pipline we have to create a estimator
# an esitmator which has methods such as fit, fit_transform, transform
# in pipeline we have estimators
# to create custom estimatiors we need to extend from baseestimator class -> and crea
# when we define fit and transform in our class, TransformerMixin will automatically c
# we dont nedd to create it seperately.

# fit - to learn something from data - usually on train data


# transform - to apply learning done in fit, to data - on test data
# fit_transform - to do both the things - on train data

In [10]: # for Dropping unnecessary column we need to create custom estimator


# custom estimator

# in this estimator we dont want to learn any thing from column


# we just want to apply action
# not writting anything under fit

file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 3/7
1/31/24, 6:56 PM EDA_Pipeline_Final
# code will come under transform
# X is dataframe

class FeaturesDropper(BaseEstimator, TransformerMixin):

def fit(self, X, y=None):


return self

def transform(self, X):


columns = ["PassengerId","Name","Ticket","Cabin"]

return X.drop(columns=columns, axis=1)

def get_feature_names_out(self):
pass

drop_column_tx = FeaturesDropper()

# basically estimators works with arrays


# our custom imputer works with dataframes
# thats why even if eimple imputer is an estimator -> creating a custom imputer

In [11]: # HOW?????
# tx - Transformer

# 1. initialize column transformer object


# 2. pass list of transformers and remainders
# column tx
# remainder: sometimes we do not tx all the columns.. so what to do with those columns
# two options: 1. Drop 2.keep them as it is -> drop or passthrough

In [12]: # here we are doing noting with age. so using remainder option we are going to keep it
# we pass tx in form of tuple
# tuple consist of name of tx, tx object, list of columns

In [13]: # creating col tx for imputation process


impute_tx = ColumnTransformer(transformers=[

('impute_age', SimpleImputer(strategy='mean'), ['Age']),


('impute_embarked', SimpleImputer(strategy="most_frequent"), ['Embarked'])

],remainder="passthrough", verbose_feature_names_out=False).set_output(transform='pand

In [14]: # creating tx for encoding


encode_tx = ColumnTransformer(transformers=[

('ohe_sex_embarked', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),

], remainder="passthrough",verbose_feature_names_out=False).set_output(transform='pand

In [15]: # Scaling
scaling_tx = ColumnTransformer(transformers=[

('scale', MinMaxScaler(), slice(0,1000))


],
remainder="passthrough",verbose_feature_names_out=False

# Slice -> it will slice the df from column 0,10

file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 4/7
1/31/24, 6:56 PM EDA_Pipeline_Final
# Why not column Names?
# -->> during the Embarked , sex one hot encoding new columns generates with unknown n
# for eg. Embarke will become Embarked_C Embarked_Q Embarked_S after one h
# so it is not feasable to use column names. in scaling
# in pipeline data flows in form of numpy array, array dont have column names facility

Create Pipeline
In [16]: # what is pipeline????????????
# it is a mechanism in sklearn.
# which chains together multiple steps so that o/p of each step is used as i/p to next

# pipelines makes it easy to apply the same preprocessing to train and test

# if we want to build a ML model. first step is to clean the data.


# suppose data have multiple issues -> missing values, outlier, encoding, scaling etc.
# eg. missing values -> outliers -> encoding cat cols -> scaling num cols -> train mod
# using pipline we can chain above steps
# we just need to give i/p at the starting of pipeline. at the end pipline will prduce

# performing individual vs performing in chain operation

# when we deploy our model on server


# whenever input data comes, we need to apply same steps of preprocessing
# if there is no plipline-> task of doing this on individual level is quite hetict ->

In [17]: # pass list of tuples to pipline


# tuple consists of two things
# 1. transformation name, tx object

pipe = Pipeline([
('drop_column_tx',drop_column_tx),
('impute_tx', impute_tx),
('encode_tx', encode_tx),
('scaling_tx', scaling_tx)

], verbose=True,).set_output(transform='pandas')

In [18]: pipe.fit_transform(X_train)

[Pipeline] .... (step 1 of 4) Processing drop_column_tx, total= 0.0s


[Pipeline] ......... (step 2 of 4) Processing impute_tx, total= 0.0s
[Pipeline] ......... (step 3 of 4) Processing encode_tx, total= 0.0s
[Pipeline] ........ (step 4 of 4) Processing scaling_tx, total= 0.0s

file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 5/7
1/31/24, 6:56 PM EDA_Pipeline_Final

Out[18]: Sex_female Sex_male Embarked_C Embarked_Q Embarked_S Age Pclass SibSp Parch

331 0.0 1.0 0.0 0.0 1.0 0.566474 0.0 0.000 0.000000

733 0.0 1.0 0.0 0.0 1.0 0.283740 0.5 0.000 0.000000

382 0.0 1.0 0.0 0.0 1.0 0.396833 1.0 0.000 0.000000

704 0.0 1.0 0.0 0.0 1.0 0.321438 1.0 0.125 0.000000

813 1.0 0.0 0.0 0.0 1.0 0.070118 1.0 0.500 0.333333

... ... ... ... ... ... ... ... ... ...

106 1.0 0.0 0.0 0.0 1.0 0.258608 1.0 0.000 0.000000

270 0.0 1.0 0.0 0.0 1.0 0.365404 0.0 0.000 0.000000

860 0.0 1.0 0.0 0.0 1.0 0.509927 1.0 0.250 0.000000

435 1.0 0.0 0.0 0.0 1.0 0.170646 0.0 0.125 0.333333

102 0.0 1.0 0.0 0.0 1.0 0.258608 0.0 0.000 0.166667

712 rows × 10 columns

In [19]: # converting pipline into pickle file


pickle.dump(pipe, open('titanic_EDA.pkl','wb'))

In [20]: # loading pickle file


titanic_eda = pickle.load(open('titanic_EDA.pkl', 'rb'))

In [21]: # traforming test data


titanic_eda.transform(X_test)

Out[21]: Sex_female Sex_male Embarked_C Embarked_Q Embarked_S Age Pclass SibSp Parch

709 0.0 1.0 1.0 0.0 0.0 0.365404 1.0 0.125 0.166667

439 0.0 1.0 0.0 0.0 1.0 0.384267 0.5 0.000 0.000000

840 0.0 1.0 0.0 0.0 1.0 0.246042 1.0 0.000 0.000000

720 1.0 0.0 0.0 0.0 1.0 0.070118 0.5 0.000 0.166667

39 1.0 0.0 1.0 0.0 0.0 0.170646 1.0 0.125 0.000000

... ... ... ... ... ... ... ... ... ...

433 0.0 1.0 0.0 0.0 1.0 0.208344 1.0 0.000 0.000000

773 0.0 1.0 1.0 0.0 0.0 0.365404 1.0 0.000 0.000000

25 1.0 0.0 0.0 0.0 1.0 0.472229 1.0 0.125 0.833333

84 1.0 0.0 0.0 0.0 1.0 0.208344 0.5 0.000 0.000000

10 1.0 0.0 0.0 0.0 1.0 0.044986 1.0 0.125 0.166667

179 rows × 10 columns

file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 6/7
1/31/24, 6:56 PM EDA_Pipeline_Final

Additional information from pipe object


In [22]: # we can check information of pipe

# it will tell all the steps that this transformer follows


pipe.named_steps

{'drop_column_tx': FeaturesDropper(),
Out[22]:
'impute_tx': ColumnTransformer(remainder='passthrough',
transformers=[('impute_age', SimpleImputer(), ['Age']),
('impute_embarked',
SimpleImputer(strategy='most_frequent'),
['Embarked'])],
verbose_feature_names_out=False),
'encode_tx': ColumnTransformer(remainder='passthrough',
transformers=[('ohe_sex_embarked',
OneHotEncoder(handle_unknown='ignore',
sparse_output=False),
['Sex', 'Embarked'])],
verbose_feature_names_out=False),
'scaling_tx': ColumnTransformer(remainder='passthrough',
transformers=[('scale', MinMaxScaler(),
slice(0, 1000, None))],
verbose_feature_names_out=False)}

In [23]: # if we want to see how this pipeline works, how learning happened???
pipe.named_steps['impute_tx'].transformers_

[('impute_age', SimpleImputer(), ['Age']),


Out[23]:
('impute_embarked', SimpleImputer(strategy='most_frequent'), ['Embarked']),
('remainder', 'passthrough', [0, 1, 3, 4, 5])]

In [24]: # we can see above the transformers in impute_tx -> impute_age, impute_embarked
# lets check the learning of tx
pipe.named_steps['impute_tx'].transformers_[0][1].statistics_

array([29.49884615])
Out[24]:

In [25]: # learning of impute embarked


pipe.named_steps['impute_tx'].transformers_[1][1].statistics_

array(['S'], dtype=object)
Out[25]:

In [26]: # using above method we can check the learning, parameters


# we can do cross validation on pipline
# we can also do hyperparameter tuning on pipeline

file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 7/7

You might also like