Professional Documents
Culture Documents
EDA Pipeline Final
EDA Pipeline Final
EDA Pipeline Final
# for analysis
import pandas as pd
import numpy as np
# for preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
# for pipline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
In [3]: # issiue
## different features have their different issues.
## eg. one feature have missing values, one feature needs to be scaled.
## handling each of them seperately is hectic task.
## differnet processes will produce their respective numpy arrays.
## at the end we need to combine them.
## if number of columns and process are large then it becomes a hetict task.
## whatever transformation we do on train side need to replicated on test side.
## whenever new data came for prediction -> we need to modify incoming data with excat
## COLUMN TRANSFORMER
## a class of sklearn
## using this whole preprocessing could be done in few lines of codes
file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 1/7
1/31/24, 6:56 PM EDA_Pipeline_Final
Out[4]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Em
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
(891, 12)
Out[5]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 2/7
1/31/24, 6:56 PM EDA_Pipeline_Final
plan
Basic Pipeline structure
Dropping unnecessary columns ->
Sacling(numerical cols)
file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 3/7
1/31/24, 6:56 PM EDA_Pipeline_Final
# code will come under transform
# X is dataframe
def get_feature_names_out(self):
pass
drop_column_tx = FeaturesDropper()
In [11]: # HOW?????
# tx - Transformer
In [12]: # here we are doing noting with age. so using remainder option we are going to keep it
# we pass tx in form of tuple
# tuple consist of name of tx, tx object, list of columns
],remainder="passthrough", verbose_feature_names_out=False).set_output(transform='pand
], remainder="passthrough",verbose_feature_names_out=False).set_output(transform='pand
In [15]: # Scaling
scaling_tx = ColumnTransformer(transformers=[
file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 4/7
1/31/24, 6:56 PM EDA_Pipeline_Final
# Why not column Names?
# -->> during the Embarked , sex one hot encoding new columns generates with unknown n
# for eg. Embarke will become Embarked_C Embarked_Q Embarked_S after one h
# so it is not feasable to use column names. in scaling
# in pipeline data flows in form of numpy array, array dont have column names facility
Create Pipeline
In [16]: # what is pipeline????????????
# it is a mechanism in sklearn.
# which chains together multiple steps so that o/p of each step is used as i/p to next
# pipelines makes it easy to apply the same preprocessing to train and test
pipe = Pipeline([
('drop_column_tx',drop_column_tx),
('impute_tx', impute_tx),
('encode_tx', encode_tx),
('scaling_tx', scaling_tx)
], verbose=True,).set_output(transform='pandas')
In [18]: pipe.fit_transform(X_train)
file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 5/7
1/31/24, 6:56 PM EDA_Pipeline_Final
Out[18]: Sex_female Sex_male Embarked_C Embarked_Q Embarked_S Age Pclass SibSp Parch
331 0.0 1.0 0.0 0.0 1.0 0.566474 0.0 0.000 0.000000
733 0.0 1.0 0.0 0.0 1.0 0.283740 0.5 0.000 0.000000
382 0.0 1.0 0.0 0.0 1.0 0.396833 1.0 0.000 0.000000
704 0.0 1.0 0.0 0.0 1.0 0.321438 1.0 0.125 0.000000
813 1.0 0.0 0.0 0.0 1.0 0.070118 1.0 0.500 0.333333
... ... ... ... ... ... ... ... ... ...
106 1.0 0.0 0.0 0.0 1.0 0.258608 1.0 0.000 0.000000
270 0.0 1.0 0.0 0.0 1.0 0.365404 0.0 0.000 0.000000
860 0.0 1.0 0.0 0.0 1.0 0.509927 1.0 0.250 0.000000
435 1.0 0.0 0.0 0.0 1.0 0.170646 0.0 0.125 0.333333
102 0.0 1.0 0.0 0.0 1.0 0.258608 0.0 0.000 0.166667
Out[21]: Sex_female Sex_male Embarked_C Embarked_Q Embarked_S Age Pclass SibSp Parch
709 0.0 1.0 1.0 0.0 0.0 0.365404 1.0 0.125 0.166667
439 0.0 1.0 0.0 0.0 1.0 0.384267 0.5 0.000 0.000000
840 0.0 1.0 0.0 0.0 1.0 0.246042 1.0 0.000 0.000000
720 1.0 0.0 0.0 0.0 1.0 0.070118 0.5 0.000 0.166667
... ... ... ... ... ... ... ... ... ...
433 0.0 1.0 0.0 0.0 1.0 0.208344 1.0 0.000 0.000000
773 0.0 1.0 1.0 0.0 0.0 0.365404 1.0 0.000 0.000000
file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 6/7
1/31/24, 6:56 PM EDA_Pipeline_Final
{'drop_column_tx': FeaturesDropper(),
Out[22]:
'impute_tx': ColumnTransformer(remainder='passthrough',
transformers=[('impute_age', SimpleImputer(), ['Age']),
('impute_embarked',
SimpleImputer(strategy='most_frequent'),
['Embarked'])],
verbose_feature_names_out=False),
'encode_tx': ColumnTransformer(remainder='passthrough',
transformers=[('ohe_sex_embarked',
OneHotEncoder(handle_unknown='ignore',
sparse_output=False),
['Sex', 'Embarked'])],
verbose_feature_names_out=False),
'scaling_tx': ColumnTransformer(remainder='passthrough',
transformers=[('scale', MinMaxScaler(),
slice(0, 1000, None))],
verbose_feature_names_out=False)}
In [23]: # if we want to see how this pipeline works, how learning happened???
pipe.named_steps['impute_tx'].transformers_
In [24]: # we can see above the transformers in impute_tx -> impute_age, impute_embarked
# lets check the learning of tx
pipe.named_steps['impute_tx'].transformers_[0][1].statistics_
array([29.49884615])
Out[24]:
array(['S'], dtype=object)
Out[25]:
file:///C:/Users/User/Downloads/EDA_Pipeline_Final.html 7/7