Professional Documents
Culture Documents
Data Analysis: Data Preparation
Data Analysis: Data Preparation
Data Analysis: Data Preparation
Data Preparation
The data preparation stage comprises all activities used
to construct the data set that will be used in the
Data Analysis: Data Preparation modeling stage.
These include
• Cleansing data: data types
• Selecting data (sampling, feature selection)
Data Mining • Constructing (derived) data (feature engineering)
• Combining data from multiple sources
ITERA • Formating data (encoding categorical data)
Semester II 2019/2020 2
1 2
3 4
3 4
1
4/2/20
5 6
5 6
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
9 10
9 10
2
4/2/20
11 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html 13
11 13
14 16
14 16
3
4/2/20
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
17 18
17 18
19 20
4
4/2/20
MinMaxScaler:
min_max_scaler = preprocessing.MinMaxScaler()
newdf = df.select_dtypes(include=['float64'])
Standardize: z = (x - 𝝻) / 𝞂 newdf_scaled = min_max_scaler.fit_transform(newdf)
newdf =pd.DataFrame(newdf_scaled)
newdf.head()
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
21 22
21 22
MinMaxScaler StandardScaler
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
df_std = df.select_dtypes(include=['float64'])
df_std = df_std.dropna()
df_std=sc.fit_transform(df_std)
df_std = pd.DataFrame(df_std)
Z = (3.47-2.54)/(3.94-2.54)=0.664 df_std.head()
23 24
23 24
5
4/2/20
25 26
● For attributes that represent numeric atr1 atr2 atr3 atr1 atr2 atr3_ atr3_ atr3_
BDO JKT DPS
01 Direct encoding values
two S BDO
● Example: num_doors: two, four → {2,4} Atr1: direct encoding 2 1 1 0 0
four M JKT Atr2: label encoding
4 2 0 1 0
Atr3: one-hot encoding
● For ordinal attributes one L DPS
02 Label encoding ● Example: size {S,M,L} → {1,2,3}.
1 3 0 0 1
four M BDO 4 2 1 0 0
27 28
6
4/2/20
2 1 0 0 1 0 0
4 0 1 0 0 1 0
1 0 0 1 0 0 1
4 0 1 0 1 0 0 29 30
29 30
31 32
31 32
7
4/2/20
33 34
33 34
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html
35 36
35 36
8
4/2/20
37 38
39
39