Data Analysis: Data Preparation

4/2/20
Data Preparation
The data preparation stage comprises all activities used
to construct the data set that will be used in the
Data Analysis: Data Preparation modeling stage.
These include
• Cleansing data: data types
• Selecting data (sampling, feature selection)
Data Mining • Constructing (derived) data (feature engineering)
• Combining data from multiple sources
ITERA • Formating data (encoding categorical data)
Semester II 2019/2020 2
1 2
Cleansing Data with Python Handling Incorrect Type Assignment

1. Determine correct types for incorrect type assignment df[col] = df[col].astype(types)
df[col] = df[col].astype(types)
• If missing values are represented by “?”, numeric
2. Handle missing values (drop or replace values) attributes are assigned as object. We should
df[col].fillna() or df.dropna()
handle missing values before converting the
3. Exclude outliers data (see filtering data) correct types.
df[filtering outlier conditions]
3 4
3 4
1
4/2/20
Automobile Dataset Exercise 1: Incorrect Type

- 6 continuous attributes are loaded as object Write program to assign the right type for incorrect
attribute (not float64), because they have “?” type assignment.
Hints: df[col] = df[col].astype(types)
...
1. normalized-losses: continuous from 65 to 256.
...
18. bore: continuous from 2.54 to 3.94.
19. stroke: continuous from 2.07 to 4.17.
...
21. horsepower: continuous from 48 to 288.
22. peak-rpm: continuous from 4150 to 6600.
...
25. price: continuous from 5118 to 45400.
5 6
5 6
Drop Missing Values Drop Missing Values

DataFrame.dropna(axis=0, how='any', thresh=None,
subset=None, inplace=False)
df[5] = df[5].replace('?', np.NaN)
df.dropna() #Drop the rows where at least one element is missing. df_drop=df.dropna()
df.dropna(inplace=True) #Keep the DataFrame with valid entries in the same variable.
df_drop.shape[0]
df.dropna(axis='columns') #Drop the columns where at least 1 element is missing.
df.dropna(how='all') #Drop the rows where all elements are missing.
df.dropna(thresh=2) # Keep only the rows with at least 2 non-NA values.
df.dropna(subset=[col1, col2]) #Define in which columns to look for missing
values.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
9 10
9 10
2
4/2/20
Exercise 2: Drop Missing Values Replace Missing Values

DataFrame.fillna(value=None, method=None, axis=None,
Investigate dataset shape from this command inplace=False, limit=None, downcast=None, **kwargs)
df[col].fillna(value, inplace=True) #replace null with value
df.dropna(axis='columns') #Drop the columns where at least 1 element is missing. df.fillna(0) #Replace all NaN elements with 0s.
df.dropna(how='all') #Drop the rows where all elements are missing. df.fillna(value={'A': 0, 'B': 1, 'C': 2, 'D': 3}) #Replace
df.dropna(thresh=24) # Keep only the rows with at least 24 non-NA values. all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
df.fillna(value=values, limit=1) #Only replace the first NaN element.
11 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html 13
11 13
Exercise 3: Replace Missing Values Exercise 3B: Answer

A. For automobile dataset with correct type
assignment, write a program to replace all
missing values with mean (for numeric attribute)
and mode (for categorical attribute).
B. Compare descriptive statistics of each column
before and after replace missing values, and
report the differences.
14 16
14 16
3
4/2/20
Selecting Data with Python Selecting Data : Feature Selection

DataFrame.drop(labels=None, axis=0, index=None,
columns=None, level=None, inplace=False,
In selecting data, feature selection methods will
errors='raise') be discussed next week.
Drop specified labels from rows or columns.
df.drop(colList, axis=1) #Drop columns

df.drop(columns=colList) #Drop columns
df.drop(rowList) #Drop a row by index list
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
17 18
17 18
Feature Extraction Exercise 4: Add New Feature

isLuxury is a new feature that will be true if a car has 1. Add new feature 'fueleconomy' that has values
front-engine (attribute 8 is front) and rear-wheel drive 0.55*citympg (i.e. attribute 23) +
(attribute 7 is rwd). 0.45*'highwaympg (i.e. attribute 24)
df['isLuxury'] = df.apply(lambda row: 1 if
((row[7]=='rwd') and (row[8]=='front')) else 0,
axis=1) 1. Identify one new feature, and add to our
df['isLuxury'] = df['isLuxury'].astype("category") dataframe. Write as comment why we need this
df.head() new feature.
19 20
19 20
4
4/2/20
Data Normalization MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
MinMaxScaler:
min_max_scaler = preprocessing.MinMaxScaler()
newdf = df.select_dtypes(include=['float64'])
Standardize: z = (x - 𝝻) / 𝞂 newdf_scaled = min_max_scaler.fit_transform(newdf)
newdf =pd.DataFrame(newdf_scaled)
newdf.head()
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
21 22
21 22
MinMaxScaler StandardScaler
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
df_std = df.select_dtypes(include=['float64'])
df_std = df_std.dropna()
df_std=sc.fit_transform(df_std)
df_std = pd.DataFrame(df_std)
Z = (3.47-2.54)/(3.94-2.54)=0.664 df_std.head()
23 24
23 24
5
4/2/20
StandardScaler Data Formatting

• Categorical encoding : converts categorical
attribute into numeric attribute
• Binning / discretization : converts numeric

attribute into categorical attribute
z = (x - 𝝻) / 𝞂
Z = (3.47-3.329751)/0.273539=0.513
25 26
25 26
Categorical Encoding Categorical Encoding: Examples
● For attributes that represent numeric atr1 atr2 atr3 atr1 atr2 atr3_ atr3_ atr3_
BDO JKT DPS
01 Direct encoding values
two S BDO
● Example: num_doors: two, four → {2,4} Atr1: direct encoding 2 1 1 0 0
four M JKT Atr2: label encoding
4 2 0 1 0
Atr3: one-hot encoding
● For ordinal attributes one L DPS
02 Label encoding ● Example: size {S,M,L} → {1,2,3}.
1 3 0 0 1
four M BDO 4 2 1 0 0
● For multivalues attributes

03 One-hot encoding ● Example: city {BDO, JKT, ...} will be
represented by city_BDO, city_JKT, etc
27 28
27 28
6
4/2/20
atr1 atr2 atr3 Encoding Langsung

two S BDO Atr1: direct encoding
Atr2, Atr3: one-hot encoding
four M JKT
one L DPS
df[5]=df[5].replace({'two': 2, 'four': 4})
four M BDO
atr1 atr2 atr2 atr2 atr3_ atr3_ atr3_
_S _M _L BDO JKT DPS
2 1 0 0 1 0 0
4 0 1 0 0 1 0
1 0 0 1 0 0 1
4 0 1 0 1 0 0 29 30
29 30
Label Encoding in Python Label Encoding: Label Encoder
from sklearn import preprocessing

for i in range(len(df.columns)): le = preprocessing.LabelEncoder()
if (df[i].dtypes=='object'): for i in range(len(df.columns)):
df[i] = df[i].astype('category') if (df[i].dtypes=='object'):
df[i] = df[i].cat.codes le.fit(df[i])
df.head() df[i] = le.transform(df[i])
df.head()
31 32
31 32
7
4/2/20
One-hot Encoding One-hot Encoding

pandas.get_dummies(data, prefix=None, prefix_sep='_',
dummy_na=False, columns=None, sparse=False,
drop_first=False, dtype=None)
df_encode = pd.get_dummies(data=df, columns=[2])
Convert categorical variable into dummy/indicator variables df_encode.head()
pd.get_dummies(df, columns=['col1', 'col2'])
33 34
33 34
Exercise 6: Categorical Encoding Binning / Discretization

A. For each categorical attributes, identify the pandas.cut(x, bins, right=True, labels=None,
most appropriate categorical encoding. retbins=False, precision=3, include_lowest=False,
B. Write a program to encode those categorical duplicates='raise')
attributes Bin values into discrete intervals.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html
35 36
35 36
8
4/2/20
Exercise 7: Binning Export Data to CSV

DataFrame.to_csv(path_or_buf=None, sep=', ', na_rep='',
Execute this program for Automobile dataset, and float_format=None, columns=None, header=True, index=True,
explain its output. index_label=None, mode='w', encoding=None, compression='infer',
bins = [0,10000,20000,40000] quoting=None, quotechar='"', line_terminator=None, chunksize=None,
tupleize_cols=None, date_format=None, doublequote=True,
car_bin=['Budget','Medium','Highend'] escapechar=None, decimal='.')
df['carsrange'] = Write object to a comma-separated values (csv) file.
pd.cut(df[25],bins,right=False,labels=car_bin)
df.head(10) df.to_csv(index=False)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
37 38
37 38
Exercise 8: Data Preparation

1. Load white Wine Quality dataset
(https://archive.ics.uci.edu/ml/datasets/wine+quality)
df = pd.read_csv("winequality-white.csv",sep=';')
2. Based on analysis in data understanding phase, conduct
data preparation to improve its data quality.
3. Save dataframe from data preparation into new csv file
39
39

Data Analysis: Data Preparation

Uploaded by

Copyright:

Available Formats

You might also like

Data Analysis: Data Preparation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis: Data Preparation

Uploaded by

Copyright:

Available Formats

4/2/20

Cleansing Data with Python Handling Incorrect Type Assignment

Automobile Dataset Exercise 1: Incorrect Type

Drop Missing Values Drop Missing Values

Exercise 2: Drop Missing Values Replace Missing Values

Exercise 3: Replace Missing Values Exercise 3B: Answer

Selecting Data with Python Selecting Data : Feature Selection

df.drop(colList, axis=1) #Drop columns

df.drop(rowList) #Drop a row by index list

Feature Extraction Exercise 4: Add New Feature

Data Normalization MinMaxScaler

StandardScaler Data Formatting

• Binning / discretization : converts numeric

Categorical Encoding Categorical Encoding: Examples

● For multivalues attributes

atr1 atr2 atr3 Encoding Langsung

Label Encoding in Python Label Encoding: Label Encoder

from sklearn import preprocessing

One-hot Encoding One-hot Encoding

pd.get_dummies(df, columns=['col1', 'col2'])

Exercise 6: Categorical Encoding Binning / Discretization

Exercise 7: Binning Export Data to CSV

Exercise 8: Data Preparation

You might also like