Data Analysis: Data Preparation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

4/2/20

Data Preparation
The data preparation stage comprises all activities used
to construct the data set that will be used in the
Data Analysis: Data Preparation modeling stage.
These include
• Cleansing data: data types
• Selecting data (sampling, feature selection)
Data Mining • Constructing (derived) data (feature engineering)
• Combining data from multiple sources
ITERA • Formating data (encoding categorical data)
Semester II 2019/2020 2

1 2

Cleansing Data with Python Handling Incorrect Type Assignment


1. Determine correct types for incorrect type assignment df[col] = df[col].astype(types)
df[col] = df[col].astype(types)
• If missing values are represented by “?”, numeric
2. Handle missing values (drop or replace values) attributes are assigned as object. We should
df[col].fillna() or df.dropna()
handle missing values before converting the
3. Exclude outliers data (see filtering data) correct types.
df[filtering outlier conditions]

3 4

3 4

1
4/2/20

Automobile Dataset Exercise 1: Incorrect Type


- 6 continuous attributes are loaded as object Write program to assign the right type for incorrect
attribute (not float64), because they have “?” type assignment.
Hints: df[col] = df[col].astype(types)
...
1. normalized-losses: continuous from 65 to 256.
...
18. bore: continuous from 2.54 to 3.94.
19. stroke: continuous from 2.07 to 4.17.
...
21. horsepower: continuous from 48 to 288.
22. peak-rpm: continuous from 4150 to 6600.
...
25. price: continuous from 5118 to 45400.

5 6

5 6

Drop Missing Values Drop Missing Values


DataFrame.dropna(axis=0, how='any', thresh=None,
subset=None, inplace=False)
df[5] = df[5].replace('?', np.NaN)
df.dropna() #Drop the rows where at least one element is missing. df_drop=df.dropna()
df.dropna(inplace=True) #Keep the DataFrame with valid entries in the same variable.
df_drop.shape[0]
df.dropna(axis='columns') #Drop the columns where at least 1 element is missing.
df.dropna(how='all') #Drop the rows where all elements are missing.
df.dropna(thresh=2) # Keep only the rows with at least 2 non-NA values.
df.dropna(subset=[col1, col2]) #Define in which columns to look for missing
values.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
9 10

9 10

2
4/2/20

Exercise 2: Drop Missing Values Replace Missing Values


DataFrame.fillna(value=None, method=None, axis=None,
Investigate dataset shape from this command inplace=False, limit=None, downcast=None, **kwargs)
df[col].fillna(value, inplace=True) #replace null with value
df.dropna(axis='columns') #Drop the columns where at least 1 element is missing. df.fillna(0) #Replace all NaN elements with 0s.
df.dropna(how='all') #Drop the rows where all elements are missing. df.fillna(value={'A': 0, 'B': 1, 'C': 2, 'D': 3}) #Replace
df.dropna(thresh=24) # Keep only the rows with at least 24 non-NA values. all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
df.fillna(value=values, limit=1) #Only replace the first NaN element.

11 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html 13

11 13

Exercise 3: Replace Missing Values Exercise 3B: Answer


A. For automobile dataset with correct type
assignment, write a program to replace all
missing values with mean (for numeric attribute)
and mode (for categorical attribute).
B. Compare descriptive statistics of each column
before and after replace missing values, and
report the differences.

14 16

14 16

3
4/2/20

Selecting Data with Python Selecting Data : Feature Selection


DataFrame.drop(labels=None, axis=0, index=None,
columns=None, level=None, inplace=False,
In selecting data, feature selection methods will
errors='raise') be discussed next week.
Drop specified labels from rows or columns.

df.drop(colList, axis=1) #Drop columns


df.drop(columns=colList) #Drop columns

df.drop(rowList) #Drop a row by index list

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
17 18

17 18

Feature Extraction Exercise 4: Add New Feature


isLuxury is a new feature that will be true if a car has 1. Add new feature 'fueleconomy' that has values
front-engine (attribute 8 is front) and rear-wheel drive 0.55*citympg (i.e. attribute 23) +
(attribute 7 is rwd). 0.45*'highwaympg (i.e. attribute 24)
df['isLuxury'] = df.apply(lambda row: 1 if
((row[7]=='rwd') and (row[8]=='front')) else 0,
axis=1) 1. Identify one new feature, and add to our
df['isLuxury'] = df['isLuxury'].astype("category") dataframe. Write as comment why we need this
df.head() new feature.
19 20

19 20

4
4/2/20

Data Normalization MinMaxScaler


from sklearn.preprocessing import MinMaxScaler

MinMaxScaler:
min_max_scaler = preprocessing.MinMaxScaler()
newdf = df.select_dtypes(include=['float64'])
Standardize: z = (x - 𝝻) / 𝞂 newdf_scaled = min_max_scaler.fit_transform(newdf)
newdf =pd.DataFrame(newdf_scaled)
newdf.head()
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
21 22

21 22

MinMaxScaler StandardScaler
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
df_std = df.select_dtypes(include=['float64'])
df_std = df_std.dropna()
df_std=sc.fit_transform(df_std)
df_std = pd.DataFrame(df_std)
Z = (3.47-2.54)/(3.94-2.54)=0.664 df_std.head()

23 24

23 24

5
4/2/20

StandardScaler Data Formatting


• Categorical encoding : converts categorical
attribute into numeric attribute

• Binning / discretization : converts numeric


attribute into categorical attribute
z = (x - 𝝻) / 𝞂
Z = (3.47-3.329751)/0.273539=0.513
25 26

25 26

Categorical Encoding Categorical Encoding: Examples

● For attributes that represent numeric atr1 atr2 atr3 atr1 atr2 atr3_ atr3_ atr3_
BDO JKT DPS
01 Direct encoding values
two S BDO
● Example: num_doors: two, four → {2,4} Atr1: direct encoding 2 1 1 0 0
four M JKT Atr2: label encoding
4 2 0 1 0
Atr3: one-hot encoding
● For ordinal attributes one L DPS
02 Label encoding ● Example: size {S,M,L} → {1,2,3}.
1 3 0 0 1
four M BDO 4 2 1 0 0

● For multivalues attributes


03 One-hot encoding ● Example: city {BDO, JKT, ...} will be
represented by city_BDO, city_JKT, etc
27 28

27 28

6
4/2/20

atr1 atr2 atr3 Encoding Langsung


two S BDO Atr1: direct encoding
Atr2, Atr3: one-hot encoding
four M JKT
one L DPS
df[5]=df[5].replace({'two': 2, 'four': 4})
four M BDO
atr1 atr2 atr2 atr2 atr3_ atr3_ atr3_
_S _M _L BDO JKT DPS

2 1 0 0 1 0 0

4 0 1 0 0 1 0

1 0 0 1 0 0 1

4 0 1 0 1 0 0 29 30

29 30

Label Encoding in Python Label Encoding: Label Encoder

from sklearn import preprocessing


for i in range(len(df.columns)): le = preprocessing.LabelEncoder()
if (df[i].dtypes=='object'): for i in range(len(df.columns)):
df[i] = df[i].astype('category') if (df[i].dtypes=='object'):
df[i] = df[i].cat.codes le.fit(df[i])
df.head() df[i] = le.transform(df[i])
df.head()

31 32

31 32

7
4/2/20

One-hot Encoding One-hot Encoding


pandas.get_dummies(data, prefix=None, prefix_sep='_',
dummy_na=False, columns=None, sparse=False,
drop_first=False, dtype=None)
df_encode = pd.get_dummies(data=df, columns=[2])
Convert categorical variable into dummy/indicator variables df_encode.head()

pd.get_dummies(df, columns=['col1', 'col2'])

33 34

33 34

Exercise 6: Categorical Encoding Binning / Discretization


A. For each categorical attributes, identify the pandas.cut(x, bins, right=True, labels=None,
most appropriate categorical encoding. retbins=False, precision=3, include_lowest=False,
B. Write a program to encode those categorical duplicates='raise')
attributes Bin values into discrete intervals.

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html
35 36

35 36

8
4/2/20

Exercise 7: Binning Export Data to CSV


DataFrame.to_csv(path_or_buf=None, sep=', ', na_rep='',
Execute this program for Automobile dataset, and float_format=None, columns=None, header=True, index=True,
explain its output. index_label=None, mode='w', encoding=None, compression='infer',
bins = [0,10000,20000,40000] quoting=None, quotechar='"', line_terminator=None, chunksize=None,
tupleize_cols=None, date_format=None, doublequote=True,
car_bin=['Budget','Medium','Highend'] escapechar=None, decimal='.')
df['carsrange'] = Write object to a comma-separated values (csv) file.
pd.cut(df[25],bins,right=False,labels=car_bin)
df.head(10) df.to_csv(index=False)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
37 38

37 38

Exercise 8: Data Preparation


1. Load white Wine Quality dataset
(https://archive.ics.uci.edu/ml/datasets/wine+quality)
df = pd.read_csv("winequality-white.csv",sep=';')
2. Based on analysis in data understanding phase, conduct
data preparation to improve its data quality.
3. Save dataframe from data preparation into new csv file

39

39

You might also like