Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

DATA MINING USING PYTHON LAB (R20-IV Sem) Page |1

Cycle-1
Aim: Demonstrate the following data preprocessing tasks using python
libraries.
a) Loading the dataset
b) Identifying the dependent and independent variables.
c) Dealing with missing data
Solution:
a) Loading the dataset
There 4 Different Ways to Load Data in Python
 Manual function
 loadtxt function
 genfromtxt function
 read_csv function

Imports
We will use Numpy, Pandas, and Pickle packages so import them.
import numpy as np
import pandas as pd

1. Manual Function

We have to design a custom function, which can load data. We have to deal with Python’s
normal filing concepts and using that you have to read a .csv file.

def load_csv(filepath):
data = []
col = []
checkcol = False
with open(filepath) as f:
for val in f.readlines():
val = val.replace("\n","")
val = val.split(',')
if checkcol is False:
col = val
checkcol = True
else:
data.append(val)
df = pd.DataFrame(data=data, columns=col)
return df
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |2

To view the Output, call the above function and print its head
myData = load_csv('Sales.csv')
print(myData.head())
Sample Output:
Region Country Item Type \
0 Australia and Oceania Tuvalu Baby
Food
1 Central America and the Caribbean Grenada
Cereal
2 Europe Russia Office
Supplies
3 Sub-Saharan Africa Sao Tome and Principe
Fruits
4 Sub-Saharan Africa Rwanda Office
Supplies

Sales Channel Order Priority Order Date Order ID Ship Date Units Sold
\
0 Offline H 5/28/2010 669165933 6/27/2010 9925
1 Online C 8/22/2012 963881480 9/15/2012 2804
2 Offline L 5/2/2014 341417157 5/8/2014 1779
3 Online C 6/20/2014 514321792 7/5/2014 8102
4 Offline L 2/1/2013 115456712 2/6/2013 5062

Unit Price Unit Cost Total Revenue Total Cost Total Profit
0 255.28 159.42 2533654.00 1582243.50 951410.50
1 205.70 117.11 576782.80 328376.44 248406.36
2 651.21 524.96 1158502.59 933903.84 224598.75
3 9.33 6.92 75591.66 56065.84 19525.82
4 651.21 524.96 3296425.02 2657347.52 639077.50

2. Numpy.loadtxt function

This is a built-in function in Numpy, a famous numerical library in Python. It is a really simple
function to load the data. It is very useful for reading data which is of the same datatype.
When data is more complex, it is hard to read using this function, but when files are easy and
simple, this function is really powerful.
df = np.loadtxt('demo.csv', delimeter = ',', usecols=[0,3])
print(df[:5,:])

Data in demo.csv file:


255,Akhil,Male,8.3
299,Kavya,Female,7.6
216,Meghana,Female,9.1
247,Sai,Male,8.8
263,Prateek,Male,7.6
238,Thanmai,Female,9.3
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |3

Output:
[[255. 8.3]
[299. 7.6]
[216. 9.1]
[247. 8.8]]

Pros and Cons


An important aspect of using this function is that you can quickly load in data from a file into
numpy arrays.
Drawbacks of it are that you can not have different data types or missing rows in your data.

3. Numpy.genfromtxt()

data = np.genfromtxt('Sales.csv', delimiter=',',


dtype=None,names=True,encoding=’utf-8’)
pd.DataFrame(data).head()

Sample Output:

4. Pandas.read_csv()
Pandas is a very popular data manipulation library, and it is very commonly used. One of it’s
very important and mature functions is read_csv() which can read any .csv file very easily and
help us manipulate it.
pdDf = pd.read_csv('Sales.csv'.csv')
pdDf.head()
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |4

Sample Output:

b) Identifying the dependent and independent variables.


The dataset Demo.csv below data-

Country Age Salary Purchased


France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes

Load the dataset using read_csv() of pandas module:


import pandas as pd
dataset= pd.read_csv('Data.csv')
The variables here can be classified as independent and dependent variables.
The independent variables are used to determine the dependent variable .
In our dataset, the first three columns (Country, Age, Salary) are independent variables which
will be used to determine the dependent variable (Purchased), which is the fourth column.

Now, we need to differentiate the matrix of features containing the independent variables from
the dependent variable ‘purchased’.
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |5

(i) Creating the matrix of features

The matrix of features will contain the variables ‘Country’, ‘Age’ and ‘Salary’.
The code to declare the matrix of features will be as follows:

x= dataset.iloc[:,:-1].values
print(x)

Output:

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

In the code above, the first ‘:’ stands for the rows which we want to include, and the next one
stands for the columns we want to include. By default, if only the ‘:’ (colon) is used, it means
that all the rows/columns are to be included. In case of our dataset, we need to include all the
rows (:) and all the columns but the last one (:-1).

(ii)Creating the dependent variable vector

We’ll be following the exact same procedure to create the dependent variable vector ‘y’. The
only change here is the columns which we want in y. As in the matrix of features, we’ll be
including all the rows. But from the columns, we need only the 4th (3rd, keeping in mind the
indexes in the python). Therefore, the code the same will look as follows:

y= dataset.iloc[:,3].values
print(y)

Output:
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
DATA MINING USING PYTHON LAB (R20
(R20-IV Sem) Page |6

c) Dealing with missing data


What Is a Missing Value?
Missing data is defined as the values or data that is not stored (or not present) for some variable/s
in the given dataset. Below is a sample of the missing data from the Titanic dataset. You can see
the columns ‘Age’ and ‘Cabin’ have some missing values.

In the dataset, the blank shows the missing values.

In Pandas, usually, missing values


es are represented by NaN. It stands for Not a Number.
Number

Types of Missing Values


1. Missing Completely At Random (MCAR)
In MCAR, the probability of data being missing is the same for all the observations. In this case,
there is no relationship between the missing data and any other values observed or unobserved
(the data which is not recorded) within the given dataset. That is, missing values are completely
independent of other data. There is no pattern.
2. Missing At Random (MAR)
MAR data means that the reasonason for missing values can be explained by variables on which you
have complete information, as there is some relationship between the missing data and other
values/data. In this case, the data is not missing for all the observations. It is missing only within
w
sub-samples
samples of the data, and there is some pattern in the missing values.
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |7

3. Missing Not At Random (MNAR)


Missing values depend on the unobserved data. If there is some structure/pattern in missing data
and other observed data can not explain it, then it is considered to be Missing Not At Random
(MNAR).
If the missing data does not fall under the MCAR or MAR, it can be categorized as MNAR

Checking for Missing Values in Python

The first step in handling missing values is to carefully look at the complete data and find all the
missing values. The following code shows the total number of missing values in each column. It
also shows the total number of missing values in the entire data set.

import pandas as pd
train_df = pd.read_csv("train_loan.csv")
#Find the missing values from each column
print(train_df.isnull().sum())

Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

IN:
#Find the total number of missing values from the entire dataset
train_df.isnull().sum().sum()

OUT:
149

There are 149 missing values in total.

Handling Missing Values


There are 2 primary ways of handling missing values:
1. Deleting the Missing values
2. Imputing the Missing Values
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |8

1. Deleting the Missing value

Generally, this approach is not recommended. It is one of the quick and dirty techniques one can
use to deal with missing values. If the missing value is of the type Missing Not At Random
(MNAR), then it should not be deleted.

If the missing value is of type Missing At Random (MAR) or Missing Completely At Random
(MCAR) then it can be deleted (In the analysis, all cases with available data are utilized, while
missing observations are assumed to be completely random (MCAR) and addressed through
pairwise deletion.)

The disadvantage of this method is one might end up deleting some useful data from the dataset.

There are 2 ways one can delete the missing data values:

(i) Deleting the entire row (listwise deletion)

If a row has many missing values, you can drop the entire row. If every row has some (column)
value missing, you might end up deleting the whole data. The code to drop the entire row is as
follows:

IN:
df = train_df.dropna(axis=0)
df.isnull().sum()

OUT:
Loan_ID 0
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64

(ii) Deleting the entire column

If a certain column has many missing values, then you can choose to drop the entire column. The
code to drop the entire column is as follows:
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |9

IN:
df = train_df.drop(['Dependents'],axis=1)
df.isnull().sum()

OUT:
Loan_ID 0
Gender 13
Married 3
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

2. Imputing the Missing Value

There are many imputation methods for replacing the missing values. You can use different
python libraries such as Pandas, and Sci-kit Learn to do this.

(i) Replacing with an arbitrary value

E.g., in the following code, we are replacing the missing values of the ‘Dependents’ column with
‘0’.

IN:
#Replace the missing value with '0' using 'fiilna' method
train_df['Dependents'] = train_df['Dependents'].fillna(0)
train_df[‘Dependents'].isnull().sum()

OUT:
0

(ii) Replacing with the mean

This is the most common method of imputing missing values of numeric columns. If there are
outliers, then the mean will not be appropriate. In such cases, outliers need to be treated first.
You can use the ‘fillna’ method for imputing the columns ‘LoanAmount’ and ‘Credit_History’
with the mean of the respective column values.

IN:
#Replace the missing values for numerical columns with mean
DATA MINING USING PYTHON LAB (R20-IV Sem) P a g e | 10

train_df['LoanAmount'] =
train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean())
train_df['Credit_History'] =
train_df[‘Credit_History'].fillna(train_df['Credit_History'].mean())

OUT:
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64

(iii) Replacing with the mode

Mode is the most frequently occurring value. It is used in the case of categorical features. You
can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’ and
‘Self_Employed.’

IN:

#Replace the missing values for categorical columns with mode


train_df['Gender'] = train_df['Gender'].fillna(train_df['Gender'].mode()[0])
train_df['Married'] =
train_df['Married'].fillna(train_df['Married'].mode()[0])
train_df['Self_Employed'] =
train_df[‘Self_Employed'].fillna(train_df['Self_Employed'].mode()[0])
train_df.isnull().sum()

OUT:
Loan_ID 0
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
DATA MINING USING PYTHON LAB (R20-IV Sem) P a g e | 11

(iv) Replacing with the median

The median is the middlemost value. It’s better to use the median value for imputation in the
case of outliers. You can use the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’
with the median value.

train_df['Loan_Amount_Term']=
train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())

(v) Replacing with the previous value – forward fill

In some cases, imputing the values with the previous value instead of the mean, mode, or median
is more appropriate. This is called forward fill. It is mostly used in time series data. You can use
the ‘fillna’ function with the parameter ‘method = ffill’

IN:
import pandas as pd
import numpy as np
test = pd.Series(range(6))
test.loc[2:4] = np.nan
test

OUT:
0 0.0
1 1.0
2 Nan
3 Nan
4 Nan
5 5.0
dtype: float64

IN:
# Forward-Fill
test.fillna(method=‘ffill')

OUT:
0 0.0
1 1.0
2 1.0
3 1.0
4 1.0
5 5.0
dtype: float64
DATA MINING USING PYTHON LAB (R20-IV Sem) P a g e | 12

(vi) Replacing with the next value – backward fill

In backward fill, the missing value is imputed using the next value.

IN:
# Backward-Fill
test.fillna(method=‘bfill')

OUT:
0 0.0
1 1.0
2 5.0
3 5.0
4 5.0
5 5.0
dtype: float64

You might also like