Professional Documents
Culture Documents
DM Lab Cycle 1
DM Lab Cycle 1
Cycle-1
Aim: Demonstrate the following data preprocessing tasks using python
libraries.
a) Loading the dataset
b) Identifying the dependent and independent variables.
c) Dealing with missing data
Solution:
a) Loading the dataset
There 4 Different Ways to Load Data in Python
Manual function
loadtxt function
genfromtxt function
read_csv function
Imports
We will use Numpy, Pandas, and Pickle packages so import them.
import numpy as np
import pandas as pd
1. Manual Function
We have to design a custom function, which can load data. We have to deal with Python’s
normal filing concepts and using that you have to read a .csv file.
def load_csv(filepath):
data = []
col = []
checkcol = False
with open(filepath) as f:
for val in f.readlines():
val = val.replace("\n","")
val = val.split(',')
if checkcol is False:
col = val
checkcol = True
else:
data.append(val)
df = pd.DataFrame(data=data, columns=col)
return df
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |2
To view the Output, call the above function and print its head
myData = load_csv('Sales.csv')
print(myData.head())
Sample Output:
Region Country Item Type \
0 Australia and Oceania Tuvalu Baby
Food
1 Central America and the Caribbean Grenada
Cereal
2 Europe Russia Office
Supplies
3 Sub-Saharan Africa Sao Tome and Principe
Fruits
4 Sub-Saharan Africa Rwanda Office
Supplies
Sales Channel Order Priority Order Date Order ID Ship Date Units Sold
\
0 Offline H 5/28/2010 669165933 6/27/2010 9925
1 Online C 8/22/2012 963881480 9/15/2012 2804
2 Offline L 5/2/2014 341417157 5/8/2014 1779
3 Online C 6/20/2014 514321792 7/5/2014 8102
4 Offline L 2/1/2013 115456712 2/6/2013 5062
Unit Price Unit Cost Total Revenue Total Cost Total Profit
0 255.28 159.42 2533654.00 1582243.50 951410.50
1 205.70 117.11 576782.80 328376.44 248406.36
2 651.21 524.96 1158502.59 933903.84 224598.75
3 9.33 6.92 75591.66 56065.84 19525.82
4 651.21 524.96 3296425.02 2657347.52 639077.50
2. Numpy.loadtxt function
This is a built-in function in Numpy, a famous numerical library in Python. It is a really simple
function to load the data. It is very useful for reading data which is of the same datatype.
When data is more complex, it is hard to read using this function, but when files are easy and
simple, this function is really powerful.
df = np.loadtxt('demo.csv', delimeter = ',', usecols=[0,3])
print(df[:5,:])
Output:
[[255. 8.3]
[299. 7.6]
[216. 9.1]
[247. 8.8]]
3. Numpy.genfromtxt()
Sample Output:
4. Pandas.read_csv()
Pandas is a very popular data manipulation library, and it is very commonly used. One of it’s
very important and mature functions is read_csv() which can read any .csv file very easily and
help us manipulate it.
pdDf = pd.read_csv('Sales.csv'.csv')
pdDf.head()
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |4
Sample Output:
Now, we need to differentiate the matrix of features containing the independent variables from
the dependent variable ‘purchased’.
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |5
The matrix of features will contain the variables ‘Country’, ‘Age’ and ‘Salary’.
The code to declare the matrix of features will be as follows:
x= dataset.iloc[:,:-1].values
print(x)
Output:
In the code above, the first ‘:’ stands for the rows which we want to include, and the next one
stands for the columns we want to include. By default, if only the ‘:’ (colon) is used, it means
that all the rows/columns are to be included. In case of our dataset, we need to include all the
rows (:) and all the columns but the last one (:-1).
We’ll be following the exact same procedure to create the dependent variable vector ‘y’. The
only change here is the columns which we want in y. As in the matrix of features, we’ll be
including all the rows. But from the columns, we need only the 4th (3rd, keeping in mind the
indexes in the python). Therefore, the code the same will look as follows:
y= dataset.iloc[:,3].values
print(y)
Output:
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
DATA MINING USING PYTHON LAB (R20
(R20-IV Sem) Page |6
The first step in handling missing values is to carefully look at the complete data and find all the
missing values. The following code shows the total number of missing values in each column. It
also shows the total number of missing values in the entire data set.
import pandas as pd
train_df = pd.read_csv("train_loan.csv")
#Find the missing values from each column
print(train_df.isnull().sum())
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
IN:
#Find the total number of missing values from the entire dataset
train_df.isnull().sum().sum()
OUT:
149
Generally, this approach is not recommended. It is one of the quick and dirty techniques one can
use to deal with missing values. If the missing value is of the type Missing Not At Random
(MNAR), then it should not be deleted.
If the missing value is of type Missing At Random (MAR) or Missing Completely At Random
(MCAR) then it can be deleted (In the analysis, all cases with available data are utilized, while
missing observations are assumed to be completely random (MCAR) and addressed through
pairwise deletion.)
The disadvantage of this method is one might end up deleting some useful data from the dataset.
There are 2 ways one can delete the missing data values:
If a row has many missing values, you can drop the entire row. If every row has some (column)
value missing, you might end up deleting the whole data. The code to drop the entire row is as
follows:
IN:
df = train_df.dropna(axis=0)
df.isnull().sum()
OUT:
Loan_ID 0
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
If a certain column has many missing values, then you can choose to drop the entire column. The
code to drop the entire column is as follows:
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |9
IN:
df = train_df.drop(['Dependents'],axis=1)
df.isnull().sum()
OUT:
Loan_ID 0
Gender 13
Married 3
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
There are many imputation methods for replacing the missing values. You can use different
python libraries such as Pandas, and Sci-kit Learn to do this.
E.g., in the following code, we are replacing the missing values of the ‘Dependents’ column with
‘0’.
IN:
#Replace the missing value with '0' using 'fiilna' method
train_df['Dependents'] = train_df['Dependents'].fillna(0)
train_df[‘Dependents'].isnull().sum()
OUT:
0
This is the most common method of imputing missing values of numeric columns. If there are
outliers, then the mean will not be appropriate. In such cases, outliers need to be treated first.
You can use the ‘fillna’ method for imputing the columns ‘LoanAmount’ and ‘Credit_History’
with the mean of the respective column values.
IN:
#Replace the missing values for numerical columns with mean
DATA MINING USING PYTHON LAB (R20-IV Sem) P a g e | 10
train_df['LoanAmount'] =
train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean())
train_df['Credit_History'] =
train_df[‘Credit_History'].fillna(train_df['Credit_History'].mean())
OUT:
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
Mode is the most frequently occurring value. It is used in the case of categorical features. You
can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’ and
‘Self_Employed.’
IN:
OUT:
Loan_ID 0
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
DATA MINING USING PYTHON LAB (R20-IV Sem) P a g e | 11
The median is the middlemost value. It’s better to use the median value for imputation in the
case of outliers. You can use the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’
with the median value.
train_df['Loan_Amount_Term']=
train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())
In some cases, imputing the values with the previous value instead of the mean, mode, or median
is more appropriate. This is called forward fill. It is mostly used in time series data. You can use
the ‘fillna’ function with the parameter ‘method = ffill’
IN:
import pandas as pd
import numpy as np
test = pd.Series(range(6))
test.loc[2:4] = np.nan
test
OUT:
0 0.0
1 1.0
2 Nan
3 Nan
4 Nan
5 5.0
dtype: float64
IN:
# Forward-Fill
test.fillna(method=‘ffill')
OUT:
0 0.0
1 1.0
2 1.0
3 1.0
4 1.0
5 5.0
dtype: float64
DATA MINING USING PYTHON LAB (R20-IV Sem) P a g e | 12
In backward fill, the missing value is imputed using the next value.
IN:
# Backward-Fill
test.fillna(method=‘bfill')
OUT:
0 0.0
1 1.0
2 5.0
3 5.0
4 5.0
5 5.0
dtype: float64