Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

GROUP WORK ASSIGNMENT

UNIT: BCSC 4227 DATA SCIENCE


TITLE: SUPERVISED AND UNSUPERVISED LEARNING
GROUP MEMBERS
1) Wambui Dennis Wainaina: BCSC01-0008-2018
2) Otieno Daren Wallace: BCSC01-0026-2018
3) Njuguna Kelvin Njoroge: BCSC01-0170-2018
QUESTION
Using data sets of your own choice, demonstrate Supervised and Unsupervised Learning
Algorithms.
Exclude the use of KNN and K-Clustering
You May use Kaggle data sets
Write a report to explain your findings explaining the benefits and challenges of chosen
algorithm

SUPERVISED LEARNING
Type: Linear Regression
Language: Python
Dataset: Medical Cost Personal Datasets
The main usage of linear regression is to model the relationship between the two provided
variables. Typically, to perceive the data, a linear equation is fitted. The main usage of linear
regression is to model the relationship between the two provided variables. Typically, to
perceive the data, a linear equation is fitted.
IMPLEMENTATION OF THE LINEAR REGRESSION MODEL
# Import library
import pandas as pd #Data manipulation
import numpy as np #Data manipulation
import matplotlib.pyplot as plt # Visualization
import seaborn as sns

Matplotlib is building the font cache; this may take a moment.

path = "C:\\Users\\25470\\Desktop\\Data Science Group Work Project\\"


dataFrame = pd.read_csv(path+'insurance.csv')

#Check the Top five rows to see the how data is structured
dataFrame.head()

age sex bmi children smoker region charges


0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

sns.lmplot(x='bmi',y='charges',data=dataFrame,aspect=2,height=6)
plt.xlabel('Boby Mass Index$(kg/m^2)$: as Independent variable')
plt.ylabel('Insurance Charges: as Dependent variable')
plt.title('Charge Vs BMI');

#Exploratory data analysis


dataFrame.describe()

age bmi children charges


count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010

#Data Preprocessing
categorical_columns = ['sex','children', 'smoker', 'region']
df_encode = pd.get_dummies(data = dataFrame, prefix = 'OHE', prefix_sep='_
',
columns = categorical_columns,
drop_first =True,
dtype='int8')

from scipy.stats import boxcox


y_bc,lam, ci= boxcox(df_encode['charges'],alpha=0.05)

ci,lam
## Log transform is used for better performance
df_encode['charges'] = np.log(df_encode['charges'])

#Train Test Split


from sklearn.model_selection import train_test_split
X = df_encode.drop('charges',axis=1) # Independet variable
y = df_encode['charges'] # dependent variable

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,rand


om_state=23)

#Model building
# Step 1: add x0 =1 to dataset
X_train_0 = np.c_[np.ones((X_train.shape[0],1)),X_train]
X_test_0 = np.c_[np.ones((X_test.shape[0],1)),X_test]

# Step2: build model


theta = np.matmul(np.linalg.inv( np.matmul(X_train_0.T,X_train_0) ), np.ma
tmul(X_train_0.T,y_train))

# The parameters for linear regression model


parameter = ['theta_'+str(i) for i in range(X_train_0.shape[1])]
columns = ['intersect:x_0=1'] + list(X.columns.values)
parameter_df = pd.DataFrame({'Parameter':parameter,'Columns':columns,'thet
a':theta})

# Scikit Learn module


from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train) # Note: x_0 =1 is no need to add, sklearn wil
l take care of it.

#Parameter
sk_theta = [lin_reg.intercept_]+list(lin_reg.coef_)
parameter_df = parameter_df.join(pd.Series(sk_theta, name='Sklearn_theta')
)
parameter_df

Parameter Columns theta Sklearn_theta


0 theta_0 intersect:x_0=1 7.059171 7.059171
1 theta_1 age 0.033134 0.033134
2 theta_2 bmi 0.013517 0.013517
3 theta_3 OHE_male -0.067767 -0.067767
4 theta_4 OHE_1 0.149457 0.149457
5 theta_5 OHE_2 0.272919 0.272919
6 theta_6 OHE_3 0.244095 0.244095
7 theta_7 OHE_4 0.523339 0.523339
8 theta_8 OHE_5 0.466030 0.466030
9 theta_9 OHE_yes 1.550481 1.550481
10 theta_10 OHE_northwest -0.055845 -0.055845
11 theta_11 OHE_southeast -0.146578 -0.146578
12 theta_12 OHE_southwest -0.133508 -0.133508

#Model evaluation
# sklearn regression module
y_pred_sk = lin_reg.predict(X_test)

# Normal equation
y_pred_norm = np.matmul(X_test_0,theta)

#Evaluvation: MSE
mse = np.sum((y_pred_norm - y_test)**2)/ X_test_0.shape[0]

# R_square
sse = np.sum((y_pred_norm - y_test)**2)
sst = np.sum((y_test - y_test.mean())**2)
R_square = 1 - (sse/sst)
print('The Mean Square Error(MSE): ', mse)
print('R square::',R_square)

The Mean Square Error(MSE): 0.1872962232298195


R square:: 0.7795687545055312

#Model Validation
# Check for Linearity
f = plt.figure(figsize=(14,5))
ax = f.add_subplot(121)
sns.scatterplot(y_test,y_pred_sk,ax=ax,color='r')
ax.set_title('Check for Linearity:\n Actual Vs Predicted value')

# Check for Residual normality & mean


ax = f.add_subplot(122)
sns.distplot((y_test - y_pred_sk),ax=ax,color='b')
ax.axvline((y_test - y_pred_sk).mean(),color='k',linestyle='--')
ax.set_title('Check for Residual normality & mean: \n Residual eror');

BENEFITS OF LINEAR REGRESSION


1) Easy to implement, interpret and efficient to train
2) Linear regression performs exceptionally well for linearly separable data
3) Using dimensionally reduced techniques, regularization, and cross-validation, it
manages overfitting fairly well.
4) The ability to extrapolate beyond a particular data set is a further benefit.
CHALLENGES OF LINEAR REGRESSION
1) The presumption that dependent and independent variables are linear, it becomes a
challenge since not all datasets are linearly correlated.
2) It is often quite prone to noise and overfitting
3) Linear regression is quite sensitive to outliers.
4) It is prone to multicollinearity

UNSUPERVISED LEARNING
Type: Hierarchical clustering
Language: Python
Dataset: Mall Customer Segmentation Data
Hierarchical clustering, also known as hierarchical cluster analysis or HCA, is an
unsupervised machine learning method for clustering unlabeled datasets.
In this technique, the hierarchy of clusters is developed in the form of a tree, and this tree-
shaped structure is known as the dendrogram.
Simply put, hierarchical clustering is the process of dividing data into groups based on some
measure of similarity, determining a technique to quantify how they're alike and different,
and narrowing down the data.
IMPLEMENTATION OF THE HIERARCHICAL MODEL
#Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import plotly as py
import plotly.graph_objs as go

import warnings
warnings.filterwarnings('ignore')

from sklearn import preprocessing


import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering

#Data Exploration
dataFrame = pd.read_csv('C:\\Users\\25470\\Desktop\\Data Science Group Wor
k Project\\segmented_customers.csv')
dataFrame.head()

CustomerID Gender Age Annual Income (k$) Spending Score (1-100) \


0 1 1 19 15 39
1 2 1 21 15 81
2 3 0 20 16 6
3 4 0 23 16 77
4 5 0 31 17 40

cluster
0 3
1 4
2 3
3 4
4 3

dataFrame.isnull().sum()

CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
cluster 0
dtype: int64

dataFrame.describe()

CustomerID Gender Age Annual Income (k$) \


count 200.000000 200.000000 200.000000 200.000000
mean 100.500000 0.440000 38.850000 60.560000
std 57.879185 0.497633 13.969007 26.264721
min 1.000000 0.000000 18.000000 15.000000
25% 50.750000 0.000000 28.750000 41.500000
50% 100.500000 0.000000 36.000000 61.500000
75% 150.250000 1.000000 49.000000 78.000000
max 200.000000 1.000000 70.000000 137.000000

Spending Score (1-100) cluster


count 200.000000 200.000000
mean 50.200000 1.760000
std 25.823522 1.191427
min 1.000000 0.000000
25% 34.750000 1.000000
50% 50.000000 2.000000
75% 73.000000 2.000000
max 99.000000 4.000000

plt.figure(1 , figsize = (15 , 6))


n = 0
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
n += 1
plt.subplot(1 , 3 , n)
plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
sns.distplot(dataFrame[x] , bins = 15)
plt.title('Distplot of {}'.format(x))
plt.show()
#Label Encoding
label_encoder = preprocessing.LabelEncoder()

dataFrame['Gender'] = label_encoder.fit_transform(dataFrame['Gender'])
dataFrame.head()

CustomerID Gender Age Annual Income (k$) Spending Score (1-100) \


0 1 1 19 15 39
1 2 1 21 15 81
2 3 0 20 16 6
3 4 0 23 16 77
4 5 0 31 17 40

cluster
0 3
1 4
2 3
3 4
4 3

#Dendogram
plt.figure(1, figsize = (16 ,8))
dendrogram = sch.dendrogram(sch.linkage(dataFrame, method = "ward"))

plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()
#Agglomerative Clustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linka
ge ='average')

y_hc = hc.fit_predict(dataFrame)
y_hc

array([3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4,
3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 2,
3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 2, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1], dtype=int64)

dataFrame['cluster'] = pd.DataFrame(y_hc)

#Graphing clusters from agglomerative clustering


trace1 = go.Scatter3d(
x= dataFrame['Age'],
y= dataFrame['Spending Score (1-100)'],
z= dataFrame['Annual Income (k$)'],
mode='markers',
marker=dict(
color = dataFrame['cluster'],
size= 10,
line=dict(
color= dataFrame['cluster'],
width= 12
),
opacity=0.8
)
)
data = [trace1]
layout = go.Layout(
title= 'Clusters using Agglomerative Clustering',
scene = dict(
xaxis = dict(title = 'Age'),
yaxis = dict(title = 'Spending Score'),
zaxis = dict(title = 'Annual Income')
)
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)

X = dataFrame.iloc[:, [3,4]].values
plt.scatter(X[y_hc==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster
1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster
2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluste
r 3')
plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='purple', label ='Clust
er 4')
plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='orange', label ='Clust
er 5')
plt.title('Clusters of Customers (Hierarchical Clustering Model)')
plt.xlabel('Annual Income(k$)')
plt.ylabel('Spending Score(1-100)')
plt.show()
BENEFITS OF HIERACHICAL CLUSTERING MODEL
1) Easy to understand
2) Easy to implement
3) Dendrogram output of the algorithm can be used to understand the big picture
4) Easily identify groups in a dataset
CHALLENGES OF HIERACHICAL CLUSTERING MODEL
1) It does not always provide the best possible solution: Poor solutions may be difficult
to detect and resolve when clustering multidimensional retail data that cannot always
be visualized on a plot.
2) If there is missing data, the algorithm cannot run: To ensure that the algorithm can
run, you must remove these lines or estimate values.
3) The dendrogram can be interpreted incorrectly: Cluster descriptors and cluster
composition may be difficult to interpret for all stakeholders involved in clustering.
4) The algorithm cannot work with different types of data: It becomes difficult to
compute a distance matrix when many different data types are used. There is no
simple formula that can deal with both qualitative and numerical data.

You might also like