Professional Documents
Culture Documents
Group Work Assignment Supervised and Unsupervised Learning
Group Work Assignment Supervised and Unsupervised Learning
SUPERVISED LEARNING
Type: Linear Regression
Language: Python
Dataset: Medical Cost Personal Datasets
The main usage of linear regression is to model the relationship between the two provided
variables. Typically, to perceive the data, a linear equation is fitted. The main usage of linear
regression is to model the relationship between the two provided variables. Typically, to
perceive the data, a linear equation is fitted.
IMPLEMENTATION OF THE LINEAR REGRESSION MODEL
# Import library
import pandas as pd #Data manipulation
import numpy as np #Data manipulation
import matplotlib.pyplot as plt # Visualization
import seaborn as sns
#Check the Top five rows to see the how data is structured
dataFrame.head()
sns.lmplot(x='bmi',y='charges',data=dataFrame,aspect=2,height=6)
plt.xlabel('Boby Mass Index$(kg/m^2)$: as Independent variable')
plt.ylabel('Insurance Charges: as Dependent variable')
plt.title('Charge Vs BMI');
#Data Preprocessing
categorical_columns = ['sex','children', 'smoker', 'region']
df_encode = pd.get_dummies(data = dataFrame, prefix = 'OHE', prefix_sep='_
',
columns = categorical_columns,
drop_first =True,
dtype='int8')
ci,lam
## Log transform is used for better performance
df_encode['charges'] = np.log(df_encode['charges'])
#Model building
# Step 1: add x0 =1 to dataset
X_train_0 = np.c_[np.ones((X_train.shape[0],1)),X_train]
X_test_0 = np.c_[np.ones((X_test.shape[0],1)),X_test]
#Parameter
sk_theta = [lin_reg.intercept_]+list(lin_reg.coef_)
parameter_df = parameter_df.join(pd.Series(sk_theta, name='Sklearn_theta')
)
parameter_df
#Model evaluation
# sklearn regression module
y_pred_sk = lin_reg.predict(X_test)
# Normal equation
y_pred_norm = np.matmul(X_test_0,theta)
#Evaluvation: MSE
mse = np.sum((y_pred_norm - y_test)**2)/ X_test_0.shape[0]
# R_square
sse = np.sum((y_pred_norm - y_test)**2)
sst = np.sum((y_test - y_test.mean())**2)
R_square = 1 - (sse/sst)
print('The Mean Square Error(MSE): ', mse)
print('R square::',R_square)
#Model Validation
# Check for Linearity
f = plt.figure(figsize=(14,5))
ax = f.add_subplot(121)
sns.scatterplot(y_test,y_pred_sk,ax=ax,color='r')
ax.set_title('Check for Linearity:\n Actual Vs Predicted value')
UNSUPERVISED LEARNING
Type: Hierarchical clustering
Language: Python
Dataset: Mall Customer Segmentation Data
Hierarchical clustering, also known as hierarchical cluster analysis or HCA, is an
unsupervised machine learning method for clustering unlabeled datasets.
In this technique, the hierarchy of clusters is developed in the form of a tree, and this tree-
shaped structure is known as the dendrogram.
Simply put, hierarchical clustering is the process of dividing data into groups based on some
measure of similarity, determining a technique to quantify how they're alike and different,
and narrowing down the data.
IMPLEMENTATION OF THE HIERARCHICAL MODEL
#Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.graph_objs as go
import warnings
warnings.filterwarnings('ignore')
#Data Exploration
dataFrame = pd.read_csv('C:\\Users\\25470\\Desktop\\Data Science Group Wor
k Project\\segmented_customers.csv')
dataFrame.head()
cluster
0 3
1 4
2 3
3 4
4 3
dataFrame.isnull().sum()
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
cluster 0
dtype: int64
dataFrame.describe()
dataFrame['Gender'] = label_encoder.fit_transform(dataFrame['Gender'])
dataFrame.head()
cluster
0 3
1 4
2 3
3 4
4 3
#Dendogram
plt.figure(1, figsize = (16 ,8))
dendrogram = sch.dendrogram(sch.linkage(dataFrame, method = "ward"))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()
#Agglomerative Clustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linka
ge ='average')
y_hc = hc.fit_predict(dataFrame)
y_hc
array([3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4,
3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 2,
3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 2, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1], dtype=int64)
dataFrame['cluster'] = pd.DataFrame(y_hc)
X = dataFrame.iloc[:, [3,4]].values
plt.scatter(X[y_hc==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster
1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster
2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluste
r 3')
plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='purple', label ='Clust
er 4')
plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='orange', label ='Clust
er 5')
plt.title('Clusters of Customers (Hierarchical Clustering Model)')
plt.xlabel('Annual Income(k$)')
plt.ylabel('Spending Score(1-100)')
plt.show()
BENEFITS OF HIERACHICAL CLUSTERING MODEL
1) Easy to understand
2) Easy to implement
3) Dendrogram output of the algorithm can be used to understand the big picture
4) Easily identify groups in a dataset
CHALLENGES OF HIERACHICAL CLUSTERING MODEL
1) It does not always provide the best possible solution: Poor solutions may be difficult
to detect and resolve when clustering multidimensional retail data that cannot always
be visualized on a plot.
2) If there is missing data, the algorithm cannot run: To ensure that the algorithm can
run, you must remove these lines or estimate values.
3) The dendrogram can be interpreted incorrectly: Cluster descriptors and cluster
composition may be difficult to interpret for all stakeholders involved in clustering.
4) The algorithm cannot work with different types of data: It becomes difficult to
compute a distance matrix when many different data types are used. There is no
simple formula that can deal with both qualitative and numerical data.