Lab File

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 96

BANASTHALI VIDYAPITH

Artificial Intelligence and Machine Learning


LAB RECORD

SUBMITTED TO:- DR. URVASHI PRAKASH SHUKLA


SUBMITTED BY:- MANSI SINGHAL
ROLL NO. :- 2016776
CLASS:- B.TECH (IT-A)
SMART ID:- BTBTI20050

Aim- Filling Missing Data with Pandas.


Ques 1 - Filling NaN values with constant value.

Pseudo-code:

import pandas as pd
df = pd.DataFrame({'date':['2021-01-01', '2021-01-02',
'2021-01-03', '2021-01-04','2021-01-01', '2021-01-02',
'2021-01-03', '2021-01-04'], 'fruit':['apple', 'apple', 'apple',
'apple', 'mango', 'mango', 'mango', 'mango'], 'price': [0.80,
None, None, 1.20, None, 2.10, 2.00, 1.80]})
df['date'] = pd.to_datetime(df['date'])
print("ORIGINAL DATA\n");
print(df)
df['price'].fillna(value = 0.85, inplace = True)
print("\n\nDATA FILLED WITH CONSTANT VALUE IN PLACE
OF NaN\n")
print(df);

Result-
Ques 2- Fill with the Mean of Column.
Pseudo-code:
# mean
df['price'].fillna(value = df.price.mean(), inplace = True)

Result-
Ques 3- Fill with Median of Column.
Pseudo-code:
df['price'].fillna(value = df.price.median(), inplace = True)

Result-
Ques 4- Fill with Mean of Group.

Pseudo-code:
# mean
df['price'].fillna(df.groupby('fruit')['price'].transform('mean'),
inplace = True)

Result-
Ques 5- Fill with Median of Group.

Pseudo-code:
# median
df['price'].fillna(df.groupby('fruit')
['price'].transform('median'), inplace = True)

Result-
Ques 6- Fill using Forward Fill.

Pseudo-code:
df['price'].fillna(method = 'ffill', inplace = True)

Result-
Ques 7- Fill using Forward Fill with Limit=1.
Pseudo-code:
df['price'].fillna(method = 'ffill', limit = 1, inplace = True)

Result-
Ques 8- Fill with Forward Fill withing Group.

Pseudo-code:
df['price'] = df.groupby('fruit')['price'].ffill()
Result-
Ques 9- Fill using Forward Fill within Group with Limit=1.

Pseudo-code:
df['price'] = df.groupby('fruit')['price'].ffill(limit = 1)

Result-
Ques 10- Fill using Back Fill.

Pseudo-code:
df['price'].fillna(method = 'bfill', inplace = True)

Result-
Ques 11- Fill using Back Fill with Limit=1.
Pseudo-code:
df['price'].fillna(method = 'bfill', limit = 1, inplace = True)

Result-
Ques 12- Fill using Back Fill within Group.
Pseudo-code:
# backfill without propagation limit
df['price'] = df.groupby('fruit')['price'].bfill()

Result-
Ques 13- Fill using Back Fill within Group with Limit=1.

Pseudo-code:
df['price'] = df.groupby('fruit')['price'].bfill(limit = 1)

Result-
Ques 14- Fill by Combining both Forward Fill and Back Fill.
Pseudo-code:
df['price'] = df.groupby('fruit')['price'].ffill().bfill()

Result-
Ques 15- Fill by Combining both Forward Fill and Back fill
but first use Back Fill then Forward Fill.
Pseudo-code:
df['price'] = df.groupby('fruit')['price'].bfill().ffill()

Result-
Ques 16- Fill using Interpolation.

Pseudo-code:
df['price'].interpolate(method = 'linear', inplace = True)

Result-
Ques 17- Fill using Interpolation within Group.

Pseudo-code:
df['price'] = df.groupby('fruit')['price'].apply(lambda x:
x.interpolate(method='linear'))

Result-
Ques 18- Fill using both Interpolation and Back Fill.

Pseudo-code:
df['price'] = df.groupby('fruit')['price'].apply(lambda x:
x.interpolate(method='linear')).bfill()

Result-
Ques 19- Fill value based on Conditions.

Pseudo-code:

#FILL NaN VALUES WITH CONDITIONS


df['weekday'] = df['date'].apply(lambda x: False if
x.day_name() in ['Saturday', 'Sunday'] else True)

mean_price = df.groupby('fruit')['price'].transform('mean')
#TRANSFORM MAKES A NEW COLUMN mean_price
print("\nMEAN PRICE\n")
print(mean_price)

print("\nFINALLY WE FILL THE MISSING VALUES BASED ON


THE GIVEN CONDITIONS PANDAS .where METHOD")
df['price'].fillna((mean_price).where(cond = df.weekday,
other = mean_price*1.25), inplace = True)

Result-
Conclusion

In this article we examined the following methods for filling missing


values using Pandas.
● Fillna

● Forward Fill

● Back Fill

● Interpolation

The choice of the filling method depends on the assumptions and the
context of the problem. For example, filling the missing values of
mangoes with mean price of apples and mangoes may not be a good
idea as apples and mangoes have rather different prices in our toy
dataset.
We also see how to use each of this methods in conjunction with
pandas.groupby() method to fill missing values for each group
separately.

Aim- Dealing with Categorical Data

Ques 1- Creating a new categorical dataframe and do


ordinal feature mapping.
Pseudo-code as well as Result included:

# Creating a new categorical dataframe.


import pandas as pd
df=pd.DataFrame([
['green','M',10.1,'class1'],
['red','L',13.5,'class2'],
['blue','XL',15.3,'class1']
])
print(df)

df.columns

df.columns=['color','size','price','classlabel']
print(df);

#Ordinal Feature Mapping


size_mapping={'XL':3,'L':2,'M':1}
df['size']=df['size'].map(size_mapping)
df

size_mapping.items()
inv_size_mapping={v:k for k,v in size_mapping.items()}
inv_size_mapping

df['size']=df['size'].map(inv_size_mapping)
df

Conclusion

When we created our dataframe we can see from the


output that the DataFrame contains a nominal feature
(color), an ordinal feature (size) as well as a numerical
feature (price) column.
So, In order for our learning algorithm to interpret the
ordinal features correctly, we should convert the
categorical string values into integers.
However, since there is no convenient function that can automatically
derive the correct order of the labels of our size feature, we have to
define the mapping manually.
If we want to transform the integer values back to the original string
representation, we can simply define a reverse-mapping dictionary
"inv_size_mapping" that can then be used via the pandas' map method
on the transformed feature column similar to the "size_mapping"
dictionary that we used previously.
Therefore, we learnt how to create categorical dataframe and then do
mapping and inverse mapping.
11-January-2023

Aim- Working with Pandas Dataframe.

Ques 1- Create a csv_dataframe and then read it using


pandas library.

Pseudo-code:
import pandas as pd
from io import StringIO
csv_data='''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
0.0,11.0,12.0,'''
csv_data = str(csv_data)
df=pd.read_csv(StringIO(csv_data))
df

Result-

Ques 2- Check null values in dataframe.


Pseudo-code:
df.isnull()

Result-

Ques 3- Give the total number of null values in each


column.
Pseudo-code:
df.isnull().sum()

Result-

\
Ques 4- Show the values in dataframe.

Pseudo-code:
df.values

Result-

Ques 5- Eliminating samples/features with missing cells


via pandas.DataFrame.dropna()

Pseudo-code as well as Result:

#Dropping rows having NaN values.


df.dropna()

#Drop columns having NaN values.


df.dropna(axis=1)
#Only drop rows where all columns are NaN.
df.dropna(how='all')

#Drop rows that do not have at least 4 non-NaN values.


df.dropna(thresh=4)

#Only drop rows where NaN appear in specific columns


(here: ‘C’)
df.dropna(subset=['C'])
Ques 6- Estimating – missing values via interpolation
Pseudo-code:
from sklearn.impute import SimpleImputer
import numpy as np
imputer=SimpleImputer(missing_values=np.nan,strategy='
mean')
imputer=imputer.fit(df)
imputed_data=imputer.transform(df)
imputed_data

Result-

Conclusion

It shows us how to use Pandas DataFrame. We created a


simple example to get a better grasp of the problem. After
creating a csv dataframe. We read the file as any file from
outside the python would be called. We can use isnull()
method to check whether a cell contains a numeric value
(False) or if the data is missing (True). For a larger
DataFrame, we may want to use the sum() method which
returns the number of missing values per column. We
learnt about various ways which we can use to eliminate
samples/ features with missing cells via
pandas.DataFrame.dropna() which means we can remove
the corresponding features(columns) or samples(rows)
from the dataset. The removal of missing data appears to
be a convenient approach, however, it also comes with
certain disadvantages:
1. There are chances that we may end up removing too many, which
will make our analysis not reliable.
2. By eliminating too many feature columns, we may run the risk of
losing valuable information for our classifier.
Then we learnt about estimating missing values via
interpolation. Mean Imputation is a method replacing the
missing values with the mean value of the entire feature
column. While this method maintains the sample size and
is easy to use, the variability in the data is reduced, so the
standard deviations and the variance estimates tend to be
underestimated. We used
sklearn.preprocessing.SimpleImputer class. Here, we
replaced each NaN value by the corresponding mean from
each feature column.

12-January-2023

Aim- Create a DataFrame with ClassLabel and do


ClassLabel Encoding.

Ques 1- Create a dataframe.

Pseudo-code as well as Result:

import pandas as pd
df=pd.DataFrame([
['green','M',10.1,'class1'],
['red','L',13.5,'class2'],
['blue','XL',15.3,'class1']
])
df
df.columns

df.columns=['color','size','price','classlabel']
df
Ques 2- Do Class Labels Encoding.
Pseudo-code with Result:
import numpy as np
np.unique(df['classlabel'])

class_mapping={label:idx for idx,label in


enumerate(np.unique(df['classlabel']))}
class_mapping
df['classlabel']=df['classlabel'].map(class_mapping)
df

inv_class_mapping={v:k for k, v in class_mapping.items()}


df['classlabel']=df['classlabel'].map(inv_class_mapping)
df
Conclusion

So, we learnt how to do class label encoding. It was similar


to ordinal feature mapping. Since class labels are not
ordinal, it doesn’t matter which integer number we assign
to a particular string-label. We can simply enumerate the
class labels starting at 0. Then we use mapping dictionary
to transform the class labels into integers. As we did for
“size” in the previous class, we can reverse the key-value
pairs in the mapping dictionary to map the converted class
labels back to the original string representation.

Aim- Use of Standard Scaler Utility Class

Pseudo-code AS WELL AS Result:

from sklearn import preprocessing


import numpy as np
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)
scaler

scaler.mean_
scaler.scale_

X_scaled=scaler.transform(X_train)
X_scaled

X_scaled.mean(axis=0)

X_scaled.std(axis=0)
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=42)
pipe = make_pipeline(StandardScaler(),
LogisticRegression())
pipe.fit(X_train, y_train) # apply scaling on training data

pipe.score(X_test,y_test) # apply scaling on testing data,


without leaking training data.
Conclusion

Standardization of datasets is a common requirement for


many machine learning estimators implemented in scikit-
learn; they might behave badly if the individual features
do not more or less look like standard normally distributed
data: Gaussian with zero mean and unit variance.
In practice we often ignore the shape of the distribution
and just transform the data to center it by removing the
mean value of each feature, then scale it by dividing non-
constant features by their standard deviation.
For instance, many elements in the objective function of a
learning algorithm (such as the RBF kernel of Support
Vector Machines or the l1 and l2 regularizers of linear
models) may assume that all features are centered around
zero or have variance in the same order. If a feature has a
variance that is orders of magnitude larger than others, it
might dominate the objective function and make the
estimator unable to learn from other features correctly as
expected.
The preprocessing module provides the StandardScaler
utility class, which is a quick and easy way to perform the
following operation on an array-like dataset.
This class implements the Transforner API to compute the
mean and standard deviation on a training set so as to be
able to later re-apply the same transformation on the
testing set. This class is hence suitable for use in the early
steps of a Pipeline.
18-January-2023

Aim- Plot independent variable X and Dependent


Variable Y

Pseudo code as well as result

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error,


r2_score

import statsmodels.api as sm

x=np.array([1,2,3,4,5])

y=np.array([7,14,15,18,19])

n=np.size(x)

x_mean=np.mean(x)

y_mean=np.mean(y)

x_mean,y_mean
Sxy=np.sum(x*y)-n*x_mean*y_mean

Sxx=np.sum(x*x)-n*x_mean*x_mean

b1=Sxy/Sxx

b0=y_mean-b1*x_mean

print('slope b1 is',b1)

print('intercept b0 is',b0)

plt.scatter(x,y)

plt.xlabel('Independent Variable x')

plt.ylabel('Dependent Variable y')


y_predict=b0+b1*x

plt.scatter(x,y,color='red')

plt.plot(x,y_predict,color='green')

plt.ylabel('Y')

plt.xlabel('X')
error=y-y_predict

se=np.sum(error**2)

print('squared error is',se)

mse=se/n

print('mean squared error is',mse)

rmse=np.sqrt(mse)

print('root mean square error is',rmse)

SSt=np.sum((y-y_mean)**2)

R2=1-(se/SSt)

print('R Square is',R2)


plt.scatter(x,y)

plt.xlabel('Independent Variable x')

plt.ylabel('Dependent Variable y')


y_predict=b0+b1*x

plt.scatter(x,y,color='red')

plt.plot(x,y_predict,color='green')

plt.ylabel('Y')

plt.xlabel('X')
error=y-y_predict

se=np.sum(error**2)

print('squared error is',se)

mse=se/n

print('mean squared error is',mse)

rmse=np.sqrt(mse)

print('root mean square error is',rmse)

SSt=np.sum((y-y_mean)**2)
R2=1-(se/SSt)

print('R Square is',R2)

Conclusion
We can observe the points distributed in the graph fitting the
regression line. We got SE as 10.80, mse as 2.16 rmse as 1.4696 and
r2 score as 0.8789.
High value of r square shows that linear regression fits the data
well. Transform features by scaling each feature to a given range.
This estimator scales and translates each feature individually such
that it is in the given range on the training set, e.g. between zero
and one.
The high value of r square shows that regression fits the data well.
The Result of the corr() method is a table with a lot of numbers that
represents how well the relationship is between two columns. 1
means that there is a 1 to 1 relationship (a perfect correlation), and
for this data set, each time a value went up in the first column, the
other one went up as well.
0.9 is also a good relationship, and if you increase one value, the
other will probably increase as well.
The adjusted R-squared is positive, not negative. It is always lower
than the R-squared. SVR, is to basically consider the points that are
within the decision boundary line. Our best fit line is the
hyperplane that has a maximum number of points.

Aim: Diabetes pedigree function


Pseudo code as well as result
y_predict=b0+b1*x
plt.scatter(x,y,color='red')
plt.plot(x,y_predict,color='green')
plt.ylabel('Y')
plt.xlabel('X')
df = pd.read_csv('diabetes.csv')
print(df)
import numpy as np
x=df[['DiabetesPedigreeFunction','Age']]
x=x.to_numpy()
x
y=df['Glucose'].to_numpy()
y=y.reshape(-1,1)
n=np.size(x)
print("Maximum y: ",np.max(y))
print("Minimum y: ",np.min(y))
from sklearn.metrics import mean_squared_error,
r2_score
mse=mean_squared_error(y,y_predicted)
rmse=np.sqrt(mse)
r2=r2_score(y,y_predicted)
print('Slope: ',regression_model.coef_)
print('Intercept:',regression_model.intercept_)
print('MSE:',mse)
print('Root mean Squared Error:',rmse)
print('R2 score:',r2)

Conclusion

25-january-2023
Aim-Correlation Coefficient
Pseudo code with result-
import pandas as pd
import numpy as np
df=pd.read_csv('diabetes.csv')
df
x=df['DiabetesPedigreeFunction']
y=df['Glucose']
print(x,y)
corr_result=np.corrcoef(x,y)
corr_result

p=df['Insulin']
corr_result=np.corrcoef(p,y)
corr_result

q=df['Pregnancies']
corr_result=np.corrcoef(q,y)
print(corr_result)
from sklearn.linear_model import LinearRegression
regression_model=LinearRegression()
regression_model.fit(df2,y)
y_predicted=regression_model.predict(df2)
print(y_predicted)

from sklearn.metrics import mean_squared_error,


r2_score
mse=mean_squared_error(y,y_predicted)
rmse=np.sqrt(mse)
r2=r2_score(y,y_predicted)
print('Slope: ',regression_model.coef_)
print('Intercept:',regression_model.intercept_)
print('MSE:',mse)
print('Root mean Squared Error:',rmse)
print('R2 score:',r2)
x=df['DiabetesPedigreeFunction'].to_numpy()
y=df['Glucose'].to_numpy()
x=x.reshape(-1,1)
y=y.reshape(-1,1)
n=np.size(x)
print("Maximum x: ",np.max(x))
print("Minimum x: ",np.min(x))
print("Maximum y: ",np.max(y))
print("Minimum y: ",np.min(y))\
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
#https://scikit-learn.org/stable/modules/generated/sklear
n.preprocessing.MinMaxScaler.html
print(scaler.fit(x))
print(scaler.data_max_)
print(scaler.transform(x))
x=scaler.transform(x)
print(scaler.fit(y))
print(scaler.data_max_)
y=scaler.transform(y)
from sklearn.metrics import mean_squared_error,
r2_score
mse=mean_squared_error(y,y_predicted)
rmse=np.sqrt(mse)
r2=r2_score(y,y_predicted)
print('Slope: ',regression_model.coef_)
print('Intercept:',regression_model.intercept_)
print('MSE:',mse)
print('Root mean Squared Error:',rmse)
print('R2 score:',r2)
k=1
print(n)
adjustedr2=1 - ((1-r2)*(n-1)/(n-k-1))
print('Adjusted R2: ',adjustedr2)

08 and 09 February,2023
Aim- Preprocessing Normalization
Pseudo code and result-
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
import pandas as pd
from sklearn import preprocessing
df=pd.read_csv('diabetes.csv')
print(df)
df.to_numpy()

x=df[['Insulin','Age','BMI','DiabetesPedigreeFunction']].to_
numpy()
y=df.Glucose.to_numpy()
normalized_arr=preprocessing.normalize(x,axis=1)
regressor=SVR(kernel='linear')regressor.fit(x,y)
import matplotlib.pyplot as plt
plt.scatter(df['Age'],y,color='tab:blue')
plt.scatter(df['Age'],y_pred,color='tab:red')
plt.show()

from sklearn.metrics import


mean_squared_error,r2_score
import numpy as np
mse=np.sqrt(np.subtract(y,y_pred))
mse
plt.plot(mse)
plt.show()

plt.scatter(df['Age'],y_pred)
plt.scatter(df['Age'],y)
15 February,2023
Aim-
Pseudo code as well as result-
import pandas as pd
import numpy as np
import matplotlib.pyplot as mtp
df=pd.read_csv('diabetes.csv')
print(df)
Result-

x= df.iloc[:, [3,4,5,6,7]].values
y= df.iloc[:, 1].values
print(x);
Result-
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y,
test_size= 0.25, random_state=0)
from sklearn.tree import DecisionTreeRegressor
model=DecisionTreeRegressor(criterion='squared_error',
max_depth=3,random_state=0)
model.fit(x_train,y_train)
result-

y_pred=model.predict(x_test)
y_pred
model.score(x_test, y_test)
result-

from sklearn import tree


mtp.figure(figsize=(15,10))
tree.plot_tree(model,filled=True)
result-
x= df.iloc[:, [3,4,5,6,7]].values
y= df.iloc[:, 8].values
print(y);
result-

from sklearn.preprocessing import StandardScaler


st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y,
test_size= 0.30, random_state=0)
from sklearn.tree import DecisionTreeClassifier
classifier=
DecisionTreeClassifier(criterion='entropy',splitter='best',m
ax_depth=3, random_state=0)
classifier.fit(x_train, y_train)
result-

y_pred= classifier.predict(x_test)
y_pred
result-
from sklearn.metrics import confusion_matrix
cm= confusion_matrix(y_test, y_pred)
result-

\
model.score(x_test, y_test)
result-

from sklearn import tree


mtp.figure(figsize=(15,10))
tree.plot_tree(classifier,filled=T
result-
KMeans Clustering
Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_iris
Loading the Iris Dataset
Pseudocode:
iris = load_iris()
iris = sns.load_dataset('iris')
iris.head()
x = iris.iloc[:, [0, 1, 2, 3]].values
print(x)

Finding the optimum number of clusters for k-means classification


Pseudocode:
from sklearn.cluster import KMeans
wcss = []

for i in range(1, 11):


kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300,
n_init = 10, random_state = 0)
kmeans.fit(x)
print(kmeans.cluster_centers_)
wcss.append(kmeans.inertia_)

score = silhouette_score(x, kmeans.labels_, metric='euclidean')


print("The score is: ", score)

Plotting the Graph


Pseudocode:
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()

Hierarchical Clustering
Importing the Libraries
Pseudocode:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_iris

Loading the Dataset


Pseudocode:
from scipy.cluster.hierarchy import dendrogram, linkage
iris = load_iris()
iris = sns.load_dataset('iris')
iris.head()

Plotting the Dendogram


Pseudocode:
dist_sin = linkage(iris.loc[:,["sepal_length", "sepal_width",
"petal_length", "petal_width"]],method="single")
#print(dist_sin)
plt.figure(figsize=(18,6))
dendrogram(dist_sin, leaf_rotation=90)
plt.xlabel('Index')
plt.ylabel('Distance')
plt.suptitle("DENDROGRAM SINGLE METHOD",fontsize=18)
plt.show()

Creating the cluster:


Pseudocode:
from scipy.cluster.hierarchy import fcluster
iris_SM=iris.copy()

iris_SM['2_clust']=fcluster(dist_sin,2, criterion='maxclust')
iris_SM['3_clust']=fcluster(dist_sin,3, criterion='maxclust')
iris_SM.head()
Plotting Different Graphs:
Pseudocode:
plt.figure(figsize=(24,4))

plt.suptitle("Hierarchical Clustering Single Method",fontsize=18)

plt.subplot(1,3,1)
plt.title("K = 2",fontsize=14)
sns.scatterplot(x="petal_length",y="petal_width", data=iris_SM,
hue="2_clust")

plt.subplot(1,3,2)
plt.title("K = 3",fontsize=14)
sns.scatterplot(x="petal_length",y="petal_width", data=iris_SM,
hue="3_clust")

plt.subplot(1,3,3)
plt.title("Species",fontsize=14)
sns.scatterplot(x="petal_length",y="petal_width", data=iris_SM,
hue="species")
Pseudocode:
plt.figure(figsize=(24,4))
plt.subplot(1,2,1)
plt.title("K = 2",fontsize=14)
sns.swarmplot(x="species",y="2_clust", data=iris_SM, hue="species")

plt.subplot(1,2,2)
plt.title("K = 3",fontsize=14)
sns.swarmplot(x="species",y="3_clust", data=iris_SM, hue="species")

Pseudocode:
dist_comp = linkage(iris.loc[:,["sepal_length", "sepal_width",
"petal_length", "petal_width"]],method="complete")
plt.figure(figsize=(18,6))
dendrogram(dist_comp, leaf_rotation=90)
plt.xlabel('Index')
plt.ylabel('Distance')
plt.suptitle("DENDROGRAM COMPLETE METHOD",fontsize=18)
plt.show()

Pseudocode:
iris_CM=iris.copy()
iris_CM['2_clust']=fcluster(dist_comp,2, criterion='maxclust')
iris_CM['3_clust']=fcluster(dist_comp,3, criterion='maxclust')
iris_CM.head()
Pseudocode:
plt.figure(figsize=(24,4))

plt.suptitle("Hierarchical Clustering Complete Method",fontsize=18)

plt.subplot(1,3,1)
plt.title("K = 2",fontsize=14)
sns.scatterplot(x="sepal_length",y="sepal_width", data=iris_CM,
hue="2_clust")

plt.subplot(1,3,2)
plt.title("K = 3",fontsize=14)
sns.scatterplot(x="sepal_length",y="sepal_width", data=iris_CM,
hue="3_clust")

plt.subplot(1,3,3)
plt.title("Species",fontsize=14)
sns.scatterplot(x="sepal_length",y="sepal_width", data=iris_CM,
hue="species")
Pseudocode:
plt.figure(figsize=(24,4))
plt.subplot(1,2,1)
plt.title("K = 2",fontsize=14)
sns.swarmplot(x="species",y="2_clust", data=iris_CM, hue="species")

plt.subplot(1,2,2)
plt.title("K = 3",fontsize=14)
sns.swarmplot(x="species",y="3_clust", data=iris_CM, hue="species")

Thomsan Sampling
Importing the Essential Libraries
Pseudocode:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Importing the Dataset


Pseudocode:
dataset = pd.read_csv('ads_CTR_Optimisation.csv')
dataset.head()

Implementing Thompson Sampling


Pseudocode:
import random
N = 10000
d = 10
ads_selected = []
numbers_of_rewards_1 = [0] * d
numbers_of_rewards_0 = [0] * d
total_reward = 0
for n in range(0, N):
ad = 0
max_random = 0
for i in range(0, d):
random_beta = random.betavariate(numbers_of_rewards_1[i] + 1,
numbers_of_rewards_0[i] + 1)
if random_beta > max_random:
max_random = random_beta
ad = i
ads_selected.append(ad)
reward = dataset.values[n, ad]
if reward == 1:
numbers_of_rewards_1[ad] = numbers_of_rewards_1[ad] + 1
else:
numbers_of_rewards_0[ad] = numbers_of_rewards_0[ad] + 1
total_reward = total_reward + reward

Visualising the results – Histogram


Pseudocode:
plt.hist(ads_selected)
plt.title('Histogram of ads selections')
plt.xlabel('Ads')
plt.ylabel('Number of times each ad was selected')
plt.show()

Upper Classification Bound


Importing the Essential Libraries
Pseudocode:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Importing the Dataset


dataset = pd.read_csv('ads_CTR_Optimisation.csv')
dataset.head

Implementing UCB
Pseudocode:
import math
N = 10000
d = 10
ads_selected = []
numbers_of_selections = [0] * d
sums_of_rewards = [0] * d
total_reward = 0
for n in range(0, N):
ad = 0
max_upper_bound = 0
for i in range(0, d):
if (numbers_of_selections[i] > 0):
average_reward = sums_of_rewards[i] / numbers_of_selections[i]
delta_i = math.sqrt(3/2 * math.log(n + 1) /
numbers_of_selections[i])
upper_bound = average_reward + delta_i
else:
upper_bound = 1e400
if (upper_bound > max_upper_bound):
max_upper_bound = upper_bound
ad = i
ads_selected.append(ad)
numbers_of_selections[ad] = numbers_of_selections[ad] + 1
reward = dataset.values[n, ad]
sums_of_rewards[ad] = sums_of_rewards[ad] + reward
total_reward = total_reward + reward

visualising the results


plt.hist(ads_selected)
plt.title('Histogram of ads selections')
plt.xlabel('Ads')
plt.ylabel('Number of times each ad was selected')
plt.show()

You might also like