Lab File

BANASTHALI VIDYAPITH
Artificial Intelligence and Machine Learning

LAB RECORD
SUBMITTED TO:- DR. URVASHI PRAKASH SHUKLA

SUBMITTED BY:- MANSI SINGHAL
ROLL NO. :- 2016776
CLASS:- B.TECH (IT-A)
SMART ID:- BTBTI20050
Aim- Filling Missing Data with Pandas.

Ques 1 - Filling NaN values with constant value.
Pseudo-code:
import pandas as pd
df = pd.DataFrame({'date':['2021-01-01', '2021-01-02',
'2021-01-03', '2021-01-04','2021-01-01', '2021-01-02',
'2021-01-03', '2021-01-04'], 'fruit':['apple', 'apple', 'apple',
'apple', 'mango', 'mango', 'mango', 'mango'], 'price': [0.80,
None, None, 1.20, None, 2.10, 2.00, 1.80]})
df['date'] = pd.to_datetime(df['date'])
print("ORIGINAL DATA\n");
print(df)
df['price'].fillna(value = 0.85, inplace = True)
print("\n\nDATA FILLED WITH CONSTANT VALUE IN PLACE
OF NaN\n")
print(df);
Result-
Ques 2- Fill with the Mean of Column.
Pseudo-code:
# mean
df['price'].fillna(value = df.price.mean(), inplace = True)
Result-
Ques 3- Fill with Median of Column.
Pseudo-code:
df['price'].fillna(value = df.price.median(), inplace = True)
Result-
Ques 4- Fill with Mean of Group.
Pseudo-code:
# mean
df['price'].fillna(df.groupby('fruit')['price'].transform('mean'),
inplace = True)
Result-
Ques 5- Fill with Median of Group.
Pseudo-code:
# median
df['price'].fillna(df.groupby('fruit')
['price'].transform('median'), inplace = True)
Result-
Ques 6- Fill using Forward Fill.
Pseudo-code:
df['price'].fillna(method = 'ffill', inplace = True)
Result-
Ques 7- Fill using Forward Fill with Limit=1.
Pseudo-code:
df['price'].fillna(method = 'ffill', limit = 1, inplace = True)
Result-
Ques 8- Fill with Forward Fill withing Group.
Pseudo-code:
df['price'] = df.groupby('fruit')['price'].ffill()
Result-
Ques 9- Fill using Forward Fill within Group with Limit=1.
Pseudo-code:
df['price'] = df.groupby('fruit')['price'].ffill(limit = 1)
Result-
Ques 10- Fill using Back Fill.
Pseudo-code:
df['price'].fillna(method = 'bfill', inplace = True)
Result-
Ques 11- Fill using Back Fill with Limit=1.
Pseudo-code:
df['price'].fillna(method = 'bfill', limit = 1, inplace = True)
Result-
Ques 12- Fill using Back Fill within Group.
Pseudo-code:
# backfill without propagation limit
df['price'] = df.groupby('fruit')['price'].bfill()
Result-
Ques 13- Fill using Back Fill within Group with Limit=1.
Pseudo-code:
df['price'] = df.groupby('fruit')['price'].bfill(limit = 1)
Result-
Ques 14- Fill by Combining both Forward Fill and Back Fill.
Pseudo-code:
df['price'] = df.groupby('fruit')['price'].ffill().bfill()
Result-
Ques 15- Fill by Combining both Forward Fill and Back fill
but first use Back Fill then Forward Fill.
Pseudo-code:
df['price'] = df.groupby('fruit')['price'].bfill().ffill()
Result-
Ques 16- Fill using Interpolation.
Pseudo-code:
df['price'].interpolate(method = 'linear', inplace = True)
Result-
Ques 17- Fill using Interpolation within Group.
Pseudo-code:
df['price'] = df.groupby('fruit')['price'].apply(lambda x:
x.interpolate(method='linear'))
Result-
Ques 18- Fill using both Interpolation and Back Fill.
Pseudo-code:
df['price'] = df.groupby('fruit')['price'].apply(lambda x:
x.interpolate(method='linear')).bfill()
Result-
Ques 19- Fill value based on Conditions.
Pseudo-code:
#FILL NaN VALUES WITH CONDITIONS

df['weekday'] = df['date'].apply(lambda x: False if
x.day_name() in ['Saturday', 'Sunday'] else True)
mean_price = df.groupby('fruit')['price'].transform('mean')
#TRANSFORM MAKES A NEW COLUMN mean_price
print("\nMEAN PRICE\n")
print(mean_price)
print("\nFINALLY WE FILL THE MISSING VALUES BASED ON

THE GIVEN CONDITIONS PANDAS .where METHOD")
df['price'].fillna((mean_price).where(cond = df.weekday,
other = mean_price*1.25), inplace = True)
Result-
Conclusion
In this article we examined the following methods for filling missing

values using Pandas.
● Fillna
● Forward Fill
● Back Fill
● Interpolation
The choice of the filling method depends on the assumptions and the
context of the problem. For example, filling the missing values of
mangoes with mean price of apples and mangoes may not be a good
idea as apples and mangoes have rather different prices in our toy
dataset.
We also see how to use each of this methods in conjunction with
pandas.groupby() method to fill missing values for each group
separately.
Aim- Dealing with Categorical Data
Ques 1- Creating a new categorical dataframe and do

ordinal feature mapping.
Pseudo-code as well as Result included:
# Creating a new categorical dataframe.

import pandas as pd
df=pd.DataFrame([
['green','M',10.1,'class1'],
['red','L',13.5,'class2'],
['blue','XL',15.3,'class1']
])
print(df)
df.columns
df.columns=['color','size','price','classlabel']
print(df);
#Ordinal Feature Mapping

size_mapping={'XL':3,'L':2,'M':1}
df['size']=df['size'].map(size_mapping)
df
size_mapping.items()
inv_size_mapping={v:k for k,v in size_mapping.items()}
inv_size_mapping
df['size']=df['size'].map(inv_size_mapping)
df
Conclusion
When we created our dataframe we can see from the

output that the DataFrame contains a nominal feature
(color), an ordinal feature (size) as well as a numerical
feature (price) column.
So, In order for our learning algorithm to interpret the
ordinal features correctly, we should convert the
categorical string values into integers.
However, since there is no convenient function that can automatically
derive the correct order of the labels of our size feature, we have to
define the mapping manually.
If we want to transform the integer values back to the original string
representation, we can simply define a reverse-mapping dictionary
"inv_size_mapping" that can then be used via the pandas' map method
on the transformed feature column similar to the "size_mapping"
dictionary that we used previously.
Therefore, we learnt how to create categorical dataframe and then do
mapping and inverse mapping.
11-January-2023
Aim- Working with Pandas Dataframe.
Ques 1- Create a csv_dataframe and then read it using

pandas library.
Pseudo-code:
import pandas as pd
from io import StringIO
csv_data='''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
0.0,11.0,12.0,'''
csv_data = str(csv_data)
df=pd.read_csv(StringIO(csv_data))
df
Result-
Ques 2- Check null values in dataframe.

Pseudo-code:
df.isnull()
Result-
Ques 3- Give the total number of null values in each

column.
Pseudo-code:
df.isnull().sum()
Result-
\
Ques 4- Show the values in dataframe.
Pseudo-code:
df.values
Result-
Ques 5- Eliminating samples/features with missing cells

via pandas.DataFrame.dropna()
Pseudo-code as well as Result:
#Dropping rows having NaN values.

df.dropna()
#Drop columns having NaN values.

df.dropna(axis=1)
#Only drop rows where all columns are NaN.
df.dropna(how='all')
#Drop rows that do not have at least 4 non-NaN values.

df.dropna(thresh=4)
#Only drop rows where NaN appear in specific columns

(here: ‘C’)
df.dropna(subset=['C'])
Ques 6- Estimating – missing values via interpolation
Pseudo-code:
from sklearn.impute import SimpleImputer
import numpy as np
imputer=SimpleImputer(missing_values=np.nan,strategy='
mean')
imputer=imputer.fit(df)
imputed_data=imputer.transform(df)
imputed_data
Result-
Conclusion
It shows us how to use Pandas DataFrame. We created a

simple example to get a better grasp of the problem. After
creating a csv dataframe. We read the file as any file from
outside the python would be called. We can use isnull()
method to check whether a cell contains a numeric value
(False) or if the data is missing (True). For a larger
DataFrame, we may want to use the sum() method which
returns the number of missing values per column. We
learnt about various ways which we can use to eliminate
samples/ features with missing cells via
pandas.DataFrame.dropna() which means we can remove
the corresponding features(columns) or samples(rows)
from the dataset. The removal of missing data appears to
be a convenient approach, however, it also comes with
certain disadvantages:
1. There are chances that we may end up removing too many, which
will make our analysis not reliable.
2. By eliminating too many feature columns, we may run the risk of
losing valuable information for our classifier.
Then we learnt about estimating missing values via
interpolation. Mean Imputation is a method replacing the
missing values with the mean value of the entire feature
column. While this method maintains the sample size and
is easy to use, the variability in the data is reduced, so the
standard deviations and the variance estimates tend to be
underestimated. We used
sklearn.preprocessing.SimpleImputer class. Here, we
replaced each NaN value by the corresponding mean from
each feature column.
12-January-2023
Aim- Create a DataFrame with ClassLabel and do

ClassLabel Encoding.
Ques 1- Create a dataframe.
Pseudo-code as well as Result:
import pandas as pd
df=pd.DataFrame([
['green','M',10.1,'class1'],
['red','L',13.5,'class2'],
['blue','XL',15.3,'class1']
])
df
df.columns
df.columns=['color','size','price','classlabel']
df
Ques 2- Do Class Labels Encoding.
Pseudo-code with Result:
import numpy as np
np.unique(df['classlabel'])
class_mapping={label:idx for idx,label in

enumerate(np.unique(df['classlabel']))}
class_mapping
df['classlabel']=df['classlabel'].map(class_mapping)
df
inv_class_mapping={v:k for k, v in class_mapping.items()}

df['classlabel']=df['classlabel'].map(inv_class_mapping)
df
Conclusion
So, we learnt how to do class label encoding. It was similar

to ordinal feature mapping. Since class labels are not
ordinal, it doesn’t matter which integer number we assign
to a particular string-label. We can simply enumerate the
class labels starting at 0. Then we use mapping dictionary
to transform the class labels into integers. As we did for
“size” in the previous class, we can reverse the key-value
pairs in the mapping dictionary to map the converted class
labels back to the original string representation.
Aim- Use of Standard Scaler Utility Class
Pseudo-code AS WELL AS Result:
from sklearn import preprocessing

import numpy as np
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)
scaler
scaler.mean_
scaler.scale_
X_scaled=scaler.transform(X_train)
X_scaled
X_scaled.mean(axis=0)
X_scaled.std(axis=0)
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=42)
pipe = make_pipeline(StandardScaler(),
LogisticRegression())
pipe.fit(X_train, y_train) # apply scaling on training data
pipe.score(X_test,y_test) # apply scaling on testing data,

without leaking training data.
Conclusion
Standardization of datasets is a common requirement for

many machine learning estimators implemented in scikit-
learn; they might behave badly if the individual features
do not more or less look like standard normally distributed
data: Gaussian with zero mean and unit variance.
In practice we often ignore the shape of the distribution
and just transform the data to center it by removing the
mean value of each feature, then scale it by dividing non-
constant features by their standard deviation.
For instance, many elements in the objective function of a
learning algorithm (such as the RBF kernel of Support
Vector Machines or the l1 and l2 regularizers of linear
models) may assume that all features are centered around
zero or have variance in the same order. If a feature has a
variance that is orders of magnitude larger than others, it
might dominate the objective function and make the
estimator unable to learn from other features correctly as
expected.
The preprocessing module provides the StandardScaler
utility class, which is a quick and easy way to perform the
following operation on an array-like dataset.
This class implements the Transforner API to compute the
mean and standard deviation on a training set so as to be
able to later re-apply the same transformation on the
testing set. This class is hence suitable for use in the early
steps of a Pipeline.
18-January-2023
Aim- Plot independent variable X and Dependent

Variable Y
Pseudo code as well as result
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,

r2_score
import statsmodels.api as sm
x=np.array([1,2,3,4,5])
y=np.array([7,14,15,18,19])
n=np.size(x)
x_mean=np.mean(x)
y_mean=np.mean(y)
x_mean,y_mean
Sxy=np.sum(x*y)-n*x_mean*y_mean
Sxx=np.sum(x*x)-n*x_mean*x_mean
b1=Sxy/Sxx
b0=y_mean-b1*x_mean
print('slope b1 is',b1)
print('intercept b0 is',b0)
plt.scatter(x,y)
plt.xlabel('Independent Variable x')
plt.ylabel('Dependent Variable y')

y_predict=b0+b1*x
plt.scatter(x,y,color='red')
plt.plot(x,y_predict,color='green')
plt.ylabel('Y')
plt.xlabel('X')
error=y-y_predict
se=np.sum(error**2)
print('squared error is',se)
mse=se/n
print('mean squared error is',mse)
rmse=np.sqrt(mse)
print('root mean square error is',rmse)
SSt=np.sum((y-y_mean)**2)
R2=1-(se/SSt)
print('R Square is',R2)

plt.scatter(x,y)
plt.xlabel('Independent Variable x')
plt.ylabel('Dependent Variable y')

y_predict=b0+b1*x
plt.ylabel('Y')
plt.xlabel('X')
error=y-y_predict
se=np.sum(error**2)
print('squared error is',se)
mse=se/n
print('mean squared error is',mse)
rmse=np.sqrt(mse)
print('root mean square error is',rmse)
SSt=np.sum((y-y_mean)**2)
R2=1-(se/SSt)
print('R Square is',R2)
Conclusion
We can observe the points distributed in the graph fitting the
regression line. We got SE as 10.80, mse as 2.16 rmse as 1.4696 and
r2 score as 0.8789.
High value of r square shows that linear regression fits the data
well. Transform features by scaling each feature to a given range.
This estimator scales and translates each feature individually such
that it is in the given range on the training set, e.g. between zero
and one.
The high value of r square shows that regression fits the data well.
The Result of the corr() method is a table with a lot of numbers that
represents how well the relationship is between two columns. 1
means that there is a 1 to 1 relationship (a perfect correlation), and
for this data set, each time a value went up in the first column, the
other one went up as well.
0.9 is also a good relationship, and if you increase one value, the
other will probably increase as well.
The adjusted R-squared is positive, not negative. It is always lower
than the R-squared. SVR, is to basically consider the points that are
within the decision boundary line. Our best fit line is the
hyperplane that has a maximum number of points.
Aim: Diabetes pedigree function

Pseudo code as well as result
y_predict=b0+b1*x
plt.ylabel('Y')
plt.xlabel('X')
df = pd.read_csv('diabetes.csv')
print(df)
import numpy as np
x=df[['DiabetesPedigreeFunction','Age']]
x=x.to_numpy()
x
y=df['Glucose'].to_numpy()
y=y.reshape(-1,1)
n=np.size(x)
print("Maximum y: ",np.max(y))
print("Minimum y: ",np.min(y))
r2_score
mse=mean_squared_error(y,y_predicted)
rmse=np.sqrt(mse)
r2=r2_score(y,y_predicted)
print('Slope: ',regression_model.coef_)
print('Intercept:',regression_model.intercept_)
print('MSE:',mse)
print('Root mean Squared Error:',rmse)
print('R2 score:',r2)
Conclusion
25-january-2023
Aim-Correlation Coefficient
Pseudo code with result-
import pandas as pd
import numpy as np
df=pd.read_csv('diabetes.csv')
df
x=df['DiabetesPedigreeFunction']
y=df['Glucose']
print(x,y)
corr_result=np.corrcoef(x,y)
corr_result
p=df['Insulin']
corr_result=np.corrcoef(p,y)
corr_result
q=df['Pregnancies']
corr_result=np.corrcoef(q,y)
print(corr_result)
from sklearn.linear_model import LinearRegression
regression_model=LinearRegression()
regression_model.fit(df2,y)
y_predicted=regression_model.predict(df2)
print(y_predicted)

r2_score
rmse=np.sqrt(mse)
print('MSE:',mse)
x=df['DiabetesPedigreeFunction'].to_numpy()
y=df['Glucose'].to_numpy()
x=x.reshape(-1,1)
y=y.reshape(-1,1)
n=np.size(x)
print("Maximum x: ",np.max(x))
print("Minimum x: ",np.min(x))
print("Maximum y: ",np.max(y))
print("Minimum y: ",np.min(y))\
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
#https://scikit-learn.org/stable/modules/generated/sklear
n.preprocessing.MinMaxScaler.html
print(scaler.fit(x))
print(scaler.data_max_)
print(scaler.transform(x))
x=scaler.transform(x)
print(scaler.fit(y))
print(scaler.data_max_)
y=scaler.transform(y)
r2_score
rmse=np.sqrt(mse)
print('MSE:',mse)
k=1
print(n)
adjustedr2=1 - ((1-r2)*(n-1)/(n-k-1))
print('Adjusted R2: ',adjustedr2)
08 and 09 February,2023
Aim- Preprocessing Normalization
Pseudo code and result-
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
import pandas as pd
print(df)
df.to_numpy()
x=df[['Insulin','Age','BMI','DiabetesPedigreeFunction']].to_
numpy()
y=df.Glucose.to_numpy()
normalized_arr=preprocessing.normalize(x,axis=1)
regressor=SVR(kernel='linear')regressor.fit(x,y)
plt.scatter(df['Age'],y,color='tab:blue')
plt.scatter(df['Age'],y_pred,color='tab:red')
plt.show()
from sklearn.metrics import

mean_squared_error,r2_score
import numpy as np
mse=np.sqrt(np.subtract(y,y_pred))
mse
plt.plot(mse)
plt.show()
plt.scatter(df['Age'],y_pred)
plt.scatter(df['Age'],y)
15 February,2023
Aim-
Pseudo code as well as result-
import pandas as pd
import numpy as np
import matplotlib.pyplot as mtp
print(df)
Result-
x= df.iloc[:, [3,4,5,6,7]].values
y= df.iloc[:, 1].values
print(x);
Result-
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
x_train, x_test, y_train, y_test= train_test_split(x, y,
test_size= 0.25, random_state=0)
from sklearn.tree import DecisionTreeRegressor
model=DecisionTreeRegressor(criterion='squared_error',
max_depth=3,random_state=0)
model.fit(x_train,y_train)
result-
y_pred=model.predict(x_test)
y_pred
model.score(x_test, y_test)
result-
from sklearn import tree

mtp.figure(figsize=(15,10))
tree.plot_tree(model,filled=True)
result-
x= df.iloc[:, [3,4,5,6,7]].values
y= df.iloc[:, 8].values
print(y);
result-

st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
x_train, x_test, y_train, y_test= train_test_split(x, y,
test_size= 0.30, random_state=0)
from sklearn.tree import DecisionTreeClassifier
classifier=
DecisionTreeClassifier(criterion='entropy',splitter='best',m
ax_depth=3, random_state=0)
classifier.fit(x_train, y_train)
result-
y_pred= classifier.predict(x_test)
y_pred
result-
from sklearn.metrics import confusion_matrix
cm= confusion_matrix(y_test, y_pred)
result-
\
model.score(x_test, y_test)
result-
from sklearn import tree

mtp.figure(figsize=(15,10))
tree.plot_tree(classifier,filled=T
result-
KMeans Clustering
Importing the libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_iris
Loading the Iris Dataset
Pseudocode:
iris = load_iris()
iris = sns.load_dataset('iris')
iris.head()
x = iris.iloc[:, [0, 1, 2, 3]].values
print(x)
Finding the optimum number of clusters for k-means classification

Pseudocode:
wcss = []
for i in range(1, 11):

kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300,
n_init = 10, random_state = 0)
kmeans.fit(x)
print(kmeans.cluster_centers_)
wcss.append(kmeans.inertia_)
score = silhouette_score(x, kmeans.labels_, metric='euclidean')

print("The score is: ", score)
Plotting the Graph

Pseudocode:
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()
Hierarchical Clustering
Importing the Libraries
Pseudocode:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_iris
Loading the Dataset

Pseudocode:
from scipy.cluster.hierarchy import dendrogram, linkage
iris = load_iris()
iris = sns.load_dataset('iris')
iris.head()
Plotting the Dendogram

Pseudocode:
dist_sin = linkage(iris.loc[:,["sepal_length", "sepal_width",
"petal_length", "petal_width"]],method="single")
#print(dist_sin)
plt.figure(figsize=(18,6))
dendrogram(dist_sin, leaf_rotation=90)
plt.xlabel('Index')
plt.ylabel('Distance')
plt.suptitle("DENDROGRAM SINGLE METHOD",fontsize=18)
plt.show()
Creating the cluster:

Pseudocode:
from scipy.cluster.hierarchy import fcluster
iris_SM=iris.copy()
iris_SM['2_clust']=fcluster(dist_sin,2, criterion='maxclust')
iris_SM['3_clust']=fcluster(dist_sin,3, criterion='maxclust')
iris_SM.head()
Plotting Different Graphs:
Pseudocode:
plt.suptitle("Hierarchical Clustering Single Method",fontsize=18)
plt.subplot(1,3,1)
plt.title("K = 2",fontsize=14)
sns.scatterplot(x="petal_length",y="petal_width", data=iris_SM,
hue="2_clust")
plt.subplot(1,3,2)
hue="3_clust")
plt.subplot(1,3,3)
plt.title("Species",fontsize=14)
hue="species")
Pseudocode:
plt.subplot(1,2,1)
sns.swarmplot(x="species",y="2_clust", data=iris_SM, hue="species")
plt.subplot(1,2,2)
sns.swarmplot(x="species",y="3_clust", data=iris_SM, hue="species")
Pseudocode:
dist_comp = linkage(iris.loc[:,["sepal_length", "sepal_width",
"petal_length", "petal_width"]],method="complete")
dendrogram(dist_comp, leaf_rotation=90)
plt.xlabel('Index')
plt.ylabel('Distance')
plt.suptitle("DENDROGRAM COMPLETE METHOD",fontsize=18)
plt.show()
Pseudocode:
iris_CM=iris.copy()
iris_CM['2_clust']=fcluster(dist_comp,2, criterion='maxclust')
iris_CM['3_clust']=fcluster(dist_comp,3, criterion='maxclust')
iris_CM.head()
Pseudocode:
plt.suptitle("Hierarchical Clustering Complete Method",fontsize=18)
plt.subplot(1,3,1)
sns.scatterplot(x="sepal_length",y="sepal_width", data=iris_CM,
hue="2_clust")
plt.subplot(1,3,2)
hue="3_clust")
plt.subplot(1,3,3)
plt.title("Species",fontsize=14)
hue="species")
Pseudocode:
plt.subplot(1,2,1)
sns.swarmplot(x="species",y="2_clust", data=iris_CM, hue="species")
plt.subplot(1,2,2)
sns.swarmplot(x="species",y="3_clust", data=iris_CM, hue="species")
Thomsan Sampling
Importing the Essential Libraries
Pseudocode:
import numpy as np
import pandas as pd
Importing the Dataset

Pseudocode:
dataset = pd.read_csv('ads_CTR_Optimisation.csv')
dataset.head()
Implementing Thompson Sampling

Pseudocode:
import random
N = 10000
d = 10
ads_selected = []
numbers_of_rewards_1 = [0] * d
numbers_of_rewards_0 = [0] * d
total_reward = 0
for n in range(0, N):
ad = 0
max_random = 0
for i in range(0, d):
random_beta = random.betavariate(numbers_of_rewards_1[i] + 1,
numbers_of_rewards_0[i] + 1)
if random_beta > max_random:
max_random = random_beta
ad = i
ads_selected.append(ad)
reward = dataset.values[n, ad]
if reward == 1:
numbers_of_rewards_1[ad] = numbers_of_rewards_1[ad] + 1
else:
numbers_of_rewards_0[ad] = numbers_of_rewards_0[ad] + 1
total_reward = total_reward + reward
Visualising the results – Histogram

Pseudocode:
plt.hist(ads_selected)
plt.title('Histogram of ads selections')
plt.xlabel('Ads')
plt.ylabel('Number of times each ad was selected')
plt.show()
Upper Classification Bound

Importing the Essential Libraries
Pseudocode:
import numpy as np
import pandas as pd
Importing the Dataset

dataset = pd.read_csv('ads_CTR_Optimisation.csv')
dataset.head
Implementing UCB
Pseudocode:
import math
N = 10000
d = 10
ads_selected = []
numbers_of_selections = [0] * d
sums_of_rewards = [0] * d
total_reward = 0
for n in range(0, N):
ad = 0
max_upper_bound = 0
for i in range(0, d):
if (numbers_of_selections[i] > 0):
average_reward = sums_of_rewards[i] / numbers_of_selections[i]
delta_i = math.sqrt(3/2 * math.log(n + 1) /
numbers_of_selections[i])
upper_bound = average_reward + delta_i
else:
upper_bound = 1e400
if (upper_bound > max_upper_bound):
max_upper_bound = upper_bound
ad = i
ads_selected.append(ad)
numbers_of_selections[ad] = numbers_of_selections[ad] + 1
reward = dataset.values[n, ad]
sums_of_rewards[ad] = sums_of_rewards[ad] + reward
total_reward = total_reward + reward
visualising the results

plt.hist(ads_selected)
plt.title('Histogram of ads selections')
plt.xlabel('Ads')
plt.ylabel('Number of times each ad was selected')
plt.show()

Lab File

Uploaded by

Copyright:

Available Formats

You might also like

Lab File

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab File

Uploaded by

Copyright:

Available Formats

BANASTHALI VIDYAPITH

Artificial Intelligence and Machine Learning

SUBMITTED TO:- DR. URVASHI PRAKASH SHUKLA

Aim- Filling Missing Data with Pandas.

#FILL NaN VALUES WITH CONDITIONS

print("\nFINALLY WE FILL THE MISSING VALUES BASED ON

In this article we examined the following methods for filling missing

Aim- Dealing with Categorical Data

Ques 1- Creating a new categorical dataframe and do

# Creating a new categorical dataframe.

#Ordinal Feature Mapping

When we created our dataframe we can see from the

Aim- Working with Pandas Dataframe.

Ques 1- Create a csv_dataframe and then read it using

Ques 2- Check null values in dataframe.

Ques 3- Give the total number of null values in each

Ques 5- Eliminating samples/features with missing cells

Pseudo-code as well as Result:

#Dropping rows having NaN values.

#Drop columns having NaN values.

#Drop rows that do not have at least 4 non-NaN values.

#Only drop rows where NaN appear in specific columns

It shows us how to use Pandas DataFrame. We created a

Aim- Create a DataFrame with ClassLabel and do

Ques 1- Create a dataframe.

Pseudo-code as well as Result:

class_mapping={label:idx for idx,label in

inv_class_mapping={v:k for k, v in class_mapping.items()}

So, we learnt how to do class label encoding. It was similar

Aim- Use of Standard Scaler Utility Class

Pseudo-code AS WELL AS Result:

from sklearn import preprocessing

pipe.score(X_test,y_test) # apply scaling on testing data,

Standardization of datasets is a common requirement for

Aim- Plot independent variable X and Dependent

Pseudo code as well as result

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error,

plt.xlabel('Independent Variable x')

plt.ylabel('Dependent Variable y')

print('squared error is',se)

print('mean squared error is',mse)

print('root mean square error is',rmse)

print('R Square is',R2)

plt.xlabel('Independent Variable x')

plt.ylabel('Dependent Variable y')

print('squared error is',se)

print('mean squared error is',mse)

print('root mean square error is',rmse)

print('R Square is',R2)

Aim: Diabetes pedigree function

from sklearn.metrics import mean_squared_error,

from sklearn.metrics import

from sklearn import tree

from sklearn.preprocessing import StandardScaler

from sklearn import tree

Finding the optimum number of clusters for k-means classification