Record 5

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

71762108005

Ex. No: 1
DATA WAREHOUSING USING POSTGRESQL
Date: 23.8.23

Aim:
To implement a Data Warehouse in PostgreSQL using Python.

Program Code:

1
71762108005

Command to run the file: streamlit run warehouse.py

2
71762108005

Output:

3
71762108005

4
71762108005

Result:
The Python program to create a data warehouse in PostgreSQL with the read, write and update
operations has been implemented successfully.

5
71762108005

Ex. No: 2
APRIORI BASED ALGORITHM
Date: 06.9.23

Aim:
To implement an Apriori Based Algorithm in Python.

Dataset:

6
71762108005

Program Code:

Output:

[RelationRecord(items=frozenset({'milk', 'butter', 'bread'}), support=0.5,


ordered_statistics=[OrderedStatistic(items_base=frozenset({'butter'}),
items_add=frozenset({'milk', 'bread'}), confidence=0.7333333333333334, lift=1.241025641025641),
OrderedStatistic(items_base=frozenset({'milk', 'bread'}), items_add=frozenset({'butter'}),
confidence=0.8461538461538461, lift=1.241025641025641)])]

Inference:

The support value for the first rule is 0.5. This number is calculated by dividing the number of transactions
containing ‘Milk,’ ‘Bread,’ and ‘Butter’ by the total number of transactions.
The confidence level for the rule is 0.846, which shows that out of all the transactions that contain both
“Milk” and “Bread”, 84.6 % contain ‘Butter’ too.
The lift of 1.241 tells us that ‘Butter’ is 1.241 times more likely to be bought by the customers who buy both
‘Milk’ and ‘Butter’ compared to the default likelihood sale of ‘Butter.’

Result:
The Python program to execute an Apriori Based Algorithm has been implemented successfully.

7
71762108005

Ex. No: 3
FP - GROWTH ALGORITHM
Date: 12.9.23

Aim:
To implement the FP - Growth Algorithm in Python.

Transactions:
Flavours of Ice Cream taken by each individual:
transactions = [
['vanilla', 'chocolate'],
['strawberry', 'chocolate', 'vanilla'],
['chocolate', 'mint'],
['vanilla', 'strawberry', 'chocolate', 'mint'],
['chocolate'],
['vanilla', 'strawberry', 'chocolate'],
['strawberry', 'mint', 'chocolate'],
['vanilla', 'strawberry', 'chocolate', 'mint'],
]

Program Code:

8
71762108005

9
71762108005

Output:

Frequent Itemsets:
['vanilla']: 5
['chocolate']: 8
['strawberry']: 5

Inference:

The code prints the frequent itemsets along with their support counts. From the output, we can see that it has
found frequent itemsets for the ice cream flavors based on the given minimum support threshold.

'vanilla' is chosen 5 times.


'chocolate' is chosen 8 times.
'strawberry' is chosen 5 times.
These results provide insights into which ice cream flavors are popular among customers and can be used
for various purposes, such as product recommendations.

Result:
The Python program to execute an FP - Growth Algorithm has been implemented successfully.

10
71762108005

Ex. No: 4
K-MEANS & HIERARCHICAL CLUSTERING IN WEKA TOOL
Date: 26.9.23

Aim:
To implement the K-Means Clustering and Hierarchical Clustering Algorithm in Weka Tool.

K – Means Clustering Algorithm:


function k_means(dataset, k):
# Randomly initialize cluster centroids
centroids = initialize_random_centroids(dataset, k)

while true:
# Assign each data point to the nearest centroid
clusters = assign_to_nearest_centroid(dataset, centroids)

# Update the centroids based on the assigned data points


new_centroids = update_centroids(clusters)

# Check for convergence


if centroids_converged(centroids, new_centroids):
break

# Update centroids for the next iteration


centroids = new_centroids

return clusters

Hierarchical Clustering Algorithm:


function hierarchical_clustering(dataset):
# Initialize each data point as a separate cluster
clusters = initialize_clusters(dataset)

while len(clusters) > 1:


# Compute pairwise distances between clusters
distances = compute_pairwise_distances(clusters)

# Find the closest pair of clusters


closest_pair = find_closest_clusters(distances)

# Merge the closest pair of clusters into a new cluster


new_cluster = merge_clusters(closest_pair, clusters)

# Remove the merged clusters from the list of clusters

11
71762108005

clusters.remove(closest_pair[0])
clusters.remove(closest_pair[1])

# Add the new cluster to the list of clusters


clusters.append(new_cluster)

return clusters

Dataset:
https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/supermarket.arff
https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.nominal.arff

Output:

12
71762108005

=== Run information ===

Scheme: weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-pruning 10000 -min-


density 2.0 -t1 -1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S 10
Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
Test mode: evaluate on training data

=== Clustering model (full training set) ===

kMeans
======
Number of iterations: 2
Within cluster sum of squared errors: 0.0

Initial starting points (random):

13
71762108005

Cluster 0:
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,high
Cluster 1:
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,low

Missing values globally replaced with mean/mode

Time taken to build model (full training data) : 0.1 seconds

=== Model and evaluation on training set ===

Clustered Instances

0 1679 ( 36%)
1 2948 ( 64%)

14
71762108005

=== Run information ===

Scheme: weka.clusterers.HierarchicalClusterer -N 2 -L SINGLE -P -A "weka.core.EuclideanDistance -R


first-last"
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data

=== Clustering model (full training set) ===

Cluster 0
((1.0:1,1.0:1):0,1.0:1)

Cluster 1
(((((0.0:1,0.0:1):0.41421,((((0.0:1,0.0:1):0,
(0.0:1,0.0:1):0):0.41421,1.0:1.41421):0,0.0:1.41421):0):0,0.0:1.41421):0,0.0:1.41421):0,1.0:1.41421)

Time taken to build model (full training data) : 0 seconds

=== Model and evaluation on training set ===

Clustered Instances

0 3 ( 21%)
1 11 ( 79%)

K-Means Clustering Inference:


• Number of Clusters: The K-Means algorithm has created two clusters, Cluster 0 and Cluster 1.
• Cluster Sizes: Cluster 0 contains 36% of the total instances (1679 instances), while Cluster 1
contains 64% of the total instances (2948 instances). These percentages represent the distribution of
data points among the clusters.
• Within-Cluster Sum of Squared Errors (WSS): The WSS is reported as 0.0, which is unusual.
Typically, the WSS should be a positive value, and it measures the total distance of data points
within each cluster to their respective cluster centroids. A WSS of 0.0 suggests that the data points
may be perfectly clustered or that there is an issue with the clustering results.
• Initial Starting Points: The initial starting points for the clusters are mentioned as "high" for Cluster
0 and "low" for Cluster 1. These labels or descriptions may indicate some characteristics or
attributes associated with each cluster, but additional context is needed to interpret their meaning.

15
71762108005

Hierarchical Clustering Inference:


• Cluster Structures:
Cluster 0: This cluster contains 3 instances, which represent 21% of the total instances in the
dataset.
Cluster 1: Cluster 1 contains 11 instances, representing 79% of the total instances.
• Hierarchical Structure: The hierarchical structure of the clusters is represented in a nested format,
with sub-clusters within Cluster 1.
• Interpretation:
Cluster 0 and Cluster 1 are the top-level clusters.
Within Cluster 1, there are sub-clusters with further nested structures.
• Distance Measures:
The distances between data points within clusters are represented using numerical values.
The specific meaning of these distances depends on the distance metric and linkage method used for
hierarchical clustering (in this case, the "EuclideanDistance" and "SINGLE" linkage method).

Result:
Successfully implemented K-Means Clustering and Hierarchical Clustering algorithm in Weka Tool
using the Cluster option.

16
71762108005

Ex. No: 5
BAYESIAN CLASSIFIER IN WEKA TOOL
Date: 17.10.23

Aim:
To implement the Bayesian Classifier Algorithm in Python and in Weka Tool.

Bayesian Classifier Algorithm:


import numpy as np
import matplotlib.pyplot as plt

# Generate a synthetic dataset with two features


np.random.seed(0)
X = np.random.rand(200, 2)
y = (X[:, 0] + X[:, 1] > 1).astype(int)

# Split the dataset into training and testing sets


X_train, X_test = X[:150], X[150:]
y_train, y_test = y[:150], y[150:]

# Implement Gaussian Naive Bayesian Classifier


class GaussianNaiveBayes:
def fit(self, X, y):
self.class_probs = {}
self.mean = {}
self.variance = {}
self.classes = np.unique(y)

for c in self.classes:
X_c = X[y == c]
self.class_probs[c] = len(X_c) / len(X)
self.mean[c] = X_c.mean(axis=0)
self.variance[c] = X_c.var(axis=0)

def predict(self, X):


predictions = [self._predict(x) for x in X]
return np.array(predictions)

def _predict(self, x):


posteriors = []

for c in self.classes:
class_prob = np.log(self.class_probs[c])
mean = self.mean[c]

17
71762108005

variance = self.variance[c]
likelihood = -0.5 * np.sum(np.log(2 * np.pi * variance) + (x - mean) ** 2 / variance)
posterior = class_prob + likelihood
posteriors.append(posterior)

return self.classes[np.argmax(posteriors)]

# Fit the Gaussian Naive Bayes Classifier


gnb = GaussianNaiveBayes()
gnb.fit(X_train, y_train)

# Make predictions on the test set


y_pred = gnb.predict(X_test)

# Visualize the decision boundaries


x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
Z = gnb.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.4)


plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=20, edgecolor='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Gaussian Naive Bayes Classifier Decision Boundaries')
plt.show()

18
71762108005

Dataset:
https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/diabetes.arff

Output:

19
71762108005

=== Run information ===

Scheme: weka.classifiers.bayes.NaiveBayes
Relation: pima_diabetes
Instances: 768
Attributes: 9
preg
plas
pres
skin
insu
mass
pedi
age
class
Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Naive Bayes Classifier

20
71762108005

Class
Attribute tested_negative tested_positive
(0.65) (0.35)
===============================================
preg
mean 3.4234 4.9795
std. dev. 3.0166 3.6827
weight sum 500 268
precision 1.0625 1.0625

plas
mean 109.9541 141.2581
std. dev. 26.1114 31.8728
weight sum 500 268
precision 1.4741 1.4741

pres
mean 68.1397 70.718
std. dev. 17.9834 21.4094
weight sum 500 268
precision 2.6522 2.6522

skin
mean 19.8356 22.2824
std. dev. 14.8974 17.6992
weight sum 500 268
precision 1.98 1.98

insu
mean 68.8507 100.2812
std. dev. 98.828 138.4883
weight sum 500 268
precision 4.573 4.573

mass
mean 30.3009 35.1475
std. dev. 7.6833 7.2537
weight sum 500 268
precision 0.2717 0.2717

pedi
mean 0.4297 0.5504
std. dev. 0.2986 0.3715
weight sum 500 268
precision 0.0045 0.0045

21
71762108005

age
mean 31.2494 37.0808
std. dev. 11.6059 10.9146
weight sum 500 268
precision 1.1765 1.1765

Time taken to build model: 0.06 seconds

=== Stratified cross-validation ===


=== Summary ===

Correctly Classified Instances 586 76.3021 %


Incorrectly Classified Instances 182 23.6979 %
Kappa statistic 0.4664
Mean absolute error 0.2841
Root mean squared error 0.4168
Relative absolute error 62.5028 %
Root relative squared error 87.4349 %
Total Number of Instances 768

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.844 0.388 0.802 0.844 0.823 0.468 0.819 0.892 tested_negative
0.612 0.156 0.678 0.612 0.643 0.468 0.819 0.671 tested_positive
Weighted Avg. 0.763 0.307 0.759 0.763 0.760 0.468 0.819 0.815

=== Confusion Matrix ===

a b <-- classified as
422 78 | a = tested_negative
104 164 | b = tested_positive

Inference:
• The Naive Bayes classifier has been applied to the "pima_diabetes" dataset.
• Based on the provided information, the classifier achieved an accuracy of approximately 76.30%.
• The dataset contains two classes: tested_negative and tested_positive, which likely relate to some
diagnostic or test results.
• The classifier seems to perform better in classifying tested_negative instances compared to
tested_positive instances, as indicated by higher precision, recall, and F-Measure for
tested_negative.

Result:
Successfully implemented Bayesian Classifier algorithm in Weka Tool using the Classify option.

22

You might also like