Professional Documents
Culture Documents
LP1 1
LP1 1
LP1 1
INDEX
Sr. No
CONTENT
2. Artificial Intelligence
2.1 Solve 8-puzzle problem using A* algorithm. Assume any initial configuration
and define goal configuration clearly.
2.3 Use Heuristic Search Techniques to Implement Best first search (Best-Solution
but not always optimal) and A* algorithm (Always gives optimal solution).
SIT, LONAVALA 1
LABORATORY PRACTICE-I BE COMPUTER
3. Data Analytics
3.1 Download the Iris flower dataset or any other dataset into a DataFrame. (eg
https://archive.ics.uci.edu/ml/datasets/Iris ) Use Python/R and Perform following –
How many features are there and what are their types (e.g., numeric,
nominal)?
Compute and display summary statistics for each feature available in the
dataset. (eg. minimum value, maximum value, mean, range, standard
deviation, variance and percentiles
Data Visualization-Create a histogram for each feature in the dataset to
illustrate the feature distributions. Plot each histogram.
Create a boxplot for each feature in the dataset. All of the boxplots should be
combined into a single plot. Compare distributions and identify outliers.
3.2 Download Pima Indians Diabetes dataset. Use Naive Bayes‟ Algorithm for
classification
Load the data from CSV file and split it into training and test datasets.
Summarize the properties in the training dataset so that we can calculate
probabilities and make predictions.
Classify samples from a test dataset and a summarized training dataset.
3.3 Trip History Analysis: Use trip history dataset that is from a bike sharing service
in the United States. The data is provided quarter-wise from 2010 (Q4) onwards.
Each file has 7 columns. Predict the class of user. Sample Test data set available
here https://www.capitalbikeshare.com/trip-history-data.
3.4 Twitter Data Analysis: Use Twitter data for sentiment analysis. The dataset is
3MB in size and has 31,962 tweets. Identify the tweets which are hate tweets and
which are not. Sample Test data set available here
https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-
analysis/
SIT, LONAVALA 2
LABORATORY PRACTICE-I BE COMPUTER
SIT, LONAVALA 3
LABORATORY PRACTICE-I BE COMPUTER
Assignment No 1
Aim:
Vector and Matrix Operations-
Design parallel algorithm to
1. Add two large vectors
2. Multiply Vector and Matrix
3. Multiply two N × N arrays using n2 processors
SIT, LONAVALA 4
LABORATORY PRACTICE-I BE COMPUTER
SIT, LONAVALA 5
LABORATORY PRACTICE-I BE COMPUTER
/ M * N on the device
MatrixMulOnDevice(M,
N, P);
/ Free matrices
FreeMatrix(M)
;
FreeMatrix(N)
;
FreeMatrix(P);
return 0;
}
Host-side code
// Matrix multiplication on the device
void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P)
{
/ Load M and N to the device
Matrix Md =
AllocateDeviceMatrix(M);
CopyToDeviceMatrix(Md, M);
Matrix Nd =
AllocateDeviceMatrix(N);
CopyToDeviceMatrix(Nd, N);
/ Allocate P on the device
SIT, LONAVALA 6
LABORATORY PRACTICE-I BE COMPUTER
FreeDeviceMatrix(Md);
FreeDeviceMatrix(Nd);
FreeDeviceMatrix(Pd);
}
SIT, LONAVALA 7
LABORATORY PRACTICE-I BE COMPUTER
Facilities:
Latest version of 64 Bit Operating Systems, CUDA enabled NVIDIA Graphics card
Input:
Two matrices
Output:
Multiplication of two matrix
Software Engg.:
Mathematical Model:
Conclusion:
We learned parallel programming with the help of CUDA architecture.
Questions:
1. What is CUDA?
2. Explain Processing flow of CUDA programming.
3. Explain advantages and limitations of CUDA.
4. Make the comparison between GPU and CPU.
5. Explain various alternatives to CUDA.
6. Explain CUDA hardware architecture in detail.
Program:
#include<stdio.h>
#include<iostream>
#include<cstdlib>
//****important to add following library to allow a programmer to use parallel
paradigms*****
#include<omp.h>
using namespace std;
#define MAX 100
int main()
{
int a[MAX],b[MAX],c[MAX],i;
printf("\n First Vector:\t");
SIT, LONAVALA 8
LABORATORY PRACTICE-I BE COMPUTER
//Instruct a master thread to fork and generate more threads to process following
loop structure
#pragma omp parallel for
for(i=0;i<MAX;i++)
{
a[i]=rand()%1000;
}
//Discuss issue of this for loop below-if we make it parallel, possibly values that
get printed will not be in sequence as we dont have any control on order of
threads execution
for(i=0;i<MAX;i++)
{
printf("%d\t",a[i]);
}
for(i=0;i<MAX;i++)
{
printf("%d\t",b[i]);
}
for(i=0;i<MAX;i++)
{
SIT, LONAVALA 9
LABORATORY PRACTICE-I BE COMPUTER
printf("\n%d\t%d\t%d",a[i],b[i],c[i]);
}
}
1)Output:
Output:
SIT, LONAVALA 10
LABORATORY PRACTICE-I BE COMPUTER
guest-bvoaff@C04L0809:~$
int main()
{
int m=3,n=2;
int mat[m][n],vec[n],out[m];
//display vector
cout<<"Input Col-Vector"<<endl;
SIT, LONAVALA 11
LABORATORY PRACTICE-I BE COMPUTER
for(int row=0;row<n;row++)
{
cout<<vec[row]<<endl;
}
for(int row=0;row<m;row++)
{
cout<<"\nvec["<<row<<"]:"<<out[row]<<endl;
}
return 0;
}
SIT, LONAVALA 12
LABORATORY PRACTICE-I BE COMPUTER
2) Output
Output:
vec[0]:4
vec[1]:4
vec[2]:4
// Matrix-Matrix Multiplication
#include<iostream>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include<omp.h>
using namespace std;
#define N 4
float A[N][N], B[N][N], C[N][N]; // declaring matrices of NxN size
int main ()
{
/* DECLARING VARIABLES */
int i, j, m; // indices for matrix multiplication
float t_1; // Execution time measures
clock_t c_1, c_2;
/* FILLING MATRICES WITH RANDOM NUMBERS */
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
SIT, LONAVALA 13
LABORATORY PRACTICE-I BE COMPUTER
{
A[i][j]= (rand()%5);
B[i][j]= (rand()%5);
}
}
SIT, LONAVALA 14
LABORATORY PRACTICE-I BE COMPUTER
for(j=0;j<N;j++)
{
C[i][j]=0.; // set initial value of resulting matrix C = 0
for(m=0;m<N;m++)
{
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
}
}
/* TERMINATE PROGRAM */
return 0;
}
3)Output
SIT, LONAVALA 15
LABORATORY PRACTICE-I BE COMPUTER
Output:
Conclusion:
To Design
parallel algorithm to Add two large vectors , Multiply Vector and Matrix
And Multiply two N × N arrays using n2 processors
Assignment No 2
SIT, LONAVALA 16
LABORATORY PRACTICE-I BE COMPUTER
Aim:
Parallel Sorting Algorithms-
For Bubble Sort and Merger Sort, based on existing sequential algorithms,
design and implement parallel algorithm utilizing all resources available.
Aim: Understand Parallel Sorting Algorithms like Bubble sort and Merge Sort.
Prerequisites:
Student should know basic concepts of Bubble sort and Merge Sort.
Objective: Study of Parallel Sorting Algorithms like Bubble sort and
Merge Sort
Theory:
i) What is Sorting?
Sorting is a process of arranging elements in a group in a particular order, i.e.,
ascending order, descending order, alphabetic order, etc.
Bubble Sort
SIT, LONAVALA 17
LABORATORY PRACTICE-I BE COMPUTER
The idea of bubble sort is to compare two adjacent elements. If they are not in
the right order,switch them. Do this comparing and switching (if necessary) until the
end of the array is reached. Repeat this process from the beginning of the array n
times.
Implemented as a pipeline.
Let local_size = n / no_proc. We divide the array in no_proc parts, and each
process executes the bubble sort on its part, including comparing the last
element
with the first one belonging to the next thread.
Implement with the loop (instead of j<i)
for (j=0; j<n-1; j++)
For every iteration of i, each thread needs to wait until the previous thread
has finished that iteration before starting.
We'll coordinate using a barrier.
SIT, LONAVALA 18
LABORATORY PRACTICE-I BE COMPUTER
Merge Sort
• Collects sorted list onto one processor
• Merges elements as they come together
• Simple tree structure
• Parallelism is limited when near the root
Theory:
To sort A[p .. r]:
1. Divide Step
SIT, LONAVALA 19
LABORATORY PRACTICE-I BE COMPUTER
If a given array A has zero or one element, simply return; it is already sorted.
Otherwise, splitA[p .. r] into two subarraysA[p .. q] and A[q + 1 .. r], each containing
about half of the elements of A[p .. r]. That is, q is the halfway point of A[p .. r].
2. Conquer Step
Conquer by recursively sorting the two subarraysA[p .. q] and A[q + 1 .. r].
3. Combine Step
Combine the elements back in A[p .. r] by merging the two sorted subarraysA[p .. q] and
A[q
+ 1 .. r] into a sorted sequence. To accomplish this step, we will define a procedure
MERGE (A, p, q, r).
SIT, LONAVALA 20
LABORATORY PRACTICE-I BE COMPUTER
1. Procedure parallelMergeSort
2. Begin
3. Create processors Pi where i = 1 to n
4. if i > 0 then recieve size and parent from the root
5. recieve the list, size and parent from the root
6. endif
7. midvalue= listsize/2
8. if both children is present in the tree then
9. send midvalue, first child
10. send listsize-mid,second child
11. send list, midvalue, first child
12. send list from midvalue, listsize-midvalue, second child
13. call mergelist(list,0,midvalue,list, midvalue+1,listsize,temp,0,listsize)
14. store temp in another array list2
15. else
16. call parallelMergeSort(list,0,listsize)
17. endif
18. if i >0 then
19. send list, listsize,parent
20. endif
21. end
INPUT:
1. Array of integer numbers.
OUTPUT:
1. Sorted array of numbers
FAQ
1. What is sorting?
2. What is parallel sort?
3. How to sort the element using Bubble Sort?
4. How to sort the element using Parallel Bubble Sort?
5. How to sort the element using Parallel Merge Sort?
6. How to sort the element using Merge Sort?
7. What is searching?
8. Different types of searching methods.
9. Time complexities of sorting and searching methods.
10. How to calculate time complexity?
11. What are space complexity of all sorting and searching methods?
12. Explain what is best, worst and average case for each method of
searching and sorting.
ALGORITHM ANALYSIS
SIT, LONAVALA 21
LABORATORY PRACTICE-I BE COMPUTER
Program:
1)Bubble Sort
#include<iostream>
#include<stdlib.h>
#include<omp.h>
using namespace std;
SIT, LONAVALA 22
LABORATORY PRACTICE-I BE COMPUTER
int test;
test=a;
a=b;
b=test;
int main()
{
int *a,n;
cout<<"\n enter total no of elements=>";
cin>>n;
a=new int[n];
cout<<"\n enter elements=>";
for(int i=0;i<n;i++)
{
cin>>a[i];
}
bubble(a,n);
return 0;
}
Output:
SIT, LONAVALA 23
LABORATORY PRACTICE-I BE COMPUTER
enter elements=>2
6
8
3
2)Merge Sort
#include<iostream>
#include<stdlib.h>
#include<omp.h>
using namespace std;
merge(a,i,mid,mid+1,j);
SIT, LONAVALA 24
LABORATORY PRACTICE-I BE COMPUTER
}––––
}
void merge(int a[],int i1,int j1,int i2,int j2)
{
int temp[1000];
int i,j,k;
i=i1;
j=i2;
k=0;
while(i<=j1)
{
temp[k++]=a[i++];
}
while(j<=j2)
{
temp[k++]=a[j++];
}
for(i=i1,j=0;i<=j2;i++,j++)
{
a[i]=temp[j];
}
}
int main()
{
int *a,n,i;
cout<<"\n enter total no of elements=>";
cin>>n;
SIT, LONAVALA 25
LABORATORY PRACTICE-I BE COMPUTER
a= new int[n];
mergesort(a, 0, n-1);
return 0;
}
Output:
SIT@SIT-ThinkCentre-E73:~$ g++ mergesort.cpp
SIT@SIT-ThinkCentre-E73:~$ ./a.out
enter elements=>2
5
8
1
Conclusion:
For Bubble Sort and Merger Sort, based on existing sequential
algorithms, design and implement parallel algorithm utilizing all resources
available.
SIT, LONAVALA 26
LABORATORY PRACTICE-I BE COMPUTER
Assignment No 3
Aim:
Parallel Search Algorithm-
Design and implement parallel algorithm utilizing all resources available. for
SIT, LONAVALA 27
LABORATORY PRACTICE-I BE COMPUTER
Outcome: Students will be understand the implementation of Binary search and BFS,
DFS
Pre-requisites:
Theory:
Binary Search:
where n is the number of elements in the array, the O is Big O notation, and
log is the logarithm. Binary search takes constant (O(1)) space, meaning that
the
space taken by the algorithm is the same for any number of elements in the
array.Binary search is faster than linear search except for small arrays, but the
array must be sorted first. Although specialized data structures designed for fast
SIT, LONAVALA 28
LABORATORY PRACTICE-I BE COMPUTER
searching, such as hash tables, can be searched more efficiently, binary search
applies to a wider range of problems.
How Binary Search Works?
For a binary search to work, it is mandatory for the target array to be sorted. We
shall learn the process of binary search with a pictorial example. The following
is our sorted array and let us assume that we need to search the location of value
31 using binary search.
Here it is, 0 + (9 - 0 ) / 2 = 4 (integer value of 4.5). So, 4 is the mid of the array.
Now we compare the value stored at location 4, with the value being searched,
i.e. 31. We find that the value at location 4 is 27, which is not a match. As the
value is greater than 27 and we have a sorted array, so we also know that the
target value must be in the upper portion of the array.
We change our low to mid + 1 and find the new mid value again.
low = mid + 1
mid = low + (high - low) / 2
Our new mid is 7 now. We compare the value stored at location 7 with our
target value 31.
SIT, LONAVALA 29
LABORATORY PRACTICE-I BE COMPUTER
The value stored at location 7 is not a match, rather it is more than what we are
looking for. So, the value must be in the lower part from this location.
We compare the value stored at location 5 with our target value. We find
that it is a match.
Breadth-First Search :
Graph traversals
Graph traversal means visiting every vertex and edge exactly once in a well-defined
order. While using certain graph algorithms, you must ensure that each vertex of the
graph is visited exactly once. The order in which the vertices are visited are important
and may depend upon the algorithm or question that you are solving.
SIT, LONAVALA 30
LABORATORY PRACTICE-I BE COMPUTER
BFS is a traversing algorithm where you should start traversing from a selected node
(source or starting node) and traverse the graph layerwise thus exploring the
neighbour nodes (nodes which are directly connected to source node). You must then
move towards the next-level neighbour nodes.
As the name BFS suggests, you are required to traverse the graph breadthwise as follows:
1. First move horizontally and visit all the nodes of the current layer
2. Move to the next layer
Consider the following diagram.
The distance between the nodes in layer 1 is comparitively lesser than the distance
between the nodes in layer 2. Therefore, in BFS, you must traverse all the nodes in
layer 1 before you move to the nodes in layer 2.
SIT, LONAVALA 31
LABORATORY PRACTICE-I BE COMPUTER
Program:
Binary Search:
#include<iostream>
#include<stdlib.h>
#include<omp.h>
using namespace std;
int mid;
mid=(low+high)/2;
int low1,low2,high1,high2,mid1,mid2,found=0,loc=-1;
while(low1<=high1)
{
cout<<"here1";
mid1=(low1+high1)/2;
if(key==a[mid1])
{
found=1;
SIT, LONAVALA 32
LABORATORY PRACTICE-I BE COMPUTER
loc=mid1;
low1=high1+1;
}
else if(key>a[mid1])
{
low1=mid1+1;
}
else if(key<a[mid1])
high1=mid1-1;
}
}
cout<<"here2";
mid2=(low2+high2)/2;
if(key==a[mid2])
{
found=1;
loc=mid2;
low2=high2+1;
}
else if(key>a[mid2])
{
SIT, LONAVALA 33
LABORATORY PRACTICE-I BE COMPUTER
low2=mid2+1;
}
else if(key<a[mid2])
high2=mid2-1;
}
}
}
return loc;
}
int main()
{
int *a,i,n,key,loc=-1;
cout<<"\n enter total no of elements=>";
cin>>n;
a=new int[n];
loc=binary(a,0,n-1,key);
if(loc==-1)
cout<<"\n Key not found.";
else
cout<<"\n Key found at position=>"<<loc+1;
return 0;
}
Output:
SIT, LONAVALA 34
LABORATORY PRACTICE-I BE COMPUTER
2) Breadth-First Search
#include<iostream>
#include<stdlib.h>
#include<queue>
using namespace std;
class node
{
public:
};
class Breadthfs
{
public:
};
SIT, LONAVALA 35
LABORATORY PRACTICE-I BE COMPUTER
if(!root)
{
root=new node;
root->left=NULL;
root->right=NULL;
root->data=data;
return root;
}
queue<node *> q;
q.push(root);
while(!q.empty())
{
node *temp=q.front();
q.pop();
if(temp->left==NULL)
{
temp->left=new node;
temp->left->left=NULL;
temp->left->right=NULL;
temp->left->data=data;
return root;
}
else
{
q.push(temp->left);
if(temp->right==NULL)
{
temp->right=new node;
temp->right->left=NULL;
temp->right->right=NULL;
temp->right->data=data;
return root;
SIT, LONAVALA 36
LABORATORY PRACTICE-I BE COMPUTER
}
else
{
q.push(temp->right);
}
}
}
queue<node*> q;
q.push(head);
int qSize;
while (!q.empty())
{
qSize = q.size();
#pragma omp parallel for
for (int i = 0; i < qSize; i++)
{
node* currNode;
#pragma omp critical
{
currNode = q.front();
q.pop();
cout<<"\t"<<currNode->data;
}
#pragma omp critical
{
if(currNode->left)
q.push(currNode->left);
if(currNode->right)
q.push(currNode->right);
}
}
}
SIT, LONAVALA 37
LABORATORY PRACTICE-I BE COMPUTER
int main(){
node *root=NULL;
int data;
char ans;
do
{
cout<<"\n enter data=>";
cin>>data;
root=insert(root,data);
}while(ans=='y'||ans=='Y');
bfs(root);
return 0;
}
[SIT@localhost ~]$ vi brfs.cpp
[SIT@localhost ~]$ g++ brfs.cpp -fopenmp
[SIT@localhost ~]$ ./a.out
enter data=>10
do you want insert one more node?y
enter data=>5
do you want insert one more node?y
enter data=>15
do you want insert one more node?y
enter data=>25
do you want insert one more node?y
enter data=>20
do you want insert one more node?n
10 5 15 25 20
SIT, LONAVALA 38
LABORATORY PRACTICE-I BE COMPUTER
Assignment No 4
Aim:
Parallel Implementation of the K Nearest Neighbors Classifier
Objective:
To Implement Parallel of the K Nearest Neighbors Classifier.
Theory:
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method
used for classification and regression.[1] In both cases, the input consists of the k closest
training examples in the feature space. The output depends on whether k-NN is used for
classification or regression:
In k-NN regression, the output is the property value for the object. This value is
the average of the values of its k nearest neighbors.
SIT, LONAVALA 39
LABORATORY PRACTICE-I BE COMPUTER
k-NN is a type of instance-based learning, or lazy learning, where the function is only
approximated locally and all computation is deferred until classification. The k-NN
algorithm is among the simplest of all machine learning algorithms.
Both for classification and regression, a useful technique can be used to assign weight
to the contributions of the neighbors, so that the nearer neighbors contribute more to
the average than the more distant ones. For example, a common weighting scheme
consists in giving each neighbor a weight of 1/d, where d is the distance to the
neighbor
Program:
#include <iostream>
#include <vector>
#include <fstream>
#include <string>
#include <sstream>
#include <cmath>
#include <set>
#include <map>
#include <ctime>
#include<mpi.h>
#include<set>
class Instance{
private:
double R;
double G;
double B;
double isSkin;
public:
Instance(double R, double G, double B, int isSkin){
this->R = R;
this->G = G;
this->B = B;
this->isSkin = isSkin;
}
SIT, LONAVALA 40
LABORATORY PRACTICE-I BE COMPUTER
double getG(){
return G;
}
double getB(){
return B;
}
int skin(){
return isSkin;
}
//Calculate Euclidean distance
double calculateDistance(double otherR, double otherG, double otherB){
return sqrt((R - otherR) * (R - otherR) + (G - otherG) * (G - otherG) + (B -
otherB) * (B - otherB));
};
class TestInstance{
private:
double R;
double G;
double B;
public:
TestInstance(double R, double G, double B){
SIT, LONAVALA 41
LABORATORY PRACTICE-I BE COMPUTER
this->R = R;
this->G = G;
this->B = B;
double getG(){
return G;
}
double getB(){
return B;
}
};
SIT, LONAVALA 42
LABORATORY PRACTICE-I BE COMPUTER
return rez;
}
vector<Instance> instances;
int k;
int countFirstClass = 0;
int countSecondClass = 0;
set<double>::iterator it = distances.begin();
SIT, LONAVALA 43
LABORATORY PRACTICE-I BE COMPUTER
int world_size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size); //
string line;
ifstream myfile("training.txt");
//init
if (myfile.is_open())
{
while (getline(myfile,line))
{
vector<string> parts = split(line, ' ');
SIT, LONAVALA 44
LABORATORY PRACTICE-I BE COMPUTER
SIT, LONAVALA 45
LABORATORY PRACTICE-I BE COMPUTER
curr = instances[i].getG();
res = (curr - minG) / (maxG - minG);
instances[i].setG(res);
curr = instances[i].getB();
res = (curr - minB) / (maxB - minB);
instances[i].setB(res);
ifstream new_file("test.txt");
string new_line;
vector<TestInstance>test_instances;
//if Process 0
if(rank == 0) {
if (new_file.is_open())
{
while (getline(new_file,new_line))
{
double r = std::stod(parts[0]);
double g = std::stod(parts[1]);
double b = std::stod(parts[2]);
SIT, LONAVALA 46
LABORATORY PRACTICE-I BE COMPUTER
}
//Get current system time
start = MPI_Wtime();
int index = 1;
for(int i = 1; i < test_instances.size(); i++){
double r = test_instances[i].getR();
MPI_Isend(&r, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, requests +
index);
index ++;
double g = test_instances[i].getG();
MPI_Isend(&g, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, requests
+ index);
index ++;
double b = test_instances[i].getB();
MPI_Isend(&b, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, requests
+ index);
index ++;
}
double r = test_instances[0].getR();
double g = test_instances[0].getG();
double b = test_instances[0].getB();
map<double, int> distanceToClass;
set<double> distances;
int class_predicted = returnClassForObject(r, g, b, distances,
distanceToClass);
printf("Class for %d object is: %d\n", rank + 1, class_predicted);
}
else{
double r;
double g;
double b;
MPI_Irecv(&r, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, requests +
rank + 1);
SIT, LONAVALA 47
LABORATORY PRACTICE-I BE COMPUTER
SIT, LONAVALA 48
LABORATORY PRACTICE-I BE COMPUTER
Conclusion:
To Implement Parallel of the K Nearest Neighbors Classifier.
ARTIFICIAL INTELLIGENCE
SIT, LONAVALA 49
LABORATORY PRACTICE-I BE COMPUTER
Assignment No 1
Aim:
Objective:
Theory:
Introduction:
SIT, LONAVALA 50
LABORATORY PRACTICE-I BE COMPUTER
path between multiple points, called nodes. The A* algorithm combines features of
uniform-cost search and pure heuristic search to efficiently compute optimal solutions.
A* algorithm is a best-first search algorithm in which the cost associated with a node is
f(n) = g(n) + h(n), where g(n) is the cost of the path from the initial state to node n and
h(n) is the heuristic estimate or the cost or a path from node n to a goal.
Thus, f(n) estimates the lowest total cost of any solution path going through node n. At
each point a node with lowest f value is chosen for expansion. Ties among nodes of equal
f value should be broken in favor of nodes with lower h values. The algorithm terminates
when a goal is chosen for expansion.
A* algorithm guides an optimal path to a goal if the heuristic function h(n) is admissible,
meaning it never overestimates actual cost. For example, since airline distance never
overestimates actual highway distance, and manhattan distance never overestimates actual
moves in the gliding tile.
For Puzzle, A* algorithm, using these evaluation functions, can find optimal solutions to
these problems. In addition, A* makes the most efficient use of the given heuristic
function in the following sense: among all shortest-path algorithms using the given
heuristic function h(n). A* algorithm expands the fewest number of nodes.
The main drawback of A* algorithm and indeed of any best-first search is its memory
requirement. Since at least the entire open list must be saved, A* algorithm is severely
space-limited in practice, and is no more practical than best-first search algorithm on
current machines. For example, while it can be run successfully on the eight puzzles, it
exhausts available memory in a matter of minutes on the fifteen puzzles.
A star algorithm is very good search method, but with complexity problems
To implement such a graph-search procedure, we will need to use two lists of node:
1) OPEN: nodes that have been generated and have had the heuristic function applied to
them but which have not yet been examined (i.e., had their successors generated). OPEN
is actually a priority queue in which the elements with the highest priority are those with
the most promising value of the heuristic function.
SIT, LONAVALA 51
LABORATORY PRACTICE-I BE COMPUTER
2) CLOSED: Nodes that have already been examined. We need to keep these nodes in
memory if we want to search a graph rather than a tree, since whether a node is generated,
we need to check whether it has been generated before
A * Algorithm:
5. Otherwise, expand n generating its children and directing pointers from each child
node to n.
For every child node n’ do
evaluate h(n’) and computef(n’) = g(n’) +h(n’)=
g(n)+c(n,n’)+h(n)
If n’ is already on OPEN or CLOSED compare its new f
with the old f and attach the lowest f to n’.
put n’ with its f value in the right order in OPEN
6. Go to step 2.
SIT, LONAVALA 52
LABORATORY PRACTICE-I BE COMPUTER
• h1(S) = 8
• h2(S) = 3+1+2+2+2+3+3+2 = 18
f(n)=g(n)+h(n)
A* is commonly used for the common path finding problem in applications such as
games, but was originally designed as a general graph traversal algorithm.
Program:
Puzzleboard.java
package ai_practical.assno3;
import java.util.Scanner;
import javax.swing.JOptionPane;
public PuzzelBoard()
{
this.board = new String[3][3];
}
public PuzzelBoard(PuzzelBoard b)
SIT, LONAVALA 53
LABORATORY PRACTICE-I BE COMPUTER
{
this.board = b.board;
this.blankX = b.blankX;
this.blankY = b.blankY;
}
if(board[i][j].equals("-"))
{
blankX=i;
blankY=j;
}
}
}
}
SIT, LONAVALA 54
LABORATORY PRACTICE-I BE COMPUTER
SIT, LONAVALA 55
LABORATORY PRACTICE-I BE COMPUTER
}
if(blankY<2)
{
temp.setBoard(board);
temp.swap(blankX, blankY, blankX, blankY+1);
int fn = (temp.getHn(goal)+gn);
System.out.println("\nFor Fn = "+fn+" : ");
temp.display();
if(fn < minFn)
{
minFn = fn;
next.setBoard(temp.board);
next.setBlankX(blankX);
next.setBlankY(blankY+1);
}
}
if(blankX>0)
{
temp.setBoard(board);
temp.swap(blankX, blankY, blankX-1, blankY);
int fn = (temp.getHn(goal)+gn);
System.out.println("\nFor Fn = "+fn+" : ");
temp.display();
if(fn < minFn)
{
minFn = fn;
next.setBoard(temp.board);
next.setBlankX(blankX-1);
next.setBlankY(blankY);
}
SIT, LONAVALA 56
LABORATORY PRACTICE-I BE COMPUTER
if(blankX<2)
{
temp.setBoard(board);
temp.swap(blankX, blankY, blankX+1, blankY);
int fn = (temp.getHn(goal)+gn);
System.out.println("\nFor Fn = "+fn+" : ");
temp.display();
if(fn < minFn)
{
minFn = fn;
next.setBoard(temp.board);
next.setBlankX(blankX+1);
next.setBlankY(blankY);
}
}
return next;
}
public void swap(int i1, int j1, int i2, int j2)
{
String temp = board[i1][j1];
board[i1][j1] = board[i2][j2];
board[i2][j2] = temp;
}
return true;
SIT, LONAVALA 57
LABORATORY PRACTICE-I BE COMPUTER
}
return hn;
}
}
Output:
run:
SIT, LONAVALA 58
LABORATORY PRACTICE-I BE COMPUTER
For Fn = 5 :
a b c
- d f
g e h
For Fn = 5 :
a b c
d f -
g e h
For Fn = 5 :
a - c
d b f
g e h
For Fn = 3 :
a b c
d e f
g - h
For Fn = 5 :
a b c
d e f
- g h
For Fn = 2 :
SIT, LONAVALA 59
LABORATORY PRACTICE-I BE COMPUTER
a b c
d e f
g h -
For Fn = 5 :
a b c
d - f
g e h
SIT, LONAVALA 60
LABORATORY PRACTICE-I BE COMPUTER
Assignment No 2
Aim:
Implement any one of the following Expert System ,
Medical Diagnosis of 10 diseases based on adequate symptoms
Identifying birds of India based on characteristics
Software Requirements:
SWI-Prolog for Windows, Editor.
Theory:
A system that uses human expertise to make complicated decisions. Simulates reasoning
by applying knowledge and interfaces. Uses expert’s knowledge as rules and data within
the system. Models the problem solving ability of a human expert.
Components of an ES:
1. Knowledge Base
i. Represents all the data and information imputed by experts in the field.
ii. Stores the data as a set of rules that the system must follow to
make decisions.
2. Reasoning or Inference Engine
i. Asks the user questions about what they are looking for.
ii. Applies the knowledge and the rules held in the knowledge base.
iii. Appropriately uses this information to arrive at a decision.
3. User Interface
i. Allows the expert system and the user to communicate.
ii. Finds out what it is that the system needs to answer.
iii. Sends the user questions or answers and receives their response.
4. Explanation Facility
i. Explains the systems reasoning and justifies its conclusions.
SIT, LONAVALA 61
LABORATORY PRACTICE-I BE COMPUTER
PROGRAM-
go:-
hypothesis(Disease),
write('It is suggested that the patient has '),
write(Disease),
nl,
undo;
write('Sorry, the system is unable to identify the disease'),nl,undo.
hypothesis(cold) :-
symptom(headache),
symptom(runny_nose),
symptom(sneezing),
symptom(sore_throat),
nl,
write('Advices and Sugestions:'),
nl,
write('1: Tylenol'),
nl,
write('2: Panadol'),
nl,
write('3: Nasal spray'),
nl,
write('Please weare warm cloths because'),
nl,!.
hypothesis(influenza) :-
symptom(sore_throat),
symptom(fever),
symptom(headache),
symptom(chills),
symptom(body_ache),
nl,
write('Advices and Sugestions:'),
nl,
write('1: Tamiflu'),
SIT, LONAVALA 62
LABORATORY PRACTICE-I BE COMPUTER
nl,
write('2: Panadol'),
nl,
write('3: Zanamivir'),
nl,
write('Please take a warm bath and do salt gargling because'),
nl,!.
hypothesis(typhoid) :-
symptom(headache),
symptom(abdominal_pain),
symptom(poor_appetite),
symptom(fever),
nl,
write('Advices and Sugestions:'),
nl,
write('1: Chloramphenicol'),
nl,
write('2: Amoxicillin'),
nl,
write('3: Ciprofloxacin'),
nl,
write('4: Azithromycin'),
nl,
write('Please do complete bed rest and take soft diet because'),
nl,!.
hypothesis(chicken_pox) :-
symptom(rash),
symptom(body_ache),
symptom(fever),
nl,
write('Advices and Sugestions:'),
nl,
write('1: Varicella vaccine'),
nl,
write('2: Immunoglobulin'),
nl,
write('3: Acetomenaphin'),
nl,
SIT, LONAVALA 63
LABORATORY PRACTICE-I BE COMPUTER
write('4: Acyclovir'),
nl,
write('Please do have oatmeal bath and stay at home because'),
nl,!.
hypothesis(measles) :-
symptom(fever),
symptom(runny_nose),
symptom(rash),
symptom(conjunctivitis),
nl,
write('Advices and Sugestions:'),
nl,
write('1: Tylenol'),
nl,
write('2: Aleve'),
nl,
write('3: Advil'),
nl,
write('4: Vitamin A'),
nl,
write('Please get rest and use more liquid because'),
nl,!.
hypothesis(malaria) :-
symptom(fever),
symptom(sweating),
symptom(headache),
symptom(nausea),
symptom(vomiting),
symptom(diarrhea),
nl,
write('Advices and Sugestions:'),
nl,
write('1: Aralen'),
nl,
write('2: Qualaquin'),
nl,
write('3: Plaquenil'),
nl,
SIT, LONAVALA 64
LABORATORY PRACTICE-I BE COMPUTER
write('4: Mefloquine'),
nl,
write('Please do not sleep in open air and cover your full skin because'),
nl,!.
ask(Question) :-
write('Does the patient has the symptom '),
write(Question),
write('? : '),
read(Response),
nl,
( (Response == yes ; Response == y)
->
assert(yes(Question)) ;
assert(no(Question)), fail).
:- dynamic yes/1,no/1.
symptom(S) :-
(yes(S)
->
true ;
(no(S)
->
fail ;
ask(S))).
undo :- retract(yes(_)),fail.
undo :- retract(no(_)),fail.
undo.
OUTPUT-
/*
SIT@SIT-ThinkCentre-E73:~$ swipl -s medicalExpert.pl
Welcome to SWI-Prolog (threaded, 64 bits, version 7.6.4)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free
software.
Please run ?- license. for legal details.
SIT, LONAVALA 65
LABORATORY PRACTICE-I BE COMPUTER
?- go.
|: go.
Does the patient has the symptom headache? :
|: yes.
Does the patient has the symptom sore_throat? :
|: n.
Does the patient has the symptom fever? :
|: y.
Does the patient has the symptom rash? :
|: y.
Does the patient has the symptom body_ache? :
Sorry, the system is unable to identify the disease
true.
?- go.
go.
Does the patient has the symptom headache? :
|: yes.
Does the patient has the symptom sore_throat? :
|: yes.
Does the patient has the symptom fever? :
|: yes.
Does the patient has the symptom rash? :
|: yes.
Does the patient has the symptom body_ache? :
?- go.
go.
Does the patient has the symptom headache? :
|: yes.
Does the patient has the symptom sore_throat? :
SIT, LONAVALA 66
LABORATORY PRACTICE-I BE COMPUTER
|: no.
Does the patient has the symptom fever? :
|: no.
Does the patient has the symptom rash? :
Sorry, the system is unable to identify the disease
true.
?- go.
go
|:
|: go.
Does the patient has the symptom headache? :
ERROR: Stream user_input:56:0 Syntax error: Operator expected
Exception: (9) hypothesis(_2070) ? creep
?- go.
|: go.
Does the patient has the symptom headache? :
|: n.
Does the patient has the symptom sore_throat? :
|: yes.
Does the patient has the symptom rash? :
|: yes.
Does the patient has the symptom body_ache? :
|: yes.
Does the patient has the symptom fever? :
?- go.
|: go.
Does the patient has the symptom headache? :
|: y.
Does the patient has the symptom sore_throat? :
SIT, LONAVALA 67
LABORATORY PRACTICE-I BE COMPUTER
|: n.
Does the patient has the symptom fever? :
|: y.
Does the patient has the symptom rash? :
|: n.
Does the patient has the symptom body_ache? :
Sorry, the system is unable to identify the disease
true.
?- go.
|: go.
Does the patient has the symptom headache? :
|: n.
Does the patient has the symptom sore_throat? :
|: n.
Does the patient has the symptom rash? :
|: y.
Does the patient has the symptom fever? :
|: y.
Does the patient has the symptom runny_nose? :
|: y.
Does the patient has the symptom sweating? :
Sorry, the system is unable to identify the disease
true.
?-
[1]+ Stopped swipl -s medicalExpert.pl
SIT@SIT-ThinkCentre-E73:~$ swipl -s medicalExpert.pl
Welcome to SWI-Prolog (threaded, 64 bits, version 7.6.4)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free
software.
Please run ?- license. for legal details.
?- go.
go.
Does the patient has the symptom headache? :
|: n.
Does the patient has the symptom sore_throat? :
SIT, LONAVALA 68
LABORATORY PRACTICE-I BE COMPUTER
|: n.
Does the patient has the symptom rash? :
|: y.
Does the patient has the symptom fever? :
|: y.
Does the patient has the symptom runny_nose? :
|: y.
Does the patient has the symptom sweating? :
Sorry, the system is unable to identify the disease
true.
?-
| go.
|: go.
Does the patient has the symptom headache? :
|: y
|: y.
Does the patient has the symptom sore_throat? :
ERROR: Stream user_input:28:2 Syntax error: Operator expected
Exception: (9) hypothesis(_2070) ? creep
?- go.
|: go.
Does the patient has the symptom sore_throat? :
|: y.
Does the patient has the symptom rash? :
|: n.
Does the patient has the symptom body_ache? :
|: y.
Does the patient has the symptom fever? :
|: y.
Does the patient has the symptom runny_nose? :
|: y.
Does the patient has the symptom conjunctivitis? :
SIT, LONAVALA 69
LABORATORY PRACTICE-I BE COMPUTER
?-
*/
Conclusion:
To implement any one of the following Expert System , Medical
Diagnosis of 10 diseases based on adequate symptoms and Identifying birds of
India based on characteristics
Assignment No 3
Aim:
Use Heuristic Search Techniques to Implement Best first search (Best-
Solution but not always optimal) and A* algorithm (Always gives optimal
solution).
Program:
BFS:
DistanceComprater.java
packagebfs;
importjava.util.Comparator;
@Override
public int compare(Node o1, Node o2) {
if(o1.getDistance() > o2.getDistance())
return 1;
else if(o1.getDistance() < o2.getDistance())
return -1;
return 0;
}
SIT, LONAVALA 70
LABORATORY PRACTICE-I BE COMPUTER
Graph.java
packagebfs;
import java.util.ArrayList;
import java.util.Scanner;
import javax.swing.JOptionPane;
ArrayList<HeadNode>headNodesList; //arraylist to
hold headNodes
int n;
}
for(int i=0;i<n;i++)
{
HeadNodetempHeadNode = headNodesList.get(i);
SIT, LONAVALA 71
LABORATORY PRACTICE-I BE COMPUTER
{
String name = tempHeadNode.getName();
// sc.skip("\n");
String ans = JOptionPane.showInputDialog("\nDo you want to add
any adjacent node to node "+ name + "? (y/n) : ");
if(ans.equals("n") || ans.equals("N"))
break;
// sc.skip("\n");
String tempName=JOptionPane.showInputDialog("Enter the name of
adjacent node of "+ name + " : ");
//sc.skip("\n");
inttempDistance=Integer.parseInt(JOptionPane.showInputDialog("Enter
distance between nodes " + name + " and " + tempName+ " :"));
tempHeadNode.setNodeInfo(tempName,tempDistance);
headNodesList.set(i, tempHeadNode);
}
}
}
SIT, LONAVALA 72
LABORATORY PRACTICE-I BE COMPUTER
HeadNode.java
packagebfs;
import java.util.ArrayList;
import java.util.Iterator;
publicArrayListgetNodeList()
{
returnadjnodes;
SIT, LONAVALA 73
LABORATORY PRACTICE-I BE COMPUTER
}
while(i.hasNext())
{
Node temp= (Node)i.next();
System.out.print(", ("+temp.getName()+","+temp.getDistance()+")");
}
}
Node.java
packagebfs;
SIT, LONAVALA 74
LABORATORY PRACTICE-I BE COMPUTER
BFS.java
packagebfs;
import java.util.ArrayList;
importjava.util.PriorityQueue;
import java.util.Scanner;
import javax.swing.JOptionPane;
/**
* @param args the command line arguments
*/
public static void main(String[] args)
{
int n;
n=Integer.parseInt(JOptionPane.showInputDialog("Enter No of nodes"));
// Enter no. of rows
PriorityQueue<Node>pq = new PriorityQueue<>(new DistanceComparator());
// Initilize priority queue
ArrayList<Boolean> visited = new ArrayList<>(n);
ArrayList<String> parent = new ArrayList<>(n); // Store
parent node
for(int i=0;i<n;i++)
{
SIT, LONAVALA 75
LABORATORY PRACTICE-I BE COMPUTER
SIT, LONAVALA 76
LABORATORY PRACTICE-I BE COMPUTER
{
ArrayList<Node>neighbours = graph.getNeighbours(temp.getName()); //
Get the neighbours of the retrieved node that are not visited
for(Node n1:neighbours) // For all adjacent
nodes
{
if(!visited.get(graph.getIndex(n1.getName())))
{
visited.set(graph.getIndex(n1.getName()), Boolean.TRUE); // Mark
visited if not marked
pq.add(n1); // Add them to queue
parent.set(graph.getIndex(n1.getName()), temp.getName()); // Set parent
of neighbour node
}
}
displayQueue(pq); // Display the Queue
}
}
tracePath(parent,graph,goal);
}
for(Node n:pq)
{
System.out.print(n.getName()+"\t");
}
System.out.println("");
}
SIT, LONAVALA 77
LABORATORY PRACTICE-I BE COMPUTER
while(!parent.get(graph.getIndex(temp)).equals("NIL")) //
Continue path till parent is not NIL
{
temp = parent.get(graph.getIndex(temp));
path = temp + ", " + path;
}
System.out.println(path);
}
}
/*
OUTPUT :
run:
A : (B,3), (C,1)
B : (D,3), (E,2)
C:
D:
E:
C B
B
B
E D
D
D
Path :
A, B, D
BUILD SUCCESSFUL (total time: 1 minute 8 seconds)
SIT, LONAVALA 78
LABORATORY PRACTICE-I BE COMPUTER
*/
A* Algorithm:
FixComprator.java
packageastargraph;
importjava.util.Comparator;
@Override
public int compare(HeadNode o1, HeadNode o2) {
if(o1.getFx()> o2.getFx())
return 1;
else if(o1.getFx() < o2.getFx())
return -1;
return 0;
}
Graph.java
packageastargraph;
import java.util.ArrayList;
import java.util.Scanner;
import javax.swing.JOptionPane;
ArrayList<HeadNode>headNodesList;
int n;
SIT, LONAVALA 79
LABORATORY PRACTICE-I BE COMPUTER
}
for(int i=0;i<n;i++)
{
HeadNodetempHeadNode = headNodesList.get(i);
tempHeadNode.setNodeInfo(tempName,tempDistance);
headNodesList.set(i, tempHeadNode);
SIT, LONAVALA 80
LABORATORY PRACTICE-I BE COMPUTER
}
}
public void setGx(String name,intgx) // Set gx for a node and update adjacency
list
{
int index = getIndex(name);
HeadNode node = headNodesList.get(index);
node.setGx(gx);
headNodesList.set(index, node);
}
SIT, LONAVALA 81
LABORATORY PRACTICE-I BE COMPUTER
returnheadNodesList.get(getIndex(name));
}
HeadNode.java
packageastargraph;
import java.util.ArrayList;
import java.util.Iterator;
SIT, LONAVALA 82
LABORATORY PRACTICE-I BE COMPUTER
returngx;
}
SIT, LONAVALA 83
LABORATORY PRACTICE-I BE COMPUTER
publicArrayListgetNodeList()
{
returnadjnodes;
}
Iterator i = adjnodes.iterator();
if(i.hasNext())
{
Node temp= (Node)i.next();
System.out.print("("+temp.getName()+","+temp.getDistance()+")");
}
while(i.hasNext())
{
Node temp= (Node)i.next();
System.out.print(", ("+temp.getName()+","+temp.getDistance()+")");
}
}
Node.java
packageastargraph;
SIT, LONAVALA 84
LABORATORY PRACTICE-I BE COMPUTER
return distance;
}
AStarGraph.java
packageastargraph;
import java.util.ArrayList;
importjava.util.PriorityQueue;
import javax.swing.JOptionPane;
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
int n;
n=Integer.parseInt(JOptionPane.showInputDialog("Enter No of nodes"));
// Enter no. of rows
SIT, LONAVALA 85
LABORATORY PRACTICE-I BE COMPUTER
displayQueue(open);
displayClosed(closed);
SIT, LONAVALA 86
LABORATORY PRACTICE-I BE COMPUTER
SIT, LONAVALA 87
LABORATORY PRACTICE-I BE COMPUTER
System.out.println("Empty");
return;
}
for(HeadNode n: open)
{
System.out.print(n.getName()+"\t");
}
System.out.println("");
}
SIT, LONAVALA 88
LABORATORY PRACTICE-I BE COMPUTER
if(n.getName().equals(name))
return true;
}
return false;
}
System.out.println(path);
}
}
/*
OUTPUT:
run:
Fx of node A = 6
Open List : A
Closed List : A
SIT, LONAVALA 89
LABORATORY PRACTICE-I BE COMPUTER
Fx of node B = 5
Fx of node C = 6
Open List : B C
Open List : C
Closed List : A B
Fx of node D = 4
Open List : D C
Open List : C
Closed List : A B D
Path :
A, B, D
*/
Conclusion:
Thus we have studied to use Heuristic Search Techniques to Implement
Best first search and A* algorithm
SIT, LONAVALA 90
LABORATORY PRACTICE-I BE COMPUTER
Assignment No 4
Aim:
Constraint Satisfaction Problem:
Implement crypt-arithmetic problem or n-queens or graph coloring problem
( Branch and Bound and Backtracking)
Objective:
Student will learn:
1. The basic concept of constraint satisfaction problem and backtracking.
2. General structure of N Queens problem.
Theory:
The N Queen is the problem of placing N chess queens on an N×N chessboard so that no
two queens attack each other. For example, following is a solution for 4 Queen problem.
The expected output is a binary matrix which has 1s for the blocks where queens are
placed.
For example, following is the output matrix for above 4 queen solution.
{ 0, 1, 0, 0} {
0, 0, 0, 1} {
1, 0, 0, 0} {
0, 0, 1, 0}
Generate all possible configurations of queens on board and print a configuration
that satisfies the given constraints.while there are untried conflagrations {
SIT, LONAVALA 91
LABORATORY PRACTICE-I BE COMPUTER
Backtracking Algorithm
Backtracking is finding the solution of a problem whereby the solution depends on the
previous steps taken.
In backtracking, we first take a step and then we see if this step taken is correct or not i.e.,
whether it will give a correct answer or not. And if it doesn’t, then we just come back and
change our first step. In general, this is accomplished by recursion. Thus, in backtracking,
we first start with a partial sub-solution of the problem (which may or may not lead us to
the solution) and then check if we can proceed further with this sub-solution or not. If not,
then we just come back and change it.
Thus, the general steps of backtracking are:
• start with a sub-solution
• check if this sub-solution will lead to the solution or not
• If not, then come back and change the sub-solution and continue again
The idea is to place queens one by one in different columns, starting from the leftmost
column. When we place a queen in a column, we check for clashes with already placed
queens. In the current column, if we find a row for which there is no clash, we mark this
row and column as part of the solution. If we do not find such a row due to clashes then
we backtrack and return false.
Algorithm:
3) Try all rows in the current column. Do following for every tried row.
SIT, LONAVALA 92
LABORATORY PRACTICE-I BE COMPUTER
a) If the queen can be placed safely in this row then mark this [row, column]
as part of the solution and recursively check if placing queen here leads
to a solution.
b) If placing the queen in [row, column] leads to a solution then return true.
3) If all rows have been tried and nothing worked, return false to trigger backtracking.
Program:
package ai_practical.assno12;
SIT, LONAVALA 93
LABORATORY PRACTICE-I BE COMPUTER
if(isAllQueensPlaced){
return true;
}
}
return false;
}
if(board[row] == board[i]){
return false;
}
return true;
}
/*
run:
_Q__
___Q
SIT, LONAVALA 94
LABORATORY PRACTICE-I BE COMPUTER
Q___
__Q_
DATA ANALYTICS
SIT, LONAVALA 95
LABORATORY PRACTICE-I BE COMPUTER
Assignment No 1
Aim:
Download the Iris flower dataset or any other dataset into a DataFrame.
(eg https://archive.ics.uci.edu/ml/datasets/Iris ) Use Python/R and Perform
following –
How many features are there and what are their types (e.g., numeric,
nominal)?
Compute and display summary statistics for each feature available in the
dataset. (eg. minimum value, maximum value, mean, range, standard
deviation, variance and percentiles
Data Visualization-Create a histogram for each feature in the dataset to
illustrate the feature distributions. Plot each histogram.
Create a boxplot for each feature in the dataset. All of the boxplots should
be combined into a single plot. Compare distributions and identify
outliers.
Theory:
R:
R is a powerful language used widely for data analysis and statistical computing. It
was developed in early 90s. Since then, endless efforts have been made to improve R’s user
interface. The journey of R language from a rudimentary text editor to interactive R Studio
and more recently Jupyter Notebooks has engaged many data science communities across the
world.
This was possible only because of generous contributions by R users globally. Inclusion of
powerful packages in R has made it more and more powerful with time. Packages such as
dplyr, tidyr, readr, data.table, SparkR, ggplot2 have made data manipulation, visualization
and computation much faster.
Advantages of R:
SIT, LONAVALA 96
LABORATORY PRACTICE-I BE COMPUTER
1. R Console: This area shows the output of code you run. Also, you can directly write
codes in console. Code entered directly in R console cannot be traced later. This is
where R script comes to use.
2. R Script: As the name suggest, here you get space to write codes. To run those codes,
simply select the line(s) of code and press Ctrl + Enter. Alternatively, you can click
on little ‘Run’ button location at top right corner of R Script.
3. R environment: This space displays the set of external elements added. This includes
data set, variables, vectors, functions etc. To check if data has been loaded properly in
R, always look at this area.
4. Graphical Output: This space display the graphs created during exploratory data
analysis. Not just graphs, you could select packages, seek help with embedded R’s
official documentation.
SIT, LONAVALA 97
LABORATORY PRACTICE-I BE COMPUTER
Train Data: The predictive model is always built on train data set. An intuitive way to
identify the train data is, that it always has the ‘response variable’ included.
Test Data: Once the model is built, it’s accuracy is ‘tested’ on test data. This data always
contains less number of observations than train data set. Also, it does not include ‘response
variable’.
Data Sets Loads specified data sets, or list the available data sets.
Percentile:
The nth percentile of an observation variable is the value that cuts off the first n percent of the
data values when it is sorted in ascending order.
Histogram:
R creates histogram using hist() function. This function takes a vector as an input and uses
some more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is –
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
v is a vector containing numeric values used in histogram.
main indicates title of the chart.
col is used to set color of the bars.
SIT, LONAVALA 98
LABORATORY PRACTICE-I BE COMPUTER
Summary
A very useful multipurpose function in R is summary(X), where X can be one of any number
of objects, including datasets, variables, and linear models, just to name a few
Response Variable (a.k.a Dependent Variable): In a data set, the response variable (y) is
one on which we make predictions. In this case, we’ll predict ‘Item_Outlet_Sales’.
Predictor Variable (a.k.a Independent Variable): In a data set, predictor variables (Xi) are
those using which the prediction is made on response variable.
Boxpots
Boxplots are great for comparing a groups of data. Let’s compare the sepal widths to the
species. The key is that the first variable is an ordered vector of quantitative
data Sepal.Width and the second variable is a vector of categorical data Species. We model
the relationship as Sepal.Width~Speciesmeaning that the Sepal.Width depends on the type
of Species.
SIT, LONAVALA 99
LABORATORY PRACTICE-I BE COMPUTER
Program:
library(datasets)
data("iris")
names(iris)
dim(iris)
#view a dataset
View(iris)
#internal structur
min(iris$Sepal.Length)
max(iris$Sepal.Length)
mean(iris$Sepal.Length)
range(iris$Sepal.Length)
#standard deviation
sd(iris$Sepal.Length)
#variance
var(iris$Sepal.Length)
#percentile
quantile(iris$Sepal.Length)
#to display specific value
quantile(iris$Sepal.Length,c(0.3,0.6))
#histo
h <- hist(iris$Sepal.Length,main="sepal length frequencies-
histogram",xlab="sepal length",xlim=c(3.5,8.5),col="blue")
h
#using breaks
h <- hist(iris$Sepal.Length,main="sepal length frequencies-
histogram",xlab="sepal
length",xlim=c(3.5,8.5),col="blue",labels=TRUE,breaks=3,border="green",las=
2)
Output:
> library(datasets)
> data("iris")
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
> dim(iris)
[1] 150 5
> View(iris)
> min(iris$Sepal.Length)
[1] 4.3
> max(iris$Sepal.Length)
[1] 7.9
> mean(iris$Sepal.Length)
[1] 5.843333
> range(iris$Sepal.Length)
[1] 4.3 7.9
> sd(iris$Sepal.Length)
[1] 0.8280661
> var(iris$Sepal.Length)
[1] 0.6856935
> quantile(iris$Sepal.Length)
0% 25% 50% 75% 100%
4.3 5.1 5.8 6.4 7.9
> quantile(iris$Sepal.Length,c(0.3,0.6))
30% 60%
5.27 6.10
> h <- hist(iris$Sepal.Length,main="sepal length frequencies-
histogram",xlab="sepal length",xlim=c(3.5,8.5),col="blue")
>h
$breaks
[1] 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
$counts
[1] 5 27 27 30 31 18 6 6
$density
[1] 0.06666667 0.36000000 0.36000000 0.40000000 0.41333333 0.24000000
0.08000000 0.08000000
$mids
[1] 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75
$xname
[1] "iris$Sepal.Length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
length",xlim=c(3.5,8.5),col="red",labels=TRUE,breaks=3,border="green",las=3
)
> H<-
hist(iris$Sepal.Length,breaks=c(4.3,4.6,4.9,5.2,5.5,5.8,6.1,6.4,6.7,7.0,7.3,7.6,7.
9))
> boxplot(iris$Sepal.Length)
> summary(iris$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
> myboxplot<-boxplot(iris[,-5])
> myboxplot$out
[1] 4.4 4.1 4.2 2.0
Assignment No 2
Aim:
Download Pima Indians Diabetes dataset. Use Naive Bayes‟ Algorithm
for classification
Load the data from CSV file and split it into training and test datasets.
Summarize the properties in the training dataset so that we can calculate
probabilities and make predictions.
Classify samples from a test dataset and a summarized training dataset.
Problem Statement:
Use of Naive Bayes‟ Algorithm for classification Load the data from CSV file and
split it into training and test datasets. Summarize the properties in the training dataset so that
we can calculate probabilities and make predictions. And Classify samples from a test dataset
and a summarized training dataset.
Objective:
Load the data from CSV file and split it into training and test datasets.
Summarize the properties in the training dataset so that we can calculate probabilities
and make predictions.
Classify samples from a test dataset and a summarized training dataset.
Theory:
R:
R is a powerful language used widely for data analysis and statistical computing. It
was developed in early 90s. Since then, endless efforts have been made to improve R’s user
interface. The journey of R language from a rudimentary text editor to interactive R Studio
and more recently Jupyter Notebooks has engaged many data science communities across the
world.
This was possible only because of generous contributions by R users globally. Inclusion of
powerful packages in R has made it more and more powerful with time. Packages such as
dplyr, tidyr, readr, data.table, SparkR, ggplot2 have made data manipulation, visualization
and computation much faster.
Advantages of R:
5. R Console: This area shows the output of code you run. Also, you can directly write
codes in console. Code entered directly in R console cannot be traced later. This is
where R script comes to use.
6. R Script: As the name suggest, here you get space to write codes. To run those codes,
simply select the line(s) of code and press Ctrl + Enter. Alternatively, you can click
on little ‘Run’ button location at top right corner of R Script.
7. R environment: This space displays the set of external elements added. This includes
data set, variables, vectors, functions etc. To check if data has been loaded properly in
R, always look at this area.
8. Graphical Output: This space display the graphs created during exploratory data
analysis. Not just graphs, you could select packages, seek help with embedded R’s
official documentation.
Library:
1. caTools: Tools: moving window statistics, GIF, Base64, ROC AUC, etc
Contains several basic utility functions including: moving (rolling, running) window statistic
functions, read/write for GIF and ENVI binary files, fast calculation of AUC, LogitBoost
classifier, base64 encoder/decoder, round-off-error-free sum and cumsum, etc.
Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support
vector machines, shortest path computation, bagged clustering, naive Bayes classifier, ...
CSV(comma-separated values )
mydata<-read.csv(file="F:/SEM7/DA/KomalN/Ass2/diabetes.csv",header=TRUE,sep=",")
The above reads the file diabetes.csv into a data frame that it creates called mydata.
header=TRUE specifies that this data includes a header row and sep=”,” specifies that the
data is separated by commas (though read.csv implies the same I think it’s safer to be
explicit).
Split data from vector Y into two sets in predefined ratio while preserving relative ratios of
different labels in Y. Used to split the data used during classification into train and test
subsets.
SplitRatio
Splitting ratio:
Train Data: The predictive model is always built on train data set. An intuitive way to
identify the train data is, that it always has the ‘response variable’ included.
Test Data: Once the model is built, it’s accuracy is ‘tested’ on test data. This data always
contains less number of observations than train data set. Also, it does not include ‘response
variable’.
Naive Bayes:
Naïve Bayes classification is a kind of simple probabilistic classification methods
based on Bayes’ theorem with the assumption of independence between features. The model
is trained on training dataset to make predictions by predict() function. This article introduces
two functions naiveBayes() and train() for the performance of Naïve Bayes classification.
Predict:
The predict() function to make predictions from that model on new data. The new dataset
must have all of the columns from the training data, but they can be in a different order with
different values.
table(pred1,test$Outcome,dnn = c("predicted","Actual"))
table uses the cross-classifying factors to build a contingency table of the counts at each
combination of factor levels.
Dnn: the names to be given to the dimensions in the result (the dimnames names).
Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows,
respectively. These are generic functions with methods for other R classes
Program:
#library(datasets)
library(caTools)
library(e1071)
mydata<-
read.csv(file="F:/SEM7/DA/KomalN/Ass2/diabetes.csv",header=TRUE,sep=","
)
View(mydata)
temp_field <- sample.split(mydata,SplitRatio=0.7)
train <- subset(mydata, temp_field==TRUE)
test <- subset(mydata, temp_field==FALSE)
head(train)
head(test)
my_model <- naiveBayes(as.factor(train$Outcome)~.,train)
my_model
pred1<-predict(my_model,test[,-9])
pred1
pred1<-predict(my_model,test[,-9],type="raw")
pred1
pred1<-predict(my_model,test[,-9])
pred1
table(pred1,test$Outcome,dnn = c("predicted","Actual"))
output<- cbind(test,pred1)
View(output)
Output:
#library(datasets)
library(caTools)
library(e1071)
mydata<-
read.csv(file="F:/SEM7/DA/KomalN/Ass2/diabetes.csv",header=TRUE,sep=","
)
View(mydata)
> head(train)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
DiabetesPedigreeFunction Age Outcome
2 1 85 66 29 0 26.6 0.351 31 0
3 8 183 64 0 0 23.3 0.672 32 1
4 1 89 66 23 94 28.1 0.167 21 0
7 3 78 50 32 88 31.0 0.248 26 1
8 10 115 0 0 0 35.3 0.134 29 0
9 2 197 70 45 543 30.5 0.158 53 1
> head(test)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
DiabetesPedigreeFunction Age Outcome
1 6 148 72 35 0 33.6 0.627 50 1
5 0 137 40 35 168 43.1 2.288 33 1
6 5 116 74 0 0 25.6 0.201 30 0
10 8 125 96 0 0 0.0 0.232 54 1
14 1 189 60 23 846 30.1 0.398 59 1
15 5 166 72 19 175 25.8 0.587 51 1
Call:
A-priori probabilities:
Y
0 1
0.6269531 0.3730469
Conditional probabilities:
Pregnancies
Y [,1] [,2]
0 3.264798 3.073319
1 4.712042 3.771892
Glucose
Y [,1] [,2]
0 110.1277 26.59334
1 138.8272 33.08691
BloodPressure
Y [,1] [,2]
0 68.51402 17.91265
1 71.36126 20.30531
SkinThickness
Y [,1] [,2]
0 19.46106 14.81635
1 21.72251 17.42568
Insulin
Y [,1] [,2]
0 65.71963 92.92128
1 99.55497 134.75274
BMI
Y [,1] [,2]
0 30.39564 7.462149
1 35.18325 6.494494
DiabetesPedigreeFunction
Y [,1] [,2]
0 0.4289221 0.3089013
1 0.5271518 0.3344238
Age
Y [,1] [,2]
0 31.4486 12.09977
1 37.1466 10.94577
> pred1<-predict(my_model,test[,-9])
> pred1
[1] 1 1 0 0 1 1 0 1 0 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0
0001101011010010001000101000011101100001001
10
[86] 0 0 0 0 1 1 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0
0100111000010000011010010001000100000100110
00
[171] 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0
0010001001110101000011001010110000000010001
01
[256] 0
Levels: 0 1
> pred1<-predict(my_model,test[,-9],type="raw")
> pred1
0 1
[1,] 3.023617e-01 0.6976383049
[2,] 1.042643e-02 0.9895735696
[3,] 9.363969e-01 0.0636031459
[4,] 9.977055e-01 0.0022944542
[5,] 3.710766e-10 0.9999999996
[6,] 2.908498e-01 0.7091502399
[7,] 7.638351e-01 0.2361648756
[8,] 1.867342e-02 0.9813265778
[9,] 7.268751e-01 0.2731249027
[10,] 9.818539e-01 0.0181461175
[11,] 1.743120e-01 0.8256879958
[12,] 9.867358e-01 0.0132641733
[13,] 2.678128e-01 0.7321871927
[14,] 3.426135e-01 0.6573865416
[15,] 3.484119e-01 0.6515880608
[16,] 1.182130e-02 0.9881787022
[17,] 9.992828e-01 0.0007172357
[18,] 9.901948e-01 0.0098051522
…..
……
[255,] 5.173472e-02 0.9482652816
Assignment No 3
Aim:
Trip History Analysis: Use trip history dataset that is from a bike sharing
service in the United States. The data is provided quarter-wise from 2010 (Q4)
onwards. Each file has 7 columns. Predict the class of user. Sample Test data set
available here https://www.capitalbikeshare.com/trip-history-data.
Problem Statement:
Analysis Trip History by using trip history dataset that is from a bike sharing service
in the United States. The data is provided quarter-wise from 2010 (Q4) onwards. Each file
has 7 columns. Predict the class of user.
Objective:
Predict the result from previous data.
Theory:
R:
R is a powerful language used widely for data analysis and statistical computing. It
was developed in early 90s. Since then, endless efforts have been made to improve R’s user
interface. The journey of R language from a rudimentary text editor to interactive R Studio
and more recently Jupyter Notebooks has engaged many data science communities across the
world.
This was possible only because of generous contributions by R users globally. Inclusion of
powerful packages in R has made it more and more powerful with time. Packages such as
dplyr, tidyr, readr, data.table, SparkR, ggplot2 have made data manipulation, visualization
and computation much faster.
Advantages of R:
9. R Console: This area shows the output of code you run. Also, you can directly write
codes in console. Code entered directly in R console cannot be traced later. This is
where R script comes to use.
10. R Script: As the name suggest, here you get space to write codes. To run those codes,
simply select the line(s) of code and press Ctrl + Enter. Alternatively, you can click
on little ‘Run’ button location at top right corner of R Script.
11. R environment: This space displays the set of external elements added. This includes
data set, variables, vectors, functions etc. To check if data has been loaded properly in
R, always look at this area.
12. Graphical Output: This space display the graphs created during exploratory data
analysis. Not just graphs, you could select packages, seek help with embedded R’s
official documentation.
Library:
3. caTools: Tools: moving window statistics, GIF, Base64, ROC AUC, etc
Contains several basic utility functions including: moving (rolling, running) window statistic
functions, read/write for GIF and ENVI binary files, fast calculation of AUC, LogitBoost
classifier, base64 encoder/decoder, round-off-error-free sum and cumsum, etc.
Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support
vector machines, shortest path computation, bagged clustering, naive Bayes classifier, ...
Data: an optional data frame in which to interpret the variables named in the formula.
CSV(comma-separated values )
mydata<-read.csv(file="F:/SEM7/DA/KomalN/Ass2/tripdata.csv",header=TRUE,sep=",")
The above reads the file tripdata.csv into a data frame that it creates called mydata.
header=TRUE specifies that this data includes a header row and sep=”,” specifies that the
data is separated by commas (though read.csv implies the same I think it’s safer to be
explicit).
Split data from vector Y into two sets in predefined ratio while preserving relative ratios of
different labels in Y. Used to split the data used during classification into train and test
subsets.
SplitRatio
Splitting ratio:
Train Data: The predictive model is always built on train data set. An intuitive way to
identify the train data is, that it always has the ‘response variable’ included.
Test Data: Once the model is built, it’s accuracy is ‘tested’ on test data. This data always
contains less number of observations than train data set. Also, it does not include ‘response
variable’.
Summary
A very useful multipurpose function in R is summary(X), where X can be one of any number
of objects, including datasets, variables, and linear models, just to name a few. When used,
the command provides summary data related to the individual object that was fed into it.
Thus, the summary function has different outputs depending on what kind of object it takes as
an argument.
Returns the first or last parts of a vector, matrix, table, data frame or function.
Since head() and tail() are generic functions, they may also have been extended to other
classes.
Program:
library(e1071)
library(caTools)
library(rpart)
mydata<-
read.csv(file="/home/SIT/Desktop/tripdata.csv",header=TRUE,sep=",")
View(mydata)
#consider column1,4,6,9 - output class
subset_mydata <- mydata[,c(1,4,6,9)]
temp_field <- sample.split(subset_mydata,SplitRatio=0.9)
train <- subset(subset_mydata, temp_field==TRUE)
test <- subset(subset_mydata, temp_field==FALSE)
summary(train)
summary(test)
head(train)
head(test)
fit <- rpart(train$Member.type~.,data=train,method="class")
plot(fit)
text(fit)
#test excluding last colm
pred<- predict(fit,newdata=test[,-4],type=("class"))
mean(pred==test$Member.type)
output <- cbind(test,pred)
View(output)
plot(fit)
text(fit)
Output:
> library(e1071)
> library(caTools)
> library(rpart)
> mydata<-
read.csv(file="/home/SIT/Desktop/tripdata.csv",header=TRUE,sep=",")
> View(mydata)
> summary(train)
Duration Start.station.number End.station.number Member.type
Min. : 60 Min. :31000 Min. :31000 Casual: 76741
> printcp(fit)
Classification tree:
rpart(formula = train$Member.type ~ ., data = train, method = "class")
n= 280586
> text(fit)
Assignment No 4
Aim:
Twitter Data Analysis: Use Twitter data for sentiment analysis. The dataset is 3MB in
size and has 31,962 tweets. Identify the tweets which are hate tweets and which are not.
Sample Test data set available here https://datahack.analyticsvidhya.com/contest/practice-
problem-twitter-sentiment-analysis/
Objective: To learn the concept of natural language processing (NLP) tasks such as
part-of-speech tagging, noun phrase extraction, sentiment analysis, and
classification.
Theory: I. Python regular expression Library: Regular expressions are used to
identify whether a pattern exists in a given sequence of characters
(string) or not. They help in manipulating textual data, which is often a
pre-requisite for data science projects that involve text mining. You
must have come across some application of regular expressions: they
are used at the server side to validate the format of email addresses or
password during registration, used for parsing text data files to find,
replace or delete certain string, etc.
Input:
Structured Dataset : Twitter Dataset
File: Twitter.csv
Output:
1. Sentiment analysis of twitter dataset.
2. Categorization of tweets as positive and negative tweets..
Program:
library(dplyr)
library(tibble)
library(twitteR)
library(graphics)
library(purrr)
library(stringr)
library(tm)
library(syuzhet)
library(gapminder)
library(httpuv)
library(openssl)
library(RCurl)
library(RInside)
library(Rcpp)
library(textclean)
library(SnowballC)
library(gapminder)
#Connect to Twitter API:+
api_key<- "CRzxTe08UF5Mrl7nFxovwmAhN"
tweets<- tbl_df(map_df(c(prat_tweets,oprah_tweets,neil_tweets,
mar_tweets,kutch_tweets),as.data.frame))
#Read in data:
setwd("C:/Users/mateo/Documents/Repo/text-analysis")
tweets<-read.csv("tweets.csv")
#Clean up data:
twitterCorpus <-Corpus(VectorSource(tweets$text))
inspect(twitterCorpus[1:10])
# remove non "American standard code for information interchange (curly quotes and
ellipsis)"
removeNonAscii<-function(x) textclean::replace_non_ascii(x)
twitterCorpus<-tm_map(twitterCorpus,content_transformer(removeNonAscii))
twitterCorpus<- tm_map(twitterCorpus,removeWords,c("amp","ufef",
"ufeft","uufefuufefuufef","uufef","s"))
twitterCorpus<- tm_map(twitterCorpus,stripWhitespace)
inspect(twitterCorpus[1:10])
# stem corpus after sentiment analysis(given my sentiment dictionary choice), but before
cluster analysis
#Sentiment analysis:
+ emotions<-get_nrc_sentiment(twitterCorpus$content)
barplot(colSums(emotions),cex.names = .7,
col = rainbow(10),
main = "Sentiment scores for tweets"
)
get_sentiment(twitterCorpus$content[1:10])
sent<-get_sentiment(twitterCorpus$content)
sentimentTweets<-dplyr::bind_cols(tweets,data.frame(sent))
meanSent<-function(i,n){
mean(sentimentTweets$sent[i:n])
}
(scores<-c(prat=meanSent(1,250),
oprah=meanSent(251,500),
neil=meanSent(501,750),
maher=meanSent(751,849),
astk=meanSent(850,1002)))
#Cluster analysis:
dtm<-DocumentTermMatrix(twitterCorpus)
dtm
mat<-as.matrix(dtm)
d<-dist(mat)
groups<-hclust(d,method="ward.D")
plot(groups,hang=-1)
cut<-cutree(groups,k=6)
newMat<-dplyr::bind_cols(tweets,data.frame(cut))
table(newMat$screenName,newMat$cut)