DMDW Lab Report: Data Analytics Branch

DMDW LAB REPORT
Data Analytics Branch
SUBMITTED BY:
<< Ahmad Alsharef >>
<< 1864012 >>
SCHOOL OF COMPUTER ENGINEERING

Kalinga Institute of Industrial Technology
Deemed to be University
Bhubaneswar-751024
1
Contents
R Language Basics ........................................................................................................................................ 6
Installing R ................................................................................................................................................ 6
R (programming language) ....................................................................................................................... 6
RStudio...................................................................................................................................................... 7
Objects & Functions.................................................................................................................................. 7
Data sets ................................................................................................................................................... 9
R Examples ................................................................................................................................................. 11
Find out the second largest number.......................................................................... 11
Using switch case................................................................................................................... 12
Convdert from Decimal to Binary and vice versa .............................................. 13
All arithmetic operations ................................................................................................. 14
Recursive functions ............................................................................................................... 15
Matrices.......................................................................................................................................... 16
Measures of variability ..................................................................................................... 19
DMDW Lab Progress Report ...................................................................................................................... 22

Collecting Data ............................................................................................................................ 22
Drawing Plots ................................................................................................................................. 22
QPLOT ................................................................................................................................................... 22
GGPLOT…………………………………………………………………………………………………………………………………………23
Merging two datasets ................................................................................................................ 24
Cleaning and filtering ........................................................................................................... 24
Using Scatter Plots .................................................................................................................. 25
Multiple Regression .................................................................................................................. 28
WEKA .......................................................................................................................................................... 30
Introduction to Weka: ............................................................................................................................ 32
Main features of Weka include: .......................................................................................................... 32
Installation Procedure of Weka:...................................................................................... 32
Start the Weka............................................................................................................................... 32
Weka Application Interfaces ............................................................................................... 33
2
Weka data formats ....................................................................................................................... 33
Data Preprocessing ................................................................................................................................. 35
Discretization .......................................................................................................................................... 36
K Means Clustering Example with Weka Explorer ................................................................................. 39
Python Programming ................................................................................................................................. 44
What is Python?...................................................................................................................................... 44
Install and Run Python in Windows ............................................................................. 45
Python - Basic Operators ........................................................................................................................ 46
Types of Operator ............................................................................................................................... 46
Python Arithmetic Operators .............................................................................................................. 46
Python Compares on Operators ........................................................................................................... 47
3
BASICS OF R LANGUAGE
4
5
R Language Basics
Installing R:
R is open-source and is freely available for macOS, Linux, and
Windows. You can download compiled versions of R (called binaries,
or precompiled binary distributions) by going to the home page for
R (http://www.r-project.org), and following the link to CRAN (the
Comprehensive R Archive Network). You will be asked to select a
mirror; pick one that is geographically nearby. On the CRAN site,
each operating system has a FAQ page and there is also a more
general FAQ. Both are worth reading.
R(programming language):
R is a programming language and free software environment for
statistical computing and graphics supported by the R Foundation
for Statistical Computing. The R language is widely used among
statisticians and data miners for developing statistical software
and for data analysis.
R provides a wide variety of statistical (linear and nonlinear

modelling, classical statistical tests, time-series analysis,
classification, clustering, …) and graphical techniques, and is
highly extensible.
One of R’s strengths is the ease with which well-designed

publication-quality plots can be produced, including mathematical
symbols and formulae where needed. Great care has been taken over
the defaults for the minor design choices in graphics, but the
user retains full control.
The R environment:
R is an integrated suite of software facilities for data
manipulation, calculation and graphical display. It includes:
• an effective data handling and storage facility.
• a suite of operators for calculations on arrays, in particular
matrices.
• a large, coherent, integrated collection of intermediate tools
for data analysis.
• graphical facilities for data analysis and display either on-
screen or on hardcopy, and a well-developed, simple and effective
6
programming language which includes conditionals, loops, user-
defined recursive functions and input and output facilities.
RStudio:
RStudio is a free and open-source integrated development
environment (IDE) for R language. RStudio was founded by JJ Allaire
creator of the programming language ColdFusion. Hadley Wickham is
the Chief Scientist at RStudio.
RStudio is available in two editions: RStudio Desktop, where the
program runs locally as a regular desktop application; and RStudio
Server, which allows accessing RStudio using a web browser while
it is running on a remote Linux server. Prepackaged distributions
of RStudio Desktop are available for Windows, macOS, and Linux.
Objects & Functions:

When you launch R, you will be greeted with a prompt (>) and a
blinking cursor:
R works with objects, and there are many types. Objects store data,
and they have commands that can operate on them, which depend the
type and structure of data that is stored. A single number or
string of text is the simplest object and is known as
To give a value to an object use one of the two assignment
operators. Although the equals sign may be more familiar, the arrow
(less-than sign, followed by a dash: <-) is more common, and you
should use it.
>x = 3
> x <- 3
To display the value of any object, type its name at the prompt.
>x
Arithmetic operators follow standard symbols for addition,
subtraction, multiplication, division, and exponents:
> x <- 3 + 4
> x <- 2 - 9 >x<- 12 * 8
> x <- 16 / 4
> x <- 2 A 3
Comments are always preceded by a pound sign (#), and what follows
7
the pound sign on that line will not be executed.
> # x <- 2 * 5; everything on this line is a comment
> x <- 2 * 5 # comments can be placed after a statement
Spaces are generally ignored, but include them to make code easier
to read. Both of these lines produce the same result.
> x<-3+7/14 >x<-3 + 7/ 14
Capitalization generally matters, as R is case-sensitive. This is
especially important when calling functions, discussed below.
> x <- 3
> x # correctly displays 3
> X # produces an error, as X doesn’t exist, but x does
Functions:
R has a rich set of functions, which are called with any arguments
in parentheses and which generally return a value. Functions are
also objects. The maximum of two values is calculated with the
max() function:
> max(9, 5)
> max(x, 4) # objects can be arguments to functions
Vectors:
A vector is a series of values, which may be numeric or text, where
all values have the same type.
Vectors are common data structures, and you will use them
frequently. Vectors are created most easily with the c() function
(c as in concatenate).
> x <- c(3, 7, -8, 10, 15, -9, 8, 2, -5, 7, 8, 9, -2, -4, -1)
> sort(x)
Sorting can be done in descending order by specifying decreasing
= TRUE or by reversing the sorted vector with rev(). The first way
is preferred.
> sort(x, decreasing=TRUE)
> rev(sort(x))
8
Lists:
Lists hold a series of values, like vectors, but the values in a
list can be of different types, unlike a vector.
Lists are made with the list() function:
> x <- list('Hueston Woods', 42, TRUE)
Lists are commonly produced by statistical functions, such as
regressions or ordinations. To access an element of a list, use
double brackets instead of single brackets.
> x[[1]] # returns 'Hueston Woods'
Matrices:
Another type of R data structure is a matrix, which has multiple
rows and columns of values, all of the same type, like a vector.
You can think of a matrix as a collection of vectors, all of the
same type.
> x <- matrix(c(3, 1, 4, 2, 5, 8), nrow=3)
This will generate a matrix of 3 rows and therefore 2 columns,
since there are six elements in the matrix. By default, the matrix
is filled by columns, such that column 1 is filled first from top
to bottom, then column 2, etc. Thus, the x matrix would be: 3 2 1
5 4 8 The matrix can be filled by rows instead, with row 1 filled
first, from left to right, then row 2, etc., by including the
argument byrow=TRUE:
> x <- matrix(c(3,1,4,2,5,8), nrow=3, byrow=TRUE)
This gives 3 1 4 2 5 8 Large matrices can also be entered this
way, but importing them is easier.
Data Frames:
A data frame is like a matrix, but the columns can be of different
types.
In this way vectors are to lists as matrices are to data frames.
Data sets:
A data set (or dataset) is a collection of data. Most commonly
corresponds to the contents of a single database table, or a single
statistical data matrix, where every column of the table represents
a particular variable, and each row corresponds to a given member
of the data set in question. The data set lists values for each of
9
the variables, such as height and weight of an object, for each
member of the data set. Each value is known as a datum. The data
set may comprise data for one or more members, corresponding to
the number of rows.
The term data set may also be used more loosely, to refer to the
data in a collection of closely related tables, corresponding to
a particular experiment or event. Less used names for this kind of
data sets are data corpus and data stock. An example of this type
is the data sets collected by space agencies performing experiments
with instruments aboard space probes. Data sets that are so large
that traditional data processing applications are inadequate to
deal with them are known as big data.
10
R Examples
1. Input three numbers and find out the greatest.
a=as.integer(readline(prompt="Enter the value of a"))

b=as.integer(readline(prompt="Enter the value of b"))
c=as.integer(readline(prompt="Enter the value of c"))
if(a>b&a>c)
{print("a is the largest")}
else if(b>a&b>c)
{ print("b is the largest")}
else
{print("c is the largest")}
output :
Enter the value of a 2
Enter the value of b 1
"c is the largest"
2. Input three numbers and find out the second largest number.
a=as.integer(readline(prompt="Enter the value of a"))

b=as.integer(readline(prompt="Enter the value of b"))
c=as.integer(readline(prompt="Enter the value of c"))
if((a>b&a<c)||(a>c&a<b))
{ print("a is the second largest number")}
else
if((b>a&b<c)||(b>c&b<a))
{ print("b is the second largest number")}
else
{ print("c is the second largest number")}
output :
Enter the value of a 22
"b is the second largest number"
3. Input one number and check whether it is odd or even and

whether it is positive or negative.
a=as.integer(readline(prompt("Insert a"))
if (a%%2=0) {print("a is even")}
else {print("a is odd")}
11
if (a>0) {print("a is positive")}
else {print("a is negative")}
output :
Insert a -5
a is odd
a is negative
4. Write R program to draw

1
121
12321
1234321
123454321
for(i in 1:5)
{
k<-i
for (j in 1:5-i) print(" ")

for (k in 1:i) print(i)
while(k!=1)
{
print(k-1)
k<-k-1
}
print("\n")
}
Output :
1
121
12321
1234321
123454321
5. Write R program to check whether a number is odd or even using

switch case.
a=as.integer(readline("Insert a number:"))
c<-(a%%2)
switch(c,print("Even"),print("Odd"))
output :
Insert a number : 4
Even
12
6. Findout the sum and average of prime numbers between 2 and 50.
count<-0
sum<-0
for(n in 2:50)
{
for (i in 2:n/2)
{
if((n%%i)==0)
{
sum=sum+i
count<-count+1
}
}
}
avg=sum/count
print("count : ",count,"\n")
print("sum : ",sum,"\n")
print("avg : ",avg,"\n")
Output:
count : 15
sum : 328
avg : 21.86667
7. Write a program to input a number to convert it from:

1. Decimal to Binary.
2. Binary to Decimal.
Binary <-function(d)
{
bsum<-0
bexp<-1
while(d>0)
{digit<-d%%2
bsum<-bsum+digit*bexp
bexp<-bexp*10}
return(bsum)
}
Decimal<-function(b)
{
dsum<-0
dexp<-1
13
while(b>0)
{digit<-b%%10
b<-floor(b/10)
dsum<-dsum+digit*bexp
bexp<-bexp*2}
return(dsum)
}
d<-readline("Insert Decimal : ")
d<-as.numeric(d)
b<-binary(d)
print("Binary: ",b)
d<decimal(b)
print("Decimal: ",d)
Output :
Insert Decimal : 6
Binary: 110
Decimal: 6
8. Write R program to display all arithmetic operation line Add,

Subtraction, Multiplication, Division in 4 separate functions
like function1, function2, function3, function4 by
illustrating a=of all the methodology of passing argument and
return type.
addition<-function(a,b)
{
c=a+b
return(c)
}
subtract<-function(a,b)
{
c=a-b
return(c)
}
multiply<-function(a,b)
{
c=a*b
return (c)
}
division<-function(a,b)s
{
if(b!=0)
c=a/b
else print("error")
return(c)
14
}
a<-as.integer(readline(prompt="a: "))
b<-as.integer(readline(prompt="b: "))
print("addition:\n")
print(addition(a,b))
print("subtract:\n")
print(subtract(a,b))
print("multiply:\n")
print(multiply(a,b))
print("division:\n")
print(division(a,b))
Output:
a: 3
b: 3
addition:
6
subtract:
0
multiply:
9
division:
1
9. Write a program to find out the sum of n natural numbers using

a recursive function.
sum<-function(a)
{
if(a>0)
{
return(a+sum(a-1))
}
else { return(a) }
}
print(sum(100))
10. Write a program to find out the factional of n natural

numbers using a recursive function.
sum<-function(a)
{
if(a>0)
{
return(a+sum(a-1))
}
15
else { return(a) }
}
print(sum(100))
11. Write a program to find out the factorial of a number using

recursion function.
fact<-function(a)
{
if(a>1) {return(a*fact(a-1))}
else {return(a)}
}
print(fact(6)
Output :
720
12. Write R program to check whether a matrix is square or not.
cat( "Random Dimensions (Between 1 and 4) of the Matrix will be

generated\n")
a <- sample(1:4, 1)
b <- sample(1:4, 1)
cat("Rows Count : ",a,"\n")
cat("Columns Count : ",b,"\n")
M<-matrix(1:a*b, nrow=a)
if ( nrow(M)==ncol(M))
{
cat("Square")
} else
{
cat("Not Square")
}
Output :
Random Dimensions (Between 1 and 4) of the Matrix will be
generated
Rows Count : 4
Columns Count : 1
Not Square
13. Input two matrices and find out sum and mull.
16
cat("Random Dimensions (1..4) Matrices will be generated\n")
d <- sample(1:4, 1)
M1=matrix(sample.int(15, size = d*d), nrow = d, ncol = d)
M2=matrix(sample.int(15, size = d*d), nrow = d, ncol = d)
cat("Matrix1 is :\n")
M1
cat("Matrix2 is :\n")
M2
Summation=M1+M2
Multipliciation=M1*M2
cat("Summation is :\n")
Summation
cat("Multipliciation is :\n")
Multipliciation
Output :
Random Dimensions (1..4)Matrices will be generated
Matrix1 is :
[,1] [,2] [,3]
[1,] 6 10 1
[2,] 2 4 15
[3,] 12 8 14
Matrix2 is :
[,1] [,2] [,3]
[1,] 14 9 5
[2,] 1 10 6
[3,] 15 2 3
Summation is :
[,1] [,2] [,3]
[1,] 20 19 6
[2,] 3 14 21
[3,] 27 10 17
Multiplication is :
[,1] [,2] [,3]
[1,] 84 90 5
[2,] 2 40 90
[3,] 180 16 42
14. Input a matrix and find out the mid element and its
neighbors.
cat( "Random Dimensions (Between 3 and 5) Matrix will be

generated\n")
roww <- sample(3:5, 1)
coll <- sample(3:5, 1)
17
M=matrix(sample.int(25, size = roww*coll), nrow = roww, ncol =
coll)
M
m=M[ceiling(roww/2),ceiling(coll/2)]
l=M[ceiling(roww/2),ceiling(coll/2)-1]
r=M[ceiling(roww/2),ceiling(coll/2)+1]
t=M[ceiling(roww/2)-1,ceiling(coll/2)]
b=M[ceiling(roww/2)+1,ceiling(coll/2)]
cat("\nMiddle is :",m)
cat("\nLeft is :",l)
cat("\nRight is :",r)
cat("\nTop is :",t)
cat("\nBottom is :",b)
Output:
Random Dimensions (Between 3 and 5) Matrix will be generated
[,1] [,2] [,3] [,4] [,5]
[1,] 14 25 1 6 22
[2,] 10 11 12 20 21
[3,] 2 3 24 5 4
[4,] 9 15 18 19 7
[5,] 8 13 17 16 23
Middle is : 24
Left is : 3
Right is : 5
Top is : 12
Bottom is : 18
15. Input one matrix and find out its min and max number.
cat( "Random Dimensions (Between 1 and 3) Matrix will be

generated\n")
roww <- sample(1:3, 1)
coll <- sample(1:3, 1)
M=matrix(sample.int(25, size = roww*coll), nrow = roww, ncol =
coll)
cat("Matrix is :\n")
M
cat("\nMax is : ",max(M))
cat("\nMin is : ",min(M))
18
Output :
Random Dimensions (Between 1 and 3) Matrix will be generated
Matrix is :
[,1] [,2] [,3]
[1,] 14 4 23
[2,] 3 17 22
[3,] 5 19 15
Max is : 23
Min is : 3
16. Apply different measures of variability over a common

dataset(variance, standard deviation, mean, max, median,
absolute deviation).
> data=read.csv("C:/Users/ALM/Downloads/LungCapData.csv")
> var(data$Age)
[1] 16.03802
> sd(data$Age)
[1] 4.00475
> mean(data$Age)
[1] 12.3269
> median(data$Age)
[1] 13
> max(data$Age)
[1] 19
> mad(data$Age)
[1] 4.4478
19
DMDW LAB PROGRESS REPORT
20
21
DMDW Lab Progress Report
Applying different R statistical operations and analysis methods

on a dataset contains information about football players from all
over the world including their names, nationalities, overall
power, club…etc.
1. First I’ve collected the data and imported it into RStudio.
> players=read.csv ('C:/Users/KIIT/DMDW Lab/players.csv')
2. Then I’ve extracted the top 100 players in the world to a new
dataset and called it Top100Players to simplify simulations.
> top100players<-head(players,100).
3. I used ggplot2 library to draw plots of Top100Players:

 The following instruction shows a qplot of the top 100 players
in the world with their nationalities:
> qplot(top100players$ID,top100players$Club, color=Nationality)
22
 Using ggplot to visualize how many professional players among
top100players each club includes:
> ggplot(top100players, aes(x=Club)) + geom_bar(color="black", fi

ll="lightblue",linetype="dashed",alpha=0.5) + theme(axis.text.x =
element_text(angle = 90, hjust = 1)).
X axis represents the club

Color of line is black
Filing color is light blue
Line type is dashed
 Visualizing how many professional players from each country in

each club:
> qplot(top100players$ID,top100players$Club,color=top100players$N
ationality).
23
4. Adding The continent of the player national team to the data
set by merging two datasets:
I imported a dataset called continents includes each country in
the world and the continent which contains it.
> continent=read.csv('C:/Users/KIIT/DMDW Lab/UNSD.csv')
I merged continents dataset with the top100players dataset using
inner join.
> m=merge(top100players,continent,by="Nationality").
5. Cleaning and filtering dataset to make it useful for

prediction:
24
 first installed the package which allows to select number of
attributes among the dataset columns:
 install.packages(tidyverse)
 library(tidyverse).
 select useful attributes for top 1000 players:

> top1000players <- head(players,1000) %>% SELECT(Name,
Club,Overall,Position,LS,ST,RS,LW,LF,CF,RF,RW,LAM,CAM,RAM,LM,LCM,
CM,RCM,RM,LWB,LDM,CDM,RDM,RWB,LB,LCB,CB,RCB,RB,Value,Wage,Joined,
Contract.Valid.Until,Release.Clause).
 Removing k from wage and m from value.
 Removing lbs from weight.
 Eliminating missing values completely from the entire data
frame
> na.omit((top1000players)
 converting feet and inches to cm using the following function:
convert_to_cm <-function(Height) {
feets<-as.integer(substr(Height,1,1))
inches<-as.integer(substr(Height,3,4))
cm<-round(30.48*feets+2.54*inches)
return(cm)
}
6. Using Scatter Plots to see whether there is a relationship

between the player price and each variable:
> top1000players$Height<-convert_to_cm(top1000players$Height)
> plot(x=top1000players$Overall, y=top1000players$Value,ylim =
c(2.5,100))
25
 Relationship between The price and the Overall.
> plot(x=top1000players$Potential, y=top1000players$Value,ylim =

c(2.5,100)).
 The relationship between the Value and the Potential.
> plot(x=top1000players$International.Reputation,
y=top1000players$Value,ylim = c(2.5,100))
No Relationship between the price and the Reputation
26
> plot(x=top1000players$Weak.Foot, y=top1000players$Value,ylim =
c(2.5,100))
No Relationship
> plot(x=top1000players$Skill.Moves, y=top1000players$Value,ylim
= c(2.5,100))
No Relationship
> plot(x=top1000players$Age, y=top1000players$Value,ylim =

c(2.5,100)).
 The relationship between Value and Age.
> plot(x=top1000players$Contract.Valid.Until,
y=top1000players$Value,ylim = c(2.5,100))
No Relationship between the price and the Contract validity:

 The attributes that affect the total price are:
Age, Overall, Potential
27
7. Multiple Regression:
model <- lm(Value ~ Age +Overall+Potential,data=top1000players)
# Show the model.
print(model)
Model:
Call:
lm(formula = Value ~ Age + Overall + Potential, data =
top1000players)
Coefficients:
(Intercept) Age Overall Potential
-282.7408 -0.5814 3.5326 0.4033
 The Formula that represents the relation the price(value) and

the attributes Age, Overall, Potential is:
PREDICTEDPRICE = - 282.7408 - 0.5814*AGE + 3.5326*OVERALL + 0.4033*POTENTIAL
8. We add 2 new attributes to the dataset:

#contains the predicted price of the player:
> top1000players$PredictedPrice <- with(top1000players,- 282.7408
- 0.5814*Age + 3.5326*Overall + 0.4033*Potential)
#contains the difference between the real price and the predicted
price:
#If it is <= 0 then the player worth his price; The real price is
less than the predicted
#If it is >0 then the player doesn't worth his price; The real
price is more than the predicted
> top1000players$Difference <- with (top1000players, Value -
PredictedPrice).
We can See Difference column and see who does worth his price and
who doesn’t
28
29
WEKA
30
31
WEKA
Introduction to Weka:
Weka is an open source software under the GNU General Public
License.
“Weka” stands for the Waikato Environment for Knowledge Analysis.
It is freely available at http://www.cs.waikato.ac.nz/ml/weka.
The system is written using object oriented language Java.
There are several different levels at which Weka can be used.
Weka provides implementations of state-of-the-art data mining and
machine learning algorithms.
Weka contains modules for data preprocessing, classification,
clustering and association rule extraction.
Main features of Weka include:
• 49 data preprocessing tools

• 76 classification/regression algorithms
• 8 clustering algorithms
• 15 attribute/subset evaluators + 10 search algorithms for
feature selection.
• 3 algorithms for finding association rules
• 3 graphical user interfaces:
“The Explorer” (exploratory data analysis)
“The Experimenter” (experimental environment)
“The Knowledge Flow”(new process model inspired interface)
2. Installation Procedure of Weka:
• Download Weka from http://www.cs.waikato.ac.nz/ml/weka/
– Choose a self-extracting executable (including Java VM)
– (If you are interested in modifying/extending Weka
there is a developer version includes the source code)
• After download is completed, run the self-extracting file
to install Weka, and use the default set-ups.
3. Start the Weka

• From windows desktop,
– click “Start”, choose “All programs”, Choose “Weka 3.6”
.
32
– Then the first interface window appears:
Weka GUI Chooser
4. Weka Application Interfaces

• Explorer
– preprocessing, attribute selection, learning, visualization
• Experimenter
– testing and evaluating machine learning algorithms
• Knowledge Flow
– visual design of KDD process
• Simple Command-line
– A simple interface for typing commands
5.Weka data formats:

Attribute Relation File Format (ARFF) is the default file type for
data analysis in weka but data can also be imported from various
formats.
• ARFF (Attribute Relation File Format) has two sections:
– the Header information defines attribute name, type and relations.
– the Data section lists the data records.
• CSV: Comma Separated Values (text file)
• Data can also be read from a database using ODBC connectivity.
33
Attribute Relation File Format (arff):
ARFF format of weather dataset from sample data in weka is presented.
Attribute type is specified in the header tag.
Nominal attribute has the distinct values of attribute in curly
brackets along with attribute name.
Numeric attribute is specified by the keyword real along with
attribute name.
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real @attribute humidity real
attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
WEKA Explorer
• Click the Explorer on Weka GUI Chooser
• On the Explorer window, click button “Open File” to open a
data file from the folder where your data files stored.
• Then select the desired module (Preprocess, Classify,
Cluster, Association etc) from the upper tabs.
34
Data Preprocessing:
Some attributes may not be required in the analysis, and then
those attributes can be removed from the dataset before analysis.
For example, attribute instance number of iris dataset is not
required in analysis. This attribute can be removed by selecting
it in the Attributes check box, and clicking Remove (Fig. 3).
Resulting dataset then can be stored in arff file format.
Selecting or Filtering Attributes
In case some attributes need to be removed before the data mining
step, this can be done using the Attribute filters in WEKA. In
the "Filter" panel, click on the "Choose" button. This will show
a popup window with a list available filters. Scroll down the
list and select the "weka.filters.unsupervised.attribute.Remove"
filter as shown in Figure 4. Next, click on text box immediately
to the right of the "Choose" button. In the resulting dialog box
enter the index of the attribute to be filtered out (this can be
a range or a list separated by commas). In this case, we enter 1
which is the index of the "id" attribute (see the left panel).
Make sure that the "invertSelection" option is set to false
(otherwise everything except attribute 1 will be filtered) (Fig
5). Then click "OK".
Filter an attribute
35
Options for filtering an attribute
Discretization:
Some techniques require performing discretization on numeric or
continuous attributes before applying data mining task. The WEKA
discretization filter, can divide the ranges blindly, or used
various statistical techniques to automatically determine the
best way of partitioning the data. Discretization is represented
here with the help of simple binning method.
Click the filter dialog box and select
"weka.filters.unsupervised.attribute.discretize" from the list.
Enter the index for the attributes to be discretized. In this case we
enter 1 corresponding to attribute "age". We also enter 3 as the
number of bins (note that it is possible to discretize more than one
attribute at the same time (by using a list of attribute indices).
Since we are doing simple binning, all of the other available options
are set to "false" (fig 7).
You can observe that WEKA has assigned its own labels to each of
the value ranges for the discretized attribute. For example, the
lower range in the "age" attribute is labeled "(-inf-34.333333]"
(enclosed in single quotes and escape characters), while the
middle range is labeled "(34.333333-50.666667]", and so on. These
labels now also appear in the data records where the original age
value was in the corresponding range.
36
Fig. 6: Discretization Filter
Fig. 7: Discretization options for attribute
MP5 Classifier Example with Weka Explorer

Decision Tree is a “divide-and-conquer” approach to the problem of
learning from a set of independent instances and leads naturally to a
style of representation called a decision tree. Nodes in a decision
tree involve testing a particular attribute. Usually, the test at a
node compares an attribute value with a constant. However, some trees
compare two attributes with each other, or use some function of one
or more attributes. Leaf nodes give a classification that applies to
all instances that reach the leaf, or a set of classifications, or a
probability distribution over all possible classifications. To
37
classify an unknown instance, it is routed down the tree according to
the values of the attributes tested in successive nodes, and when a
leaf is reached the instance is classified according to the class
assigned to the leaf. MP5 is the basic decision tree classifier.
Following is the example of MP5 on weather data from sample datasets
of weka (Fig. 8).
• Select the Classify tab from the upper tabs.
• There are many classifiers available in the Weka
• Select MP5 from the tree class
• You can select the cross-validation or percentage split of the
data
• Other options like selection of variables for analysis
• By default, algorithm considers last attribute as class
attribute, user can define any other attribute as class
attribute too.
• Click on start to run the algorithm.
Interpretation of obtained results:
The first two columns are the TP Rate (True Positive Rate) and the FP
Rate (False Positive Rate). For the first level where ‘play=yes’ TP
Rate is the ratio of play cases predicted correctly cases to the total
of positive cases (e.g.: 8 out of 9 is predicted correctly =8/9=0.88).
The FP Rate is then the ratio no play cases incorrectly predicted as
play yes cases to the total of play no cases. 1 play no case was
wrongly predicted as play yes. So the FP Rate is 1/5=0.2
The next two columns are terms related to information retrieval
theory. When one is conducting a search for relevant documents, it is
often not possible to get to the relevant documents easily or
directly. In many cases, a search will yield lots results many of
which will be irrelevant. Under these circumstances, it is often
impractical to get all results at once but only a portion of them at a
time. In such cases, the terms recall and precision are important to
consider.
Recall is the ratio of relevant documents found in the search result
to the total of all relevant documents. Thus, higher recall values
imply that relevant documents are returned more quickly. A recall of
30% at 10% means that 30% of the relevant documents were found with
only 10% of the results examined. Precision is the proportion of
relevant documents in the results returned. Thus a precision of 0.75
means that 75% of the returned documents were relevant.
38
In our example, such measures are not very applicable…the recall in
this case just corresponds to the TP Rate, as we are always looking at
100% of test sample and precision is just the proportion of low and
normal weight cases in the test sample.
The F-measure is a way of combining recall and precision scores
into a single measure of performance. The formula for it is:
2*recall*precision / recall+ precision
Fig. 8: ID3 algorithm in weka

Confusion matrix specifies the classes of obtained results. For
example, class a has majority of objects (8 objects) from yes category,
hence a is treated as class of “yes” group. Similarly, b has majority
of objects (4) from no category, hence b is treated as class of “no”
group. Hence one object each from both the classes is misclassified,
which leads to misclassified instance as 2. User can see the plot of
tree too.
K Means Clustering Example with Weka Explorer:

K-means is the most popularly used algorithm for clustering. User need
to specify the number of clusters (k) in advance. Algorithm randomly
selects k objects as cluster mean or center. It works towards
optimizing square error criteria function, defined as:
K 2
,
∑ x − where is the mean of
∑ mi mi cluster Ci .
i
=1
x∈C
i
Main steps of k-means algorithm are:

1) Assign initial means mi
2) Assign each data object x to the cluster Ci for the closest mean
39
3) Compute new mean for each cluster
4) Iterate until criteria function converges, that is, there are no
newer assignments.
Following is the example of K means on weather data from sample
datasets of weka (Fig).
• Select the Cluster tab from the upper tabs.
• Select K-means from the choose tab.
• You can select the attributes for clustering.
• If class attribute is known, then user can select that attribute
for “classes to cluster evaluation” to check for accuracy of
results.
• In order to store the results, select “Store cluster for
visualization”
• Click on start to run the algorithm.
• Right click on the result and select visualize cluster
assignment.
• Click on Save button to store the results in arff file format.
Fig.: K means clustering in weka

Figure shows the results of k means on weather data. Confusion matrix
specifies the classes of obtained results as we have selected the
classes to cluster evaluation. For example, cluster0 has total 9
objects, out of which majority of objects (6) are from yes category,
hence this cluster is treated as cluster of “yes”. Similarly, cluster1
has total of 5 objects, out of which 3 objects are from “no” category,
hence it is considered as cluster of no category.
40
41
PYTHON PROGRAMMING
42
43
Python Programming
Python is a powerful multi-purpose programming language created by
Guido van Rossum.
It has simple easy-to-use syntax, making it the perfect language
for someone trying to learn computer programming for the first
time.
What is Python?
Python is a general-purpose language. It has wide range of
applications from Web development (like: Django and Bottle),
scientific and mathematical computing (Orange, SymPy, NumPy) to
desktop graphical user Interfaces (Pygame, Panda3D).
The syntax of the language is clean and length of the code is

relatively short. It's fun to work in Python because it allows you
to think about the problem rather than focusing on the syntax.
Release Dates of Different Versions

Version Release Data
Python 1.0 (first standard release) January 1994

Python 1.6 (Last minor version) September 5, 2000
Python 2.0 (Introduced list comprehensions) October 16, 2000

Python 2.7 (Last minor version) July 3, 2010
Python 3.0 (Emphasis on removing duplicative constructs and module) December 3, 2008
Python 3.5 (Last updated version) September 13, 2015
Web Applications
You can create scalable Web Apps using frameworks and CMS (Content
Management System) that are built on Python. Some of the popular
platforms for creating Web Apps are: Django, Flask, Pyramid, Plone,
Django CMS.
Sites like Mozilla, Reddit, Instagram and PBS are written in Python.
44
Scientific and Numeric Computing
There are numerous libraries available in Python for scientific and

numeric computing. There are libraries like: SciPy and NumPy that are
used in general purpose computing. And, there are specific libraries
like: EarthPy for earth science, AstroPy for Astronomy and so on.
Also, the language is heavily used in machine learning, data mining

and deep learning.
Creating software Prototypes
Python is slow compared to compiled languages like C++ and Java. It

might not be a good choice if resources are limited and efficiency is
a must.
However, Python is a great language for creating prototypes. For

example: You can use Pygame (library for creating games) to create
your game's prototype first. If you like the prototype, you can use
language like C++ to create the actual game.
Good Language to Teach Programming
Python is used by many companies to teach programming to kids and

newbies.
It is a good language with a lot of features and capabilities. Yet,

it's one of the easiest language to learn because of its simple easy-
to-use syntax.
Install and Run Python in Windows
Go to Download Python page on the official site and click Download

Python 3.6.0 (You may see different version name).
When the download is completed, double-click the file and follow the
instructions to install it.
When Python is installed, a program called IDLE is also installed
along with it. It provides graphical user interface to work with
Python.
Open IDLE, copy the following code below and press enter.
print("Hello, World!")
45
To create a file in IDLE, go to File > New Window (Shortcut: Ctrl+N).
Write Python code (you can copy the code below for now) and save
(Shortcut: Ctrl+S) with .py file extension like: hello.py or your-
first-program.py
print("Hello, World!")
Go to Run > Run module (Shortcut: F5) and you can see the output.
Congratulations, you've successfully run your first Python program.
Python - Basic Operators:

Operators are the constructs which can manipulate the value of operands.
Consider the expression 4 + 5 = 9. Here, 4 and 5 are called operands
and + is called operator.
Types of Operator
Python language supports the following types of operators.
Arithmetic Operators
 Comparison (Relational) Operators
 Assignment Operators
 Logical Operators
 Bitwise Operators
 Membership Operators
 Identity Operators
Let us have a look on all operators one by one.
Python Arithmetic Operators:
Operator Description Example
+ Addition Adds values on either side of the operator. a + b = 30
- Subtraction Subtracts right hand operand from left hand a – b = -10

operand.
* Multiplication Multiplies values on either side of the operator a * b = 200
/ Division Divides left hand operand by right hand operand b/a=2
% Modulus Divides left hand operand by right hand operand b%a=0

and returns remainder
46
** Exponent Performs exponential (power) calculation on a**b =10 to the power 20
operators
Python Compare Operators

These operators compare the values on either sides of them and decide
the relation among them. They are also called Relational operators.
Assume variable a holds 10 and variable b holds 20, then −
== If the values of two operands are equal, then (a == b) is not true.

the condition becomes true.
!= If values of two operands are not equal, then (a != b) is true.

condition becomes true.
<> If values of two operands are not equal, then (a <> b) is true. This is similar
condition becomes true. to != operator.
> If the value of left operand is greater than the (a > b) is not true.
value of right operand, then condition
becomes true.
< If the value of left operand is less than the (a < b) is true.
value of right operand, then condition
becomes true.
>= If the value of left operand is greater than or (a >= b) is not true.
equal to the value of right operand, then
<= If the value of left operand is less than or equal (a <= b) is true.
to the value of right operand, then condition
becomes true.
Python Assignment Operators:
= Assigns values from right side operands c = a + b assigns value of a + b

to left side operand into c
+= Add AND It adds right operand to the left

operand and assign the result to left c += a is equivalent to c = c + a
operand
47
-= Subtract AND It subtracts right operand from the left
operand and assign the result to left c -= a is equivalent to c = c - a
operand
*= Multiply AND It multiplies right operand with the left

operand and assign the result to left c *= a is equivalent to c = c * a
operand
/= Divide AND It divides left operand with the right

c /= a is equivalent to c = c / ac
operand and assign the result to left
/= a is equivalent to c = c / a
operand
%= Modulus AND It takes modulus using two operands c %= a is equivalent to c = c %

and assign the result to left operand a
**= Exponent AND Performs exponential (power)

c **= a is equivalent to c = c
calculation on operators and assign
** a
value to the left operand
//= Floor Division It performs floor division on operators c //= a is equivalent to c = c //

and assign value to the left operand a
Python Logical Operators:

There are following logical operators supported by Python
language. Assume variable a holds 10 and variable b holds 20 then
and Logical AND If both the operands are true then (a and b) is true.
or Logical OR If any of the two operands are non-zero (a or b) is true.

then condition becomes true.
not Logical NOT Used to reverse the logical state of its Not(a and b) is false.
operand.
Used to reverse the logical state of its operand.

Python Membership Operators:
Python’s membership operators test for membership in a sequence, such
as strings, lists, or tuples. There are two membership operators as
explained below −
48
in Evaluates to true if it finds a variable in the x in y, here in results in a 1 if x is a

specified sequence and false otherwise. member of sequence y.
not in Evaluates to true if it does not finds a variable in x not in y, here not in results in a 1 if
the specified sequence and false otherwise. x is not a member of sequence y.
Python Identity Operators:

Identity operators compare the memory locations of two objects. There are
two Identity operators explained below −
is Evaluates to true if the variables on either side of

x is y, here is results in 1 if id(x) equals
the operator point to the same object and false
id(y).
otherwise.
is not Evaluates to false if the variables on either side of

x is not y, here is notresults in 1 if id(x)
the operator point to the same object and true
is not equal to id(y).
otherwise.
Python Operators Precedence:

The following table lists all operators from highest precedence to lowest.
Sr.No. Operator & Description
1 **
Exponentiation (raise to the power)
2 ~+-
Complement, unary plus and minus (method names for the last two are +@ and
-@)
3 * / % //
Multiply, divide, modulo and floor division
4 +-
Addition and subtraction
5 >> <<
Right and left bitwise shift
49
6 & Bitwise 'AND'
7 ^ | Bitwise exclusive `OR' and regular `OR'
8 = %= /= //= -= += *= **= Assignment operators
9 Is, is not Identity operators
12 In, not in Membership operators
13 not or and Logical operators
Example in python:
num1 = int(input('Enter First number: '))
num2 = int(input('Enter Second number '))
add = num1 + num2
dif = num1 - num2
mul = num1 * num2
div = num1 / num2
floor_div = num1 // num2
power = num1 ** num2
modulus = num1 % num2
print('Sum of ',num1 ,'and' ,num2 ,'is :',add)
print('Difference of ',num1 ,'and' ,num2 ,'is :',dif)
print('Product of' ,num1 ,'and' ,num2 ,'is :',mul)
print('Division of ',num1 ,'and' ,num2 ,'is :',div)
print('Floor Division of ',num1 ,'and' ,num2 ,'is :',floor_div)
print('Exponent of ',num1 ,'and' ,num2 ,'is :',power)
print('Modulus of ',num1 ,'and' ,num2 ,'is :',modulus)
Output :
50

DMDW Lab Report: Data Analytics Branch

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DMDW Lab Report: Data Analytics Branch

Uploaded by

Copyright:

Available Formats

DMDW LAB REPORT

Data Analytics Branch

SCHOOL OF COMPUTER ENGINEERING

DMDW Lab Progress Report ...................................................................................................................... 22

R provides a wide variety of statistical (linear and nonlinear

One of R’s strengths is the ease with which well-designed

Objects & Functions:

a=as.integer(readline(prompt="Enter the value of a"))

a=as.integer(readline(prompt="Enter the value of a"))

3. Input one number and check whether it is odd or even and

4. Write R program to draw

for (j in 1:5-i) print(" ")

5. Write R program to check whether a number is odd or even using

7. Write a program to input a number to convert it from:

8. Write R program to display all arithmetic operation line Add,

9. Write a program to find out the sum of n natural numbers using

10. Write a program to find out the factional of n natural

11. Write a program to find out the factorial of a number using

12. Write R program to check whether a matrix is square or not.

cat( "Random Dimensions (Between 1 and 4) of the Matrix will be

cat( "Random Dimensions (Between 3 and 5) Matrix will be

cat( "Random Dimensions (Between 1 and 3) Matrix will be

16. Apply different measures of variability over a common

Applying different R statistical operations and analysis methods

> players=read.csv ('C:/Users/KIIT/DMDW Lab/players.csv')

3. I used ggplot2 library to draw plots of Top100Players:

> qplot(top100players$ID,top100players$Club, color=Nationality)

> ggplot(top100players, aes(x=Club)) + geom_bar(color="black", fi

X axis represents the club

 Visualizing how many professional players from each country in

5. Cleaning and filtering dataset to make it useful for

 select useful attributes for top 1000 players:

6. Using Scatter Plots to see whether there is a relationship

> plot(x=top1000players$Potential, y=top1000players$Value,ylim =

> plot(x=top1000players$Age, y=top1000players$Value,ylim =

No Relationship between the price and the Contract validity:

 The Formula that represents the relation the price(value) and

8. We add 2 new attributes to the dataset:

Main features of Weka include:

• 49 data preprocessing tools

3. Start the Weka

Weka GUI Chooser

4. Weka Application Interfaces

5.Weka data formats:

Fig. 7: Discretization options for attribute

MP5 Classifier Example with Weka Explorer

Fig. 8: ID3 algorithm in weka

K Means Clustering Example with Weka Explorer:

Main steps of k-means algorithm are:

Fig.: K means clustering in weka

The syntax of the language is clean and length of the code is

Release Dates of Different Versions

Python 1.0 (first standard release) January 1994

Python 2.0 (Introduced list comprehensions) October 16, 2000

There are numerous libraries available in Python for scientific and

Also, the language is heavily used in machine learning, data mining

Creating software Prototypes

Python is slow compared to compiled languages like C++ and Java. It

However, Python is a great language for creating prototypes. For

Good Language to Teach Programming

Python is used by many companies to teach programming to kids and

It is a good language with a lot of features and capabilities. Yet,