Professional Documents
Culture Documents
DMDW Lab Report: Data Analytics Branch
DMDW Lab Report: Data Analytics Branch
SUBMITTED BY:
<< Ahmad Alsharef >>
<< 1864012 >>
GGPLOT…………………………………………………………………………………………………………………………………………23
Merging two datasets ................................................................................................................ 24
Cleaning and filtering ........................................................................................................... 24
Using Scatter Plots .................................................................................................................. 25
Multiple Regression .................................................................................................................. 28
WEKA .......................................................................................................................................................... 30
Introduction to Weka: ............................................................................................................................ 32
Main features of Weka include: .......................................................................................................... 32
Installation Procedure of Weka:...................................................................................... 32
Start the Weka............................................................................................................................... 32
Weka Application Interfaces ............................................................................................... 33
2
Weka data formats ....................................................................................................................... 33
Data Preprocessing ................................................................................................................................. 35
Discretization .......................................................................................................................................... 36
K Means Clustering Example with Weka Explorer ................................................................................. 39
Python Programming ................................................................................................................................. 44
What is Python?...................................................................................................................................... 44
Install and Run Python in Windows ............................................................................. 45
Python - Basic Operators ........................................................................................................................ 46
Types of Operator ............................................................................................................................... 46
Python Arithmetic Operators .............................................................................................................. 46
Python Compares on Operators ........................................................................................................... 47
3
BASICS OF R LANGUAGE
4
5
R Language Basics
Installing R:
R is open-source and is freely available for macOS, Linux, and
Windows. You can download compiled versions of R (called binaries,
or precompiled binary distributions) by going to the home page for
R (http://www.r-project.org), and following the link to CRAN (the
Comprehensive R Archive Network). You will be asked to select a
mirror; pick one that is geographically nearby. On the CRAN site,
each operating system has a FAQ page and there is also a more
general FAQ. Both are worth reading.
R(programming language):
R is a programming language and free software environment for
statistical computing and graphics supported by the R Foundation
for Statistical Computing. The R language is widely used among
statisticians and data miners for developing statistical software
and for data analysis.
The R environment:
R is an integrated suite of software facilities for data
manipulation, calculation and graphical display. It includes:
• an effective data handling and storage facility.
• a suite of operators for calculations on arrays, in particular
matrices.
• a large, coherent, integrated collection of intermediate tools
for data analysis.
• graphical facilities for data analysis and display either on-
screen or on hardcopy, and a well-developed, simple and effective
6
programming language which includes conditionals, loops, user-
defined recursive functions and input and output facilities.
RStudio:
RStudio is a free and open-source integrated development
environment (IDE) for R language. RStudio was founded by JJ Allaire
creator of the programming language ColdFusion. Hadley Wickham is
the Chief Scientist at RStudio.
RStudio is available in two editions: RStudio Desktop, where the
program runs locally as a regular desktop application; and RStudio
Server, which allows accessing RStudio using a web browser while
it is running on a remote Linux server. Prepackaged distributions
of RStudio Desktop are available for Windows, macOS, and Linux.
Data sets:
A data set (or dataset) is a collection of data. Most commonly
corresponds to the contents of a single database table, or a single
statistical data matrix, where every column of the table represents
a particular variable, and each row corresponds to a given member
of the data set in question. The data set lists values for each of
9
the variables, such as height and weight of an object, for each
member of the data set. Each value is known as a datum. The data
set may comprise data for one or more members, corresponding to
the number of rows.
The term data set may also be used more loosely, to refer to the
data in a collection of closely related tables, corresponding to
a particular experiment or event. Less used names for this kind of
data sets are data corpus and data stock. An example of this type
is the data sets collected by space agencies performing experiments
with instruments aboard space probes. Data sets that are so large
that traditional data processing applications are inadequate to
deal with them are known as big data.
10
R Examples
1. Input three numbers and find out the greatest.
output :
Enter the value of a 2
Enter the value of b 1
Enter the value of b 13
"c is the largest"
2. Input three numbers and find out the second largest number.
output :
Enter the value of a 22
Enter the value of b 20
Enter the value of b 18
"b is the second largest number"
for(i in 1:5)
{
k<-i
a=as.integer(readline("Insert a number:"))
c<-(a%%2)
switch(c,print("Even"),print("Odd"))
output :
Insert a number : 4
Even
12
6. Findout the sum and average of prime numbers between 2 and 50.
count<-0
sum<-0
for(n in 2:50)
{
for (i in 2:n/2)
{
if((n%%i)==0)
{
sum=sum+i
count<-count+1
}
}
}
avg=sum/count
print("count : ",count,"\n")
print("sum : ",sum,"\n")
print("avg : ",avg,"\n")
Output:
count : 15
sum : 328
avg : 21.86667
Binary <-function(d)
{
bsum<-0
bexp<-1
while(d>0)
{digit<-d%%2
bsum<-bsum+digit*bexp
bexp<-bexp*10}
return(bsum)
}
Decimal<-function(b)
{
dsum<-0
dexp<-1
13
while(b>0)
{digit<-b%%10
b<-floor(b/10)
dsum<-dsum+digit*bexp
bexp<-bexp*2}
return(dsum)
}
d<-readline("Insert Decimal : ")
d<-as.numeric(d)
b<-binary(d)
print("Binary: ",b)
d<decimal(b)
print("Decimal: ",d)
Output :
Insert Decimal : 6
Binary: 110
Decimal: 6
addition<-function(a,b)
{
c=a+b
return(c)
}
subtract<-function(a,b)
{
c=a-b
return(c)
}
multiply<-function(a,b)
{
c=a*b
return (c)
}
division<-function(a,b)s
{
if(b!=0)
c=a/b
else print("error")
return(c)
14
}
a<-as.integer(readline(prompt="a: "))
b<-as.integer(readline(prompt="b: "))
print("addition:\n")
print(addition(a,b))
print("subtract:\n")
print(subtract(a,b))
print("multiply:\n")
print(multiply(a,b))
print("division:\n")
print(division(a,b))
Output:
a: 3
b: 3
addition:
6
subtract:
0
multiply:
9
division:
1
sum<-function(a)
{
if(a>0)
{
return(a+sum(a-1))
}
else { return(a) }
}
print(sum(100))
fact<-function(a)
{
if(a>1) {return(a*fact(a-1))}
else {return(a)}
}
print(fact(6)
Output :
720
M<-matrix(1:a*b, nrow=a)
if ( nrow(M)==ncol(M))
{
cat("Square")
} else
{
cat("Not Square")
}
Output :
Random Dimensions (Between 1 and 4) of the Matrix will be
generated
Rows Count : 4
Columns Count : 1
Not Square
13. Input two matrices and find out sum and mull.
16
cat("Random Dimensions (1..4) Matrices will be generated\n")
d <- sample(1:4, 1)
M1=matrix(sample.int(15, size = d*d), nrow = d, ncol = d)
M2=matrix(sample.int(15, size = d*d), nrow = d, ncol = d)
cat("Matrix1 is :\n")
M1
cat("Matrix2 is :\n")
M2
Summation=M1+M2
Multipliciation=M1*M2
cat("Summation is :\n")
Summation
cat("Multipliciation is :\n")
Multipliciation
Output :
Random Dimensions (1..4)Matrices will be generated
Matrix1 is :
[,1] [,2] [,3]
[1,] 6 10 1
[2,] 2 4 15
[3,] 12 8 14
Matrix2 is :
[,1] [,2] [,3]
[1,] 14 9 5
[2,] 1 10 6
[3,] 15 2 3
Summation is :
[,1] [,2] [,3]
[1,] 20 19 6
[2,] 3 14 21
[3,] 27 10 17
Multiplication is :
[,1] [,2] [,3]
[1,] 84 90 5
[2,] 2 40 90
[3,] 180 16 42
14. Input a matrix and find out the mid element and its
neighbors.
17
M=matrix(sample.int(25, size = roww*coll), nrow = roww, ncol =
coll)
M
m=M[ceiling(roww/2),ceiling(coll/2)]
l=M[ceiling(roww/2),ceiling(coll/2)-1]
r=M[ceiling(roww/2),ceiling(coll/2)+1]
t=M[ceiling(roww/2)-1,ceiling(coll/2)]
b=M[ceiling(roww/2)+1,ceiling(coll/2)]
cat("\nMiddle is :",m)
cat("\nLeft is :",l)
cat("\nRight is :",r)
cat("\nTop is :",t)
cat("\nBottom is :",b)
Output:
Random Dimensions (Between 3 and 5) Matrix will be generated
[,1] [,2] [,3] [,4] [,5]
[1,] 14 25 1 6 22
[2,] 10 11 12 20 21
[3,] 2 3 24 5 4
[4,] 9 15 18 19 7
[5,] 8 13 17 16 23
Middle is : 24
Left is : 3
Right is : 5
Top is : 12
Bottom is : 18
15. Input one matrix and find out its min and max number.
18
Output :
Random Dimensions (Between 1 and 3) Matrix will be generated
Matrix is :
[,1] [,2] [,3]
[1,] 14 4 23
[2,] 3 17 22
[3,] 5 19 15
Max is : 23
Min is : 3
19
DMDW LAB PROGRESS REPORT
20
21
DMDW Lab Progress Report
2. Then I’ve extracted the top 100 players in the world to a new
dataset and called it Top100Players to simplify simulations.
> top100players<-head(players,100).
22
Using ggplot to visualize how many professional players among
top100players each club includes:
23
4. Adding The continent of the player national team to the data
set by merging two datasets:
I imported a dataset called continents includes each country in
the world and the continent which contains it.
> continent=read.csv('C:/Users/KIIT/DMDW Lab/UNSD.csv')
I merged continents dataset with the top100players dataset using
inner join.
> m=merge(top100players,continent,by="Nationality").
install.packages(tidyverse)
library(tidyverse).
25
Relationship between The price and the Overall.
> plot(x=top1000players$International.Reputation,
y=top1000players$Value,ylim = c(2.5,100))
No Relationship between the price and the Reputation
26
> plot(x=top1000players$Weak.Foot, y=top1000players$Value,ylim =
c(2.5,100))
No Relationship
> plot(x=top1000players$Skill.Moves, y=top1000players$Value,ylim
= c(2.5,100))
No Relationship
> plot(x=top1000players$Contract.Valid.Until,
y=top1000players$Value,ylim = c(2.5,100))
28
29
WEKA
30
31
WEKA
Introduction to Weka:
Weka is an open source software under the GNU General Public
License.
“Weka” stands for the Waikato Environment for Knowledge Analysis.
It is freely available at http://www.cs.waikato.ac.nz/ml/weka.
The system is written using object oriented language Java.
There are several different levels at which Weka can be used.
Weka provides implementations of state-of-the-art data mining and
machine learning algorithms.
Weka contains modules for data preprocessing, classification,
clustering and association rule extraction.
33
Attribute Relation File Format (arff):
ARFF format of weather dataset from sample data in weka is presented.
Attribute type is specified in the header tag.
Nominal attribute has the distinct values of attribute in curly
brackets along with attribute name.
Numeric attribute is specified by the keyword real along with
attribute name.
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real @attribute humidity real
attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
WEKA Explorer
• Click the Explorer on Weka GUI Chooser
• On the Explorer window, click button “Open File” to open a
data file from the folder where your data files stored.
• Then select the desired module (Preprocess, Classify,
Cluster, Association etc) from the upper tabs.
34
Data Preprocessing:
Some attributes may not be required in the analysis, and then
those attributes can be removed from the dataset before analysis.
For example, attribute instance number of iris dataset is not
required in analysis. This attribute can be removed by selecting
it in the Attributes check box, and clicking Remove (Fig. 3).
Resulting dataset then can be stored in arff file format.
Selecting or Filtering Attributes
In case some attributes need to be removed before the data mining
step, this can be done using the Attribute filters in WEKA. In
the "Filter" panel, click on the "Choose" button. This will show
a popup window with a list available filters. Scroll down the
list and select the "weka.filters.unsupervised.attribute.Remove"
filter as shown in Figure 4. Next, click on text box immediately
to the right of the "Choose" button. In the resulting dialog box
enter the index of the attribute to be filtered out (this can be
a range or a list separated by commas). In this case, we enter 1
which is the index of the "id" attribute (see the left panel).
Make sure that the "invertSelection" option is set to false
(otherwise everything except attribute 1 will be filtered) (Fig
5). Then click "OK".
Filter an attribute
35
Options for filtering an attribute
Discretization:
Some techniques require performing discretization on numeric or
continuous attributes before applying data mining task. The WEKA
discretization filter, can divide the ranges blindly, or used
various statistical techniques to automatically determine the
best way of partitioning the data. Discretization is represented
here with the help of simple binning method.
Click the filter dialog box and select
"weka.filters.unsupervised.attribute.discretize" from the list.
Enter the index for the attributes to be discretized. In this case we
enter 1 corresponding to attribute "age". We also enter 3 as the
number of bins (note that it is possible to discretize more than one
attribute at the same time (by using a list of attribute indices).
Since we are doing simple binning, all of the other available options
are set to "false" (fig 7).
You can observe that WEKA has assigned its own labels to each of
the value ranges for the discretized attribute. For example, the
lower range in the "age" attribute is labeled "(-inf-34.333333]"
(enclosed in single quotes and escape characters), while the
middle range is labeled "(34.333333-50.666667]", and so on. These
labels now also appear in the data records where the original age
value was in the corresponding range.
36
Fig. 6: Discretization Filter
38
In our example, such measures are not very applicable…the recall in
this case just corresponds to the TP Rate, as we are always looking at
100% of test sample and precision is just the proportion of low and
normal weight cases in the test sample.
The F-measure is a way of combining recall and precision scores
into a single measure of performance. The formula for it is:
2*recall*precision / recall+ precision
K 2
,
∑ x − where is the mean of
∑ mi mi cluster Ci .
i
=1
x∈C
i
42
43
Python Programming
Python is a powerful multi-purpose programming language created by
Guido van Rossum.
It has simple easy-to-use syntax, making it the perfect language
for someone trying to learn computer programming for the first
time.
What is Python?
Python is a general-purpose language. It has wide range of
applications from Web development (like: Django and Bottle),
scientific and mathematical computing (Orange, SymPy, NumPy) to
desktop graphical user Interfaces (Pygame, Panda3D).
Python 3.0 (Emphasis on removing duplicative constructs and module) December 3, 2008
Python 3.5 (Last updated version) September 13, 2015
Web Applications
You can create scalable Web Apps using frameworks and CMS (Content
Management System) that are built on Python. Some of the popular
platforms for creating Web Apps are: Django, Flask, Pyramid, Plone,
Django CMS.
Sites like Mozilla, Reddit, Instagram and PBS are written in Python.
44
Scientific and Numeric Computing
When the download is completed, double-click the file and follow the
instructions to install it.
When Python is installed, a program called IDLE is also installed
along with it. It provides graphical user interface to work with
Python.
Open IDLE, copy the following code below and press enter.
print("Hello, World!")
45
To create a file in IDLE, go to File > New Window (Shortcut: Ctrl+N).
Write Python code (you can copy the code below for now) and save
(Shortcut: Ctrl+S) with .py file extension like: hello.py or your-
first-program.py
print("Hello, World!")
Go to Run > Run module (Shortcut: F5) and you can see the output.
Congratulations, you've successfully run your first Python program.
Types of Operator
Python language supports the following types of operators.
Arithmetic Operators
Comparison (Relational) Operators
Assignment Operators
Logical Operators
Bitwise Operators
Membership Operators
Identity Operators
Let us have a look on all operators one by one.
Python Arithmetic Operators:
Operator Description Example
46
** Exponent Performs exponential (power) calculation on a**b =10 to the power 20
operators
<> If values of two operands are not equal, then (a <> b) is true. This is similar
condition becomes true. to != operator.
> If the value of left operand is greater than the (a > b) is not true.
value of right operand, then condition
becomes true.
< If the value of left operand is less than the (a < b) is true.
value of right operand, then condition
becomes true.
>= If the value of left operand is greater than or (a >= b) is not true.
equal to the value of right operand, then
condition becomes true.
<= If the value of left operand is less than or equal (a <= b) is true.
to the value of right operand, then condition
becomes true.
47
-= Subtract AND It subtracts right operand from the left
operand and assign the result to left c -= a is equivalent to c = c - a
operand
and Logical AND If both the operands are true then (a and b) is true.
condition becomes true.
not Logical NOT Used to reverse the logical state of its Not(a and b) is false.
operand.
48
Operator Description Example
not in Evaluates to true if it does not finds a variable in x not in y, here not in results in a 1 if
the specified sequence and false otherwise. x is not a member of sequence y.
1 **
Exponentiation (raise to the power)
2 ~+-
Complement, unary plus and minus (method names for the last two are +@ and
-@)
3 * / % //
Multiply, divide, modulo and floor division
4 +-
Addition and subtraction
5 >> <<
Right and left bitwise shift
49
6 & Bitwise 'AND'
Example in python:
num1 = int(input('Enter First number: '))
num2 = int(input('Enter Second number '))
add = num1 + num2
dif = num1 - num2
mul = num1 * num2
div = num1 / num2
floor_div = num1 // num2
power = num1 ** num2
modulus = num1 % num2
print('Sum of ',num1 ,'and' ,num2 ,'is :',add)
print('Difference of ',num1 ,'and' ,num2 ,'is :',dif)
print('Product of' ,num1 ,'and' ,num2 ,'is :',mul)
print('Division of ',num1 ,'and' ,num2 ,'is :',div)
print('Floor Division of ',num1 ,'and' ,num2 ,'is :',floor_div)
print('Exponent of ',num1 ,'and' ,num2 ,'is :',power)
print('Modulus of ',num1 ,'and' ,num2 ,'is :',modulus)
Output :
50