Exp 2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

1.

Weka is data mining software that uses a collection of machine


learning algorithms. These algorithms can be applied directly to
the data or called from the Java code.
Weka is a collection of tools for:
Regression, Clustering, Association, Data pre-processing, Classification,
Visualization.
Weka uses the Attribute Relation File Format for data analysis, by
default. But listed below are some formats that Weka supports, from
where data can be imported:
 CSV
 ARFF
 Database using ODBC
The Weka Explorer is illustrated in Figure 4 and contains a total of six
tabs.
The tabs are as follows.
1) Preprocess: This allows us to choose the data file.
2) Classify: This allows us to apply and experiment with different
algorithms on preprocessed data files.
3) Cluster: This allows us to apply different clustering tools, which
identify clusters within the data file.
4) Association: This allows us to apply association rules, which identify
the association within the data.
5) Select attributes: These allow us to see the changes on the inclusion
and exclusion of attributes from the experiment.
6) Visualize: This allows us to see the possible visualization produced
on the data set in a 2D format, in scatter plot and bar graph output.
The user cannot move between the different tabs until the initial
preprocessing of the data set has been completed.
1.1

1.2
2.1
N-fold cross validation, as i understand it, means we partition our data in
N random equal sized subsamples. A single subsample is retained as
validation for testing and the remaining N-1 subsamples are used for
training. The result is the average of all test results.
2.2
A true positive is an outcome where the model correctly predicts the
positive class. Similarly, a true negative is an outcome where the model
correctly predicts the negative class.
A false positive is an outcome where the model incorrectly predicts the
positive class. And a false negative is an outcome where the model
incorrectly predicts the negative class.
TP
sensitivity=
TP+ FN
TN
specificity=
TN + FP
TP+TN
accuracy =
TP+ TN + FP+ FN
TP
precision=
TP+ FP

2.3
Mean absolute error (MAE) is a measure of errors between paired
observations expressing the same phenomenon. Examples of Y versus X
include comparisons of predicted versus observed, subsequent time
versus initial time, and one technique of measurement versus an
alternative technique of measurement. It has the same unit as the original
data, and it can only be compared between models whose errors are
measured in the same units. It is usually similar in magnitude to RMSE,
but slightly smaller.

The mean relative error (MRE) was defined as the ratio between MAE
and the reference E-field magnitude within the corresponding target
region. Here, MRE was only computed within the target region to avoid
singularity when computing the ratio.
Covariance is a statistical term that refers to a systematic relationship
between two random variables in which a change in the other reflects a
change in one variable.
3) Classification is a technique in data science used by data scientists to
categorize data into a given number of classes. This technique can be
performed on structured or unstructured data and its main goal is to
identify the category or class to which a new data will fall under.
This technique also has its algorithms that can be used to enable text
analysis software to perform tasks such as analyzing aspect-based
sentiment and categorize unstructured text by topic and polarity of
opinion. There are five classification algorithms that mostly used in data
science as we will discuss later.

Neural Network
First, there is neural network. It is a set of algorithms that attempt to
identify the underlying relationships in a data set through a process that
mimics how human brain operates. In data science, neural networks help
to cluster and classify complex relationship. Neural networks could be
used to group unlabeled data according to similarities among the
example inputs and classify data when they have a labelled dataset to
train on.
K-Nearest Neighbors
KNN (K-Nearest Neighbors) becomes one of many algorithms used in
data mining and machine learning, KNN is a classifier algorithm in
which the learning is based on the similarity of data (a vector) from
others. It also could be used to store all available cases and classifies
new cases based on a similarity measure (e.g., distance functions).
Decision Tree
Decision tree algorithm is included in supervised learning algorithms.
This algorithm could be used to solve regression and other classification
problems. Decision tree builds classification or regression models in the
form of a tree structure. It breaks down a dataset into smaller and
smaller subsets while at the same time an associated decision tree is
incrementally developed. The purpose of using decision tree algorithm is
to predict class or value of target variable by learning simple decision
rules concluded from prior data.
Random Forest
Random forests are an ensemble learning method for classification,
regression and other tasks that operates by constructing multiple
decision trees at training time. For classification task, the output from
the random forest is the class selected by most trees. For the regression
task, the mean or mean prediction of each tree is returned. Random
forests generally outperform decision trees but have lower accuracy than
gradient boosted trees. However, the characteristics of the data can
affect its performance.
Naïve Bayes
Naive Bayes is a classification technique based on Bayes' theorem with
the assumption of independence between predictors. In simple terms, the
Naive Bayes classifier assumes that the presence of certain features in a
class is not related to the presence of other features. Classifier assumes
that the presence of a particular feature in a class is unrelated to the
presence of any other feature. It's updating knowledge step by step with
new info.

You might also like