Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

UNIT-IV

PART-B
1. Explain about the classification and prediction? Example with
an
Example?
Ans: There are two forms of data analysis that can be used for
extracting models describing important classes or to predict
future data trends. These two forms are as follows −

 Classification
 Prediction

Classification models predict categorical class labels; and


prediction models predict continuous valued functions. For
example, we can build a classification model to categorize bank
loan applications as either safe or risky, or a prediction model to
predict the expenditures in dollars of potential customers on
computer equipment given their income and occupation.
following are the examples of cases where the data analysis
task is Classification −
 A bank loan officer wants to analyze the data in order to
know which customer (loan applicant) are risky or which
are safe.
 A marketing manager at a company needs to analyze a
customer with a given profile, who will buy a new computer.
In both of the above examples, a model or classifier is
constructed to predict the categorical labels. These labels are
risky or safe for loan application data and yes or no for marketing
data.

Following are the examples of cases where the data analysis


task is Prediction −
Suppose the marketing manager needs to predict how much a
given customer will spend during a sale at his company. In this
example we are bothered to predict a numeric value. Therefore
the data analysis task is an example of numeric prediction. In
this case, a model or a predictor will be constructed that predicts
a continuous-valued-function or ordered value.
2. Discuss about basic decision tree induction algorithm?
Ans: Decide which attribute (splitting (splitting ‐point) to test at
node N by determining the “best” way to separate or partition
the tuples in D into individual classes
• The splitting criteria is determined so that , ideally, the
resulting partitions at each branch are as “pure ” as possible . –
A partition is pure if all of the tuples in it belong to the same
class

3. Explain briefly various measures associated with attribute


selection
Ans: There are different attribute selection measures. Some of
them are as follows;
1. Information Gain
2. Gain ratio
3. Gini index
Information gain, as we saw, is biased toward multivalued
attributes.
Although the gain ratio adjusts for this bias, it tends to prefer
unbalanced splits in which one partition is much smaller than
the others.
The Gini index is biased toward multivalued attributes and has
difficulty when the number of classes is large. It also tends to
favor tests that result in equal-size partitions and purity in both
partitions. Although biased, these measures give reasonably
good results in practice.
4. Summarize how does tree pruning work? What are some
enhancements to basic decision tree induction
Ans: Pruning is a technique in machine learning that reduces
the size of decision trees by removing sections of the tree that
provide little power to classify instances. Pruning reduces the
complexity of the final classifier, and hence improves predictive
accuracy by the reduction of overfitting.
One of the questions that arises in a decision tree algorithm is
the optimal size of the final tree. A tree that is too large
risks overfitting the training data and poorly generalizing to new
samples. A small tree might not capture important structural
information about the sample space. However, it is hard to tell
when a tree algorithm should stop because it is impossible to
tell if the addition of a single extra node will dramatically
decrease error. This problem is known as the horizon effect. A
common strategy is to grow the tree until each node contains a
small number of instances then use pruning to remove nodes
that do not provide additional information.[1]
Pruning should reduce the size of a learning tree without
reducing predictive accuracy as measured by a cross-
validation set. There are many techniques for tree pruning that
differ in the measurement that is used to optimize performance.
Pruning can occur in a top down or bottom up fashion. A top
down pruning will traverse nodes and trim subtrees starting at
the root, while a bottom up pruning will start at the leaf nodes.
Below are two popular pruning algorithms.
Reduced error pruning
One of the simplest forms of pruning is reduced error pruning.
Starting at the leaves, each node is replaced with its most
popular class. If the prediction accuracy is not affected then the
change is kept. While somewhat naive, reduced error pruning
has the advantage of simplicity and speed.
Cost complexity pruning

Enhancement of decision tree induction:


https://slideplayer.com/slide/3429699/
5. Explain how scalable is decision tree induction? Explain?

Ans: Let us consider how scalable the decision tree induction


is. If we consider a dataset D which is the training set which is
larger than the main memory. The important decision tree
algorithms like ID3, C4.5 and CART are well established for
relatively small data sets.

When these algorithms are applied to the mining of very large


real-world databases, then the issue of efficiency will be raised.
In practice, data mining applications perform on huge datasets
with millions of tuples. So in this case the decision tree
induction becomes inefficient because of the swapping of
training tuples in and out of the main memory. More scalable
approaches are required in order to handle the large amount of
data to be fit in to main memory.

Currently in order to meet this requirement, there are few


algorithms which employ techniques of presorting on the disk-
resident data that are too large to fit in the memory by using
new data structures to facilitate the tree construction. Some
other algorithms use boot strapping which is a statistical
technique. Hence, scalability is being improved.
6. Describe the working procedures of simple Bayesian
classifier?

Ans: Bayes’ Theorem finds the probability of an event occurring


given the probability of another event that has already
occurred. Bayes’ theorem is stated mathematically as the
following equation

where A and B are events and P(B) ?0.


 Basically, we are trying to find probability of event A, given
the event B is true. Event B is also termed as evidence.
 P(A) is the priori of A (the prior probability, i.e. Probability
of event before evidence is seen). The evidence is an
attribute value of an unknown instance(here, it is event B).
 P(A|B) is a posteriori probability of B, i.e. probability of
event after evidence is seen.
Now, with regards to our dataset, we can apply Bayes’ theorem

in following way
where, y is class variable and X is a dependent feature vector
(of size n) where:

Just to clear, an example of a feature vector and corresponding


class variable can be: (refer 1st row of dataset)
X = (Rainy, Hot, High, False)

y = No

So basically, P(X|y) here means, the probability of “Not playing


golf” given that the weather conditions are “Rainy outlook”,
“Temperature is hot”, “high humidity” and “no wind”.
7. Explain Bayesian Belief Networks?

Ans: • A BBN is a special type of diagram (called a directed


graph) together with an associated set of probability tables.

• The graph consists of nodes and arcs.

• The nodes represent variables, which can be discrete or


continuous.

• The arcs represent causal relationships between variables

Example:

Key features:

• BBNs enable us to model and reason about uncertainty.

• BBNs accommodate both subjective probabilities and


probabilities based on objective data.

• The most important use of BBNs is in revising probabilities in


the light of actual observations of events
8. Discuss about k-nearest neighbor classifier and case-based
reasoning?
Ans: K-Nearest Neighbors is one of the most basic yet
essential classification algorithms in Machine Learning. It
belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining and intrusion
detection.
It is widely disposable in real-life scenarios since it is non-
parametric, meaning, it does not make any underlying
assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian
distribution of the given data).
We are given some prior data (also called training data), which
classifies coordinates into groups identified by an attribute.
Algorithm:
Let m be the number of training data samples. Let p be an
unknown point.
1. Store the training samples in an array of data points arr[].
This means each element of this array represents a tuple
(x, y).
2. for i=0 to m:
3. Calculate Euclidean distance d(arr[i], p).
4. Make set S of K smallest distances obtained. Each of these
distances correspond to an already classified data point.
5. Return the majority label among S.
Case-based reasoning (CBR), broadly construed, is the
process of solving new problems based on the solutions of
similar past problems. An auto mechanic who fixes
an engine by recalling another car that exhibited similar
symptoms is using case-based reasoning. A lawyer who
advocates a particular outcome in a trial based
on legal precedents or a judge who creates case law is
using case-based reasoning. So, too, an engineer copying
working elements of nature (practicing biomimicry), is
treating nature as a database of solutions to problems.
Case-based reasoning is a prominent type
of analogy solution making.

It has been argued that case-based reasoning is not only


a powerful method for computer reasoning, but also a
pervasive behavior in everyday human problem solving;
or, more radically, that all reasoning is based on past
cases personally experienced. This view is related
to prototype theory, which is most deeply explored
in cognitive science.
9. Explain about classifier accuracy? Explain the process of
measuring the accuracy of a classifier?
Ans: Accuracy is one metric for evaluating classification
models. Informally, accuracy is the fraction of predictions our
model got right. Formally, accuracy has the following definition:

For binary classification, accuracy can also be calculated in


terms of positives and negatives as follows:

Where TP = True Positives, TN = True Negatives, FP = False


Positives, and FN = False Negatives
Accuracy comes out to 0.91, or 91% (91 correct predictions out
of 100 total examples). That means our tumor classifier is doing
a great job of identifying malignancies, right?
Actually, let's do a closer analysis of positives and negatives to
gain more insight into our model's performance.
Of the 100 tumor examples, 91 are benign (90 TNs and 1 FP)
and 9 are malignant (1 TP and 8 FNs).
Of the 91 benign tumors, the model correctly identifies 90 as
benign. That's good. However, of the 9 malignant tumors, the
model only correctly identifies 1 as malignant—a terrible
outcome, as 8 out of 9 malignancies go undiagnosed!
While 91% accuracy may seem good at first glance, another
tumor-classifier model that always predicts benign would
achieve the exact same accuracy (91/100 correct predictions)
on our examples. In other words, our model is no better than
one that has zero predictive ability to distinguish malignant
tumors from benign tumors.
Accuracy alone doesn't tell the full story when you're working
with a class-imbalanced data set, like this one, where there is
a significant disparity between the number of positive and
negative labels.
10. Describe any ideas can be applied to any association rule
mining be applied to classification?
Ans:
11. Explain about the major issues regarding classifications and
predictions?
Ans: The major issue is preparing the data for Classification and
Prediction. Preparing the data involves the following activities −
 Data Cleaning − Data cleaning involves removing the
noise and treatment of missing values. The noise is
removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing
value with most commonly occurring value for that
attribute.
 Relevance Analysis − Database may also have the
irrelevant attributes. Correlation analysis is used to know
whether any two given attributes are related.
 Data Transformation and reduction − The data can be
transformed by any of the following methods.
o Normalization − The data is transformed using
normalization. Normalization involves scaling all
values for given attribute in order to make them fall
within a small specified range. Normalization is used
when in the learning step, the neural networks or the
methods involving measurements are used.
o Generalization − The data can also be transformed
by generalizing it to the higher concept. For this
purpose we can use the concept hierarchies.

12. Differentiate classification and prediction methods?

Ans: Here is the criteria for comparing the methods of


Classification and Prediction −
 Accuracy − Accuracy of classifier refers to the ability of
classifier. It predict the class label correctly and the
accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a
new data.
 Speed − This refers to the computational cost in generating
and using the classifier or predictor.
 Robustness − It refers to the ability of classifier or
predictor to make correct predictions from given noisy
data.
 Scalability − Scalability refers to the ability to construct the
classifier or predictor efficiently; given large amount of
data.
 Interpretability − It refers to what extent the classifier or
predictor understands.
13. Explain briefly various measures associated with attribute
selection?
Ans: refer 3
14. Explain training of Bayesian belief networks?
Ans: refer 7
15. Explain how tree pruning useful in decision tree induction?
What is a drawback of using a separate set of tuples to
evaluate pruning?
Ans: refer part c 1
16. Explain for a given a decision tree, you have the option of
(a) converting the decision tree to rules and then pruning the
resulting rules, or (b) pruning the Decision tree and then
converting the pruned tree to rules. What advantage does (a)
have over (b)?
Ans: refer part c 2
17. Compare the advantages and disadvantages of eager
classification (e.g.,
decision tree, Bayesian, neural network) versus lazy
classification (e.g., k- nearest neighbor, case- based
reasoning).
Ans: lazy learning advantages:
• Incremental (online) learning: The problem-solving ability is
increased with each newly presented case.
• Suitability for complex and incomplete problem domains: A
complex target function can be described as a collection of less
complex local approximations and unknown classes can be
learned.
• Suitability for simultaneous application to multiple problems:
Examples are simply stored and can be used for multiple
problem-solving purposes.
• Ease of maintenance: A lazy learner adapts automatically to
changes in the problem domain
Lazy learning disadvantages:
• Handling very large problem domains: This implies high
memory / storage requirements and time-consuming search for
similar examples.
• Handling highly dynamic problem domains: In CBR, this
involves continuous reorganisation of the case base, which
may introduce errors in the case base. Overall, the set
previously encountered examples may become outdated if a
sudden large shift in the problem domain occurs.
• Handling overly noisy data: Such data may result in storing
same problems numerous times because of the differences in
cases due to noise. In turn, this implies high memory / storage
requirements and time-consuming search for similar examples.
• Achieving fully automatic operation: Only for complete
problem domains a fully automatic operation of a lazy learner
can be expected. Otherwise, user feedback is needed for
situations for which the learner has no solution.
Advantages and disadvantages of eager classification:
Eager classification is seen to be much faster than the lazy
classification. This is because it constructs a generalization
model before receiving any new tuples to classify. Accuracy of
classification is generally seen because weights are assigned
to attributes. Since this kind of classification involves a single
hypothesis for the entire classification, this leads to decrease in
classification levels and also takes a long time. People working
on such classifications need to be trained. Lazy classification
on the other hand involves a better hypothesis level this
improves accuracy of classification .in this classification trading
tuples have to be stored increasing costs and involved high end
indexing structures. Classifiers are not built into tuples which
need to be classified .there could be irrelevant attributes in the
data leading to inefficient

18. Write an algorithm for k-nearest-neighbor classification


given k and n, the number of attributes describing each tuple.
Ans: K nearest neighbors or KNN Algorithm is a simple algorithm
which uses the entire dataset in its training phase. Whenever a
prediction is required for an unseen data instance, it searches
through the entire training dataset for k-most similar instances
and the data with the most similar instance is finally returned as
the prediction.
kNN is often used in search applications where you are
looking for similar items, like find items similar to this one.
Algorithm suggests that if you’re similar to your neighbours,
then you are one of them. For example, if apple looks more
similar to peach, pear, and cherry (fruits) than monkey, cat or a
rat (animals), then most likely apple is a fruit.
Algorithm
Let m be the number of training data samples. Let p be an
unknown point.
1. Store the training samples in an array of data points arr[].
This means each element of this array represents a tuple
(x, y).
2. for i=0 to m:
3. Calculate Euclidean distance d(arr[i], p).
4. Make set S of K smallest distances obtained. Each of these
distances correspond to an already classified data point.
5. Return the majority label among S.
19. Describe each of the following clustering algorithms in
terms of the following criteria:
(i) shapes of clusters that can be determined;
(ii) input parameters that must be specified; and (iii) limitations.
(a)k-means (b)k-medoids
Ans: (i)Shapes of clusters
Kmeans algorithm is good in capturing structure of
the data if clusters have a spherical-like shape. It
always try to construct a nice spherical shape
around the centroid. That means, the minute the
clusters have a complicated geometric shapes,
kmeans does a poor job in clustering the data.

Input parameters for k-means:


The algorithm inputs are the number of
clusters Κ and the data set. The data set is a
collection of features for each data point. The
algorithms starts with initial estimates for
the Κ centroids, which can either be randomly
generated or randomly selected from the data
set
Input parameters for k medoids:
 The input data contains an ID column and the other columns
are of integer, varchar, nvarchar, or double data type.
 The input data does not contain null value. The algorithm will
issue errors when encountering null value
Limitaions of k-means:
• Difficult to predict the number of clusters (K-Value)
• IniHal seeds have a strong impact on the final results
• The order of the data has an impact on the final results
• SensiHve to scale: rescaling your datasets (normalizaHon or
standardizaHon) will completely change results. While this itself
is not bad, not realizing that you have to spend extra a4en(on
to scaling your data might be bad.
Limitations of k medoids:
1)K-Mediods is more costly than K-Means Method because of
its time complexity.
2) It does not scale well for large datasets.
3)Results and total run time depends upon initial partitions

UNIT-V
PART-B

1. Discuss the various types of data in cluster analysis?


Ans: Algorithms should be capable to be applied on any kind of
data such as interval-based (numerical) data, categorical, and
binary data

2. Explain the categories of major clustering methods?


Ans: Clustering methods can be classified into the following
categories −

 Partitioning Method
 Hierarchical Method

 Density-based Method

 Grid-Based Method

 Model-Based Method

 Constraint-based Method

Partitioning Method
Suppose we are given a database of ‘n’ objects and the
partitioning method constructs ‘k’ partition of data. Each partition
will represent a cluster and k ≤ n. It means that it will classify the
data into k groups, which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
Points to remember −
 For a given number of partitions (say k), the partitioning
method will create an initial partitioning.
 Then it uses the iterative relocation technique to improve
the partitioning by moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given
set of data objects. We can classify hierarchical methods on the
basis of how the hierarchical decomposition is formed. There
are two approaches here −

 Agglomerative Approach
 Divisive Approach

Agglomerative Approach
This approach is also known as the bottom-up approach. In this,
we start with each object forming a separate group. It keeps on
merging the objects or groups that are close to one another. It
keep on doing so until all of the groups are merged into one or
until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this,
we start with all of the objects in the same cluster. In the
continuous iteration, a cluster is split up into smaller clusters. It
is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or
splitting is done, it can never be undone.
3. Write algorithms for k-means and k-medoids? Explain?
Ans: k-medoids:

The most common realisation of k-medoid clustering is


the Partitioning Around Medoids (PAM) algorithm and is as
follows:

1. Initialize: randomly select k of the n data points as the


medoids
2. Assignment step: Associate each data point to the
closest medoid.
3. Update step: For each medoid m and each data
point o associated to m swap m and o and compute the
total cost of the configuration (that is, the average
dissimilarity of o to all the data points associated to m).
Select the medoid o with the lowest cost of the
configuration.
Repeat alternating steps 2 and 3 until there is no change in the
assignments.
K means:
The algorithm works as follows:
1. First we initialize k points, called means, randomly.
2. We categorize each item to its closest mean and we update
the mean’s coordinates, which are the averages of the
items categorized in that mean so far.
3. We repeat the process for a given number of iterations and
at the end, we have our clusters.
4. Describe the different types of hierarchical methods?
Ans: Hierarchical clustering algorithm is of two types:

 Agglomerative: This is a "bottom-up" approach: each


observation starts in its own cluster, and pairs of clusters are
merged as one moves up the hierarchy.
 Divisive: This is a "top-down" approach: all observations
start in one cluster, and splits are performed recursively as
one moves down the hierarchy.

i) Agglomerative Hierarchical clustering algorithm or AGNES


(agglomerative nesting) and

ii) Divisive Hierarchical clustering algorithm or DIANA (divisive


analysis).
Both this algorithm are exactly reverse of each other. So we will
be covering Agglomerative Hierarchical clustering algorithm in
detail.

Agglomerative Hierarchical clustering -This algorithm works


by grouping the data one by one on the basis of the nearest
distance measure of all the pairwise distance between the data
point. Again distance between the data point is recalculated but
which distance to consider when the groups has been formed?
For this there are many available methods. Some of them are:

1) single-nearest distance or single linkage.

2) complete-farthest distance or complete linkage.

3) average-average distance or average linkage.

4) centroid distance.

5) ward's method - sum of squared euclidean distance is


minimized.

This way we go on grouping the data until one cluster is


formed. Now on the basis of dendogram graph we can
calculate how many number of clusters should be actually
present.

5. Demonstrate about the following hierarchical methods


a) BIRCH
b) Chameleon
ans: BIRCH (Balanced Iterative Reducing and Clustering using
Hierarchies)

 It is a scalable clustering method.


 Designed for very large data sets
 Only one scan of data is necessary
 It is based on the notation of CF (Clustering Feature) a CF
Tree.
 CF tree is a height balanced tree that stores the clustering
features for a hierarchical clustering.
 Cluster of data points is represented by a triple of
numbers (N,LS,SS) Where
N= Number of items in the sub cluster
LS=Linear sum of the points
SS=sum of the squared of the points
Combines initial partition of data with hierarchical clustering
techniques it modifies clusters dynamically
Step1:

 Generate a KNN graph


 because it's local, it reduces influence of noise and outliers
 provides automatic adjustment for densities

Step2:
 use METIS: a graph partitioning algorithm
 get equally-sized groups of well-connected vertices
 this produces "sub-clusters" - something that is a part of true
clusters
Step3:

 recombine sub-clusters
 combine two clusters if
o they are relatively close
o they are relatively interconnected
 so they are merged only if the new cluster will be similar to
the original ones
 i.e. when "self-similarity" is preserved (similar to the join
operation in Scatter/Gather)

6. Explain about semi-supervised cluster analysis


Ans: In real world, there is more unlabelled data then
labelled data, but there is still some labelled data. We want to
use all available information for most robust and best
performing model. Semi-supervised clustering helps us there.

Some examples of semi-supervised clustering can be news


category classification, as you seen on Google News. There
may be some information about a news item being related to
‘politics’ or ‘sports’ but nobody can sift through hundreds of
thousands of items every day to create fully labelled data.
Similarly, image recognition uses the similar method as you
may experience now on Google Photos.

Generating labels may not be easy or cheap, and hence due


to limited resources, we may have labels for only few
observations. For example, investigating fraud is expensive so
we may know about confirmed fraud or confirmed non-fraud
only for limited insurance claims. But not knowing doesn’t mean
that those claims cannot be fraud.

7. Explain about the outlier analysis


Ans: Outlier Analysis − Outliers may be defined as the data
objects that do not comply with the general behavior or model
of the data available.
 A database may contain data objects that do not comply
with the general behavior or
 model of the data. Such data objects, which are grossly
different from or inconsistent with the remaining set of
data, are called outliers.
 Most data mining methods discard outliers as noise or
exceptions.
 However, in some applications such as fraud detection,
the rare events can be more interesting than the more
regularly occurring ones.
 The analysis of outlier data is referred to as outlier mining.
 Outliers may be detected using statistical tests that
assume a distribution or probability model for the data, or
using distance measures where objects that are a
substantial distance from any other cluster are considered
outliers.
 Rather than using statistical or distance measures,
deviation-based methods identify outliers by examining
differences in the main characteristics of objects in a
group.
 Outliers can be caused by measurement or execution
error.
 Outliers may be the result of inherent data variability.
 Many data mining algorithms try to minimize the influence
of outliers or eliminate them all together.
 This, however, could result in the loss of important hidden
information because one person’s noise could be another
person’s signal.
 Thus, outlier detection and analysis is an interesting data
mining task, referred to as outlier mining.
 Approaches:
 1.statisical
 2.distance based
 3. The density-based local outlier approach:
 4. the deviation-based approach
8. Define the distance-based outlier? Illustrate the efficient
algorithms for mining distance-based algorithm

Ans: Distance-based outlier detection method consults the


neighbourhood of an object, which is defined by a given radius.
An object is then considered an outlier if its neighborhood does
not have enough other points.

A distance the threshold that can be defined as a reasonable


neighbourhood of the object. For each object o we can find a
reasonable number of neighbours of an object.

Formally, let r(r>0) be a distance threshold and 𝜋 (0<𝜋<1) be a


fraction threshold. An object o is DB(r,𝜋)

dist — distance measure.

The straightforward approach takes O(n²) time.

Algorithms for mining distance-based outliers:

 Index-based algorithm
 Nested-loop algorithm
 Cell-based algorithm

9. Explain about the Statistical-based outlier detection?

Ans: Statistical methods (also known as model-based methods)


assume that the normal data follow some statistical model (a
stochastic model). The idea is to learn a generative model fitting
the given data set, and then identify the objects in low
probability regions of the model as outliers.

Parametric methods
A parametric method assumes that the normal data objects are
generated by a parametric distribution with parameter 𝜃. The
probability density function of the parametric
distribution f(x,𝜃) gives a probability that object x is generated
by the distribution. The smaller this value, the more likely x is an
outlier.

Normal objects occurs in region of high probability for the


stochastic model and objects in the region of low probability are
outliers.

The effectiveness of statistical methods highly depends on the


assumption for the distribution.

For example, consider the univariate data with a normal


distribution. We will detect the outlier using methods of
maximum likelihood.

Maximum likelihood methods to estimate the parameter 𝜇 and 𝜎

Taking the derivatives with respect to 𝜇 and 𝜎² and solving the


resulting system of the first-order condition leads to the
following maximum likelihood estimates
10. Describe about the distance-based outlier detection?
Ans: refer 8
11. Discuss about the density-based outlier detection?
Ans: Density-based outlier detection method investigates the
density of an object and that of its neighbors. Here, an object is
identified as an outlier if its density is relatively much lower than
that of its neighbors.

Many real-world data sets demonstrate a more complex


structure, where objects may be considered outliers with
respect to their local neighborhoods, rather than with respect to
the global data distribution.

The idea of density-based is that we need to compare the


density around an object with the density around its local
neighbors. The basic assumption of density-based outlier
detection methods is that the density around a nonoutlier object
is similar to the density around its neighbors, while the density
around an outlier object is significantly different from the density
around its neighbors.

dist_k(o) -is a distance between object o and k-nearest


neighbors. The k-distance neighborhood of o contains all
objects of the distance to o is not greater than dist_k(o) the k-th
distance of o

We can use the average distance from the objects in Nk(o)


to o as the measure of the local density of o. If o has very close
neighbors o’ such that dist(o,o’) is very small, the statistical
fluctuations of the distance measure can be undesirably high.
To overcome this problem, we can switch to the following
reachability distance measure by adding a smoothing effect.

k is a user-specified parameter that controls the smoothing


effect. Essentially, k specifies the minimum neighborhood to be
examined to determine the local density of an
object.Reachability distance is not symmetric.

Local reachability density of an object o is

We calculate the local reachability density for an object and


compare it with that of its neighbors to quantify the degree to
which the object is considered an outlier.
The local outlier factor is the average of the ratio of the local
reachability density of o and those of o’s k-nearest neighbours.
The lower local reachability density of o and the higher the local
reachability densities of the k-nearest neighbors of o, the higher
the LOF value is. This exactly captures a local outlier of which
the local density is relatively low compared to the local densities
of its k-nearest neighbors.
12. Demonstrate about the deviation-based outlier detection
techniques?

Ans:
• Identifies outliers by examining the main characteristics of
objects in a group
• Objects that ―deviate‖ from this description are considered
outliers
• sequential exception technique – simulates the way in which
humans can distinguish unusual objects from among a series of
supposedly like objects
• OLAP data cube technique – uses data cubes to identify
regions of anomalies in large multidimensional data
Outlier (also called deviation or exception) detection is an
important function in data mining. In identifying outliers, the
deviation-based approach has many advantages and draws
much attention. Although a linear algorithm for sequential
deviation detection is proposed, it is not stable and always
loses many deviation points.
In this paper, we present three algorithms on detecting
deviations. The first algorithm is time proportional to the square
of the dataset length, and the second is time proportional to the
square of the number of distinct data values. These two
algorithms lead to same result, while the latter is much more
efficient than the former. In the third algorithm, a deviation
factor is defined to help finding deviation points. Although
leading to approximation results, it is the most efficient of the
three, especially to large datasets with lots of distinct values.
13. Demonstrate about the BIRCH hierarchical methods?
Ans:
14. standardize the variable by the following:
(a) Compute the mean absolute deviation of age.
(b) Compute the z-score for the first four measurements.
Ans: no data given
15. Illustrate the strength and weakness of k-means in
comparison with the k- medoids algorithm. Also, illustrate the
strength and weakness of these schemes in comparison with a
hierarchical clustering scheme (such as AGNES).

Ans: K-means only - the data set is coloured in accordance with


k-means results. K-medoids centroids are not displayed.
K-means (show k-medoids centroids) - the data set is coloured
in accordance with k-means results. Both groups of centroids
are displayed.

K-medoids only - the data set is coloured in accordance with k-


medoids results. K-means centroids are not displayed.

K-medoids (show k-means centroids) - the data set is coloured


in accordance with k-medoids results. Both groups of centroids
are displayed.

In Hierarchical clustering, clusters have a tree like structure or a


parent child relationship. Here, the two most similar clusters are
combined together and continue to combine until all objects are
in the same cluster.

K- means is a collection of objects which are “similar” between


them and are “dissimilar” to the objects belonging to other
clusters. It is a division of objects into clusters such that each
object is in exactly one cluster, not several.

16. Explain why is outlier mining important? Briefly describe the


different approaches behind statistical-based outlier detection,
distanced-based outlier detection, density-based local outlier
detection, and deviation-based outlier detection.
Ans :refer 8,9,10,11,12
17. Explain briefly mining of multimedia databases and time
series databases
Ans: A multimedia database system stores and manages a
large collection of multimedia objects, such as audio data,
image data, video data, sequence data, and hypertext data,
which contain text, text markups and linkages. Multimedia
database systems are increasingly common owing to the
popular use of audiovideo equipment, CD-ROMs and the
Internet. Typical multimedia database systems include NASA's
EOS (Earth Observation System), various kinds of image and
audio-video databases, human genome databases, and
Internet databases
we consider two main families of multimedia indexing and
retrieval systems:
(1) description-based retrieval systems, which build indices and
perform object retrieval based on image descriptions, such as
keywords, captions, size, and time of creation; and
(2) Content-based retrieval systems, which support retrieval
based on the image content, such as color histogram, texture,
shape, objects, and wavelet transforms. Description-based
retrieval is laborintensive if performed manually. If automated,
the results are typically of poor quality; for example, the
assignment of keywords to images can be a tricky and arbitrary
task. Content-based retrieval uses: visual features to index
images and promotes object retrieval based on feature
similarity, which is highly desirable in many applications.

Time series database:


A time series database allows users to create, enumerate,
update and destroy various time series and organize them. The
server often supports a number of basic calculations that work
on a series as a whole, such as multiplying, adding, or
otherwise combining various time series into a new time
series.[citation needed] They can also filter on arbitrary patterns such
as time ranges, low value filters, high value filters, or even have
the values of one series filter another.[citation needed] Some TSDBs
also build in additional statistical functions that are targeted to
time series data.[citation needed]
For example, for the following expression:

select gold_price * gold_volume

the TSDB would join the two series 'gold_price' and


'gold_volume' based on the overlapping areas of time for each,
multiply the values where they intersect, and then output a
single composite time series.
TSDBs often allow users to manage a repository of filters or
masks that specify in some way a pattern. In this way, one can
readily assemble time series data. Assuming such a filter
exists, one might hypothetically write

select onpeak( cellphoneusage )

which would extract out the time series of cellphoneusage that


only intersects that of 'onpeak'.
This syntactical simplicity drives the appeal of the TSDB. For
example, a simple utility bill might be implemented using a
query such as:

select max( onpeak( powerusagekw ) ) * demand_charge;

select sum( onpeak( powerusagekwh ) ) * energy_charge;

You might also like