Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Assignment 3

1) What are the different ways of Handling Categorical and Continuous


Attributes?
Categorical variables are also known as discrete or qualitative variables.
Different ways of handling Categorical attributes are:
• Nominal: Nominal variables are variables that have two or more categories, but
which do not have an intrinsic order. The different categories of a nominal variable
can also be referred to as groups or levels of the nominal variable.
• Dichotomous variables: Dichotomous variables are categorical variables with two
categories or levels. Levels are different groups within the same independent
variable.
• Ordinal variables: Ordinal variables are variables that have two or more categories
just like nominal variables only the categories can also be ordered or ranked.
Different ways of handling Continuous attributes are:
• Interval: Interval variables are variables for which their central characteristic
is that they can be measured along a continuum and they have a numerical
value.
• Ratio: Ratio variables are interval variables. The name "ratio" reflects the fact
that you can use the ratio of measurements. A ratio variable, has all the
properties of an interval variable, and also has a clear definition of 0.0. When
the variable equals 0.0, there is none of that variable.
• Discretize once at the beginning
• Binary Decision: (A < v) or (A v)
o A is a continuous attribute: consider all possible splits and find the best
cut. Can be more computational intensive.
2) Explain the concept of hierarchy
Concept Hierarchy reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level concepts. Concept hierarchies
can be used in processing of all the tasks.
Different type of concept hierarchies:
• Schema hierarchy: A schema hierarchy is total or partial order among the attribute
in the database schema. This hierarchy may formally express semantic relation
between attributes.
• Set grouping hierarchy: It organizes values for a given attribute into groups or
sets or range of values. Total or partial order can be defined among groups. It is
also used to refine or enrich schema-defined hierarchies.
• Operation-derived hierarchy: An Operation derived hierarchy is based on
operation specified by a user, experts or by the mining systems. Operation can be
including the decoding of information, encoding and extracting from complex data
clustering.
• Rule-based hierarchy: A rule-based hierarchy either a whole concept hierarchy or
a portion of it is defined by a set of rules and is evaluated dynamically based on
current data and definition.
Concept hierarchy generation for numeric data:
• Binning: In binning, first sort data and partition into bins then one can smooth by
bin means, smooth by bin median, smooth by bin boundaries.
• Histogram analysis: Histogram is a popular data reduction technique. It Divides
the data into buckets and store average for each bucket. It can also be constructed
optimally in one dimension using dynamic programming.
• Clustering analysis: Partition data set into clusters, and one can store cluster
representation only. It can have hierarchical clustering and be stored in multi-
dimensional index tree structures.
• Entropy-based discretization: Entropy-based discretization is a supervised, top
down splitting technique. It explores class distribution information in its calculation
and determination of split-points.
3) What is infrequent Pattern and how do you identify those.
Often when considering data mining, the focus is on frequent patterns. Although
the majority of the most interesting patterns will lie within the frequent ones, there are
important patterns that will be ignored with this approach. These are called infrequent
patterns. The infrequent pattern mining is to discover item sets whose frequency of
occurrence in the analyzed data is less than or equal to a maximum threshold. An infrequent
pattern is an itemset or a rule whose support is less than the minsup threshold.
The key issues in mining infrequent patterns are:
• How to identify interesting infrequent patterns, and
• How to efficiently discover them in large data sets.

Techniques for Mining Interesting Infrequent Patterns are:

• Negative Pattern: Transaction data can be binarized by augmenting it with negative items.
By applying existing frequent itemset generation algorithm such as Apriori on the
augmented transactions, all the negative item sets can be derived. Such an approach is
feasible only if a few variables are treated as symmetric binary.
• Support Expectation: Another class of techniques considers an infrequent pattern to be
interesting only if its actual support is considerably smaller than its expected support. For
negatively correlated patterns, the expected support is computed based on the statistical
independence assumption.
• Support Expectation Based on Concept Hierarchy: Objective measures alone may not
be sufficient to eliminate uninteresting infrequent patterns. A subjective approach for
determining expected support is needed to avoid generating such infrequent patterns.
• Support Expectation Based on Indirect Association: Indirect association can be
generated in the following way. First, the set of frequent itemsets is generated using
standard algorithm such as Apriori or FP-growth. Each pair of frequent k-itemsets are then
merged to obtain a candidate indirect association. Once the candidates have been
generated, it is necessary to verify that they satisfy the item pair support

You might also like