Mining Class Comparisions and Mining Descriptive Statistical Measures

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 24

MINING CLASS COMPARISIONS AND

MINING DESCRIPTIVE STATISTICAL


MEASURES

Neha Sharma
ME(I.T),3rd Sem
Roll.NO-463
MINING CLASS COMPARISONS:DISCRIMINATING BETWEEN
DIFFERENT CLASSES

 In many applications,users may not be interested in having a


single class(or concept) described or characterized,but rather
would prefer to mine a description that compares or
distinguishes one class(or concept) from other comparable
classes(or concepts).
 Class discrimination or comparision (hereafter referred to as
class comparision)mines descriptions that distinguish a target
class from its contrasting classes.
 Notice that the target and contrasting classes must be
comparable in the sense that they share similar dimensions
and attributes.
 For eg-the three classes,person,address and item ,are not
comparable.However ,the sales in the last three years are
comparable classes,and so are computer science students
versus physics students.
 The attribute generalization process descibed for class
characterization can be modified so that generalization is
performed synchronously among all the classes compared.
 This allows the attributes in all of the classes to be generalized
to the same level of abstraction.
EXAMPLE
 Suppose we are given the AllElectronics data for sales in 2003
and 2004 and we would like to compare these two
classes.Consider the dimension location with abstraction at
the city,province_or_state,and country levels.Each class of
data should be generalized to the same location level.That
is,they are synchronously all generalized to either the city
level,or the province_or_state level,or the country level.
HOW IS CLASS COMPARISION PERFORMED?

In general,the procedure followed is as follows:


1. Data collection-The set of relevant data in the database is collected by
Query processing and is partitioned respectively into a target class and
one or a set of contrasting class(es).
2. Data Relevance Analysis- If there are many dimensions,then dimension
relevance analysis should be performed on these classes to select only
the highly relevant dimensions for further analysis.Correlation or
entropy-based measures can be used for this step.
3. Synchronous generalization- Generalization is performed on the target
class to the level controlled by a user- or expert-specified dimension
threshold,which results in a prime target
Class relation.The concepts in the contrasting class(es) are generalized
to the same level as those in the prime target class
relation,forming the prime contrasting class(es) relation.
4. Presentation of the derived comparision-The resulting class
comparision description can be visualized in the form of
tables,graphs,and rules.This presentation usually includes a
“contrasting” meaure such as count% that reflects the
comparision between the target and contrasting classes.The
user can adjust the comparision description by applying drill-
down,roll-up,and other Olap operations to the target and
contrasting classes as desired.
EXAMPLE
 Task
– Compare graduate and undergraduate students using discriminant
rule.
– DMQL query

use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence,
phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67
Woodman Canada Richmond
Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …

Initial working relations:the target class(gradute students)


Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Bob M Calgary,alt,Ca 10-1-78 2642 Halifax 294-4291 2.96
Schumann Chemis nada st.,Burnaby
try
Amy Eau F Bio Golden,BC, 30-3-76 463 Sunset 681-5417 3.52
Canada Cres.,Vancouve
r

Initial working relations:the target class(undergradute students)


HOW CAN CLASS COMPARISION DESCRIPTIONS BE PRESENTED?

Prime generalized relation for the target class(graduate students)


major age_range gpa count%
Science 21-25 Good 5.53%
Science 26-30 Good 5.02%
Science Over30 Verygood 5.86%
… … …
Bussiness over30 excellent 4.68%

Prime generalized relation for the target class(undergraduate students)


major age_range gpa count%
Science 16-20 Fair 5.53%
Science 16-20 Good 4.53%
… … … …
Science 26-30 Good 2.32%
… … … …
Business Over30 excellent 0.68%
QUANTITATIVE DISCRIMINANT RULE
Cj = target class
qa = a generalized tuple covers some tuples of class
– but can also cover some tuples of contrasting class
– d-weight
– range: [0, 1]
quantitative discriminant rule form
count(q a  C j )
d  weight  m

 count(q
i 1
a  Ci )

quantitative discriminant rule form


 X, target_class(X)  condition(X) [d : d_weight]
Example: Quantitative Discriminant Rule
Status Birth_country Age_range Gpa Count
Graduate Canada 25-30 Good 90
Undergraduate Canada 25-30 Good 210

Count distribution between graduate and undergraduate students for a generalized tuple

• Quantitative discriminant rule

X , graduate_ student( X ) 
birth_ country( X ) "Canada"age_ range( X ) "25  30"gpa( X ) " good" [d : 30%]
– where 90/(90+210) = 30%

12/7/21 Data Mining: Concepts and Techniques 11


Class Description
• Quantitative characteristic rule
 X, target_class(X)  condition(X) [t : t_weight]
– necessary
• Quantitative discriminant rule
 X, target_cla ss(X)  condition(X) [d : d_weight]

– sufficient
• Quantitative description rule
 X, target_cla ss(X) 
condition 1(X) [t : w1, d : w 1]  ...  condition n(X) [t : wn, d : w n]
– necessary and sufficient
12/7/21 12
Example: Quantitative Description Rule
Location/item TV Computer Both_items

Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt


Europe 80 25% 40% 240 75% 30% 320 100% 32%
N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68%

Both_ 200 20% 100% 800 80% 100% 1000 100% 100%
regions

Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and
computers sold at AllElectronics in 1998

• Quantitative description rule for target class Europe

 X, Europe(X) 
(item(X) " TV" ) [t : 25%, d : 40%]  (item(X) " computer" ) [t : 75%, d : 30%]

12/7/21 Data Mining: Concepts and Techniques 13


Measuring the Central Tendency
n
1
• Mean x 
n

i 1
xi

n
wixi
– Weighted arithmetic mean x  i 1
n

 wi
• Median: A holistic measure i 1

– Middle value if odd number of values, or average of the middle two


values otherwise n / 2  ( f )l
median  L1  ( )c
– estimated by interpolation f median
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
– Empirical formula: mean  mode  3  (mean  median)
Measuring the Dispersion of Data

• Quartiles, outliers and boxplots


– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, M, Q3, max
– Boxplot: ends of the box are the quartiles, median is marked,
whiskers, and plot outlier individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation
– Variance s2: (algebraic,
n scalable computation) n n
1 1 1
 [ ( xi)
2
s 2
 (xi  x ) 2
 xi  2
]
n  1 i1 n  1 i1 n i1

– Standard deviation s is the square root of variance s2


Boxplot Analysis

• Five-number summary of a distribution:


Minimum, Q1, M, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
(interquartile range)
– The median is marked by a line within the box
– Whiskers: two lines outside the box extend to
Minimum and Maximum
A Boxplot
Mining Descriptive Statistical Measures in Large Databases

• Variance
1 n 1  1 2
2
s   i ( x  x ) 2
   i
x 2
  x 
 i 
n  1 i 1 n 1  n 

• Standard deviation: the square root of the


variance
– Measures spread about the mean
– It is zero if and only if all the values are equal
– Both the deviation and the variance are algebraic
Histogram Analysis

• Graph displays of basic statistical class descriptions


– Frequency histograms
• A univariate graphical method
• Consists of a set of rectangles that reflect the counts or frequencies of the
classes present in the given data
Quantile Plot
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against
the corresponding quantiles of another
• Allows the user to view whether there is a shift in going from
one distribution to another
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
Loess Curve
• Adds a smooth curve to a scatter plot in order to provide
better perception of the pattern of dependence
• Loess curve is fitted by setting two parameters: a smoothing
parameter, and the degree of the polynomials that are fitted
by the regression
THANKS
ANY QUERIES??

You might also like