Professional Documents
Culture Documents
Attribute Oriented Analysis
Attribute Oriented Analysis
• Attribute generalization
• Attribute relevance
• Class comparison
• Statistical measures
• Experiments with weka – using filters and statistics
Data Objects
• Represents an entity
• Example in sales database, the objects may be customers, store
items, and sales
• Data objects are typically described by attributes.
• If the data objects are stored in a database, they are data tuples.
• That is the rows of a database correspond to the data objects, and the
columns correspond to the attributes
What is an Attribute?
1. Nominal
2. Binary
3. Ordinal
4. Numeric
Nominal Attributes
Example:
Example:
0: very dissatisfied,
1: somewhat dissatisfied,
2: neutral,
3: satisfied, and
4: very satisfied.
Numeric Attributes
Interval-Scaled Attributes
• measured on a scale of equal-size units
• values have order
• allows to compare and quantify the difference between values.
Example:
• temperature of 20°C and 15°C
• Calendar dates 2010 and 2022
Numeric Attributes (cont’d)
Ratio-Scaled Attributes
• a numeric attribute with an inherent zero-point
• values are ordered, and we can also compute the difference between values,
as well as the mean, median, and mode
Example:
• count attributes such as years_of_experience (e.g., the objects are employees)
• number_of_words (e.g., the objects are documents)
• Additional examples include attributes to measure weight, height, latitude
and longitude coordinates (e.g., when clustering houses)
Attribute Generalization
Example:
Set representation
Generalization
Y1 = {x2 = hot, x3 = high, x4 = weak} (X1
with first and last attributes dropped)
Attribute Relevance
Mining a class comparison. Suppose that you would like to compare the
general properties of the graduate and undergraduate students at Big
University, given the attributes name, gender, major, birth_place,
birth_date, residence, phone#, and gpa.
Class Comparison (cont’d)
What Why
1. Measures of central tendency • To get overall picture of the data, basic
• Mean, median, mode statistical descriptions are used in data
• Location of the center of a data analysis
distribution • The statistical metrics can tell us if there
• Where do most of the attributes values are issues exist as extreme outliers and
fall? large deviation in the values of attributes
2. Dispersion measures
What is Outliers
• Range, quartiles, inter quartile range,
five-number summary and box plots, • Data values differs significantly from other values
variance and standard deviation. • It affect the mean value of the data but little
• It describes how are the data spread out. affect on median or mode.
Measures of Central Tendency
Example: We have the values for salary (in thousand dollars) 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Mean – Average value of numeric Median – Middle value of numeric Mode – Most common value of
attribute attribute numeric attribute
Mean salary is $58,000. Median is $54,000. Modes are $52,000 and $70,000
Dispersion Measures
Example: We take data for any attribute X sorted in increasing numeric order
Range – The difference between the largest and smallest values of the attribute.
Quantiles – points takes at regular intervals dividing the data into equal size.
2-Quantile – a data point dividing the lower and upper halves of the data – Median
4-Quantiles – three data points that divide the data into four equal parts - Quartiles
100-Quantiles – divide the data values into 100 parts – Percentiles.
Dispersion Measures (Quartile)
Second quartile Q2 – 50th Median gives the center of the data distribution.
percentile The distance between the Q1 and Q3 gives the range covered by the middle half of
the data. This distance is called the Interquartile range. IQR=Q3-Q1
Experiments with Weka
using Filters