Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

Attribute-Oriented Analysis

Presenter: Joveliza A. Trongcoso


Topic Outline

• Attribute generalization
• Attribute relevance
• Class comparison
• Statistical measures
• Experiments with weka – using filters and statistics
Data Objects

• Represents an entity
• Example in sales database, the objects may be customers, store
items, and sales
• Data objects are typically described by attributes.
• If the data objects are stored in a database, they are data tuples.
• That is the rows of a database correspond to the data objects, and the
columns correspond to the attributes
What is an Attribute?

• A data field representing a characteristic or feature of a data object.


• The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
• Attributes describing a customer object can include, for example,
customer_ID, name, and address
What is an Attribute? (cont’d)

• Observations – are the observed values for a given attribute


• Attribute vector (or feature vector) – set of attributes used to describe
a given object
• Univariate - distribution of data involving one attribute (or variable)
• Bivariate - distribution involves two attributes
Types of Attributes

1. Nominal
2. Binary
3. Ordinal
4. Numeric
Nominal Attributes

• The values of a nominal attribute are symbols or names of things.


• Nominal attributes are also referred to as categorical.

Example: hair_color and marital_status


Binary Attributes

• A nominal attribute with only two categories or states: 0 or 1


• 0 means absent; 1 means present
• Binary attributes are referred to as Boolean if the two states correspond
to true and false.

Example:

Attribute smoker describing a patient object


1 indicates that the patient smokes, while 0 indicates that the patient does not
Binary Attributes (cont’d)

• A binary attribute is symmetric if both of its states are equally


valuable and carry the same weight;
• Binary attribute is asymmetric if the outcomes of the states are not
equally important
Ordinal Attributes

• An attribute with possible values that have a meaningful order or


ranking among them.

Example:

drink_size (small, medium, and large)


professional_rank (private, private first class, specialist, corporal, and sergeant)
Ordinal Attributes (cont’d)

• Ordinal attributes are often used in surveys for ratings.

Example: Customer satisfaction had the following ordinal categories;

0: very dissatisfied,
1: somewhat dissatisfied,
2: neutral,
3: satisfied, and
4: very satisfied.
Numeric Attributes

• A numeric attribute is quantitative; that is, it is a measurable quantity,


represented in integer or real values.
• Numeric attributes can be interval-scaled or ratio-scaled.
Numeric Attributes (cont’d)

Interval-Scaled Attributes
• measured on a scale of equal-size units
• values have order
• allows to compare and quantify the difference between values.

Example:
• temperature of 20°C and 15°C
• Calendar dates 2010 and 2022
Numeric Attributes (cont’d)

Ratio-Scaled Attributes
• a numeric attribute with an inherent zero-point
• values are ordered, and we can also compute the difference between values,
as well as the mean, median, and mode

Example:
• count attributes such as years_of_experience (e.g., the objects are employees)
• number_of_words (e.g., the objects are documents)
• Additional examples include attributes to measure weight, height, latitude
and longitude coordinates (e.g., when clustering houses)
Attribute Generalization

• Attribute generalization is based on the following rule: “if there is a


large set of distinct values for an attribute, then a generalization
operator should be selected and applied to the attribute”

• Nominal attributes: the operation defines a sub-cube by performing a


selection on two or more dimensions. (Dropping condition)
• Structured attributes: climbing up concept hierarchy is used. Replacing a
value in an attribute value pair with a more general one. The operation
performs aggregation on data cube, either by climbing up a concept hierarchy
for a dimension or by dimension reduction.
Attribute Generalization (cont’d)

Example:

Set representation

Generalization
Y1 = {x2 = hot, x3 = high, x4 = weak} (X1
with first and last attributes dropped)
Attribute Relevance

Attribute relevance analysis is done in order to filter out statistically


irrelevant or weakly relevant attributes, and retain or even rank the
most relevant attributes for the descriptive mining task at hand.
Class Comparison

• Class discrimination or comparison (hereafter referred to as class


comparison) mines descriptions that distinguish a target class from its
contrasting classes.
• target and contrasting classes must be comparable and share similar
dimensions and attributes.
Class Comparison (cont’d)

Example: a class comparison describing the graduate and


undergraduate students at Big University.

Mining a class comparison. Suppose that you would like to compare the
general properties of the graduate and undergraduate students at Big
University, given the attributes name, gender, major, birth_place,
birth_date, residence, phone#, and gpa.
Class Comparison (cont’d)

This data mining task can be expressed in DMQL as follows:


Class Comparison (cont’d)
Class Comparison (cont’d)
Statistical Description of data

What Why
1. Measures of central tendency • To get overall picture of the data, basic
• Mean, median, mode statistical descriptions are used in data
• Location of the center of a data analysis
distribution • The statistical metrics can tell us if there
• Where do most of the attributes values are issues exist as extreme outliers and
fall? large deviation in the values of attributes
2. Dispersion measures
What is Outliers
• Range, quartiles, inter quartile range,
five-number summary and box plots, • Data values differs significantly from other values
variance and standard deviation. • It affect the mean value of the data but little
• It describes how are the data spread out. affect on median or mode.
Measures of Central Tendency

Example: We have the values for salary (in thousand dollars) 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

Mean – Average value of numeric Median – Middle value of numeric Mode – Most common value of
attribute attribute numeric attribute

Sort values in increasing order. It can be determined for


If N is odd, median is middle value qualitative and quantitative
of the ordered set. attributes.
If N is even, median is the average
The data from Example are
of the two middlemost values.
bimodal

Mean salary is $58,000. Median is $54,000. Modes are $52,000 and $70,000
Dispersion Measures

Example: We take data for any attribute X sorted in increasing numeric order

Range – The difference between the largest and smallest values of the attribute.

Quantiles – points takes at regular intervals dividing the data into equal size.
2-Quantile – a data point dividing the lower and upper halves of the data – Median
4-Quantiles – three data points that divide the data into four equal parts - Quartiles
100-Quantiles – divide the data values into 100 parts – Percentiles.
Dispersion Measures (Quartile)

A plot of the data distribution for an attribute X.

First quartile Q1 – 25th


Cuts off the lowest 25% of the data.
percentile

Third quartile Q3 – 75th


Cuts off the lowest 75% (or highest 25%) of the data.
percentile

Second quartile Q2 – 50th Median gives the center of the data distribution.
percentile The distance between the Q1 and Q3 gives the range covered by the middle half of
the data. This distance is called the Interquartile range. IQR=Q3-Q1
Experiments with Weka
using Filters

You might also like