Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 83

BITS Pilani

BITS Pilani Dr.Aruna Malapati


Asst Professor
Hyderabad Campus Department of CSIS
BITS Pilani
Hyderabad Campus

Data
Today’s Learning objective

• Describe Data

• List various Data types

• List the issues in Data quality

• List and identify the right preprocessing techniques


given data

BITS Pilani, Hyderabad Campus


What is Data?
• Collection of data objects and their Attributes
attributes

• An attribute is a property or Tid Refund Marital Taxable


Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a person,
2 No Married 100K No
temperature, etc.
3 No Single 70K No
– Other names: variable, filed,
characteristic, feature, Predictor, 4 Yes Married 120K No

etc. 5 No Divorced 95K Yes

• A collection of attributes describe Objects 6 No Married 60K No


No
an object 7 Yes Divorced 220K
8 No Single 85K Yes
– Other names: record, point, case,
sample, entity, or instance 9 No Married 75K No
10 No Single 90K Yes
10

BITS Pilani, Hyderabad Campus


Attribute Values

• Each attribute has a set of values object draws from.

• The same attribute can be mapped to different attribute

values

– Example: Temperature can be Celsius  in feet or Fahrenheit 

• Different attributes can be mapped to the same set of values

– Example: Attribute values for ID and age are integers

BITS Pilani, Hyderabad Campus


Types of Attributes

Has finite or countably


infinite set of values
Ex: Zip Codes, terms in doc

Has real numbers


Ex: length, Weight, temp etc.

BITS Pilani, Hyderabad Campus


Properties of Attribute
Values
• The type of an attribute depends on which of the following
properties it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: * /

– Nominal attribute: distinctness


– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties

BITS Pilani, Hyderabad Campus


Attribute Description Examples Operations
Type

Nominal The values of a nominal attribute are just zip codes, employee mode, entropy,
different names, i.e., nominal attributes ID numbers, eye contingency
provide only enough information to color, sex: {male, correlation, 2
distinguish one object from another. (=, )female}
test

Ordinal The values of an ordinal hardness of minerals, median,


attribute provide enough {good, better, best}, percentiles, rank
information to order objects. grades, street correlation, run
numbers
(<, >) tests, sign tests
Interval For interval attributes, the calendar dates, mean, standard
differences between values are temperature in deviation, Pearson's
meaningful, i.e., a unit of
Celsius or correlation, t and F
measurement exists.
(+, - ) Fahrenheit tests
Ratio For ratio variables, both temperature in
differences and ratios are Kelvin, monetary geometric mean,
meaningful. (*, /) quantities, counts, harmonic mean,
age, mass, length, percent
electrical current variation
BITS Pilani, Hyderabad Campus
Discrete and Continuous
Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes

• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.

BITS Pilani, Hyderabad Campus


Types of data sets

• Record
– Data Matrix
– Document Data
– Transaction Data

• Graph
– World Wide Web
– Molecular Structures

• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

BITS Pilani, Hyderabad Campus


Important Characteristics of
Structured Data

– Dimensionality

• Curse of Dimensionality

– Sparsity

• Only presence counts

– Resolution

• Patterns depend on the scale

BITS Pilani, Hyderabad Campus


Record Data

• Data that consists of a collection of records, each of


which consists of a fixed set of attributes

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

BITS Pilani, Hyderabad Campus


Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

BITS Pilani, Hyderabad Campus


Document Data

• Each document becomes a `term' vector,


– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding term
occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

BITS Pilani, Hyderabad Campus


Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

BITS Pilani, Hyderabad Campus


Graph Data
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
2 <li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5

BITS Pilani, Hyderabad Campus


Chemical Data

Benzene Molecule: C6H6

BITS Pilani, Hyderabad Campus


Ordered Data
Sequences of transactions

An element of the
sequence

BITS Pilani, Hyderabad Campus


Ordered Data
Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

BITS Pilani, Hyderabad Campus


Ordered Data

Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

BITS Pilani, Hyderabad Campus


Data Quality

• What kinds of data quality problems?

• How can we detect problems with the data?

• What can we do about these problems?

• Examples of data quality problems:


– Noise and outliers
– missing values
– duplicate data

BITS Pilani, Hyderabad Campus


Noise
• Noise: An invalid signal overlapping valid data
– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


BITS Pilani, Hyderabad Campus
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

BITS Pilani, Hyderabad Campus


Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)

BITS Pilani, Hyderabad Campus


Duplicate Data

• Data set may include data objects that are duplicates, or


almost duplicates of one another
– Major issue when merging data from heterogeneous
sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues

BITS Pilani, Hyderabad Campus


Data Preprocessing

• Aggregation

• Sampling

• Dimensionality Reduction

• Feature subset selection

• Feature creation

• Discretization and Binarization

• Attribute Transformation

BITS Pilani, Hyderabad Campus


Aggregation

• Combining two or more attributes (or objects) into a single


attribute (or object)

• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability

BITS Pilani, Hyderabad Campus


Aggregation
Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
BITS Pilani, Hyderabad Campus
Sampling

• Sampling is the main technique employed for data


selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.

• Statisticians sample because obtaining the entire set of


data of interest is too expensive or time consuming.

• Sampling is used in data mining because processing the


entire set of data of interest is too expensive or time
consuming.

BITS Pilani, Hyderabad Campus


Sampling

• The key principle for effective sampling is:

• A sample will work almost as well as using the entire data

set if the sample is representative(different for different

data set).

• Sampling may remove outliers and if done improperly can

introduce noise.

BITS Pilani, Hyderabad Campus


Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item

• Sampling without replacement


– As each item is selected, it is removed from the population

• Sampling with replacement


– Objects are not removed from the population as they are selected for the
sample.
• In sampling with replacement, the same object can be picked up more than
once

• Stratified sampling
– Split the data into several partitions; then draw random samples from
each partition

BITS Pilani, Hyderabad Campus


Sample Size

8000 points 2000 Points 500 Points

BITS Pilani, Hyderabad Campus


Curse of Dimensionality
• When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

• Definitions of density and


distance between points,
which is critical for
clustering and outlier
detection, become less
• Randomly generate 500 points
meaningful • Compute difference between max and min
distance between any pair of points

BITS Pilani, Hyderabad Campus


Dimensionality Reduction

• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise

• Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques

BITS Pilani, Hyderabad Campus


Dimensionality Reduction:
PCA

• Goal is to find a projection that captures the largest amount of


variation in data

x2

u1

x1

BITS Pilani, Hyderabad Campus


PCA

• Reduce High Dimensional data into something that can be explained in


fewer dimensions.
• We need PCA since we suspect that in our interesting data set not all
measures are independent i.e there exist correlations.

x2

Assume the data set represent height and weight


of people in a region.

x1
BITS Pilani, Hyderabad Campus
PCA in nutshell

BITS Pilani, Hyderabad Campus


Dimensionality Reduction:
PCA
Dimensions
Dimensions==206
120
160
10
40
80

BITS Pilani, Hyderabad Campus


PCA Limitations

• Covariance is extremely sensitive to large values

• Multiply some dimension by 1000

• Dominates covariance

• Becomes principal component

• Normalize each dimension to zero mean and unit


variance.

X’=(X-mean)/standard-deviation

BITS Pilani, Hyderabad Campus


PCA Limitations

• PCA assumes underlying subspace is linear.


• 1D –line
• 2D - plane

BITS Pilani, Hyderabad Campus


PCA and classification

• PCA is unsupervised
• Maximize overall variance of the data along a small set
of directions
• Does not known anything about class labels
• Can pick direction that makes it hard to separate classes

BITS Pilani, Hyderabad Campus


Feature Subset Selection
• Another way to reduce dimensionality of data

• Redundant features
– duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of
sales tax paid

• Irrelevant features
– contain no information that is useful for the data mining task
at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA

BITS Pilani, Hyderabad Campus


Subset Selection

• D initial features

• There are 2D possible subsets.

• Criteria to decide which subset is the best:

• Classifier performance based on these M feature


subsets
• Huge search space 2D

• Need heuristics

BITS Pilani, Hyderabad Campus


Feature Subset Selection
Techniques

• Brute-force approach: Try all possible feature subsets as input


to data mining algorithm
• Filter approaches: Compute a score for each feature and then
select features according to the score.
• Wrapper approaches: score feature subsets by seeing their
performance on a dataset using a classification algorithm.
• Embedded approaches: Select features during the process of
training.

BITS Pilani, Hyderabad Campus


Filter Methods

1. Measure how well each attribute individually helps to discriminate


between each class
• Which measure to use?
• Pearson correlation coefficient
• Information gain
• Chi-square statistic
2. Rank all attributes
3. The user can then discard all attributes that do not meet a specified
criterion. For example retain the best 10 attributes

BITS Pilani, Hyderabad Campus


Filter Methods

• Features are selected on the basis of their scores in


various statistical tests for their correlation with the
outcome variable. 

BITS Pilani, Hyderabad Campus


Linear Discriminant
Analysis (LDA)

• Find the features that best separate two classes

• Assumes that the variables are normally distributed

• Pseudonyms

• Fisher LDA

BITS Pilani, Hyderabad Campus


LDA

• The basic idea is to looks for interesting dimension such


that when the data points are projected on this new
dimension it maximizes the difference between the mean
of the classes normalized by their variances.

BITS Pilani, Hyderabad Campus


LDA

• Let {(Xi,yi),i=1,2,…,n} be the data

• Let yi Є {1,2}

• Let C1 and C2 denote the two classes. Thus, if yi=1 then Xi Є


C1 and if yi=2 then Xi Є C2.

• Let n1 and n2 denote the number of examples of each class.


(n=n1+n2)

• For any W, let zi=WTXi

BITS Pilani, Hyderabad Campus


Linear discriminant
Analysis(LDA)

BITS Pilani, Hyderabad Campus


Wrapper Methods

BITS Pilani, Hyderabad Campus


Sequential Forward
Selection (SFS)
1. Start with an empty feature set

2. Try each remaining feature

3. Estimate classifier performance for adding each feature

4. Select feature that gives max improvement

5. Stop when there is no significant improvement

Disadvantage: Once a feature is retained, it cannot be discarded;

nesting problem
BITS Pilani, Hyderabad Campus
Sequential Backward
Selection (SBS)

1. Start with an full feature set

2. Try removing feature

3. Drop the feature with smallest impact on classifier

performance

Disadvantage: SBS requires more computation than SFS

BITS Pilani, Hyderabad Campus


Search space for feature
selection
Forward Feature subset selection

Backward Feature subset selection

BITS Pilani, Hyderabad Campus


Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
• Three general methodologies:

– Feature Extraction: multimedia features(low,middle,high


level fetaures)
• domain-specific

– Mapping Data to New Space

– Feature Construction: combining features


BITS Pilani, Hyderabad Campus
Mapping Data to a New
Space
 Fourier transform (Time-domain)
 Wavelet transform (Frequency domain)

Two Sine Waves Two Sine Waves + Noise Frequency

BITS Pilani, Hyderabad Campus


Attribute Transformation

• A function that maps the entire set of values of a given


attribute to a new set of replacement values such that
each old value can be identified with one of the new
values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization

BITS Pilani, Hyderabad Campus


Discretization and
Binarization
• Many data mining algorithms require the data to be discrete, often
binary.
• Discretization is the process of converting
• Continuous-valued attributes, and
• Ordinal attributes with high number of distinct values
into discrete variables(nominal or numeric) with a small number
of values.
• Discretization is performed by
• choosing one or more threshold values from the range of the
attribute to create intervals of the original value range, and
• then putting values inside each interval into a common bin
• Choosing the best number of bins is an open problem, typically
trial and error process
BITS Pilani, Hyderabad Campus
Binarization

• Can be used to convert continuous and categorical into


binary representation.
• If m categorical values exist then pick [0-m-1] interval
values and convert them into binary representation.
• If the attribute is ordinal then the order of the values
need to be maintained.

BITS Pilani, Hyderabad Campus


Discretization

• Discretization of continuous attributes is most often


performed one attribute at a time, independent of other
attributes.
• This approach is known as static attribute discretization on
the other end of the spectrum is dynamic attribute
discretization, where all attributes are discretized
simultaneously while taking into account the
interdependencies among them.

BITS Pilani, Hyderabad Campus


Discretization

• Unsupervised discretization
• Class labels are ignored
• Equal-interval binning • The best number of bins k is
determined experimentally
• Equal-frequency binning

• Supervised discretization

• Entropy-based discretization

• It tries to maximize the “purity” of the intervals (i.e. to


contain as less as possible mixture of class labels)
BITS Pilani, Hyderabad Campus
Unsupervised
Discretization
• Require the user to specify the number of intervals and/or how
many data points should be included in any given interval.
• The following heuristic is often used to choose intervals:

• The number of intervals for each attribute should not be smaller


than the number of classes (if known).
• The other popular heuristic is to choose the number of intervals,
nFi, for each attribute, Fi (i=1,…,n,) where n is the number of
attributes), as follows: nFi = M/3∗ C where M is the number of
training examples and C is the number of known categories.

BITS Pilani, Hyderabad Campus


Unsupervised
Discretization

BITS Pilani, Hyderabad Campus


Unsupervised
Discretization

• Equal-frequency binning

• An equal number of values are placed in each of the k bins.

• Disadvantage: Many occurrences of the same continuous

value could cause the values to be assigned into different

bins.

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Supervised Discretization

• Suppose you are analyzing risk of Alzheimer's disease and


you split age data at age 16, age 24, and age 30.

Your bins look something like this:


<=16 Now you have a giant bin of people older than 30, where
16...24 most Alzheimers patients are, and multiple bins split at
lower values, where you're not really getting much
24...30 information.
>30

• Because of this issue, we want to make meaningful splits in


our continuous variables.

BITS Pilani, Hyderabad Campus


Supervised Discretization

• Entropy-based discretization
• The main idea is to split the attribute’s value in a way that
generates bins as “pure” as possible
• We need a measure of “impurity of a bin” such that
• A bin with uniform class distribution has the highest
impurity
• A bin with all items belonging to the same class has
zero impurity
• The more skewed is the class distribution in the bin
the smaller is the impurity
• Entropy can be such measure of impurity

BITS Pilani, Hyderabad Campus


Entropy
• If a random variable Y takes on values in a set Y = {y1, y2, ..., yn}, and is defined
by a probability distribution P(Y), then we will write the entropy of the random
variable as

BITS Pilani, Hyderabad Campus


Example of entropy

• Entropy of a discrete variable y taking values in Ω =


{0,1}, with P [Y=1] = p and P [Y=0] = 1-p

• Uncertainty is maximal when both events have the same


probability

BITS Pilani, Hyderabad Campus


Algorithm

• Calculate Entropy for your data.


• For each potential split in your data.
Calculate Entropy in each potential bin
– Find the net entropy for your split
– Calculate entropy gain
• Select the split with the highest entropy gain
• Recursively (or iteratively in some cases) perform the partition
on each split until a termination criteria is met 
Terminate once you reach a specified number of bins
– Terminate once entropy gain falls below a certain threshold.

BITS Pilani, Hyderabad Campus


• Net Entropy: Information across two bins will be equal to
the proportion of the bin's size multiplied by that bin's
entropy.

• Entropy gain: Gain(Enew) =Gain(Einitial – Enew)

BITS Pilani, Hyderabad Campus


Example
Hours Studies Scored A in Test? Let's start by calculating entropy of the data
4 N set itself
Scored A in Scored A in Test=N
5 Y
Test=Y
8 N
Overall 3 2
12 Y
15 Y

BITS Pilani, Hyderabad Campus


BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
When to Terminate

• There are two popular options for stopping the algorithm:

• Terminate when a specified number of bins has been


reached. 
• Terminate when information gain falls below a certain
threshold. 

BITS Pilani, Hyderabad Campus


Normalization/standardizati
on

• In many data mining tasks, variables that have vastly


different scales from each other may cause problems.
• e.g. height in inches vs cms or weight in pounds vs kilos

• Expressing an attribute in smaller units will lead to a larger


range for that attribute, and thus tend to give such an
attribute greater effect.
• To avoid dependence on the choice of measurement units,
the data should be normalized or standardized.

BITS Pilani, Hyderabad Campus


Normalization/standardizati
on
• Normalization is particularly useful for classification
algorithms.
• min-max normalization

• z-score normalization

• Normalization by decimal scaling

BITS Pilani, Hyderabad Campus


Min-max normalization

Transform the data from measured units to a new interval


from 𝑛𝑒𝑤_𝑚𝑖𝑛𝐹 to 𝑛𝑒𝑤_𝑚𝑎𝑥𝐹 for feature :

where 𝑣 is the current value of feature 𝐹.


Suppose that the minimum and maximum values for the
feature income are $120,000 and $98,000, respectively. We
would like to map income to the range 0.0,1.0 . By min-max
normalization, a value of $73,600 for income is transformed
to:

BITS Pilani, Hyderabad Campus


z-score (zero-mean)
Normalization

BITS Pilani, Hyderabad Campus


Decimal Scaling
Normalization

BITS Pilani, Hyderabad Campus


Take home message

• Top reasons to use feature selection are:

• It enables the machine learning algorithm to train faster.

• It reduces the complexity of a model and makes it easier


to interpret.
• It improves the accuracy of a model if the right subset is
chosen.
• It reduces overfitting.

BITS Pilani, Hyderabad Campus

You might also like