L2-4 - Data

BITS Pilani
BITS Pilani Dr.Aruna Malapati

Asst Professor
Hyderabad Campus Department of CSIS
BITS Pilani
Hyderabad Campus
Data
Today’s Learning objective
• Describe Data
• List various Data types
• List the issues in Data quality
• List and identify the right preprocessing techniques

given data
BITS Pilani, Hyderabad Campus

What is Data?
• Collection of data objects and their Attributes
attributes
• An attribute is a property or Tid Refund Marital Taxable

Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a person,
2 No Married 100K No
temperature, etc.
3 No Single 70K No
– Other names: variable, filed,
characteristic, feature, Predictor, 4 Yes Married 120K No
etc. 5 No Divorced 95K Yes
• A collection of attributes describe Objects 6 No Married 60K No

No
an object 7 Yes Divorced 220K
8 No Single 85K Yes
– Other names: record, point, case,
sample, entity, or instance 9 No Married 75K No
10 No Single 90K Yes
10

Attribute Values
• Each attribute has a set of values object draws from.
• The same attribute can be mapped to different attribute
values
– Example: Temperature can be Celsius in feet or Fahrenheit
• Different attributes can be mapped to the same set of values
– Example: Attribute values for ID and age are integers

Types of Attributes
Has finite or countably

infinite set of values
Ex: Zip Codes, terms in doc
Has real numbers

Ex: length, Weight, temp etc.

Properties of Attribute
Values
• The type of an attribute depends on which of the following
properties it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: * /
– Nominal attribute: distinctness

– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties

Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute are just zip codes, employee mode, entropy,
different names, i.e., nominal attributes ID numbers, eye contingency
provide only enough information to color, sex: {male, correlation, 2
distinguish one object from another. (=, )female}
test
Ordinal The values of an ordinal hardness of minerals, median,

attribute provide enough {good, better, best}, percentiles, rank
information to order objects. grades, street correlation, run
numbers
(<, >) tests, sign tests
Interval For interval attributes, the calendar dates, mean, standard
differences between values are temperature in deviation, Pearson's
meaningful, i.e., a unit of
Celsius or correlation, t and F
measurement exists.
(+, - ) Fahrenheit tests
Ratio For ratio variables, both temperature in
differences and ratios are Kelvin, monetary geometric mean,
meaningful. (*, /) quantities, counts, harmonic mean,
age, mass, length, percent
electrical current variation
Discrete and Continuous
Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.

Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

Important Characteristics of
Structured Data
– Dimensionality
• Curse of Dimensionality
– Sparsity
• Only presence counts
– Resolution
• Patterns depend on the scale

Record Data
• Data that consists of a collection of records, each of

which consists of a fixed set of attributes
Tid Refund Marital Taxable

Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
• Such data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness

of x Load of y load
10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

Document Data
• Each document becomes a `term' vector,

– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding term
occurs in the document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0

Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Graph Data
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
2 <li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5

Chemical Data
Benzene Molecule: C6H6

Ordered Data
Sequences of transactions
An element of the
sequence

Ordered Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

Ordered Data
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean

Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:

– Noise and outliers
– missing values
– duplicate data

Noise
• Noise: An invalid signal overlapping valid data
– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise

Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)

Duplicate Data
• Data set may include data objects that are duplicates, or

almost duplicates of one another
– Major issue when merging data from heterogeneous
sources
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues

Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation

Aggregation
• Combining two or more attributes (or objects) into a single

attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability

Aggregation
Variation of Precipitation in Australia
Standard Deviation of Average Standard Deviation of

Monthly Precipitation Average Yearly Precipitation
Sampling
• Sampling is the main technique employed for data

selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.
• Statisticians sample because obtaining the entire set of

data of interest is too expensive or time consuming.
• Sampling is used in data mining because processing the

entire set of data of interest is too expensive or time
consuming.

Sampling
• The key principle for effective sampling is:
• A sample will work almost as well as using the entire data
set if the sample is representative(different for different
data set).
• Sampling may remove outliers and if done improperly can
introduce noise.

Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
• Sampling without replacement

– As each item is selected, it is removed from the population
• Sampling with replacement

– Objects are not removed from the population as they are selected for the
sample.
• In sampling with replacement, the same object can be picked up more than
once
• Stratified sampling
– Split the data into several partitions; then draw random samples from
each partition

Sample Size
8000 points 2000 Points 500 Points

Curse of Dimensionality
• When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
• Definitions of density and

distance between points,
which is critical for
clustering and outlier
detection, become less
• Randomly generate 500 points
meaningful • Compute difference between max and min
distance between any pair of points

Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise
• Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques

Dimensionality Reduction:
PCA
• Goal is to find a projection that captures the largest amount of

variation in data
x2
u1
x1

PCA
• Reduce High Dimensional data into something that can be explained in

fewer dimensions.
• We need PCA since we suspect that in our interesting data set not all
measures are independent i.e there exist correlations.
x2
Assume the data set represent height and weight

of people in a region.
x1
PCA in nutshell

Dimensionality Reduction:
PCA
Dimensions
Dimensions==206
120
160
10
40
80

PCA Limitations
• Covariance is extremely sensitive to large values
• Multiply some dimension by 1000
• Dominates covariance
• Becomes principal component
• Normalize each dimension to zero mean and unit

variance.
X’=(X-mean)/standard-deviation

PCA Limitations
• PCA assumes underlying subspace is linear.

• 1D –line
• 2D - plane

PCA and classification
• PCA is unsupervised
• Maximize overall variance of the data along a small set
of directions
• Does not known anything about class labels
• Can pick direction that makes it hard to separate classes

Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
– duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of
sales tax paid
• Irrelevant features
– contain no information that is useful for the data mining task
at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA

Subset Selection
• D initial features
• There are 2D possible subsets.
• Criteria to decide which subset is the best:
• Classifier performance based on these M feature

subsets
• Huge search space 2D
• Need heuristics

Feature Subset Selection
Techniques
• Brute-force approach: Try all possible feature subsets as input

to data mining algorithm
• Filter approaches: Compute a score for each feature and then
select features according to the score.
• Wrapper approaches: score feature subsets by seeing their
performance on a dataset using a classification algorithm.
• Embedded approaches: Select features during the process of
training.

Filter Methods
1. Measure how well each attribute individually helps to discriminate

between each class
• Which measure to use?
• Pearson correlation coefficient
• Information gain
• Chi-square statistic
2. Rank all attributes
3. The user can then discard all attributes that do not meet a specified
criterion. For example retain the best 10 attributes

Filter Methods
• Features are selected on the basis of their scores in

various statistical tests for their correlation with the
outcome variable.

Linear Discriminant
Analysis (LDA)
• Find the features that best separate two classes
• Assumes that the variables are normally distributed
• Pseudonyms
• Fisher LDA

LDA
• The basic idea is to looks for interesting dimension such

that when the data points are projected on this new
dimension it maximizes the difference between the mean
of the classes normalized by their variances.

LDA
• Let {(Xi,yi),i=1,2,…,n} be the data
• Let yi Є {1,2}
• Let C1 and C2 denote the two classes. Thus, if yi=1 then Xi Є

C1 and if yi=2 then Xi Є C2.
• Let n1 and n2 denote the number of examples of each class.

(n=n1+n2)
• For any W, let zi=WTXi

Linear discriminant
Analysis(LDA)

Wrapper Methods

Sequential Forward
Selection (SFS)
1. Start with an empty feature set
2. Try each remaining feature
3. Estimate classifier performance for adding each feature
4. Select feature that gives max improvement
5. Stop when there is no significant improvement
Disadvantage: Once a feature is retained, it cannot be discarded;
nesting problem
Sequential Backward
Selection (SBS)
1. Start with an full feature set
2. Try removing feature
3. Drop the feature with smallest impact on classifier
performance
Disadvantage: SBS requires more computation than SFS

Search space for feature
selection
Forward Feature subset selection
Backward Feature subset selection

Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
• Three general methodologies:
– Feature Extraction: multimedia features(low,middle,high

level fetaures)
• domain-specific
– Mapping Data to New Space
– Feature Construction: combining features

Mapping Data to a New
Space
 Fourier transform (Time-domain)
 Wavelet transform (Frequency domain)
Two Sine Waves Two Sine Waves + Noise Frequency

Attribute Transformation
• A function that maps the entire set of values of a given

attribute to a new set of replacement values such that
each old value can be identified with one of the new
values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization

Discretization and
Binarization
• Many data mining algorithms require the data to be discrete, often
binary.
• Discretization is the process of converting
• Continuous-valued attributes, and
• Ordinal attributes with high number of distinct values
into discrete variables(nominal or numeric) with a small number
of values.
• Discretization is performed by
• choosing one or more threshold values from the range of the
attribute to create intervals of the original value range, and
• then putting values inside each interval into a common bin
• Choosing the best number of bins is an open problem, typically
trial and error process
Binarization
• Can be used to convert continuous and categorical into

binary representation.
• If m categorical values exist then pick [0-m-1] interval
values and convert them into binary representation.
• If the attribute is ordinal then the order of the values
need to be maintained.

Discretization
• Discretization of continuous attributes is most often

performed one attribute at a time, independent of other
attributes.
• This approach is known as static attribute discretization on
the other end of the spectrum is dynamic attribute
discretization, where all attributes are discretized
simultaneously while taking into account the
interdependencies among them.

Discretization
• Unsupervised discretization
• Class labels are ignored
• Equal-interval binning • The best number of bins k is
determined experimentally
• Equal-frequency binning
• Supervised discretization
• Entropy-based discretization
• It tries to maximize the “purity” of the intervals (i.e. to

contain as less as possible mixture of class labels)
Unsupervised
Discretization
• Require the user to specify the number of intervals and/or how
many data points should be included in any given interval.
• The following heuristic is often used to choose intervals:
• The number of intervals for each attribute should not be smaller

than the number of classes (if known).
• The other popular heuristic is to choose the number of intervals,
nFi, for each attribute, Fi (i=1,…,n,) where n is the number of
attributes), as follows: nFi = M/3∗ C where M is the number of
training examples and C is the number of known categories.

Unsupervised
Discretization

Unsupervised
Discretization
• Equal-frequency binning
• An equal number of values are placed in each of the k bins.
• Disadvantage: Many occurrences of the same continuous
value could cause the values to be assigned into different
bins.

Example

Supervised Discretization
• Suppose you are analyzing risk of Alzheimer's disease and

you split age data at age 16, age 24, and age 30.
Your bins look something like this:

<=16 Now you have a giant bin of people older than 30, where
16...24 most Alzheimers patients are, and multiple bins split at
lower values, where you're not really getting much
24...30 information.
>30
• Because of this issue, we want to make meaningful splits in

our continuous variables.

Supervised Discretization
• Entropy-based discretization
• The main idea is to split the attribute’s value in a way that
generates bins as “pure” as possible
• We need a measure of “impurity of a bin” such that
• A bin with uniform class distribution has the highest
impurity
• A bin with all items belonging to the same class has
zero impurity
• The more skewed is the class distribution in the bin
the smaller is the impurity
• Entropy can be such measure of impurity

Entropy
• If a random variable Y takes on values in a set Y = {y1, y2, ..., yn}, and is defined
by a probability distribution P(Y), then we will write the entropy of the random
variable as

Example of entropy
• Entropy of a discrete variable y taking values in Ω =

{0,1}, with P [Y=1] = p and P [Y=0] = 1-p
• Uncertainty is maximal when both events have the same

probability

Algorithm
• Calculate Entropy for your data.

• For each potential split in your data.
Calculate Entropy in each potential bin
– Find the net entropy for your split
– Calculate entropy gain
• Select the split with the highest entropy gain
• Recursively (or iteratively in some cases) perform the partition
on each split until a termination criteria is met
Terminate once you reach a specified number of bins
– Terminate once entropy gain falls below a certain threshold.

• Net Entropy: Information across two bins will be equal to
the proportion of the bin's size multiplied by that bin's
entropy.
• Entropy gain: Gain(Enew) =Gain(Einitial – Enew)

Example
Hours Studies Scored A in Test? Let's start by calculating entropy of the data
4 N set itself
Scored A in Scored A in Test=N
5 Y
Test=Y
8 N
Overall 3 2
12 Y
15 Y

When to Terminate
• There are two popular options for stopping the algorithm:
• Terminate when a specified number of bins has been

reached.
• Terminate when information gain falls below a certain
threshold.

Normalization/standardizati
on
• In many data mining tasks, variables that have vastly

different scales from each other may cause problems.
• e.g. height in inches vs cms or weight in pounds vs kilos
• Expressing an attribute in smaller units will lead to a larger

range for that attribute, and thus tend to give such an
attribute greater effect.
• To avoid dependence on the choice of measurement units,
the data should be normalized or standardized.

Normalization/standardizati
on
• Normalization is particularly useful for classification
algorithms.
• min-max normalization
• z-score normalization
• Normalization by decimal scaling

Min-max normalization
Transform the data from measured units to a new interval

from 𝑛𝑒𝑤_𝑚𝑖𝑛𝐹 to 𝑛𝑒𝑤_𝑚𝑎𝑥𝐹 for feature :
where 𝑣 is the current value of feature 𝐹.

Suppose that the minimum and maximum values for the
feature income are $120,000 and $98,000, respectively. We
would like to map income to the range 0.0,1.0 . By min-max
normalization, a value of $73,600 for income is transformed
to:

z-score (zero-mean)
Normalization

Decimal Scaling
Normalization

Take home message
• Top reasons to use feature selection are:
• It enables the machine learning algorithm to train faster.
• It reduces the complexity of a model and makes it easier

to interpret.
• It improves the accuracy of a model if the right subset is
chosen.
• It reduces overfitting.

L2-4 - Data

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L2-4 - Data

Uploaded by

Copyright:

Available Formats

BITS Pilani

BITS Pilani Dr.Aruna Malapati

• List various Data types

• List the issues in Data quality

• List and identify the right preprocessing techniques

BITS Pilani, Hyderabad Campus

• An attribute is a property or Tid Refund Marital Taxable

etc. 5 No Divorced 95K Yes

• A collection of attributes describe Objects 6 No Married 60K No

BITS Pilani, Hyderabad Campus

• Each attribute has a set of values object draws from.

• The same attribute can be mapped to different attribute

– Example: Temperature can be Celsius in feet or Fahrenheit

• Different attributes can be mapped to the same set of values

– Example: Attribute values for ID and age are integers

BITS Pilani, Hyderabad Campus

Has finite or countably

Has real numbers

BITS Pilani, Hyderabad Campus

– Nominal attribute: distinctness

BITS Pilani, Hyderabad Campus

Ordinal The values of an ordinal hardness of minerals, median,

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• Only presence counts

• Patterns depend on the scale

BITS Pilani, Hyderabad Campus

• Data that consists of a collection of records, each of

Tid Refund Marital Taxable

1 Yes Single 125K No

BITS Pilani, Hyderabad Campus

• Such data set can be represented by an m by n matrix,

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

BITS Pilani, Hyderabad Campus

• Each document becomes a `term' vector,

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

Benzene Molecule: C6H6

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• What kinds of data quality problems?

• How can we detect problems with the data?

• What can we do about these problems?

• Examples of data quality problems:

BITS Pilani, Hyderabad Campus

Two Sine Waves Two Sine Waves + Noise

BITS Pilani, Hyderabad Campus

• Handling missing values

BITS Pilani, Hyderabad Campus

• Data set may include data objects that are duplicates, or

BITS Pilani, Hyderabad Campus

• Feature subset selection

• Discretization and Binarization

BITS Pilani, Hyderabad Campus

• Combining two or more attributes (or objects) into a single

BITS Pilani, Hyderabad Campus

Standard Deviation of Average Standard Deviation of

• Sampling is the main technique employed for data

• Statisticians sample because obtaining the entire set of