UNIT-3 DATA MINING - Part1

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 111

DATA WAREHOUSING

AND MINING
UNIT-3_PART1
BTECH- IT421

KAVITA SETHIA
Sources of Data
Businesses generate giant data
worldwide sales
including
transactions, stocksets, records,
trading
product descriptions,
promotions,
sales and performance, and customer
profiles compan
feedback.
y
Scientific and engineering practices generate high
orders of petabytes of data in a continuous manner,
from remote sensing, process measuring, scientific
experiments, system performance, engineering
observations, and environment surveillance.

2/1/2019 2
Billions of Web searches supported by search engines
process tens of petabytes of data daily. Communities
and social media have become increasingly important
data sources, producing digital pictures and videos,
blogs, Web communities, and various kinds of social
networks.

The medical and health industry generates


tremendous amounts of data from medical records,
patient monitoring, and medical imaging.

The list of sources that generate huge amounts of data


is endless.
2/1/2019 3
Powerful and versatile tools are needed to
automatically uncover valuable information from the
tremendous amounts of data and to transform such
data into organized knowledge.

This necessity has led to the birth of data


mining.

2/1/2019 4
Knowledge Discovery
Process(previous year paper)
 Knowledge discovery concerns the entire knowledge extraction process,

including how data are stored and accessed,


how to useefficient and scalable algorithms to
analyze massive datasets,
how to interpret and visualize the results, and
how to model and support the interaction between
human and machine.
 It also concerns support for learning and analyzing the application
domain.

The knowledge discovery process is shown in Figure.

2/1/2019 5
2/1/2019 6
Steps in Knowledge Discovery
Process
KDD is an iterative sequence of the following steps:

1. Data cleaning - to remove noise and inconsistent data


2. Data integration- where multiple data sources may be
combined.
3. Data selection - where data relevant to the analysis task
are retrieved from the database
4. Data transformation - where data are transformed and
consolidated into forms appropriate for mining by
performing summary or aggregation operations.

2/1/2019 7
5.Data - an essential process where
mining
intelligent methods are applied to
patterns. extract data
6. Pattern evaluation - to identify the truly
interesting patterns representing knowledge based on
interestingness measures.
7. Knowledge presentation - where visualization
and knowledge representation techniques are used to
present mined knowledge to users.

2/1/2019 8
Contents:
Introduction to Data Mining
Data Mining
On What Kind of Data
What kind of patterns to be mined
Classification of data mining
system
Data Mining Task primitives
Major issues in Data Mining

2/1/2019 9
What is Data Mining
Data mining is the process of discovering
knowledge in form of interesting patterns and
relationships from large amounts of data.
“Data mining,” writes Joseph P. Bigus in his book, Data
Mining with Neural Networks,
It is the effi cient discovery of valuable, non-
obvious information from a large collection of
data.”
 Data mining centers around the automated
discovery of new facts and relationships in data.

2/1/2019 10
It is a multi-disciplinary skill that uses machine
learning, statistics, AI and database technology.

With traditional query tools, you search for known


information. Data mining tools enable you to uncover
hidden information.

2/1/2019 11
2/1/2019 12
Difference between KDD and Data
Mining
Although, the two terms KDD and Data Mining are
heavily used interchangeably, they refer to two related
yet slightly different concepts. KDD is the overall
process of extracting knowledge from data while
Data Mining is a step inside the KDD process,
which deals with identifying patterns in data. In
other words, Data Mining is only the application of a
specific algorithm based on the overall goal of the
KDD process.

2/1/2019 13
Classification of Data
Mining Systems(previous
year paper)
A data mining system can be classified based on the kind
of :

(a) Databases mined- what kind of data can be mined?


 database data
 data warehouse data
 transactional data
(b)Knowledge mined- What kind of patterns
are extracted?
 the mining of frequent patterns, associations, and correlations
 classification and regression
 clustering analysis
 outlier analysis
 characterization and discrimination

2/1/2019 14
(c) Applications adapted
 Finance
 Retail and Telecommunication

 Science and Engineering

 Intrusion Detection and Prevention

 Recommender System

2/1/2019 15
What kinds of data can be mined?
As a general technology, data mining can be applied
to any kind of data as long as the data are meaningful
for a target application. The most basic forms of data
for mining applications are :
database data
data warehouse data
transactional data

2/1/2019 16
Database Data
 Relational data can be accessed by database queries written
in a relational query language (e.g., SQL) or with the assistance
of graphical user interfaces.

 Suppose that your job is to analyze the AllElectronics store


data.
Through the use of relational queries, you can ask things like:

 “Show me a list of all items that were sold in the last quarter.”
 “How
“Show me many
the totalsales
sales transactions
of the last month, grouped
in theby branch,”
month of
occurred
December?”
 “Which salesperson had the highest sales?”

2/1/2019 17
When mining relational databases, we can go
further by searching for trends or data patterns. For
example, data mining systems can analyse customer
data to predict the credit risk of new customers based
on their income, age, and previous credit information.

2/1/2019 18
Data Warehouse
 A Data Warehouse is a repository of information collected from
multiple sources, stored under a unified schema, and usually
residing at a single site.

 Data warehouses are constructed via a process of data


cleaning, data integration, data transformation,
loading, and periodic data refreshing. data

 The data are stored to provide information from a historical


perspective, such as in the past 6 to 12 months, and are typically
summarized.

 For example, rather than storing the details of each sales


transaction, the data warehouse may store a summary of the
transactions per item_type for each store or, summarized to a
higher level, for each sales region.

2/1/2019 19
A data warehouse is usually modelled by a
multidimensional data structure, called a data cube,
in which each dimension corresponds to an attribute
or a set of attributes in the schema, and each cell
stores the value of some aggregate measure such as
count or sum(sales_amount).

A data cube provides a multidimensional view of data


and allows the precomputation and fast access of
summarized data.

2/1/2019 20
It allows the exploration of multiple combinations of dimensions at
varying levels of granularity in data mining, and thus has
greater potential for discovering interesting patterns representing
knowledge.
2/1/2019 21
Transactional Data
 Each record in a transactional database captures a transaction,
such as a customer’s purchase, a flight booking, or a user’s clicks
on a web page.

 A transaction typically includes a unique transaction identity


number (trans ID) and a list of the items making up the
transaction, such as the items purchased in the transaction.

 Transactions can be stored in a table, with one record per


transaction.

 Because most relational database systems do not support nested


relational structures, the transactional database is usually stored
in a fl at file.

2/1/2019 22
Suppose you may want to know, “Which items sold well
together?”
This kind of market basket data analysis would enable
you to bundle groups of items together as a strategy
for boosting sales.
For example, given the knowledge that printers are
commonly purchased together with computers, you
could offer certain printers at discounted price.

2/1/2019 23
Other Kinds of Data
By mining user comments on products (which are
often submitted as short text messages),we can assess
customer sentiments and understand how well a
product is accepted by a market.
By mining video data of a hockey game, we can detect
video sequences corresponding to goals.
Stock exchange data can be mined to uncover trends
that could help you plan investment strategies.
With spatial data, we may look for patterns that
describe changes in metropolitan poverty rates based
on city distances from major highways.

2/1/2019 24
What kind of patterns can be mined?
There are a number of data mining
functionalities. These include:
the mining of frequent patterns, associations, and
correlations
classification and regression
clustering analysis
outlier analysis
characterization and discrimination

Data mining functionalities are used to specify


the kinds of patterns to be found in data mining tasks.
2/1/2019 25
In general, such tasks can be classified into two categories:
descriptive and predictive.

Predictive data mining tasks come up with a model from


the available data set that is helpful in predicting unknown
or future values of another data set of interest. A medical
practitioner trying to diagnose a disease based on the
medical test results of a patient can be considered as a
predictive data mining task.

Descriptive data mining tasks usually finds data


describing patterns and comes up with new, significant
information from the available data set. A retailer trying to
identify products that are purchased together can be
considered as a descriptive data mining task
2/1/2019 26
Mining frequent patterns, associations and
correlations
 Frequent patterns, as the name suggests, are patterns that
occur
frequently in data.

 There are many kinds of frequent patterns, including frequent itemsets and
frequent subsequences.

 A frequent itemset typically refers to a set of items that often appear


together in a transactional data set—for example, milk and bread, which are
frequently bought together in grocery stores by many customers.

 A frequently occurring subsequence, such as the pattern that customers, tend


to purchase first a laptop, followed by a digital camera, and then a memory
card, is a (frequent) sequential pattern.
2/1/2019 27
Association analysis
Suppose that, as a marketing manager at AllElectronics, you
want to know which items are frequently purchased together (i.e.,
within the same transaction).
An example of such a rule, mined from the AllElectronics
transactional database, is

where X is a variable representing a customer.

•A confidence, or certainty, of 50% means that if a customer buys


a computer, there is a 50% chance that she will buy software as
well.
•A 1% support means that 1% of all the transactions under
analysis show that computer and software are purchased
together
2/1/2019 28
The rule indicates that of the AllElectronics customers under study, 2%
are 20 to 29 years old with an income of $40,000 to $49,000 and have
purchased a laptop (computer)at AllElectronics. There is a 60%
probability that a customer in this age and income group will purchase
a laptop.

As it involves more than one attribute or predicate(age, income, buys),


it is known as multidimensional association rule.

Typically, association rules are discarded as uninteresting if they


do not satisfy both a minimum support threshold and a
minimum confidence threshold.
2/1/2019 29
Classification and Regression for
Predictive Analysis
Classification is the process of finding a model
(or function) that describes and distinguishes
data classes or concepts.

The model are derived based on the analysis of a set of


training data (i.e., data objects for which the class
labels are known). The model is used to predict the
class label of objects for which the class label is
unknown.

2/1/2019 30
The derived model may be represented in various forms,
such as classification rules (i.e., IF-THEN rules),
decision trees.

2/1/2019 31
A decision tree is a flowchart-like tree structure,
where each node denotes a test on an attribute value,
each branch represents an outcome of the test, and
tree leaves represent classes or class distributions

2/1/2019 32
Whereas classification predicts categorical
(discrete, labels, regression models
continuous-valued
unordered) functions. That is, regression is
used to predict missing or unavailable numerical data
values rather than (discrete) class labels.

The term prediction refers to both numeric prediction


and class label prediction.
PREDICTION

CLASSIFICATION REGRESSION
(nominal or categorical labels) (Numerical values)

2/1/2019 33
Regression Analysis
Regression analysis is a statistical methodology that is
most often used for numeric prediction.

Regression analysis is used in statistics to find trends in


data.

For example, you might guess that there’s a connection


between how much you eat and how much you weigh;
regression analysis can help you quantify that.

Regression analysis will provide you with an equation for a


graph so that you can make predictions about your data.
For example, if you’ve been putting on weight over the last
few years, it can predict how much you’ll weigh in ten years
time if you continue to put on weight at the same rate.
2/1/2019 34
Clusterin
gUnlike classification and regression, which analyze
class-labeled (training) data sets, clustering analyzes
data objects without consulting class labels.

The objects are clustered or grouped based on the


principle of maximizing the intraclass similarity
and minimizing the interclass similarity. That is,
clusters of objects are formed so that objects within a
cluster have high similarity in comparison to one
another, but are rather dissimilar to objects in other
clusters.

2/1/2019 35
Formation of
clusters with similar
properties

2/1/2019 36
Outlier Analysis
A data set may contain objects that do not comply
with the general behavior or model of the data.
These data objects are outliers. Many data mining
methods discard outliers as noise or exceptions.

However, in some applications (e.g., fraud detection)


the rare events can be more interesting than the more
regularly occurring ones. The analysis of outlier data is
referred to as outlier analysis or anomaly mining.

2/1/2019 37
For Outlier analysis may uncover
example:
fraudulent of credit cards by detecting
usage
purchases of unusually large amounts for a
account number in comparison
given to regular charges
incurred by the same account. Outlier values may also
be detected with respect to the locations and types of
purchase, or the purchase frequency.

2/1/2019 38
Data Characterization
Data characterization is a summarization of the
general characteristics or features of a target class
of data. The data corresponding to the user-specified
class are typically collected by a query.

For example, to study the characteristics of software


products with sales that increased by 10% in the
previous year, the data related to such products can be
collected by executing an SQL query on the sales
database.
2/1/2019 39
The output of data characterization can be presented
in various forms. Examples include pie charts, bar
charts, curves, multidimensional data cubes, and
multidimensional tables, including crosstabs.

2/1/2019 40
Data Discrimination
 Data discrimination is a comparison of the general features of
the target class data objects against the general features of
objects from one or multiple contrasting classes.

 The target and contrasting classes can be specified by a user, and the
corresponding data objects can be retrieved through database queries.

 For example, a user may want to compare the general features of


software products with sales that increased by 10% last year against
those with sales that decreased by at least 30% during the same
period.

 The methods used for data discrimination are similar to those used for
data characterization.

2/1/2019 41
Data Mining Life
Cycle/Implementation Process
The life cycle of a data mining project consists of six
phases. The sequence of the phases is not rigid.
Moving back and forth between different phases is
always required depending upon the outcome of each
phase. The main phases are:

2/1/2019 42
2/1/2019 43
Business Understanding
In this phase, business and data-mining goals are established.
First, you need to understand business and client objectives.
You need to define what your client wants (which many times
even they do not know themselves)
Take stock of the current data mining scenario. Factor in
resources, assumption, constraints, and other significant
factors into your assessment.
Using business objectives and current scenario, define your
data mining goals.
A good data mining plan is very detailed and should be
developed to accomplish both business and data mining goals.

2/1/2019 44
Data Understanding
 Performed to check whether Data is appropriate for the data mining
goals.
 First, data is collected from multiple data sources available in the
organization.
 These data sources may include multiple databases, flat files or data
cubes. There are issues like object matching and schema
integration which can arise during Data Integration process. It is a
quite complex and tricky process as data from various sources
unlikely to match easily.
 Next, the step is to search for properties of acquired data. A good
way to explore the data is to answer the data mining questions
(decided in business phase) using the query, reporting, and
visualization tools.
 Based on the results of query, the data quality should be ascertained.
Missing data if any should be acquired.
2/1/2019 45
Data Preparation/Preprocessing
 In this phase, data is made production ready.
 The data preparation process consumes about 90% of the time of the
project.
 The data from different sources should be selected,
cleaned, transformed, formatted, anonymized, and constructed (if
required).
 Data cleaning is a process to "clean" the data by smoothing noisy
data and filling in missing values.
 For example, for a customer demographics profile, age data is
missing. The data is incomplete and should be filled. In some cases,
there could be data outliers. For instance, age has a value 100. Data
could be inconsistent. For instance, name of the customer is different
in different tables.
 Data transformation operations change the data to make it useful in
data mining. Following transformation can be applied
2/1/2019 46
Data Transformation
Data transformation operations would contribute toward the success of
the mining process.
 Smoothing: It helps to remove noise from the data. For eg : using
Binning
 Aggregation: Summary or aggregation operations are applied to the
data. i.e., the weekly sales data is aggregated to calculate the monthly
and yearly total.
 Generalization: In this step, Low-level data is replaced by higher-
level concepts with the help of concept hierarchies. For example, the
city is replaced by the county.
 Normalization: Normalization performed when the attribute data are
scaled up or scaled down. Example: Data should fall in the range
-
2.0 to 2.0 post-normalization.

 The result of this process is a final data set that can be used in
modeling.
2/1/2019 47
Modelling
In this phase, mathematical models are used to determine
data patterns.
Based on the business objectives, suitable
modeling techniques should be selected for the
prepared dataset.
Create a scenario to test check the quality and validity of
the model.
Run the model on the prepared dataset.
Results should be assessed by all stakeholders to
make sure that model can meet data mining objectives.
2/1/2019 48
Evaluatio
nIn this phase, patterns identified are evaluated against the
business objectives.

Results generated by the data mining model should be


evaluated against the business objectives.

Gaining business understanding is an iterative process. In


fact, while understanding, new business requirements may
be raised because of data mining.

A go or no-go decision is taken to move the model in


the deployment phase.
2/1/2019 49
Deployment
Inthe deployment phase, you ship your data
mining
discoveries to everyday business operations.
The knowledge or information discovered during data mining
process should be made easy to understand for non-technical
stakeholders.
A detailed deployment plan, for shipping, maintenance, and
monitoring of data mining discoveries is created.
A final project report is created with lessons learned and key
experiences during the project. This helps to improve the
organization's business policy.
2/1/2019 50
Challenges to Implementation
of Data Mining
Skilled Experts are needed to formulate the data
mining
queries.
Due to small size training database, a model may not fit future
states.
Data mining needs large databases which sometimes
are
difficult to manage.
Business practices may need to be modified to determine to
use the information uncovered.
If the data set is not diverse, data mining results may not be
accurate.
Integration information needed from
heterogeneous databases and global information systems
2/1/2019
could be complex. 51
Data Mining Tools
Following are 2 popular Data Mining Tools widely used in
Industry
R-language:
R language is an open source tool for statistical computing and
graphics. R has a wide variety of statistical, classical statistical
tests, time-series analysis, classification and graphical
techniques. It offers effective data handing and storage facility.
Oracle Data Mining:
Oracle Data Mining popularly known as ODM is a module of
the Oracle Advanced Analytics Database. This Data mining
tool allows data analysts to generate detailed insights and
makes predictions. It helps predict customer behavior, develops
customer profiles, identifies cross-selling opportunities.

2/1/2019 52
Benefits of Data Mining
Data mining technique helps companies to get
knowledge-
based information.
Data mining helps organizations to make the
profitable adjustments in operation and production.
The data mining is a cost-effective and efficient
solution compared to other statistical data applications.
Data mining helps with the decision-making process.
Facilitates automated prediction of trends and behaviors as
well as automated discovery of hidden patterns.
It can be implemented in new systems as well as
existing platforms
It is the speedy process which makes it easy for the users to
analyze huge amount of data in less time.
2/1/2019 53
Limitations of Data Mining
There are chances of companies may sell useful
information of their customers to other companies for
money. For example, American Express has sold credit
card purchases of their customers to the other companies.
Many data mining analytics software is difficult to operate
and requires advance training to work on.
Different data mining tools work in different manners due
to different algorithms employed in their design.
Therefore, the selection of correct data mining tool is a
very difficult task.
If the data mining techniques are not accurate, it can cause
serious consequences in certain conditions.
2/1/2019 54
Applications of Data
Mining/Where it is used?

2/1/2019 55
2/1/2019 56
2/1/2019 57
Data Mining Task Primitives
Each user will have a data mining task in mind, that is,
some form of data analysis that he or she would like to
have performed.

 A data mining task can be specified in the form of a


data mining query, which is input to the data mining
system. A data mining query is defined in terms of
data mining task primitives.

These allow the user to inter-actively


communicate
primitives with the data mining system during
discovery in order to direct the mining process, or examine
the findings from different angles or depths.
2/1/2019 58
The five data mining task primitives are:

1) The set of task-relevant data to be mined


2) The kind of knowledge to be mined
3) The background knowledge to be used in the discovery
process
4) The interestingness measures and
thresholds for pattern evaluation
5) The expected representation for
visualizing the discovered patterns

2/1/2019 59
The set of task-relevant data to be mined: This
specifies the portions of the database or the set of data
in which the user is interested. This includes the
database attributes or data warehouse dimensions of
interest.

The kind of knowledge to be mined: This specifies


the data mining functions to be performed, such as
characterization, discrimination, association or
correlation analysis, classification, prediction,
clustering, outlier analysis, or evolution analysis.

2/1/2019 60
The background knowledge be used in the
discovery process: The background knowledg allows
to
data to be mined at multiple levels ofe abstraction. For
example, the Concept hierarchies are one of the
background knowledge that allows data to be mined at
multiple levels of abstraction.

The interestingness measures and thresholds


pattern They may be used to guide for
mining process or, after discovery, to evaluate the
evaluation:
discovered patterns. Different kinds of knowledge the
have different interestingness measures. For example,
may
interestingness measures for association rules include
support and confidence. Rules whose support and
confidence values are below user-specified thresholds are
considered uninteresting.
2/1/2019 61
The expected representation for visualizing the
discovered patterns: This refers to the form in which
discovered patterns are to be displayed, which may
include rules, tables, charts, graphs, decision trees,
and cubes.

A data mining query language can be designed to


incorporate these primitives, allowing users to flexibly
interact with data mining systems. Having a data
mining query language provides a foundation on
which user-friendly graphical interfaces can be
built.

2/1/2019 62
Major Issues in Data
Mining(previous year paper)
Data mining is not an easy task, as the algorithms
used can get very complex and data is not always
available at one place. It needs to be integrated from
various heterogeneous data sources. These factors
also create
some issues. Here the major issues regarding −

Mining Methodology and User Interaction


Performance Issues
Diverse Data Types Issues
2/1/2019 63
2/1/2019 64
Mining methodology and User
Interaction Issues
 It refers to the following kinds of issues :
 Mining different kinds of knowledge in databases : Different users
may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge
discovery task.
 Interactive mining of knowledge at multiple levels of abstraction:
The data mining process needs to be interactive because it allows
users to focus the search for patterns, providing and refining data
mining requests based on the returned results.
 Incorporation of background knowledge: To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple levels of abstraction.
2/1/2019 65
Data mining query languages and ad hoc data mining: Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results: Once
the patterns are discovered it needs to be expressed in high
level languages, and visual representations. These
representations should be easily understandable.
Handling noisy or incomplete data: The data cleaning
methods are required to handle the noise and incomplete
objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered
patterns will be poor.
Pattern evaluation: The patterns should
interesting. So techniques are needed
discovered be
to interestingness of discovered patterns. assess
the
2/1/2019 66
Performance Issues
There can be performance-related issues such as follows :
Efficiency and scalability of data mining algorithms: In
order to effectively extract the information from huge amount
of data in databases, data mining algorithm must be efficient
and scalable.The running time should be predictable,short and
acceptable by applications.
Parallel, distributed, and incremental mining algorithms:
The factors such as huge size of databases, wide distribution of
data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms.
These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from
the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
2/1/2019 67
Diverse Data Type Issues
Handling of relational and complex types of data: The
database may contain complex data objects, multimedia
data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.

Mining information from heterogeneous databases and


global information systems: The data is available at
different data sources on LAN or WAN. These data source
may be structured, semi structured or unstructured.
Therefore mining the knowledge from them adds
challenges to data mining.

2/1/2019 68
SCATTER PLOTS WHEN THERE IS NO CORRELATION BETWEEN ATTRIBUTES

2/1/2019 69
Data Preprocessing(previous year
paper)
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Discretization
Concept Hierarchy Generation

2/1/2019 70
Data Preprocessing
Today’s real-world databases are highly susceptible to
noisy, missing, and inconsistent data due to their
typically huge size (often several gigabytes or more) and
their likely origin from multiple, heterogeneous sources.

“Low-quality data will lead to low-quality mining results. “

Data preprocessing is a very crucial step of knowledge


discovery process that is required to prepare the data by
performing cleaning, integration, reduction and
transformation before applying data mining techniques to
discover knowledge.
2/1/2019 71
There are many factors comprising data quality,
including accuracy, completeness, consistency,
timeliness, believability, and interpretability.

Inaccurate, incomplete, and inconsistent data are


commonplace properties of large real-world databases and
data warehouses.

Preprocessing techniques can help identify


erroneous
values and outliers.

Data preprocessing techniques, when applied before


mining, can substantially improve the overall quality
of the patterns mined and/or the time required for
the actual mining.
2/1/2019 72
EXAMPLES

2/1/2019 73
2/1/2019 74
Data Cleaning
Data that to be analyze by Data Mining can
be incomplete (Missing attribute values or certain
attributes
interest of or containing only
data),noisy(containing errors or aggregate outlier
deviate from the expected) and inconsistent (containing
values which
discrepancies in the department codes used to categorize
items).

Data cleaning (or data cleansing) routines attempt to fill in


missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.

2/1/2019 75
Handling Missing Data

What is missing data?

Missing data are defined as not available values.


Missing data can be anything from missing sequence, incomplete
feature, files missing, information incomplete, data entry error
etc.
Most datasets in the real world contain missing data.
Why handling missing data is important?

The problem of missing data is prevalent in most of the


research areas. Missing data produces various problems.

1. The missingness of data reduces the power of statistical


methods.
2. The missing data can cause bias in the model.
3. Many machine learning packages in python does not accept
missing data. It needs the missing data to be treated first. (Pre-
processing step)
Missing data mechanisms
• The values are Missing Completely At Random
Missing (MCAR) if the missing data is completely not
Completely At related to both observed and missing instances.
• An example of MCAR is a weighing scale that ran
Random (MCAR): out of batteries.

• Missing At Random (MAR) is when the missing


data is related to the observed data but not to the
Missing At missing data.
Random (MAR): • Example, if women are less likely to tell you their
weight than men, weight is MAR.

• Missing Not At Random (MNAR) is data that is neither


MAR nor MCAR. This implies that the missing data is
Missing Not At related to both observed and missing instances.
Random (MNAR): • Example: people with the lowest education are missing on
education or the sickest people are most likely to drop out
of the study.
Ways of handling missing values

Deleting Rows/list-
Replacing With
Predicting The Using Algorithms
Mean/Median/Mod Which Support
wise deletion Missing Values
e
Missing Values
Pros: Pros: Pros:
This is a better approach when This is a better approach when Imputation is good as long as the
Pros:
the data size is small the data size is small bias from the same is smaller than Does not require creation of a
If missingness is MCAR Then, the omitted variable bias predictive model for each attribute
It can prevent data loss which with missing data in the dataset
listwise deletion may be a results in removal of the rows Yields unbiased estimates of the
reasonable strategy model parameters Correlation of the data is neglected
and columns

Cons:
Cons:
Loss of information and data Bias also arises when an incomplete Cons:
Works poorly if the percentage Cons: conditioning set is used for a Is a very time consuming process
of missing values is high (say Add variance and bias categorical variable and it can be critical in data mining
30%), compared to the whole Considered only as a proxy for the where large databases are being
dataset true values extracted
Handling Missing Values
There are many tuples that have no recorded values for several
attributes, then:
Various methods to address missing values for the attribute are:

1. Ignore the tuple: This is usually done when the class label is
missing (assuming the mining task involves classification). This
method is not very effective, unless the tuple contains several
attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably. By
ignoring the tuple, we do not make use of the remaining
attributes’ values in the tuple. Such data could have been useful
to the task at hand.

2. Fill in the missing value manually: In general, this


approach is time consuming and may not be feasible given a
large data set with many missing values.
2/1/2019 80
3. Use a global constant to fill in the missing value:
Replace all missing attribute values by the same
constant such as a label like “Unknown” or - ∞. If
missing values are replaced by, say, “Unknown,” then the
mining program may mistakenly think that they form an
interesting concept, since they all have a value in
common—that of “Unknown.” Hence, this
method
althoughis simple, it is not foolproof.

4. Use a measure of central tendency for the attribute


(e.g., the mean or median) to fill in the missing value:
For normal (symmetric) data distributions, the mean can
be used to replace the missing values.
2/1/2019 81
5. Use the attribute mean or median for all samples
belonging to the same class as the given tuple: For
example, if classifying customers according to credit
risk, we may replace the missing value with the mean
income value for customers in the same credit risk
category as that of the given tuple. If the data distribution
for a given class is skewed, the median value is a better
choice.

6. Use the most probable value to fill in the missing


value: This may be determined with regression,
inference-based tools using a Bayesian formalism, or
decision tree induction.
2/1/2019 95
Noisy Data
“What is noise?” Noise is a random error or variance in a
measured variable.
Noisy data is meaningless data. The term has often been used
as a synonym for corrupt data.
Various data visualization techniques like Boxplot and
Scatter plots can be used to identify outliers which may
represent noise.

Let’s look at the following data smoothing techniques:

Binning
Regression
Outlier analysis

2/1/2019 83
Data Cleaning as a process
Missing values, noise, and inconsistencies contribute
to inaccurate data. So far, we have looked at techniques
for handling missing data and for smoothing data.
“But data cleaning is a big job. What about data
cleaning as a process? How exactly does one proceed in
tackling this task? Are there any tools out there to
help?”

The two step data cleaning process includes:


Discrepancy Detection
Data Transformation
2/1/2019 84
Discrepancy detection
Discrepancies can be caused by several factors,
including poorly designed data entry forms that have
many optional fields, human error in data entry,
deliberate errors (e.g., respondents not wanting to
divulge information about themselves), and data
decay (e.g., outdated addresses).

Metadata that describes what are the data type and


domain of each attribute? What are the acceptable
values for each attribute? What is the range of values?
is used for discrepancy detection.
2/1/2019 85
Data Transformation
Once we find discrepancies, typically need to
we
define and apply (a series of ) transformations to
correct them.

Data migration tools allow simple


transformations to be specified such as to replace the
string “gender” by
“sex.”

2/1/2019 86
Data Integration
Data mining often requires data integration—
the merging of data from multiple data stores.

One of the most well-known implementation of


data integration is building an enterprise's data
warehouse.

The benefit of a data warehouse enables a business


to perform analysis based on the data in the data
warehouse.
2/1/2019 87
Careful integration can help reduce and avoid
redundancies and inconsistencies in the resulting data
set. This can help improve the accuracy and speed of
the subsequent data mining process.

The semantic heterogeneity and structure of data


pose great challenges in data integration. Various
challenges in data integration are:
Entity Identification Problem
Redundancy and Correlation Analysis
Tuple Duplication
Data Value Conflict Detection and Resolution

2/1/2019 88
Redundancy and Correlation Analysis
Redundancy is another important issue in data integration.
An attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or
set of attributes.

Some redundancies can be detected by correlation


analysis. Given two attributes, such analysis can
measure how strongly one attribute implies the other,
based on the available data.

For nominal(categorical) data, we use the chi-square test.


For numeric attributes, we can use the correlation
coefficient and covariance, both of which access how one
attribute’s values vary from those of another.
2/1/2019 89
Tuple Duplication
 In addition to detecting redundancies between attributes, duplication
should also be detected at the tuple level (e.g., where there are two or
more identical tuples for a given unique data entry case).

 The use of denormalized tables (often improve


performance done to by avoiding joins) is of
redundancy. another source data

 Inconsistencies often arise between various duplicates, due to


inaccurate data entry or updating some but not all data occurrences.

 For example, if a purchase order database contains attributes for the


purchaser’s name and address instead of a key to this information in
a purchaser database, discrepancies can occur, such as the same
purchaser’s name appearing with different addresses within the
purchase order database.

2/1/2019 90
Data Transformation
In this preprocessing step, the data are transformed or
consolidated so that the resulting mining process may
be more efficient, and the patterns found may be
easier to understand.

Data
transformation: -0.02, 0.36, 1.00, 0.58, 0.95
-2, 36, 100, 58, 95

2/1/2019 91
Data Transformation Strategies
In data transformation, the data are transformed or consolidated
into forms appropriate for mining. Strategies for data
transformation include the following:
 Smoothing: which works to remove noise from the data.
Techniques include binning, regression, and clustering.
Aggregation: where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total amounts.
Normalization: where the attribute data are scaled so as to fall
within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0.

2/1/2019 92
Discretization: where the raw values of a numeric
attribute (e.g., age) are replaced by interval labels (e.g., 0–
10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior). The labels, in turn, can be recursively organized
into higher-level concepts, resulting in a concept hierarchy
for the numeric attribute.

Concept hierarchy generation for nominal data: where


attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for
nominal attributes are implicit within the database schema
and can be automatically defined at the schema definition
level.

2/1/2019 93
Data Transformation by
Normalization
The measurement unit used can affect the data analysis.

For example, changing measurement units from meters to


inches for height, or from kilograms to pounds for weight,
may lead to very different results.

In general, expressing an attribute in smaller units will lead


to a larger range for that attribute, and thus tend to give
such an attribute greater effect or “weight.”

To help avoid dependence on the choice of measurement


units, the data should be normalized or standardized.

2/1/2019 94
Methods for Data Normalization
There are many methods for data normalization like:

min-max normalization,
z-score normalization, and
normalization by decimal scaling.

2/1/2019 95
2/1/2019 96
2/1/2019 97
2/1/2019 98
Z-score normalization

2/1/2019 99
2/1/2019 100
2/1/2019 101
Normalization by decimal scaling
Decimal Scaling is a data normalization technique.
In this technique, we move the decimal point of values of
the attribute.
This movement of decimal points totally depends on the
maximum value among all values in the attribute.
Formula for Decimal Scaling:
Avalue v of attribute A can be normalized by
the following formula:
Normalized value of attribute

2/1/2019 102
2/1/2019 103
Data Transformation by
Discretization
Data Discretization convert a large number of data
values into smaller so that data evaluation and data
management becomes very easy.
Data Discretization Techniques:
Histogram Analysis
Binning
Correlation Analysis
Clustering Analysis
Decision Tree Analysis

2/1/2019 104
2/1/2019 105
Data Transformation
by
Concept Hierarchy
Generation
Manual definition of concept hierarchies can be a tedious and
time-consuming task for a user or a domain expert.
Fortunately, many hierarchies are implicit within the database
schema and can be automatically defined at the schema
definition level.
It defines a sequence of mappings from a set of low level
concepts to high level. The concept hierarchies can be used
to transform the data into multiple levels of granularity.
For example, data mining patterns regarding sales may be
found relating to specific regions or countries, in addition to
individual branch locations.

2/1/2019 106
Four methods for the generation of concept hierarchies for
nominal data:
1. Specification of a partial ordering of attributes explicitly at
the schema level by users or experts: Concept hierarchies
for nominal attributes or dimensions typically involve a
group of attributes.
 A user or expert can easily define a concept hierarchy by
specifying a partial or total ordering of the attributes at the
schema level.
 For example, suppose that a relational database contains the
following group of attributes: street, city, province or state,
and country. Similarly, a data warehouse location dimension
may contain the same attributes.
 A hierarchy can be defined by specifying the total
ordering among these attributes at the schema level such
as street < city < province or state < country.

2/1/2019 107
2 Specification of a portion of a hierarchy by explicit
data grouping: This is essentially the manual definition
of a portion of a concept hierarchy. In a large database, it
is unrealistic to define an entire concept hierarchy by
explicit value enumeration.
On the contrary, we can easily specify explicit groupings
for a small portion of intermediate-level data.
For example, after specifying that province and country
form a hierarchy at the schema level, a user could define
some intermediate levels manually such as:
“{Janakpuri, Dwarka} West Delhi”

2/1/2019 108
3. Specification of a set of attributes, but not of their
partial ordering: A user may specify a set of attributes
forming a concept hierarchy, but omit to explicitly state
their partial ordering.
The system can then try to automatically generate the
attribute ordering so as to construct a meaningful concept
hierarchy based on the number of distinct values.
Note that this heuristic rule is not foolproof. For
example, a time dimension in a database may contain
20 distinct years, 12 distinct months, and 7 distinct
days of the week. However, this does not suggest that
the time hierarchy should be “year <month <days of
the week,” with days of the week at the top of the
hierarchy.
2/1/2019 109
Concept Hierarchy Generation by system on the basis of
distinct values.

2/1/2019 110
4. Specification of only a partial set of attributes:
Sometimes a user can be careless when defining a
hierarchy, or have only a vague idea about what should be
included in a hierarchy.
 Consequently, the user may have included only a small
subset of the relevant attributes in the hierarchy
specification.
 For example, instead of including all of the
hierarchically relevant attributes for location, the user
may have specified only street and city and rest of
attributes that is state and country are automatically
pinned together according to data semantics defined in
metadata.

2/1/2019 140

You might also like