Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

Complete Notes of DATA MINING

By Sami Sikander
Topic: 01

What is Data Mining?


Data mining is looking for hidden, valid, and potentially useful patterns in huge
data sets. Data Mining is all about discovering unsuspected/ previously unknown
relationships amongst the data.

 It is a multi-disciplinary skill that uses machine learning, statistics, AI and


database technology.
 The insights derived via Data Mining can be used for marketing, fraud
detection, and scientific discovery, etc.
 Data mining is also called as Knowledge discovery, Knowledge extraction,
data/pattern analysis, information harvesting, etc.

Why do we need data mining?


 Really, really huge amounts of raw data!
 Mobile devices, digital photographs, web documents.
 Queries, clicks, browsing
 Cheap storage has made possible to maintain this data
 Need to analyze the raw data to extract knowledge

Types of Data
Data mining can be performed on following types of data

 Relational databases
 Data warehouses
 Advanced DB and information repositories
 Object-oriented and object-relational databases
 Transactional and Spatial databases
 Heterogeneous and legacy databases
 Multimedia and streaming database
 Text databases
 Text mining and Web mining

Data Mining Implementation Process

Data Mining Techniques


1. Classification:

This analysis is used to retrieve important and relevant information about data, and
metadata. This data mining method helps to classify data in different classes.

2. Clustering:

Clustering analysis is a data mining technique to identify data that are like each
other. This process helps to understand the differences and similarities between the
data.

3. Regression:

Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific
variable, given the presence of other variables.

4. Association Rules:

This data mining technique helps to find the association between two or more
Items. It discovers a hidden pattern in the data set.

5. Outer Detection:

This type of data mining technique refers to observation of data items in the dataset
which do not match an expected pattern or expected behavior. This technique can
be used in a variety of domains, such as intrusion, detection, fraud or fault
detection, etc.

6. Sequential Patterns:

This data mining technique helps to discover or identify similar patterns or trends
in transaction data for certain period.

7. Prediction:
Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or
instances in a right sequence for predicting a future event.

Challenges of Implementation of Data mine:


 Skilled Experts are needed to formulate the data mining queries.
 Over fitting: Due to small size training database, a model may not fit future
states.
 Data mining needs large databases which sometimes are difficult to manage
 If the data set is not diverse, data mining results may not be accurate.

Benefits of Data Mining:


 Data mining technique helps companies to get knowledge-based
information.
 Data mining helps organizations to make the profitable adjustments in
operation and production.
 Data mining helps with the decision-making process.
 It can be implemented in new systems as well as existing platforms

Disadvantages of Data Mining


 Many data mining analytics software is difficult to operate and requires
advance training to work on.
 Different data mining tools work in different manners due to different
algorithms employed in their design. Therefore, the selection of correct data
mining tool is a very difficult task.
 The data mining techniques are not accurate, and so it can cause serious
consequences in certain conditions.

So, what is Data?


 Collection of data objects and their attributes
 An attribute is a property or characteristic of an object
o Examples: eye color of a person, temperature, etc.
o Attribute is also known as variable, field, characteristic, or feature
 A collection of attributes describes an object
o Object is also known as record, point, case, sample, entity, or instance

Types of data:
• Numeric data: Each object is a point in a multidimensional space

• Categorical data: Each object is a vector of categorical values

• Set data: Each object is a set of values (with or without counts)


Sets can also be represented as binary vectors, or vectors of counts

• Ordered sequences: Each object is an ordered sequence of values.

• Graph data: Each object is a collection of nodes.


Topic: 02
Data Reduction
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases. In
order to get rid of this, we use data reduction technique. It aims to increase the
storage efficiency and reduce data storage and analysis costs.

The various steps to data reduction are:


Data Cube Aggregation:
• Aggregation operation is applied to data for the construction of the data
cube.
Attribute Subset Selection:
• The highly relevant attributes should be used; rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value
of the attribute. The attribute having p-value greater than significance level
can be discarded.
Numerosity Reduction:
• This enable to store the model of data instead of whole data, for example:
Regression Models.
Dimensionality Reduction:
• This reduce the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are:
Wavelet transforms and PCA (Principal Component Analysis)

Topic: 03
Data Preprocessing (Preparing of Data)
Why Data Preprocessing?
Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data


o e.g., occupation= “ ”
 Noisy: containing errors or outliers
o e.g., Salary= “-10”

 Inconsistent: containing discrepancies in codes or names


o e.g., Age= “42” Birthday= “03/07/1997”

Multi-Dimensional Measure of Data Quality


A well-accepted multidimensional view:

 Accuracy

 Completeness

 Consistency

 Timeliness

 Believability

 Value added

 Interpretability

 Accessibility

Major Tasks in Data Preprocessing.


Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
Data integration
Integration of multiple databases, data cubes, or files

Data transformation
Normalization and aggregation

Data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Data Integration
Combines data from multiple sources into a logical store.

Schema integration:
Integrate metadata from different sources e.g., A. cust-id = B. cust-id

Entity identification problem:


Identify real world entities from multiple data sources (by using correlation
analysis, e.g., Bill Clinton = William Clinton

Detecting and resolving data value conflicts:


For the same real world entity, attribute values from different sources are
different
Possible reasons: different representations, different scales, e.g., metric vs. British
units (meter vs. feet)

Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases


Object identification: The same attribute or object may have different names
in different databases
Derivable data: One attribute may be a “derived” attribute in another table
e.g., annual revenue
Redundant attributes may be able to be detected by correlation analysis.
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality.
Data Cleaning

Importance:
“Data cleaning is one of the three biggest problems in data warehousing”—Ralph
Kimball
“Data cleaning is the number one problem in data warehousing”—DCI survey
Data cleaning tasks:
 Fill in missing values Identify outliers and smooth out noisy data Correct
inconsistent data.
 Resolve redundancy caused by data integration

Missing Data

Missing data may be due to equipment malfunction inconsistent with other


recorded data and thus deleted data not entered due to misunderstanding. certain
data may not be considered important at the time of entry not register history or
changes of the data. Missing data may need to be inferred.

How to Handle Missing Data?


 Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible? Fill in it
automatically with a global constant: e.g., “unknown”, a new class?!
 Use a measure of central tendency for the attribute (e.g., the mean or
median) to fill in the missing value the attribute mean for all samples
belonging to the same class:
 Smarter the most probable value: inference-based such as Bayesian formula
or decision tree
Noisy Data

Noise: random error or variance in a measured variable Incorrect attribute values


may due to:
 Faulty data collection instruments
 Data entry problems
 Data transmission problems
 Technology limitation
 Inconsistency in naming convention
How to Handle Noisy Data?
Binning
First sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with possible outliers)
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction Generalization: concept
hierarchy climbing
Normalization: scaled to fall within a small, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling

Data Compression
String compression:
There are extensive theories and well-tuned algorithms Typically lossless. But
only limited manipulation is possible without expansion.
Audio/video compression:
Typically, lossy compression, with progressive refinement. Sometimes small
fragments of signal can be reconstructed without reconstructing the whole.

Topic: 04
Statistical Methods
Read this topic to the slides (Lecture No: 02)
Topic: 05
Decision Tree and Decision Rules
Read this topic to the slides (Lecture No: 05)

Topic: 06
Neural Network
Neuron in the brain, Many neurons in our brain

 Dendrite: receive input


 Axon: produce output
 When it sends a message through the Axon to another neuron
 It sends to another neuron’s Dendrite
Neuron model
 Input wires
dendrites
 Output wire
Axon
Logistic Unit:

Neural Network

3 Layers
1 Layer: input layer 2 Layer: hidden layer 3 Layer: output layer

Feed-Forward

The flow of the signals in neural networks can be either in only


one direction.
In the first case, we call the neural network architecture feed-
forward, since the input signals are fed into the input layer, then,
after being processed, they are forwarded to the next layer, just
as shown in the following figure.

Feed-Backward

 When the neural network has some kind of internal


recurrence,
 Meaning that the signals are fed back to a neuron or layer
that has already received and processed that signal,
 The network is of the type feedback, as shown in the
following image:

Single Layer Perceptron VS Multi-Layer Perceptron


 A single layer perceptron contains only one hidden layer.
 While Multi-Layer Perceptron (MLP) contains one or more hidden layers
(apart from one input and one output layer).
 Input Layer: The Input layer has three nodes.
 The Bias node has a value of 1
Neural Network Applications
 AND function
 OR function
 NOT function
 NOR function
 NAND function

AND function

OR Function:
NOT Function:

NAND Function:
NOR Function:

XNOR Function:

Topic: 07
Ensemble learning
Ensemble of classifiers, is a set of classifiers whose individual decisions
combined in some way to classify new examples Simplest approach:

1. Generate multiple classifiers


2. Each votes on test instance
3. Take majority as classification

Data Sets: Labeled images of hot dogs and other objects.


 Images of hot dog is labelled 1
 Images of other objects is labelled 0
Why We Use the Ensemble Model?
 Because It provide the better accuracy (Low Errors)
 High consistency (Avoid Overfitting)
 Reduce bias and Variance errors.

Why Use This Ensemble Model...?


 It is single model over fits
 Result worth the extra training
 Can be used for classification as well as regression

Popular Ensemble Models


1. BOOTSTRAP AGGREGATION (BAGGING)

2. BOOSTING

Topic: 08
Cluster Analysis
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis: Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined classes
 Clustering is used:
 As a stand-alone tool to get insight into data distribution
 Visualization of clusters may expose important information
General Applications of Clustering
 Pattern Recognition
 Spatial Data Analysis
 create thematic maps in GIS by clustering feature spaces
  detect spatial clusters and explain them in spatial data mining
 Image Processing
 cluster images based on their visual content
 Economic Science (especially market research)
 WWW
 document classification
 cluster Weblog data to discover groups of similar access patterns

Outliers
 Outliers are objects that do not belong to any cluster or form clusters of very
small cardinality

In some applications we are interested in discovering outliers, not clusters (outlier


analysis)

Types of Data in Cluster Analysis


 Centroid Clustering. This is one of the more common methodologies used in
cluster analysis. ...
 Density Clustering. Density clustering groups data points by how densely
populated they are. ...
 Distribution Clustering. Distribution clustering identifies the probability that
a point belongs to a cluster. ...
 Connectivity Clustering. The primary premise of this technique is that points
closer to each other are more related.

Major Clustering Approaches


 Partitioning algorithms: Construct random partitions and then iteratively
refine them by some criterion (K-mean Algorithm)
 Hierarchical algorithms: Create a hierarchical decomposition of the set of
data (or objects) using some criterion
 Density-based: based on connectivity and density functions
 Grid-based: based on a multiple-level granularity structure
 Model-based: A model is hypothesized for each of the clusters and the idea
is to find the best fit of that model to each other

Topic: 09
Association Rules
Association Rules find all sets of items (itemsets) that have support greater than
the minimum support and then using the large itemsets to generate the desired rules
that have confidence greater than the minimum confidence.

The lift of a rule is the ratio of the observed support to that expected if X and Y
were independent.  A typical and widely used example of association rules
application is market basket analysis

Patterns and Item-Sets


 itemset: A set of one or more items
 k-itemset B = {b1, …. bk}
 (absolute) support is Frequency or occurrence of an itemset B
 Milk: 4, Diaper:3
 (relative) support is the fraction of transactions that contains B (i.e., the
probability that a transaction contains B)
 Milk: 0.8, Diaper: 0.6
Frequent Patterns
 An itemset B is frequent if B’s support is no less than a minsup
threshold
 Let minsup = 0.5
 Freq. 1-itemsets
 Milk: 4/5=0.8
 Diaper: 3/5=0.6
 Cereal: 3/5=0.6
 Freq. 2-itemsets
 {Milk, Diaper}: 3/5=0.6

Association Rules
 For two item-sets, A and B
 Support S, probability that a transaction contains A∪B
 Confidence C, conditional probability transaction having A also contains B
 C = Sup (A∪B) / Sup(A)
 Association rules: A = B (S, C)
Topic: 10
Web & Text Mining
 Web Mining is the application of data mining techniques to extract knowledge
from web data such as Web content, Web structure and Web usage data.
 It is the process of discovering the useful and previously unknown information
from the web data.
 Web data is:
Web content: - text, images, records, etc.
Web structure: - hyperlinks, tags, etc.
Web usage: - http logs, app server logs, etc.

Text Mining
 The objective of Text Mining is to exploit information contained in textual
documents in various ways, including discovery of patterns and trends in data,
associations among entities, predictive rules, etc.

Text Mining Workflow:


Data Mining vs. Text Mining
 Both seek novel and useful pattern.
 Both are semi-automated process.
 Difference is the nature of the data:
Structured VS Unstructured data
 Structured data: databases
 Unstructured data: word docs, pdf files, xml files, and so on
 Text mining – first, impose structure to the data, then mine the structured data.

Technology premise of Text Mining


 Summarization: It is a process of making summary of any document
containing large amount of information while theme or main idea of
document is maintained.
 Information Extraction: It utilizes relations within the text. It uses pattern
matching for it.
 Categorization: It is a supervised learning technique which places the
document according to content. Document categorization is largely used in
libraries.
 Visualization: It is computer graphic effect to represent information and
revealing relationships.
 Clustering: It is a document’s textual similarity based unsupervised
technique which is used by data analysis to divide the text into mutually
exclusive groups.
 Question Answering: Natural language queries or questions answering is
responsible to decide a way find a more suitable answer for particular
question.
 Sentiment Analysis: It is also known as opinion mining is configured of
user’s emotion, mostly into several classes which are positive, negative,
neutral and mixed. It is mainly used to get people’s view or attitude towards
anything which includes services and products.

Topic: 11
Genetic Algorithm
 A genetic algorithm is a search heuristic that is inspired by Charles
Darwin's theory of natural evolution.
 This algorithm reflects the process of natural selection where the fittest
individuals are selected for reproduction in order to produce offspring of the
next generation.

Where use?
Optimization − Genetic Algorithms are most commonly used in optimization
problems wherein we have to maximize or minimize a given objective function
value under a given set of constraints.

Topic: 12

Fuzzy logic 
Fuzzy logic is an approach of data mining that involves computing the data based
on the probable predictions and clustering as opposed to the traditional “true or
false”. Algorithms that use fuzzy logic are increasingly being applied in several
disciplines to help in mining of databases.

For example, when using fuzzy algorithm for the prediction and clustering of


breast cancer data, the human experience and knowledge related to breast cancer
risks can be expressed as a set of inference rules of deduction that are then attached
to the fuzzy logic system.

Use of Fuzzy logic 


Fuzzy logic is used in Natural language processing and various intensive
applications in Artificial Intelligence. Fuzzy logic is extensively used in modern
control systems such as expert systems. Fuzzy Logic is used with Neural
Networks as it mimics how a person would make decisions, only much faster.

Fuzzy sets
Fuzzy sets are somewhat sets like sets whose elements have degree of
membership. By contrast fuzzy set theory permits the gradual assessment of the
membership of elements in a set, that is describes with the aid of membership
function valued in the real unit interval.

Fuzzy sets in two examples. Suppose that is some (universal) set, - an element of,


- some property. A usual subset of set which elements satisfy the properties, is
defined as a set of ordered pairs where is the characteristic function, i.e. the so-
called affiliation (membership) function, which takes the value.

Topic: 14
Visualization Method
Data mining visualization is the combination of data visualization and makes use
of a number of technique areas included geometric:

 Pixel-oriented Visualization
 Hierarchical Visualization
 Graph- Based Visualization
 Distortion Visualization
 User Interaction Visualization

Topic: 15
Data Mining Tools
1) Rapid Miner.
2) Orange.
3) Weka.
4) KNIME.
5) Sisense.
6) SSDT (SQL Server Data Tools)
7) Apache Mahout.
8) Oracle Data Mining.

You might also like