Professional Documents
Culture Documents
DATA MINING Notes (Upate)
DATA MINING Notes (Upate)
By Sami Sikander
Topic: 01
Types of Data
Data mining can be performed on following types of data
Relational databases
Data warehouses
Advanced DB and information repositories
Object-oriented and object-relational databases
Transactional and Spatial databases
Heterogeneous and legacy databases
Multimedia and streaming database
Text databases
Text mining and Web mining
This analysis is used to retrieve important and relevant information about data, and
metadata. This data mining method helps to classify data in different classes.
2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each
other. This process helps to understand the differences and similarities between the
data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific
variable, given the presence of other variables.
4. Association Rules:
This data mining technique helps to find the association between two or more
Items. It discovers a hidden pattern in the data set.
5. Outer Detection:
This type of data mining technique refers to observation of data items in the dataset
which do not match an expected pattern or expected behavior. This technique can
be used in a variety of domains, such as intrusion, detection, fraud or fault
detection, etc.
6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends
in transaction data for certain period.
7. Prediction:
Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or
instances in a right sequence for predicting a future event.
Types of data:
• Numeric data: Each object is a point in a multidimensional space
Topic: 03
Data Preprocessing (Preparing of Data)
Why Data Preprocessing?
Data in the real world is dirty
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Data Integration
Combines data from multiple sources into a logical store.
Schema integration:
Integrate metadata from different sources e.g., A. cust-id = B. cust-id
Importance:
“Data cleaning is one of the three biggest problems in data warehousing”—Ralph
Kimball
“Data cleaning is the number one problem in data warehousing”—DCI survey
Data cleaning tasks:
Fill in missing values Identify outliers and smooth out noisy data Correct
inconsistent data.
Resolve redundancy caused by data integration
Missing Data
Data Compression
String compression:
There are extensive theories and well-tuned algorithms Typically lossless. But
only limited manipulation is possible without expansion.
Audio/video compression:
Typically, lossy compression, with progressive refinement. Sometimes small
fragments of signal can be reconstructed without reconstructing the whole.
Topic: 04
Statistical Methods
Read this topic to the slides (Lecture No: 02)
Topic: 05
Decision Tree and Decision Rules
Read this topic to the slides (Lecture No: 05)
Topic: 06
Neural Network
Neuron in the brain, Many neurons in our brain
Neural Network
3 Layers
1 Layer: input layer 2 Layer: hidden layer 3 Layer: output layer
Feed-Forward
Feed-Backward
AND function
OR Function:
NOT Function:
NAND Function:
NOR Function:
XNOR Function:
Topic: 07
Ensemble learning
Ensemble of classifiers, is a set of classifiers whose individual decisions
combined in some way to classify new examples Simplest approach:
2. BOOSTING
Topic: 08
Cluster Analysis
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis: Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined classes
Clustering is used:
As a stand-alone tool to get insight into data distribution
Visualization of clusters may expose important information
General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data mining
Image Processing
cluster images based on their visual content
Economic Science (especially market research)
WWW
document classification
cluster Weblog data to discover groups of similar access patterns
Outliers
Outliers are objects that do not belong to any cluster or form clusters of very
small cardinality
Topic: 09
Association Rules
Association Rules find all sets of items (itemsets) that have support greater than
the minimum support and then using the large itemsets to generate the desired rules
that have confidence greater than the minimum confidence.
The lift of a rule is the ratio of the observed support to that expected if X and Y
were independent. A typical and widely used example of association rules
application is market basket analysis
Association Rules
For two item-sets, A and B
Support S, probability that a transaction contains A∪B
Confidence C, conditional probability transaction having A also contains B
C = Sup (A∪B) / Sup(A)
Association rules: A = B (S, C)
Topic: 10
Web & Text Mining
Web Mining is the application of data mining techniques to extract knowledge
from web data such as Web content, Web structure and Web usage data.
It is the process of discovering the useful and previously unknown information
from the web data.
Web data is:
Web content: - text, images, records, etc.
Web structure: - hyperlinks, tags, etc.
Web usage: - http logs, app server logs, etc.
Text Mining
The objective of Text Mining is to exploit information contained in textual
documents in various ways, including discovery of patterns and trends in data,
associations among entities, predictive rules, etc.
Topic: 11
Genetic Algorithm
A genetic algorithm is a search heuristic that is inspired by Charles
Darwin's theory of natural evolution.
This algorithm reflects the process of natural selection where the fittest
individuals are selected for reproduction in order to produce offspring of the
next generation.
Where use?
Optimization − Genetic Algorithms are most commonly used in optimization
problems wherein we have to maximize or minimize a given objective function
value under a given set of constraints.
Topic: 12
Fuzzy logic
Fuzzy logic is an approach of data mining that involves computing the data based
on the probable predictions and clustering as opposed to the traditional “true or
false”. Algorithms that use fuzzy logic are increasingly being applied in several
disciplines to help in mining of databases.
Fuzzy sets
Fuzzy sets are somewhat sets like sets whose elements have degree of
membership. By contrast fuzzy set theory permits the gradual assessment of the
membership of elements in a set, that is describes with the aid of membership
function valued in the real unit interval.
Topic: 14
Visualization Method
Data mining visualization is the combination of data visualization and makes use
of a number of technique areas included geometric:
Pixel-oriented Visualization
Hierarchical Visualization
Graph- Based Visualization
Distortion Visualization
User Interaction Visualization
Topic: 15
Data Mining Tools
1) Rapid Miner.
2) Orange.
3) Weka.
4) KNIME.
5) Sisense.
6) SSDT (SQL Server Data Tools)
7) Apache Mahout.
8) Oracle Data Mining.