Professional Documents
Culture Documents
Data Mining and Data Warehousing
Data Mining and Data Warehousing
WAREHOUSING
UNIT-1 DATA WAREHOUSING
Key points
1. A data warehouse is a database, which is different from the
organization’s operational Database.
2. Data warehouse does not implement frequent updating.
3. A data warehouse helps executives to take strategic decisions
by organizing and understanding the data.
2. Data Mart:
- Definition: A Data Mart is a subset of an Enterprise Data
Warehouse that focuses on specific business areas or user
groups.
- Purpose: Data Marts are designed to meet the needs of a
particular department, business unit, or group of users. They
provide a more targeted and specialized view of data, making it
easier for specific teams to access and analyze information
relevant to their requirements.
ER based. Star/Snowflake.
2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one
independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The
outliers may be undetected or it will fall outside the
clusters.
3. Discretization:
This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
Applications of KDD
Some of the crucial applications of KDD are as follows:
Business and Marketing: User analysis, market prediction,
Segmenting clients, and focused marketing are all examples of
business and marketing databases.
Manufacturing: Predictive system analysis, process
improvement, and quality control.
Scope In the KDD method, the fourth KDD is a broad method that
phase is called "data mining." includes data mining as one
of its steps.
Key Data Mining KDD
Features
DBMS vs DM:
2. Clustering:
Clustering is a division of information into groups of connected
objects. Describing the data by a few clusters mainly loses certain
confine details, but accomplishes improvement. It models data by its
clusters. Data modeling puts clustering from a historical point of view
rooted in statistics, mathematics, and numerical analysis. From a
machine learning point of view, clusters relate to hidden patterns, the
search for clusters is unsupervised learning, and the subsequent
framework represents a data concept. From a practical point of view,
clustering plays an extraordinary job in data mining applications. For
example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis,
computational biology, medical diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining
technique to identify similar data. This technique helps to recognize
the differences and similarities between the data. Clustering is very
similar to the classification, but it involves grouping chunks of data
together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence
of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling. For
example, we might use it to project certain costs, depending on other
factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more
variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or
more items. It finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the
probability of interactions between data items within large data sets
in different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data
or medical data sets.
The way the algorithm works is that you have various data, For
example, a list of grocery items that you have been buying for the last
six months. It calculates a percentage of items being purchased
together.
These are three major measurements technique:
o Lift:
This measurement technique measures the accuracy of the
confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple
items are purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is
purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data
items in the data set, which do not match an expected pattern or
expected behavior. This technique may be used in various domains
like intrusion, detection, fraud detection, etc. It is also known as
Outlier Analysis or Outilier mining. The outlier is a data point that
diverges too much from the rest of the dataset. The majority of the
real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field. Outlier detection is valuable
in numerous fields like network interruption identification, credit or
debit card fraud detection, detecting outlying in wireless sensor
network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized
for evaluating sequential data to discover sequential patterns. It
comprises of finding interesting subsequences in a set of sequences,
where the stake of a sequence can be measured in terms of different
criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or
recognize similar patterns in transaction data over some time.
These are the following areas where data mining is widely used:
Apprehending a criminal is not a big deal, but bringing out the truth
from him is a very challenging task. Law enforcement may use data
mining techniques to investigate offenses, monitor suspected terrorist
communications, etc. This technique includes text mining also, and it
seeks meaningful patterns in data, which is usually unstructured text. The
information collected from the previous investigations is compared, and
a model for lie detection is constructed.
- Characteristics:
- Examples:
- Characteristics:
- Examples:
Comparison:
- Flexibility:
Formula Derivation
P(E∣H)=P(E∩H)/P(H)
P(H∣E)=P(H∩E)/ P(E)
We can substitute the expression for P(H∩E) from the first equation
into the second equation to obtain -
P(H∣E)=P(E∣H)∗P(H)/P(E)
This is the formula for Bayes' theorem for hypothesis H and event E.
It states that the probability of hypothesis H given event E is
proportional to the likelihood of the event given the hypothesis,
multiplied by the prior probability of the hypothesis, and divided by
the probability of the event.
1. Two-Class Classification:
Classification Error=
Number of Misclassified Instances/Total Number of Instances
In simpler terms, it is the ratio of the number of instances that the
model predicted incorrectly to the total number of instances in the
dataset.
1. Transaction Partitioning:
2. Itemset Partitioning:
- Itemsets are combinations of items that frequently occur
together in transactions. Partitioning itemsets involves grouping
them based on common characteristics or properties. For
example, you might partition itemsets based on the type of
products involved.
3. Rule Partitioning:
4. Support-Confidence Partitioning:
Correlation analysis:
Correlation analysis is a statistical method used to evaluate the strength and
direction of the linear relationship between two quantitative variables. It
measures the degree to which changes in one variable are associated with
changes in another variable. The most common measure of correlation is the
Pearson correlation coefficient, denoted by γ(gama). The value of γ ranges from
-1 to 1, where:
It helps users to understand the structure or natural grouping in a data set and used either
as a stand-alone instrument to get a better insight into data distribution or as a pre-
processing step for other algorithms
Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis. It is
based on data similarities and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and it
helps single out important characteristics that differentiate between distinct groups.
1. Scalability:
Scalability in clustering implies that as we boost the amount of data objects, the time
to perform clustering should approximately scale to the complexity order of the
algorithm. For example, if we perform K- means clustering, we know it is O(n), where
n is the number of objects in the data. If we raise the number of data objects 10
folds, then the time taken to cluster them should also approximately increase 10
times. It means there should be a linear relationship. If that is not the case, then there
is some error with our implementation process.
Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure
illustrates the graphical example where it may lead to the wrong result.
2. Interpretability:
The outcomes of clustering should be interpretable, comprehensible, and usable.
The clustering algorithm should be able to find arbitrary shape clusters. They should
not be limited to only distance measurements that tend to discover a spherical
cluster of small sizes.
Algorithms should be capable of being applied to any data such as data based on
intervals (numeric), binary data, and categorical data.
Databases contain data that is noisy, missing, or incorrect. Few algorithms are
sensitive to such data and may result in poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but
also the low-dimensional space.
Algorithm
1. Initialize K cluster centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate the centroids' coordinates by computing the mean of the
data points assigned to each cluster.
4. Repeat steps 2 and 3 until the cluster assignments no longer change or a
maximum number of iterations is reached.
5. Return the K clusters and their respective centroids.
Advantages of K-Means
Scalability
Speed
Simplicity
Interpretability
Disadvantages of K-Means
Curse of dimensionality
User-defined K
Non-convex shape clusters
Unable to handle noisy data
Disadvantages of K-Medoid:
1. Computational Complexity
2. Dependency on Initial Medoid Selection
3. Sensitivity to the Number of Clusters (k)
4. Limited to Single Linkage
5. Not Suitable for Large Datasets
Sensitive to outliers, as
they can significantly Less sensitive to outliers, as
Sensitivity to affect the mean medoids are less influenced by
Outliers (centroid). extreme values.
Tends to be computationally
Generally more more expensive, especially for
Computational computationally efficient large datasets, due to pairwise
Complexity than k-medoid. dissimilarity computations.
Hierarchical method:
Hierarchical clustering is a method of cluster analysis in data mining that
creates a hierarchical representation of the clusters in a dataset. The method
starts by treating each data point as a separate cluster and then iteratively
combines the closest clusters until a stopping criterion is reached. The result of
hierarchical clustering is a tree-like structure, called a dendrogram, which
illustrates the hierarchical relationships among the clusters.
Hierarchical clustering has several advantages over other clustering methods
The ability to handle non-convex clusters and clusters of different
sizes and densities.
The ability to handle missing data and noisy data.
The ability to reveal the hierarchical structure of the data, which can
be useful for understanding the relationships among the clusters.
Drawbacks of Hierarchical Clustering
The need for a criterion to stop the clustering process and determine
the final number of clusters.
The computational cost and memory requirements of the method can
be high, especially for large datasets.
The results can be sensitive to the initial conditions, linkage criterion,
and distance metric used.
In summary, Hierarchical clustering is a method of data mining that
groups similar data points into clusters by creating a hierarchical
structure of the clusters.
This method can handle different types of data and reveal the
relationships among the clusters. However, it can have high
computational cost and results can be sensitive to some conditions.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and at every step,
merge the nearest pairs of the cluster. (It is a bottom-up method). At first,
every dataset is considered an individual entity or cluster. At every iteration,
the clusters merge with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
Consider every data point as an individual cluster
Merge the clusters which are highly similar or close to each other.
Recalculate the proximity matrix for each cluster
Repeat Steps 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note: This is just a demonstration of how the actual algorithm works no
calculation has been performed below all the proximity among the clusters is
assumed.
Let’s say we have six data points A, B, C, D, E, and F.
4. Mean-Shift Clustering:
- Mean-shift is a density-based method that iteratively shifts data points
towards the mode (peak) of the data distribution. It automatically determines
the number of clusters and is capable of detecting clusters with irregular
shapes.
Structured Pruning:
Structured pruning involves eliminating whole structures or groups of
parameters from the model, such as whole neurons, channels, or filters.
This sort of pruning preserves the hidden structure of the model,
implying that the pruned model will have the same overall architecture
as the first model, but with fewer parameters.
Unstructured Pruning:
Unstructured pruning involves eliminating individual parameters from
the model without respect for their location in the model. This sort of
pruning does not preserve the hidden structure of the model, implying
that the pruned model will have an unexpected architecture in
comparison to the first model. Unstructured pruning is suitable for
models without a structured architecture, such as completely connected
brain networks, where the parameters are coordinated into a single grid.
It tends to be more effective than structured pruning since it allows for
more fine-grained pruning; however, it can also be more difficult to
execute.
Advantages
o Decreased model size and complexity. Pruning can significantly
diminish the quantity of parameters in a machine learning model,
prompting a smaller and simpler model that is easier to prepare
and convey.
o Faster inference. Pruning can decrease the computational cost of
making predictions, prompting faster and more effective
predictions.
o Further developed generalization. Pruning can forestall overfitting
and further develop the generalization capacity of the model by
diminishing the complexity of the model.
o Increased interpretability. Pruning can result in a simpler and more
interpretable model, making it easier to understand and make
sense of the model's decisions.
Disadvantages
o Possible loss of accuracy. Pruning can sometimes result in a loss of
accuracy, especially in the event that such a large number of
parameters are pruned or on the other hand in the event that
pruning is not done cautiously.
o Increased training time. Pruning can increase the training season of
the model, especially assuming it is done iteratively during training.
o Trouble in choosing the right pruning technique. Choosing the
right pruning technique can be testing and may require area
expertise and experimentation.
o Risk of over-pruning. Over-pruning can prompt an overly simplified
model that is not accurate enough for the task.
Extracting classification rules from decision tree:
- Trace the paths from the root to each leaf node by following the
conditions at each decision node. Each path represents a unique
combination of conditions.
- Associate each rule with the class label assigned to the corresponding
leaf node. This indicates the predicted outcome for instances that satisfy
the conditions of the rule.
- If the tree deals with missing values, include rules for how the model
handles them at decision nodes.
- Optionally, you can simplify rules to make them more concise and
easier to understand. This may involve combining similar rules or
expressing them in a more compact form.
1. Entropy Calculation:
2. Attribute Selection:
3. Node Creation:
4. Recursion:
5. Stopping Conditions:
6. Tree Construction:
- Continue building the tree until the stopping conditions are met for
all branches.
Limitations of ID3:
2. Overfitting:
- ID3 does not handle missing values well, and it may exclude instances
with missing values during attribute selection.
1. Presorting:
- This presorting step is performed once for each attribute, and the
sorted order is maintained throughout the tree construction process.
- For each attribute, traverse the sorted values and evaluate potential
split points.
- Choose the attribute and split point that result in the maximum
impurity reduction.
3. Node Creation:
- Create a decision node in the tree based on the selected attribute
and split point.
4. Recursion:
5. Stopping Conditions:
6. Tree Construction:
- Continue building the tree until the stopping conditions are met for
all branches.
Benefits of Presorting:
1. Efficiency Improvement:
Considerations:
- Memory Usage:
- Applicability:
1. RapidMiner:
- Features: RapidMiner provides a user-friendly interface for designing and
executing data workflows without extensive coding. It supports various data
preprocessing tasks, modeling techniques, and evaluation methods. It also
allows integration with other data science tools and languages like R and
Python.
2. Weka:
- Features: Weka is a Java-based software that offers a vast collection of
machine learning algorithms for data mining. It provides tools for data
preprocessing, classification, regression, clustering, association rule mining,
and feature selection. Weka is popular for its ease of use and is widely used for
educational purposes.
3. Knime:
- Features: KNIME is known for its modular and visual approach to data
analysis. Users can create workflows by connecting pre-built nodes that
perform specific tasks. KNIME supports integration with various data sources
and offers a range of analytics and reporting capabilities.
4. IBM SPSS Modeler:
- Features: SPSS Modeler is part of the IBM SPSS Statistics suite. It provides a
visual interface for building predictive models using machine learning
algorithms. The software supports data preparation, model building,
evaluation, and deployment. It is widely used in industries for tasks such as
customer segmentation and predictive maintenance.
5. SAS Enterprise Miner:
- Features: SAS Enterprise Miner is a comprehensive data mining and
predictive analytics tool. It includes a variety of statistical and machine learning
algorithms for tasks like regression, clustering, and decision trees. The software
is often used in industries such as finance, healthcare, and marketing.
6. TensorFlow and scikit-learn:
- Features: TensorFlow is an open-source machine learning library developed
by Google. While it is more popular for deep learning, it also includes tools for
traditional machine learning tasks. Scikit-learn, on the other hand, is a Python
library that provides simple and efficient tools for data mining and data
analysis. Both are widely used in the Python data science ecosystem.
7. Tableau:
- Features: Tableau is primarily a data visualization tool that connects to
various data sources. While not a traditional data mining tool, it allows users to
explore and analyze data visually, uncovering patterns and trends. Tableau can
be integrated with other data science tools to enhance its analytical
capabilities.
8. Sisense:
- Features: Sisense is a business intelligence platform that goes beyond
traditional data mining. It includes data preparation, analysis, and visualization
features. Sisense allows users to create interactive dashboards and reports,
enabling organizations to make informed decisions based on data insights.
By using text mining, the unstructured text data can be transformed into
structured data that can be used for data mining tasks such as
classification, clustering, and association rule mining. This allows
organizations to gain insights from a wide range of data sources, such as
customer feedback, social media posts, and news articles.
- Calculate the Term Frequency (TF) for each word in the document.
- Multiply TF and IDF to obtain the TF-IDF score for each word.
2. TextRank Algorithm:
3. Frequency-Based Methods:
- Identifying and extracting noun phrases from the text can yield
meaningful keywords. This can be done using part-of-speech tagging to
identify nouns and noun phrases in the text.
7. Topic Modeling:
8. Domain-Specific Methods:
1. Parsing:
- Types of Parsing:
- Applications:
- Semantic Role Labeling (SRL): Parsing helps identify the roles that
different words play in a sentence, such as the subject, object, or
predicate, which is crucial for understanding the semantic structure.
2. Soft Parsing:
- Probabilistic Models:
- Applications:
- Advantages:
- Techniques:
- Applications:
- Techniques:
- Applications:
- Techniques:
- Applications:
- Data Volume and Diversity: The web generates vast and diverse data,
making it challenging to handle and analyze.
#Classifiying web pages: Classifying web pages involves categorizing them into
predefined classes or topics based on their content, structure, or other relevant
features. This process is essential for various applications, including information
retrieval, content organization, and user experience improvement. Here are
common techniques and approaches for classifying web pages:
1. Text-Based Classification:
- Technique: Analyzing the textual content of web pages to determine their
category.
- Methods:
- Natural Language Processing (NLP): Using techniques such as tokenization,
stemming, and sentiment analysis to process and understand the text.
- Machine Learning Algorithms: Employing supervised learning algorithms,
such as Naive Bayes, Support Vector Machines (SVM), or neural networks, to
train models on labeled data.
2. Content-Based Classification:
- Technique: Examining the features and attributes of web page content,
including text, images, and multimedia elements.
- Methods:
- Keyword Extraction: Identifying key terms in the text content.
- Image Analysis: Analyzing images or multimedia content for classification.
- Text and Image Fusion: Combining information from both textual and visual
elements.
3. Link-Based Classification:
- Technique: Analyzing the link structure and relationships between web
pages.
- Methods:
- Link Analysis Algorithms: Using algorithms like PageRank to determine the
importance of pages based on their links.
- Community Detection: Identifying clusters or groups of interlinked pages.
4. Web Structure-Based Classification:
- Technique: Examining the HTML structure, tags, and other structural
elements of web pages.
- Methods:
- HTML DOM Parsing: Analyzing the Document Object Model (DOM) of web
pages.
- Pattern Recognition: Identifying structural patterns in the HTML code.
5. Domain-Specific Classification:
- Technique: Considering domain-specific characteristics or features for
classification.
- Methods:
- Custom Features: Defining and extracting features relevant to the specific
domain.
- Supervised Learning: Training models on domain-specific labeled data.
6. Web Usage-Based Classification:
- Technique: Analyzing user interactions and behavior on web pages for
classification.
- Methods:
- Clickstream Analysis: Considering user clicks, navigation paths, and session
data.
- Behavioral Pattern Recognition: Identifying recurring patterns in user
behavior.
7. Machine Learning Models for Web Page Classification:
- Technique: Employing various machine learning models to classify web
pages.
- Methods:
- Decision Trees, Random Forests: Suitable for handling categorical features.
- Support Vector Machines (SVM): Effective for binary and multiclass
classification.
- Neural Networks: Deep learning models for complex patterns and
representations.
8. Ensemble Methods:
- Technique: Combining predictions from multiple classifiers to improve
overall accuracy.
- Methods:
- Voting Systems: Combining results through majority voting.
- Bagging (Bootstrap Aggregating): Training multiple models on different
subsets of the data.
The end