Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 92

DATA MINING AND DATA

WAREHOUSING
UNIT-1 DATA WAREHOUSING

Introduction to DATA Warehousing: A data warehouse is like a super


organized and giant storage space where a company keeps all its
important information. It collects data from different places and
keeps it in a way that's easy for everyone in the company to
understand. It not only shows what's happening now but also keeps a
record of what happened before. This helps people make smart
decisions and understand how things have changed over time. It's
basically a big, well-organized hub for all the data a company needs
to work well and make good choices.
Its introduced by Bill Inmon in 1990.
Or a data warehouse is a subject oriented, integrated, time-variant,
and non-volatile collation of data..

Key points
1. A data warehouse is a database, which is different from the
organization’s operational Database.
2. Data warehouse does not implement frequent updating.
3. A data warehouse helps executives to take strategic decisions
by organizing and understanding the data.

Data warehouse features:-


1. Subject oriented:- because its provide information around
the subject(product, customers).
2. Integrated:- the data warehouse is constructed by the
integrating the data.
3. Time-variant:- the data collected by the warehouse is
identified with the particular time. That’s why data
warehouse provides the information from the historical
point of view.
4. Non-volatile:- The previous data will not erase when a new
data is added

Why we need data warehouse: We need a data warehouse for


several important reasons:

1. Centralized Data: Organizations deal with data from various


sources like sales, finance, and operations. A data warehouse
centralizes this data, providing a single, unified source of truth. This
makes it easier for everyone in the organization to access and work
with consistent information.

2. Historical Analysis: Data warehouses store historical data, allowing


organizations to analyze trends and changes over time. This historical
perspective is crucial for making informed decisions, understanding
business patterns, and planning for the future.

3. Improved Decision-Making: By having a comprehensive and


organized view of data, decision-makers can make more informed
and strategic choices. Data warehouses support business intelligence,
helping users extract valuable insights and identify opportunities or
areas for improvement.
4. Data Quality and Consistency: Data warehouses often involve
Extract, Transform, Load (ETL) processes, which help ensure data
quality by cleaning and transforming information before it is stored.
This contributes to more reliable and accurate reporting and analysis.
5. Efficient Reporting and Analysis: Data warehouses are optimized
for analytical queries, making it faster and more efficient to retrieve
and analyze large volumes of data. This facilitates reporting and
business analytics, enabling users to derive meaningful insights from
the data.
6. Support for Business Growth: As organizations grow, so does the
volume and complexity of their data. A data warehouse provides a
scalable solution to handle this growth, ensuring that the
infrastructure can support increasing data demands without
sacrificing performance.
7. Consolidation of Data Sources: Companies often have data stored
in different systems and formats. A data warehouse consolidates this
diverse data into a standardized format, simplifying access and
analysis. This integration enhances data consistency and reduces the
risk of errors.
8. Facilitation of Business Intelligence (BI): Data warehouses are a
cornerstone of business intelligence initiatives. They provide a
foundation for BI tools and applications, empowering users to
explore and visualize data for better decision-making.

Advantages of Data Warehousing


 Intelligent Decision-Making: With centralized data in
warehouses, decisions may be made more quickly and
intelligently.
 Business Intelligence: Provides strong operational insights
through business intelligence.
 Historical Analysis: Predictions and trend analysis are made
easier by storing past data.
 Data Quality: Guarantees data quality and consistency for
trustworthy reporting.
 Scalability: Capable of managing massive data volumes and
expanding to meet changing requirements.
 Effective Queries: Fast and effective data retrieval is made
possible by an optimized structure.
 Cost reductions: Data warehousing can result in cost savings
over time by reducing data management procedures and
increasing overall efficiency, even when there are setup costs
initially.
 Data security: Data warehouses employ security protocols to
safeguard confidential information, guaranteeing that only
authorized personnel are granted access to certain data.
Disadvantages of Data Warehousing
 Cost: Building a data warehouse can be expensive, requiring
significant investments in hardware, software, and personnel.
 Complexity: Data warehousing can be complex, and businesses
may need to hire specialized personnel to manage the system.
 Time-consuming: Building a data warehouse can take a
significant amount of time, requiring businesses to be patient
and committed to the process.
 Data integration challenges: Data from different sources can be
challenging to integrate, requiring significant effort to ensure
consistency and accuracy.
 Data security: Data warehousing can pose data security risks,
and businesses must take measures to protect sensitive data
from unauthorized access.
Types Of DATA WAREHOUSE:- There are three main types of
data warehouses, each serving specific needs and purposes
within an organization. These types are:

1. Enterprise Data Warehouse (EDW):


- Definition: An Enterprise Data Warehouse is a centralized
repository that integrates data from various sources across the
entire organization.
- Purpose: EDWs provide a comprehensive and unified view of
an organization's data, supporting enterprise-wide reporting,
analytics, and decision-making. They are typically used for
strategic planning and high-level analysis.

2. Data Mart:
- Definition: A Data Mart is a subset of an Enterprise Data
Warehouse that focuses on specific business areas or user
groups.
- Purpose: Data Marts are designed to meet the needs of a
particular department, business unit, or group of users. They
provide a more targeted and specialized view of data, making it
easier for specific teams to access and analyze information
relevant to their requirements.

3. Operational Data Store (ODS):


- Definition: An Operational Data Store is a database that
provides real-time or near-real-time operational data from
various transactional systems.
- Purpose: ODS serves as an intermediate layer between
source systems and the data warehouse. It is designed to
support operational reporting and provide a snapshot of
current, frequently changing data. ODS is particularly useful
when organizations require quick access to the latest
operational information.

OLAP:- OLAP stands for Online Analytical Processing. It is a category


of software tools that allows users to interactively analyze and
explore multidimensional data from different perspectives. OLAP
systems are designed for complex queries and reporting, providing a
fast and efficient way to retrieve and analyze aggregated data.

Key features of OLAP include:


1. Multidimensional Data Model: OLAP systems organize data into
multidimensional structures, often referred to as "cubes." These
cubes represent data in a way that allows users to easily navigate and
analyze information across multiple dimensions, such as time,
geography, and product categories.
2. Dimensions and Measures: In OLAP, dimensions represent the
different ways data can be analyzed (e.g., time, geography), while
measures are the numerical data points being analyzed (e.g., sales,
revenue). Users can "slice" and "dice" the data cube along different
dimensions to view specific subsets of data.
3. Aggregation and Drill-Down: OLAP systems support aggregation,
allowing users to view data at different levels of granularity. Users
can drill down into more detailed data or roll up to view summarized
data, enabling a hierarchical exploration of information.
4. Fast Query Performance: OLAP databases are optimized for fast
query performance, making it efficient to retrieve and analyze large
volumes of data. This is crucial for interactive and ad-hoc analysis,
where users need quick responses to their queries.
5. Flexibility in Analysis: OLAP provides a flexible environment for
users to perform ad-hoc analysis and explore data interactively. Users
can change dimensions, apply filters, and manipulate the data to gain
insights and answer specific business questions.
6. Business Intelligence (BI) Integration: OLAP is often integrated with
business intelligence tools, reporting systems, and dashboards. This
integration enhances the presentation and visualization of data,
making it easier for users to interpret and communicate insights.

There are two main types of OLAP systems:

1. ROLAP (Relational OLAP): ROLAP systems store data in relational


databases and generate multidimensional views on-the-fly. They
leverage the existing relational database infrastructure for storage
and query processing.
2. MOLAP (Multidimensional OLAP): MOLAP systems store data in a
specialized multidimensional database format. Examples include
Microsoft Analysis Services and IBM Cognos TM1. MOLAP systems
often provide faster query performance but may require additional
storage considerations.

Multidimensional Data Model: A multi-dimensional data model is a


type of database model that organizes data into multiple dimensions
for more efficient and intuitive retrieval. This model is particularly
useful for analytical and business intelligence applications where
users need to analyze and explore data from various perspectives.

Key concepts in a multi-dimensional data model include:


1. Dimensions: Dimensions represent categories by which data is
organized. For example, in a sales database, dimensions could
include time, geography, product, and customer.
2. Hierarchies: Each dimension can have multiple levels of hierarchy.
For instance, the time dimension might have levels such as year,
quarter, month, and day.
3. Measures: Measures are the numeric values or metrics that users
want to analyze. In a sales database, measures might include
revenue, quantity sold, and profit.
4. Cubes: A cube is a multi-dimensional array that contains the actual
data values. It is formed by the intersection of dimensions. Each cell
in the cube contains a specific measure for a particular combination
of dimension values.
5. Slicing and Dicing: Slicing involves taking a "slice" of the cube to
view specific values along one dimension. Dicing involves viewing a
smaller, subcube of data by specifying values for two or more
dimensions.
6. Drill Down and Roll Up: Drill down involves moving from a higher
level of detail to a lower level (e.g., going from yearly to monthly
data), while roll up involves moving from a lower level to a higher
level (e.g., going from monthly to yearly data).
7. OLAP (Online Analytical Processing): OLAP tools are commonly
used to interact with multi-dimensional data models. These tools
provide a user-friendly interface for exploring and analyzing data.
Benefits of multi-dimensional data models include improved
performance for complex queries, enhanced data analysis
capabilities, and a more intuitive representation of data relationships.
They are commonly used in data warehouses and decision support
systems to support business intelligence and reporting requirements.

Diff. b/w DATA WAREHOUSE and DBMS:

Database System Data Warehouse

It supports operational It supports analysis and


processes. performance reporting.

Capture and maintain the


Explore the data.
data.

Current data. Multiple years of history.

Data is balanced within the Data must be integrated and


scope of this one system. balanced from multiple system.

Data is updated when Data is updated on scheduled


transaction occurs. processes.

Data verification occurs Data verification occurs after the


when entry is done. fact.

100 MB to GB. 100 GB to TB.


Database System Data Warehouse

ER based. Star/Snowflake.

Application oriented. Subject oriented.

Primitive and highly


Summarized and consolidated.
detailed.

Flat relational. Multidimensional.

OLAP operations: OLAP stands for Online Analytical


Processing Server. It is a software technology that allows users to
analyze information from multiple database systems at the same
time. It is based on multidimensional data model and allows the user
to query on multi-dimensional data (eg. Delhi -> 2018 -> Sales data).
OLAP databases are divided into one or more cubes and these cubes
are known as Hyper-cubes.
OLAP operations:
There are five basic analytical operations that can be performed on
an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is
converted into highly detailed data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is
performed by moving down in the concept hierarchy
of Time dimension (Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It


performs aggregation on the OLAP cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is
performed by climbing up in the concept hierarchy
of Location dimension (City -> Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two


or more dimensions. In the cube given in the overview section,
a sub-cube is selected by selecting following dimensions with
criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”

4. Slice: It selects a single dimension from the OLAP cube which


results in a new sub-cube creation. In the cube given in the
overview section, Slice is performed on the dimension Time =
“Q1”.

5. Pivot: It is also known as rotation operation as it rotates the


current view to get a new view of the representation. In the
sub-cube obtained after the slice operation, performing pivot
operation gives a new view of it.

DATA PREPROCESSING: Data preprocessing is an important step in


the data mining process. It refers to the cleaning, transforming, and
integrating of data in order to make it ready for analysis. The goal of
data preprocessing is to improve the quality of the data and to make
it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and
duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be
achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves
transforming the data into a lower-dimensional space while
preserving the important information.
Data Integration: This involves combining data from multiple sources
to create a unified dataset. Data integration can be challenging as it
requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can be
used for data integration.
Data Transformation: This involves converting the data into a
suitable format for analysis. Common techniques used in data
transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have
zero mean and unit variance. Discretization is used to convert
continuous data into discrete categories.
Data Discretization: This involves dividing continuous data into
discrete categories or intervals. Discretization is often used in data
mining and machine learning algorithms that require categorical
data. Discretization can be achieved through techniques such as
equal width binning, equal frequency binning, and clustering.
DATA CLEANING: Data Cleaning: This involves identifying and
correcting errors or inconsistencies in the data, such as missing
values, outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and transformation.

steps of Data Cleaning:


The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It
can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is
quite large and multiple values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to
fill the missing values manually, by attribute mean or the
most probable value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by
machines.It can be generated due to faulty data collection, data
entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it.
The whole data is divided into segments of equal size and
then various methods are performed to complete the
task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary
values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one
independent variable) or multiple (having multiple
independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The
outliers may be undetected or it will fall outside the
clusters.

DATA TRANSFORMATION: This involves converting the data into a


suitable format for analysis. Common techniques used in data
transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have
zero mean and unit variance. Discretization is used to convert
continuous data into discrete categories.

Steps of Data Transformation:


1. Normalization:
It is done in order to scale the data values in a specified range (-
1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given
set of attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.

DATA REDUCTION: Data reduction is a technique used in data mining


to reduce the size of a dataset while still preserving the most
important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset
contains a large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be
used in data mining, including:
1. Data Sampling: This technique involves selecting a subset of
the data to work with, rather than using the entire dataset. This
can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the
number of features in the dataset, either by removing features
that are not relevant or by combining multiple features into a
single feature.
3. Data Compression: This technique involves using techniques
such as lossy or lossless compression to reduce the size of a
dataset.
4. Data Discretization: This technique involves converting
continuous data into discrete data by partitioning the range of
possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of
features from the dataset that are most relevant to the task at
hand.
6. It’s important to note that data reduction can have a trade-off
between the accuracy and the size of the data. The more data is
reduced, the less accurate the model will be and the less
generalizable it will be.

DISCRETIZATION: Data discretization refers to a method of converting


a huge number of data values into smaller ones so that the
evaluation and management of data become easy. In other words,
data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss.
There are two forms of data discretization first is supervised
discretization, and the second is unsupervised discretization.
Supervised discretization refers to a method in which the class data is
used. Unsupervised discretization refers to a method depending
upon the way which operation proceeds. It means it works on the
top-down splitting strategy and bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization


ADVERTISEMENT

Attribute Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46

After Discretization Child Young Mature

Some Famous techniques of data discretization


Histogram analysis
Histogram refers to a plot used to represent the underlying frequency
distribution of a continuous data set. Histogram assists the data
inspection for data distribution. For example, Outliers, skewness
representation, normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a
huge number of continuous values into smaller values. For data
discretization and the development of idea hierarchy, this technique
can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm
is executed by dividing the values of x numbers into clusters to isolate
a computational feature of x.

Data discretization and concept hierarchy generation


The term hierarchy represents an organizational structure or
mapping in which items are ranked according to their levels of
importance. In other words, we can say that a hierarchy concept
refers to a sequence of mappings with a set of more general concepts
to complex concepts. It means mapping is done from low-level
concepts to high-level concepts. For example, in computer science,
there are different types of hierarchical systems. A document is
placed in a folder in windows at a specific place in the tree structure
is the best example of a computer hierarchical tree model. There are
two types of hierarchy: top-down mapping and the second one is
bottom-up mapping.
Let's understand this concept hierarchy for the dimension location
with the help of an example.
A particular city can map with the belonging country. For example,
New Delhi can be mapped to India, and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general
information and ends with the bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some
specialized information and ends with the top to the generalized
information.
UNIT-2 DATA MINING

DATA MINING: Data mining is like digging for valuable information in


a vast digital field. It involves using computer algorithms to sift
through large amounts of data, seeking patterns, trends, or hidden
insights that might not be immediately apparent. Imagine you have a
huge pile of information, and data mining is the process of carefully
examining and analyzing that pile to discover meaningful
connections, associations, or predictions. It helps businesses and
researchers make sense of complex data sets, revealing valuable
knowledge that can be used to improve decision-making and uncover
new opportunities.
KDD: The most common full form of "KDD" is "Knowledge Discovery
in Databases." KDD refers to the process of discovering useful
knowledge from large volumes of data. It involves various steps such
as data cleaning, data preprocessing, data mining, and interpretation
of the results. Knowledge Discovery in Databases is often associated
with the broader field of data mining and is used to extract valuable
patterns, trends, and insights from complex datasets

Applications of KDD
Some of the crucial applications of KDD are as follows:
 Business and Marketing: User analysis, market prediction,
Segmenting clients, and focused marketing are all examples of
business and marketing databases.
 Manufacturing: Predictive system analysis, process
improvement, and quality control.

 Finance: Fraud investigation, evaluation of credit risk, and stock


market research in the finance sector can be analysed using the
KDD method.

 Healthcare: Drug progress, patient monitoring, and disease


diagnosis from a large set of patient data.

 Scientific research: Identifying patterns in massive scientific


databases, such as genetics, astronomy, and climate.

KDD VS DM(DATA MINING):

Key Data Mining KDD


Features

Basic Data mining is the process of The KDD method is a


Definition identifying patterns and complex and iterative
extracting details about big approach to knowledge
data sets using intelligent extraction from big data.
methods.

Goal To extract patterns from To discover knowledge from


datasets. datasets.

Scope In the KDD method, the fourth KDD is a broad method that
phase is called "data mining." includes data mining as one
of its steps.
Key Data Mining KDD
Features

Used  Classification  Data cleaning


Techniques  Clustering  Data Integration
 Decision Trees  Data selection
 Dimensionality Reduction  Data transformation
 Neural Networks  Data mining
 Regression  Pattern evaluation
 Knowledge
Presentation

Example Clustering groups of data Data analysis to find patterns


elements based on how similar and links.
they are.

DBMS vs DM:

Aspect DBMS Data Mining


Manages and organizes
structured data for Extracts patterns,
efficient storage, knowledge, and insights
retrieval, and from large datasets to
Purpose manipulation. discover hidden information.
Primary Storage and retrieval of Analysis and discovery of
Function data. patterns in data.
Aspect DBMS Data Mining
Structured data (tables Structured, semi-structured,
Data Type and relationships). or unstructured data.
Operational data for day- Analytical data for decision
to-day transactions and support and strategic
Focus business applications. planning.
SQL (Structured Query Various algorithms and
Language) is commonly techniques are employed for
used for querying and pattern discovery and
Querying manipulation. analysis.
CRUD operations (Create, Classification, regression,
Read, Update, Delete), clustering, association rule
transaction mining, anomaly detection,
Tasks management. etc.
Optimized for quick Involves complex analysis,
retrieval and potentially slower than
Processing manipulation of small DBMS operations, especially
Speed sets of records. for large datasets.
MySQL, Oracle, SQL
Examples Server. WEKA, RapidMiner, KNIME.
Used for decision support,
Integral part of day-to- business intelligence, and
Usage day business operations. research purposes.
May involve normalization
Normalizes data to but also includes
eliminate redundancy denormalization for efficient
Normalization and ensure consistency. analysis.
Integration Manages and integrates Integrates data from diverse
Aspect DBMS Data Mining
data from different
sources for operational sources to discover hidden
purposes. patterns and relationships.
Typically has a user- Specialized tools or
friendly interface for programming interfaces for
managing and querying designing, running, and
User Interface data. interpreting mining tasks.
Managing customer Identifying customer
orders, employee purchasing patterns, fraud
Examples in records, inventory in an detection, predicting market
Practice e-commerce system. trends.

DATA MINING techniques:


1. Classification:
This technique is used to obtain
important and relevant
information about data and
metadata. This data mining
technique helps to classify data
in different classes.
i. Classification of Data
mining frameworks as per
the type of data sources
mined:
This classification is as per
the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database
involved:
This classification based on the data model involved. For
example. Object-oriented database, transactional database,
relational database, and so on..
iii. Classification of data mining frameworks as per the kind of
knowledge discovered:
This classification depends on the types of knowledge
discovered or data mining functionalities. For example,
discrimination, classification, clustering, characterization, etc.
some frameworks tend to be extensive frameworks offering a
few data mining functionalities together..
iv. Classification of data mining frameworks according to data
mining techniques used:
This classification is as per the data analysis approach utilized,
such as neural networks, machine learning, genetic algorithms,
visualization, statistics, data warehouse-oriented or database-
oriented, etc.
The classification can also take into account, the level of user
interaction involved in the data mining procedure, such as
query-driven systems, autonomous systems, or interactive
exploratory systems.

2. Clustering:
Clustering is a division of information into groups of connected
objects. Describing the data by a few clusters mainly loses certain
confine details, but accomplishes improvement. It models data by its
clusters. Data modeling puts clustering from a historical point of view
rooted in statistics, mathematics, and numerical analysis. From a
machine learning point of view, clusters relate to hidden patterns, the
search for clusters is unsupervised learning, and the subsequent
framework represents a data concept. From a practical point of view,
clustering plays an extraordinary job in data mining applications. For
example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis,
computational biology, medical diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining
technique to identify similar data. This technique helps to recognize
the differences and similarities between the data. Clustering is very
similar to the classification, but it involves grouping chunks of data
together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence
of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling. For
example, we might use it to project certain costs, depending on other
factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more
variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or
more items. It finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the
probability of interactions between data items within large data sets
in different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data
or medical data sets.
The way the algorithm works is that you have various data, For
example, a list of grocery items that you have been buying for the last
six months. It calculates a percentage of items being purchased
together.
These are three major measurements technique:
o Lift:
This measurement technique measures the accuracy of the
confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple
items are purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is
purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data
items in the data set, which do not match an expected pattern or
expected behavior. This technique may be used in various domains
like intrusion, detection, fraud detection, etc. It is also known as
Outlier Analysis or Outilier mining. The outlier is a data point that
diverges too much from the rest of the dataset. The majority of the
real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field. Outlier detection is valuable
in numerous fields like network interruption identification, credit or
debit card fraud detection, detecting outlying in wireless sensor
network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized
for evaluating sequential data to discover sequential patterns. It
comprises of finding interesting subsequences in a set of sequences,
where the stake of a sequence can be measured in terms of different
criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or
recognize similar patterns in transaction data over some time.

issues and challenges in data mining:

Incomplete and noisy data:The process of extracting useful data from


large volumes of data is data mining. The data in the real-world is
heterogeneous, incomplete, and noisy. Data in huge quantities will
usually be inaccurate or unreliable. These problems may occur due to
data measuring instrument or because of human errors. Suppose a retail
chain collects phone numbers of customers who spend more than $ 500,
and the accounting employees put the information into their system. The
person may make a digit mistake when entering the phone number,
which results in incorrect data. Even some customers may not be willing
to disclose their phone numbers, which results in incomplete data. The
data could get changed due to human or system error. All these
consequences (noisy and incomplete data)makes data mining
challenging.

Data Distribution: Real-world data is usually stored on various


platforms in a distributed computing environment. It might be in a
database, individual systems, or even on the internet. Practically, It is a
quite tough task to make all the data to a centralized data repository
mainly due to organizational and technical concerns. For example,
various regional offices may have their servers to store their data. It is
not feasible to store, all the data from all the offices on a central server.
Therefore, data mining requires the development of tools and algorithms
that allow the mining of distributed data.

Complex Data:Real-world data is heterogeneous, and it could be


multimedia data, including audio and video, images, complex data,
spatial data, time series, and so on. Managing these various types of data
and extracting useful information is a tough task. Most of the time, new
technologies, new tools, and methodologies would have to be refined to
obtain specific information.

Performance:The data mining system's performance relies primarily on


the efficiency of algorithms and techniques used. If the designed
algorithm and techniques are not up to the mark, then the efficiency of
the data mining process will be affected adversely.

Data Privacy and Security:Data mining usually leads to serious issues in


terms of data security, governance, and privacy. For example, if a retailer
analyzes the details of the purchased items, then it reveals data about
buying habits and preferences of the customers without their permission.

Data Visualization:In data mining, data visualization is a very important


process because it is the primary method that shows the output to the
user in a presentable way. The extracted data should convey the exact
meaning of what it intends to express. But many times, representing the
information to the end-user in a precise and easy way is difficult. The
input data and the output information being complicated, very efficient,
and successful data visualization processes need to be implemented to
make it successful.

Data Mining Application:

These are the following areas where data mining is widely used:

Data Mining in Healthcare:

Data mining in healthcare has excellent potential to improve the health


system. It uses data and analytics for better insights and to identify best
practices that will enhance health care services and reduce costs.
Analysts use data mining approaches such as Machine learning, Multi-
dimensional database, Data visualization, Soft computing, and statistics.
Data Mining can be used to forecast patients in each category. The
procedures ensure that the patients get intensive care at the right place
and at the right time. Data mining also enables healthcare insurers to
recognize fraud and abuse.
Data Mining in Market Basket Analysis:

Market basket analysis is a modeling method based on a hypothesis. If


you buy a specific group of products, then you are more likely to buy
another group of products. This technique may enable the retailer to
understand the purchase behavior of a buyer. This data may assist the
retailer in understanding the requirements of the buyer and altering the
store's layout accordingly. Using a different analytical comparison of
results between various stores, between customers in different
demographic groups can be done.

Data mining in Education:

Education data mining is a newly emerging field, concerned with


developing techniques that explore knowledge from the data generated
from educational Environments. EDM objectives are recognized as
affirming student's future learning behavior, studying the impact of
educational support, and promoting learning science. An organization
can use data mining to make precise decisions and also to predict the
results of the student. With the results, the institution can concentrate on
what to teach and how to teach.

Data Mining in Manufacturing Engineering:

Knowledge is the best asset possessed by a manufacturing company.


Data mining tools can be beneficial to find patterns in a complex
manufacturing process. Data mining can be used in system-level
designing to obtain the relationships between product architecture,
product portfolio, and data needs of the customers. It can also be used
to forecast the product development period, cost, and expectations
among the other tasks.

Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and


holding Customers, also enhancing customer loyalty and implementing
customer-oriented strategies. To get a decent relationship with the
customer, a business organization needs to collect data and analyze the
data. With data mining technologies, the collected data can be used for
analytics.

Data Mining in Fraud detection:

Billions of dollars are lost to the action of frauds. Traditional methods of


fraud detection are a little bit time consuming and sophisticated. Data
mining provides meaningful patterns and turning data into information.
An ideal fraud detection system should protect the data of all the users.
Supervised methods consist of a collection of sample records, and these
records are classified as fraudulent or non-fraudulent. A model is
constructed using this data, and the technique is made to identify
whether the document is fraudulent or not.

Data Mining in Lie Detection:

Apprehending a criminal is not a big deal, but bringing out the truth
from him is a very challenging task. Law enforcement may use data
mining techniques to investigate offenses, monitor suspected terrorist
communications, etc. This technique includes text mining also, and it
seeks meaningful patterns in data, which is usually unstructured text. The
information collected from the previous investigations is compared, and
a model for lie detection is constructed.

Data Mining Financial Banking:

The Digitalization of the banking system is supposed to generate an


enormous amount of data with every new transaction. The data mining
technique can help bankers by solving business-related problems in
banking and finance by identifying trends, casualties, and correlations in
business information and market costs that are not instantly evident to
managers or executives because the data volume is too large or are
produced too rapidly on the screen by experts. The manager may find
these data for better targeting, acquiring, retaining, segmenting, and
maintain a profitable customer.

DM algorithm: Classification Algorithms:


 Decision Trees
 Random Forest
 Support Vector Machines (SVM)
 k-Nearest Neighbors (k-NN)
 Naive Bayes

-Decision Trees:- In data mining, the Decision Tree algorithm is


a versatile and widely used method for both classification and
regression tasks. It operates by recursively partitioning the
dataset based on the most informative features, creating a tree-
like structure where each internal node represents a decision
based on a specific attribute, and each leaf node corresponds
to a predicted outcome. The algorithm selects the best attribute
for splitting at each node, typically based on criteria such as
information gain or Gini impurity. Decision Trees are renowned
for their interpretability, as the resulting tree structure is easily
understandable, making them valuable in explaining complex
decision-making processes. However, they can be susceptible
to overfitting, capturing noise in the data, and various
techniques, such as pruning, are employed to mitigate this
challenge. Decision Trees find applications in diverse fields,
including finance for credit scoring, healthcare for disease
diagnosis, and marketing for customer segmentation. Their
simplicity and effectiveness make them a foundational tool in
the broader landscape of data mining and machine learning.

- Random Forest:- Random Forest is a versatile ensemble learning


algorithm employed in data mining and machine learning for
classification and regression tasks. It constructs a "forest" of decision
trees by leveraging bootstrap sampling, building each tree on a
subset of the training data, and considering random subsets of
features at each node. The final prediction is then determined
through a voting mechanism in classification or an averaging process
in regression. Known for its robustness and resistance to overfitting,
Random Forest excels in handling diverse types of data, providing
accurate predictions, and offering insights into feature importance.
Widely applied across various domains, from finance to healthcare,
its ability to mitigate individual tree weaknesses while capitalizing on
their collective strength makes it a popular and powerful tool in
predictive modeling and data analysis.

- Support Vector Machines (SVM): Support Vector Machines


(SVM) is a powerful supervised learning algorithm used for both
classification and regression tasks in machine learning and data
mining. The fundamental concept behind SVM is to find a
hyperplane that best separates data points of different classes
in a high-dimensional space. The "support vectors" are the data
points that lie closest to the decision boundary, and the optimal
hyperplane maximizes the margin, which is the distance
between the support vectors and the decision boundary. SVM is
effective in handling complex relationships and is particularly
well-suited for high-dimensional data. It can also handle non-
linear decision boundaries through the use of kernel functions.
SVM has found applications in diverse fields, including image
recognition, text classification, and bioinformatics. Its ability to
handle both linear and non-linear patterns, along with its
robust performance in high-dimensional spaces, contributes to
its popularity in various real-world applications.

- k-Nearest Neighbors (k-NN): k-Nearest Neighbors (k-NN)


is a simple yet powerful supervised learning algorithm used for
both classification and regression tasks. The core idea of k-NN
is to classify or predict a data point based on the majority class
or average value of its k nearest neighbors in the feature space.
The choice of the parameter k determines the number of
neighbors considered, influencing the algorithm's sensitivity to
local variations. In classification, the class label is assigned
based on a majority vote, while in regression, it involves
averaging the values of the nearest neighbors. k-NN is a non-
parametric algorithm, meaning it doesn't make explicit
assumptions about the underlying data distribution. It is
particularly effective in scenarios where the decision boundaries
are not well-defined or when the data exhibits local patterns.
However, its computational complexity can be a challenge with
large datasets, and appropriate preprocessing, such as feature
scaling, is often necessary for optimal performance. k-NN is
widely applied in areas like pattern recognition, image
processing, and recommendation systems due to its simplicity
and flexibility.

- Naive Bayes:- Naive Bayes is a probabilistic machine learning


algorithm commonly used for classification tasks, especially in
natural language processing and text categorization. Based on
Bayes' theorem, it assumes that features are conditionally
independent given the class label, which simplifies the
calculation of probabilities. Despite the "naive" assumption of
feature independence, Naive Bayes often performs surprisingly
well in practice and is computationally efficient. It calculates the
probability of a given instance belonging to each class and
assigns the class with the highest probability as the predicted
class. Naive Bayes is particularly effective in situations with a
large number of features and limited training data. Its
simplicity, speed, and effectiveness make it a popular choice for
applications like spam filtering, sentiment analysis, and
document categorization. Additionally, Naive Bayes can be
easily adapted to handle continuous or categorical features,
contributing to its versatility across various domains.

Prediction-parametric and non-parametric:- Parametric and


non-parametric methods represent two different approaches to
building these models:

1. Parametric Classification and Prediction:

- Characteristics:

- Assumes a specific form or structure for the underlying


model.

- Typically involves estimating parameters of a predefined


model.

- Examples:

- Linear Regression: Assumes a linear relationship between


input features and the output.

- Logistic Regression: Models the probability of belonging


to a particular class using a logistic function.
- Parametric Bayesian Models: Incorporate Bayesian
methods to estimate parameters and make predictions.

2. Non-parametric Classification and Prediction:

- Characteristics:

- Does not assume a specific form for the underlying model.

- More flexible in capturing complex relationships in the


data.

- Examples:

- k-Nearest Neighbors (k-NN): Classifies data points based


on the majority class among their k nearest neighbors.

- Decision Trees: Builds a tree structure to make decisions


based on features.

- Support Vector Machines (SVM): Constructs hyperplanes


to separate classes without assuming a specific distribution.

Comparison:

- Flexibility:

- Parametric: Limited by the assumed model structure.

- Non-parametric: More flexible as it doesn't assume a


specific model form, making it suitable for complex
relationships.

- Assumption:- Parametric: Assumes a specific form of the


underlying distribution.
- Non-parametric: Makes fewer assumptions about the
underlying data distribution.

- Sample Size: - Parametric: May require a larger sample size to


accurately estimate parameters.

- Non-parametric: Can be more robust with smaller sample


sizes, especially in cases with complex relationships.

- Interpretability:- Parametric: Model parameters have clear


interpretations.

- Non-parametric: Models might be less interpretable due to


their flexibility.

-Computational Complexity:- Parametric: Often


computationally less expensive.

- Non-parametric: May be computationally more intensive,


especially with large datasets.

The choice between parametric and non-parametric methods


depends on the characteristics of the data and the goals of the
analysis. Parametric methods are useful when the underlying
model structure is reasonably well-known and assumptions are
met. Non-parametric methods are more suitable when the
relationship between variables is complex, and making fewer
assumptions about the data is desirable.

Bayesian classification: Bayesian classification in data mining is a


statistical approach to data classification that uses Bayes' theorem to
make predictions about a class of a data point based on observed
data. It is a popular data mining and machine learning technique for
modelling the probability of certain outcomes and making
predictions based on that probability.

The basic idea behind Bayesian classification in data mining is to


assign a class label to a new data instance based on the probability
that it belongs to a particular class, given the observed data. Bayes'
theorem provides a way to compute this probability by multiplying
the prior probability of the class (based on previous knowledge or
assumptions) by the likelihood of the observed data given that class
(conditional probability).

Several types of Bayesian classifiers exist, such as naive Bayes,


Bayesian network classifiers, Bayesian logistic regression, etc.
Bayesian classification is preferred in many applications because it
allows for the incorporation of new data (just by updating the prior
probabilities) and can update the probabilities of class labels
accordingly.

Bayesian classification is a powerful tool for data mining and


machine learning and is widely used in many applications, such as
spam filtering, text classification, and medical diagnosis. Its ability to
incorporate prior knowledge and uncertainty makes it well-suited for
real-world problems where data is incomplete or noisy and accurate
predictions are critical.

Formula Derivation

Bayes' theorem is derived from the definition of conditional


probability. The conditional probability of an event E given a
hypothesis H is defined as the joint probability of E and H, divided by
the probability of H, as shown below -

P(E∣H)=P(E∩H)/P(H)

We can rearrange this equation to solve for the joint probability of E


and H
P(E∩H)=P(E∣H)∗P(H)

Similarly, we can use the definition of conditional probability to write


the conditional probability of H given E

P(H∣E)=P(H∩E)/ P(E)

Based on the commutative property of joint probability,


P(H∩E)=P(E∩H)

We can substitute the expression for P(H∩E) from the first equation
into the second equation to obtain -

P(H∣E)=P(E∣H)∗P(H)/P(E)

This is the formula for Bayes' theorem for hypothesis H and event E.
It states that the probability of hypothesis H given event E is
proportional to the likelihood of the event given the hypothesis,
multiplied by the prior probability of the hypothesis, and divided by
the probability of the event.

APPLICATION OF BAYES THEOERM:

Spam filtering - Bayes' theorem is commonly used in email spam


filtering, where it helps to identify emails that are likely to be spam
based on the text content and other features.

Medical diagnosis - Bayes' theorem can be used to diagnose medical


conditions based on the observed symptoms, test results, and prior
knowledge about the prevalence and characteristics of the disease.

Risk assessment - Bayes' theorem can be used to assess the risk of


events such as accidents, natural disasters, or financial market
fluctuations based on historical data and other relevant factors.
Natural language processing - Bayes' theorem can be used to classify
documents, sentiment analysis, and topic modeling in natural
language processing applications.

Recommendation systems - Bayes' theorem can be used in


recommendation systems like e-commerce websites to suggest
products or services to users based on their previous behavior and
preferences.

Fraud detection - Bayes' theorem can be used to detect fraudulent


behavior, such as credit card or insurance fraud, by analyzing
patterns of transactions and other data.

TWO class and GENERALIZED class classification:-

In data mining and machine learning, the terms "two-class


classification" and "multiclass classification" refer to different types
of classification tasks based on the number of classes or categories
involved. Let's delve into these concepts:

1. Two-Class Classification:

- Definition: In a two-class classification task, the algorithm is


designed to predict between two possible classes or outcomes. It's a
binary classification problem where each instance is categorized into
one of two mutually exclusive classes.

- Examples: Spam detection (spam or not spam), medical diagnosis


(disease or no disease), sentiment analysis (positive or negative
sentiment), and fraud detection (fraudulent or non-fraudulent
transactions) are common examples of two-class classification
problems.
- Algorithms: Algorithms suitable for two-class classification include
logistic regression, support vector machines, decision trees, and
random forests.

2. Multiclass Classification (Generalized Classification):

- Definition: In multiclass classification, the algorithm is trained to


classify instances into one of several possible classes. The problem
involves predicting among more than two classes, and each class
represents a distinct category or label.

- Examples: Handwritten digit recognition (classifying digits 0 to 9),


image classification (identifying objects in images from multiple
classes), and language identification (detecting the language of a text
among several possibilities) are examples of multiclass classification
tasks.

- Algorithms: Multiclass classification algorithms include


multinomial logistic regression, decision trees, k-nearest neighbors,
and neural networks.

CLASSIFICATION ERROR:- Classification error, also known as


misclassification error or classification accuracy, is a metric used to
evaluate the performance of a classification model in data mining
and machine learning. It measures the proportion of instances that
are incorrectly classified by the model.

The classification error is calculated using the following formula:

Classification Error=
Number of Misclassified Instances/Total Number of Instances
In simpler terms, it is the ratio of the number of instances that the
model predicted incorrectly to the total number of instances in the
dataset.

A low classification error indicates better model performance, while


a high classification error suggests that the model is not accurately
predicting the class labels. It is important to note that classification
error may not be the most appropriate metric for all situations,
especially when dealing with imbalanced datasets. In such cases,
other metrics like precision, recall, F1 score, or area under the
receiver operating characteristic (ROC) curve might provide a more
comprehensive evaluation of the model's performance, taking into
account false positives, false negatives, and the trade-off between
precision and recall. The choice of evaluation metric depends on the
specific goals and characteristics of the classification problem at
hand.

UNIT-3 ASSOCIATION RULES & CLUSTERING

Association rules: Association rules in data mining are patterns or


relationships discovered within datasets, revealing interesting
connections between different variables. These rules are derived
using algorithms like Apriori, typically applied to transactional
databases. The rules consist of three key measures: support,
indicating the frequency of occurrence of a set of items; confidence,
measuring the strength of the association between items; and lift,
comparing the likelihood of co-occurrence to the independence of
the items. Commonly used in market basket analysis, association
rules unveil connections such as "if X is purchased, then Y is likely to
be purchased as well." This has applications in various domains, from
retail and e-commerce to healthcare and web usage analysis,
providing valuable insights into consumer behavior, product
recommendations, and pattern recognition within large datasets.

Apriori Algorithm: Apriori algorithm refers to an algorithm that is


used in mining frequent products sets and relevant association rules.
Generally, the apriori algorithm operates on a database containing a
huge number of transactions. For example, the items customers but
at a Big Bazar.
Apriori algorithm helps the customers to buy their products with
ease and increases the sales performance of the particular store.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
1. Support
2. Confidence
3. Lift
Let's take an example to understand this concept.
Suppose you have 4000 customers transactions in a Big Bazar. You
have to calculate the Support, Confidence, and Lift for two products,
and you may say Biscuits and Chocolate. This is because customers
frequently buy these two items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain
Chocolate, and these 600 transactions include a 200 that includes
Biscuits and chocolates. Using this data, we will find out the support,
confidence, and lift.
Support
Support refers to the default popularity of any product. You find the
support as a quotient of the division of the number of transactions
comprising that product by the total number of transactions. Hence,
we get
Support (Biscuits) = (Transactions relating biscuits) / (Total
transactions)
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both
biscuits and chocolates together. So, you need to divide the number
of transactions that comprise both biscuits and chocolates by the
total number of transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) /
(Total transactions involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought
chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of
the sale of chocolates when you sell biscuits. The mathematical
equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and
chocolates together is five times more than that of purchasing the
biscuits alone. If the lift value is below one, it requires that the
people are unlikely to buy both the items together. Larger the value,
the better is the combination.

Advantages of Apriori Algorithm


o It is used to calculate large itemsets.
o Simple to understand and apply.

Disadvantages of Apriori Algorithms


o Apriori algorithm is an expensive method to find support
since the calculation has to pass through the whole
database.
o Sometimes, you need a huge number of candidate rules,
so it becomes computationally more expensive.

Partition- In the context of association rules and data mining,


partitioning refers to the process of dividing a dataset into
distinct subsets or partitions based on certain criteria. This
partitioning is often done to analyze the data more effectively
and extract meaningful patterns or associations.

Partitioning in association rules can be applied in various ways:

1. Transaction Partitioning:

- Data is often organized into transactions, where each


transaction represents a set of items purchased or observed
together. Partitioning transactions can involve dividing them
based on specific criteria, such as time periods, customer
segments, or any other relevant factors.

2. Itemset Partitioning:
- Itemsets are combinations of items that frequently occur
together in transactions. Partitioning itemsets involves grouping
them based on common characteristics or properties. For
example, you might partition itemsets based on the type of
products involved.

3. Rule Partitioning:

- Association rules consist of antecedents (the "if" part) and


consequents (the "then" part). Partitioning rules involves
categorizing them into subsets based on certain criteria,
making it easier to analyze specific types of associations.

4. Support-Confidence Partitioning:

- Association rules are often evaluated based on metrics like


support and confidence. Partitioning based on these metrics
involves grouping rules that share similar support or confidence
values. This can help identify patterns that meet certain criteria.

Partitioning in association rules allows analysts to focus on


specific subsets of the data, making it more manageable and
facilitating a more in-depth analysis of patterns within those
subsets. It can also aid in the extraction of rules that are
relevant to particular conditions or characteristics within the
dataset.

FR-Tree growth algorithm: The FP Growth algorithm is a popular


method for frequent pattern mining in data mining. It works by
constructing a frequent pattern tree (FP-tree) from the input dataset.
The FP-tree is a compressed representation of the dataset that
captures the frequency and association information of the items in
the data.
The algorithm first scans the dataset and maps each transaction to a
path in the tree. Items are ordered in each transaction based on their
frequency, with the most frequent items appearing first. Once the FP
tree is constructed, frequent itemsets can be generated by
recursively mining the tree. This is done by starting at the bottom of
the tree and working upwards, finding all combinations of itemsets
that satisfy the minimum support threshold.

Working on FP Growth Algorithm


The working of the FP Growth algorithm in data mining can be
summarized in the following steps:
 Scan the database:
In this step, the algorithm scans the input dataset to determine
the frequency of each item. This determines the order in which
items are added to the FP tree, with the most frequent items
added first.
 Sort items:
In this step, the items in the dataset are sorted in descending
order of frequency. The infrequent items that do not meet the
minimum support threshold are removed from the dataset. This
helps to reduce the dataset's size and improve the algorithm's
efficiency.
 Construct the FP-tree:
In this step, the FP-tree is constructed. The FP-tree is a compact
data structure that stores the frequent itemsets and their
support counts.
 Generate frequent itemsets:
Once the FP-tree has been constructed, frequent itemsets can
be generated by recursively mining the tree. Starting at the
bottom of the tree, the algorithm finds all combinations of
frequent item sets that satisfy the minimum support threshold.
 Generate association rules:
Once all frequent item sets have been generated, the algorithm
post-processes the generated frequent item sets to generate
association rules, which can be used to identify interesting
relationships between the items in the dataset.
FP Tree
The FP-tree (Frequent Pattern tree) is a data structure used in the FP
Growth algorithm for frequent pattern mining. It represents the
frequent itemsets in the input dataset compactly and efficiently. The
FP tree consists of the following components:
 Root Node:
The root node of the FP-tree represents an empty set. It has no
associated item but a pointer to the first node of each item in
the tree.
 Item Node:
Each item node in the FP-tree represents a unique item in the
dataset. It stores the item name and the frequency count of the
item in the dataset.
 Header Table:
The header table lists all the unique items in the dataset, along
with their frequency count. It is used to track each item's
location in the FP tree.
 Child Node:
Each child node of an item node represents an item that co-
occurs with the item the parent node represents in at least one
transaction in the dataset.
 Node Link:
The node-link is a pointer that connects each item in the header
table to the first node of that item in the FP-tree. It is used to
traverse the conditional pattern base of each item during the
mining process.

Advantages of FP Growth Algorithm


 Efficiency:
FP Growth algorithm is faster and more memory-efficient than
other frequent itemset mining algorithms such as Apriori,
especially on large datasets with high dimensionality.
 Scalability:
FP Growth algorithm scales well with increasing database size
and itemset dimensionality, making it suitable for mining
frequent itemsets in large datasets.
 Resistant to noise:
FP Growth algorithm is more resistant to noise in the data than
other frequent itemset mining algorithms, as it generates only
frequent itemsets and ignores infrequent itemsets that may be
caused by noise.
 Parallelization:
FP Growth algorithm can be easily parallelized, making it
suitable for distributed computing environments and allowing it
to take advantage of multi-core processors.

Disadvantages of FP Growth Algorithm


 Memory consumption:
Although the FP Growth algorithm is more memory-efficient
than other frequent itemset mining algorithms, storing the FP-
Tree and the conditional pattern bases can still require a
significant amount of memory, especially for large datasets.
 Complex implementation:
The FP Growth algorithm is more complex than other frequent
itemset mining algorithms, making it more difficult to
understand and implement.

Generalized association rules: association rule extraction is a


powerful tool for getting a rough idea of interesting patterns hidden
in data. However, since patterns are extracted at each level of
abstraction, the mined rule sets may be too large to be used
effectively for decision-making. Therefore, in order to discover
valuable and interesting knowledge, post-processing steps are often
required. Generalized association rules should have categorical
(nominal or discrete) properties on both the left and right sides of
the rule.

basic terminology in association rules:


1. Itemset: A collection of one or more items. In the context of
association rules, an itemset represents a set of items that are
analyzed to discover associations.
2. Support: The support of an itemset is the proportion of
transactions in the dataset that contain the itemset. It measures the
frequency with which the itemset appears in the dataset.
3. Confidence: Confidence measures the reliability of the association
rule. It is the conditional probability that a transaction containing the
antecedent of the rule also contains the consequent.
4. Antecedent and Consequent: In an association rule "if X, then Y," X
is the antecedent (the condition) and Y is the consequent (the result
or outcome).
5. Support Threshold: A user-defined minimum support level.
Itemsets with support values below this threshold are typically
considered uninteresting and are often excluded from the results.
6. Confidence Threshold: A user-defined minimum confidence level.
Rules with confidence values below this threshold may be considered
less reliable and may be filtered out.
7. Lift: A measure of how much more likely the consequent is given
the antecedent compared to its likelihood without the antecedent.
Lift values greater than 1 indicate a positive association.

Correlation analysis:
Correlation analysis is a statistical method used to evaluate the strength and
direction of the linear relationship between two quantitative variables. It
measures the degree to which changes in one variable are associated with
changes in another variable. The most common measure of correlation is the
Pearson correlation coefficient, denoted by γ(gama). The value of γ ranges from
-1 to 1, where:

- γ = -1 indicates a perfect negative linear relationship.


- γ = 0 indicates no linear relationship.
Types of Correlation
There are three types of correlation:
Correlation
1. Positive Correlation: Positive correlation indicates that two variables
have a direct relationship. As one variable increases, the other variable
also increases. For example, there is a positive correlation between
height and weight. As people get taller, they also tend to weigh more.
2. Negative Correlation: Negative correlation indicates that two variables
have an inverse relationship. As one variable increases, the other
variable decreases. For example, there is a negative correlation between
price and demand. As the price of a product increases, the demand for
that product decreases.
3. Zero Correlation: Zero correlation indicates that there is no relationship
between two variables. The changes in one variable do not affect the
other variable. For example, there is zero correlation between shoe size
and intelligence.
Clustering: Clustering is the method of converting a group of abstract objects into
classes of similar objects.

Clustering is a method of partitioning a set of data or objects into a set of significant


subclasses called clusters.

It helps users to understand the structure or natural grouping in a data set and used either
as a stand-alone instrument to get a better insight into data distribution or as a pre-
processing step for other algorithms

Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis. It is
based on data similarities and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and it
helps single out important characteristics that differentiate between distinct groups.

Applications of cluster analysis in data mining:


o In many applications, clustering analysis is widely used, such as data analysis, market
research, pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on the
purchasing patterns. They can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies,
categorization of genes with the same functionalities and gain insight into structure
inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth
observation database and the identification of house groups in a city according to
house type, value, and geographical location.
Why is clustering used in data mining?
Clustering analysis has been an evolving problem in data mining due to its variety of
applications. The advent of various data clustering tools in the last few years and
their comprehensive use in a broad range of applications, including image
processing, computational biology, mobile communication, medicine, and
economics, must contribute to the popularity of these algorithms. The main issue
with the data clustering algorithms is that it cant be standardized. The advanced
algorithm may give the best results with one type of data set, but it may fail or
perform poorly with other kinds of data set. Although many efforts have been made
to standardize the algorithms that can perform well in all situations, no significant
achievement has been achieved so far. Many clustering tools have been proposed so
far. However, each algorithm has its advantages or disadvantages and cant work on
all real situations.

1. Scalability:

Scalability in clustering implies that as we boost the amount of data objects, the time
to perform clustering should approximately scale to the complexity order of the
algorithm. For example, if we perform K- means clustering, we know it is O(n), where
n is the number of objects in the data. If we raise the number of data objects 10
folds, then the time taken to cluster them should also approximately increase 10
times. It means there should be a linear relationship. If that is not the case, then there
is some error with our implementation process.

Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure
illustrates the graphical example where it may lead to the wrong result.

2. Interpretability:
The outcomes of clustering should be interpretable, comprehensible, and usable.

3. Discovery of clusters with attribute shape:

The clustering algorithm should be able to find arbitrary shape clusters. They should
not be limited to only distance measurements that tend to discover a spherical
cluster of small sizes.

4. Ability to deal with different types of attributes:

Algorithms should be capable of being applied to any data such as data based on
intervals (numeric), binary data, and categorical data.

5. Ability to deal with noisy data:

Databases contain data that is noisy, missing, or incorrect. Few algorithms are
sensitive to such data and may result in poor quality clusters.

6. High dimensionality:

The clustering tools should not only able to handle high dimensional data space but
also the low-dimensional space.

Basic issues in clustering: Clustering is a fundamental task in data


mining, but it comes with several challenges and issues that
researchers and practitioners need to address.
1. Ambiguity and Subjectivity:
- Solution Ambiguity: Clustering results may vary based on the
choice of algorithms, distance measures, and parameters, leading to
different clusterings for the same dataset.
- Subjectivity: Interpreting and validating clusters can be subjective,
as there may be multiple valid perspectives on what constitutes a
meaningful grouping.
2. Cluster Validity:
- Evaluation Metrics: Determining the quality of clusters is non-
trivial, and selecting appropriate evaluation metrics depends on the
nature of the data and the clustering goals.
- Internal vs. External Validity: Internal validity measures how well
the clusters are formed within the dataset, while external validity
assesses how well the clusters align with external criteria or ground
truth.
3. Scalability:
- Computational Complexity: Some clustering algorithms may
become computationally expensive as the size of the dataset
increases, making them impractical for large-scale data.
4. Noise and Outliers:
- Handling Noise: Clustering algorithms may be sensitive to noise or
irrelevant features in the data, leading to the formation of spurious
clusters.
- Outlier Detection: Identifying and handling outliers is crucial for
ensuring the robustness of clustering results.
5. Cluster Shape and Density:
- Assumption of Spherical Clusters: Many algorithms assume
spherical clusters, making them less effective for data with non-
spherical or irregularly shaped clusters.
- Varying Cluster Density: Clusters with varying densities pose
challenges for algorithms that assume uniform density.
6. Choosing the Right Algorithm:
- Algorithm Selection: Different clustering algorithms have different
strengths and weaknesses. Selecting the most suitable algorithm for
a particular dataset and problem is a non-trivial task.
7. Handling High-Dimensional Data:
- Curse of Dimensionality: Clustering high-dimensional data can be
challenging due to the curse of dimensionality, where distances
between data points become less meaningful in high-dimensional
spaces.
8. Interpreting and Representing Results:
- Cluster Interpretability: Interpreting and presenting the results of
clustering in a meaningful way, especially when dealing with high-
dimensional or complex data.

Partitioning methods: Partitioning methods are a widely used family of


clustering algorithms in data mining that aim to partition a dataset into K
clusters. These algorithms attempt to group similar data points together while
maximizing the differences between the clusters. Partitioning methods work by
iteratively refining the cluster centroids until convergence is reached. These
algorithms are popular for their speed and scalability in handling large
datasets.
The most widely used partitioning method is the K-means algorithm. Other
popular partitioning methods include K-medoids, Fuzzy C-means,
and Hierarchical K-means. The K-medoids are similar to K-means but use
medoids instead of centroids as cluster representatives.
Partitioning methods offer several benefits, including speed, scalability, and
simplicity. They are relatively easy to implement and can handle large datasets.
Partitioning methods are also effective in identifying natural clusters within
data and can be used for various applications, such as customer segmentation,
image segmentation, and anomaly detection.

K-means method: K-means is the most popular algorithm in partitioning


methods for clustering. It partitions a dataset into K clusters, where K is a user-
defined parameter.
How does K-Means Work?
The K-Means algorithm begins by randomly assigning each data point to a
cluster. It then iteratively refines the clusters' centroids until convergence. The
refinement process involves calculating the mean of the data points assigned to
each cluster and updating the cluster centroids' coordinates accordingly. The
algorithm continues to iterate until convergence, meaning the cluster
assignments no longer change. K-means clustering aims to minimize the sum of
squared distances between each data point and its assigned cluster centroid. K-
means is widely used in various applications, such as customer segmentation,
image segmentation, and anomaly detection, due to its simplicity and
efficiency in handling large datasets. For example, the K-Means algorithm can
group data points into two clusters

Algorithm
1. Initialize K cluster centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate the centroids' coordinates by computing the mean of the
data points assigned to each cluster.
4. Repeat steps 2 and 3 until the cluster assignments no longer change or a
maximum number of iterations is reached.
5. Return the K clusters and their respective centroids.
Advantages of K-Means
 Scalability
 Speed
 Simplicity
 Interpretability
Disadvantages of K-Means
 Curse of dimensionality
 User-defined K
 Non-convex shape clusters
 Unable to handle noisy data

K-MEDOID: K-medoid is a clustering algorithm that is similar to k-means but


uses medoids as representatives of clusters instead of centroids. The medoid is
the data point within a cluster whose average dissimilarity to all the other
points in the cluster is minimized. Unlike the centroid, which is the mean of the
data points, the medoid is an actual data point in the dataset
1. Initialization: Randomly select k data points as the initial medoids.
2. Assignment: Assign each data point to the nearest medoid, based on a
dissimilarity or distance metric (commonly Euclidean distance or other
similarity measures).
3. Update Medoids: For each cluster, evaluate the total dissimilarity of all data
points to each data point in the cluster. Choose the data point with the
minimum total dissimilarity as the new medoid for that cluster.
4. Repeat: Steps 2 and 3 are repeated iteratively until convergence, meaning
that the medoids do not change significantly or a predefined number of
iterations is reached.
Advantages of K-Medoid:
1. Robust to Outliers
2. Handles Non-Euclidean Distances
3. Interpretability
4. Applicability to Arbitrary Shapes of Clusters

Disadvantages of K-Medoid:
1. Computational Complexity
2. Dependency on Initial Medoid Selection
3. Sensitivity to the Number of Clusters (k)
4. Limited to Single Linkage
5. Not Suitable for Large Datasets

Aspect K-Means Clustering K-Medoid Clustering

Medoids are used as


Centroids are used to representatives of clusters.
represent cluster centers. They are actual data points,
Centroids vs. They are the mean of minimizing dissimilarity to
Medoids data points in the cluster. others.
Aspect K-Means Clustering K-Medoid Clustering

Sensitive to outliers, as
they can significantly Less sensitive to outliers, as
Sensitivity to affect the mean medoids are less influenced by
Outliers (centroid). extreme values.

Typically uses Euclidean Can use any dissimilarity or


distance, but can be distance metric, providing
adapted to other flexibility in handling different
Distance Metric distance metrics. data types.

Tends to be computationally
Generally more more expensive, especially for
Computational computationally efficient large datasets, due to pairwise
Complexity than k-medoid. dissimilarity computations.

No assumptions about cluster


Assumes spherical shapes, making it more
Cluster Shape clusters due to the use of suitable for clusters of arbitrary
Assumption centroids. shapes.

Medoids are actual data


Centroids may not points, providing a more
Cluster correspond to actual data interpretable representation of
Representativeness points. clusters.

Sensitive to the choice of


initial centroids, Sensitive to the choice of initial
Initialization impacting the final medoids, affecting the final
Dependency clustering results. clustering results.

Effective in scenarios where


medoids make more sense
Widely used in various than centroids, such as when
domains and well-suited dealing with categorical data or
Application for large datasets. non-Euclidean distances.

Usage Commonly used in Used when a more robust


practice and clustering method is needed,
implemented in many particularly in the presence of
Aspect K-Means Clustering K-Medoid Clustering

libraries and tools. outliers or non-Euclidean data.

Image segmentation, Genomic data clustering,


customer segmentation, categorical data clustering,
Examples data compression. outlier detection.

Hierarchical method:
Hierarchical clustering is a method of cluster analysis in data mining that
creates a hierarchical representation of the clusters in a dataset. The method
starts by treating each data point as a separate cluster and then iteratively
combines the closest clusters until a stopping criterion is reached. The result of
hierarchical clustering is a tree-like structure, called a dendrogram, which
illustrates the hierarchical relationships among the clusters.
Hierarchical clustering has several advantages over other clustering methods
 The ability to handle non-convex clusters and clusters of different
sizes and densities.
 The ability to handle missing data and noisy data.
 The ability to reveal the hierarchical structure of the data, which can
be useful for understanding the relationships among the clusters.
Drawbacks of Hierarchical Clustering
 The need for a criterion to stop the clustering process and determine
the final number of clusters.
 The computational cost and memory requirements of the method can
be high, especially for large datasets.
 The results can be sensitive to the initial conditions, linkage criterion,
and distance metric used.
In summary, Hierarchical clustering is a method of data mining that
groups similar data points into clusters by creating a hierarchical
structure of the clusters.
 This method can handle different types of data and reveal the
relationships among the clusters. However, it can have high
computational cost and results can be sensitive to some conditions.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and at every step,
merge the nearest pairs of the cluster. (It is a bottom-up method). At first,
every dataset is considered an individual entity or cluster. At every iteration,
the clusters merge with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Steps 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note: This is just a demonstration of how the actual algorithm works no
calculation has been performed below all the proximity among the clusters is
assumed.
Let’s say we have six data points A, B, C, D, E, and F.

Agglomerative Hierarchical clustering


 Step-1: Consider each alphabet as a single cluster and calculate the
distance of one cluster from all the other clusters.
 Step-2: In the second step comparable clusters are merged together
to form a single cluster. Let’s say cluster (B) and cluster (C) are very
similar to each other therefore we merge them in the second step
similarly to cluster (D) and (E) and at last, we get the clusters [(A),
(BC), (DE), (F)]
 Step-3: We recalculate the proximity according to the algorithm and
merge the two nearest clusters([(DE), (F)]) together to form new
clusters as [(A), (BC), (DEF)]
 Step-4: Repeating the same process; The clusters DEF and BC are
comparable and merged together to form a new cluster. We’re now
left with clusters [(A), (BCDEF)].
 Step-5: At last, the two remaining clusters are merged together to
form a single cluster [(ABCDEF)].
2. Divisive Hierarchical clustering
We can say that Divisive Hierarchical clustering is precisely the opposite of
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we
take into account all of the data points as a single cluster and in every iteration,
we separate the data points from the clusters which aren’t comparable. In the
end, we are left with N clusters.

NON-HIERARCHICAL TECHNIQUES: Non-hierarchical clustering techniques


involve partitioning the dataset into a predetermined number of clusters
without forming a hierarchical structure. Some popular non-hierarchical
clustering methods include:
1. K-Means Clustering:
- K-means is a widely used partitioning method that separates the dataset
into a predefined number of clusters (k). It minimizes the sum of squared
distances between data points and the centroid of their assigned cluster. K-
means is computationally efficient but sensitive to initializations and outliers.

2. K-Medoids (PAM - Partitioning Around Medoids):


- Similar to k-means, but instead of using centroids, k-medoids uses medoids
(actual data points) as representatives of clusters. It is more robust to outliers
but computationally more expensive due to the use of pairwise dissimilarity
computations.

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):


- DBSCAN groups data points based on their density in the feature space. It
identifies dense regions separated by sparser areas and can find clusters of
arbitrary shapes. It is particularly effective in identifying outliers and handling
noise in the data.

4. Mean-Shift Clustering:
- Mean-shift is a density-based method that iteratively shifts data points
towards the mode (peak) of the data distribution. It automatically determines
the number of clusters and is capable of detecting clusters with irregular
shapes.

5. Fuzzy C-Means Clustering:


- Fuzzy C-means is an extension of k-means that allows data points to belong
to multiple clusters with varying degrees of membership. It assigns fuzzy
membership values to each point, representing the likelihood of belonging to
different clusters.
6. Spectral Clustering:
- Spectral clustering treats data points as nodes in a graph and uses the
spectral properties of the graph to group similar points into clusters. It is
effective in capturing complex relationships and works well for data with
nonlinear structures.

7. Hierarchical Density-Based Spatial Clustering (HDBSCAN):


- HDBSCAN is an extension of DBSCAN that builds a hierarchy of clusters
based on density. It can identify clusters of varying shapes and sizes and is
robust to varying densities within clusters.

8. OPTICS (Ordering Points To Identify the Clustering Structure):


- OPTICS is a density-based method that produces an ordering of data points
based on their reachability density. It can identify clusters of varying shapes
and sizes and provides a visualization of the density-based clustering structure.
UNIT 4- DECISION TREES

Introduction: Decision Tree is a Supervised learning technique that can be used


for both classification and Regression problems, but mostly it is preferred for
solving Classification problems. It is a tree-structured classifier, where internal
nodes represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
The decisions or the test are performed on the basis of features of the given
dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
Why use Decision Trees?
1.Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.
2.The logic behind the decision tree can be easily understood because it shows
a tree-like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the
tree.
Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of
root attribute with the record (real dataset) attribute and, based on the
comparison, follows the branch and jumps to the next node.
For Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.

Advantages of the Decision Tree


o is simple to understand as it follows the same process which a
It
human follow while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other
algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using
the Random Forest algorithm.
o For more class labels, the computational complexity of the decision
tree may increase.
Tree pruning: Tree pruning is a horticultural practice that involves the
selective removal of branches or parts of a tree for various purposes,
including improving the tree's health, structure, and appearance. Pruning is
done for several reasons, such as removing dead or diseased branches,
shaping the tree to promote a strong framework, controlling its size, and
enhancing its overall aesthetic value. Different pruning techniques, such as
thinning, heading back, and crown reduction, are employed based on the
specific goals and characteristics of the tree. Pruning is typically carried out
with the use of specialized tools, and it requires knowledge of tree biology to
ensure that cuts are made correctly to facilitate proper healing and growth.
Proper pruning can contribute to the longevity and vitality of trees, while
improper techniques can lead to long-term damage.

types of pruning tech.

Structured Pruning:
Structured pruning involves eliminating whole structures or groups of
parameters from the model, such as whole neurons, channels, or filters.
This sort of pruning preserves the hidden structure of the model,
implying that the pruned model will have the same overall architecture
as the first model, but with fewer parameters.

Structured pruning is suitable for models with a structured architecture,


such as convolutional neural networks (CNNs), where the parameters are
coordinated into filters, channels, and layers. It is also easier to carry out
than unstructured pruning since it preserves the structure of the model.

Unstructured Pruning:
Unstructured pruning involves eliminating individual parameters from
the model without respect for their location in the model. This sort of
pruning does not preserve the hidden structure of the model, implying
that the pruned model will have an unexpected architecture in
comparison to the first model. Unstructured pruning is suitable for
models without a structured architecture, such as completely connected
brain networks, where the parameters are coordinated into a single grid.
It tends to be more effective than structured pruning since it allows for
more fine-grained pruning; however, it can also be more difficult to
execute.

Advantages
o Decreased model size and complexity. Pruning can significantly
diminish the quantity of parameters in a machine learning model,
prompting a smaller and simpler model that is easier to prepare
and convey.
o Faster inference. Pruning can decrease the computational cost of
making predictions, prompting faster and more effective
predictions.
o Further developed generalization. Pruning can forestall overfitting
and further develop the generalization capacity of the model by
diminishing the complexity of the model.
o Increased interpretability. Pruning can result in a simpler and more
interpretable model, making it easier to understand and make
sense of the model's decisions.

Disadvantages
o Possible loss of accuracy. Pruning can sometimes result in a loss of
accuracy, especially in the event that such a large number of
parameters are pruned or on the other hand in the event that
pruning is not done cautiously.
o Increased training time. Pruning can increase the training season of
the model, especially assuming it is done iteratively during training.
o Trouble in choosing the right pruning technique. Choosing the
right pruning technique can be testing and may require area
expertise and experimentation.
o Risk of over-pruning. Over-pruning can prompt an overly simplified
model that is not accurate enough for the task.
Extracting classification rules from decision tree:

Extracting classification rules from a decision tree involves translating the


tree structure into a set of human-readable rules that describe how the
model makes predictions. Decision trees are inherently interpretable, and
each path from the root to a leaf node in the tree corresponds to a rule.
Here's a general process for extracting classification rules from a decision
tree:

1. Understand Tree Structure:

- Review the structure of the decision tree, which consists of nodes


(decision points) and branches (edges) leading to leaf nodes (final
predictions).

2. Identify Decision Nodes:

- Each decision node represents a condition based on a feature.


Identify these conditions and the features they are associated with.

3. Follow Decision Paths:

- Trace the paths from the root to each leaf node by following the
conditions at each decision node. Each path represents a unique
combination of conditions.

4. Translate Conditions into Rules:

- Convert the conditions on each path into human-readable rules. Each


rule should state the conditions that lead to a specific classification.

5. Include Class Labels:

- Associate each rule with the class label assigned to the corresponding
leaf node. This indicates the predicted outcome for instances that satisfy
the conditions of the rule.

6. Handle Continuous Variables:


- For decision nodes based on continuous variables, specify ranges or
thresholds in the rules.

7. Address Missing Values:

- If the tree deals with missing values, include rules for how the model
handles them at decision nodes.

8. Simplify Rules (Optional):

- Optionally, you can simplify rules to make them more concise and
easier to understand. This may involve combining similar rules or
expressing them in a more compact form.

Decision tree construction algorithms:

The ID3 (Iterative Dichotomiser 3) algorithm is one of the earliest and


most influential decision tree algorithms. It was developed by Ross
Quinlan and is designed for building decision trees for classification
tasks.

Key Steps of the ID3 Algorithm:

1. Entropy Calculation:

- Measure the impurity of the current dataset using a metric called


entropy.

- Entropy quantifies the amount of disorder or uncertainty in a set of


data.

2. Attribute Selection:

- For each attribute in the dataset, calculate the Information Gain.

- Information Gain is the reduction in entropy that results from splitting


the dataset based on a particular attribute.
- The attribute with the highest Information Gain is selected as the
node for splitting.

3. Node Creation:

- Create a decision node in the tree based on the selected attribute.

- Branch out to child nodes corresponding to the possible values of the


selected attribute.

4. Recursion:

- Recursively apply the algorithm to each subset of data created by the


split.

- Continue this process until one of the stopping conditions is met.

5. Stopping Conditions:

- Stop splitting if all instances in a subset belong to the same class.

- Stop if no more attributes are left for splitting.

- Stop if a predefined depth limit is reached.

6. Tree Construction:

- Continue building the tree until the stopping conditions are met for
all branches.

Limitations of ID3:

1. Binary Attributes Only:

- ID3 is designed for binary attributes, and it may not handle


continuous or multivalued attributes directly.

2. Overfitting:

- ID3 tends to create deep trees, leading to overfitting. This can be


addressed through pruning techniques.
3. Handling Missing Data:

- ID3 does not handle missing values well, and it may exclude instances
with missing values during attribute selection.

4. Bias Toward Features with Many Values:

- Features with a large number of values may have an advantage in the


Information Gain calculation, potentially leading to a bias.

Decision tree construction with presorting:

Presorting is a technique used in decision tree construction to improve


the efficiency of the algorithm. The basic idea is to sort the data based
on attribute values before making decisions at each node of the tree.
This sorting process allows for quicker identification of splitting points,
reducing the overall time complexity of the algorithm.

1. Presorting:

- Sort the training dataset based on the values of each attribute.

- This presorting step is performed once for each attribute, and the
sorted order is maintained throughout the tree construction process.

2. Selecting the Best Splitting Point:

- For each attribute, traverse the sorted values and evaluate potential
split points.

- Calculate the impurity measure (e.g., Gini impurity, Information Gain)


at each potential split point.

- Choose the attribute and split point that result in the maximum
impurity reduction.

3. Node Creation:
- Create a decision node in the tree based on the selected attribute
and split point.

- Branch out to child nodes corresponding to the values below and


above the chosen split point.

4. Recursion:

- Recursively apply the presorting and splitting process to each subset


of data created by the split.

- Continue this process until one of the stopping conditions is met.

5. Stopping Conditions:

- Stop splitting if all instances in a subset belong to the same class.

- Stop if no more attributes are left for splitting.

- Stop if a predefined depth limit is reached.

6. Tree Construction:

- Continue building the tree until the stopping conditions are met for
all branches.

Benefits of Presorting:

1. Efficiency Improvement:

- By presorting the data, the algorithm can quickly identify optimal


split points, leading to faster tree construction.

2. Reduced Time Complexity:

- The time complexity of decision tree construction is reduced,


especially when evaluating potential split points.
3. Better Handling of Categorical Data:- Presorting is particularly
beneficial when dealing with categorical attributes, as it simplifies the
identification of optimal split points.

Considerations:

- Memory Usage:

- Presorting may require additional memory to store the sorted data.

- Initial Overhead: - The initial presorting step adds some overhead,


but this cost is amortized over the multiple decisions made during tree
construction.

- Applicability:

- Presorting is more effective in scenarios where the dataset is relatively


large and the number of attributes is not too high.

UNIT-5 Techniques of data mining

# DATA MINING SOFTWARE AND APPLICATION:

1. RapidMiner:
- Features: RapidMiner provides a user-friendly interface for designing and
executing data workflows without extensive coding. It supports various data
preprocessing tasks, modeling techniques, and evaluation methods. It also
allows integration with other data science tools and languages like R and
Python.
2. Weka:
- Features: Weka is a Java-based software that offers a vast collection of
machine learning algorithms for data mining. It provides tools for data
preprocessing, classification, regression, clustering, association rule mining,
and feature selection. Weka is popular for its ease of use and is widely used for
educational purposes.
3. Knime:
- Features: KNIME is known for its modular and visual approach to data
analysis. Users can create workflows by connecting pre-built nodes that
perform specific tasks. KNIME supports integration with various data sources
and offers a range of analytics and reporting capabilities.
4. IBM SPSS Modeler:
- Features: SPSS Modeler is part of the IBM SPSS Statistics suite. It provides a
visual interface for building predictive models using machine learning
algorithms. The software supports data preparation, model building,
evaluation, and deployment. It is widely used in industries for tasks such as
customer segmentation and predictive maintenance.
5. SAS Enterprise Miner:
- Features: SAS Enterprise Miner is a comprehensive data mining and
predictive analytics tool. It includes a variety of statistical and machine learning
algorithms for tasks like regression, clustering, and decision trees. The software
is often used in industries such as finance, healthcare, and marketing.
6. TensorFlow and scikit-learn:
- Features: TensorFlow is an open-source machine learning library developed
by Google. While it is more popular for deep learning, it also includes tools for
traditional machine learning tasks. Scikit-learn, on the other hand, is a Python
library that provides simple and efficient tools for data mining and data
analysis. Both are widely used in the Python data science ecosystem.
7. Tableau:
- Features: Tableau is primarily a data visualization tool that connects to
various data sources. While not a traditional data mining tool, it allows users to
explore and analyze data visually, uncovering patterns and trends. Tableau can
be integrated with other data science tools to enhance its analytical
capabilities.
8. Sisense:
- Features: Sisense is a business intelligence platform that goes beyond
traditional data mining. It includes data preparation, analysis, and visualization
features. Sisense allows users to create interactive dashboards and reports,
enabling organizations to make informed decisions based on data insights.

#Text mining: Text mining is a component of data mining that deals


specifically with unstructured text data. It involves the use of natural
language processing (NLP) techniques to extract useful information and
insights from large amounts of unstructured text data. Text mining can
be used as a preprocessing step for data mining or as a standalone
process for specific tasks.

By using text mining, the unstructured text data can be transformed into
structured data that can be used for data mining tasks such as
classification, clustering, and association rule mining. This allows
organizations to gain insights from a wide range of data sources, such as
customer feedback, social media posts, and news articles.

Advantages of Text Mining

1. Large Amounts of Data: Text mining allows organizations to


extract insights from large amounts of unstructured text data. This
can include customer feedback, social media posts, and news
articles.
2. Variety of Applications: Text mining has a wide range of
applications, including sentiment analysis, named entity
recognition, and topic modeling. This makes it a versatile tool for
organizations to gain insights from unstructured text data.
3. Improved Decision Making: Text mining can be used to extract
insights from unstructured text data, which can be used to make
data-driven decisions.
4. Cost-effective: Text mining can be a cost-effective way to extract
insights from unstructured text data, as it eliminates the need for
manual data entry.
5. Broader benefits: Cost reductions, productivity increases, the
creation of novel new services, and new business models are just a
few of the larger economic advantages mentioned by those
consulted.

Disadvantages of Text Mining

1. Complexity: Text mining can be a complex process that requires


advanced skills in natural language processing and machine
learning.
2. Quality of Data: The quality of text data can vary, which can affect
the accuracy of the insights extracted from text mining.
3. High Computational Cost: Text mining requires high
computational resources, and it may be difficult for smaller
organizations to afford the technology.
4. Limited to Text Data: Text mining is limited to extracting insights
from unstructured text data and cannot be used with other data
types.
5. Noise in text mining results: Text mining of documents may
result in mistakes. It’s possible to find false links or to miss others.
In most situations, if the noise (error rate) is sufficiently low, the
benefits of automation exceed the chance of a larger mistake than
that produced by a human reader.
6. Lack of transparency: Text mining is frequently viewed as a
mysterious process where large corpora of text documents are
input and new information is produced. Text mining is in fact
opaque when researchers lack the technical know-how or expertise
to comprehend how it operates, or when they lack access to
corpora or text mining tools.

#Extracting attributes: Extracting keywords in text mining involves


identifying and isolating the most important and representative words or
phrases from a given text. These keywords can provide a condensed
representation of the main topics or themes within the text.
1. Term Frequency-Inverse Document Frequency (TF-IDF):

- TF-IDF is a numerical statistic that reflects the importance of a word


in a document relative to a collection of documents (corpus). Words with
higher TF-IDF scores are considered more important. The process
involves the following steps:

- Calculate the Term Frequency (TF) for each word in the document.

- Calculate the Inverse Document Frequency (IDF) for each word in


the corpus.

- Multiply TF and IDF to obtain the TF-IDF score for each word.

- Select words with the highest TF-IDF scores as keywords.

2. TextRank Algorithm:

- TextRank is an unsupervised algorithm inspired by PageRank, which is


used by Google for ranking web pages. It assigns importance scores to
words based on their co-occurrence and relationships within the text.
The algorithm involves:

- Constructing a graph where nodes represent words and edges


represent relationships.

- Applying an iterative algorithm to update the importance scores.

- Selecting words with the highest scores as keywords.

3. Frequency-Based Methods:

- Simple frequency-based methods involve counting the occurrence of


each word in the document and selecting the most frequently occurring
words as keywords. This method is straightforward but may not capture
the importance of less common but meaningful words.
4. Noun Phrase Extraction:

- Identifying and extracting noun phrases from the text can yield
meaningful keywords. This can be done using part-of-speech tagging to
identify nouns and noun phrases in the text.

5. Rapid Automatic Keyword Extraction (RAKE):

- RAKE is a keyword extraction algorithm that considers word co-


occurrence and frequency. It involves the following steps:

- Tokenizing the text into words and phrases.

- Calculating word and phrase scores based on co-occurrence and


frequency.

- Selecting words and phrases with the highest scores as keywords.

6. Machine Learning Approaches:

- Machine learning models, such as supervised classifiers or deep


learning models, can be trained to identify keywords. Training data with
labeled examples of keywords and non-keywords is used to build a
predictive model.

7. Topic Modeling:

- Topic modeling techniques, such as Latent Dirichlet Allocation (LDA),


can help identify topics within a document. The most representative
words in each topic can be considered as keywords.

8. Domain-Specific Methods:

- In certain domains, custom methods or dictionaries can be created to


extract domain-specific keywords. For example, in medical texts, specific
medical terms may be identified as keywords.
#Structural approaches: In the context of text mining and natural
language processing, structural approaches, including parsing and soft
parsing, play a crucial role in extracting meaningful information from
unstructured text data. Let's delve into each of these approaches:

1. Parsing:

- Definition: Parsing is the process of analyzing the grammatical


structure of a sentence to understand its syntactic components and
relationships. It involves breaking down a sentence into its constituent
parts, such as nouns, verbs, adjectives, and determining how these parts
are grammatically related to one another.

- Types of Parsing:

- Dependency Parsing: Identifies the relationships between words in a


sentence, representing them as a dependency tree. Each word is a node,
and the edges indicate grammatical dependencies.

- Constituency (Syntactic) Parsing: Analyzes the sentence's hierarchical


structure, identifying phrases and their grammatical categories.

- Applications:

- Semantic Role Labeling (SRL): Parsing helps identify the roles that
different words play in a sentence, such as the subject, object, or
predicate, which is crucial for understanding the semantic structure.

- Named Entity Recognition (NER): Parsing can contribute to the


extraction of named entities and their relationships within a sentence.

2. Soft Parsing:

- Definition: Soft parsing is a more relaxed or probabilistic approach to


parsing compared to traditional parsing methods. It often involves using
statistical models or machine learning algorithms to assign probabilities
to different syntactic structures rather than selecting a single, definite
structure.

- Probabilistic Models:

- Probabilistic Context-Free Grammar (PCFG): Assigns probabilities to


different grammar rules, allowing for a more flexible interpretation of
sentence structure.

- Stochastic Dependency Parsing: Similar to dependency parsing, but


with probabilities associated with the dependencies between words.

- Applications:

- Machine Translation: Soft parsing is useful in machine translation


systems where the translation model needs to consider multiple possible
sentence structures.

- Speech Recognition: In speech recognition, where variations in


pronunciation can lead to different parses, a soft parsing approach can
be beneficial.

- Advantages:

- Flexibility: Soft parsing accommodates linguistic variability and


ambiguity in natural language, allowing for more nuanced analysis.

- Probabilistic Output: Instead of committing to a single


interpretation, soft parsing provides probabilities, which can be useful in
uncertain or ambiguous language contexts.

Both parsing and soft parsing contribute to understanding the syntactic


structure of text, enabling more advanced analysis and extraction of
information. These approaches are fundamental in various natural
language processing tasks, such as information extraction, sentiment
analysis, and machine translation.

#web mining: Web mining is the process of discovering and extracting


valuable information from the vast amount of data available on the
World Wide Web. It involves the application of data mining techniques
to analyze and understand patterns, trends, and knowledge from web
data. Web mining can be broadly categorized into three main types: web
content mining, web structure mining, and web usage mining.

1. Web Content Mining:

- Definition: Web content mining focuses on extracting information


from the textual content of web pages. It involves analyzing the text,
images, and multimedia content present on websites.

- Techniques:

- Text Mining: Extracting information from text using techniques such


as natural language processing, keyword extraction, and sentiment
analysis.

- Image and Multimedia Mining: Analyzing and extracting information


from images, videos, and other multimedia content.

- Applications:

- Information Retrieval: Improving search engine results by


understanding and indexing web page content.

- Content Categorization: Categorizing web pages into predefined


topics or themes.

- Sentiment Analysis: Analyzing user sentiments expressed in reviews,


comments, and social media.
2. Web Structure Mining:

- Definition: Web structure mining focuses on analyzing the


relationships and link structures among web pages. It aims to understand
the organization and connectivity of the web.

- Techniques:

- Link Analysis: Examining the links between web pages to determine


page importance and relationships.

- Graph Theory: Applying graph algorithms to analyze the structure of


the web.

- Applications:

- Page Ranking: Assigning importance scores to web pages (e.g.,


Google's PageRank algorithm).

- Community Detection: Identifying groups of interconnected web


pages.

- Web Navigation Analysis: Understanding user navigation patterns.

3. Web Usage Mining:

- Definition: Web usage mining involves analyzing user interactions


and behavior on the web. It focuses on understanding how users
navigate through websites and interact with web content.

- Techniques:

- Clickstream Analysis: Analyzing user clicks and navigation paths.

- Sessionization: Grouping user interactions into sessions.

- Pattern Discovery: Identifying recurring patterns in user behavior.

- Applications:

- Personalization: Customizing content or recommendations based on


user behavior.

- User Profiling: Creating profiles of user preferences and interests.


- E-commerce Recommendations: Suggesting products based on user
browsing and purchase history.

4. Challenges in Web Mining:

- Data Volume and Diversity: The web generates vast and diverse data,
making it challenging to handle and analyze.

- Dynamic Nature: Web content and structure change dynamically,


requiring continuous updates.

- Privacy Concerns: User privacy is a significant concern when analyzing


web usage data.

#Classifiying web pages: Classifying web pages involves categorizing them into
predefined classes or topics based on their content, structure, or other relevant
features. This process is essential for various applications, including information
retrieval, content organization, and user experience improvement. Here are
common techniques and approaches for classifying web pages:
1. Text-Based Classification:
- Technique: Analyzing the textual content of web pages to determine their
category.
- Methods:
- Natural Language Processing (NLP): Using techniques such as tokenization,
stemming, and sentiment analysis to process and understand the text.
- Machine Learning Algorithms: Employing supervised learning algorithms,
such as Naive Bayes, Support Vector Machines (SVM), or neural networks, to
train models on labeled data.
2. Content-Based Classification:
- Technique: Examining the features and attributes of web page content,
including text, images, and multimedia elements.
- Methods:
- Keyword Extraction: Identifying key terms in the text content.
- Image Analysis: Analyzing images or multimedia content for classification.
- Text and Image Fusion: Combining information from both textual and visual
elements.
3. Link-Based Classification:
- Technique: Analyzing the link structure and relationships between web
pages.
- Methods:
- Link Analysis Algorithms: Using algorithms like PageRank to determine the
importance of pages based on their links.
- Community Detection: Identifying clusters or groups of interlinked pages.
4. Web Structure-Based Classification:
- Technique: Examining the HTML structure, tags, and other structural
elements of web pages.
- Methods:
- HTML DOM Parsing: Analyzing the Document Object Model (DOM) of web
pages.
- Pattern Recognition: Identifying structural patterns in the HTML code.
5. Domain-Specific Classification:
- Technique: Considering domain-specific characteristics or features for
classification.
- Methods:
- Custom Features: Defining and extracting features relevant to the specific
domain.
- Supervised Learning: Training models on domain-specific labeled data.
6. Web Usage-Based Classification:
- Technique: Analyzing user interactions and behavior on web pages for
classification.
- Methods:
- Clickstream Analysis: Considering user clicks, navigation paths, and session
data.
- Behavioral Pattern Recognition: Identifying recurring patterns in user
behavior.
7. Machine Learning Models for Web Page Classification:
- Technique: Employing various machine learning models to classify web
pages.
- Methods:
- Decision Trees, Random Forests: Suitable for handling categorical features.
- Support Vector Machines (SVM): Effective for binary and multiclass
classification.
- Neural Networks: Deep learning models for complex patterns and
representations.
8. Ensemble Methods:
- Technique: Combining predictions from multiple classifiers to improve
overall accuracy.
- Methods:
- Voting Systems: Combining results through majority voting.
- Bagging (Bootstrap Aggregating): Training multiple models on different
subsets of the data.

The end

You might also like