Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

SEMESTER-II

IT-421 Data Warehousing and Data Mining


Clock Hours: 60
Total Marks: 100
UNIT-I: Fundamentals of Data Warehousing
[10L] Max Marks: 12
1.1 Failure of Past Decision Support System

The marketing department in your company has been concerned about the performance of the
West Coast Region and the sales numbers from the monthly report this month are drastically low.
The marketing Vice President is agitated and wants to get some reports from the IT department to
analyze the performance over the past two years, product by product, and compared to monthly
targets. He wants to make quick strategic decisions to rectify the situation. The CIO wants your
boss to deliver the reports as soon as possible. Your boss runs to you and asks you to stop
everything and work on the reports. There are no regular reports from any system to give the
marketing department what they want. You have to gather the data from multiple applications and
start from scratch. Does this sound familiar?

At one time or another in your career in information technology, you must have been exposed
to situations like this. Sometimes, you may be able to get the information required for such ad hoc
reports from the databases or files of one application. Usually this is not so. You may have to go to
several applications, perhaps running on different platforms in your company environment, to get
the information. What happens next? The marketing department likes the ad hoc reports you have
produced. But now they would like reports in a different form, containing more information that
they did not think of originally. After the second round, they find that …

1.2 Operational V/S Decision Support Systems

Characteristic Operational Support Decision Support System


System

Data Currency Current operations Historic data,Snapshot,of


Real-time data company data Time
component(week/month/year))

Granularity Atomic detailed data Summarized data

Summarization level Low: some aggregate High: many aggregation levels


yields
Data model Nonnormalized Complex
Highly normalized mostly structures Some relational, but
mostly multidimensional DBMS
relationl DBMS

Transaction type Mostly updates Mostly query

Transaction volumes High update volumes Periodic loads and summary


calculations

Transaction speed Updates are critical Retrievals are critical

Query activity Low to medium High

Query scope Narrow range Broad range

Query complexity Simple to medium Very complex

​ 1.3 Data Warehousing Lifecycle



​ The Data Warehousing Lifecycle refers to the process of designing, implementing, and
maintaining a data warehouse. A data warehouse is a centralized repository that stores
data from various sources in a structured format, enabling organizations to analyze and
make informed decisions. The lifecycle involves several key stages:
​ Planning:
● Define Objectives: Clearly define the goals and objectives of the data warehouse.
Understand the business requirements and how the data warehouse will support
decision-making processes.
● Assess Feasibility: Evaluate the technical and financial feasibility of
implementing a data warehouse. Consider factors such as data sources,
technology infrastructure, and organizational readiness.
​ Requirements Analysis:
● Gather Requirements: Work with stakeholders to identify and document data
requirements. Understand the types of queries and analyses that users will
perform to ensure the data warehouse meets their needs.
● Data Source Analysis: Identify and analyze potential data sources. Assess the
quality and compatibility of data from different systems.
​ Design:
● Data Model Design: Create a data model that represents the structure of the data
warehouse. This includes defining dimensions, facts, and relationships between
data elements.
● ETL (Extract, Transform, Load) Design: Plan the processes for extracting data
from source systems, transforming it into the desired format, and loading it into
the data warehouse.
● Infrastructure Design: Define the hardware and software infrastructure required
to support the data warehouse. Consider factors such as storage, processing
power, and data integration tools.
​ Implementation:
● ETL Development: Implement the ETL processes designed during the previous
stage. This involves extracting data from source systems, transforming it, and
loading it into the data warehouse.
● Data Warehouse Construction: Build the data warehouse based on the designed
data model. Populate the warehouse with data from the ETL processes.
​ Testing:
● Data Quality Assurance: Perform data quality checks to ensure the accuracy and
completeness of the data in the data warehouse.
● Performance Testing: Test the performance of queries and data retrieval
processes to ensure the data warehouse meets performance requirements.
​ Deployment:
● Data Warehouse Deployment: Deploy the data warehouse to production, making
it available for users and applications.
● User Training: Train end-users and relevant stakeholders on how to access and
use the data warehouse.
​ Maintenance and Evolution:
● Monitoring and Optimization: Monitor the performance of the data warehouse
and optimize queries or processes as needed.
● Data Refresh and Updates: Regularly update the data warehouse with new data
from source systems.
● Evolution and Expansion: Adapt the data warehouse to changing business
requirements. Consider expanding the data warehouse to include new data
sources or additional functionality.
​ Retirement (Optional):
● If the data warehouse becomes obsolete or is replaced by a newer system, plan
for its retirement. Migrate or archive relevant data and inform stakeholders about
the transition.

1.4 Architecture

Data warehouse architecture is complex as a system of information containing historical and


commutative data from various sources. Data in several databases are organized according to
a data warehouse architecture. A contemporary data warehouse layout determines the most
efficient method of obtaining information from raw data because the data must be sorted and
cleaned to be valuable. Three modes – single-tier, two-tier, and three-tier – are available for
building data warehouse layers.

1. Single-tier architecture:A single layer’s goal is to store as little data as possible. The
elimination of data redundancy is the aim. In reality, single-tier architecture is not frequently
employed. To accomplish this, it eliminates redundant data to keep as little data as possible.
The way a single-tier data warehouse is made reduces the amount of data that is stored while
making a dense data set.
Even though this warehouse design style is suitable for eliminating redundancies, it is not right
for companies with complex data needs and multiple data streams. Multi-tier data warehouse
architectures can help in this situation since they can handle more complicated data streams.
A relational database system is typically represented by the bottom tier or data warehouse
server. This architecture is vulnerable since it does not separate analytical and transactional
processing as required. Following the interpretation of the middleware, analysis queries are
approved for operational data. This is how inquiries have an impact on transactional
workloads.

2. Two-layer architecture:The data structure of a two-tier data warehouse architecture maintains


a clear separation between the actual data sources and the warehouse itself. In contrast to a
single layer, the two-tier model uses a system and a database server. This style of data warehouse
architecture is generally utilized by small businesses that use servers as data marts. The two-tier
structure is not scalable even though it is better at data management and storage. Additionally, it
only accommodates a small number of users. It consists of four consecutive stages of data flow:

● Source layer: A data warehouse system makes use of several data sources. The information
may originate from an information system beyond the company’s boundaries or be initially
housed in legacy or internal relational databases.
● Data staging: It entails extracting the data from the source, cleaning it to remove
discrepancies and fill in any gaps, and integrating it to combine data from several sources
into a single standard schema. The Extraction, Transformation, and Loading Tools (ETL)
process can combine data schemas that are different from one another, besides enabling
data extraction, transformation, cleaning, validation, and filtration to be loaded into a data
warehouse.
● Data warehouse layer: A data warehouse is where one can store information in a way that
makes sense as per centralization logic. Users can access data warehouses directly but can
also use them to make data marts for specific departments within the company and partly
copy the contents from the data warehouse. Data staging, users, sources, access processes,
data mart schema, and other information are all stored in meta-data repositories.
● Analysis: This layer allows for rapid and flexible access to integrated data to generate
reports, analyze data in real-time, and model fictitious business scenarios. It should have
customer-friendly GUIs, advanced query optimizers, and aggregate information
navigators.

3. Three-Tier Architecture:The three-tier architecture comprises the source layer – many


source systems, the reconciliation layer, and the data warehouse layer. The reconciliation
layer sits between the data warehouse and the source data. The primary positive of the
reconciled layer is that it creates a uniform reference data model for the whole company,
besides setting out the difference between problems with filling the data warehouse and
those with getting source data and putting it all together. The top, middle, and bottom tiers
make up this hierarchy:

● Bottom tier: A relational database system is typically used. Data is cleaned,


changed, and loaded into this layer using back-end tools.
● Middle tier: An online analytical processing (OLAP) server developed using either
the ROLAP or MOLAP paradigm makes up the middle tier of a data warehouse.
This layer serves as a liaison between the database and the end user.
● Top tier: Front-end client layer makes up the top tier. The tools and application
programming interfaces (APIs) you connect to extract data from the data warehouse
are considered top tier.

1.5 Building Block


Architecture is the proper arrangement of the elements. We build a data warehouse with
software and hardware components. To suit the requirements of our organizations, we arrange
these building we may want to boost up another part with extra tools and services. All of these
depends on our circumstances.
The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building block. In the
middle, we see the Data Storage component that handles the data warehouses data. This element
not only stores and manages the data; it also keeps track of data using the metadata repository.
The Information Delivery component shows on the right consists of all the different ways of
making the information from the data warehouses available to the users.

Source Data Component

Source data coming into the data warehouses may be grouped into four broad categories:

Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of the
data from the various operational modes.

Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part of
which could be useful in a data warehouse.

Archived Data: Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry produced
by the external department.

Data Staging Component

After we have been extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse. The extracted data coming from
several different sources need to be changed, converted, and made ready in a format that is
relevant to be saved for querying and analysis.

We will now discuss the three primary functions that take place in the staging area.

1) Data Extraction: This method has to deal with numerous data sources. We have to employ
the appropriate techniques for each data source.

2) Data Transformation: As we know, data for a data warehouse comes from many different
sources. If data extraction for a data warehouse posture big challenges, data transformation
present even significant challenges. We perform several individual tasks as part of data
transformation.

First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or elimination
of duplicates when we bring in the same data from various source systems.

Standardization of data components forms a large part of data transformation. Data


transformation contains many forms of combining pieces of data from different sources. We
combine data from single source record or related data parts from many source records.On the
other hand, data transformation also contains purging source data that is not useful and
separating outsource records into new combinations. Sorting and merging of data take place on a
large scale in the data staging area. When the data transformation function ends, we have a
collection of integrated data that is cleaned, standardized, and summarized.

3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time, we
do the initial loading of the information into the data warehouse storage. The initial load moves
high volumes of data using up a substantial amount of time.

Data Storage Components

Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories include
the data structured in highly normalized for fast and efficient processing.
Information Delivery Component

The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.

Metadata Component

Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures,
the data about the records and addresses, the information about the indexes, and so on.

1.5 Data Marts

It includes a subset of corporate-wide data that is of value to a specific group of users. The scope
is confined to particular selected subjects. Data in a data warehouse should be a fairly current,
but not mainly up to the minute, although development in the data warehouse industry has made
standard and incremental data dumps more achievable. Data marts are lower than data
warehouses and usually contain organization. The current trends in data warehousing are to
developed a data warehouse with several smaller related data marts for particular kinds of
queries and reports.

Management and Control Component


The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the data
warehouse storage. On the other hand, it moderates the data delivery to the clients. Its work with the
database management systems and authorizes data to be correctly saved in the repositories. It
monitors the movement of information into the staging method and from there into the data
warehouses storage itself.
UNIT-II: Data Pre-processing
[10L] Max Marks: 12

2.1 Need For Pre-Processing Of The Data

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific
data mining task.

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data
preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset.
Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for
data integration.
Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization
is used to transform the data to have zero mean and unit variance. Discretization is used to
convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require
categorical data. Discretization can be achieved through techniques such as equal width binning,
equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1
or -1 and 1. Normalization is often used to handle data with different units and scales. Common
normalization techniques include min-max normalization, z-score normalization, and decimal
scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the
analysis results. The specific steps involved in data preprocessing may vary depending on the
nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results
become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.

2.2 Descriptive Data Summarization


Descriptive data summarization in data mining involves the process of extracting key insights
and characteristics from a dataset to provide a concise and understandable summary of its
contents. This summary aids in understanding the underlying patterns, trends, and distributions
within the data. Descriptive summarization techniques are particularly useful for exploratory
data analysis and for gaining an initial understanding of the dataset before applying more
advanced analytical methods. Here are some common techniques used for descriptive data
summarization in data mining:

● Summary Statistics: Calculating basic statistics such as mean, median, mode, standard
deviation, minimum, maximum, and quartiles provides a general overview of the
distribution of numerical variables within the dataset.
● Frequency Distribution: Creating frequency tables or histograms allows you to visualize
the distribution of categorical variables and identify the most common categories or
levels within each variable
● Data Visualization: Techniques such as scatter plots, bar charts, pie charts, box plots, and
heatmaps are used to visually represent the relationships and patterns within the data.
Visualization aids in identifying outliers, clusters, trends, and correlations.
● Correlation Analysis: Computing correlation coefficients (e.g., Pearson correlation)
between pairs of numerical variables helps identify relationships and dependencies
between variables. Correlation analysis is useful for understanding how variables are
related to each other.
● Cross-Tabulation: Cross-tabulating categorical variables allows you to examine the
relationships between different categories and identify any associations or dependencies
between them.
● Data Profiling: Data profiling involves analyzing the structure and quality of the dataset,
including identifying missing values, outliers, data types, and unique values for each
variable. Data profiling helps in understanding the completeness and integrity of the
dataset.
● Cluster Analysis: Cluster analysis techniques such as k-means clustering or hierarchical
clustering group similar data points together based on their characteristics. Cluster
analysis helps identify natural groupings or patterns within the data.
● Dimensionality Reduction: Techniques such as principal component analysis (PCA) or
t-distributed stochastic neighbor embedding (t-SNE) can be used to reduce the
dimensionality of the dataset while preserving important information. Dimensionality
reduction helps in visualizing high-dimensional data and identifying underlying structures.
● Data Summarization Techniques: Techniques such as data discretization, binning, or
summarization by aggregation (e.g., grouping data by time periods) can be used to
reduce the complexity of the dataset while preserving key insights.
● Text Summarization: In cases where the dataset contains textual data, techniques such
as text summarization, keyword extraction, or sentiment analysis can be used to extract
key themes, topics, or sentiments from the text.
2.3 Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.

● (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
● (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data
is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values can
be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

2.4 Data Integration and Transformation


Data Integration :
● Data integration is a crucial step in data mining where data from multiple heterogeneous
sources are combined, unified, and transformed into a single, consistent format for
analysis. This process involves resolving inconsistencies, handling redundancies, and
integrating data from different formats, structures, and schemas. Here are some key
aspects of data integration in data mining:
● Data Source Identification: Identify and gather data from various internal and external
sources, including databases, flat files, spreadsheets, APIs, web services, and data
warehouses.
● Schema Integration: Resolve differences in data schemas (structures) between different
data sources. This may involve mapping attributes from different sources to a common
schema, resolving naming conflicts, and standardizing data types.
● Data Cleaning and Preprocessing: Cleanse and preprocess the data to ensure
consistency, accuracy, and completeness. This includes handling missing values, outliers,
duplicates, and errors, as well as standardizing formats and resolving inconsistencies.
● Data Transformation: Transform and standardize the data into a common format suitable
for analysis. This may involve converting data types, normalizing or standardizing
numeric values, and encoding categorical variables.
● Data Quality Assessment: Assess the quality of the integrated data by evaluating factors
such as accuracy, completeness, consistency, and timeliness. Identify and address any
data quality issues that may impact the analysis.
● Entity Resolution: Resolve redundancies and inconsistencies in data records by
identifying and merging duplicate or related entities. This may involve techniques such as
record linkage and deduplication to identify matching records across different datasets.
● Data Fusion: Combine data from multiple sources to create a unified dataset that
integrates information from various domains or perspectives. This may involve
aggregating data, joining datasets based on common attributes, and enriching data with
additional information.
● Metadata Management: Maintain metadata to provide documentation and context for the
integrated data, including information about the data sources, transformations applied,
and business rules. Metadata management facilitates understanding, querying, and reuse
of the integrated data.
● Data Governance: Establish policies, processes, and controls to ensure the integrity,
security, and privacy of the integrated data. Data governance frameworks define roles and
responsibilities for managing data assets and ensure compliance with regulations and
standards.
● Data Warehousing and Integration Tools: Utilize data integration tools and platforms to
automate and streamline the process of integrating data from disparate sources. These
tools provide capabilities for data extraction, transformation, loading (ETL), and
orchestration of data integration workflows.

2. Data Transformation:
Data transformation is a crucial step in the data mining process where raw data is manipulated
or modified to prepare it for analysis. This step involves converting data into a format that is
suitable for the chosen data mining technique or algorithm. Data transformation helps improve
the quality of the data, reduces noise, and enhances the performance of data mining algorithms.
Here are some common techniques used in data transformation in data mining:

❖ Normalization: Normalization is the process of scaling numeric data to a standard range,


typically between 0 and 1, or -1 and 1. This helps in ensuring that all numeric attributes
contribute equally to the analysis and prevents attributes with larger scales from
dominating the analysis.
❖ Standardization: Standardization, also known as z-score normalization, involves scaling
numeric data to have a mean of 0 and a standard deviation of 1. This technique is useful
when the distribution of the data is Gaussian (bell-shaped curve).
❖ Attribute Discretization: Attribute discretization involves converting continuous numeric
attributes into discrete intervals or bins. This simplifies the data and makes it easier to
analyze, especially for algorithms that work better with categorical data.
❖ Missing Value Handling: Dealing with missing values is an important aspect of data
transformation. Techniques such as imputation (replacing missing values with estimated
values), deletion (removing records with missing values), or using predictive models to
estimate missing values can be employed.
❖ Data Aggregation: Data aggregation involves combining multiple data points into
summary statistics, such as averages, counts, sums, or other statistical measures. This is
useful for reducing the size of the dataset while retaining important information.
❖ Feature Engineering: Feature engineering involves creating new features or attributes
from existing ones to enhance the predictive power of the data. This may include creating
interaction terms, polynomial features, or transforming variables using mathematical
functions.
❖ Feature Scaling: Feature scaling ensures that all features have a similar scale, preventing
attributes with larger magnitudes from dominating the analysis. Techniques such as
min-max scaling or z-score normalization can be used for feature scaling.
❖ Data Encoding: Data encoding involves converting categorical variables into a numerical
format that can be processed by data mining algorithms. Techniques such as one-hot
encoding, label encoding, or ordinal encoding are commonly used for this purpose.
❖ Text Preprocessing: In cases where the data contains textual information, text
preprocessing techniques such as tokenization, stemming, lemmatization, and removal of
stop words are used to prepare the text data for analysis.
❖ Dimensionality Reduction: Dimensionality reduction techniques such as principal
component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE) are
used to reduce the number of features in the dataset while preserving important
information and reducing computational complexity.

2.5 Data Reduction:


Data reduction in data mining refers to the process of reducing the volume and complexity of
data while preserving its essential characteristics and patterns. This step is essential for
handling large datasets efficiently and improving the performance of data mining algorithms.
Data reduction techniques aim to minimize computational requirements, reduce storage costs,
and enhance the quality of analysis results. Here are some common data reduction techniques
used in data mining:

1. Sampling: Sampling involves selecting a representative subset of the original dataset


for analysis. Random sampling, systematic sampling, and stratified sampling are
common sampling methods. By analyzing a smaller sample of the data, computational
resources can be conserved while still capturing essential patterns and trends.
2. Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the
number of features or attributes in the dataset while preserving as much information as
possible. Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and
t-Distributed Stochastic Neighbor Embedding (t-SNE) are examples of dimensionality
reduction techniques. By reducing the dimensionality of the dataset, computational
complexity is reduced, and the risk of overfitting is mitigated.
3. Feature Selection: Feature selection involves selecting a subset of the most relevant
features or attributes from the original dataset. This process eliminates redundant or
irrelevant features, reducing the dimensionality of the data while maintaining its
discriminative power. Feature selection techniques include filter methods, wrapper
methods, and embedded methods.
4. Attribute Transformation: Attribute transformation involves transforming the original
attributes or features into a new set of attributes using mathematical functions or
transformations. Common transformations include normalization, standardization, log
transformation, and power transformation. Attribute transformation can improve the
performance of certain data mining algorithms and make the data more suitable for
analysis.
5. Discretization: Discretization involves converting continuous numerical attributes into
discrete intervals or bins. This reduces the complexity of the data by replacing a large
number of distinct values with a smaller number of discrete categories. Discretization
facilitates the analysis of numerical data using techniques that are designed for
categorical data.
6. Data Compression: Data compression techniques reduce the storage requirements of
the dataset by encoding it in a more compact form. Compression algorithms such as
run-length encoding, delta encoding, and Huffman coding can be applied to reduce the
size of the dataset without losing important information.
7. Clustering: Clustering techniques group similar data points together based on their
characteristics. By clustering the data, redundant information can be summarized, and
the number of data points can be reduced while preserving the underlying patterns and
relationships. Clustering algorithms such as k-means clustering and hierarchical
clustering are commonly used for data reduction.
8. Binning: Binning involves dividing continuous numerical attributes into a small number
of discrete bins or intervals. Binning reduces the granularity of the data and simplifies
its representation, making it easier to analyze and interpret. Binning can also help in
handling outliers and noise in the data.
9. Sampling in Time or Space: For data that has a temporal or spatial component,
sampling can be performed based on time intervals or geographic regions. This reduces
the volume of data while retaining the temporal or spatial patterns present in the
original dataset.
10. Data Cube Aggregation: In data warehousing and OLAP (Online Analytical Processing),
data cube aggregation involves aggregating data across multiple dimensions to create
summary data cubes. Aggregated data cubes provide a higher-level view of the data and
can significantly reduce the volume of data while preserving key insights.

2.6 Data Discretization And Concept Hierarchy Generation.

Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy. In other words, data
discretization is a method of converting attributes values of continuous data into a finite set of
intervals with minimum data loss. There are two forms of data discretization first is supervised
discretization, and the second is unsupervised discretization. Supervised discretization refers
to a method in which the class data is used. Unsupervised discretization refers to a method
depending upon the way which operation proceeds. It means it works on the top-down splitting
strategy and bottom-up merging strategy.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78


After Discretization Child Young Mature Old

Another example is analytics, where we gather the static data of website visitors. For example,
all visitors who visit the site with the IP address of India are shown under country level.

Some Famous techniques of data discretization

Histogram analysis

Histogram refers to a plot used to represent the underlying frequency distribution of a


continuous data set. Histogram assists the data inspection for data distribution. For example,
Outliers, skewness representation, normal distribution representation, etc.

Binning

Binning refers to a data smoothing technique that helps to group a huge number of continuous
values into smaller values. For data discretization and the development of idea hierarchy, this
technique can also be used.

Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing


the values of x numbers into clusters to isolate a computational feature of x.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down slicing technique is
used. It is done through a supervised procedure. In a numeric attribute discretization, first, you
need to select the attribute that has the least entropy, and then you need to run it with the help
of a recursive process. The recursive process divides it into various discretized disjoint
intervals, from top to bottom, using the same splitting criterion.

Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best neighboring interval, and
then the large intervals are combined to develop a larger overlap to form the final 20
overlapping intervals. It is a supervised procedure.

2.6 .1 Data discretization and concept hierarchy generation


The term hierarchy represents an organizational structure or mapping in which items are ranked
according to their levels of importance. In other words, we can say that a hierarchy concept refers to a
sequence of mappings with a set of more general concepts to complex concepts. It means mapping is
done from low-level concepts to high-level concepts. For example, in computer science, there are
different types of hierarchical systems. A document is placed in a folder in windows at a specific place
in the tree structure is the best example of a computer hierarchical tree model. There are two types of
hierarchy: top-down mapping and the second one is bottom-up mapping.

Let's understand this concept hierarchy for the dimension location with the help of an example.

A particular city can map with the belonging country. For example, New Delhi can be mapped to India,
and India can be mapped to Asia.

Top-down mapping

Top-down mapping generally starts with the top with some general information and ends with the
bottom to the specialized information.

Bottom-up mapping

Bottom-up mapping generally starts with the bottom with some specialized information and ends with
the top to the generalized information.

Data discretization and binarization


in data mining

Data discretization is a method of converting


attributes values of continuous data into a
finite set of intervals with minimum data loss.
In contrast, data binarization is used to
transform the continuous and discrete
attributes into binary attributes.

Why is Discretization important?

As we know, an infinite of degrees of freedom


mathematical problem poses with the
continuous data. For many purposes, data scientists need the implementation of
discretization. It is also used to improve signal noise ratio.
UNIT-III: OLAP
[10L] Max Marks: 14

3.1 OLAP In Data Warehouse, Demand For Online Analytical Processing


What is OLAP (Online Analytical Processing)?

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software


technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible views of
data that has been transformed from raw information to reflect the real dimensionality of the
enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is
rapidly enhancing the essential foundation for Intelligent Solutions containing Business
Performance Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis,
Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables
end-clients to perform ad hoc analysis of record in multiple dimensions, providing the insight
and understanding they require for better decision making.

Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an organization.

Finance and accounting:

○ Budgeting
○ Activity-based costing
○ Financial performance analysis
○ And financial modeling

Sales and Marketing

○ Sales analysis and forecasting


○ Market research analysis
○ Promotion analysis
○ Customer analysis
○ Market and customer segmentation

Production
○ Production planning
○ Defect analysis

OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.

3.2 Need For Multidimensional

Multidimensional OLAP (MOLAP) uses array-based multidimensional storage engines for


multidimensional views of data. With multidimensional data stores, the storage utilization may
be low if the dataset is sparse. Therefore, many MOLAP servers use two levels of data storage
representation to handle dense and sparse datasets.

Points to Remember −

​ MOLAP tools process information with consistent response time regardless of level of
summarizing or calculations selected.
​ MOLAP tools need to avoid many of the complexities of creating a relational database
to store data for analysis.
​ MOLAP tools need fastest possible performance.
​ MOLAP server adopts two level of storage representation to handle dense and sparse
data sets.
​ Denser sub-cubes are identified and stored as array structure.
​ Sparse sub-cubes employ compression technology.

Advantages
​ MOLAP
allows fastest
indexing to the
pre-computed
summarized
data.
​ Helps
the users
connected to a
network who need to analyze larger, less-defined data.
​ Easier to use, therefore MOLAP is suitable for inexperienced users.
Disadvantages

​ MOLAP are not capable of containing detailed data.


​ The storage utilization may be low if the data set is sparse.

OLAP Definitions And Rules,

On-line Analytical Processing (OLAP) is a category of software technology that


enables analytics, managers and executives to gain insight into data through fast,
consistent, interactive access in a wide variety of information that has been
transformed from the raw data to reflect the real dimensionality of the enterprise as
understood by the user.
OLAP was introduced by Dr.E.F.Codd in 1993 and he presented 12 rules regarding
OLAP:
1. Multidimensional Conceptual View:
Multidimensional data model is provided that is intuitively analytical and
easy to use. A multidimensional data model decides how the users perceive
business problems.

2. Transparency:
It makes the technology, underlying data repository, computing architecture,
and the diverse nature of source data totally transparent to users.

3. Accessibility:
Access should provided only to the data that is actually needed to perform
the specific analysis, presenting a single, coherent and consistent view to the
users.

4. Consistent Reporting Performance:


Users should not experience any significant degradation in reporting
performance as the number of dimensions or the size of the database
increases. It also ensures users must perceive consistent run time, response
time or machine utilization every time a given query is run.

5. Client/Server Architecture:
It conforms the system to the principles of client/server architecture for
optimum performance, flexibility, adaptability, and interoperability.

6. Generic Dimensionality:
It should be ensured that very data dimension is equivalent in both structure
and operational capabilities. Have one logical structure for all dimensions.

7. Dynamic Sparse Matrix Handling:


Adaption should be of the physical schema to the specific analytical model
being created and loaded that optimizes sparse matrix handling.

8. Multi-user Support:
Support should be provided for end users to work concurrently with either
the same analytical model or to create different models from the same data.

9. Unrestricted Cross-dimensional Operations:


System should have abilities to recognize dimensional and automatically
perform roll-up and drill-down operations within a dimension or across
dimensions.

10. Intuitive Data Manipulation:


Consolidation path reorientation, drill-down, and roll-up and other
manipulations to be accomplished intuitively should be enabled and directly
via point and click actions.

11. Flexible Reporting:


Business user is provided capabilities to arrange columns, rows, and cells in
manner that gives the facility of easy manipulation, analysis and synthesis of
information.

12. Unlimited Dimensions and Aggregation Levels:


There should be at least fifteen or twenty data dimensions within a common
analytical model.

3.5 Characteristics of OLAP

In the FASMI characteristics of OLAP methods, the term derived from the first letters of the
characteristics are:
Fast

It defines which the system targeted to deliver the most


feedback to the client within about five seconds, with the
elementary analysis taking no more than one second and
very few taking more than 20 seconds.

Analysis

It defines which the method can cope with any business


logic and statistical analysis that is relevant for the function and the user, keep it easy enough
for the target client. Although some preprogramming may be needed we do not think it
acceptable if all application definitions have to be allow the user to define new Adhoc
calculations as part of the analysis and to document on the data in any desired method,
without having to program so we excludes products (like Oracle Discoverer) that do not allow
the user to define new Adhoc calculation as part of the analysis and to document on the data
in any desired product that do not allow adequate end user-oriented calculation flexibility.

Share

It defines which the system tools all the security requirements for understanding and, if
multiple write connection is needed, concurrent update location at an appropriated level, not all
functions need customer to write data back, but for the increasing number which does, the
system should be able to manage multiple updates in a timely, secure manner.

Multidimensional

This is the basic requirement. OLAP system must provide a multidimensional conceptual view
of the data, including full support for hierarchies, as this is certainly the most logical method to
analyze business and organizations.

Information

The system should be able to hold all the data needed by the applications. Data sparsity
should be handled in an efficient manner.

The main characteristics of OLAP are as follows:

1. Multidimensional conceptual view: OLAP systems let business users have a


dimensional and logical view of the data in the data warehouse. It helps in carrying slice
and dice operations.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should
provide normal database operations, containing retrieval, update, adequacy control,
integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The
OLAP operations should be sitting between data sources (e.g., data warehouses) and an
OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
5. Uniform documenting performance: Increasing the number of dimensions or database
size should not significantly degrade the reporting performance of the OLAP system.
6. OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of metrics
along a single business dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
11. OLAP presents results in a number of meaningful ways, including charts and graphs.

3.6 Major Features And Functions,

There are various features of OLAP Servers which are as follows −

Multidimensional conceptual view − A user view of the enterprise data is multidimensional. The
conceptual view of OLAP models should be multidimensional. The multidimensional models
can be manipulated more easily and intuitively than in the case of single-dimensional models.

Transparency − A user should be able to get full value from an OLAP engine without regarding
the source of the data. The OLAP system’s technology, underlying database, and computing
architecture, and the heterogeneity of input data sources should be transparent to users to
preserve their productivity and proficiency with familiar front-end environments and tools.

It should also be transparent to the user as to whether or not the enterprise data input to the
OLAP tool comes from homogeneous or heterogeneous database environments.

Accessibility − The OLAP user must be able to perform analysis based upon a common
conceptual schema composed of enterprise data in relational DBMS. The OLAP tool should
map its logical schema to heterogeneous physical data stores, create the data and implement
some conversions necessary to present a single, coherent and consistent user view.

Consistent performance − As the number of dimensions or the size of the database increases,
there should not be any significant degradation in reporting performance. Consistent reporting
performance is essential to supporting the ease to use and lack of complexity needed in
bringing OLAP to the end-user.
Client-server architecture − Most of the data which require online analytical processing are
stored on mainframe systems and accessed via personal computers. The OLAP servers must
be capable of operating in a client-server environment.

The server component of OLAP tools should be sufficiently intelligent such that various clients
can be connected with minimum effort and integration programming. This server may be
capable of performing the mapping and consolidation between disparate logical and physical
enterprise database schema necessary to affect transparency and to build common
conceptual, logical, and physical schemas.

Generic dimensionality − Each data dimension should be similar in both its architecture and
operational capabilities. Additional operational capabilities can be granted to selected
dimensions but since dimensions are symmetric a given additional function can be granted to
any dimension.

Multi-user support − There can be a need to work concurrently with either the same analytical
model or to create different models from the same enterprise data.

Intuitive data manipulation − Consolidation by drilling down across columns or rows, zooming
out, and other manipulation inherent in the consolidation outlines should be accomplished via
direct action upon the cells of the analytical model, and should neither require the use of a
menu nor multiple tips across the user interface

3.7 Types of OLAP

There are three main types of OLAP servers are as following:

ROLAP stands for


Relational OLAP, an
application based on
relational DBMSs.

MOLAP stands for


Multidimensional OLAP,
an application based on
multidimensional
DBMSs.

HOLAP stands for Hybrid


OLAP, an application using both relational and multidimensional techniques.

Relational OLAP (ROLAP) Server


These are intermediate servers which stand in between a relational back-end server and user
frontend tools.

They use a relational or extended-relational DBMS to save and handle warehouse data, and
OLAP middleware to provide missing pieces.

ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.

ROLAP technology tends to have higher scalability than MOLAP technology.

ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.

This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

● big data, allowing users to make informed decisions based on up-to-date


information.
● In-memory OLAP (IOLAP): IOLAP is an OLAP solution that stores data in memory

for faster access a

MOLAP stores data on disks in a specialized multidimensional array structure. OLAP is

performed on it relying on the random access capability of the arrays. Arrays elements

are determined by dimension instances, and the fact data or measured value

associated with each cell is usually stored in the corresponding array element. In

MOLAP, the multidimensional array is usually stored in a linear allocation according to

nested traversal of the axes in some predetermined order.

But unlike ROLAP, where only records with non-zero facts are stored, all array elements are
defined in MOLAP and as a result, the arrays generally tend to be sparse, with empty
elements occupying a greater part of it. Since both storage and retrieval costs are important
while assessing online performance efficiency, MOLAP systems typically include provisions
such as advanced indexing and hashing to locate data while performing queries for handling
sparse arrays. MOLAP cubes are fast data retrieval, optimal for slicing and dicing, and can
perform complex calculations. All calculations are pre-generated when the cube is created.

molap

Hybrid OLAP (HOLAP)


HOLAP is a combination of ROLAP and MOLAP. HOLAP servers allow for storing large data
volumes of detailed data. On the one hand, HOLAP leverages the greater scalability of
ROLAP. On the other hand, HOLAP leverages cube technology for faster performance and
summary-type information. Cubes are smaller than MOLAP since detailed data is kept in the
relational database. The database is used to store data in the most functional way possible.

Transparent OLAP (TOLAP)

TOLAP systems are designed to work transparently with existing RDBMS systems, allowing
users to access OLAP features without needing to transfer data to a separate OLAP system.
This allows for more seamless integration between OLAP and traditional RDBMS systems.

Other Types of OLAP

There are some other types of OLAP Systems that are used in analyzing databases. Some of
them are mentioned below.

● Web OLAP (WOLAP): It is a Web browser-based technology. In traditional OLAP


application is accessible by the client/server but this OLAP application is accessible by
the web browser. It is a three-tier architecture that consists of a client, middleware,
and database server. The most appealing features of this style of OLAP were (past
tense intended, since few products categorize themselves this way) the considerably
lower investment involved on the client side (“all that’s needed is a browser”) and
enhanced accessibility to connect to the data. A Web-based application requires no
deployment on the client machine. All that is needed is a Web browser and a network
connection to the intranet or Internet.
● Desktop OLAP (DOLAP): DOLAP stands for desktop analytical processing. Users can
download the data from the source and work with the dataset, or on their desktop.
Functionality is limited compared to other OLAP applications. It has a cheaper cost.
● Mobile OLAP (MOLAP): MOLAP is wireless functionality for mobile devices. User work
and access the data through mobile devices.
● Spatial OLAP (SOLAP): Merge capabilities of both Geographic Information Systems
(GIS) and OLAP into the single user interface, SOLAP egress. SOLAP is created
because the data come in the form of alphanumeric, image, and vector. This provides
the easy and quick exploration of data that resides in a spatial database.
● Real-time OLAP (ROLAP): ROLAP technology combines the features of both OLTP and
OLAP. It allows users to view data in real-time and perform analysis on data as it is
being updated in the system. ROLAP also provides a single, unified view of data from
different sources and supports advanced analytics like predictive modeling and data
mining.
● Cloud OLAP (COLAP): COLAP is a cloud-based OLAP solution that allows users to
access data from anywhere and anytime. It eliminates the need for on-premise
hardware and software installations, making it a cost-effective and scalable solution
for businesses of all sizes. COLAP also offers high availability and disaster recovery
capabilities, ensuring business continuity in the event of a disaster.
● Big Data OLAP (BOLAP): BOLAP is an OLAP solution that can handle large
amounts of data, such as data from Hadoop or other big data sources. It provides
high-performance analytics on large datasets and supports complex queries that
are impossible with traditional OLAP tools. BOLAP also supports real-time analysis
of nd processing. It provides real-time analysis on large datasets and supports
complex queries, making it an ideal solution for businesses that require fast and
accurate analytics. IOLAP also supports advanced analytics like predictive
modeling and data mining, allowing users to gain insights into their data and
make informed decisions.

3.8 Difference between OLAP and OLTP in DBMS

S.N OLAP OLTP


o.

1 OLAP stands for Online analytical processing. OLTP stands for online
transaction processing.
2 It includes software tools that help in It helps in managing online
analyzing data mainly for business decisions. database modification.

3 It utilizes the data warehouse. It utilizes traditional


approaches of DBMS.

4 It is popular as an online database query It is popular as an online


management system. database modifying system.

5 OLAP employs the data warehouse. OLTP employs traditional


DBMS.

6 It holds old data from various Databases. It holds current operational


data.

7 Here the tables are not normalized. Here, the tables are
normalized.

8 It allows only read and hardly write operations. It allows both read and write
operations.

9 Here, the complex queries are involved. Here, the queries are simple.

3.8 Data Cubes And Operations On Cubes.

Data cube operations

Data cube operations are used to manipulate


data to meet the needs of users. These
operations help to select particular data for
the analysis purpose. There are mainly 5 operations listed below-

● Roll-up: operation and aggregate certain similar data attributes having the same
dimension together. For example, if the data cube displays the daily income of a
customer, we can use a roll-up operation to find the monthly income of his salary.

● Drill-down: this operation is the reverse of the roll-up operation. It allows us to take
particular information and then subdivide it further for coarser granularity analysis.
It zooms into more detail. For example- if India is an attribute of a country column
and we wish to see villages in India, then the drill-down operation splits India into
states, districts, towns, cities, villages and then displays the required information.

● Slicing: this operation filters the unnecessary portions. Suppose in a particular


dimension, the user doesn’t need everything for analysis, rather a particular
attribute. For example, country=”jamaica”, this will display only about jamaica and
only display other countries present on the country list.

● Dicing: this operation does a multidimensional cutting, that not only cuts only one
dimension but also can go to another dimension and cut a certain range of it. As a
result, it looks more like a subcube out of the whole cube(as depicted in the figure).
For example- the user wants to see the annual salary of Jharkhand state
employees.

● Pivot: this operation is very important from a viewing point of view. It basically
transforms the data cube in terms of view. It doesn’t change the data present in the
data cube. For example, if the user is comparing year versus branch, using the pivot
operation, the user can change the viewpoint and now compare branch versus item
type.
UNIT-IV: Data Mining

4.1 Introduction-Data Mining functionalities,


Introduction to Data Mining
Data mining is the process of extracting useful information from large sets of data. It involves
using various techniques from statistics, machine learning, and database systems to identify
patterns, relationships, and trends in the data. This information can then be used to make
data-driven decisions, solve business problems, and uncover hidden insights. Applications of
data mining include customer profiling and segmentation, market basket analysis, anomaly
detection, and predictive modeling. Data mining tools and technologies are widely used in
various industries, including finance, healthcare, retail, and telecommunications.

In general terms, “Mining” is the process of extraction of some valuable material from the
earth e.g. coal mining, diamond mining, etc. In the context of computer science, “Data Mining”
can be referred to as knowledge mining from data, knowledge extraction, data/pattern
analysis, data archaeology, and data dredging. It is basically the process carried out for the
extraction of useful information from a bulk of data or data warehouses. One can see that the
term itself is a little confusing. In the case of coal or diamond mining, the result of the
extraction process is coal or diamond. But in the case of Data Mining, the result of the
extraction process is not data!! Instead, data mining results are the patterns and knowledge
that we gain at the end of the extraction process. In that sense, we can think of Data Mining
as a step in the process of Knowledge Discovery or Knowledge Extraction.

Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases” in 1989.


However, the term ‘data mining’ became more popular in the business and press
communities. Currently, Data Mining and Knowledge Discovery are used interchangeably.

Nowadays, data mining is used in almost all places where a large amount of data is stored
and processed. For example, banks typically use ‘data mining’ to find out their prospective
customers who could be interested in credit cards, personal loans, or insurance as well. Since
banks have the transaction details and detailed profiles of their customers, they analyze all
this data and try to find out patterns that help them predict that certain customers could be
interested in personal loans, etc.

Data Mining Functionalities

Functionalities of Data Mining

Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. Data mining tasks can be classified into two types:
descriptive and predictive. Descriptive mining tasks define the common features of the data in
the database, and the predictive mining tasks act in inference on the current information to
develop predictions.

Data mining is extensively used in many areas or sectors. It is used to predict and characterize
data. But the ultimate objective in Data Mining Functionalities is to observe the various trends
in data mining. There are several data mining functionalities that the organized and scientific
methods offer, such as:

1. Class/Concept Descriptions

A class or concept implies there is a data set or set of features that define the class or a
concept. A class can be a category of items on a shop floor, and a concept could be the
abstract idea on which data may be categorized like products to be put on clearance sale and
non-sale products. There are two concepts here, one that helps with grouping and the other
that helps in differentiating.

○ Data Characterization: This refers to the summary of general characteristics or features


of the class, resulting in specific rules that define a target class. A data analysis
technique called Attribute-oriented Induction is employed on the data set for achieving
characterization.
○ Data Discrimination: Discrimination is used to separate distinct data sets based on the
disparity in attribute values. It compares features of a class with features of one or
more contrasting classes.g., bar charts, curves and pie charts.

2. Mining Frequent Patterns

One of the functions of data mining is finding data patterns. Frequent patterns are things that
are discovered to be most common in data. Various types of frequency can be found in the
dataset.

○ Frequent item set:This term refers to a group of items that are commonly found
together, such as milk and sugar.
○ Frequent substructure: It refers to the various types of data structures that can be
combined with an item set or subsequences, such as trees and graphs.
○ Frequent Subsequence: A regular pattern series, such as buying a phone followed by a
cover.

3. Association Analysis

It analyses the set of items that generally occur together in a transactional dataset. It is also
known as Market Basket Analysis for its wide use in retail sales. Two parameters are used for
determining the association rules:
○ It provides which identifies the common item set in the database.
○ Confidence is the conditional probability that an item occurs when another item occurs
in a transaction.

4. Classification

Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to predict
a class or essentially classify a collection of items. A training set containing items whose
properties are known is used to train the system to predict the category of items from an
unknown collection of items.

5. Prediction

It defines predict some unavailable data values or spending trends. An object can be
anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase or decrease trends in time-related
information. There are primarily two types of predictions in data mining: numeric and class
predictions.

○ Numeric predictions are made by creating a linear regression model that is based on
historical data. Prediction of numeric values helps businesses ramp up for a future
event that might impact the business positively or negatively.
○ Class predictions are used to fill in missing class information for products using a
training data set where the class for products is known.

6. Cluster Analysis

In image processing, pattern recognition and bioinformatics, clustering is a popular data


mining functionality. It is similar to classification, but the classes are not predefined. Data
attributes represent the classes. Similar data are grouped together, with the difference being
that a class label is not known. Clustering algorithms group data based on similar features and
dissimilarities.

7. Outlier Analysis

Outlier analysis is important to understand the quality of data. If there are too many outliers,
you cannot trust the data or draw patterns. An outlier analysis determines if there is something
out of turn in the data and whether it indicates a situation that a business needs to consider
and take measures to mitigate. An outlier analysis of the data that cannot be grouped into any
classes by the algorithms is pulled up.

8. Evolution and Deviation Analysis


Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify,
cluster or discriminate time-related data.

9. Correlation Analysis

Correlation is a mathematical technique for determining whether and how strongly two
attributes is related to one another. It refers to the various types of data structures, such as
trees and graphs, that can be combined with an item set or subsequence. It determines how
well two numerically measured continuous variables are linked. Researchers can use this type
of analysis to see if there are any possible correlations between variables in their study.

4.2 Classification of Data Mining Systems

Data mining refers to the process of extracting important data from raw data. It analyses the
data patterns in huge sets of data with the help of several software. Ever since the
development of data mining, it is being incorporated by researchers in the research and
development field.

With Data mining, businesses are found to gain more profit. It has not only helped in
understanding customer demand but also in developing effective strategies to enforce overall
business turnover. It has helped in determining business objectives for making clear decisions.

Data collection and data warehousing, and computer processing are some of the strongest
pillars of data mining. Data mining utilizes the concept of mathematical algorithms to segment
the data and assess the possibility of occurrence of future events.

To understand the system and meet the desired requirements, data mining can be classified
into the following systems:

○ Classification based on the mined Databases


○ Classification based on the type of mined knowledge
○ Classification based on statistics
○ Classification based on Machine Learning
○ Classification based on visualization
○ Classification based on Information Science
○ Classification based on utilized techniques
○ Classification based on adapted applications

Classification Based on the mined Databases


A data mining system can be classified based on the types of databases that have been
mined. A database system can be further segmented based on distinct principles, such as
data models, types of data, etc., which further assist in classifying a data mining system.

For example, if we want to classify a database based on the data model, we need to select
either relational, transactional, object-relational or data warehouse mining systems.

Classification Based on the type of Knowledge Mined

A data mining system categorized based on the kind of knowledge mind may have the
following functionalities:

1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis

Classification Based on the Techniques Utilized

A data mining system can also be classified based on the type of techniques that are being
incorporated. These techniques can be assessed based on the involvement of user interaction
involved or the methods of analysis employed.

Classification Based on the Applications Adapted

Data mining systems classified based on adapted applications adapted are as follows:

1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail

Examples of Classification Task

Following is some of the main examples of classification tasks:


○ Classification helps in determining tumor cells as benign or malignant.
○ Classification of credit card transactions as fraudulent or legitimate.
○ Classification of secondary structures of protein as alpha-helix, beta-sheet, or random
coil.
○ Classification of news stories into distinct categories such as finance, weather,
entertainment, sports, etc.

Integration schemes of Database and Data warehouse systems

No Coupling

In no coupling schema, the data mining


system does not use any database or data
warehouse system functions.

Loose Coupling

In loose coupling, data mining utilizes


some of the database or data warehouse
system functionalities. It mainly fetches
the data from the data repository
managed by these systems and then
performs data mining. The results are kept
either in the file or any designated place in
the database or data warehouse.

Semi-Tight Coupling

In semi-tight coupling, data mining is linked to either the DB or DW system and provides an
efficient implementation of data mining primitives within the database.

Tight Coupling

A data mining system can be effortlessly combined with a database or data warehouse
system in tight coupling.

4.3 Basic Data Mining task,

Introduction to Data Mining Tasks

The data mining tasks can be classified generally into two types based on what a specific task
tries to achieve. Those two categories are descriptive tasks and predictive tasks. The
descriptive data mining tasks characterize the general properties of data whereas predictive
data mining tasks perform inference on the available data set to predict how a new data set will
behave.
Different Data Mining Tasks

There are a number of data mining tasks such as classification, prediction, time-series
analysis, association, clustering, summarization etc. All these tasks are either predictive data
mining tasks or descriptive data mining tasks. A data mining system can execute one or more
of the above specified tasks as part
of data mining.

Predictive data mining tasks come


up with a model from the available
data set that is helpful in predicting
unknown or future values of another
data set of interest. A medical
practitioner trying to diagnose a
disease based on the medical test
results of a patient can be
considered as a predictive data
mining task. Descriptive data mining
tasks usually finds data describing
patterns and comes up with new,
significant information from the
available data set. A retailer trying to
identify products that are purchased
together can be considered as a descriptive data mining task.

a) Classification
Classification derives a model to determine the class of an object based on its attributes. A
collection of records will be available, each record with a set of attributes. One of the attributes
will be class attribute and the goal of classification task is assigning a class attribute to new set
of records as accurately as possible.
Classification can be used in direct marketing, that is to reduce marketing costs by targeting a
set of customers who are likely to buy a new product. Using the available data, it is possible to
know which customers purchased similar products and who did not purchase in the past.
Hence, {purchase, don’t purchase} decision forms the class attribute in this case. Once the
class attribute is assigned, demographic and lifestyle information of customers who purchased
similar products can be collected and promotion mails can be sent to them directly.

b) Prediction
Prediction task predicts the possible values of missing or future data. Prediction involves
developing a model based on the available data and this model is used in predicting future
values of a new data set of interest. For example, a model can predict the income of an
employee based on education, experience and other demographic factors like place of stay,
gender etc. Also prediction analysis is used in different areas including medical diagnosis,
fraud detection etc.

c) Time - Series Analysis


Time series is a sequence of events where the next event is determined by one or more of the
preceding events. Time series reflects the process being measured and there are certain
components that affect the behavior of a process. Time series analysis includes methods to
analyze time-series data in order to extract useful patterns, trends, rules and statistics. Stock
market prediction is an important application of time- series analysis.

d) Association
Association discovers the association or connection among a set of items. Association
identifies the relationships between objects. Association analysis is used for commodity
management, advertising, catalog design, direct marketing etc. A retailer can identify the
products that normally customers purchase together or even find the customers who respond
to the promotion of same kind of products. If a retailer finds that beer and nappy are bought
together mostly, he can put nappies on sale to promote the sale of beer.

e) Clustering
Clustering is used to identify data objects that are similar to one another. The similarity can be
decided based on a number of factors like purchase behavior, responsiveness to certain
actions, geographical locations and so on. For example, an insurance company can cluster its
customers based on age, residence, income etc. This group information will be helpful to
understand the customers better and hence provide better customized services.

f) Summarization
Summarization is the generalization of data. A set of relevant data is summarized which result
in a smaller set that gives aggregated information of the data. For example, the shopping done
by a customer can be summarized into total products, total spending, offers used, etc. Such
high level summarized information can be useful for sales or customer relationship team for
detailed customer and purchase behavior analysis. Data can be summarized in different
abstraction levels and from different angles.
Summary

Different data mining tasks are the core of data mining process. Different prediction and
classification data mining tasks actually extract the required information from the available data
sets.

Data Mining Issues

Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −

​ Mining Methodology and User Interaction


​ Performance Issues
​ Diverse Data Types Issues

The following diagram describes the major issues.

Mining Methodology and User


Interaction Issues

It refers to the following kinds


of issues −

​ Mining different kinds


of knowledge in databases −
Different users may be
interested in different kinds of
knowledge. Therefore it is
necessary for data mining to
cover a broad range of
knowledge discovery task.
​ Interactive mining of
knowledge at multiple levels
of abstraction − The data mining process needs to be interactive because it allows
users to focus the search for patterns, providing and refining data mining requests
based on the returned results.
​ Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.
​ Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
​ Presentation and visualization of data mining results − Once the patterns are discovered
it needs to be expressed in high level languages, and visual representations. These
representations should be easily understandable.
​ Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
​ Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −

​ Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
​ Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate
the development of parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel fashion. Then the
results from the partitions is merged. The incremental algorithms, update databases
without mining the data again from scratch.

Diverse Data Types Issues

​ Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible
for one system to mine all these kind of data.
​ Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may
be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
Unit-V: Association Rule Mining

5.1 Efficient and Scalable Frequent Item set Mining Methods

INTRODUCTION:

1. Frequent item sets, also known as association rules, are a fundamental concept in
association rule mining, which is a technique used in data mining to discover
relationships between items in a dataset. The goal of association rule mining is to
identify relationships between items in a dataset that occur frequently together.
2. A frequent item set is a set of items that occur together frequently in a dataset. The
frequency of an item set is measured by the support count, which is the number of
transactions or records in the dataset that contain the item set. For example, if a
dataset contains 100 transactions and the item set {milk, bread} appears in 20 of
those transactions, the support count for {milk, bread} is 20.
3. Association rule mining algorithms, such as Apriori or FP-Growth, are used to find
frequent item sets and generate association rules. These algorithms work by
iteratively generating candidate item sets and pruning those that do not meet the
minimum support threshold. Once the frequent item sets are found, association rules
can be generated by using the concept of confidence, which is the ratio of the
number of transactions that contain the item set and the number of transactions that
contain the antecedent (left-hand side) of the rule.
4. Frequent item sets and association rules can be used for a variety of tasks such as
market basket analysis, cross-selling and recommendation systems. However, it
should be noted that association rule mining can generate a large number of rules,
many of which may be irrelevant or uninteresting. Therefore, it is important to use
appropriate measures such as lift and conviction to evaluate the interestingness of
the generated rules.

Association Mining searches for frequent items in the data set. In frequent mining usually,
interesting associations and correlations between item sets in transactional and relational
databases are found. In short, Frequent Mining shows which items appear together in a
transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules from a
Transactional Dataset. If there are 2 items X and Y purchased frequently then it’s good to put
them together in stores or provide some discount offer on one item on purchase of another
item. This can really increase sales. For example, it is likely to find that if a customer buys Milk
and bread he/she also buys Butter. So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So the
seller can suggest the customer buy butter if he/she buys Milk and Bread.

Important Definitions :
● Support : It is one of the measures of interestingness. This tells about the usefulness
and certainty of rules. 5% Support means total 5% of transactions in the database
follow the rule.

Support(A -> B) = Support_count(A ∪ B)


● Confidence: A confidence of 60% means that 60% of the customers who purchased
a milk and bread also bought butter.

Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)


If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
● Support_count(X): Number of transactions in which X appears. If X is A union B then
it is the number of transactions in which A and B both are present.
● Maximal Itemset: An itemset is maximal frequent if none of its supersets are
frequent.
● Closed Itemset: An itemset is closed if none of its immediate supersets have same
support count same as Itemset.
● K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an
itemset is frequent if the corresponding support count is greater than the minimum
support count.

Example On finding Frequent Itemsets – Consider the given dataset with given transactions.

● Lets say minimum support count is 3


● Relation hold is maximal frequent => closed => frequent
1-frequent: {A} = 3; // not closed due to {A, C} and not maximal {B} = 4; // not closed due to {B,
D} and no maximal {C} = 4; // not closed due to {C, D} not maximal {D} = 5; // closed item-set
since not immediate super-set has same count. Not maximal

2-frequent: {A, B} = 2 // not frequent because support count < minimum support count so
ignore {A, C} = 3 // not closed due to {A, C, D} {A, D} = 3 // not closed due to {A, C, D} {B, C} =
3 // not closed due to {B, C, D} {B, D} = 4 // closed but not maximal due to {B, C, D} {C, D} = 4
// closed but not maximal due to {B, C, D}

3-frequent: {A, B, C} = 2 // ignore not frequent because support count < minimum support
count {A, B, D} = 2 // ignore not frequent because support count < minimum support count {A,
C, D} = 3 // maximal frequent {B, C, D} = 3 // maximal frequent

4-frequent: {A, B, C, D} = 2 //ignore not frequent </

ADVANTAGES OR DISADVANTAGES:

Advantages of using frequent item sets and association rule mining include:

1. Efficient discovery of patterns: Association rule mining algorithms are efficient at


discovering patterns in large datasets, making them useful for tasks such as market
basket analysis and recommendation systems.
2. Easy to interpret: The results of association rule mining are easy to understand and
interpret, making it possible to explain the patterns found in the data.
3. Can be used in a wide range of applications: Association rule mining can be used in
a wide range of applications such as retail, finance, and healthcare, which can help
to improve decision-making and increase revenue.
4. Handling large datasets: These algorithms can handle large datasets with many
items and transactions, which makes them suitable for big-data scenarios.

Disadvantages of using frequent item sets and association rule mining include:

1. Large number of generated rules: Association rule mining can generate a large
number of rules, many of which may be irrelevant or uninteresting, which can make
it difficult to identify the most important patterns.
2. Limited in detecting complex relationships: Association rule mining is limited in its
ability to detect complex relationships between items, and it only considers the
co-occurrence of items in the same transaction.
3. Can be computationally expensive: As the number of items and transactions
increases, the number of candidate item sets also increases, which can make the
algorithm computationally expensive.
4. Need to define the minimum support and confidence threshold: The minimum
support and confidence threshold must be set before the association rule mining
process, which can be difficult and requires a good understanding of the data.

5.2 ,Mining Various Kinds of Association Rules-Mining multilevel association rules-


Mining multidimensional association rule(Association Mining to Correlation Analysis,
Constraint-Based Association Mining).

In this article, we will discuss concepts of Multilevel Association Rule mining and its algorithms,
applications, and challenges.

Data mining is the process of extracting hidden patterns from large data sets. One of the
fundamental techniques in data mining is association rule mining. To identify relationships
between items in a dataset, Association rule mining is used. These relationships can then be
used to make predictions about future occurrences of those items.

Multilevel Association Rule mining is an extension of Association Rule mining. Multilevel


Association Rule mining is a powerful tool that can be used to discover patterns and trends.

Association Rule in data mining

Association rule mining is used to discover relationships between items in a dataset. An


association rule is a statement of the form "If A, then B," where A and B are sets of items. The
strength of an association rule is measured using two measures: support and confidence.
Support measures the frequency of the occurrence of the items in the rule, and confidence
measures the reliability of the rule.
Apriori algorithm is a popular algorithm for mining association rules. It is an iterative algorithm
that works by generating candidate itemsets and pruning those that do not meet the support
and confidence thresholds.

Multilevel Association Rule in data mining

Multilevel Association Rule mining is a technique that extends Association Rule mining to
discover relationships between items at different levels of granularity. Multilevel Association
Rule mining can be classified into two types: multi-dimensional Association Rule and
multi-level Association Rule.

Multi-dimensional Association Rule mining

This is used to find relationships between items in different dimensions of a dataset. For
example, in a sales dataset, multi-dimensional Association Rule mining can be used to find
relationships between products, regions, and time.

Multi-level Association Rule mining

This is used to find relationships between items at different levels of granularity. For example,
in a retail dataset, multi-level Association Rule mining can be used to find relationships
between individual items and categories of items.

Needs of Multidimensional Rule

Multidimensional rule mining is important because data at lower levels may not exhibit any
meaningful patterns, yet it can contain valuable insights. The goal is to find such hidden
information within and across levels of abstraction.

Algorithms for Multilevel Association Rule Mining

There are several algorithms for Multilevel Association Rule mining, including partition-based,
agglomerative, and hybrid approaches.

Partition-based algorithms divide the data into partitions based on some criteria, such as the
level of granularity, and then mine Association Rules within each partition. Agglomerative
algorithms start with the smallest itemsets and then gradually merge them into larger itemsets,
until a set of rules is obtained. Hybrid algorithms combine the strengths of partition-based and
agglomerative approaches.

Approaches to Multilevel Association rule mining

Multilevel Association Rule mining has different approaches to finding relationships between
items at different levels of granularity. There are three approaches: Uniform Support, Reduced
Support, and Group-based Support. These are explained as follows below in brief.
Uniform Support (using uniform minimum support for all levels)

where only one minimum support threshold is used for all levels. This approach is simple but
may miss meaningful associations at low levels.

Reduced Support (using reduced minimum support at lower levels)

where the minimum support threshold is lowered at lower levels to avoid missing important
associations. This approach uses different search techniques, such as Level-by-Level
independence and Level-cross separating by single item or K-itemset.

Group-based Support (using item or group based support)

where the user or expert sets the support and confidence threshold based on a specific group
or product category.

For example, if an expert wants to study the purchase patterns of laptops and clothes in the
non-electronic category, a low support threshold can be set for this group to give attention to
these items' purchase patterns.

Applications of Multilevel Association Rule in data mining

These are some application as follows

Retail Sales Analysis

Multilevel Association Rule mining helps retailers gain insights into customer buying behavior
and preferences, optimize product placement and pricing, and improve supply chain
management.

Healthcare Management

Multilevel Association Rule mining helps healthcare providers identify patterns in patient
behavior, diagnose diseases, identify high-risk patients, and optimize treatment plans.

Fraud Detection

Multilevel Association Rule mining helps companies identify fraudulent patterns, detect
anomalies, and prevent fraud in various industries such as finance, insurance, and
telecommunications.

Web Usage Mining

Multilevel Association Rule mining helps web-based companies gain insights into user
preferences, optimize website design and layout, and personalize content for individual users
by analyzing data at different levels of abstraction.
Social Network Analysis

Multilevel Association Rule mining helps social network providers identify influential users,
detect communities, and optimize network structure and design by analyzing social network
data at different levels of abstraction.

Challenges in Multilevel Association Rule Mining

Multilevel Association Rule mining poses several challenges, including high dimensionality,
large data set size, and scalability issues.

High dimensionality

It is the problem of dealing with data sets that have a large number of attributes.

Large data set size

It is the problem of dealing with data sets that have a large number of records.

Scalability

It is the problem of dealing with data sets that are too large to fit into memory.

Conclusion

Multilevel Association Rule mining is a powerful technique that can be used to identify
relationships between items at different levels of granularity. It is an extension of Association
Rule mining that can discover patterns and trends that would otherwise be missed. Multilevel
Association Rule mining has several applications, including market basket analysis, medical
data analysis, and web usage mining.

However, Multilevel Association Rule mining also poses several challenges, including high
dimensionality, large data set size, and scalability issues. Future research directions in
Multilevel Association Rule mining include developing more efficient algorithms and addressing
these challenges.

In conclusion, Multilevel Association Rule mining is a powerful technique that can be used to
discover relationships between items at different levels of granularity. It has several
applications in various fields, but it also poses several challenges. As data sets continue to
grow in size and complexity, Multilevel Association Rule mining will become an increasingly
important tool for discovering hidden patterns in large data sets.
Unit-VI: Classification and Prediction

6.1 Issues Regarding Classification and Prediction:

There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −

​ Classification
​ Prediction

Classification models predict categorical class labels; and prediction models predict continuous
valued functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the expenditures in dollars
of potential customers on computer equipment given their income and occupation.

What is classification?

Following are the examples of cases where the data analysis task is Classification −

​ A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
​ A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.

In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing
data.

What is prediction?

Following are the examples of cases where the data analysis task is Prediction −

Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or a
predictor will be constructed that predicts a continuous-valued-function or ordered value.

Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.

How Does Classification Works?

With the help of the bank loan application that we have discussed above, let us understand the
working of classification. The Data Classification process includes two steps −

​ Building the Classifier or Model


​ Using Classifier for Classification

Building the Classifier or Model

​ This step is the learning step or the learning phase.


​ In this step the classification algorithms build the classifier.
​ The classifier is built from the training set made up of database tuples and their
associated class labels.
​ Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.

Using Classifier for


Classification

In this step, the


classifier is used for
classification. Here the
test data is used to
estimate the accuracy
of classification rules.
The classification rules
can be applied to the
new data tuples if the
accuracy is considered
acceptable.

Classification and Prediction Issues

The major issue is preparing


the data for Classification
and Prediction. Preparing
the data involves the
following activities −

​ Data Cleaning − Data


cleaning involves removing
the noise and treatment of
missing values. The noise is
removed by applying smoothing
techniques and the problem of
missing values is solved by
replacing a missing value with
most commonly occurring value for that attribute.
​ Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
​ Data Transformation and reduction − The data can be transformed by any of the
following methods.
​ Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within a
small specified range. Normalization is used when in the learning step, the neural
networks or the methods involving measurements are used.
​ Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.

Note − Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.

Comparison of Classification and Prediction Methods

Here is the criteria for comparing the methods of Classification and Prediction −

​ Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class
label correctly and the accuracy of the predictor refers to how well a given predictor can
guess the value of predicted attribute for a new data.
​ Speed − This refers to the computational cost in generating and using the classifier or
predictor.
​ Robustness − It refers to the ability of classifier or predictor to make correct predictions
from given noisy data.
​ Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
​ Interpretability − It refers to what extent the classifier or predictor understands.

6.2 Classification by Decision Tree Introduction,

○ Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
○ In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
○ The decisions or the test are performed on the basis of features of the given dataset.
○ It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
○ It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
○ In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
○ A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
○ Below diagram explains the general structure of a decision tree:

Why use Decision Trees?

There are various algorithms in


Machine learning, so choosing the
best algorithm for the given dataset
and problem is the main point to
remember while creating a machine
learning model. Below are the two
reasons for using the Decision tree:

○ Decision Trees usually mimic


human thinking ability while making
a decision, so it is easy to
understand.
○ The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies


Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.

Branch/Sub Tree: A tree formed by splitting the tree.

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next
node.

Bayesian Classification:

ata Mining Bayesian Classifiers

In numerous applications, the connection between the attribute set and the class variable is
non- deterministic. In other words, we can say the class label of a test record cant be assumed
with certainty even though its attribute set is the same as some of the training examples.
These circumstances may emerge due to the noisy data or the presence of certain confusing
factors that influence classification, but it is not included in the analysis. For example, consider
the task of predicting the occurrence of whether an individual is at risk for liver illness based on
individuals eating habits and working efficiency. Although most people who eat healthly and
exercise consistently having less probability of occurrence of liver disease, they may still do so
due to other factors. For example, due to consumption of the high-calorie street foods and
alcohol abuse. Determining whether an individual's eating routine is healthy or the workout
efficiency is sufficient is also subject to analysis, which in turn may introduce vulnerabilities into
the leaning issue.

Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The
theory expresses how a level of belief, expressed as a probability.

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.

6.3 Rule Based Classification,

Rule-based classifiers are just another type of classifier which makes the class decision
depending by using various “if..else” rules. These rules are easily interpretable and thus
these classifiers are generally used to generate descriptive models. The condition used
with “if” is called the antecedent and the predicted class of each rule is called the
consequent. Properties of rule-based classifiers:

● Coverage: The percentage of records which satisfy the antecedent conditions


of a particular rule.
● The rules generated by the rule-based classifiers are generally not mutually
exclusive, i.e. many rules can cover the same record.
● The rules generated by the rule-based classifiers may not be exhaustive, i.e.
there may be some records which are not covered by any of the rules.
● The decision boundaries created by them is linear, but these can be much
more complex than the decision tree because the many rules are triggered for
the same record.
An obvious question, which comes into the mind after knowing that the rules are not
mutually exclusive is that how would the class be decided in case different rules with
different consequent cover the record. There are two solutions to the above problem:

● Either rules can be ordered, i.e. the class corresponding to the highest priority
rule triggered is taken as the final class.
● Otherwise, we can assign votes for each class depending on some their
weights, i.e. the rules remain unordered.

Example: Below is the dataset to classify mushrooms as edible or poisonous:

Cap Cap
Class Bruises Odour Stalk Shape Population Habitat
Shape Surface

edible flat scaly yes anise tapering scattered grasses

poisonous convex scaly yes pungent enlargening several grasses

edible convex smooth yes almond enlargening numerous grasses

edible convex scaly yes almond tapering scattered meadows

edible flat fibrous yes anise enlargening several woods


edible flat fibrous no none enlargening several urban

poisonous conical scaly yes pungent enlargening scattered urban

edible flat smooth yes anise enlargening numerous meadows

poisonous convex smooth yes pungent enlargening several urban

Rules:

● Odour = pungent and habitat = urban -> Class = poisonous


● Bruises = yes -> Class = edible : This rules covers both negative and positive
records.

The given rules are not mutually exclusive.

How to generate a rule:

Sequential Rule Generation.

Rules can be generated either using general-to-specific


approach or specific-to-general approach. In the
general-to-specific approach, start with a rule with no
antecedent and keep on adding conditions to it till we
see major improvements in our evaluation metrics.
While for the other we keep on removing the conditions
from a rule covering a very specific case. The evaluation metric can be accuracy,
information gain, likelihood ratio etc. Algorithm for generating the model incrementally:
The algorithm given below generates a model with unordered rules and ordered
classes, i.e. we can decide which class to give priority while generating the rules.
A <-Set of attributes T <-Set of training records Y <-Set of classes Y' <-Ordered Y according to
relevance R <-Set of rules generated, initially to an empty list for each class y in Y' while the majority of
class y records are not covered generate a new rule for class y, using methods given above Add this
rule to R Remove the records covered by this rule from T end while end for Add rule {}->y' where y' is
the default class

Classifing a record: The classification algorithm described below assumes that the rules
are unordered and the classes are weighted.
R <-Set of rules generated using training Set T <-Test Record W <-class name to Weight mapping,
predefined, given as input F <-class name to Vote mapping, generated for each test record, to be
calculated for each rule r in R check if r covers T if so then add W of predicted_class to F of
predicted_class end for Output the class with the highest calculated vote in F

6.4 Support Vector Machines:

Introduction to SVM

Support vector machines (SVMs) are powerful yet flexible supervised machine learning
algorithms which are used both for classification and regression. But generally, they are used
in classification problems. In 1960s, SVMs were first introduced but later they got refined in
1990. SVMs have their unique way of implementation as compared to other machine learning
algorithms. Lately, they are extremely popular because of their ability to handle multiple
continuous and categorical variables.

Working of SVM

An SVM model is basically a representation of different classes in a hyperplane in


multidimensional space. The hyperplane will be generated in an iterative manner by SVM so
that the error can be minimized. The goal of SVM is to divide the datasets into classes to find a
maximum marginal hyperplane (MMH).

The followings are important concepts in SVM −

​ Support Vectors − Datapoints that are closest


to the hyperplane is called support vectors.
Separating line will be defined with the help of these
data points.
​ Hyperplane − As we can see in the above diagram, it is a decision plane or space which
is divided between a set of objects having different classes.
​ Margin − It may be defined as the gap between two lines on the closet data points of
different classes. It can be calculated as the perpendicular distance from the line to the
support vectors. Large margin is considered as a good margin and small margin is
considered as a bad margin.

The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps −

​ First, SVM will generate hyperplanes iteratively that segregates the classes in best way.
​ Then, it will choose the hyperplane that separates the classes correctly.

Implementing SVM in Python

​ For implementing SVM in Python − We will start with the standard libraries import as
follows −

SVM Kernels

In practice, SVM algorithm is implemented with kernel that transforms an input data space into
the required form. SVM uses a technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional space. In simple words,
kernel converts non-separable problems into separable problems by adding more dimensions
to it. It makes SVM more powerful, flexible and accurate. The following are some of the types
of kernels used by SVM.

Linear Kernel

It can be used as a dot product between any two observations. The formula of linear kernel is
as below −From the above formula, we can see that the product between two vectors say 𝑥 &
𝑥𝑖 is the sum of the multiplication of each pair of input values.

Polynomial Kernel

It is more generalized form of linear kernel and distinguish curved or nonlinear input space.
Following is the formula for polynomial kernel −

Here d is the degree of polynomial, which we need to specify manually in the learning
algorithm.

Radial Basis Function (RBF) Kernel

RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional
space. Following formula explains it mathematically −
Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A
good default value of gamma is 0.1.

As we implemented SVM for linearly separable data, we can implement it in Python for the
data that is not linearly separable. It can be done by using kernels.

Example

The following is an example for creating an SVM classifier by using kernels. We will be using
iris dataset from scikit-learn −

We will start by importing following packages −

import pandas as pd

import numpy as np

from sklearn import svm, datasets

import matplotlib.pyplot as plt

Now, we need to load the input data −

iris = datasets.load_iris()

From this dataset, we are taking first two features as follows −

X = iris.data[:, :2]

y = iris.target

Next, we will plot the SVM boundaries with original data as follows −

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

h = (x_max / x_min)/100

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

X_plot = np.c_[xx.ravel(), yy.ravel()]

Now, we need to provide the value of regularization parameter as follows −


C = 1.0

Next, SVM classifier object can be created as follows −

Svc_classifier = svm.SVC(kernel='linear', C=C).fit(X, y)

Z = svc_classifier.predict(X_plot)

Z = Z.reshape(xx.shape)

plt.figure(figsize=(15, 5))

plt.subplot(121)

plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)

plt.xlabel('Sepal length')

plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())

plt.title('Support Vector Classifier with linear kernel')

Output

Text(0.5, 1.0, 'Support Vector Classifier with linear kernel')

For creating SVM classifier with rbf kernel, we can change the kernel to rbf as follows −
Svc_classifier = svm.SVC(kernel = 'rbf', gamma =‘auto’,C = C).fit(X, y)

Z = svc_classifier.predict(X_plot)

Z = Z.reshape(xx.shape)

plt.figure(figsize=(15, 5))

plt.subplot(121)

plt.contourf(xx, yy, Z, cmap = plt.cm.tab10, alpha = 0.3)

plt.scatter(X[:, 0], X[:, 1], c = y, cmap = plt.cm.Set1)

plt.xlabel('Sepal length')

plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())

plt.title('Support Vector Classifier with rbf kernel')

Output

Text(0.5, 1.0, 'Support Vector Classifier with rbf kernel')

6.5 Associative Classification:

Data mining is the process of discovering and extracting hidden patterns from different types of
data to help decision-makers make decisions. Associative classification is a common
classification learning method in data mining, which applies association rule detection methods
and classification to create classification models.

Association Rule learning in Data Mining:

Association rule learning is a machine learning method for discovering interesting relationships
between variables in large databases. It is designed to detect strong rules in the database
based on some interesting metrics. For any given multi-item transaction, association rules aim
to obtain rules that determine how or why certain items are linked.
Association rules are created by searching for information on common if-then patterns and
using specific criteria with support and trust to define what the key relationships are. They help
to show the frequency of an item in a given data since confidence is defined by the number of
times an if-then statement is found to be true. However, a third criterion called lift is often used
to compare expected and actual confidence. Lift shows how many times the if-then statement
was predicted to be true. Create association rules to compute itemsets based on data created
by two or more items. Association rules usually consist of rules that are well represented by
the data.
There are different types of data mining techniques that can be used to find out the specific
analysis and result like Classification analysis, Clustering analysis, and multivariate analysis.
Association rules are mainly used to analyze and predict customer behavior.
● In Classification analysis, it is mostly used to question, make decisions, and predict
behavior.
● In Clustering analysis, it is mainly used when no assumptions are made about
possible relationships in the data.
● In Regression analysis, it is used when we want to predict an infinitely dependent
value of a set of independent variables.

We put the value of gamma to ‘auto’ but you can provide its
value between 0 to 1 also.

Pros and Cons of SVM Classifiers

Pros of SVM classifiers

SVM classifiers offers great accuracy and work well with high dimensional space. SVM
classifiers basically use a subset of training points hence in result uses very less memory.

Cons of SVM classifiers

They have high training time hence in practice not suitable for large datasets. Another
disadvantage is that SVM classifiers do not work well with overlapping classes.

6.6 Lazy Learners:

Lazy learning is a type of machine learning that doesn't process training data until it needs to
make a prediction. Instead of building models during training, lazy learning algorithms wait until
they encounter a new query. This method stores and compares training examples when
making predictions. It's also called instance-based or memory-based learning.

Lazy learning algorithms work by memorizing the training data rather than constructing a
general model. When a new query is received, lazy learning retrieves similar instances from
the training set and uses them to generate a prediction. The similarity between instances is
usually calculated using distance metrics, such as Euclidean distance or cosine similarity.

One of the most popular lazy learning algorithms is the k-nearest neighbors (k-NN) algorithm.
In k-NN, the k closest training instances to the query point are considered, and their class
labels are used to determine the class of the query. Lazy learning methods excel in situations
where the underlying data distribution is complex or where the training data is noisy.
What are the Benefits of Lazy Learning?

Lazy learning offers several advantages:

● Adaptability. Lazy learning algorithms can adapt quickly to new or changing data. Since
the learning process happens at prediction time, they can incorporate new instances
without requiring complete retraining of the model.
● Robustness to outliers. Lazy learning algorithms are less affected by outliers compared
to eager learning methods. Outliers have less influence on predictions because they are
not used during the learning phase.
● Flexibility. When it comes to handling complex data distributions and nonlinear
relationships, lazy learning algorithms are effective. They can capture intricate decision
boundaries by leveraging the information stored in the training instances.

6.7 Other Classification Methods

classification techniques in machine learning. Here are a few more:

K-Nearest Neighbors (KNN): KNN is a simple and intuitive classification algorithm. It works by
finding the k-nearest data points in the feature space and assigning the class label based on
the majority class among its neighbors.

Gaussian Naive Bayes: Gaussian Naive Bayes is an extension of the Naive Bayes algorithm. It
assumes that the features follow a Gaussian (normal) distribution, and it estimates the
parameters (mean and variance) of each class to make predictions.

Quadratic Discriminant Analysis (QDA): QDA is a classification technique similar to Gaussian


Naive Bayes, but it relaxes the assumption of equal variances across classes. Instead, it
allows each class to have its own covariance matrix.

Kernel Methods: Kernel methods, such as Kernel Support Vector Machines (SVMs) and Kernel
Logistic Regression, are used for nonlinear classification tasks. They transform the input data
into a higher-dimensional space using a kernel function, allowing them to learn complex
decision boundaries.
Ensemble Methods: Ensemble methods combine multiple base classifiers to improve the
overall classification performance. Examples include AdaBoost, Gradient Boosting Machines
(GBM), Random Forests, and Stacked Generalization (Stacking).

Neural Networks Variants: Besides traditional feedforward neural networks, there are various
neural network architectures specifically designed for classification tasks. These include
Convolutional Neural Networks (CNNs) for image classification, Recurrent Neural Networks
(RNNs) for sequential data, and Transformer models for natural language processing tasks.

Decision Boundary Estimation Methods: Techniques such as Gaussian Processes (GP) and
Generative Adversarial Networks (GANs) can be used for estimating decision boundaries in
classification tasks. GANs, for example, can generate synthetic data points to help better
understand the distribution of classes in the feature space.

Anomaly Detection Methods: While primarily used for anomaly detection, methods like
One-Class SVMs and Isolation Forests can also be adapted for binary classification tasks
where one class is significantly smaller than the other.

Prediction:

Prediction is a fundamental concept in machine learning and data science, where the goal is to
make informed guesses about unknown or future outcomes based on available data.
Predictive modeling involves building a mathematical model that captures patterns and
relationships in the data, which can then be used to make predictions on new or unseen data
points.

Here's an overview of the prediction process in machine learning:

​ Problem Definition: The first step in prediction is defining the problem you want to solve.
This includes determining what you want to predict (the target variable) and what
features or predictors are available to make those predictions.
​ Data Collection and Preprocessing: Next, you collect relevant data that contains
information about both the predictors and the target variable. This data may come from
various sources such as databases, APIs, or manual data collection. Once collected,
the data needs to be preprocessed, which may involve tasks like cleaning missing
values, encoding categorical variables, and scaling numerical features.
​ Feature Engineering: Feature engineering is the process of creating new features or
transforming existing ones to improve the performance of the predictive model. This
may include techniques like feature scaling, dimensionality reduction, or creating
interaction terms between features.
​ Model Selection: Once the data is prepared, you choose an appropriate machine
learning algorithm or model to train on the data. The choice of model depends on
factors such as the nature of the problem (classification, regression, etc.), the size and
complexity of the data, and the interpretability requirements.
​ Model Training: In this step, you use the prepared data to train the selected model.
During training, the model learns the underlying patterns and relationships in the data
by adjusting its parameters to minimize a predefined loss function.
​ Model Evaluation: After training, the model's performance is evaluated using evaluation
metrics appropriate for the problem at hand. For classification tasks, metrics like
accuracy, precision, recall, and F1-score are commonly used, while for regression tasks,
metrics like mean squared error (MSE) or R-squared are used.
​ Prediction: Once the model is trained and evaluated, it can be used to make predictions
on new or unseen data points. These predictions provide insights into the likely
outcomes or values of the target variable based on the available information.
​ Model Deployment: Finally, if the model performs well and meets the desired criteria, it
can be deployed into production to make real-time predictions. This involves integrating
the model into the existing software infrastructure and monitoring its performance over
time.

Throughout this process, it's essential to iterate and refine the model based on feedback and
new data to ensure that it continues to make accurate predictions. Additionally, ethical
considerations such as fairness, transparency, and privacy should be taken into account when
making predictions that impact individuals or society.

– Accuracy and Error Measures.:

Out of the nine data mining models in SQL Server, three of them can be considered as
classification models. The Classification models are Naïve Bayes Decision Trees, Neural
Network. Though the logistic regression is a regression technique, that can be used for a
classification problem as well. Since you have four models as a solution for the classification
problem, we need to look at which algorithm should be selected to use. Obviously, you need to
select the most accurate data mining model. To evaluate which algorithm to use, an accuracy
test should be done.

Let us create simple four models using Naïve Bayes, Decision Trees, Logistic Regression, and
Neural Network algorithms for measuring Accuracy in Data Mining.
Unit-VII: Cluster Analysis

[08L] Max Marks:12

INTRODUCTION:

Cluster analysis, also known as clustering, is a method of data mining that groups
similar data points together. The goal of cluster analysis is to divide a dataset into
groups (or clusters) such that the data points within each group are more similar to
each other than to data points in other groups. This process is often used for
exploratory data analysis and can help identify patterns or relationships within the
data that may not be immediately obvious. There are many different algorithms used
for cluster analysis, such as k-means, hierarchical clustering, and density-based
clustering. The choice of algorithm will depend on the specific requirements of the
analysis and the nature of the data being analyzed.

7.1 Types of Data in Cluster Analysis,

The clustering algorithm needs to be chosen experimentally unless there is a mathematical


reason to choose one cluster method over another.It should be noted that an algorithm that
works on a particular set of data will not work on another set of data. There are a number of
different methods to perform cluster analysis. Some of them are,

Hierarchical Cluster Analysis

In this method, first, a cluster is made and then added to another cluster (the most similar and
closest one) to form one single cluster. This process is repeated until all subjects are in one
cluster. This particular method is known as Agglomerative method. Agglomerative clustering
starts with single objects and starts grouping them into clusters.

The divisive method is another kind of Hierarchical method in which clustering starts with the
complete data set and then starts dividing into partitions.

Centroid-based Clustering

In this type of clustering, clusters are represented by a


central entity, which may or may not be a part of the given
data set. K-Means method of clustering is used in this
method, where k are the cluster centers and objects are assigned to the nearest cluster
centres.

Distribution-based Clustering

It is a type of clustering model closely related to


statistics based on the modals of distribution.
Objects that belong to the same distribution are
put into a single cluster.This type of clustering
can capture some complex properties of
objects like correlation and dependence
between attributes.

Density-based Clustering
In this type of clustering, clusters are defined by the
areas of density that are higher than the remaining
of the data set. Objects in sparse areas are usually
required to separate clusters.The objects in these
sparse points are usually noise and border points in
the graph.The most popular method in this type of
clustering is DBSCAN.

–Partitioning Methods:

7.2 Data Mining Clustering Methods


Let’s take a look at different types of clustering in data mining!

1. Partitioning Clustering Method

In this method, let us say that “m” partition is done on the “p” objects of the database. A cluster
will be represented by each partition and m < p. K is the number of groups after the
classification of objects. There are some requirements which need to be satisfied with this
Partitioning Clustering Method and they are: –

1. One objective should only belong to only one group.


2. There should be no group without even a single purpose.

There are some points which should be remembered in this type of Partitioning Clustering
Method which are:

1. There will be an initial partitioning if we already give no. of a partition (say m).
2. There is one technique called iterative relocation, which means the object will be moved
from one group to another to improve the partitioning.

Our learners also read: Free Python Course with Certification

2. Hierarchical Clustering Methods

Among the many different types of clustering in data mining, In this hierarchical clustering
method, the given set of an object of data is created into a kind of hierarchical decomposition.
The formation of hierarchical decomposition will decide the purposes of classification. There
are two types of approaches for the creation of hierarchical decomposition, which are: –

1. Divisive Approach

Another name for the Divisive approach is a top-down approach. At the beginning of this
method, all the data objects are kept in the same cluster. Smaller clusters are created by
splitting the group by using the continuous iteration. The constant iteration method will keep on
going until the condition of termination is met. One cannot undo after the group is split or
merged, and that is why this method is not so flexible.
2. Agglomerative Approach

Another name for this approach is the bottom-up approach. All the groups are separated in the
beginning. Then it keeps on merging until all the groups are merged, or condition of
termination is met.

There are two approaches which can be used to improve the Hierarchical Clustering Quality in
Data Mining which are: –

1. One should carefully analyze the linkages of the object at every partitioning of
hierarchical clustering.
2. One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the
microcluster.

3. Density-Based Clustering Method

In this method of clustering in Data Mining, density is the main focus. The notion of mass is
used as the basis for this clustering method. In this clustering method, the cluster will keep on
growing continuously. At least one number of points should be there in the radius of the group
for each point of data.

4. Grid-Based Clustering Method

In this type of Grid-Based Clustering Method, a grid is formed using the object together. A Grid
Structure is formed by quantifying the object space into a finite number of cells.

Advantage of Grid-based clustering method: –

1. Faster time of processing: The processing time of this method is much quicker than
another way, and thus it can save time.
2. This method depends on the no. of cells in the space of quantized each dimension.

5. Model-Based Clustering Methods

In this type of clustering method, every cluster is hypothesized so that it can find the data
which is best suited for the model. The density function is clustered to locate the group in this
method.

6. Constraint-Based Clustering Method


Application or user-oriented constraints are incorporated to perform the clustering. The
expectation of the user is referred to as the constraint. In this process of grouping,
communication is very interactive, which is provided by the restrictions.

7.3 Clustering High-Dimensional Data in Data Mining


Clustering is basically a type of unsupervised learning method. An unsupervised
learning method is a method in which we draw references from datasets consisting of
input data without labeled responses.
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the
same group and dissimilar to the data points in other groups.

Challenges of Clustering High-Dimensional Data:

Clustering of the High-Dimensional Data return the group of objects which are clusters. It is
required to group similar types of objects together to perform the cluster analysis of
high-dimensional data, But the High-Dimensional data space is huge and it has complex data
types and attributes. A major challenge is that we need to find out the set of attributes that are
present in each cluster. A cluster is defined and characterized based on the attributes present
in the cluster. Clustering High-Dimensional Data we need to search for clusters and find out
the space for the existing clusters.
7.3 Outliers in Data Mining
Outlier is a data object that deviates significantly from the rest of the data objects and behaves
in a different manner. An outlier is an object that deviates significantly from the rest of the
objects. They can be caused by measurement or execution errors. The analysis of outlier data
is referred to as outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected of not being
generated by the same method as the rest of the data objects.
Outliers are of three types, namely –
1. Global (or Point) Outliers
2. Collective Outliers
3. Contextual (or Conditional) Outliers

1. Global Outliers

1. Definition: Global outliers are data points that deviate significantly from the overall
distribution of a dataset.

2. Collective Outliers

1. Definition: Collective outliers are groups of data points that collectively deviate significantly
from the overall distribution of a dataset.

3. Contextual Outliers

1. Definition: Contextual outliers are data points that deviate significantly from the expected
behavior within a specific context or subgroup.

You might also like