Download as pdf or txt
Download as pdf or txt
You are on page 1of 316

Introduction to Data Warehouse

What is Data Warehouse


► A data warehouse is a subject-oriented, integrated, time-variant and non-volatile
collection of data in support of management's decision making process.
► Subject-Oriented: A data warehouse can be used to analyze a particular subject
area. For example, "sales" can be a particular subject.
► Integrated: A data warehouse integrates data from multiple data sources. For
example, source A and source B may have different ways of identifying a product,
but in a data warehouse, there will be only a single way of identifying a product.
► Time-Variant: Historical data is kept in a data warehouse. For example, one can
retrieve data from 3 months, 6 months, 12 months, or even older data from a data
warehouse. This contrasts with a transactions system, where often only the most
recent data is kept. For example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold all addresses associated
with a customer.
► Non-volatile: Once data is in the data warehouse, it will not change. So, historical
data in a data warehouse should never be altered.
What is Data Warehouse
► A data warehouse is a system that stores data from a company’s operational
databases as well as external sources. Data warehouse platforms are different
from operational databases because they store historical information, making it
easier for business leaders to analyze data over a specific period of time. Data
warehouse platforms also sort data based on different subject matter, such as
customers, products or business activities.
Why is data warehousing important?
► Data warehousing is an increasingly important business intelligence tool, allowing
organizations to:
Ensure consistency. Data warehouses are programmed to apply a uniform format to
all collected data, which makes it easier for corporate decision-makers to analyze and
share data insights with their colleagues around the globe. Standardizing data from
different sources also reduces the risk of error in interpretation and improves overall
accuracy.
Make better business decisions. Successful business leaders develop data-driven
strategies and rarely make decisions without consulting the facts. Data warehousing
improves the speed and efficiency of accessing different data sets and makes it easier
What is Data Warehouse
for corporate decision-makers to derive insights that will guide the business and
marketing strategies that set them apart from their competitors.
Improve their bottom line. Data warehouse platforms allow business leaders to
quickly access their organization's historical activities and evaluate initiatives that
have been successful — or unsuccessful — in the past. This allows executives to see
where they can adjust their strategy to decrease costs, maximize efficiency and
increase sales to improve their bottom line.
Framework of the data warehouse
for corporate decision-makers to derive insights that will guide the business and
marketing strategies that set them apart from their competitors.
Improve their bottom line. Data warehouse platforms allow business leaders to
quickly access their organization's historical activities and evaluate initiatives that
have been successful — or unsuccessful — in the past. This allows executives to see
where they can adjust their strategy to decrease costs, maximize efficiency and
increase sales to improve their bottom line.
Framework of the data warehouse
Framework of the data warehouse
Source Data Component
Source data coming into the data warehouses may be grouped into four broad
categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose
segments of the data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the
internal data, part of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business.
In every operational system, we periodically take the old data and store it in achieved
files.
External Data: Most executives depend on information from external sources for a
large percentage of the information they use. They use statistics associating to their
industry produced by the external department.
Framework of the data warehouse
A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing
within the enterprise. Each data warehouse is different, but all are characterized by
standard vital components.
Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP). These
include applications such as forecasting, profiling, summary reporting, and trend
analysis.
Framework of the data warehouse
A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing
within the enterprise. Each data warehouse is different, but all are characterized by
standard vital components.
Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP). These
include applications such as forecasting, profiling, summary reporting, and trend
analysis.
Data Warehouse Architecture: Basic
Data Warehouse Architecture: With Staging Area
Data Warehouse Architecture: With Staging Area and Data Marts
Operational System
An operational system is a method used in data warehousing to refer to a system that
is used to process the day-to-day transactions of an organization.
Flat Files:- A Flat file system is a system of files in which transactional data is stored,
and every file in the system must have a different name.
Framework of the data warehouse
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data
build, and data changed, and file size are examples of very basic document metadata.
Meta data is used to direct a query to the most appropriate data source.
Data Warehouse Architecture: With Staging Area
► We can do this programmatically, although data warehouses uses a staging area (A
place where data is processed before entering the warehouse).
► A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses
where all relevant data of an enterprise is consolidated.
Framework of the data warehouse
Types of Data Warehouse Architectures
► Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method, data
warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an
intermediate processing layer.
Framework of the data warehouse
► Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture
for a data warehouse system,
Source layer: A data warehouse system uses a heterogeneous source of data. That data is
stored initially to corporate relational databases or legacy databases, or it may come from
an information system outside the corporate walls.
Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema.
Data Warehouse layer: Information is saved to one logically centralized individual
repository: a data warehouse. The data warehouses can be directly accessed, but it can also
be used as a source for creating data marts, which partially replicate data warehouse
contents and are designed for specific enterprise departments.
Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should
feature aggregate information navigators, complex query optimizers, and customer-friendly
GUIs.
Framework of the data warehouse
Framework of the
data warehouse
Three-Tier Architecture
The three-tier architecture consists of the source layer
(containing multiple source system), the reconciled
layer and the data warehouse layer (containing both
data warehouses and data marts). The reconciled layer
sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it
creates a standard reference data model for a whole
enterprise. At the same time, it separates the problems
of source data extraction and integration from those of
data warehouse population.
Framework of the data warehouse
► Three-Tier Data Warehouse Architecture
Data Warehouses usually have a three-level (tier) architecture that includes:
1. Bottom Tier (Data Warehouse Server)
2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).
Developing data warehouses,
Data Warehousing is a flow process used to gather and handle structured and unstructured
data from multiple sources into a centralized repository to operate actionable business
decisions. With all of your data in one place, it becomes easier to perform analysis,
reporting and discover meaningful insights at completely different combination levels.
A data warehouse setting includes extraction, transformation, and loading (ELT)
resolution, an online analytical processing (OLAP) engine, consumer analysis tools, and
different applications that manage the method of gathering data and delivering it to
business.
Developing data warehouses,
Requirement Specification: It is the first step in the development of the Data Warehouse
and is done by business analysts. In this step, Business Analysts prepare business
requirement specification documents.
Data Modelling: This is the second step in the development of the Data Warehouse. Data
Modelling is the process of visualizing data distribution and designing databases by
fulfilling the requirements to transform the data into a format that can be stored in the data
warehouse.
ELT Design and Development: This is the third step in the development of the Data
Warehouse. ETL or Extract, Transfer, Load tool may extract data from various source
systems and store it in a data lake.
OLAP Cubes: This is the fourth step in the development of the Data Warehouse. An
OLAP cube, also known as a multidimensional cube or hypercube, is a data structure that
allows fast analysis of data according to the multiple dimensions that define a business
problem
UI Development: This is the fifth step in the development of the Data Warehouse. So far,
the processes discussed have taken place at the backend.
Developing data warehouses,
Maintenance: This is the sixth step in the development of the Data Warehouse. In this
phase, we can update or make changes to the schema and data warehouse’s
application domain or requirements. Data warehouse maintenance systems must
provide means to keep track of schema modifications as well, for instance,
modifications.
Test and Deployment: This is often the ultimate step in the Data Warehouse
development cycle. Businesses and organizations test data warehouses to ensure
whether the required business problems are implemented successfully or not. The
warehouse testing involves the scrutiny of enormous volumes of data.
Differences between OLAP and OLTP

OLAP OLTP
► It is used for data analysis
► It is used to manage very large
► It uses data warehouse. number of online short
transactions.
► It manages all insert,
update and delete transaction ► It uses traditional DBMS
► Processing is little slow ► It is mainly used for data
reading
► Tables in OLAP database are
not normalized. ► In Milliseconds
► Tables in OLTP database are
normalized.
OLAP Operations in the Multidimensional Data Model

In the multidimensional model, the records are organized into various dimensions,
and each dimension includes multiple levels of abstraction described by concept
hierarchies. This organization support users with the flexibility to view data from
various perspectives. A number of OLAP data cube operation exist to demonstrate
these different views, allowing interactive queries and search of the record at hand.
Hence, OLAP supports a user-friendly environment for interactive data analysis.
Consider the OLAP operations which are to be performed on multidimensional data.
The figure shows data cubes for sales of a shop. The cube contains the dimensions,
location, and time and item, where the location is aggregated with regard to city
values, time is aggregated with respect to quarters, and an item is aggregated with
respect to item types.
► Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs
aggregation on a data cube, by climbing down concept hierarchies, i.e., dimension
reduction. Roll-up is like zooming-out on the data cubes. Figure shows the result of roll-up
operations performed on the dimension location.
OLAP Operations in the Multidimensional Data Model

Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up.
Drill-down is like zooming-in on the data cube. It navigates from less detailed record
to more detailed data. Drill-down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions.
Slice
A slice is a subset of the cubes corresponding to a single value for one or more
members of the dimension. For example, a slice operation is executed when the
customer wants a selection on one dimension of a three-dimensional cube resulting in
a two-dimensional site. So, the Slice operations perform a selection on one dimension
of the given cube, thus resulting in a sub cube.
Dice
The dice operation describes a sub cube by operating a selection on two or more
dimension.
What is OLAP

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software


technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible
views of data that has been transformed from raw information to reflect the real
dimensionality of the enterprise as understood by the clients.
OLAP implement the multidimensional analysis of business information and support
the capability for complex estimations, trend analysis, and sophisticated data
modeling. It is rapidly enhancing the essential foundation for Intelligent Solutions
containing Business Performance Management, Planning, Budgeting, Forecasting,
Financial Documenting, Analysis, Simulation-Models, Knowledge Discovery, and Data
Warehouses Reporting.
Who uses OLAP and Why?
• Budgeting
• Activity-based costing
• Financial performance analysis
• And financial modeling
Types of OLAP

There are three main types of OLAP servers are as following:


► ROLAP stands for Relational OLAP, an application based on relational DBMSs.
► MOLAP stands for Multidimensional OLAP, an application based on
multidimensional DBMSs.
► HOLAP stands for Hybrid OLAP, an application using both relational and
multidimensional techniques.
Relational OLAP (ROLAP) Server
► These are intermediate servers which stand in between a relational back-end
server and user frontend tools.
► They use a relational or extended-relational DBMS to save and handle warehouse
data, and OLAP middleware to provide missing pieces.
► ROLAP servers contain optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tools and services.
► ROLAP technology tends to have higher scalability than MOLAP technology.
Types of OLAP

Multidimensional OLAP (MOLAP) Server


A MOLAP system is based on a native logical model that directly supports
multidimensional data and operations. Data are stored physically into
multidimensional arrays, and positional techniques are used to access them.
One of the significant distinctions of MOLAP against a ROLAP is that data are
summarized and are stored in an optimized format in a multidimensional cube,
instead of in a relational database. In MOLAP model, data are structured into
proprietary formats by client's reporting requirements with the calculations
pre-generated on the cubes.
Types of OLAP

Hybrid OLAP (HOLAP) Server


HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture.
HOLAP systems save more substantial quantities of detailed data in the relational
tables while the aggregations are stored in the pre-calculated cubes. HOLAP also can
drill through from the cube down to the relational tables for delineated data. The
Microsoft SQL Server 2000 provides a hybrid OLAP server.
Types of OLAP

Other Types
► There are also less popular types of OLAP styles upon which one could stumble
upon every so often. We have listed some of the less popular brands existing in the
OLAP industry.
Web-Enabled OLAP (WOLAP) Server
► WOLAP pertains to OLAP application which is accessible via the web browser.
Unlike traditional client/server OLAP applications, WOLAP is considered to have a
three-tiered architecture which consists of three components: a client, a
middleware, and a database server.
Desktop OLAP (DOLAP) Server
► DOLAP permits a user to download a section of the data from the database or
source, and work with that dataset locally, or on their desktop.
Mobile OLAP (MOLAP) Server
► Mobile OLAP enables users to access and work on OLAP data and applications
remotely through the use of their mobile devices.
Back-End Tools and Utilities

Data Warehousing is a technique that is mainly used to collect and manage data from
various sources to give the business a meaningful business insight. A data warehouse
is specifically designed to support management decisions.
Data extraction:
get data from multiple, heterogeneous, and external sources
Data cleaning:
detect errors in the data and rectify them when possible
Data transformation:
convert data from legacy or host format to warehouse format
Load:
sort, summarize, consolidate, compute views, check integrity, and build indicies and
partitions
Refresh
propagate the updates from the data sources to the warehouse
Metadata Repository
Metadata is data that describes and contextualizes other data. It provides information
about the content, format, structure, and other characteristics of data, and can be used
to improve the organization, discoverability, and accessibility of data.
Metadata can be stored in various forms, such as text, XML, or RDF, and can be
organized using metadata standards and schemas.
Metadata is data that provides information about other data.
1. File metadata: This includes information about a file, such as its name, size, type,
and creation date.
2. Image metadata: This includes information about an image, such as its resolution,
color depth, and camera settings.
3. Music metadata: This includes information about a piece of music, such as its title,
artist, album, and genre.
4. Video metadata: This includes information about a video, such as its length,
resolution, and frame rate.
Metadata Repository
Document metadata: This includes information about a document, such as its author,
title, and creation date.
Database metadata: This includes information about a database, such as its
structure, tables, and fields.
Web metadata: This includes information about a web page, such as its title,
keywords, and description.
Types of Metadata:
Descriptive metadata: This type of metadata provides information about the
content, structure, and format of data, and may include elements such as title, author,
subject, and keywords.
Administrative metadata: This type of metadata provides information about the
management and technical characteristics of data, and may include elements such as
file format, size, and creation date.
Structural metadata: This type of metadata provides information about the
relationships and organization of data, and may include elements such as links, tables
of contents, and indices.
Metadata Repository
Rights metadata: This type of metadata provides information about the ownership,
licensing, and access controls of data, and may include elements such as copyright,
permissions, and terms of use. Rights metadata helps to manage and protect the
intellectual property rights of data and can be used to support data governance and
compliance.
Educational metadata: This type of metadata provides information about the
educational value and learning objectives of data, and may include elements such as
learning outcomes, educational levels, and competencies
Metadata Repository
A metadata repository is a database or other storage mechanism that is used to store
metadata about data. A metadata repository can be used to manage, organize, and
maintain metadata in a consistent and structured manner, and can facilitate the
discovery, access, and use of data.
A metadata repository may contain metadata about a variety of types of data, such as
documents, images, audio and video files, and other types of digital content. The
metadata in a metadata repository may include information about the content, format,
structure, and other characteristics of data, and may be organized using metadata
standards and schemas.
Metadata Repository
There are many types of metadata repositories, ranging from simple file systems or
spreadsheets to complex database systems. The choice of metadata repository will
depend on the needs and requirements of the organization, as well as the size and
complexity of the data that is being managed.
Benefits of Metadata Repository
Improved data quality: A metadata repository can help ensure that metadata is
consistently structured and accurate, which can improve the overall quality of the
data.
Increased data accessibility: A metadata repository can make it easier for users to
access and understand the data, by providing context and information about the data.
Enhanced data integration: A metadata repository can facilitate data integration by
providing a common place to store and manage metadata from multiple sources.
Metadata Repository
Improved data governance: A metadata repository can help enforce metadata
standards and policies, making it easier to ensure that data is being used and managed
appropriately.
Enhanced data security: A metadata repository can help protect the privacy and
security of metadata, by providing controls to restrict access to sensitive or
confidential information.
► Challenges for Metadata Management
There are several challenges that can arise when managing metadata:
Lack of standardization: Different organizations or systems may use different
standards or conventions for metadata, which can make it difficult to effectively
manage metadata across different sources.
Data quality: Poorly structured or incorrect metadata can lead to problems with data
quality, making it more difficult to use and understand the data.
Data integration: When integrating data from multiple sources, it can be challenging
to ensure that the metadata is consistent and aligned across the different sources.
Metadata Repository
Data governance: Establishing and enforcing metadata standards and policies can be
difficult, especially in large organizations with multiple stakeholders.
Data security: Ensuring the security and privacy of metadata can be a challenge,
especially when working with sensitive or confidential information.
► Metadata Management Software:
Software for managing metadata makes it easier to assess, curate, collect, and store
metadata. In order to enable data monitoring and accountability, organizations should
automate data management. Examples of this kind of software include the following:
SAP Power Designer by SAP: This data management system has a good level of
stability. It is recognized for its ability to serve as a platform for model testing.
SAP Information Steward by SAP: This solution’s data insights make it valuable.
IBM Info Sphere Information Governance Catalog by IBM: The ability to use Open
IGC to build unique assets and data lineages is a key feature of this system.
Informatica Enterprise Data Catalog by Informatica: The technology used by this
solution, which can both scan and gather information from diverse sources, is highly
respected.
Difference between Database and Data Warehouse

Database Data Warehouse


► 1. It is used for Online Transactional ► 1. It is used for Online Analytical
Processing (OLTP) but can be used for Processing (OLAP). This reads the
other objectives such as Data historical information for the customers
Warehousing. This records the data for business decisions.
from the clients for history.
► 2. The tables and joins are accessible
► The tables and joins are complicated since they are de-normalized. This is
since they are normalized for RDBMS. done to minimize the response time for
This is done to reduce redundant files analytical queries.
and to save storage space.
► 3. Data is largely static
► 3. Data is dynamic
► 4. Optimized for read operations.
► 4. Optimized for write operations.
► 5. High performance for analytical
► 5. Performance is low for analysis queries.
queries.
DW Design Consideration and
Dimensional Modeling
Dimensional Data Modeling
► Dimensional Data Modeling is one of the data modeling techniques used in data
warehouse design. The concept of Dimensional Modeling was developed by Ralph
Kimball which is comprised of facts and dimension tables. Since the main goal of
this modeling is to improve the data retrieval so it is optimized for SELECT
OPERATION.
► The advantage of using this model is that we can store data in such a way that it is
easier to store and retrieve the data once stored in a data warehouse. The
dimensional model is the data model used by many OLAP systems.
Elements of Dimensional Data Model
Facts
► Facts are the measurable data elements that represent the business metrics of
interest. For example, in a sales data warehouse, the facts might include sales
revenue, units sold, and profit margins. Each fact is associated with one or more
dimensions, creating a relationship between the fact and the descriptive data.
Dimensional Data Modeling
Dimension
► Dimensions are the descriptive data elements that are used to categorize or
classify the data. For example, in a sales data warehouse, the dimensions might
include product, customer, time, and location. Each dimension is made up of a set
of attributes that describe the dimension. For example, the product dimension
might include attributes such as product name, product category, and product
price.
Attributes
► Characteristics of dimension in data modeling are known as characteristics. These
are used to filter, search facts, etc. For a dimension of location, attributes can be
State, Country, Zipcode, etc.
Fact Table
► In a dimensional data model, the fact table is the central table that contains the
measures or metrics of interest, surrounded by the dimension tables that describe
the attributes of the measures. The dimension tables are related to the fact table
through foreign key relationships
Dimensional Data Modeling
Dimension
► Dimensions are the descriptive data elements that are used to categorize or
classify the data. For example, in a sales data warehouse, the dimensions might
include product, customer, time, and location. Each dimension is made up of a set
of attributes that describe the dimension. For example, the product dimension
might include attributes such as product name, product category, and product
price.
Attributes
► Characteristics of dimension in data modeling are known as characteristics. These
are used to filter, search facts, etc. For a dimension of location, attributes can be
State, Country, Zipcode, etc.
Fact Table
► In a dimensional data model, the fact table is the central table that contains the
measures or metrics of interest, surrounded by the dimension tables that describe
the attributes of the measures. The dimension tables are related to the fact table
through foreign key relationships
Dimensional Data Modeling
Dimension Table
► Dimensions of a fact are mentioned by the dimension table and they are basically
joined by a foreign key. Dimension tables are simply de-normalized tables. The
dimensions can be having one or more relationships.
Steps to Create Dimensional Data Modeling
Step-1: Identifying the business objective: The first step is to identify the business
objective. Sales, HR, Marketing, etc. are some examples of the need of the
organization. Since it is the most important step of Data Modelling the selection of
business objectives also depends on the quality of data available for that process.
Step-2: Identifying Granularity: Granularity is the lowest level of information stored
in the table. The level of detail for business problems and its solution is described
by Grain.
Dimensional Data Modeling
Step-3: Identifying Dimensions and their Attributes: Dimensions are objects or
things. Dimensions categorize and describe data warehouse facts and measures in
a way that supports meaningful answers to business questions. A data warehouse
organizes descriptive attributes as columns in dimension tables. For Example, the
data dimension may contain data like a year, month, and weekday.
Step-4: Identifying the Fact: The measurable data is held by the fact table. Most of
the fact table rows are numerical values like price or cost per unit, etc.

Step-5: Building of Schema: We implement the Dimension Model in this step. A


schema is a database structure. There are two popular schemes: Star
Schema and Snowflake Schema.
Dimensional data modeling is a technique used in data warehousing to organize and
structure data in a way that makes it easy to analyze and understand. In a
dimensional data model, data is organized into dimensions and facts.
Dimensional Data Modeling
Advantages of Dimensional Data Modeling
► Simplified Data Access: Dimensional data modeling enables users to easily access
data through simple queries, reducing the time and effort required to retrieve and
analyze data.
► Enhanced Query Performance: The simple structure of dimensional data
modeling allows for faster query performance, particularly when compared to
relational data models.
► Increased Flexibility: Dimensional data modeling allows for more flexible data
analysis, as users can quickly and easily explore relationships between data.
► Improved Data Quality: Dimensional data modeling can improve data quality by
reducing redundancy and inconsistencies in the data.
► Easy to Understand: Dimensional data modeling uses simple, intuitive structures
that are easy to understand, even for non-technical users.
Dimensional Data Modeling
Disadvantages of Dimensional Data Modeling
► Limited Complexity: Dimensional data modeling may not be suitable for very
complex data relationships, as it relies on simple structures to organize data.
► Limited Integration: Dimensional data modeling may not integrate well with
other data models, particularly those that rely on normalization techniques.
► Limited Scalability: Dimensional data modeling may not be as scalable as other
data modeling techniques, particularly for very large datasets.
► Limited History Tracking: Dimensional data modeling may not be able to track
changes to historical data, as it typically focuses on current data.
Granularity of Facts
Granularity in Data Warehousing refers to the level of detail or resolution at which
data is stored and analyzed. It determines the "grain" of data, which is the smallest
unit of data that can be accessed or manipulated. Granularity can vary based on the
specific business requirements and objectives. It affects the level of detail in
reporting, analysis, and decision-making processes.
How Granularity in Data Warehousing Works
In a data warehousing environment, data is organized and stored in a structured
manner that facilitates efficient querying, processing, and analysis. Granularity
determines the level of detail captured during data collection and storage. It can
range from high to low, depending on the specific needs of the business. Higher
granularity means more detailed data, while lower granularity means aggregated
or summarized data.
Granularity of Facts
Granularity in Data Warehousing refers to the level of detail or resolution at which
data is stored and analyzed. It determines the "grain" of data, which is the smallest
unit of data that can be accessed or manipulated. Granularity can vary based on the
specific business requirements and objectives. It affects the level of detail in
reporting, analysis, and decision-making processes.
How Granularity in Data Warehousing Works
In a data warehousing environment, data is organized and stored in a structured
manner that facilitates efficient querying, processing, and analysis. Granularity
determines the level of detail captured during data collection and storage. It can
range from high to low, depending on the specific needs of the business. Higher
granularity means more detailed data, while lower granularity means aggregated
or summarized data.
For example, in a retail business, high granularity might involve storing data at the
individual transaction level, including details such as the customer, product,
quantity, and price. Low granularity, on the other hand, could involve storing data
at the monthly sales total level, summarizing the transactions for each month
without capturing individual details.
Granularity of Facts
Granularity in Data Warehousing is Important
Accurate Analysis: Higher granularity allows for more detailed analysis,
providing insights into specific trends, patterns, and outliers. It enables
businesses to make accurate and informed decisions based on comprehensive
data.
Flexibility: Data stored at different granularities caters to various reporting and
analysis requirements. It offers the flexibility to drill down into detailed data
or view summarized information, depending on the specific needs of different
stakeholders.
Data Integration : Granularity affects the integration of data from multiple
sources. Aligning the granularities of different data sets enables businesses to
combine and analyze them effectively, ensuring consistency and accuracy in
reporting and analysis.
Granularity of Facts
Granularity in Data Warehousing is Important
Accurate Analysis: Higher granularity allows for more detailed analysis, providing
insights into specific trends, patterns, and outliers. It enables businesses to make
accurate and informed decisions based on comprehensive data.
Flexibility: Data stored at different granularities caters to various reporting and
analysis requirements. It offers the flexibility to drill down into detailed data or
view summarized information, depending on the specific needs of different
stakeholders.
Data Integration : Granularity affects the integration of data from multiple sources.
Aligning the granularities of different data sets enables businesses to combine and
analyze them effectively, ensuring consistency and accuracy in reporting and
analysis.
Data Governance: Granularity supports data governance initiatives, allowing
businesses to define and enforce data quality, security, and privacy measures at
the appropriate level of detail.
Additivity of Facts
Additive facts support mathematical operations like addition, subtraction,
multiplication, and division, allowing aggregation across all associated
dimensions.
Example
Consider a sales fact table containing each transaction's total sales amount. This
additive fact can be aggregated across dimensions such as time (e.g., monthly,
quarterly, yearly), product category, or sales region.
This sales fact table includes transaction details such as the transaction date,
product, category, sales amount, and quantity sold. It represents individual
sales transactions that can be aggregated across dimensions for analysis and
decision-making.
Additivity of Facts
Additive facts support mathematical operations like addition, subtraction,
multiplication, and division, allowing aggregation across all associated dimensions.
Example
Consider a sales fact table containing each transaction's total sales amount. This
additive fact can be aggregated across dimensions such as time (e.g., monthly,
quarterly, yearly), product category, or sales region.
This sales fact table includes transaction details such as the transaction date, product,
category, sales amount, and quantity sold. It represents individual sales
transactions that can be aggregated across dimensions for analysis and
decision-making.
Functional dependency of the data
In a relational database management, functional dependency is a concept that
specifies the relationship between two sets of attributes where one attribute
determines the value of another attribute.
It is denoted as X → Y, where the attribute set on the left side of the arrow, X is
called Determinant, and Y is called the Dependent.
Functional dependencies are used to mathematically express relations among
database entities and are very important to understand advanced concepts in
Relational Database System and understanding problems in competitive exams
like Gate.
Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee
table because if we know the Emp_Id, we can tell that employee name associated
with it.
Functional dependency can be written as:
We can say that Emp_Name is functionally dependent on Emp_Id.
Helper Data
To support a many-to-many relationship between fact and dimension records, a
helper table is inserted between the fact and dimension tables. The helper
table can have multiple records for each fact and dimension key combination.
This allows queries to retrieve facts for any given dimension value.
Many to Many relationship between fact and dimensional modeling

To support a many-to-many relationship between fact and dimension records, a


helper table is inserted between the fact and dimension tables. The helper
table can have multiple records for each fact and dimension key combination.
This allows queries to retrieve facts for any given dimension value.
Data Warehouse Models
Data Warehouse Modeling
Data warehouse modeling is the process of designing the schemas of the detailed and
summarized information of the data warehouse. The goal of data warehouse modeling
is to develop a schema describing the reality, or at least a part of the fact, which the
data warehouse is needed to support.
Data warehouse modeling is an essential stage of building a data warehouse for two main
reasons. Firstly, through the schema, data warehouse clients can visualize the relationships
among the warehouse data, to use them with greater ease. Secondly, a well-designed
schema allows an effective data warehouse structure to emerge, to help decrease the cost of
implementing the warehouse and improve the efficiency of using it.
Data modeling in data warehouses is different from data modeling in operational database
systems. The primary function of data warehouses is to support DSS processes. Thus, the
objective of data warehouse modeling is to make the data warehouse efficiently support
complex queries on long term information.
Data Warehouse Modeling
Data Warehouse Modeling
► Data Modeling Life Cycle
It is a straight forward process of transforming the business requirements to fulfill the
goals for storing, maintaining, and accessing the data within IT systems. The result is a
logical and physical data model for an enterprise data warehouse.
The objective of the data modeling life cycle is primarily the creation of a storage area
for business information. That area comes from the logical and physical data modeling
stages.
Data Warehouse Modeling
Conceptual Data Model
A conceptual data model recognizes the highest-level relationships between
the different entities.
Characteristics of the conceptual data model
It contains the essential entities and the relationships among them.
No attribute is specified.
No primary key is specified.
Data Warehouse Modeling
Logical Data Model
A logical data model defines the information in as much structure as possible,
without observing how they will be physically achieved in the database.
Features of a logical data model
• It involves all entities and relationships among them.
• All attributes for each entity are specified.
• The primary key for each entity is stated.
• Referential Integrity is specified (FK Relation).
► The phase for designing the logical data model which are as follows:
• Specify primary keys for all entities.
• List the relationships between different entities.
• List all attributes for each entity.
• Normalization.
• No data types are listed
Data Warehouse Modeling
Data Warehouse Modeling
Physical Data Model
Physical data model describes how the model will be presented in the database. A
physical database model demonstrates all table structures, column names, data
types, constraints, primary key, foreign key, and relationships between tables.
► Characteristics of a physical data model
• Specification all tables and columns.
• Foreign keys are used to recognize relationships between tables.
► The steps for physical data model design which are as follows:
• Convert entities to tables.
• Convert relationships to foreign keys.
• Convert attributes to columns.
Data Warehouse Modeling
Types of Data Warehouse Models
Enterprise Data Warehouse (EDW),
► An enterprise data warehouse (EDW) is a database, or collection of databases, that
centralizes a business’s information from multiple sources and applications, and
makes it available for analytics and use across the organization. EDWs can be
housed in an on-premise server or in the cloud.
► The data stored in this type of digital warehouse can be one of a business’s most
valuable assets, as it represents much of what is known about the business, its
employees, its customers, and more.
► Enterprise data contains insights on customer behavior, spending, and revenue.
Modern data analysis and business intelligence (BI) involves integrating data from
disparate sources, and harnessing it for analysis and BI, usually with the aid of an
enterprise data warehouse (EDW).
► EDWs contain current data, such as real-time feeds or the latest snapshots from
source systems, as well as historical data.
Enterprise Data Warehouse (EDW),

Types of enterprise data warehouses


1) On-premises data warehouse (on-prem)
An on-premises (often called “on-prem”) data warehouse is a
data center using onsite systems and servers at a physical
location.
2) Cloud data warehouse
Rather than running on a local network using hardware, a cloud
data warehouse stores data in the cloud, using a distributed
network that the user doesn’t have to manage. The cloud service
provider manages the data infrastructure.
Enterprise Data Warehouse (EDW),

Enterprise data warehouse benefits


Marketing use cases for EDWs
Enterprise data warehouses offer marketers a wealth of actionable insights to help
them understand customers, observe their behavior, and improve marketing across
every channel.
Sales use cases for EDWs
Customer data from an EDW sets up sales teams for success by showing them
accounts that are primed for outreach and ready to buy. When reps are doing
outbound sales, data offers a full picture of who an individual is and their role in the
decision-making process.
Customer service use cases for EDWs
For instance, when a customer reaches out via chat, email, or phone with a concern or
challenge, an EDW offers a unified data view for the support .
Enterprise Data Warehouse (EDW),
Components of an enterprise data warehouse
Data sources. These include all of the sources that send data to the EDW, including
enterprise resource planning (ERP) systems (such as Oracle), customer relationship
management platforms (such as Salesforce), marketing channels like social media and
websites, SQL databases (such as MySQL), product data, and others.
Staging area. In the staging area, data is transformed before being loaded into the
EDW. The data is aggregated and cleaned, ensuring it’s in the correct format — for
most warehouses, this is a tabular format (or in rows and columns) so that SQL can
query the data. Then, it’s ready for storage and analysis.
Storage layer. From the staging area, data is loaded into the storage layer. The
storage layer includes a metadata module that tells administrators details about the
data, such as its source. The EDW itself and any smaller data marts each receive
metadata within the storage layer.
Enterprise Data Warehouse (EDW),
Components of an enterprise data warehouse
Presentation space. For teams to be able to access, analyze, and activate the data
they need a unified interface or dashboard. Also known as the access space, the
presentation space lets business users run reports and analytics, enables data sharing,
provides a dashboard for data visualization, and offers in-depth insights that teams
need to make decisions.
Enterprise data warehouse architecture
One-tier architecture. This structure has a data source layer, warehouse layer, and
presentation layer. Although it can improve data quality, it limits the amount of data
an organization can store.
Two-tier architecture. This structure includes the layers of one-tier architecture, but
it adds a staging area between the source layer and data warehouse to clean data and
ensure it’s in the correct format
Enterprise Data Warehouse (EDW),
Components of an enterprise data warehouse
Three-tier architecture. Most organizations use this structure, which includes both a
staging area and data marts. Three-tier architecture offers scalability and faster
development than other structures.
Benefits of an enterprise data warehouse
• An EDW improves an organization’s ability to deliver insights faster and confers a
competitive advantage to businesses.
• Business units across an organization can replicate data from multiple sources into
the repository for analysis and BI. This removes communication bottlenecks and
helps users access the information they need faster.
• The careful data modeling required to put together an EDW results in data that is
better suited for analytics.
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific collection of
users. The scope is confined to particular selected subjects.
For example, a marketing data mart may restrict its subjects to the customer, items, and
sales. The data contained in the data marts tend to be summarized.
Data Marts is divided into two parts:
► Independent Data Mart: Independent data mart is sourced from data captured from one
or more operational systems or external data providers, or data generally locally within a
different department or geographic area.
► Dependent Data Mart: Dependent data marts are sourced exactly from enterprise
data-warehouses.
Data Marts are analytical record stores designed to focus on particular business functions for
a specific community within an organization.
Data marts are derived from subsets of data in a data warehouse, though in the bottom-up
data warehouse design methodology, the data warehouse is created from the union of
organizational data marts.
The fundamental use of a data mart is Business Intelligence (BI) applications.
Data Mart
BI is used to gather, store, access, and analyze record. It can be used by smaller businesses
to utilize the data they have accumulated since it is less expensive than implementing a data
warehouse.
Data Mart
Reasons for creating a data mart
• Creates collective data by a group of users
• Easy access to frequently needed data
• Ease of creation
• Improves end-user response time
• Lower cost than implementing a complete data warehouses
• Potential clients are more clearly defined than in a comprehensive data warehouse
• It contains only essential business data and is less cluttered.
Dependent Data Marts
A dependent data marts is a logical subset of a physical subset of a higher data
warehouse. According to this technique, the data marts are treated as the subsets of a
data warehouse.
Data Mart
In this technique, firstly a data warehouse is created from which further various data marts
can be created. These data mart are dependent on the data warehouse and extract the
essential record from it. In this technique, as the data warehouse creates the data mart;
therefore, there is no need for data mart integration. It is also known as a top-down
approach.
Data Mart
Independent Data Marts
The second approach is Independent data marts (IDM) Here, firstly independent data
marts are created, and then a data warehouse is designed using these independent
multiple data marts.
In this approach, as all the data marts are designed independently; therefore, the
integration of data marts is required.
It is also termed as a bottom-up approach as the data marts are integrated to
develop a data warehouse.
Data Mart
Steps in Implementing a Data Mart
Designing
The design step is the first in the data mart process. This phase covers all of the
functions from initiating the request for a data mart through gathering data about the
requirements and developing the logical and physical design of the data mart.
1. gathering the business and technical requirements
2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data mart.
Constructing
This step contains creating the physical database and logical structures associated with the
data mart to provide fast and efficient access to the data.
Data Mart
1. Creating the physical database and logical structures such as tablespaces
associated with the data mart.
2. creating the schema objects such as tables and indexes describe in the design step.
3. Determining how best to set up the tables and access structures.
Populating
► This step includes all of the tasks related to the getting data from the source,
cleaning it up, modifying it to the right format and level of detail, and moving it
into the data mart.
It involves the following tasks:
1. Mapping data sources to target data sources
2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata
Data Mart
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating
reports, charts and graphs and publishing them.
It involves the following tasks:
1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer
translates database operations and objects names into business conditions so that
the end-clients can interact with the data mart using words which relates to the
business functions.
2. Set up and manage database architectures like summarized tables which help
queries agree through the front-end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step, management
functions are performed as
Data Mart
► Providing secure access to the data.
► Managing the growth of the data.
► Optimizing the system for better performance.
► Ensuring the availability of data event with system failures.
Difference between Data Warehouse and Data Mart
Data Warehouse Data Mart

A Data Warehouse is a vast repository of information A data mart is an only subtype of a Data Warehouses. It is
collected from various organizations or departments within a architecture to meet the requirement of a specific user group.
corporation.

It may hold multiple subject areas. It holds only one subject area. For example, Finance or Sales.

It holds very detailed information. It may hold more summarized data.

Works to integrate all data sources It concentrates on integrating data from a given subject area or
set of source systems.

In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake Schema are used.

It is a Centralized System. It is a Decentralized System.

Data Warehousing is the data-oriented. Data Marts is a project-oriented.


Virtual Data Warehouse

► virtual warehousing involves creating a digital replica of physical space storage


facilities that accurately represents all essential information about products
housed within them.
► A virtual warehouse on Snowflake comprises database servicers for the smooth
execution of user queries. You can get them based on your business requirements.
If you have an on-premise database, the hardware deployment will remain fixed.
► A virtual warehouse is a process for collecting and managing data from different
sources. You can use it typically to connect and analyze data from heterogeneous
sources.
Virtual Data Warehouse

► Virtual warehousing is utilized by businesses across various industries to


streamline their inventory management processes.
► Organizations focusing on disaster relief operations also recognize the value
offered by virtual warehousing implementation
BENEFITS OF VIRTUAL WAREHOUSING
• Automation: Virtual warehousing automates various inventory management
tasks, such as stock replenishment and picking. This reduces the need for manual
labor and eliminates human error.
• Improved Data Analytics: Monitoring the performance of multiple physical
warehouses from one source can provide valuable insights into the overall
inventory health, allowing businesses to analyze trends and optimize stocks
accordingly.
Virtual Data Warehouse

• Reduced Storage Costs: By managing stocks more efficiently through an


automated system, companies can reduce storage costs associated with traditional
warehousing.
• Increased Agility: Virtual warehousing enables businesses to easily adapt to
changing customer needs, enabling them to deliver products quickly and
efficiently while providing better customer service.
CHALLENGES OF USING A VIRTUAL WAREHOUSE
• Data Security: Virtual warehouses contain highly sensitive data. Thus, companies
need to deploy robust security measures to ensure this information remains
secure.
• Integration: Businesses may find it difficult to integrate the necessary systems
into their operations, and many companies need to hire skilled personnel or
consultancies to guide them through the process.
• Initial Setup Costs: Setting up a virtual warehouse requires an initial investment
of time and resources, which can be costly for some businesses.
Virtual Data Warehouse

• Reduced Storage Costs: By managing stocks more efficiently through an


automated system, companies can reduce storage costs associated with traditional
warehousing.
• Increased Agility: Virtual warehousing enables businesses to easily adapt to
changing customer needs, enabling them to deliver products quickly and
efficiently while providing better customer service.
CHALLENGES OF USING A VIRTUAL WAREHOUSE
• Data Security: Virtual warehouses contain highly sensitive data. Thus, companies
need to deploy robust security measures to ensure this information remains
secure.
• Integration: Businesses may find it difficult to integrate the necessary systems
into their operations, and many companies need to hire skilled personnel or
consultancies to guide them through the process.
• Initial Setup Costs: Setting up a virtual warehouse requires an initial investment
of time and resources, which can be costly for some businesses.
Hybrid Data Warehouse.

Data warehouse architecture continues to evolve as users modernize their


warehouses and as vendor, open source, and cloud provider communities roll out
innovative new platforms.
There are good reasons for creating a hybrid architecture for your modern data
warehouse. However, modern data warehouse teams and their colleagues need some
rules and strategies to guide the design and use of such architectures. To that end, the
modern data warehouse can be compartmentalized in various ways to group data sets
and workloads for platform assignment and distribution in hybrid architectures.
Lake use cases typically involve many forms of advanced analytics based on mining,
statistics, natural language processing, machine learning, and predictive analytics.
Data Mining
Data Mining
► Data mining is the process of extracting knowledge or insights from large amounts
of data using various statistical and computational techniques.
► The data can be structured, semi-structured or unstructured, and can be stored in
various forms such as databases, data warehouses, and data lakes.
► The primary goal of data mining is to discover hidden patterns and relationships in
the data that can be used to make informed decisions or predictions.
► This involves exploring the data using various techniques such as clustering,
classification, regression analysis, association rule mining, and anomaly detection.
► Data mining has a wide range of applications across various industries, including
marketing, finance, healthcare, and telecommunications.
► For example, in marketing, data mining can be used to identify customer segments
and target marketing campaigns, while in healthcare, it can be used to identify risk
factors for diseases and develop personalized treatment plans.
Data Mining
Types of Data Mining
Relational Database: A relational database is a collection of multiple data sets
formally organized by tables, records, and columns from which data can be accessed
in various ways without having to recognize the database tables. Tables convey and
share information, which facilitates data searchability, reporting, and organization.
Data warehouses: A Data Warehouse is the technology that collects the data from
various sources within the organization to provide meaningful business insights. The
huge amount of data comes from multiple places such as Marketing and Finance. The
extracted data is utilized for analytical purposes and helps in decision- making for a
business organization. The data warehouse is designed for the analysis of data rather
than transaction processing.
Data Repositories: The Data Repository generally refers to a destination for data
storage. However, many IT professionals utilize the term more clearly to refer to a
specific kind of setup within an IT structure. For example, a group of databases, where
an organization has kept various kinds of information.
Types of Data Mining
Object-Relational Database: A combination of an object-oriented database model
and relational database model is called an object-relational model. It supports Classes,
Objects, Inheritance, etc.
Transactional Database: A transactional database refers to a database management
system (DBMS) that has the potential to undo a database transaction if it is not
performed appropriately. Even though this was a unique capability a very long while
back, today, most of the relational database systems support transactional database
activities.
Advantages of Data Mining
► The Data Mining technique enables organizations to obtain knowledge-based data.
► Data mining enables organizations to make lucrative modifications in operation
and production.
► Compared with other statistical data applications, data mining is a cost-efficient.
► Data Mining helps the decision-making process of an organization.
► It Facilitates the automated discovery of hidden patterns as well as the prediction of
trends and behaviors.
► It can be induced in the new system as well as the existing platforms.
► It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.
Disadvantages of Data Mining
► There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card
purchases of their customers to other organizations.
► Many data mining analytics software is difficult to operate and needs advance training to
work on.
► Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data mining tools is a
very challenging task.
► The data mining techniques are not precise, so that it may lead to severe consequences in
certain conditions.
Data Mining Applications
► Data Mining is primarily used by organizations with intense consumer demands-
Retail, Communication, Financial, marketing company, determine price, consumer
preferences, product positioning, and impact on sales, customer satisfaction, and
corporate profits.
► Data mining enables a retailer to use point-of-sale records of customer purchases
to develop products and promotions that help the organization to attract the
customer.
Data Mining Applications
Data Mining in Healthcare:Data mining in healthcare has excellent potential to
improve the health system. It uses data and analytics for better insights and to identify
best practices that will enhance health care services and reduce costs.
Data Mining in Market Basket Analysis:Market basket analysis is a modeling
method based on a hypothesis. If you buy a specific group of products, then you are
more likely to buy another group of products.
Data mining in Education:Education data mining is a newly emerging field,
concerned with developing techniques that explore knowledge from the data
generated from educational Environments.
Data Mining in CRM (Customer Relationship Management):Customer Relationship
Management (CRM) is all about obtaining and holding Customers, also enhancing
customer loyalty and implementing customer-oriented strategies.
Challenges of Implementation in Data mining
Data mining is very powerful, it faces many challenges during its execution. Various
challenges could be related to performance, data, methods, and techniques, etc. The
process of data mining becomes effective when the challenges or problems are
correctly recognized and adequately resolved.
Challenges of Implementation in Data mining
Incomplete and noisy data: The process of extracting useful data from large volumes
of data is data mining. The data in the real-world is heterogeneous, incomplete, and
noisy.
Data Distribution: Real-worlds data is usually stored on various platforms in a
distributed computing environment. It might be in a database, individual systems, or
even on the internet. Practically, It is a quite tough task to make all the data to a
centralized data repository mainly due to organizational and technical concerns.
Complex Data: Real-world data is heterogeneous, and it could be multimedia data,
including audio and video, images, complex data, spatial data, time series, and so on.
Managing these various types of data and extracting useful information is a tough task.
Most of the time, new technologies, new tools, and methodologies would have to be
refined to obtain specific information.
Data Privacy and Security: Data mining usually leads to serious issues in terms of
data security, governance, and privacy. For example, if a retailer analyzes the details of
the purchased items, then it reveals data about buying habits and preferences of the
customers without their permission.
Knowledge Discovery in Data (KDD)
► The term KDD stands for Knowledge Discovery in Databases. It refers to the broad
procedure of discovering knowledge in data and emphasizes the high-level
applications of specific Data Mining techniques.
► It is a field of interest to researchers in various fields, including artificial
intelligence, machine learning, pattern recognition, databases, statistics,
knowledge acquisition for expert systems, and data visualization.
► The main objective of the KDD process is to extract information from data in the
context of large databases. It does this by using Data Mining algorithms to identify
what is deemed knowledge.
► The Knowledge Discovery in Databases is considered as a programmed,
exploratory analysis and modeling of vast data repositories.KDD is the organized
procedure of recognizing valid, useful, and understandable patterns from huge and
complex data sets.
► Data Mining is the root of the KDD procedure, including the inferring of algorithms
that investigate the data, develop the model, and find previously unknown
patterns
The KDD Process
► The knowledge discovery process(illustrates in the given figure) is iterative and
interactive, comprises of nine steps. The process is iterative at each stage, implying
that moving back to the previous actions might be required.
► The process has many imaginative aspects in the sense that one cant presents one
formula or make a complete scientific categorization for the correct decisions for each
step and application type.
► The process begins with determining the KDD objectives and ends with the
implementation of the discovered knowledge.
The KDD Process
1. Building up an understanding of the application domain:
This is the initial preliminary step. It develops the scene for understanding what
should be done with the various decisions like transformation, algorithms,
representation, etc.
2. Choosing and creating a data set on which discovery will be performed:
Once defined the objectives, the data that will be utilized for the knowledge discovery
process should be determined. This incorporates discovering what data is accessible,
obtaining important data, and afterward integrating all the data for knowledge
discovery onto one set involves the qualities that will be considered for the process.
3. Preprocessing and cleansing:
In this step, data reliability is improved. It incorporates data clearing, for example,
Handling the missing quantities and removal of noise or outliers.
4. Data Transformation:
In this stage, the creation of appropriate data for Data Mining is prepared and
developed. Techniques here incorporate dimension reduction( for example, feature
selection and extraction and record sampling), also attribute transformation(for
example, discretization of numerical attributes and functional transformation).
The KDD Process

5. Prediction and description:


We are now prepared to decide on which kind of Data Mining to use, for example,
classification, regression, clustering, etc. This mainly relies on the KDD objectives, and
also on the previous steps.
6. Selecting the Data Mining algorithm:
This stage incorporates choosing a particular technique to be used for searching
patterns that include multiple inducers. For example, considering precision versus
understandability, the previous is better with neural networks, while the latter is
better with decision trees.
7. Utilizing the Data Mining algorithm:
At last, the implementation of the Data Mining algorithm is reached. In this stage, we
may need to utilize the algorithm several times until a satisfying outcome is obtained.
8. Evaluation:
In this step, we assess and interpret the mined patterns, rules, and reliability to the
objective characterized in the first step. Here we consider the preprocessing steps as
for their impact on the Data Mining algorithm results.
The KDD Process
9. Using the discovered knowledge:
we are prepared to include the knowledge into another system for further activity.
The knowledge becomes effective in the sense that we may make changes to the
system and measure the impacts.
Data Cleaning - In this step the noise and inconsistent data is removed.
Data Integration - In this step multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from the
database.
Data Transformation - In this step data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining - In this step intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation - In this step, data patterns are evaluated. Knowledge Presentation
- In this step, knowledge is represented.
Kinds of Database
► Hierarchical databases
► Network databases
► Object-oriented databases
► Relational databases
► Cloud Database
► Centralized Database
► Operational Database
► NoSQL databases
Basic mining techniques
► Data mining includes the utilization of refined data analysis tools to find
previously unknown, valid patterns and relationships in huge data sets.
► These tools can incorporate statistical models, machine learning techniques, and
mathematical algorithms, such as neural networks or decision trees. Thus, data
mining incorporates analysis and prediction.
► Depending on various methods and technologies from the intersection of machine
learning, database management, and statistics, professionals in data mining have
devoted their careers to better understanding how to process and make
conclusions from the huge amount of data, but what are the methods they use to
make it happen?
Basic mining techniques
Basic mining techniques
1. Classification:
This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.
Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
Classification of data mining frameworks as per the kind of knowledge
discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
Basic mining techniques
Classification of data mining frameworks according to data mining techniques
used:
This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in
the data mining procedure, such as query-driven systems, autonomous systems, or
interactive exploratory systems.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the
data by a few clusters mainly loses certain confine details, but accomplishes
improvement.
It models data by its clusters. Data modeling puts clustering from a historical point of
view rooted in statistics, mathematics, and numerical analysis.
Basic mining techniques
From a practical point of view, clustering plays an extraordinary job in data mining
applications. For example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis, computational
biology, medical diagnostics, and much more.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used
to define the probability of the specific variable. Regression, primarily a form of
planning and modeling.
For example, we might use it to project certain costs, depending on other factors
such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.
Basic mining techniques
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends,
clustering, classification, etc. It analyzes past events or instances in the right sequence
to predict a future event.
Data Mining Issues
► Data mining is not an easy task, as the algorithms used can get very complex and
data is not always available at one place.
► It needs to be integrated from various heterogeneous data sources. These factors
also create some issues.
1) Mining Methodology and User Interaction
2) Performance Issues
3) Diverse Data Types Issues
Data Mining Issues
Data Mining Issues
Mining Methodology and User Interaction Issues.
Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining to
cover a broad range of knowledge discovery task.
Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
Data Mining Issues
Performance Issues
Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining algorithm
must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms − The factors such as


huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms.
Data Mining Issues
Diverse Data Types Issues
Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.

Mining information from heterogeneous databases and global information


systems − The data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
Data Mining Metrics
Data mining is one of the forms of artificial intelligence that uses perception models,
analytical models, and multiple algorithms to simulate the techniques of the human
brain. Data mining supports machines to take human decisions and create human
choices.
Usefulness − Usefulness involves several metrics that tell us whether the model
provides useful data.
Return on Investment (ROI) − Data mining tools will find interesting patterns buried
inside the data and develop predictive models.
Access Financial Information during Data Mining − The simplest way to frame
decisions in financial terms is to augment the raw information that is generally mined
to also contain financial data.
Converting Data Mining Metrics into Financial Terms − A general data mining
metric is the measure of "Lift". Lift is a measure of what is achieved by using the
specific model or pattern relative to a base rate in which the model is not used.
Data Mining Metrics
Accuracy − Accuracy is a measure of how well the model correlates results with the
attributes in the data that has been supported.
Social Implications of Data Mining,
Data mining is the process of finding useful new correlations, patterns, and trends by
transferring through a high amount of data saved in repositories, using pattern
recognition technologies including statistical and mathematical techniques.
Data mining systems are designed to promote the identification and classification of
individuals into different groups or segments.
Privacy − It is a loaded issue. In current years privacy concerns have taken on a more
important role in American society as merchants, insurance companies, and
government agencies amass warehouses including personal records.
Profiling − Data Mining and profiling is a developing field that attempts to organize,
understand, analyze, reason, and use the explosion of data in this information age.
Unauthorized Used − Trends obtain through data mining designed to be used for
marketing goals or some other ethical goals, can be misused.
Data Preprocessing
Data Mining Architecture
Data Mining Architecture
Data Source:
► The actual source of data is the Database, data warehouse, World Wide Web
(WWW), text files, and other documents. You need a huge amount of historical
data for data mining to be successful.
Different processes:
► Before passing the data to the database or data warehouse server, the data must
be cleaned, integrated, and selected. As the information comes from various
sources and in different formats, it can't be used directly for the data mining
procedure because the data may not be complete and accurate.
Database or Data Warehouse Server:
► The database or data warehouse server consists of the original data that is ready
to be processed. Hence, the server is cause for retrieving the relevant data that is
based on data mining as per user request.
Data Mining Architecture
Data Mining Engine:
► The data mining engine is a major component of any data mining system. It
contains several modules for operating data mining tasks, including association,
characterization, classification, clustering, prediction, time-series analysis, etc.
Pattern Evaluation Module:
► The Pattern evaluation module is primarily responsible for the measure of
investigation of the pattern by using a threshold value. It collaborates with the
data mining engine to focus the search on exciting patterns.
Graphical User Interface:
► The graphical user interface (GUI) module communicates between the data mining
system and the user. This module helps the user to easily and efficiently use the
system without knowing the complexity of the process.
Knowledge Base:
► The knowledge base is helpful in the entire process of data mining. It might be
helpful to guide the search or evaluate the stake of the result patterns.
Data Processing prerequisites
In data mining, data processing is a crucial step that involves transforming raw data
into a format suitable for analysis. Several prerequisites in data processing are
essential for effective data mining:
Data Cleaning: This involves detecting and correcting errors or inconsistencies in the
data. It may include handling missing values, dealing with outliers, and resolving
inconsistencies in data formats or units.
Data Integration: Data integration involves combining data from multiple sources
into a unified format.
Data Transformation: Data transformation involves converting data into a suitable
format for analysis.
Dimensionality Reduction: Dimensionality reduction techniques are used to reduce
the number of features or variables in the dataset while preserving the essential
information. This helps in simplifying the analysis, improving algorithm efficiency, and
avoiding the curse of dimensionality.
Data Processing prerequisites
Data Discretization: Data discretization involves converting continuous attributes
into discrete intervals or categories.
Data Sampling: Data sampling involves selecting a representative subset of the data
for analysis. This is particularly useful when dealing with large datasets or when
resources are limited.
Data Reduction: Data reduction techniques aim to reduce the size of the dataset
while preserving its essential characteristics. This may involve techniques such as
feature selection, instance selection, and clustering-based data reduction methods.
Data Normalization: Data normalization involves scaling the data to a standard range
to eliminate the effects of different scales or units in the data.
Attributes and Data types
In data mining, understanding the different types of attributes or data types is
essential as it helps to determine the appropriate data analysis techniques to use.
1]Nominal Data:
This type of data is also referred to as categorical data. Nominal data represents data
that is qualitative and cannot be measured or compared with numbers. In nominal
data, the values represent a category, and there is no inherent order or hierarchy.
Examples of nominal data include gender, race, religion, and occupation. Nominal data
is used in data mining for classification and clustering tasks.
2]Ordinal Data:
This type of data is also categorical, but with an inherent order or hierarchy. Ordinal
data represents qualitative data that can be ranked in a particular order. For instance,
education level can be ranked from primary to tertiary, and social status can be
ranked from low to high. In ordinal data, the distance between values is not uniform.
Ordinal data is used in data mining for ranking and classification tasks.
Attributes and Data types
3]Binary Data:
This type of data has only two possible values, often represented as 0 or 1. Binary data
is commonly used in classification tasks, where the target variable has only two
possible outcomes. Examples of binary data include yes/no, true/false, and pass/fail.
Binary data is used in data mining for classification and association rule mining tasks.
4]Interval Data:
This type of data represents quantitative data with equal intervals between
consecutive values. Interval data has no absolute zero point, and therefore, ratios
cannot be computed. Examples of interval data include temperature, IQ scores, and
time. Interval data is used in data mining for clustering and prediction tasks.
5]Ratio Data:
This type of data is similar to interval data, but with an absolute zero point. In ratio
data, it is possible to compute ratios of two values, and this makes it possible to make
meaningful comparisons. Examples of ratio data include height, weight, and income.
Ratio data is used in data mining for prediction and association rule mining tasks.
Attributes and Data types
6]Text Data:
This type of data represents unstructured data in the form of text. Text data can be
found in social media posts, customer reviews, and news articles. Text data is used in
data mining for sentiment analysis, text classification, and topic modeling tasks.
Data Quality:
1.Incompleteness:
This refers to missing data or information in the dataset. Missing data can result from
various factors, such as errors during data entry or data loss during transmission.
2.Inconsistency:
This refers to conflicting or contradictory data in the dataset. Inconsistent data can
result from errors in data entry, data integration, or data storage.
3.Noise:
This refers to random or irrelevant data in the dataset. Noise can result from errors
during data collection or data entry.
Attributes and Data types
4.Outliers:
Outliers are data points that are significantly different from the other data points in
the dataset. Outliers can result from errors in data collection, data entry, or data
transmission.
5.Redundancy:
Redundancy refers to the presence of duplicate or overlapping data in the dataset.
Redundant data can result from data integration or data storage.
6.Data format:
This refers to the structure and format of the data in the dataset. Data may be in
different formats, such as text, numerical, or categorical. Preprocessing techniques,
such as data transformation and normalization, can be used to convert data into a
consistent format for analysis.
Statistical descriptions of data
Descriptive statistics are like a translator for data that helps provide the patterns and
trends of data.
Descriptive statistics use various tools, such as measures of central tendency and
variability and graphical representations, to present data in a meaningful and
accessible way.
Central tendency measures the typical value of a dataset, while variability measures
show how well the spread-out data is. Graphical representations add a visual element,
making it easier to spot patterns and trends.
Descriptive statistics are essential for analyzing and understanding data, providing a
foundation for complex statistical analysis in various fields, including business,
healthcare, and social sciences.
Statistical descriptions of data
Types of Descriptive Statistics
Descriptive Statistics can be divided in terms of data types, patterns, or
characteristics- Distribution(or frequency distribution), Centra Tendency, and
Variability(or Dispersion)
Frequency Distribution
► The number of times a specific value appears in the data is its frequency (f). A
variable's distribution is its frequency pattern or the collection of all conceivable
values and the frequencies corresponding to those values.
► Frequency tables or charts are used to represent frequency distributions. Both
categorical and numerical variables can be employed using frequency distribution
tables. Only class intervals, which will be detailed momentarily, should be used
with continuous variables. Bar graphs, pie charts, and line graphs may be found in
most statistical or graphical data presentation software programs. Quantile plots,
quantile-quantile plots, histograms, and scatter plots are more common methods
for displaying data summaries and distributions.
Statistical descriptions of data
Central Tendency Indicators: Mean, Median, and Mode
Mean
It is the sum of observations divided by the total number of observations. It is also
defined as average which is the sum divided by count.
Statistical descriptions of data
x = Observations
n = number of terms
import numpy as np
# Sample Data
arr = [5, 6, 11]
# Mean
mean = np.mean(arr)
print("Mean = ", mean)
Output :

Mean = 7.333333333333333
Statistical descriptions of data
Mode
It is the value that has the highest frequency in the given data set. The data set may
have no mode if the frequency of all data points is the same.
from scipy import stats
# sample Data
arr = [1, 2, 2, 3]
# Mode
mode = stats.mode(arr)
print("Mode = ", mode)
Output:
Mode = ModeResult(mode=array([2]), count=array([2]))
Statistical descriptions of data
Median
It is the middle value of the data set. It splits the data into two halves. If the number of
elements in the data set is odd then the center element is the median and if it is even
then the median would be the average of two central elements.
import numpy as np
# sample Data
arr = [1, 2, 3, 4]
# Median
median = np.median(arr)
print("Median = ", median)
Output:

Median = 2.5
Statistical descriptions of data
Median
It is the middle value of the data set. It splits the data into two halves. If the number of
elements in the data set is odd then the center element is the median and if it is even
then the median would be the average of two central elements.
import numpy as np
# sample Data
arr = [1, 2, 3, 4]
# Median
median = np.median(arr)
print("Median = ", median)
Output:

Median = 2.5
Statistical descriptions of data
Measure of Variability.
Measures of variability are also termed measures of dispersion as it helps to gain
insights about the dispersion or the spread of the observations at hand.
Range
Variance
Standard deviation
Range
The range describes the difference between the largest and smallest data point in our
data set. The bigger the range, the more the spread of data and vice versa.
Range = Largest data value – smallest data value
Statistical descriptions of data
import numpy as np
# Sample Data
arr = [1, 2, 3, 4, 5]
# Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)
# Difference Of Max and Min
Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(
Maximum, Minimum, Range))
Output:
Maximum = 5, Minimum = 1 and Range = 4
Statistical descriptions of data
Variance
It is defined as an average squared deviation from the mean. It is calculated by finding
the difference between every data point and the average which is also known as the
mean, squaring them, adding all of them, and then dividing by the number of data
points present in our data set.
x -> Observation under consideration
N -> number of terms
mu -> Mean
Statistical descriptions of data
import statistics
# sample data
arr = [1, 2, 3, 4, 5]
# variance
print("Var = ", (statistics.variance(arr)))
Output:

Var = 2.5
Statistical descriptions of data
Standard Deviation
It is defined as the square root of the variance. It is calculated by finding the Mean,
then subtracting each number from the Mean which is also known as the average, and
squaring the result.

x = Observation under consideration


N = number of terms
mu = Mean
Statistical descriptions of data
import statistics
# sample data
arr = [1, 2, 3, 4, 5]

# Standard Deviation
print("Std = ", (statistics.stdev(arr)))
Output:
Std = 1.5811388300841898
Statistical descriptions of data
Inferential Statistics: In inferential statistics, predictions are made by taking any group
of data in which you are interested. It can be defined as a random sample of data taken
from a population to describe and make inferences about the population.
Statistical descriptions of data
► It makes inferences about the population using data drawn from the population.
► It allows us to compare data, and make hypotheses and predictions.
► It is used to explain the chance of occurrence of an event.
► It attempts to reach the conclusion about the population.
► It can be achieved by probability.
Distance and similarity measures
► Clustering consists of grouping certain objects that are similar to each other, it can
be used to decide if two items are similar or dissimilar in their properties.
► In a Data Mining sense, the similarity measure is a distance with dimensions
describing object features. That means if the distance among two data points is
small then there is a high degree of similarity among the objects and vice versa.
► The similarity is subjective and depends heavily on the context and application.
1. Euclidean Distance: Euclidean distance is considered the traditional metric for
problems with geometry. It can be simply explained as the ordinary distance between
two points. It is one of the most used algorithms in the cluster analysis. One of the
algorithms that use this formula would be K-mean. Mathematically it computes the
root of squared differences between the coordinates between two objects.
Distance and similarity measures
Distance and similarity measures
Distance and similarity measures
2. Manhattan Distance: This determines the absolute difference among the pair of
the coordinates. Suppose we have two points P and Q to determine the distance
between these points we simply have to calculate the perpendicular distance of the
points from X-Axis and Y-Axis. In a plane with P at coordinate (x1, y1) and Q at (x2,
y2). Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|
Need for Preprocessing
► Data preprocessing is an important process of data mining.
► In this process, raw data is converted into an understandable format and made
ready for further analysis. The motive is to improve data quality and make it up to
mark for specific tasks.
Data cleaning: Data cleaning help us remove inaccurate, incomplete and incorrect
data from the dataset.
Attribute's mean and median values can be used to replace the missing values in
normal and non-normal distribution of data respectively.
Tuples can be ignored if the dataset is quite large and many values are missing within
a tuple.
Binning − This method handle noisy data to make it smooth. Data gets divided equally
and stored in form of bins and then methods are applied to smoothing or completing
the tasks.
Regression − Regression functions are used to smoothen the data. Regression can be
linear(consists of one independent variable) or multiple(consists of multiple
independent variables).
Need for Preprocessing
Clustering − It is used for grouping the similar data in clusters and is used for finding
outliers.
Data integration:The process of combining data from multiple sources (databases,
spreadsheets, text files) into a single dataset.
Data transformation:
In this part, change in format or structure of data in order to transform the data suitable for
mining process. Methods for data transformation are −
Normalization − Method of scaling data to represent it in a specific smaller range( -1.0 to
1.0)
Discretization − It helps reduce the data size and make continuous data divide into intervals.
Attribute Selection − To help the mining process, new attributes are derived from the given
attributes.
Concept Hierarchy Generation − In this, the attributes are changed from lower level to
higher level in hierarchy.
Aggregation − In this, a summary of data gets stored which depends upon quality and
quantity of data to make the result more optimal.
Need for Preprocessing
Data reduction :It helps in increasing storage efficiency and reducing data storage to
make the analysis easier by producing almost the same results.
Handling Missing data
Missing values are data points that are absent for a specific variable in a dataset. They
can be represented in various ways, such as blank cells, null values, or special symbols
like “NA” or “unknown.”
Handling Missing data
Reduce the sample size: This can decrease the accuracy and reliability of your
analysis.
Introduce bias: If the missing data is not handled properly, it can bias the results of
your analysis.
Make it difficult to perform certain analyses: Some statistical techniques require
complete data for all variables, making them inapplicable when missing values are
present.
Why Is Data Missing From the Dataset?
Data can be missing for many reasons like technical issues, human errors, privacy
concerns, data processing issues, or the nature of the variable itself. Understanding
the cause of missing data helps choose appropriate handling strategies and ensure the
quality of your analysis.
Handling Missing data
Types of Missing Values
Missing Completely at Random (MCAR): MCAR is a specific type of missing data in
which the probability of a data point being missing is entirely random and
independent of any other variable in the dataset.
Missing at Random (MAR): MAR is a type of missing data where the probability of a
data point missing depends on the values of other variables in the dataset, but not on
the missing variable itself.
Missing Not at Random (MNAR): MNAR is the most challenging type of missing data
to deal with. It occurs when the probability of a data point being missing is related to
the missing value itself.
Handling Missing data
Methods for Identifying Missing Data

Functions Descriptions

Identifies missing values in a Series or


.isnull()
DataFrame.

check for missing values in a pandas Series or


DataFrame. It returns a boolean Series or
.notnull()
DataFrame, where True indicates non-missing
values and False indicates missing values.

Displays information about the DataFrame,


.info() including data types, memory usage, and
presence of missing values.
Handling Missing data

similar to notnull() but returns True for missing


.isna()
values and False for non-missing values.

Drops rows or columns containing missing values


dropna()
based on custom criteria.

Fills missing values with specific values, means,


fillna()
medians, or other calculated values.

Replaces specific values with other values,


replace()
facilitating data correction and standardization.

Removes duplicate rows based on specified


drop_duplicates()
columns.

unique() Finds unique values in a Series or DataFrame.


Data Cleaning
► Data cleaning is an essential step in the data mining process. It is crucial to the
construction of a model. The step that is required, but frequently overlooked by
everyone, is data cleaning.
► The major problem with quality information management is data quality.
Problems with data quality can happen at any place in an information system. Data
cleansing offers a solution to these issues.
► Data cleaning is the process of correcting or deleting inaccurate, damaged,
improperly formatted, duplicated, or insufficient data from a dataset.
Steps for Cleaning Data
1. Remove duplicate or irrelevant observations
Remove duplicate or pointless observations as well as undesirable observations from
your dataset. The majority of duplicate observations will occur during data gathering.
Data Cleaning
2. Fix structural errors
When you measure or transfer data and find odd naming practices, typos, or wrong
capitalization, such are structural faults. Mislabelled categories or classes may result
from these inconsistencies. For instance, "N/A" and "Not Applicable" might be present
on any given sheet, but they ought to be analyzed under the same heading.
3. Filter unwanted outliers
There will frequently be isolated findings that, at first glance, do not seem to fit the
data you are analyzing. Removing an outlier if you have a good reason to, such as
incorrect data entry, will improve the performance of the data you are working with.
4. Handle missing data
Because many algorithms won't tolerate missing values, you can't overlook missing
data. There are a few options for handling missing data. While neither is ideal, both
can be taken into account, for example:
Although you can remove observations with missing values, doing so will result in the
loss of information, so proceed with caution.
Data Cleaning
5. Validate and QA
As part of fundamental validation, you ought to be able to respond to the following
queries once the data cleansing procedure is complete:
► Are the data coherent?
► Does the data abide by the regulations that apply to its particular field?
► Does it support or refute your working theory? Does it offer any new information?
► To support your next theory, can you identify any trends in the data?
► If not, is there a problem with the data's quality?
Data Cleaning
Techniques for Cleaning Data
Data Cleaning
Techniques for Cleaning Data
Ignore the tuples: This approach is not very practical because it is only useful when a
tuple has multiple characteristics and missing values.
Fill in the missing value: This strategy is also not very practical or effective.
Additionally, it could be a time-consuming technique. One must add the missing value
to the approach. The most common method for doing this is manually, but other
options include using attribute means or the most likely value.
Binning method: This strategy is fairly easy to comprehend. The values nearby are
used to smooth the sorted data. The information is subsequently split into several
equal-sized parts. The various techniques are then used to finish the assignment.
Regression: With the use of the regression function, the data is smoothed out.
Regression may be multivariate or linear. Multiple regressions have more
independent variables than linear regressions, which only have one.
Clustering: This technique focuses mostly on the group. Data are grouped using
clustering. After that, clustering is used to find the outliers. After that, the comparable
values are grouped into a "group" or "cluster".
Data Cleaning
Techniques for Cleaning Data
Ignore the tuples: This approach is not very practical because it is only useful when a
tuple has multiple characteristics and missing values.
Fill in the missing value: This strategy is also not very practical or effective.
Additionally, it could be a time-consuming technique. One must add the missing value
to the approach. The most common method for doing this is manually, but other
options include using attribute means or the most likely value.
Binning method: This strategy is fairly easy to comprehend. The values nearby are
used to smooth the sorted data. The information is subsequently split into several
equal-sized parts. The various techniques are then used to finish the assignment.
Regression: With the use of the regression function, the data is smoothed out.
Regression may be multivariate or linear. Multiple regressions have more
independent variables than linear regressions, which only have one.
Clustering: This technique focuses mostly on the group. Data are grouped using
clustering. After that, clustering is used to find the outliers. After that, the comparable
values are grouped into a "group" or "cluster".
Data Cleaning
import pandas as pd
import numpy as np
# Load the dataset
df = pd.read_csv('titanic.csv')
df.head()
Process of Data Cleaning
Monitoring the errors: Keep track of the areas where errors seem to occur most
frequently. It will be simpler to identify and maintain inaccurate or corrupt
information.
Standardize the mining process: To help lower the likelihood of duplicity,
standardize the place of insertion.
Validate data accuracy: Analyse the data and spend money on data cleaning
software. Artificial intelligence-based tools were utilized to thoroughly check for
accuracy.
Data Cleaning
Scrub for duplicate data: To save time when analyzing data, find duplicates. By
analyzing and investing in independent data-erasing technologies that can analyze
imperfect data in quantity and automate the operation, it is possible to avoid again
attempting the same data.
Research on data: Our data needs to be vetted, standardized, and duplicate-checked
before this action. There are numerous third-party sources, and these vetted and
approved sources can extract data straight from our databases.
Communicate with the team: Keeping the group informed will help with client
development and strengthening as well as giving more focused information to
potential clients.
Data Cleaning
Usage of Data Cleaning in Data Mining.
Data Cleaning
Characteristics of Data Cleaning
Data Cleaning
Accuracy: The business's database must contain only extremely accurate data.
Coherence: To ensure that the information on a person or body is the same
throughout all types of storage, the data must be consistent with one another.
Validity: There must be rules or limitations in place for the stored data. The
information must also be confirmed to support its veracity.
Uniformity: A database's data must all share the same units or values. Since it doesn't
complicate the process, it is a crucial component while doing the Data Cleansing
process.
Data Verification: Every step of the process, including its appropriateness and
effectiveness, must be checked.
Clean Data Backflow: After addressing quality issues, the previously clean data must
be replaced with data that is not present in the source so that legacy applications can
profit from it and avoid the need for a subsequent data-cleaning program.
Data Cleaning
Accuracy: The business's database must contain only extremely accurate data.
Coherence: To ensure that the information on a person or body is the same
throughout all types of storage, the data must be consistent with one another.
Validity: There must be rules or limitations in place for the stored data. The
information must also be confirmed to support its veracity.
Uniformity: A database's data must all share the same units or values. Since it doesn't
complicate the process, it is a crucial component while doing the Data Cleansing
process.
Data Verification: Every step of the process, including its appropriateness and
effectiveness, must be checked.
Clean Data Backflow: After addressing quality issues, the previously clean data must
be replaced with data that is not present in the source so that legacy applications can
profit from it and avoid the need for a subsequent data-cleaning program.
Data Cleaning
Tools for Data Cleaning in Data Mining
► OpenRefine
► Trifacta Wrangler
► Drake
► Data Ladder
► Data Cleaner
► Cloudingo
► Reifier
► IBM Infosphere Quality Stage
► TIBCO Clarity
► Winpure
Data Cleaning
Benefits of Data Cleaning
► Removal of inaccuracies when several data sources are involved.
► Clients are happier and employees are less annoyed when there are fewer
mistakes.
► The capacity to map out the many functions and the planned uses of your data.
► Monitoring mistakes and improving reporting make it easier to resolve inaccurate
or damaged data for future applications by allowing users to identify where issues
are coming from.
► Making decisions more quickly and with greater efficiency will be possible with
the use of data cleansing tools.
Data Transformation and Data Discretization
Data transformation in data mining refers to the process of converting raw data into a
format that is suitable for analysis and modeling.
Data cleaning: Removing or correcting errors, inconsistencies, and missing values in
the data.
Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
Data normalization: Scaling the data to a common range of values, such as between 0
and 1, to facilitate comparison and analysis.
Data reduction: Reducing the dimensionality of the data by selecting a subset of
relevant features or attributes.
Data discretization: Converting continuous data into discrete categories or bins.
Data Transformation and Data Discretization
The data transformation involves steps that are:
Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset.
Aggregation: Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description.
Discretization: It is a process of transforming continuous data into set of small
intervals. Most Data Mining activities in the real world require continuous attributes
example, (1-10, 11-20) (age:- young, middle age, senior).
generalization: It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old).
Decimal Scaling: It normalizes the values of an attribute by changing the position of
their decimal points
The number of points by which the decimal point is moved can be determined by the
absolute maximum value of attribute A.
Data Transformation and Data Discretization
ADVANTAGES OR DISADVANTAGES:
Advantages of Data Transformation in Data Mining:
Improves Data Quality: Data transformation helps to improve the quality of data by
removing errors, inconsistencies, and missing values.
Facilitates Data Integration: Data transformation enables the integration of data from
multiple sources, which can improve the accuracy and completeness of the data.
Improves Data Analysis: Data transformation helps to prepare the data for analysis
and modeling by normalizing, reducing dimensionality, and discretizing the data.
Disadvantages of Data Transformation in Data Mining:
Time-consuming: Data transformation can be a time-consuming process, especially
when dealing with large datasets.
Complexity: Data transformation can be a complex process, requiring specialized
skills and knowledge to implement and interpret the results.
Data Transformation and Data Discretization
Data Loss: Data transformation can result in data loss, such as when discretizing
continuous data, or when removing attributes or features from the data.
Biased transformation: Data transformation can result in bias, if the data is not
properly understood or used.
High cost: Data transformation can be an expensive process, requiring significant
investments in hardware, software, and personnel.
Data Reduction in Data Mining
Data reduction techniques ensure the integrity of data while reducing the data.
Data reduction is a process that reduces the volume of original data and represents it
in a much smaller volume. Data reduction techniques are used to obtain a reduced
representation of the dataset that is much smaller in volume by maintaining the
integrity of the original data.
Techniques of Data Reduction
Association Rules Mining
Problem Definition
Problem Definition is probably one of the most complex and heavily neglected stages
in the big data analytics pipeline.
In order to define the problem a data product would solve, experience is mandatory.
Frequent item set generation
► Frequent item sets, also known as association rules, are a fundamental concept in
association rule mining, which is a technique used in data mining to discover
relationships between items in a dataset. The goal of association rule mining is to
identify relationships between items in a dataset that occur frequently together.
► A frequent item set is a set of items that occur together frequently in a dataset. The
frequency of an item set is measured by the support count, which is the number of
transactions or records in the dataset that contain the item set. For example, if a
dataset contains 100 transactions and the item set {milk, bread} appears in 20 of
those transactions, the support count for {milk, bread} is 20.
Problem Definition
Frequent item sets and association rules can be used for a variety of tasks such as
market basket analysis, cross-selling and recommendation systems.
Association Mining searches for frequent items in the data set. In frequent mining
usually, interesting associations and correlations between item sets in transactional
and relational databases are found. In short, Frequent Mining shows which items
appear together in a transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules
from a Transactional Dataset.
If there are 2 items X and Y purchased frequently then it’s good to put them together
in stores or provide some discount offer on one item on purchase of another item.
This can really increase sales. For example, it is likely to find that if a customer buys
Milk and bread he/she also buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So the seller can suggest the
customer buy butter if he/she buys Milk and Bread.
Problem Definition
Frequent item sets and association rules can be used for a variety of tasks such as
market basket analysis, cross-selling and recommendation systems.
Association Mining searches for frequent items in the data set. In frequent mining
usually, interesting associations and correlations between item sets in transactional
and relational databases are found. In short, Frequent Mining shows which items
appear together in a transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules
from a Transactional Dataset.
If there are 2 items X and Y purchased frequently then it’s good to put them together
in stores or provide some discount offer on one item on purchase of another item.
This can really increase sales. For example, it is likely to find that if a customer buys
Milk and bread he/she also buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So the seller can suggest the
customer buy butter if he/she buys Milk and Bread.
Problem Definition
Important Definitions :
Support : It is one of the measures of interestingness. This tells about the usefulness
and certainty of rules. 5% Support means total 5% of transactions in the database
follow the rule.
Support(A -> B) = Support_count(A ∪ B)
Confidence: A confidence of 60% means that 60% of the customers who purchased a
milk and bread also bought butter.
Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
Problem Definition
Support_count(X): Number of transactions in which X appears. If X is A union B then
it is the number of transactions in which A and B both are present.
Maximal Itemset: An itemset is maximal frequent if none of its supersets are
frequent.
Closed Itemset: An itemset is closed if none of its immediate supersets have same
support count same as Itemset.
K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an
itemset is frequent if the corresponding support count is greater than the minimum
support count.
Example On finding Frequent Itemsets – Consider the given dataset with given
transactions.
Problem Definition
Problem Definition
► Lets say minimum support count is 3
► Relation hold is maximal frequent => closed => frequent
ADVANTAGES OR DISADVANTAGES:
Efficient discovery of patterns: Association rule mining algorithms are efficient at
discovering patterns in large datasets, making them useful for tasks such as market
basket analysis and recommendation systems.
Easy to interpret: The results of association rule mining are easy to understand and
interpret, making it possible to explain the patterns found in the data.
Disadvantages of using frequent item sets and association rule mining include:
Large number of generated rules: Association rule mining can generate a large
number of rules, many of which may be irrelevant or uninteresting, which can make it
difficult to identify the most important patterns.
Limited in detecting complex relationships: Association rule mining is limited in its
ability to detect complex relationships between items, and it only considers the
co-occurrence of items in the same transaction.
The APRIORI Principle
Apriori algorithm refers to the algorithm which is used to calculate the association
rules between objects.
It means how two or more objects are related to one another.
In other words, we can say that the apriori algorithm is an association rule leaning
that analyzes that people who bought product A also bought product B.
We take an example to understand the concept better. You must have noticed that the
Pizza shop seller makes a pizza, soft drink, and breadstick combo together. He also
offers a discount to their customers who buy these combos. Do you ever think why
does he do so? He thinks that customers who buy pizza also buy soft drinks and
breadsticks. However, by making combos, he makes it easy for the customers. At the
same time, he also increases his sales performance.
The APRIORI Principle
What is Apriori Algorithm?
► Apriori algorithm refers to an algorithm that is used in mining frequent products
sets and relevant association rules. Generally, the apriori algorithm operates on a
database containing a huge number of transactions. For example, the items
customers but at a Big Bazar.
► Apriori algorithm helps the customers to buy their products with ease and
increases the sales performance of the particular store.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
► Support
► Confidence
► Lift
The APRIORI Principle
you need a huge database containing a large no of transactions. Suppose you have
4000 customers transactions in a Big Bazar. You have to calculate the Support,
Confidence, and Lift for two products, and you may say Biscuits and Chocolate. This is
because customers frequently buy these two items together.Out of 4000 transactions,
400 contain Biscuits, whereas 600 contain Chocolate, and these 600 transactions
include a 200 that includes Biscuits and chocolates. Using this data, we will find out
the support, confidence, and lift.
Support
Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by the
total number of transactions. Hence, we get

Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

= 400/4000 = 10 percent.
The APRIORI Principle
Confidence
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise
both biscuits and chocolates by the total number of transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions
involving Biscuits)
= 200/400

= 50 percent.

It means that 50 percent of customers who bought biscuits bought chocolates also
The APRIORI Principle
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)


= 50/10 = 5
Advantages of Apriori Algorithm
It is used to calculate large itemsets.
Simple to understand and apply.
The APRIORI Principle
Disadvantages of Apriori Algorithms
Apriori algorithm is an expensive method to find support since the calculation has to
pass through the whole database.
Sometimes, you need a huge number of candidate rules, so it becomes
computationally more expensive.
FP Growth Algorithm in Data Mining
The FP-Growth Algorithm is an alternative way to find frequent item sets without
using candidate generations, thus improving performance.
For so much, it uses a divide-and-conquer strategy.
The core of this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the item set association information.
This algorithm works as follows:
► First, it compresses the input database creating an FP-tree instance to represent
frequent items.
► After this first step, it divides the compressed database into a set of conditional
databases, each associated with one frequent pattern.
► Finally, each such database is mined separately.
FP Growth Algorithm in Data Mining
Maximal Frequent item set, closed frequent item set

A maximal frequent itemset is a frequent itemset for which none of its immediate
supersets are frequent.
The support counts are shown on the top left of each node. Assume support count
threshold = 50%, that is, each item must occur in 2 or more transactions. Based on
that threshold, the frequent itemsets are a, b, c, d, ab, ac, and ad (shaded nodes).
Out of these 7 frequent itemsets, 3 are identified as maximal frequent (having red
outline):

ab: Immediate supersets abc and abd are infrequent.


ac: Immediate supersets abc and acd are infrequent.
ad: Immediate supersets abd and bcd are infrequent.
Maximal Frequent item set, closed frequent item set
Classification And Prediction
Classification And Prediction
We use classification and prediction to extract a model, representing the data classes
to predict future data trends.
Classification predicts the categorical labels of data with the prediction models.
This analysis provides us with the best understanding of the data at a large scale.
Classification models predict categorical class labels, and prediction models predict
continuous-valued functions.
For example, we can build a classification model to categorize bank loan applications
as either safe or risky or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and occupation.
Classification And Prediction
What is Classification?
Classification is to identify the category or the class label of a new observation.
First, a set of data is used as training data.
The set of input data and the corresponding outputs are given to the algorithm.
So, the training data set includes the input data and their associated class labels.
Using the training dataset, the algorithm derives a model or the classifier.
The derived model can be a decision tree, mathematical formula, or a neural network.
In classification, when unlabeled data is given to the model, it should find the class to
which it belongs.
The new data provided to the model is the test data set.
Classification And Prediction
Classification is the process of classifying a record.
One simple example of classification is to check whether it is raining or not.
The answer can either be yes or no.
So, there is a particular number of choices. Sometimes there can be more than two
classes to classify. That is called multiclass classification.
How does Classification Works?
Developing the Classifier or model creation: This level is the learning stage or the
learning process. The classification algorithms construct the classifier in this stage. A
classifier is constructed from a training set composed of the records of databases and
their corresponding class names.
Applying classifier for classification: The classifier is used for classification at this
level. The test data are used here to estimate the accuracy of the classification
algorithm. If the consistency is deemed sufficient, the classification rules can be
expanded to cover new data records.
Classification And Prediction
Data Classification Process: The data classification process can be categorized into
five steps:
Create the goals of data classification, strategy, workflows, and architecture of data
classification.
Classify confidential details that we store.
Using marks by data labelling.
To improve protection and obedience, use effects.
Data is complex, and a continuous method is a classification.
What is Data Classification Lifecycle?
Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google
documents, social media, and websites.
Role-based practice: Role-based security restrictions apply to all delicate data by
tagging based on in-house protection policies and agreement rules.
Storage: Here, we have the obtained data, including access controls and encryption.
Classification And Prediction
Sharing: Data is continually distributed among agents, consumers, and co-workers
from various devices and platforms.
Archive: Here, data is eventually archived within an industry's storage systems.
Publication: Through the publication of data, it can reach customers. They can then
view and download in the form of dashboards.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output.
Same as in classification, the training dataset contains the inputs and corresponding
numerical output values.
Model construction
Data mining uses raw data to extract information and present it uniquely.
The data mining process is usually found in the most diverse range of applications,
including business intelligence studies, political model forecasting, web ranking
forecasting, weather pattern model forecasting, etc.
In business operation intelligence studies, business experts mine huge data sets
related to a business operation or a market and try to discover previously
unrecognized trends and relationships.
Data mining is also used in organizations that utilize big data as a raw data source to
extract the required data.
What are data mining models?
A Data mining model refers to a method that usually use to present the information
and various ways in which they can apply information to specific questions and
problems.
As per the specialists, the data mining regression model is the most commonly used
data mining model. In this process, a mining expert first analyzes the data sets and
creates a formula that defines them. Various Financial market analysts use this model
to make predictions related to prices and market trends.
Model construction
Types of data mining models
Predictive data mining models
A predictive data mining model predicts the values of data using known results
gathered from the different data sets.
Predictive modeling can not be classified as a separate discipline; it occurs in all
organizations or industries across all disciplines.
The main objective of predictive data mining models is to predict the future based on
the past data, generally but not always on the statistical modeling.
Model construction
Descriptive model
A descriptive model differentiates the patterns and relationships in data.
A descriptive model does not attempt to generalize to a statistical population or
random process.
A predictive model attempts to generalize to a population or random process.
Predictive models should give prediction intervals and must be cross-validated; that
is, they must prove that they can be used to make predictions with data that was not
used in constructing the model.
Choosing algorithm
Machine Learning is the field of study that gives computers the capability to learn
without being explicitly programmed.
ML is one of the most exciting technologies that one would have ever come across.
A machine-learning algorithm is a program with a particular manner of altering its
own parameters, given responses on the past predictions of the data set.
Simple Steps to Choose Best Machine Learning Algorithm
Understand Your Problem : Begin by gaining a deep understanding on the problem
you are trying to solve. What is your goal? What is the problem all about classification,
regression , clustering, or something else? What kind of data you are working with?
Process the Data: Ensure that your data is in the right format for your chosen
algorithm. Process and prepare your data by cleaning, Clustering, Regression.
Exploration of Data: Conduct data analysis to gain insights into your data.
Visualizations and statistics helps you to understand the relationships within your
data.
Choosing algorithm
Metrics Evaluation: Decide on the metrics that will measure the success of model.
You must choose the metric that should align with your problem.
Simple models: One should begin with the simple easy-to-learn algorithms. For
classification, try regression, decision tree. Simple model provides a baseline for
comparison.
Use Multiple Algorithms: Try to use multiple algorithms to check that one performs
on your dataset.
Decision tree Induction
Decision Tree is a supervised learning method used in data mining for classification
and regression methods.
It is a tree that helps us in decision-making purposes.
The decision tree creates classification or regression models as a tree structure.
It separates a data set into smaller subsets, and at the same time, the decision tree is
steadily developed.
The final tree is a tree with the decision nodes and leaf nodes.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it
measures the randomness or impurity in data sets.
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also
called Entropy Reduction. Building a decision tree is all about discovering attributes
that return the highest data gain.
Decision tree Induction
Decision tree Induction
Why are decision trees useful?
It enables us to analyze the possible consequences of a decision thoroughly.
It provides us a framework to measure the values of outcomes and the probability of
accomplishing them.
It helps us to make the best decisions based on existing data and best speculations.
Decision tree Algorithm:
The decision tree algorithm may appear long, but it is quite simply the basis algorithm
techniques is as follows:
The algorithm is based on three parameters: D, attribute_list, and Attribute
_selection_method.
Generally, we refer to D as a data partition.
Initially, D is the entire set of training tuples and their related class levels (input
training data).
Decision tree Induction
The parameter attribute_list is a set of attributes defining the tuples.
Attribute_selection_method specifies a heuristic process for choosing the attribute
that "best" discriminates the given tuples according to class.
Attribute_selection_method process applies an attribute selection measure.
Advantages of using decision trees:
A decision tree does not need scaling of information.
Missing values in data also do not influence the process of building a choice tree to any
considerable extent.
A decision tree model is automatic and simple to explain to the technical team as well
as stakeholders.
Compared to other algorithms, decision trees need less exertion for data preparation
during pre-processing.
A decision tree does not require a standardization of data.
Bayesian Classification
In numerous applications, the connection between the attribute set and the class
variable is non- deterministic.
In other words, we can say the class label of a test record cant be assumed with
certainty even though its attribute set is the same as some of the training examples.
Bayesian classifiers are the statistical classifiers with the Bayesian probability
understandings.
The theory expresses how a level of belief, expressed as a probability.
Bayesian classification uses Bayes theorem to predict the occurrence of any event.
Bayesian Classification
Where X and Y are the events and P (Y) ≠ 0
P(X/Y) is a conditional probability that describes the occurrence of event X is given
that Y is true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given
that X is true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other.
This is known as the marginal probability.
Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree of belief.“
Bayes theorem connects the degree of belief in a hypothesis before and after
accounting for evidence.
For example, Lets us consider an example of the coin. If we toss a coin, then we get
either heads or tails, and the percent of occurrence of either heads and tails is 50%. If
the coin is flipped numbers of times, and the outcomes are observed, the degree of
belief may rise, fall, or remain the same depending on the outcomes.
Bayesian Classification
P(X), the prior, is the primary degree of belief in X
P(X/Y), the posterior is the degree of belief having accounted for Y.
The quotient Data Mining Bayesian Classifiers represents the supports Y provides for
X.
Bayes theorem can be derived from the conditional probability:
Data Mining Bayesian Classifiers
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Data Mining Bayesian Classifiers
Naïve Bayes classifier
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem.
It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of each
other.
One of the most simple and effective classification algorithms, the Naïve Bayes
classifier aids in the rapid development of machine learning models with rapid
prediction capabilities.
Naïve Bayes algorithm is used for classification problems. It is highly used in text
classification.
In text classification tasks, data contains high dimension (as each word represent one
feature in the data). It is used in spam filtering, sentiment detection, rating
classification etc.
Naïve Bayes classifier
Why it is Called Naive Bayes?
The “Naive” part of the name indicates the simplifying assumption made by the Naïve Bayes
classifier.
The classifier assumes that the features used to describe an observation are conditionally
independent, given the class label.
The “Bayes” part of the name refers to Reverend Thomas Bayes, an 18th-century statistician
and theologian who formulated Bayes’ theorem.
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
Feature independence: The features of the data are conditionally independent of each
other, given the class label.
Continuous features are normally distributed: If a feature is continuous, then it is
assumed to be normally distributed within each class.
Discrete features have multinomial distributions: If a feature is discrete, then it is
assumed to have a multinomial distribution within each class.
Features are equally important: All features are assumed to contribute equally to the
prediction of the class label.
No missing data: The data should not contain any missing values.
Naïve Bayes classifier
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Naïve Bayes classifier
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on
a particular day according to the weather conditions. So to solve this problem, we
need to follow the below steps:
Convert the given dataset into frequency tables.
Generate Likelihood table by finding the probabilities of given features.
Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:


Naïve Bayes classifier
Advantages of Naïve Bayes Classifier:
Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
It can be used for Binary as well as Multi-class Classifications.
It performs well in Multi-class predictions as compared to the other Algorithms.
It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Applications of Naïve Bayes Classifier:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
It is used in Text classification such as Spam filtering and Sentiment analysis.
Naïve Bayes classifier
Python Implementation of the Naïve Bayes algorithm:
► Steps to implement:
► Data Pre-processing step
► Fitting Naive Bayes to the Training set
► Predicting the test result
► Test accuracy of the result(Creation of Confusion matrix)
► Visualizing the test set result.
Importing the libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
Naïve Bayes classifier
Importing the dataset
dataset = pd.read_csv('user_data.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state
= 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
Naïve Bayes classifier
2) Fitting Naive Bayes to the Training Set:
After the pre-processing step, now we will fit the Naive Bayes model to the Training
set. Below is the code for it:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train, y_train)
3) Prediction of the test set result:
Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions.
# Predicting the Test set results
y_pred = classifier.predict(x_test)
Naïve Bayes classifier
4) Creating Confusion Matrix:
Now we will check the accuracy of the Naive Bayes classifier using the Confusion
matrix. Below is the code for it:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Linear Regression
What is regression?
Regression refers to a type of supervised machine learning technique that is used to
predict any continuous-valued attribute.
Regression helps any business organization to analyze the target variable and
predictor variable relationships.
It is a most significant tool to analyze the data that can be used for financial
forecasting and time series modeling.
Linear Regression
Linear regression is the type of regression that forms a relationship between the
target variable and one or more independent variables utilizing a straight line.
The given equation represents the equation of linear regression
Y = a + b*X + e
Where,
A: represents the intercept
Linear Regression
b represents the slope of the regression line
e represents the error
X and Y represent the predictor and target variables, respectively.
If X is made up of more than one variable, termed as multiple linear equations
Classification: It predicts the class of the dataset based on the independent input
variable. Class is the categorical or discrete values. like the image of an animal is a cat
or dog?
Regression: It predicts the continuous output variables based on the independent
input variable. like the prediction of house prices based on different parameters like
house age, distance from the main road, location, area, etc.
Linear Regression
Types of Linear Regression
There are two main types of linear regression:
Simple Linear Regression
This is the simplest form of linear regression, and it involves only one independent
variable and one dependent variable. The equation for simple linear regression is:
y=\beta_{0}+\beta_{1}X
where:
Y is the dependent variable
X is the independent variable
β0 is the intercept
β1 is the slope
Linear Regression
Multiple Linear Regression
This involves more than one independent variable and one dependent variable. The
equation for multiple linear regression is:
y=\beta_{0}+\beta_{1}X+\beta_{2}X+.........\beta_{n}X
where:
Y is the dependent variable
X1, X2, …, Xp are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
Logistic Regression
What is Logistic Regression?
Logistic regression is used for binary classification where we use sigmoid function,
that takes input as independent variables and produces a probability value between 0
and 1.
Key Points:
Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value
as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
Logistic Regression
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as “low”, “Medium”, or “High”.
Logistic Regression
Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Logistic Regression
Steps in Logistic Regression: To implement the Logistic Regression using
Python, we will use the same steps as we have done in previous topics of
Regression. Below are the steps:
Data Pre-processing step
Fitting Logistic Regression to the Training set
Predicting the test result
Test accuracy of the result(Creation of Confusion matrix)
Visualizing the test set result.
Non-linear Regression
Nonlinear regression refers to a broader category of regression models where the
relationship between the dependent variable and the independent variables is not assumed
to be linear.
If the underlying pattern in the data exhibits a curve, whether it’s exponential growth, decay,
logarithmic, or any other non-linear form, fitting a nonlinear regression model can provide a
more accurate representation of the relationship.
This is because in linear regression it is pre-assumed that the data is linear.
Assumptions in NonLinear Regression
These assumptions are similar to those in linear regression but may have nuanced
interpretations due to the nonlinearity of the model. Here are the key assumptions in
nonlinear regression:
Functional Form: The chosen nonlinear model correctly represents the true relationship
between the dependent and independent variables.
Independence: Observations are assumed to be independent of each other.
Homoscedasticity: The variance of the residuals (the differences between observed and
predicted values) is constant across all levels of the independent variable.
Normality: Residuals are assumed to be normally distributed.
Multicollinearity: Independent variables are not perfectly correlated.
Non-linear Regression
Types of Non-Linear Regression
Parametric non-linear regression :assumes that the relationship between the
dependent and independent variables can be modeled using a specific mathematical
function. For example, the relationship between the population of a country and time
can be modeled using an exponential function. Some common parametric non-linear
regression models include: Polynomial regression, Logistic regression, Exponential
regression, Power regression etc.
Non-parametric non-linear regression :does not assume that the relationship
between the dependent and independent variables can be modeled using a specific
mathematical function. Instead, it uses machine learning algorithms to learn the
relationship from the data. Some common non-parametric non-linear regression
algorithms include: Kernel smoothing, Local polynomial regression, Nearest neighbor
regression etc.
Validating Model
Confusion Matrix in Machine Learning
► The confusion matrix is a matrix used to determine the performance of the
classification models for a given set of test data. It can only be determined if the
true values for test data are known. The matrix itself can be easily understood, but
the related terminologies may be confusing. Since it shows the errors in the model
performance in the form of a matrix, hence also known as an error matrix.
Some features of Confusion matrix are given below:
► For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it
is 3*3 table, and so on.
► The matrix is divided into two dimensions, that are predicted values and actual
values along with the total number of predictions.
► Predicted values are those values, which are predicted by the model, and actual
values are the true values for the given observations.
Confusion Matrix in Machine Learning
Need for Confusion Matrix in Machine learning
► It evaluates the performance of the classification models, when they make
predictions on test data, and tells how good our classification model is.
► It not only tells the error made by the classifiers but also the type of errors such as
it is either type-I or type-II error.
► With the help of the confusion matrix, we can calculate the different parameters
for the model, such as accuracy, precision, etc.
Confusion Matrix in Machine Learning
Need for Confusion Matrix in Machine learning
► It evaluates the performance of the classification models, when they make
predictions on test data, and tells how good our classification model is.
► It not only tells the error made by the classifiers but also the type of errors such as
it is either type-I or type-II error.
► With the help of the confusion matrix, we can calculate the different parameters
for the model, such as accuracy, precision, etc.
Confusion Matrix in Machine Learning
Calculations using Confusion Matrix:
► Classification Accuracy: It is one of the important parameters to determine the
accuracy of the classification problems. It defines how often the model predicts the
correct output. It can be calculated as the ratio of the number of correct
predictions made by the classifier to all number of predictions made by the
classifiers.
► Precision: It can be defined as the number of correct outputs provided by the
model or out of all positive classes that have predicted correctly by the model, how
many of them were actually true.
► Recall: It is defined as the out of total positive classes, how our model predicted
correctly.
► F-measure: If two models have low precision and high recall or vice versa, it is
difficult to compare these models. So, for this purpose, we can use F-score. This
score helps us to evaluate the recall and precision at the same time. The F-score is
maximum if the recall is equal to the precision.
Cross-Validation in Machine Learning
► Cross-validation is a technique for validating the model efficiency by training it on
the subset of input data and testing on previously unseen subset of the input
data. We can also say that it is a technique to check how a statistical model
generalizes to an independent dataset.
Hence the basic steps of cross-validations are:
► Reserve a subset of the dataset as a validation set.
► Provide the training to the model using the training dataset.
► Now, evaluate model performance using the validation set. If the model performs
well with the validation set, perform the further step, else check for the issues.
Cross-Validation in Machine Learning
Methods used for Cross-Validation
► Validation Set Approach
► Leave-P-out cross-validation
► Leave one out cross-validation
► K-fold cross-validation
► Stratified k-fold cross-validation
Cross-Validation in Machine Learning
Methods used for Cross-Validation
► Validation Set Approach
► Leave-P-out cross-validation
► Leave one out cross-validation
► K-fold cross-validation
► Stratified k-fold cross-validation
Applications of Cross-Validation
► This technique can be used to compare the performance of different predictive
modeling methods.
► It has great scope in the medical research field.
► It can also be used for the meta-analysis, as it is already being used by the data
scientists in the field of medical statistics.
LECTURE NOTES ON
DATA MINING& DATA WAREHOUSING
COURSE CODE:BCS-403

DEPT OF CSE & IT


VSSUT, Burla
SYLLABUS:
Module – I

Data Mining overview, Data Warehouse and OLAP Technology,Data Warehouse Architecture,
Stepsfor the Design and Construction of Data Warehouses, A Three-Tier Data
WarehouseArchitecture,OLAP,OLAP queries, metadata repository,Data Preprocessing – Data
Integration and Transformation, Data Reduction,Data Mining Primitives:What Defines a Data
Mining Task? Task-Relevant Data, The Kind of Knowledge to be Mined,KDD

Module – II

Mining Association Rules in Large Databases, Association Rule Mining, Market


BasketAnalysis: Mining A Road Map, The Apriori Algorithm: Finding Frequent Itemsets Using
Candidate Generation,Generating Association Rules from Frequent Itemsets, Improving the
Efficiently of Apriori,Mining Frequent Itemsets without Candidate Generation, Multilevel
Association Rules, Approaches toMining Multilevel Association Rules, Mining
Multidimensional Association Rules for Relational Database and Data
Warehouses,Multidimensional Association Rules, Mining Quantitative Association Rules,
MiningDistance-Based Association Rules, From Association Mining to Correlation Analysis

Module – III

What is Classification? What Is Prediction? Issues RegardingClassification and Prediction,


Classification by Decision Tree Induction, Bayesian Classification, Bayes Theorem, Naïve
Bayesian Classification, Classification by Backpropagation, A Multilayer Feed-Forward Neural
Network, Defining aNetwork Topology, Classification Based of Concepts from Association Rule
Mining, OtherClassification Methods, k-Nearest Neighbor Classifiers, GeneticAlgorithms,
Rough Set Approach, Fuzzy Set Approachs, Prediction, Linear and MultipleRegression,
Nonlinear Regression, Other Regression Models, Classifier Accuracy

Module – IV

What Is Cluster Analysis, Types of Data in Cluster Analysis,A Categorization of Major


Clustering Methods, Classical Partitioning Methods: k-Meansand k-Medoids, Partitioning
Methods in Large Databases: From k-Medoids to CLARANS, Hierarchical Methods,
Agglomerative and Divisive Hierarchical Clustering,Density-BasedMethods, Wave Cluster:
Clustering Using Wavelet Transformation, CLIQUE:Clustering High-Dimensional Space,
Model-Based Clustering Methods, Statistical Approach,Neural Network Approach.

DEPT OF CSE & IT


VSSUT, Burla
Chapter-1

1.1 What Is Data Mining?

Data mining refers to extracting or mining knowledge from large amountsof data. The term is
actually a misnomer. Thus, data miningshould have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data.

It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.

The key properties of data mining are


Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information
Focus on large datasets and databases

1.2 The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:

DEPT OF CSE & IT


VSSUT, Burla
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive hands-
on analysis can now be answered directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
identify the targets most likely to maximize return on investment in future mailings. Other
predictive problems include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.

1.3 Tasks of Data Mining


Data mining involves six common classes of tasks:
Anomaly detection (Outlier/change/deviation detection) – The identification of
unusual data records, that might be interesting or data errors that require further
investigation.
Association rule learning (Dependency modelling) – Searches for relationships
between variables. For example a supermarket might gather data on customer purchasing
habits. Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression – attempts to find a function which models the data with the least error.
DEPT OF CSE & IT
VSSUT, Burla
Summarization – providing a more compact representation of the data set, including
visualization and report generation.

1.4 Architecture of Data Mining

A typical data mining system may have the following major components.

1. Knowledge Base:

This is the domain knowledge that is used to guide the search orevaluate the
interestingness of resulting patterns. Such knowledge can include concepthierarchies,

DEPT OF CSE & IT


VSSUT, Burla
used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples of
domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).

2. Data Mining Engine:

This is essential to the data mining systemand ideally consists ofa set of functional
modules for tasks such as characterization, association and correlationanalysis,
classification, prediction, cluster analysis, outlier analysis, and evolutionanalysis.

3. Pattern Evaluation Module:

This component typically employs interestingness measures interacts with the data
mining modules so as to focus thesearch toward interesting patterns. It may use
interestingness thresholds to filterout discovered patterns. Alternatively, the pattern
evaluation module may be integratedwith the mining module, depending on the
implementation of the datamining method used. For efficient data mining, it is highly
recommended to pushthe evaluation of pattern interestingness as deep as possible into
the mining processso as to confine the search to only the interesting patterns.

4. User interface:

Thismodule communicates between users and the data mining system,allowing the
user to interact with the system by specifying a data mining query ortask, providing
information to help focus the search, and performing exploratory datamining based on
the intermediate data mining results. In addition, this componentallows the user to
browse database and data warehouse schemas or data structures,evaluate mined
patterns, and visualize the patterns in different forms.

DEPT OF CSE & IT


VSSUT, Burla
1.5 Data Mining Process:

Data Mining is a process of discovering various models, summaries, and derived values from a
given collection of data.
The general experimental procedure adapted to data-mining problems involves the following
steps:
1. State the problem and formulate the hypothesis

Most data-based modeling studies are performed in a particular application domain.


Hence, domain-specific knowledge and experience are usually necessary in order to come
up with a meaningful problem statement. Unfortunately, many application studies tend to
focus on the data-mining technique at the expense of a clear problem statement. In this
step, a modeler usually specifies a set of variables for the unknown dependency and, if
possible, a general form of this dependency as an initial hypothesis. There may be several
hypotheses formulated for a single problem at this stage. The first step requires the
combined expertise of an application domain and a data-mining model. In practice, it
usually means a close interaction between the data-mining expert and the application
expert. In successful data-mining applications, this cooperation does not stop in the initial
phase; it continues during the entire data-mining process.

2. Collect the data

This step is concerned with how the data are generated and collected. In general, there are
two distinct possibilities. The first is when the data-generation process is under the
control of an expert (modeler): this approach is known as a designed experiment. The
second possibility is when the expert cannot influence the data- generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Typically, the sampling
DEPT OF CSE & IT
VSSUT, Burla
distribution is completely unknown after data are collected, or it is partially and implicitly
given in the data-collection procedure. It is very important, however, to understand how
data collection affects its theoretical distribution, since such a priori knowledge can be
very useful for modeling and, later, for the final interpretation of results. Also, it is
important to make sure that the data used for estimating a model and the data used later
for testing and applying a model come from the same, unknown, sampling distribution. If
this is not the case, the estimated model cannot be successfully used in a final application
of the results.

3. Preprocessing the data

In the observational setting, data are usually "collected" from the existing databses, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:

1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:

a. Detect and eventually remove outliers as a part of the preprocessing phase, or


b. Develop robust modeling methods that are insensitive to outliers.

2. Scaling, encoding, and selecting features – Data preprocessing includes several steps
such as variable scaling and different types of encoding. For example, one feature with
the range [0, 1] and the other with the range [−100, 1000] will not have the same weights
in the applied technique; they will also influence the final data-mining results differently.
Therefore, it is recommended to scale them and bring both features to the same weight
for further analysis. Also, application-specific encoding methods usually achieve

DEPT OF CSE & IT


VSSUT, Burla
dimensionality reduction by providing a smaller number of informative features for
subsequent data modeling.
These two classes of preprocessing tasks are only illustrative examples of a large
spectrum of preprocessing activities in a data-mining process.
Data-preprocessing steps should not be considered completely independent from other
data-mining phases. In every iteration of the data-mining process, all activities, together,
could define new and improved data sets for subsequent iterations. Generally, a good
preprocessing method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and
encoding.

4. Estimate the model

The selection and implementation of the appropriate data-mining technique is the main
task in this phase. This process is not straightforward; usually, in practice, the
implementation is based on several models, and selecting the best one is an additional
task. The basic principles of learning and discovery from data are given in Chapter 4 of
this book. Later, Chapter 5 through 13 explain and analyze specific techniques that are
applied to perform a successful learning process from data and to develop an appropriate
model.

5. Interpret the model and draw conclusions

In most cases, data-mining models should help in decision making. Hence, such models
need to be interpretable in order to be useful because humans are not likely to base their
decisions on complex "black-box" models. Note that the goals of accuracy of the model
and accuracy of its interpretation are somewhat contradictory. Usually, simple models are
more interpretable, but they are also less accurate. Modern data-mining methods are
expected to yield highly accurate results using highdimensional models. The problem of
interpreting these models, also very important, is considered a separate task, with specific

DEPT OF CSE & IT


VSSUT, Burla
techniques to validate the results. A user does not want hundreds of pages of numeric
results. He does not understand them; he cannot summarize, interpret, and use them for
successful decision making.

The Data mining Process

1.6 Classification of Data mining Systems:

The data mining system can be classified according to the following criteria:

Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines

DEPT OF CSE & IT


VSSUT, Burla
Some Other Classification Criteria:

Classification according to kind of databases mined


Classification according to kind of knowledge mined
Classification according to kinds of techniques utilized
Classification according to applications adapted

Classification according to kind of databases mined

We can classify the data mining system according to kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data etc. And the
data mining system can be classified accordingly. For example if we classify the database
according to data model then we may have a relational, transactional, object- relational, or data
warehouse mining system.

Classification according to kind of knowledge mined

We can classify the data mining system according to kind of knowledge mined. It is means data
mining system are classified on the basis of functionalities such as:

Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis

DEPT OF CSE & IT


VSSUT, Burla
Classification according to kinds of techniques utilized

We can classify the data mining system according to kind of techniques used. We can describes
these techniques according to degree of user interaction involved or the methods of analysis
employed.

Classification according to applications adapted

We can classify the data mining system according to application adapted. These applications are
as follows:

Finance
Telecommunications
DNA
Stock Markets
E-mail

1.7 Major Issues In Data Mining:

Mining different kinds of knowledge in databases. - The need of different users is


not the same. And Different user may be in interested in different kind of knowledge. Therefore
it is necessary for data mining to cover broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.

Incorporation of background knowledge. - To guide discovery process and to express the


discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple level of
abstraction.
DEPT OF CSE & IT
VSSUT, Burla
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. This representations should
be easily understandable by the users.

Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, incomplete objects while mining the data regularities. If data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.

Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should
be interesting because either they represent common knowledge or lack novelty.

Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.

Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithm divide the
data into partitions which is further processed parallel. Then the results from the partitions is
merged. The incremental algorithms, updates databases without having mine the data again
from scratch.

1.8 Knowledge Discovery in Databases(KDD)

DEPT OF CSE & IT


VSSUT, Burla
Some people treat data mining same as Knowledge discovery while some people view data
mining essential step in process of knowledge discovery. Here is the list of steps involved in
knowledge discovery process:

Data Cleaning - In this step the noise and inconsistent data is removed.
Data Integration - In this step multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from the database.
Data Transformation - In this step data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining - In this step intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step,knowledge is represented.

DEPT OF CSE & IT


VSSUT, Burla
The following diagram shows the process of knowledge discovery process:

Architecture of KDD

1.9 Data Warehouse:

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of


data in support of management's decision making process.

DEPT OF CSE & IT


VSSUT, Burla
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.

1.9.1 Data Warehouse Design Process:

A data warehouse can be built using a top-down approach, a bottom-up approach, or a


combination of both.

The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that must
be solved are clear and well understood.

The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.

In the combined approach, an organization can exploit the planned and strategic nature of
the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.
DEPT OF CSE & IT
VSSUT, Burla
The warehouse design process consists of the following steps:

Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, or the general ledger. If the business process is organizational
and involves multiple complex object collections, a data warehouse model should be
followed. However, if the process is departmental and focuses on the analysis of one kind of
business process, a data mart model should be chosen.
Choose the grain of the business process. The grain is the fundamental, atomic level of data
to be represented in the fact table for this process, for example, individual transactions,
individual daily snapshots, and so on.
Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
Choose the measures that will populate each fact table record. Typical measures are numeric
additive quantities like dollars sold and units sold.

DEPT OF CSE & IT


VSSUT, Burla
1.9.2 A Three Tier Data Warehouse Architecture:

Tier-1:

The bottom tier is a warehouse database server that is almost always a relationaldatabase
system. Back-end tools and utilities are used to feed data into the bottomtier from
operational databases or other external sources (such as customer profileinformation
provided by external consultants). These tools and utilities performdataextraction,
cleaning, and transformation (e.g., to merge similar data from differentsources into a
unified format), as well as load and refresh functions to update thedata warehouse . The
data are extracted using application programinterfaces known as gateways. A gateway is

DEPT OF CSE & IT


VSSUT, Burla
supported by the underlying DBMS andallows client programs to generate SQL code to
be executed at a server.

Examplesof gateways include ODBC (Open Database Connection) and OLEDB (Open
Linkingand Embedding for Databases) by Microsoft and JDBC (Java Database
Connection).
This tier also contains a metadata repository, which stores information aboutthe data
warehouse and its contents.

Tier-2:

The middle tier is an OLAP server that is typically implemented using either a relational
OLAP (ROLAP) model or a multidimensional OLAP.

OLAP model is an extended relational DBMS thatmaps operations on multidimensional


data to standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.

Tier-3:

The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

DEPT OF CSE & IT


VSSUT, Burla
1.9.3 Data Warehouse Models:

There are three data warehouse models.

1. Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the entire
organization.
It provides corporate-wide data integration, usually from one or more operational systems
or external information providers, and is cross-functional in scope.
It typically contains detailed data aswell as summarized data, and can range in size from a
few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes, computer
superservers, or parallel architecture platforms. It requires extensive business modeling
and may take years to design and build.

2. Data mart:

A data mart contains a subset of corporate-wide data that is of value to aspecific group of
users. The scope is confined to specific selected subjects. For example,a marketing data
mart may confine its subjects to customer, item, and sales. Thedata contained in data
marts tend to be summarized.

Data marts are usually implemented on low-cost departmental servers that


areUNIX/LINUX- or Windows-based. The implementation cycle of a data mart ismore
likely to be measured in weeks rather than months or years. However, itmay involve
complex integration in the long run if its design and planning werenot enterprise-wide.

DEPT OF CSE & IT


VSSUT, Burla
Depending on the source of data, data marts can be categorized as independent
ordependent. Independent data marts are sourced fromdata captured fromone or
moreoperational systems or external information providers, or fromdata generated
locallywithin a particular department or geographic area. Dependent data marts are
sourceddirectly from enterprise data warehouses.

3. Virtual warehouse:

A virtual warehouse is a set of views over operational databases. Forefficient query


processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database
servers.

1.9.4 Meta Data Repository:

Metadata are data about data.When used in a data warehouse, metadata are the data thatdefine
warehouse objects. Metadata are created for the data names anddefinitions of the given
warehouse. Additional metadata are created and captured fortimestamping any extracted data,
the source of the extracted data, and missing fieldsthat have been added by data cleaning or
integration processes.

A metadata repository should contain the following:

A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.

Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or purged),
and monitoring information (warehouse usage statistics, error reports, and audit trails).
DEPT OF CSE & IT
VSSUT, Burla
The algorithms used for summarization, which include measure and dimension
definitionalgorithms, data on granularity, partitions, subject areas, aggregation,
summarization,and predefined queries and reports.

The mapping from the operational environment to the data warehouse, which
includessource databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
andsecurity (user authorization and access control).

Data related to system performance, which include indices and profiles that improvedata
access and retrieval performance, in addition to rules for the timing and scheduling of
refresh, update, and replication cycles.

Business metadata, which include business terms and definitions, data


ownershipinformation, and charging policies.

1.10 OLAP(Online analytical Processing):

OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.


OLAP is part of the broader category of business intelligence, which also encompasses
relational database, report writing and data mining.
OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives.

OLAP consists of three basic analytical operations:

 Consolidation (Roll-Up)
 Drill-Down

DEPT OF CSE & IT


VSSUT, Burla
 Slicing And Dicing

Consolidation involves the aggregation of data that can be accumulated and computed in
one or more dimensions. For example, all sales offices are rolled up to the sales
department or sales division to anticipate sales trends.

The drill-down is a technique that allows users to navigate through the details. For
instance, users can view the sales by individual products that make up a region’s sales.

Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data
of the OLAP cube and view (dicing) the slices from different viewpoints.

1.10.1 Types of OLAP:

1. Relational OLAP (ROLAP):

ROLAP works directly with relational databases. The base data and the dimension
tables are stored as relational tables and new tables are created to hold the aggregated
information. It depends on a specialized schema design.
This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence,
each action of slicing and dicing is equivalent to adding a "WHERE" clause in the
SQL statement.
ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required to
answer the question.
ROLAP tools feature the ability to ask any question because the methodology does
not limit to the contents of a cube. ROLAP also has the ability to drill down to the
lowest level of detail in the database.

DEPT OF CSE & IT


VSSUT, Burla
2. Multidimensional OLAP (MOLAP):

MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.

MOLAP stores this data in an optimized multi-dimensional array storage, rather than
in a relational database. Therefore it requires the pre-computation and storage of
information in the cube - the operation known as processing.

MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
The data cube contains all the possible answers to a given range of questions.

MOLAP tools have a very fast response time and the ability to quickly write back
data into the data set.

3. Hybrid OLAP (HOLAP):

There is no clear agreement across the industry as to what constitutes Hybrid OLAP,
except that a database will divide data between relational and specialized storage.
For example, for some vendors, a HOLAP database will use relational tables to hold
the larger quantities of detailed data, and use specialized storage for at least some
aspects of the smaller quantities of more-aggregate or less-detailed data.
HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the
capabilities of both approaches.
HOLAP tools can utilize both pre-calculated cubes and relational data sources.

DEPT OF CSE & IT


VSSUT, Burla
1.11 Data Preprocessing:

1.11.1 Data Integration:

It combines datafrom multiple sources into a coherent data store, as in data warehousing. These
sourcesmay include multiple databases, data cubes, or flat files.

The data integration systems are formally defined as triple<G,S,M>

Where G: The global schema

S:Heterogeneous source of schemas

M: Mapping between the queries of source and global schema

DEPT OF CSE & IT


VSSUT, Burla
1.11.2 Issues in Data integration:

1. Schema integration and object matching:

How can the data analyst or the computer be sure that customer id in one database and
customer number in another reference to the same attribute.

2. Redundancy:

An attribute (such as annual revenue, forinstance) may be redundant if it can be derived


from another attribute or set ofattributes. Inconsistencies in attribute or dimension naming
can also cause redundanciesin the resulting data set.

3. detection and resolution of datavalue conflicts:

For the same real-world entity, attribute values fromdifferent sources may differ.

1.11.3 Data Transformation:

In data transformation, the data are transformed or consolidated into forms appropriatefor
mining.

Data transformation can involve the following:

Smoothing, which works to remove noise from the data. Such techniques includebinning,
regression, and clustering.
Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and
annualtotal amounts. This step is typically used in constructing a data cube for analysis of
the data at multiple granularities.

DEPT OF CSE & IT


VSSUT, Burla
Generalization of the data, where low-level or ―primitive‖ (raw) data are replaced
byhigher-level concepts through the use of concept hierarchies. For example,
categoricalattributes, like street, can be generalized to higher-level concepts, like city or
country.
Normalization, where the attribute data are scaled so as to fall within a small
specifiedrange, such as 1:0 to 1:0, or 0:0 to 1:0.
Attribute construction (or feature construction),wherenewattributes are constructedand
added from the given set of attributes to help the mining process.

1.11.4 Data Reduction:

Data reduction techniques can be applied to obtain a reduced representation of thedata set that
ismuch smaller in volume, yet closely maintains the integrity of the originaldata. That is, mining
on the reduced data set should be more efficient yet produce thesame (or almost the same)
analytical results.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations are applied to the data in
theconstruction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributesor
dimensions may be detected and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.
Numerosityreduction,where the data are replaced or estimated by alternative,
smallerdata representations such as parametric models (which need store only the
modelparameters instead of the actual data) or nonparametric methods such as
clustering,sampling, and the use of histograms.
Discretization and concept hierarchy generation,where rawdata values for
attributesare replaced by ranges or higher conceptual levels. Data discretization is a form
ofnumerosity reduction that is very useful for the automatic generation of concept
hierarchies.Discretization and concept hierarchy generation are powerful tools for
datamining, in that they allow the mining of data at multiple levels of abstraction.

DEPT OF CSE & IT


VSSUT, Burla
Chapter-2

2.1 Association Rule Mining:


Association rule mining is a popular and well researched method for discovering interesting
relations between variables in large databases.
It is intended to identify strong rules discovered in databases using different measures of
interestingness.
Based on the concept of strong rules, RakeshAgrawal et al. introduced association rules.
Problem Definition:
The problem of association rule mining is defined as:

Let be a set of binary attributes called items.

Let be a set of transactions called the database.


Each transaction in has a unique transaction ID and contains a subset of the items in .
A rule is defined as an implication of the form
where and .
The sets of items (for short itemsets) and are called antecedent (left-hand-side or LHS) and
consequent (right-hand-side or RHS) of the rule respectively.
Example:
To illustrate the concepts, we use a small example from the supermarket domain. The set of

items is and a small database containing the items (1


codes presence and 0 absence of an item in a transaction) is shown in the table.

An example rule for the supermarket could be meaning that if


butter and bread are bought, customers also buy milk.

DEPT OF CSE & IT


VSSUT, Burla
Example database with 4 items and 5 transactions

Transaction ID milk bread butter beer

1 1 1 0 0

2 0 0 1 0

3 0 0 0 1

4 1 1 1 0

5 0 1 0 0

2.1.1 Important concepts of Association Rule Mining:

The support of an itemset is defined as the proportion of transactions in the


data set which contain the itemset. In the example database, the itemset

has a support of since it occurs in 20% of all


transactions (1 out of 5 transactions).

The confidenceof a rule is defined

For example, the rule has a confidence of

in the database, which means that for 100% of the transactions


containing butter and bread the rule is correct (100% of the times a customer buys butter
and bread, milk is bought as well). Confidence can be interpreted as an estimate of the

probability , the probability of finding the RHS of the rule in transactions


under the condition that these transactions also contain the LHS.

The liftof a rule is defined as

DEPT OF CSE & IT


VSSUT, Burla
or the ratio of the observed support to that expected if X and Y were independent. The

rule has a lift of .

The conviction of a rule is defined as

The rule has a conviction of ,

and can be interpreted as the ratio of the expected frequency that X occurs without Y
(that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.

2.2 Market basket analysis:

This processanalyzes customer buying habits by finding associations between the different items
thatcustomers place in their shopping baskets. The discovery of such associationscan help
retailers develop marketing strategies by gaining insight into which itemsare frequently
purchased together by customers. For instance, if customers are buyingmilk, how likely are they
to also buy bread (and what kind of bread) on the same trip to the supermarket. Such information
can lead to increased sales by helping retailers doselective marketing and plan their shelf space.

DEPT OF CSE & IT


VSSUT, Burla
Example:

If customers who purchase computers also tend to buy antivirussoftware at the same time, then
placing the hardware display close to the software displaymay help increase the sales of both
items. In an alternative strategy, placing hardware andsoftware at opposite ends of the store may
entice customers who purchase such items topick up other items along the way. For instance,
after deciding on an expensive computer,a customer may observe security systems for sale while
heading toward the software displayto purchase antivirus software and may decide to purchase a
home security systemas well. Market basket analysis can also help retailers plan which items to
put on saleat reduced prices. If customers tend to purchase computers and printers together,
thenhaving a sale on printers may encourage the sale of printers as well as computers.

2.3 Frequent Pattern Mining:


Frequent patternmining can be classified in various ways, based on the following criteria:

DEPT OF CSE & IT


VSSUT, Burla
1. Based on the completeness of patterns to be mined:

We can mine the complete set of frequent itemsets, the closed frequent itemsets, and the
maximal frequent itemsets, given a minimum support threshold.
We can also mine constrained frequent itemsets, approximate frequent itemsets,near-
match frequent itemsets, top-k frequent itemsets and so on.

2. Based on the levels of abstraction involved in the rule set:

Some methods for associationrule mining can find rules at differing levels of abstraction.

For example, supposethat a set of association rules mined includes the following rules
where X is a variablerepresenting a customer:

buys(X, ―computer‖))=>buys(X, ―HP printer‖) (1)

buys(X, ―laptop computer‖)) =>buys(X, ―HP printer‖) (2)

In rule (1) and (2), the items bought are referenced at different levels ofabstraction (e.g.,
―computer‖ is a higher-level abstraction of ―laptop computer‖).
3. Based on the number of data dimensions involved in the rule:

If the items or attributes in an association rule reference only one dimension, then it is a
single-dimensional association rule.
buys(X, ―computer‖))=>buys(X, ―antivirus software‖)

If a rule references two or more dimensions, such as the dimensions age, income, and buys,
then it is amultidimensional association rule. The following rule is an exampleof a
multidimensional rule:
age(X, ―30,31…39‖) ^ income(X, ―42K,…48K‖))=>buys(X, ―high resolution TV‖)

DEPT OF CSE & IT


VSSUT, Burla
4. Based on the types of values handled in the rule:
If a rule involves associations between the presence or absence of items, it is a Boolean
association rule.
If a rule describes associations between quantitative items or attributes, then it is a
quantitative association rule.

5. Based on the kinds of rules to be mined:


Frequent pattern analysis can generate various kinds of rules and other interesting
relationships.
Association rule mining cangenerate a large number of rules, many of which are
redundant or do not indicatea correlation relationship among itemsets.
The discovered associations can be further analyzed to uncover statistical correlations,
leading to correlation rules.

6. Based on the kinds of patterns to be mined:


Many kinds of frequent patterns can be mined from different kinds of data sets.
Sequential pattern mining searches for frequent subsequences in a sequence data set,
where a sequence records an ordering of events.
For example, with sequential pattern mining, we can study the order in which items are
frequently purchased. For instance, customers may tend to first buy a PC, followed by a
digitalcamera,and then a memory card.
Structuredpatternminingsearches for frequent substructuresin a structured data set.
Single items are the simplest form of structure.
Each element of an itemsetmay contain a subsequence, a subtree, and so on.
Therefore, structuredpattern mining can be considered as the most general formof
frequent pattern mining.

DEPT OF CSE & IT


VSSUT, Burla
2.4 Efficient Frequent Itemset Mining Methods:
2.4.1 Finding Frequent Itemsets Using Candidate Generation:The
Apriori Algorithm

Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.
A two-step process is followed in Aprioriconsisting of joinand prune action.

DEPT OF CSE & IT


VSSUT, Burla
Example:

TID List of item IDs


T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

There are nine transactions in this database, that is, |D| = 9.

DEPT OF CSE & IT


VSSUT, Burla
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
itemsets, C1. The algorithm simply scans all of the transactions in order to countthe number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can thenbe determined. It consists of the candidate 1-itemsets
satisfying minimum support.In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
togenerate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the
prune step because each subset of thecandidates is also frequent.
4.Next, the transactions inDare scanned and the support count of each candidate itemsetInC2 is
accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, Fromthejoin step, we first getC3 =L2x
L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on the
Apriori property that all subsets of a frequentitemsetmust also be frequent, we can determine
that the four latter candidates cannotpossibly be frequent.

7.The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support.
8.The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.

DEPT OF CSE & IT


VSSUT, Burla
2.4.2 Generating Association Rules from Frequent Itemsets:
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them.

DEPT OF CSE & IT


VSSUT, Burla
Example:

2.5 Mining Multilevel Association Rules:

For many applications, it is difficult to find strong associations among data items at low
or primitive levels of abstraction due to the sparsity of data at those levels.
Strong associations discovered at high levels of abstraction may represent commonsense
knowledge.
Therefore, data mining systems should provide capabilities for mining association rules
at multiple levels of abstraction, with sufficient flexibility for easy traversal
amongdifferentabstraction spaces.

DEPT OF CSE & IT


VSSUT, Burla
Association rules generated from mining data at multiple levels of abstraction arecalled
multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.
In general, a top-down strategy is employed, where counts are accumulated for the
calculation of frequent itemsets at each concept level, starting at the concept level 1 and
working downward in the hierarchy toward the more specific concept levels,until no
more frequent itemsets can be found.

A concepthierarchy defines a sequence of mappings froma set of low-level concepts to


higherlevel,more general concepts. Data can be generalized by replacing low-level
conceptswithin the data by their higher-level concepts, or ancestors, from a concept hierarchy.

DEPT OF CSE & IT


VSSUT, Burla
The concept hierarchy has five levels, respectively referred to as levels 0to 4, starting with level
0 at the root node for all.

Here, Level 1 includes computer, software, printer&camera, and computer accessory.


Level 2 includes laptop computer, desktop computer, office software, antivirus software
Level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on.
Level 4 is the most specific abstraction level of this hierarchy.

2.5.1 Approaches ForMining Multilevel Association Rules:


1.UniformMinimum Support:
The same minimum support threshold is used when mining at each level of abstraction.
When a uniform minimum support threshold is used, the search procedure is simplified.
The method is also simple in that users are required to specify only one minimum support
threshold.
The uniform support approach, however, has some difficulties. It is unlikely thatitems at
lower levels of abstraction will occur as frequently as those at higher levelsof abstraction.
If the minimum support threshold is set too high, it could miss somemeaningful associations
occurring at low abstraction levels. If the threshold is set too low, it may generate many
uninteresting associations occurring at high abstractionlevels.

2.Reduced Minimum Support:


Each level of abstraction has its own minimum support threshold.

DEPT OF CSE & IT


VSSUT, Burla
The deeper the level of abstraction, the smaller the corresponding threshold is.
For example,the minimum support thresholds for levels 1 and 2 are 5% and 3%,respectively.
In this way, ―computer,‖ ―laptop computer,‖ and ―desktop computer‖ areall considered
frequent.

3.Group-Based Minimum Support:


Because users or experts often have insight as to which groups are more important than
others, it is sometimes more desirable to set up user-specific, item, or group based minimal
support thresholds when mining multilevel rules.
For example, a user could set up the minimum support thresholds based on product price, or
on items of interest, such as by setting particularly low support thresholds for laptop
computersand flash drives in order to pay particular attention to the association patterns
containing items in these categories.

2.6 Mining Multidimensional Association Rules from Relational


Databases and Data Warehouses:

Single dimensional or intradimensional association rule contains a single distinct


predicate (e.g., buys)with multiple occurrences i.e., the predicate occurs more than once
within the rule.

buys(X, ―digital camera‖)=>buys(X, ―HP printer‖)

Association rules that involve two or more dimensions or predicates can be referred to
as multidimensional association rules.

DEPT OF CSE & IT


VSSUT, Burla
age(X, “20…29”)^occupation(X, “student”)=>buys(X, “laptop”)
Above Rule contains three predicates (age, occupation,and buys), each of which occurs
only once in the rule. Hence, we say that it has norepeated predicates.
Multidimensional association rules with no repeated predicates arecalled
interdimensional association rules.
We can also mine multidimensional associationrules with repeated predicates, which
contain multiple occurrences of some predicates.These rules are called hybrid-
dimensional association rules. An example of sucha rule is the following, where the
predicate buys is repeated:
age(X, ―20…29‖)^buys(X, ―laptop‖)=>buys(X, ―HP printer‖)

2.7 Mining Quantitative Association Rules:


Quantitative association rules are multidimensional association rules in which the numeric
attributes are dynamically discretized during the mining process so as to satisfy some mining
criteria, such as maximizing the confidence or compactness of the rules mined.
In this section, we focus specifically on how to mine quantitative association rules having
two quantitative attributes on the left-hand side of the rule and one categorical attribute on
the right-hand side of the rule. That is
Aquan1 ^Aquan2 =>Acat
whereAquan1 and Aquan2 are tests on quantitative attribute interval
Acattests a categorical attribute fromthe task-relevantdata.
Such rules have been referred to as two-dimensional quantitative association rules,
because they contain two quantitative dimensions.
For instance, suppose you are curious about the association relationship between pairs of
quantitative attributes, like customer age and income, and the type of television (such as
high-definition TV, i.e., HDTV) that customers like to buy.
An example of such a 2-D quantitative association rule is
age(X, ―30…39‖)^income(X, ―42K…48K‖)=>buys(X, ―HDTV‖)

DEPT OF CSE & IT


VSSUT, Burla
2.8 From Association Mining to Correlation Analysis:
A correlation measure can be used to augment the support-confidence framework for
association rules. This leads to correlation rules of the form
A=>B [support, confidence, correlation]
That is, a correlation rule is measured not only by its support and confidence but alsoby
the correlation between itemsetsA and B. There are many different correlation
measuresfrom which to choose. In this section, we study various correlation measures
todetermine which would be good for mining large data sets.

Lift is a simple correlation measure that is given as follows. The occurrence of itemset
A is independent of the occurrence of itemsetB if = P(A)P(B); otherwise,
itemsetsA and B are dependent and correlated as events. This definition can easily be
extended to more than two itemsets.

The lift between the occurrence of A and B can bemeasured by computing

If the lift(A,B) is less than 1, then the occurrence of A is negativelycorrelated with the
occurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated, meaning that
the occurrence of one implies the occurrence of the other.
If the resulting value is equal to 1, then A and B are independent and there is no correlation
between them.

DEPT OF CSE & IT


VSSUT, Burla
Chapter-3

3.1 Classification and Prediction:

Classification and prediction are two forms of data analysis that can be used to extractmodels
describing important data classes or to predict future data trends.
Classificationpredicts categorical (discrete, unordered) labels, prediction models
continuousvaluedfunctions.
For example, we can build a classification model to categorize bankloan applications as
either safe or risky, or a prediction model to predict the expendituresof potential customers
on computer equipment given their income and occupation.
A predictor is constructed that predicts a continuous-valued function, or ordered value, as
opposed to a categorical label.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Many classification and prediction methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
Most algorithms are memory resident, typically assuming a small data size. Recent data
mining research has built on such work, developing scalable classification and prediction
techniques capable of handling large disk-resident data.

3.1.1 Issues Regarding Classification and Prediction:


1.Preparing the Data for Classification and Prediction:
The following preprocessing steps may be applied to the data to help improve the accuracy,
efficiency, and scalability of the classification or prediction process.

DEPT OF CSE & IT


VSSUT, Burla
(i)Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by applying
smoothing techniques) and the treatment of missingvalues (e.g., by replacing a missing
value with the most commonly occurring value for that attribute, or with the most probable
value based on statistics).
Although most classification algorithms have some mechanisms for handling noisy or
missing data, this step can help reduce confusion during learning.
(ii)Relevance analysis:
Many of the attributes in the data may be redundant.
Correlation analysis can be used to identify whether any two given attributes are
statisticallyrelated.
For example, a strong correlation between attributes A1 and A2 would suggest that one of
the two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection can be used
in these cases to find a reduced set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original distribution obtained
using all attributes.
Hence, relevance analysis, in the form of correlation analysis and attribute subset
selection, can be used to detect attributes that do not contribute to the classification or
prediction task.
Such analysis can help improve classification efficiency and scalability.
(iii)Data Transformation And Reduction
The data may be transformed by normalization, particularly when neural networks or
methods involving distance measurements are used in the learning step.
Normalization involves scaling all values for a given attribute so that they fall within a
small specified range, such as -1 to +1 or 0 to 1.
The data can also be transformed by generalizing it to higher-level concepts. Concept
hierarchies may be used for this purpose. This is particularly useful for continuous
valuedattributes.
DEPT OF CSE & IT
VSSUT, Burla
For example, numeric values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can
be generalized to higher-level concepts, like city.
Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such
as binning, histogram analysis, and clustering.

3.1.2 Comparing Classification and Prediction Methods:


 Accuracy:
The accuracy of a classifier refers to the ability of a given classifier to correctly predict
the class label of new or previously unseen data (i.e., tuples without class label
information).
The accuracy of a predictor refers to how well a given predictor can guess the value of
the predicted attribute for new or previously unseen data.
 Speed:
This refers to the computational costs involved in generating and using the
given classifier or predictor.
 Robustness:
This is the ability of the classifier or predictor to make correct predictions
given noisy data or data with missing values.
 Scalability:
This refers to the ability to construct the classifier or predictor efficiently
given large amounts of data.
 Interpretability:
This refers to the level of understanding and insight that is providedby the classifier or
predictor.
Interpretability is subjective and therefore more difficultto assess.

DEPT OF CSE & IT


VSSUT, Burla
3.2 Classification by Decision Tree Induction:
Decision tree induction is the learning of decision trees from class-labeled training tuples.
A decision tree is a flowchart-like tree structure,where
 Each internal nodedenotes a test on an attribute.
 Each branch represents an outcome of the test.
 Each leaf node holds a class label.
 The topmost node in a tree is the root node.

The construction of decision treeclassifiers does not require any domain knowledge or
parameter setting, and therefore I appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensionaldata.
Their representation of acquired knowledge in tree formis intuitive and generallyeasy to
assimilate by humans.
The learning and classification steps of decision treeinduction are simple and fast.
In general, decision tree classifiers have good accuracy.
Decision tree induction algorithmshave been used for classification in many application
areas, such as medicine,manufacturing and production, financial analysis, astronomy, and
molecular biology.

DEPT OF CSE & IT


VSSUT, Burla
3.2.1 Algorithm For Decision Tree Induction:

The algorithm is called with three parameters:


 Data partition
 Attribute list
 Attribute selection method

The parameter attribute list is a list ofattributes describing the tuples.


Attribute selection method specifies a heuristic procedurefor selecting the attribute that
―best‖ discriminates the given tuples according to class.
The tree starts as a single node, N, representing the training tuples in D.

DEPT OF CSE & IT


VSSUT, Burla
If the tuples in D are all of the same class, then node N becomes a leaf and is labeledwith
that class .
Allof the terminating conditions are explained at the end of the algorithm.
Otherwise, the algorithm calls Attribute selection method to determine the splitting
criterion.
The splitting criterion tells us which attribute to test at node N by determiningthe ―best‖
way to separate or partition the tuples in D into individual classes.

There are three possible scenarios.Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.

1 A is discrete-valued:

In this case, the outcomes of the test at node N corresponddirectly to the known
values of A.
A branch is created for each known value, aj, of A and labeled with that value.
Aneed not be considered in any future partitioning of the tuples.

2 A is continuous-valued:

In this case, the test at node N has two possible outcomes, corresponding to the conditions
A <=split point and A >split point, respectively
wheresplit point is the split-point returned by Attribute selection method as part of the
splitting criterion.

3 A is discrete-valued and a binary tree must be produced:

The test at node N is of the form―A€SA?‖.


SA is the splitting subset for A, returned by Attribute selection methodas part of the splitting
criterion. It is a subset of the known values of A.

DEPT OF CSE & IT


VSSUT, Burla
(a)If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued and a binary
tree must be produced:

3.3 Bayesian Classification:

Bayesian classifiers are statistical classifiers.


They can predictclass membership probabilities, such as the probability that a given tuple
belongs toa particular class.
Bayesian classification is based on Bayes’ theorem.

3.3.1 Bayes’ Theorem:


Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is described by
measurements made on a set of n attributes.

DEPT OF CSE & IT


VSSUT, Burla
Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P(H|X), the probability that the hypothesis
H holds given the ―evidence‖ or observed data tuple X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it providesa way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).

3.3.2 Naïve Bayesian Classification:

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1.Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, …, An.

2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.

That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only
if

Thus we maximize P(CijX). The classCifor which P(CijX) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem

3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).

DEPT OF CSE & IT


VSSUT, Burla
4.Given data sets with many attributes, it would be extremely computationally expensiveto
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive assumption
of class conditional independence is made. This presumes that the values of the attributes
areconditionally independent of one another, given the class label of the tuple. Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) fromthe


trainingtuples. For eachattribute, we look at whether the attribute is categorical or
continuous-valued. Forinstance, to compute P(X|Ci), we consider the following:
 If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D havingthe value
xkfor Ak, divided by |Ci,D| the number of tuples of class Ciin D.
 If Akis continuous-valued, then we need to do a bit more work, but the calculationis pretty
straightforward.
A continuous-valued attribute is typically assumed tohave a Gaussian distribution with a
mean μ and standard deviation , defined by

5.In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci.
The classifier predicts that the class label of tuple X is the class Ciif and only if

3.4 A Multilayer Feed-Forward Neural Network:


The backpropagation algorithm performs learning on a multilayer feed-
forward neural network.
It iteratively learns a set of weights for prediction of the class label of tuples.
A multilayer feed-forward neural network consists of an input layer, one or
more hiddenlayers, and an output layer.
DEPT OF CSE & IT
VSSUT, Burla
Example:

The inputs to the network correspond to the attributes measured for each training tuple. The
inputs are fed simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a second layer
known as a hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The
number of hidden layers is arbitrary.
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network’s prediction for given tuples

3.4.1 Classification by Backpropagation:


Backpropagation is a neural network learning algorithm.
A neural network is a set of connected input/output units inwhich each connection has a
weightassociated with it.
During the learning phase, the network learns by adjusting the weights so as to be able to
predict the correct class label of the input tuples.
Neural network learning is also referred to as connectionist learning due to the connections
between units.

DEPT OF CSE & IT


VSSUT, Burla
Neural networks involve long training times and are therefore more suitable for
applicationswhere this is feasible.
Backpropagation learns by iteratively processing a data set of training tuples, comparing
the network’s prediction for each tuple with the actual known target value.
The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for prediction).
For each training tuple, the weights are modified so as to minimize the mean squared
errorbetween the network’s prediction and the actual target value. These modifications
are made in the ―backwards‖ direction, that is, from the output layer, through each hidden
layer down to the first hidden layer hence the name is backpropagation.
Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
Advantages:
It include their high tolerance of noisy data as well as their ability to classify patterns on
which they have not been trained.
They can be used when you may have little knowledge of the relationships between
attributesand classes.
They are well-suited for continuous-valued inputs and outputs, unlike most decision tree
algorithms.
They have been successful on a wide array of real-world data, including handwritten
character recognition, pathology and laboratory medicine, and training a computer to
pronounce English text.
Neural network algorithms are inherently parallel; parallelization techniques can be used
to speed up the computation process.
Process:
Initialize the weights:

The weights in the network are initialized to small random numbers


ranging from-1.0 to 1.0, or -0.5 to 0.5. Each unit has a bias associated with it. The biases are
similarly initialized to small random numbers.

DEPT OF CSE & IT


VSSUT, Burla
Each training tuple, X, is processed by the following steps.

Propagate the inputs forward:

First, the training tuple is fed to the input layer of thenetwork. The inputs pass through the input
units, unchanged. That is, for an input unitj, its output, Oj, is equal to its input value, Ij. Next, the
net input and output of eachunit in the hidden and output layers are computed. The net input to a
unit in the hiddenor output layers is computed as a linear combination of its inputs.
Each such unit has anumber of inputs to it that are, in fact, the outputs of the units connected to it
in theprevious layer. Each connection has a weight. To compute the net input to the unit, each
input connected to the unit ismultiplied by its correspondingweight, and this is summed.

wherewi,jis the weight of the connection from unit iin the previous layer to unit j;
Oiis the output of unit ifrom the previous layer
Ɵjis the bias of the unit & it actsas a threshold in that it serves to vary the activity of the unit.

Each unit in the hidden and output layers takes its net input and then applies an activation
function to it.

DEPT OF CSE & IT


VSSUT, Burla
Backpropagate the error:

The error is propagated backward by updating the weights and biases to reflect the error of
the network’s prediction. For a unit j in the output layer, the error Err jis computed by

whereOjis the actual output of unit j, and Tjis the known target value of the giventraining
tuple.
The error of a hidden layerunit j is

wherewjkis the weight of the connection from unit j to a unit k in the next higher layer,
andErrkis the error of unit k.
Weights are updatedby the following equations, where Dwi j is the change in weight wi j:

Biases are updated by the following equations below

DEPT OF CSE & IT


VSSUT, Burla
Algorithm:

3.5 k-Nearest-Neighbor Classifier:


Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a
given test tuplewith training tuples that are similar to it.
The training tuples are describedby n attributes. Each tuple represents a point in an n-
dimensional space. In this way,all of the training tuples are stored in an n-dimensional
pattern space. When given anunknown tuple, a k-nearest-neighbor classifier searches the
pattern space for the k trainingtuples that are closest to the unknown tuple. These k training
tuples are the k nearest neighbors of the unknown tuple.
DEPT OF CSE & IT
VSSUT, Burla
Closeness is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say, X1 = (x11, x12, … , x1n) and
X2 = (x21, x22, … ,x2n), is

In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1and in tuple X2, square this difference,and accumulate it.
The square root is taken of the total accumulated distance count.
Min-Max normalization can be used to transforma value v of a numeric attribute A to v0 in
therange [0, 1] by computing

whereminAand maxAare the minimum and maximum values of attribute A

For k-nearest-neighbor classification, the unknown tuple is assigned the mostcommon


class among its k nearest neighbors.
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to
it in pattern space.
Nearestneighborclassifiers can also be used for prediction, that is, to return a real-valued
prediction for a given unknown tuple.
In this case, the classifier returns the averagevalue of the real-valued labels associated
with the k nearest neighbors of the unknowntuple.

3.6 Other Classification Methods:

3.6.1 Genetic Algorithms:

Genetic algorithms attempt to incorporate ideas of natural evolution. In general, geneticlearning


starts as follows.
An initial population is created consisting of randomly generated rules. Each rule can be
represented by a string of bits. As a simple example, suppose that samples in a given
DEPT OF CSE & IT
VSSUT, Burla
training set are described by two Boolean attributes, A1 and A2, and that there are two
classes,C1 andC2.
The rule ―IF A1 ANDNOT A2 THENC2‖ can be encoded as the bit string ―100,‖ where
the two leftmost bits represent attributes A1 and A2, respectively, and the rightmost bit
represents the class.
Similarly, the rule ―IF NOT A1 AND NOT A2 THEN C1‖ can be encoded as ―001.‖
If an attribute has k values, where k > 2, then k bits may be used to encode the attribute’s
values.
Classes can be encoded in a similar fashion.
Based on the notion of survival of the fittest, a new population is formed to consist of
the fittest rules in the current population, as well as offspring of these rules.
Typically, thefitness of a rule is assessed by its classification accuracy on a set of training
samples.
Offspring are created by applying genetic operators such as crossover and mutation.
In crossover, substrings from pairs of rules are swapped to form new pairs of rules.
Inmutation, randomly selected bits in a rule’s string are inverted.
The process of generating new populations based on prior populations of rules continues
until a population, P, evolves where each rule in P satisfies a pre specified fitness
threshold.
Genetic algorithms are easily parallelizable and have been used for classification as
well as other optimization problems. In data mining, they may be used to evaluate the
fitness of other algorithms.

3.6.2 Fuzzy Set Approaches:


Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree ofmembership
that a certain value has in a given category. Each category then represents afuzzy set.
Fuzzy logic systemstypically provide graphical tools to assist users in converting attribute
values to fuzzy truthvalues.
Fuzzy set theory is also known as possibility theory.

DEPT OF CSE & IT


VSSUT, Burla
It was proposed by LotfiZadeh in1965 as an alternative to traditional two-value logic and
probability theory.
It lets usworkat a high level of abstraction and offers a means for dealing with imprecise
measurementof data.
Most important, fuzzy set theory allows us to dealwith vague or inexact facts.
Unlike the notion of traditional ―crisp‖ sets where anelement either belongs to a set S or its
complement, in fuzzy set theory, elements canbelong to more than one fuzzy set.
Fuzzy set theory is useful for data mining systems performing rule-based classification.
It provides operations for combining fuzzy measurements.
Several procedures exist for translating the resulting fuzzy output into a defuzzifiedor crisp
value that is returned by the system.
Fuzzy logic systems have been used in numerous areas for classification, including
market research, finance, health care, and environmental engineering.

Example:

3.7 Regression Analysis:


Regression analysis can be used to model the relationship between one or moreindependent
or predictor variables and a dependent or response variable which is continuous-valued.
In the context of data mining, the predictor variables are theattributes of interest describing
the tuple (i.e., making up the attribute vector).
In general,the values of the predictor variables are known.

DEPT OF CSE & IT


VSSUT, Burla
The response variable is what we want to predict.

3.7.1 Linear Regression:


Straight-line regression analysis involves a response variable, y, and a single predictor
variable x.
It is the simplest form of regression, and models y as a linear function of x.
That is,y = b+wx
where the variance of y is assumed to be constant
band w are regression coefficientsspecifying the Y-intercept and slope of the line.
The regression coefficients, w and b, can also be thought of as weights, so that we can
equivalently write, y = w0+w1x
These coefficients can be solved for by the method of least squares, which estimates the
best-fitting straight line as the one that minimizes the error between the actual data and
the estimate of the line.
Let D be a training set consisting of values of predictor variable, x, for some population and
their associated values for response variable, y. The training set contains |D| data points of
the form(x1, y1), (x2, y2), … , (x|D|, y|D|).
The regressioncoefficients can be estimated using this method with the following equations:

where x is the mean value of x1, x2, … , x|D|, and y is the mean value of y1, y2,…, y|D|.
The coefficients w0 and w1 often provide good approximations to otherwise complicated
regression equations.
3.7.2 Multiple Linear Regression:
It is an extension of straight-line regression so as to involve more than one predictor
variable.

DEPT OF CSE & IT


VSSUT, Burla
It allows response variable y to be modeled as a linear function of, say, n predictor
variables or attributes, A1, A2, …, An, describing a tuple, X.
An example of a multiple linear regression model basedon two predictor attributes or
variables, A1 and A2, isy = w0+w1x1+w2x2
where x1 and x2 are the values of attributes A1 and A2, respectively, in X.
Multiple regression problemsare instead commonly solved with the use of statistical
software packages, such as SAS,SPSS, and S-Plus.

3.7.3 Nonlinear Regression:


It can be modeled by adding polynomial terms to the basic linear model.
By applying transformations to the variables, we can convert the nonlinear model into a
linear one that can then be solved by the method of least squares.
Polynomial Regression is a special case of multiple regression. That is, the addition of
high-order terms like x2, x3, and so on, which are simple functions of the single variable, x,
can be considered equivalent to adding new independent variables.
Transformation of a polynomial regression model to a linear regression model:
Consider a cubic polynomial relationship given by
y = w0+w1x+w2x2+w3x3
To convert this equation to linear form, we define new variables:
x1 = x, x2 = x2 ,x3 = x3
It can then be converted to linear formby applying the above assignments,resulting in the
equationy = w0+w1x+w2x2+w3x3
which is easily solved by themethod of least squares using software for regression analysis.

3.8 Classifier Accuracy:


The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier.
In the pattern recognition literature, this is also referred to as the overall recognition rate of
theclassifier, that is, it reflects how well the classifier recognizes tuples of the various
classes.
DEPT OF CSE & IT
VSSUT, Burla
The error rate or misclassification rate of a classifier,M, which is simply 1-Acc(M),
where Acc(M) is the accuracy of M.
The confusion matrix is a useful tool for analyzing how well your classifier can recognize
tuples of different classes.
True positives refer to the positive tuples that were correctly labeled by the classifier.
True negatives are the negative tuples that were correctly labeled by the classifier.
False positives are the negative tuples that were incorrectly labeled.
How well the classifier can recognize, for this sensitivity and specificity measures can be
used.
Accuracy is a function of sensitivity and specificity.

wheret _posis the number of true positives


posis the number of positive tuples
t _negis the number of true negatives
negis the number of negative tuples,
f _posis the number of false positives

DEPT OF CSE & IT


VSSUT, Burla
Chapter-4

4.1 Cluster Analysis:


The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering.
A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.
A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression.
Cluster analysis tools based on k-means, k-medoids, and several methods have also been
built into many statisticalanalysis software packages or systems, such as S-Plus, SPSS, and
SAS.

4.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.

DEPT OF CSE & IT


VSSUT, Burla
Clustering can also be used for outlier detection,Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.

4.1.2 Typical Requirements Of Clustering InData Mining:


 Scalability:
Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering
on a sample of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed.
 Ability to deal with different types of attributes:
Many algorithms are designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape:
Many clustering algorithms determine clusters based on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures tend to find spherical clusters with
similar size and density.
However, a cluster could be of any shape. It is important to develop algorithms thatcan
detect clusters of arbitrary shape.
 Minimal requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to input certain parameters in cluster analysis
(such as the number of desired clusters). The clustering results can be quite sensitive to
input parameters. Parameters are often difficult to determine, especially for data sets
containing high-dimensional objects. This not only burdens users, but it also makes the
quality of clustering difficult to control.
 Ability to deal with noisy data:
Most real-world databases contain outliers or missing, unknown, or erroneous data.
Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
DEPT OF CSE & IT
VSSUT, Burla
 Incremental clustering and insensitivity to the order of input records:
Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates)
into existing clustering structures and, instead, must determine a new clustering from
scratch. Some clustering algorithms are sensitive to the order of input data.
That is, given a set of data objects, such an algorithm may return dramatically different
clusterings depending on the order of presentation of the input objects.
It is important to develop incremental clustering algorithms and algorithms thatare
insensitive to the order of input.
 High dimensionality:
A database or a data warehouse can contain several dimensionsor attributes.Many
clustering algorithms are good at handling low-dimensional data,involving only two to
three dimensions. Human eyes are good at judging the qualityof clustering for up to three
dimensions. Finding clusters of data objects in highdimensionalspace is challenging,
especially considering that such data can be sparseand highly skewed.
 Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic
banking machines (ATMs) in a city. To decide upon this, you may cluster households
while considering constraints such as the city’s rivers and highway networks, and the type
and number of customers per cluster. A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints.
 Interpretability and usability:
Users expect clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering
features and methods.

4.2 Major Clustering Methods:


 Partitioning Methods
 Hierarchical Methods

DEPT OF CSE & IT


VSSUT, Burla
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods

4.2.1 Partitioning Methods:


A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
Each group must contain at least one object, and
Each object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation


technique that attempts to improve the partitioning by moving objects from one group to
another.

The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.

4.2.2 Hierarchical Methods:


A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. A
hierarchical method can be classified as being eitheragglomerative or divisive, based on
howthe hierarchical decomposition is formed.

 Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
 The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination condition holds.

DEPT OF CSE & IT


VSSUT, Burla
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.

There are two approachesto improving the quality of hierarchical clustering:

 Perform careful analysis ofobject ―linkages‖ at each hierarchical partitioning, such as in


Chameleon, or
 Integratehierarchical agglomeration and other approaches by first using a
hierarchicalagglomerative algorithm to group objects into microclusters, and then
performingmacroclustering on the microclusters using another clustering method such as
iterative relocation.
4.2.3 Density-based methods:
 Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers)and discover clusters of
arbitrary shape.
 DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is a
methodthat clusters objects based on the analysis of the value distributions of density
functions.
4.2.4 Grid-Based Methods:
 Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
DEPT OF CSE & IT
VSSUT, Burla
 All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number
of cells in each dimension in the quantized space.
 STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.

4.2.5 Model-Based Methods:


 Model-based methods hypothesize a model for each of the clusters and find the best fit
of the data to the given model.
 A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
 It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.

4.3 Tasks in Data Mining:


 Clustering High-Dimensional Data
 Constraint-Based Clustering
4.3.1 Clustering High-Dimensional Data:
It is a particularly important task in cluster analysis because many applications
require the analysis of objects containing a large number of features or dimensions.
For example, text documents may contain thousands of terms or keywords as
features, and DNA micro array data may provide information on the expression
levels of thousands of genes under hundreds of conditions.
Clustering high-dimensional data is challenging due to the curse of dimensionality.
Many dimensions may not be relevant. As the number of dimensions increases,
thedata become increasingly sparse so that the distance measurement between pairs
ofpoints become meaningless and the average density of points anywhere in the
data islikely to be low. Therefore, a different clustering methodology needs to be
developedfor high-dimensional data.
DEPT OF CSE & IT
VSSUT, Burla
CLIQUE and PROCLUS are two influential subspace clustering methods, which
search for clusters in subspaces ofthe data, rather than over the entire data space.
Frequent pattern–based clustering,another clustering methodology, extractsdistinct
frequent patterns among subsets ofdimensions that occur frequently. It uses such
patterns to group objects and generatemeaningful clusters.

4.3.2 Constraint-Based Clustering:


It is a clustering approach that performs clustering by incorporation of user-specified
or application-oriented constraints.
A constraint expresses a user’s expectation or describes properties of the desired
clustering results, and provides an effective means for communicating with the
clustering process.
Various kinds of constraints can be specified, either by a user or as per application
requirements.
Spatial clustering employs with the existence of obstacles and clustering under user-
specified constraints. In addition, semi-supervised clusteringemploys forpairwise
constraints in order to improvethe quality of the resulting clustering.

4.4 Classical Partitioning Methods:


The mostwell-known and commonly used partitioningmethods are
 The k-Means Method
 k-Medoids Method
4.4.1 Centroid-Based Technique: The K-Means Method:
The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok
clusters so that the resulting intracluster similarity is high but the intercluster similarity is
low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which
can be viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows.

DEPT OF CSE & IT


VSSUT, Burla
First, it randomly selects k of the objects, each of which initially represents a cluster
mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster.
This process iterates until the criterion function converges.

Typically, the square-error criterion is used, defined as

whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci

4.4.1 The k-means partitioning algorithm:


The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster.

DEPT OF CSE & IT


VSSUT, Burla
Clustering of a set of objects based on the k-means method

4.4.2 The k-Medoids Method:

The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as

whereE is the sum of the absolute error for all objects in the data set

pis the point inspace representing a given object in clusterCj

ojis the representative object of Cj

DEPT OF CSE & IT


VSSUT, Burla
The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non representative objects continues as long as the quality of the
resulting clustering is improved.
This quality is estimated using a cost function that measures the average
dissimilaritybetween an object and the representative object of its cluster.
To determine whether a non representative object, oj random, is a good replacement for a
current representativeobject, oj, the following four cases are examined for each of the
nonrepresentative objects.

Case 1:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.

Case 2:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.

Case 3:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.

Case 4:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.

DEPT OF CSE & IT


VSSUT, Burla
Four cases of the cost function for k-medoids clustering

4.4.2 Thek-MedoidsAlgorithm:

The k-medoids algorithm for partitioning based on medoid or central objects.

DEPT OF CSE & IT


VSSUT, Burla
The k-medoids method ismore robust than k-means in the presence of noise and outliers,
because a medoid is lessinfluenced by outliers or other extreme values than a mean. However,
its processing ismore costly than the k-means method.

4.5 Hierarchical Clustering Methods:

A hierarchical clustering method works by grouping data objects into a tree of clusters.
The quality of a pure hierarchical clusteringmethod suffers fromits inability to
performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot
backtrack and correct it.

Hierarchical clustering methods can be further classified as either agglomerative or divisive,


depending on whether the hierarchical decomposition is formed in a bottom-up or top-down
fashion.

4.5.1 Agglomerative hierarchical clustering:


This bottom-up strategy starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are satisfied.
Most hierarchical clustering methods belong to this category. They differ only in their
definition of intercluster similarity.

4.5.2 Divisive hierarchical clustering:


This top-down strategy does the reverse of agglomerativehierarchical clustering by
starting with all objects in one cluster.
It subdividesthe cluster into smaller and smaller pieces, until each object forms a cluster
on itsown or until it satisfies certain termination conditions, such as a desired number
ofclusters is obtained or the diameter of each cluster is within a certain threshold.

DEPT OF CSE & IT


VSSUT, Burla
4.6 Constraint-Based Cluster Analysis:
Constraint-based clustering finds clusters that satisfy user-specified preferences orconstraints.
Depending on the nature of the constraints, constraint-based clusteringmay adopt rather different
approaches.
There are a few categories of constraints.
 Constraints on individual objects:

We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setof objects to be clustered. It can easily be handled
by preprocessing after which the problem reduces to an instance ofunconstrained
clustering.

 Constraints on the selection of clustering parameters:

A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired numberof clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
 Constraints on distance or similarity functions:

We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.
DEPT OF CSE & IT
VSSUT, Burla
 User-specified constraints on the properties of individual clusters:
A user may like tospecify desired characteristics of the resulting clusters, which may
strongly influencethe clustering process.
 Semi-supervised clustering based on partial supervision:
The quality of unsupervisedclustering can be significantly improved using some weak form
of supervision.This may be in the formof pairwise constraints (i.e., pairs of objects labeled
as belongingto the same or different cluster). Such a constrained clustering process is
calledsemi-supervised clustering.

4.7 Outlier Analysis:

There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
of data, are called outliers.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because
one person’s noise could be another person’s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for
identifying the spending behavior of customers with extremely low or extremely high
incomes, or in medicalanalysis for finding unusual responses to various medical treatments.

Outlier mining can be described as follows: Given a set of n data points or objectsand k, the
expected number of outliers, find the top k objects that are considerablydissimilar,
exceptional, or inconsistent with respect to the remaining data. The outliermining problem
can be viewed as two subproblems:
Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined.
DEPT OF CSE & IT
VSSUT, Burla
Types of outlier detection:
 Statistical Distribution-Based Outlier Detection
 Distance-Based Outlier Detection
 Density-Based Local Outlier Detection
 Deviation-Based Outlier Detection

4.7.1 Statistical Distribution-Based Outlier Detection:


The statistical distribution-based approach to outlier detection assumes a distributionor
probability model for the given data set (e.g., a normal or Poisson distribution) andthen
identifies outliers with respect to the model using a discordancy test. Application ofthe
test requires knowledge of the data set parameters knowledge of distribution parameters
such as the mean and variance and theexpected number of outliers.
A statistical discordancy test examines two hypotheses:
A working hypothesis
An alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from
an initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its


rejection. A discordancy test verifies whether an object, oi, is significantly large (or
small) in relation to the distribution F. Different test statistics have been proposed for use
as a discordancy test, depending on the available knowledge of the data. Assuming that
some statistic, T, has been chosen for discordancy testing, and the value of the statistic
for object oi is vi, then the distribution of T is constructed. Significance probability,
SP(vi)=Prob(T > vi), is evaluated. If SP(vi) is sufficiently small, then oi is discordant and
the working hypothesis is rejected.
An alternative hypothesis, H, which states that oi comes from another distribution model,
G, is adopted. The result is very much dependent on which model F is chosen because
oimay be an outlier under one model and a perfectly valid value under another. The

DEPT OF CSE & IT


VSSUT, Burla
alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier.
There are different kinds of alternative distributions.
Inherent alternative distribution:
In this case, the working hypothesis that all of the objects come from distribution F is
rejected in favor of the alternative hypothesis that all of the objects arise from another
distribution, G:
H :oi € G, where i = 1, 2,…, n
F and G may be different distributions or differ only in parameters of the same
distribution.
There are constraints on the form of the G distribution in that it must have potential to
produce outliers. For example, it may have a different mean or dispersion, or a longer
tail.
Mixture alternative distribution:
The mixture alternative states that discordant values are not outliers in the F population,
but contaminants from some other population,
G. In this case, the alternative hypothesis is

Slippage alternative distribution:


This alternative states that all of the objects (apart from some prescribed small number)
arise independently from the initial model, F, with its given parameters, whereas the
remaining objects are independent observations from a modified version of F in which
the parameters have been shifted.
There are two basic types of procedures for detecting outliers:
Block procedures:
In this case, either all of the suspect objects are treated as outliersor all of them are accepted
as consistent.
Consecutive procedures:
An example of such a procedure is the insideoutprocedure. Its main idea is that the object
that is least likely to be an outlier istested first. If it is found to be an outlier, then all of the

DEPT OF CSE & IT


VSSUT, Burla
more extreme values are alsoconsidered outliers; otherwise, the next most extreme object is
tested, and so on. Thisprocedure tends to be more effective than block procedures.

4.7.2 Distance-Based Outlier Detection:


The notion of distance-based outliers was introduced to counter the main limitationsimposed
by statistical methods. An object, o, in a data set, D, is a distance-based (DB)outlier with
parameters pct and dmin,that is, a DB(pct;dmin)-outlier, if at least a fraction,pct, of the
objects in D lie at a distance greater than dmin from o. In other words, rather thatrelying on
statistical tests, we can think of distance-based outliers as thoseobjects that do not have
enoughneighbors, where neighbors are defined based ondistance from the given object. In
comparison with statistical-based methods, distancebased outlier detection generalizes the
ideas behind discordancy testing for various standarddistributions. Distance-based outlier
detection avoids the excessive computationthat can be associated with fitting the observed
distribution into some standard distributionand in selecting discordancy tests.
For many discordancy tests, it can be shown that if an object, o, is an outlier accordingto the
given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct anddmin.
For example, if objects that lie three or more standard deviations from the mean
are considered to be outliers, assuming a normal distribution, then this definition can
be generalized by a DB(0.9988, 0.13s) outlier.
Several efficient algorithms for mining distance-based outliers have been developed.
Index-based algorithm:
Given a data set, the index-based algorithm uses multidimensionalindexing structures, such
as R-trees or k-d trees, to search for neighbors of eachobject o within radius dminaround that
object. Let Mbe the maximum number ofobjects within the dmin-neighborhood of an outlier.
Therefore, onceM+1 neighborsof object o are found, it is clear that o is not an outlier. This
algorithm has a worst-casecomplexity of O(n2k), where n is the number of objects in the data
set and k is thedimensionality. The index-based algorithm scales well as k increases.
However, thiscomplexity evaluation takes only the search time into account, even though the
taskof building an index in itself can be computationally intensive.
Nested-loop algorithm:

DEPT OF CSE & IT


VSSUT, Burla
The nested-loop algorithm has the same computational complexityas the index-based
algorithm but avoids index structure construction and triesto minimize the number of I/Os. It
divides the memory buffer space into two halvesand the data set into several logical blocks.
By carefully choosing the order in whichblocks are loaded into each half, I/O efficiency can
be achieved.
Cell-based algorithm:
To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory-
resident data sets. Its complexity is O(ck+n), where c is a constant depending on the number
of cells and k is the dimensionality.

In this method, the data space is partitioned into cells with a side length equal to
Eachcell has two layers surrounding it. The first layer is one cell thick, while the secondis

cells thick, rounded up to the closest integer. The algorithm countsoutliers on a


cell-by-cell rather than an object-by-object basis. For a given cell, itaccumulates three
counts—the number of objects in the cell, in the cell and the firstlayer together, and in the
cell and both layers together. Let’s refer to these counts ascell count, cell + 1 layer count, and
cell + 2 layers count, respectively.

Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an
outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objectsin the cell can be
removed from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are
considered outliers. Otherwise, if this number is more than M, then itis possible that some
of the objects in the cell may be outliers. To detect theseoutliers, object-by-object
processing is used where, for each object, o, in the cell,objects in the second layer of o
are examined. For objects in the cell, only thoseobjects having no more than M points in
their dmin-neighborhoods are outliers.The dmin-neighborhood of an object consists of
the object’s cell, all of its firstlayer, and some of its second layer.

DEPT OF CSE & IT


VSSUT, Burla
A variation to the algorithm is linear with respect to n and guarantees that no morethan three
passes over the data set are required. It can be used for large disk-residentdata sets, yet does
not scale well for high dimensions.

4.7.3 Density-Based Local Outlier Detection:


Statistical and distance-based outlier detection both depend on the overall or
globaldistribution of the given set of data points, D. However, data are usually not
uniformlydistributed. These methods encounter difficulties when analyzing data with rather
different
density distributions.
To define the local outlier factor of an object, we need to introduce the concepts ofk-
distance, k-distance neighborhood, reachability distance,13 and local reachability density.
These are defined as follows:
The k-distance of an object p is the maximal distance that p gets from its k-
nearestneighbors. This distance is denoted as k-distance(p). It is defined as the distance,
d(p, o), between p and an object o 2 D, such that for at least k objects, o0 2 D, it holds that
d(p, o’)_d(p, o). That is, there are at least k objects inDthat are as close asor closer to p than
o, and for at most k-1 objects, o00 2 D, it holds that d(p;o’’) <d(p, o).

That is, there are at most k-1 objects that are closer to p than o. You may bewondering at this
point how k is determined. The LOF method links to density-basedclustering in that it sets k
to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying
clusters based on density.
Here, MinPts (as k) is used to define the local neighborhood of an object, p.
The k-distance neighborhood of an object p is denoted Nkdistance(p)(p), or Nk(p)for short. By
setting k to MinPts, we get NMinPts(p). It contains the MinPts-nearestneighbors of p. That is, it
contains every object whose distance is not greater than theMinPts-distance of p.
The reachability distance of an object p with respect to object o (where o is within
theMinPts-nearest neighbors of p), is defined as reach
distMinPts(p, o) = max{MinPtsdistance(o), d(p, o)}.

DEPT OF CSE & IT


VSSUT, Burla
Intuitively, if an object p is far away , then the reachabilitydistance between the two is simply
their actual distance. However, if they are sufficientlyclose (i.e., where p is within the
MinPts-distance neighborhood of o), thenthe actual distance is replaced by the MinPts-
distance of o. This helps to significantlyreduce the statistical fluctuations of d(p, o) for all of
the p close to o.
The higher thevalue of MinPts is, the more similar is the reachability distance for objects
withinthe same neighborhood.
Intuitively, the local reachability density of p is the inverse of the average reachability
density based on the MinPts-nearest neighbors of p. It is defined as

The local outlier factor (LOF) of p captures the degree to which we call p an outlier.
It is defined as

It is the average of the ratio of the local reachability density of p and those of p’s
MinPts-nearest neighbors. It is easy to see that the lower p’s local reachability density
is, and the higher the local reachability density of p’s MinPts-nearest neighbors are,
the higher LOF(p) is.
4.7.4 Deviation-Based Outlier Detection:
Deviation-based outlier detection does not use statistical tests or distance-basedmeasures to
identify exceptional objects. Instead, it identifies outliers by examining themain
characteristics of objects in a group.Objects that ―deviate‖ fromthisdescription areconsidered
outliers. Hence, in this approach the term deviations is typically used to referto outliers. In
this section, we study two techniques for deviation-based outlier detection.The first
sequentially compares objects in a set, while the second employs an OLAPdata cube
approach.

Sequential Exception Technique:

DEPT OF CSE & IT


VSSUT, Burla
The sequential exception technique simulates the way in which humans can
distinguishunusual objects from among a series of supposedly like objects. It uses implicit
redundancyof the data. Given a data set, D, of n objects, it builds a sequence of subsets,{D1,
D2, …,Dm}, of these objects with 2<=m <= n such that

Dissimilarities are assessed between subsets in the sequence. The technique introducesthe
following key terms.
Exception set:
This is the set of deviations or outliers. It is defined as the smallestsubset of objects whose
removal results in the greatest reduction of dissimilarity in the residual set.
Dissimilarity function:
This function does not require a metric distance between theobjects. It is any function that, if
given a set of objects, returns a lowvalue if the objectsare similar to one another. The greater
the dissimilarity among the objects, the higherthe value returned by the function. The
dissimilarity of a subset is incrementally computedbased on the subset prior to it in the
sequence. Given a subset of n numbers, {x1, …,xn}, a possible dissimilarity function is the
variance of the numbers in theset, that is,

where x is the mean of the n numbers in the set. For character strings, the dissimilarityfunction
may be in the form of a pattern string (e.g., containing wildcard charactersthat is used to cover
all of the patterns seen so far. The dissimilarity increases when the pattern covering all of the
strings in Dj-1 does not cover any string in Dj that isnot in Dj-1.
Cardinality function:
This is typically the count of the number of objects in a given set.
Smoothing factor:
This function is computed for each subset in the sequence. Itassesses how much the
dissimilarity can be reduced by removing the subset from theoriginal set of objects.

DEPT OF CSE & IT


VSSUT, Burla

You might also like