Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Data Warehousing & Data Mining

Unit 1&2 Notes according to CT-1


Syllabus in March 2024

(This pdf covers 99% of all the topics strictly from the syllabus of Data
Warehousing and Data Mining subject Unit 1 and 2 and it’s content is made
to be written in exam)

Made By- Utkarsh M.


Unit 1
Data Warehousing:

A Data Warehouse stores a huge amount of data, which is


typically collected from multiple heterogeneous sources like
files, DBMS, etc.
The goal is to produce statistical results that may help in
decision-making.
For example, a college might want to see quick different
results, like how the placement of CS students has improved
over the last 10 years, in terms of salaries, counts, etc.

Need for data warehousing:


An ordinary Database can store MBs to GBs of data and that
too for a specific purpose. For storing data of TB size, the
storage shifted to the Data Warehouse.
Besides this, a transactional database doesn’t offer itself to
analytics.
To effectively perform analytics, an organization keeps a
central Data Warehouse to closely study its business by
organizing, understanding, and using its historical data for
making strategic decisions and analyzing trends.

Made By- Utkarsh M.


Difference b/w DBMS and Data Warehouse (if asked):

Made By- Utkarsh M.


Advantages of Data Warehousing:
1. Intelligent Decision-Making - With centralized data in
warehouses, decisions may be made more quickly and
intelligently.
2. Business Intelligence - Provides strong operational insights
through business intelligence.
3. Historical Analysis - Predictions and trend analysis are
made easier by storing past data.
4. Data Quality - Guarantees data quality and consistency for
trustworthy reporting.
5. Scalability - Capable of managing massive data volumes
and expanding to meet changing requirements.

Disadvantages of Data Warehousing:


1. Cost - Building a data warehouse can be expensive,
requiring significant investments in hardware, software, and
personnel.
2. Complexity - Data warehousing can be complex, and
businesses may need to hire specialized personnel.
3. Time-consuming - Building a data warehouse can take a
significant amount of time, requiring businesses to be patient
and committed to the process.
4. Data integration challenges - Data from different sources
can be challenging to integrate, requiring significant effort to
ensure consistency and accuracy.
5. Data security - Data warehousing can pose data security
risks, and businesses must take measures to protect sensitive
data from unauthorized access or breaches.

Made By- Utkarsh M.


Components of Data Warehouse:

1. Operational Source –
• An operational Source is a data source consists of
Operational Data and External Data.
• Data can come from Relational DBMS like Informix,
Oracle.
2. Load Manager –
• The Load Manager performs all operations associated
with the extraction of loading data in the data
warehouse.
• These tasks include the simple transformation of data to
prepare data for entry into the warehouse.

Made By- Utkarsh M.


3. Warehouse Manage –
• The warehouse manager is responsible for the
warehouse management process.
• The operations performed by the warehouse manager
are the analysis, aggregation, backup and collection of
data, de-normalization of the data.
4. Query Manager –
• Query Manager performs all the tasks associated with
the management of user queries.
• The complexity of the query manager is determined by
the end-user access operations tool and the features
provided by the database.
5. Detailed Data –
• It is used to store all the detailed data in the database
schema.
• Detailed data is loaded into the data warehouse to
complement the data collected.
6. Summarized Data –
• Summarized Data is a part of the data warehouse that
stores predefined aggregations
• These aggregations are generated by the warehouse
manager.
7. Archive and Backup Data –
• The Detailed and Summarized Data are stored for the
purpose of archiving and backup.
• The data is relocated to storage archives such as
magnetic tapes or optical disks.

Made By- Utkarsh M.


8. Metadata –
• Metadata is basically data stored above data.
• It is used for extraction and loading process, warehouse,
management process, and query management process.
9. End User Access Tools –
• End-User Access Tools consist of Analysis, Reporting, and
mining.
• By using end-user access tools users can link with the
warehouse.

Building a Data Warehouse:

Steps for building data warehouse are:

1. To extract the data from different data sources - For


building a data warehouse, a data is extracted from various
data sources and that data is stored in central storage area.
For extraction of the data Microsoft SQL is one of the best
choices.

2. To transform the transnational data - There are various


DBMS where many of the companies stores their data. Some
of them are: MS Access, MS SQL Server, Oracle, Sybase etc.
Also these companies saves the data in spreadsheets, flat
files, mail systems etc. Relating a data from all these sources
is done while building a data warehouse.

Made By- Utkarsh M.


3. To load the data into the dimensional database - After
building a dimensional model, the data is loaded in the
dimensional database. This process combines the several
columns together or it may split one field into the several
columns. There are two stages at which transformation of the
data can be performed and they are: while loading the data
into the dimensional model or while data extraction from
their origins.
4. To purchase a front-end reporting tool - There are top
notch analytical tools are available in the market. These tools
are provided by the several major vendors. A cost effective
tool and Data Analyzer is released by the Microsoft on its own.

Data Warehouse Architecture

There are 2 approaches to Data Warehouse Architecture:

(i) Top-down approach:

Made By- Utkarsh M.


Components of Top-Down approach:

1. External Sources – External source is a source from where


data is collected irrespective of the type of data. Data can be
structured, semi structured and unstructured as well.

2. Stage Area – Since the data, extracted from the external


sources does not follow a particular format, so there is a need
to validate this data to load into data warehouse. For this
purpose, it is recommended to use ETL tool.
• E(Extracted): Data is extracted from External data source.
• T(Transform): Data is transformed into the standard
format.
• L(Load): Data is loaded into data warehouse after
transforming it into the standard format.

3. Data-warehouse – After cleansing of data, it is stored in the


data warehouse as central repository. It actually stores the
meta data and the actual data gets stored in the data marts.

4. Data Marts – Data mart is also a part of storage component.


It stores the information of a particular function of an
organization which is handled by single authority. There can
be as many number of data marts in an organization
depending upon the functions.

Made By- Utkarsh M.


5. Data Mining – The practice of analyzing the big data present
in data warehouse is data mining. It is used to find the hidden
patterns that are present in the database or in data
warehouse with the help of algorithm of data mining.

(ii) Bottom-up approach:

1. First, the data is extracted from external sources (same as


happens in top-down approach).

2. Then, the data go through the staging area (as explained


above) and loaded into data marts instead of data
warehouse. It addresses a single business area.

3. These data marts are then integrated into data warehouse.

Made By- Utkarsh M.


Data Warehouse schema for Decision Support:

Contents:

1. Data Layout for business access:


• The data warehouse RDBMS typically needs to process
queries that are large complex, adhoc, and data
intensive.
• For solving modern business problems such as market
analysis and financial forecasting requires query-centric
database schemas that are array oriented and
multidimensional in nature.

2. Multidimensional Data Model:


• The multidimensional data model is to view it as cube.

• The multidimensional cube is the multidimensional


database technology that is not relational by nature.

Made By- Utkarsh M.


3. Star Schema:
• The star schema is the simplest type of Data Warehouse
schema.
• It is known as star schema as its structure resembles a
star.
• The center of the star can have one fact tables and
numbers of associated dimension tables.
• It is also known as Star Join Schema and is optimized for
querying large data sets.
• The basic of star schema is that information can be
classified into two groups:
- Facts
– Dimension
Facts are the core data element being analyzed e.g.. items
sold
Dimensions are attributes about the facts e.g. date of
purchase

Fact Table:
• A fact table is a primary table in a dimensional
model. A Fact Table contains
• Measurements/facts
• Foreign key to dimension table
Dimension table:
• A dimension table contains dimensions of a fact.
• They are joined to fact table via a foreign key.
• Dimension tables are de-normalized tables.
• The Dimension Attributes are the various columns
in a dimension table.

Made By- Utkarsh M.


4. Snowflake Schema:
• Some dimension tables in the Snowflake schema are
normalized.
• The normalization splits up the data into additional
tables.
• Unlike Star schema, the dimensions table in a snowflake
schema are normalized.
For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely
item and supplier table.

5. Fact Constellation Schema:


• A fact constellation has multiple fact tables. It is also
known as galaxy schema.
• The sales fact table is same as that in the star schema.

• The shipping fact table also contains two measures,


namely dollars sold and units sold.

• It is also possible to share dimension tables between fact


tables.

For example, time, item, and location dimension tables are


shared between the sales and shipping fact table.

Made By- Utkarsh M.


Data Extraction Tools:

As part of the Extract, Transform, Load (ETL) process, data


extraction involves gathering and retrieving data from a single
source or multiple sources.
Extraction process is often the first step for loading data into
a data warehouse or the cloud for further processing and
analysis.
Data extraction tools are a vital component of data
management.
Using an automated tool enables organizations to efficiently
control and retrieve data from various origin systems into one
central system for future use in single applications and
higher-level analytics.

Data extraction tools efficiently and effectively read various


systems, such as databases, ERPs, and CRMs, and collect the
appropriate data found within each source.
Most tools have the ability to gather any data, whether
structured, semi-structured, or unstructured.
Combined with the ability to extract information from infinite
big data sources, business users can leverage a collection of
sources — such as product databases with real-time e-
commerce applications — to produce a more well-rounded
and informed business intelligence report.

Made By- Utkarsh M.


The benefits of data extraction tools include:

1. Scalability - Data extraction software is critical for helping


organizations collect data at scale. Without these tools, users
would have to manually parse through sources to collect this
information.
2. Efficiency - The automation of data extraction tools
contributes to greater efficiency, especially when considering
the time involved in collecting data.
Data extraction software utilizing options for RPA, AI, and ML
considerably hasten identifying and collecting relevant data.
3. Business process management - Data extraction software
leveraging RPA or different aspects of AI can do more than
simply identify and gather relevant data.
These options are also useful for inputting that data into
downstream processes.

4. Control - Data extraction tools are the key to actually


identifying which data is necessary and then gathering that
data from disparate sources.
Organizations understanding this functionality can migrate
data from any number of sources into their target systems,
reducing reliance on data silos and increasing meaningful
interaction with data.

Made By- Utkarsh M.


5. Accuracy - Data extraction tools often provide a more
advanced preparation process that lends its hand to
managing complex data streams. This capability combined
with the removal of human error and user bias results in
increased accuracy and high quality data.

6. Usability - Last but not least, the most obvious benefit


relies on data extraction tools’ ease of use. These tools
provide business users with a user interface that is not only
intuitive, but provides a visual view of the data processes and
rules in place.

(This is absolute bull shit, I am leaving this topic for brighter minds so they can
enlighten people)

Metadata

Metadata is data that describes and contextualizes other


data. It provides information about the content, format,
structure, and other characteristics of data, and can be used
to improve the organization, discoverability, and accessibility
of data.
Metadata can be stored in various forms, such as text, XML, or
RDF, and can be organized using metadata standards and
schemas.
There are many metadata standards that have been
developed to facilitate the creation and management of
metadata, such as Dublin Core, schema.org, and the Metadata
Encoding and Transmission Standard (METS).

Made By- Utkarsh M.


Metadata can be used in a variety of contexts, such as
libraries, museums, archives, and online platforms.
It can be used to improve the discoverability and ranking of
content in search engines and to provide context and
additional information about search results.
Metadata can also support data preservation by providing
information about the context, provenance, and preservation
needs of data, and can support data visualization by
providing information about the data’s structure and content,
and by enabling the creation of interactive and customizable
visualizations.

Types of Metadata:

1. Descriptive metadata - This type of metadata provides


information about the content, structure, and format of data,
and may include elements such as title, author, subject, and
keywords.
Descriptive metadata helps to identify and describe the
content of data and can be used to improve the
discoverability of data through search engines and other
tools.

2. Administrative metadata - This type of metadata provides


information about the management and technical
characteristics of data, and may include elements such as file
format, size, and creation date.

Made By- Utkarsh M.


Administrative metadata helps to manage and maintain data
over time and can be used to support data governance and
preservation.

3. Structural metadata - This type of metadata provides


information about the relationships and organization of data,
and may include elements such as links, tables of contents,
and indices.
Structural metadata helps to organize and connect data and
can be used to facilitate the navigation and discovery of data.

4. Provenance metadata - This type of metadata provides


information about the history and origin of data, and may
include elements such as the creator, date of creation, and
sources of data.
Provenance metadata helps to provide context and credibility
to data and can be used to support data governance and
preservation.

5. Rights metadata - This type of metadata provides


information about the ownership, licensing, and access
controls of data, and may include elements such as copyright,
permissions, and terms of use.
Rights metadata helps to manage and protect the intellectual
property rights of data and can be used to support data
governance and compliance.

Made By- Utkarsh M.


6. Educational metadata - This type of metadata provides
information about the educational value and learning
objectives of data, and may include elements such as
learning outcomes, educational levels, and competencies.
Educational metadata can be used to support the discovery
and use of educational resources, and to support the design
and evaluation of learning environments.

Several Examples of Metadata:

1. File metadata: This includes information about a file, such


as its name, size, type, and creation date.
2. Image metadata: This includes information about an image,
such as its resolution, color depth, and camera settings.
3. Music metadata: This includes information about a piece of
music, such as its title, artist, album, and genre.
4. Video metadata: This includes information about a video,
such as its length, resolution, and frame rate.
5. Document metadata: This includes information about a
document, such as its author, title, and creation date.
6. Database metadata: This includes information about a
database, such as its structure, tables, and fields.
7. Web metadata: This includes information about a web page,
such as its title, keywords, and description.

Made By- Utkarsh M.


Querying tools and applications in Data Warehousing:

1. SQL-based tools - SQL-based tools are the most common


and widely used data querying tools for data warehouses.
SQL, or Structured Query Language, is a standard language for
manipulating and retrieving data from relational databases.
SQL-based tools allow you to write and execute SQL queries
on your data warehouse, either through a graphical user
interface (GUI) or a command-line interface (CLI).
Some examples of SQL-based tools are Microsoft SQL Server
Management Studio, Oracle SQL Developer, MySQL
Workbench, and pgAdmin.

2. BI tools - BI tools, or business intelligence tools, are data


querying tools that provide more advanced and interactive
features for data analysis and visualization.
BI tools allow you to connect to your data warehouse and
create dashboards, reports, charts, graphs, and other visual
elements to display and explore your data.
Some examples of BI tools are Tableau, Power BI, Qlik Sense,
and Looker.
BI tools are great for creating and sharing data stories,
discovering patterns and trends, and performing self-service
analytics.

3. Notebook tools - Notebook tools are data querying tools


that use a web-based interface to create and run code cells
that can interact with your data warehouse.

Made By- Utkarsh M.


Notebook tools allow you to write and execute queries in
various languages, such as SQL, Python, R, or Scala, and
integrate them with other libraries and frameworks for data
manipulation and visualization.
Some examples of notebook tools are Jupyter Notebook,
Databricks, Google Colab, and Zeppelin.
Notebook tools are ideal for data science and machine
learning projects, as they enable you to perform complex and
ad-hoc analysis, experimentation, and modeling on your data.

4. No-code tools - No-code tools are data querying tools that


use a drag-and-drop interface to create and run queries on
your data warehouse without writing any code.
No-code tools allow you to connect to your data warehouse
and select, filter, aggregate, join, and transform your data
using predefined functions and operators.
Some examples of no-code tools are Sigma, Holistics, Chartio,
and Mode.
No-code tools are convenient for non-technical users, as they
simplify and automate the data querying process, and reduce
the need for coding skills.

5. Hybrid tools - Hybrid tools are data querying tools that


combine the features of different types of tools, such as SQL-
based, BI, notebook, and no-code tools.
Hybrid tools allow you to use multiple methods and
languages to query your data warehouse, and switch between
them according to your needs and preferences.

Made By- Utkarsh M.


Some examples of hybrid tools are Metabase, Superset,
Redash, and Data Studio.
Hybrid tools are versatile and adaptable, as they offer a range
of options and functionalities for data querying, analysis, and
visualization.

Online Analytical Processing (OLAP):

OLAP stands for On-Line Analytical Processing.


OLAP is a classification of software technology which
authorizes analysts, managers, and executives to gain insight
into information through fast, consistent, interactive access in
a wide variety of possible views of data that has been
transformed from raw information to reflect the real
dimensionality of the enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business


information and support the capability for complex
estimations, trend analysis, and sophisticated data modeling.
It is rapidly enhancing the essential foundation for Intelligent
Solutions containing Business Performance Management,
Planning, Budgeting, Forecasting, Financial Documenting,
Analysis, Simulation-Models, Knowledge Discovery, and Data
Warehouses Reporting.
OLAP enables end-clients to perform ad hoc analysis of
record in multiple dimensions, providing the insight and
understanding they require for better decision making.

Made By- Utkarsh M.


Since OLAP servers are based on multidimensional view of
data, we will discuss OLAP operations in multidimensional
data.

OLAP Operations are listed as follows:

1. Roll-up:
Roll-up performs aggregation on a data cube in any of the
following ways −
• By climbing up a concept hierarchy for a dimension
• By dimension reduction
• Roll-up is performed by climbing up a concept hierarchy
for the dimension location.
• Initially the concept hierarchy was "street < city <
province < country".
• On rolling up, the data is aggregated by ascending the
location hierarchy from the level of city to the level of
country.
• The data is grouped into cities rather than countries.
• When roll-up is performed, one or more dimensions from
the data cube are removed.

2. Drill-down:

Drill-down is the reverse operation of roll-up. It is performed


by either of the following ways −

• By stepping down a concept hierarchy for a dimension


• By introducing a new dimension.
• Drill-down is performed by stepping down a concept
hierarchy for the dimension time.

Made By- Utkarsh M.


• Initially the concept hierarchy was "day < month < quarter
< year."
• On drilling down, the time dimension is descended from
the level of quarter to the level of month.
• When drill-down is performed, one or more dimensions
from the data cube are added.
• It navigates the data from less detailed data to highly
detailed data.

3. Slice:
The slice operation selects one particular dimension from a
given cube and provides a new sub-cube.

4. Dice:
Dice selects two or more dimensions from a given cube and
provides a new sub-cube.

5. Pivot:
The pivot operation is also known as rotation.
It rotates the data axes in view in order to provide an
alternative presentation of data.

Types of OLAP Servers:

1. Relational OLAP - ROLAP servers are placed between


relational back-end server and client front-end tools. To store
and manage warehouse data, ROLAP uses relational or
extended-relational DBMS.

Made By- Utkarsh M.


2. Multidimensional OLAP - MOLAP uses array-based
multidimensional storage engines for multidimensional views
of data. With multidimensional data stores, the storage
utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage
representation to handle dense and sparse data sets.

3. Hybrid OLAP - Hybrid OLAP is a combination of both ROLAP


and MOLAP. It offers higher scalability of ROLAP and faster
computation of MOLAP.
HOLAP servers allows to store the large data volumes of
detailed information. The aggregations are stored separately
in MOLAP store.

4. Specialized SQL Servers - Specialized SQL servers provide


advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only
environment.

Made By- Utkarsh M.


UNIT – 2

Introduction to Data Mining:

Data mining refers to the analysis of data. It is the computer-


supported process of analyzing huge sets of data that have
either been compiled by computer systems or have been
downloaded into the computer.
In the data mining process, the computer analyzes the data
and extract useful information from it.
It looks for hidden patterns within the data set and try to
predict future behavior.
Data mining is primarily used to discover and indicate
relationships among the data sets.

Data mining aims to enable business organizations to view


business behaviors, trends relationships that allow the
business to make data-driven decisions.
It is also known as knowledge Discover in Database (KDD).
Data mining tools utilize AI, statistics, databases, and
machine learning systems to discover the relationship
between the data.

Made By- Utkarsh M.


Advantages of Data Mining:

i. Market Analysis - Data Mining can predict the market that


helps the business to make the decision. For example, it
predicts who is keen to purchase what type of products.

ii. Fraud detection - Data Mining methods can help to find


which cellular phone calls, insurance claims, credit, or debit
card purchases are going to be fraudulent.

iii. Financial Market Analysis - Data Mining techniques are


widely used to help Model Financial Market

iv. Trend Analysis - Analyzing the current existing trend in the


marketplace is a strategic benefit because it helps in cost
reduction and manufacturing process as per market demand.

Made By- Utkarsh M.


Data Mining vs. Data Warehouse (if asked):

Made By- Utkarsh M.


Data Mining Functionalities:

Data mining functions are used to define the trends or


correlations contained in data mining activities. Data mining
functionalities are used to represent the type of patterns that
have to be discovered in data mining tasks. Data mining
activities can be divided into two categories:

1. Descriptive Data Mining - It includes certain knowledge to


understand what is happening within the data without a
previous idea. The common data features are highlighted in
the data set.
For example, count, average etc.

2. Predictive Data Mining - It helps developers to provide


unlabeled definitions of attributes. With previously available
or historical data, data mining can be used to make
predictions about critical business metrics based on data's
linearity.
For example, predicting the volume of business next quarter
based on performance in the previous quarters over several
years or judging from the findings of a patient's medical
examinations that is he suffering from any particular disease.

Made By- Utkarsh M.


There are various data mining functionalities which are as
follows −

1. Data characterization − It is a summarization of the general


characteristics of an object class of data.
The data corresponding to the user-specified class is
generally collected by a database query.
The output of data characterization can be presented in
multiple forms.

2. Data discrimination − It is a comparison of the general


characteristics of target class data objects with the general
characteristics of objects from one or a set of contrasting
classes.
The target and contrasting classes can be represented by the
user, and the equivalent data objects fetched through
database queries.

3. Association Analysis − It analyses the set of items that


generally occur together in a transactional dataset.
There are two parameters that are used for determining the
association rules −

• It provides which identifies the common item set in the


database.

• Confidence is the conditional probability that an item


occurs in a transaction when another item occurs.

Made By- Utkarsh M.


4. Classification − Classification is the procedure of
discovering a model that represents and distinguishes data
classes or concepts, for the objective of being able to use the
model to predict the class of objects whose class label is
anonymous.
The derived model is established on the analysis of a set of
training data (i.e., data objects whose class label is common).

5. Prediction − It defines predict some unavailable data values


or pending trends.
An object can be anticipated based on the attribute values of
the object and attribute values of the classes.
It can be a prediction of missing numerical values or
increase/decrease trends in time-related information.

6. Clustering − It is similar to classification but the classes are


not predefined.
The classes are represented by data attributes. It is
unsupervised learning.
The objects are clustered or grouped, depends on the
principle of maximizing the intraclass similarity and
minimizing the intraclass similarity.

7. Outlier analysis − Outliers are data elements that cannot be


grouped in a given class or cluster.
These are the data objects which have multiple behavior from
the general behavior of other data objects.

Made By- Utkarsh M.


The analysis of this type of data can be essential to mine the
knowledge.

8. Evolution analysis − It defines the trends for objects whose


behavior changes over some time.

DATA PREPROCESSING:

Data preprocessing is an important process of data mining. In


this process, raw data is converted into an understandable
format and made ready for further analysis. The motive is to
improve data quality and make it up to mark for specific
tasks.

Tasks in Data Preprocessing:

Made By- Utkarsh M.


1. Data cleaning:

Data cleaning help us remove inaccurate, incomplete and


incorrect data from the dataset. Some techniques used in
data cleaning are −

(i) Handling missing values:


This type of scenario occurs when some data is missing.

• Standard values can be used to fill up the missing


values in a manual way but only for a small dataset.

• Attribute's mean and median values can be used to


replace the missing values in normal and non-
normal distribution of data respectively.

• Tuples can be ignored if the dataset is quite large


and many values are missing within a tuple.

• Most appropriate value can be used while using


regression or decision tree algorithms

Made By- Utkarsh M.


(ii) Noisy Data:
Noisy data are the data that cannot be interpreted by
machine and are containing unnecessary faulty data.
Some ways to handle them are −

• Binning − This method handle noisy data to make it


smooth. Data gets divided equally and stored in
form of bins and then methods are applied to
smoothing or completing the tasks.

The methods are Smoothing by a bin mean method (bin


values are replaced by mean values),
Smoothing by bin median (bin values are replaced by
median values) and
Smoothing by bin boundary (minimum/maximum bin
values are taken and replaced by closest boundary values).

• Regression − Regression functions are used to


smoothen the data. Regression can be linear
(consists of one independent variable) or multiple
(consists of multiple independent variables).

• Clustering − It is used for grouping the similar data


in clusters and is used for finding outliers.

Made By- Utkarsh M.


2. Data integration:

The process of combining data from multiple sources


(databases, spreadsheets, text files) into a single dataset.
Single and consistent view of data is created in this process.
Major problems during data integration are Schema
integration (Integrates set of data collected from various
sources), Entity identification (identifying entities from
different databases) and detecting and resolving data values
concept.

3. Data transformation:

In this part, change in format or structure of data in order to


transform the data suitable for mining process. Methods for
data transformation are −

• Normalization − Method of scaling data to represent it in


a specific smaller range ( -1.0 to 1.0)

• Discretization − It helps reduce the data size and make


continuous data divide into intervals.

• Attribute Selection − To help the mining process, new


attributes are derived from the given attributes.

Made By- Utkarsh M.


• Concept Hierarchy Generation − In this, the attributes are
changed from lower level to higher level in hierarchy.

• Aggregation − In this, a summary of data gets stored


which depends upon quality and quantity of data to
make the result more optimal.

4. Data reduction:

It helps in increasing storage efficiency and reducing data


storage to make the analysis easier by producing almost the
same results.
Analysis becomes harder while working with huge amounts of
data, so reduction is used to get rid of that.
Steps of data reduction are −

• Data Compression - Data is compressed to make efficient


analysis. Lossless compression is when there is no loss
of data while compression. loss compression is when
unnecessary information is removed during
compression.

• Numerosity Reduction - There is a reduction in volume


of data i.e. only store model of data instead of whole
data, which provides smaller representation of data
without any loss of data.

Made By- Utkarsh M.


• Dimensionality reduction - In this, reduction of
attributes or random variables are done so as to make
the data set dimension low. Attributes are combined
without losing its original characteristics.

Data discretization and concept hierarchy generation:

Data discretization refers to a method of converting a huge


number of data values into smaller ones so that the
evaluation and management of data become easy.
In other words, data discretization is a method of converting
attributes values of continuous data into a finite set of
intervals with minimum data loss.
There are two forms of data discretization first is supervised
discretization, and the second is unsupervised discretization.
1. Supervised discretization refers to a method in
which the class data is used.
2. Unsupervised discretization refers to a method
depending upon the way which operation proceeds.
It means it works on the top-down splitting strategy and
bottom-up merging strategy.

Made By- Utkarsh M.


Now, we can understand this concept with the help of an
example:

Techniques of data discretization:

1. Histogram analysis - Histogram refers to a plot used to


represent the underlying frequency distribution of a
continuous data set. Histogram assists the data inspection for
data distribution. For example, Outliers, skewness
representation, normal distribution representation, etc.

2. Binning - Binning refers to a data smoothing technique that


helps to group a huge number of continuous values into
smaller values.
For data discretization and the development of idea
hierarchy, this technique can also be used.

Made By- Utkarsh M.


3. Cluster Analysis - Cluster analysis is a form of data
discretization.
A clustering algorithm is executed by dividing the values of x
numbers into clusters to isolate a computational feature of x.

4. Data discretization using decision tree analysis –


Data discretization refers to a decision tree analysis in which
a top-down slicing technique is used. It is done through a
supervised procedure.
In a numeric attribute discretization, first, you need to select
the attribute that has the least entropy, and then you need to
run it with the help of a recursive process.
The recursive process divides it into various discretized
disjoint intervals, from top to bottom, using the same
splitting criterion.

5. Data discretization using correlation analysis - Discretizing


data by linear regression technique, you can get the best
neighboring interval, and then the large intervals are
combined to develop a larger overlap to form the final 20
overlapping intervals. It is a supervised procedure.

Made By- Utkarsh M.


Concept Hierarchy Generation:

The term hierarchy represents an organizational structure or


mapping in which items are ranked according to their levels
of importance.
In other words, we can say that a hierarchy concept refers to a
sequence of mappings with a set of more general concepts to
complex concepts. It means mapping is done from low-level
concepts to high-level concepts.
For example, in computer science, there are different types of
hierarchical systems. A document is placed in a folder in
windows at a specific place in the tree structure is the best
example of a computer hierarchical tree model.

There are two types of hierarchy: top-down mapping and the


second one is bottom-up mapping.

Top-down mapping - Top-down mapping generally starts with


the top with some general information and ends with the
bottom to the specialized information.

Bottom-up mapping - Bottom-up mapping generally starts


with the bottom with some specialized information and ends
with the top to the generalized information.

Made By- Utkarsh M.


Made By- Utkarsh M.
Architecture of a Typical Data Mining Systems:

1. Data Sources - Database, World Wide Web(WWW), and data


warehouse are parts of data sources.
The data in these sources may be in the form of plain text,
spreadsheets, or other forms of media like photos or videos.
WWW is one of the biggest sources of data.

2. Different processes - Before passing the data to the


database or data warehouse server, the data must be
cleaned, integrated, and selected.
As the information comes from various sources and in
different formats, it can't be used directly for the data mining

Made By- Utkarsh M.


procedure because the data may not be complete and
accurate.
So, the first data requires to be cleaned and unified. More
information than needed will be collected from various data
sources, and only the data of interest will have to be selected
and passed to the server.

3. Database Server - The database server contains the actual


data ready to be processed. It performs the task of handling
data retrieval as per the request of the user.
4. Data Mining Engine - It is one of the core components of
the data mining architecture that performs all kinds of data
mining techniques like association, classification,
characterization, clustering, prediction, etc.
5. Pattern Evaluation Modules - They are responsible for
finding interesting patterns in the data and sometimes they
also interact with the database servers for producing the
result of the user requests.
6. Graphic User Interface - Since the user cannot fully
understand the complexity of the data mining process so
graphical user interface helps the user to communicate
effectively with the data mining system.
7. Knowledge Base - Knowledge Base is an important part of
the data mining engine that is quite beneficial in guiding the
search for the result patterns. Data mining engines may also
sometimes get inputs from the knowledge base. This
knowledge base may contain data from user experiences. The
objective of the knowledge base is to make the result more
accurate and reliable.

Made By- Utkarsh M.


Basic Working of a typical data mining system:

1. It all starts when the user puts up certain data mining


requests, these requests are then sent to data mining
engines for pattern evaluation.

2. These applications try to find the solution to the query


using the already present database.

3. The metadata then extracted is sent for proper analysis


to the data mining engine which sometimes interacts
with pattern evaluation modules to determine the result.

4. This result is then sent to the front end in an easily


understandable manner using a suitable interface.

Classification of Data Mining Systems:

1. Classification Based on the mined Databases - A data


mining system can be classified based on the types of
databases that have been mined.
A database system can be further segmented based on
distinct principles, such as data models, types of data, etc.,
which further assist in classifying a data mining system.

For example, if we want to classify a database based on the


data model, we need to select either relational, transactional,
object-relational or data warehouse mining systems.

Made By- Utkarsh M.


2. Classification Based on the type of Knowledge Mined –
A data mining system categorized based on the kind of
knowledge mind may have the following functionalities:
1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis

3. Classification according to the type of techniques utilized -


A data mining system can also be classified based on the type
of techniques that are being incorporated.
This technique involves the degree of user interaction or the
technique of data analysis involved.
For example, machine learning, visualization, pattern
recognition, neural networks, database-oriented or data-
warehouse oriented techniques.

4. Classification according to the application adapted –


This involves domain-specific application. Data mining
systems classified based on adapted applications adapted are
as follows:
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail

Made By- Utkarsh M.


Association Rule in Data Mining:

Association rule learning is a type of unsupervised learning


technique that checks for the dependency of one data item
on another data item and maps accordingly so that it can be
more profitable.
It tries to find some interesting relations or associations
among the variables of dataset. It is based on different rules
to discover the interesting relations between variables in the
database.

The association rule learning is one of the very important


concepts of machine learning, and it is employed in Market
Basket analysis, Web usage mining, continuous production,
etc.
Here market basket analysis is a technique used by the
various big retailer to discover the associations between
items.
We can understand it by taking an example of a supermarket,
as in a supermarket, all products that are purchased together
are put together.

For example, if a customer buys bread, he most likely can also


buy butter, eggs, or milk, so these products are stored within
a shelf or mostly nearby. Consider the below diagram:

Made By- Utkarsh M.


Working of association rules
(ONLY if asked or skip to the types):

Association rule learning works on the concept of If and Else


Statement, such as if A then B.

Here the If element is called antecedent, and then statement


is called as Consequent.
These types of relationships where we can find out some
association or relation between two items is known as single
cardinality.
It is all about creating rules, and if the number of items
increases, then cardinality also increases accordingly.

Made By- Utkarsh M.


So, to measure the associations between thousands of data
items, there are several metrics. These metrics are given
below:
• Support
• Confidence
• Lift
1. Support - Support is the frequency of A or how frequently
an item appears in the dataset. It is defined as the fraction of
the transaction T that contains the itemset X.
If there are X datasets, then for transactions T, it can be
written as:

2. Confidence - Confidence indicates how often the rule has


been found to be true. Or how often the items X and Y occur
together in the dataset when the occurrence of X is already
given.
It is the ratio of the transaction that contains X and Y to the
number of records that contain X.

3. Lift - It is the strength of any rule, which can be defined as


below formula:

Made By- Utkarsh M.


It is the ratio of the observed support measure and expected
support if X and Y are independent of each other. It has three
possible values:
• If Lift= 1: The probability of occurrence of antecedent and
consequent is independent of each other.
• Lift>1: It determines the degree to which the two
itemsets are dependent to each other.
• Lift<1: It tells us that one item is a substitute for other
items, which means one item has a negative effect on
another.

Types of Association Rule Learning Algorithm:

Association rule learning can be divided into three


algorithms:

1. Apriori Algorithm - This algorithm uses frequent datasets to


generate association rules. It is designed to work on the
databases that contain transactions. This algorithm uses a
breadth-first search and Hash Tree to calculate the itemset
efficiently.
It is mainly used for market basket analysis and helps to
understand the products that can be bought together. It can
also be used in the healthcare field to find drug reactions for
patients.

Made By- Utkarsh M.


2. Eclat Algorithm - Eclat algorithm stands for Equivalence
Class Transformation. This algorithm uses a depth-first search
technique to find frequent itemsets in a transaction database.
It performs faster execution than Apriori Algorithm.

3. F-P Growth Algorithm - The F-P growth algorithm stands for


Frequent Pattern, and it is the improved version of the Apriori
Algorithm. It represents the database in the form of a tree
structure that is known as a frequent pattern or tree. The
purpose of this frequent tree is to extract the most frequent
patterns.

Efficient and Scalable Frequent Item set Mining Methods:

(I failed to find any kind of mining methods for frequent items set… so I guess it’s
time for brighter minds to shine, only if they want to… either way I am leaving this
one I am already tired)

Various Types of Association Rules:

There are various types of association rules in data mining:-


• Multi-relational association rules
• Generalized association rules
• Quantitative association rules

Made By- Utkarsh M.


1. Multi-relational association rules - Multi-Relation
Association Rules (MRAR) is a new class of association rules,
different from original, simple, and even multi-relational
association rules (usually extracted from multi-relational
databases), each rule element consists of one entity but many
a relationship.
These relationships represent indirect relationships between
entities.

2. Generalized association rules - Generalized association rule


extraction is a powerful tool for getting a rough idea of
interesting patterns hidden in data.
However, since patterns are extracted at each level of
abstraction, the mined rule sets may be too large to be used
effectively for decision-making.
Therefore, in order to discover valuable and interesting
knowledge, post-processing steps are often required.
Generalized association rules should have categorical
(nominal or discrete) properties on both the left and right
sides of the rule.

3. Quantitative association rules - Quantitative association


rules is a special type of association rule.
Unlike general association rules, where both left and right
sides of the rule should be categorical (nominal or discrete)
attributes, at least one attribute (left or right) of quantitative
association rules must contain numeric attributes.

Made By- Utkarsh M.


Association Mining to Correlation Analysis:

Correlation analysis is a statistical method used to measure


the strength of the linear relationship between two variables
and compute their association.
Correlation analysis calculates the level of change in one
variable due to the change in the other.
A high correlation points to a strong relationship between the
two variables, while a low correlation means that the
variables are weakly related.

There is a positive correlation between two variables when an


increase in one variable leads to an increase in the other.
On the other hand, a negative correlation means that when
one variable increases, the other decreases and vice-versa.

In terms of the strength of the relationship, the correlation


coefficient's value varies between +1 and -1. A value of ± 1
indicates a perfect degree of association between the two
variables.

As the correlation coefficient value goes towards 0, the


relationship between the two variables will be weaker. The
coefficient sign indicates the direction of the relationship; a +
sign indicates a positive relationship, and a - sign indicates a
negative relationship.

Made By- Utkarsh M.


Types of Correlation Analysis in Data Mining:

1. Pearson r correlation - Pearson r correlation is the most


widely used correlation statistic to measure the degree of the
relationship between linearly related variables.
For example, in the stock market, if we want to measure how
two stocks are related to each other, Pearson r correlation is
used to measure the degree of relationship between the two.
The point-biserial correlation is conducted with the Pearson
correlation formula, except that one of the variables is
dichotomous. The following formula is used to calculate the
Pearson r correlation:

Where,
rxy= Pearson r correlation coefficient between x and y
n= number of observations
xi = value of x (for ith observation)
yi= value of y (for ith observation)

Made By- Utkarsh M.


2. Kendall rank correlation - Kendall rank correlation is a non-
parametric test that measures the strength of dependence
between two variables.
Considering two samples, a and b, where each sample size is
n, we know that the total number of pairings with a b is n(n-
1)/2. The following formula is used to calculate the value of
Kendall rank correlation:

Nc= number of concordant


Nd= Number of discordant

3. Spearman rank correlation - Spearman rank correlation is a


non-parametric test that is used to measure the degree of
association between two variables.

This coefficient requires a table of data that displays the raw


data, its ranks, and the difference between the two ranks.
This squared difference between the two ranks will be shown
on a scatter graph, which will indicate whether there is a
positive, negative, or no correlation between the two
variables.
The constraint that this coefficient works under is -1 ≤ r ≤ +1,
where a result of 0 would mean that there was no relation
between the data whatsoever.

Made By- Utkarsh M.


The following formula is used to calculate the Spearman rank
correlation:

ρ= Spearman rank correlation


di= the difference between the ranks of corresponding
variables
n= number of observations

Constraint-based Association Mining:

Constraint-Based Association Mining is a data mining


technique that allows users to focus on discovering
interesting patterns or rules from a given dataset by defining
specific constraints.
Constraint-based algorithms apply constraints during the
frequent itemset generation step (similar to exhaustive
algorithms).
The most common constraint is the support minimum
threshold, which limits the inclusion of infrequent itemsets.
By setting constraints, the exploration space is significantly
reduced, leading to more efficient mining.

Made By- Utkarsh M.


The constraints can include the following which are as
follows:

1. Knowledge type constraints − These define the type of


knowledge to be mined, including association or
correlation.

2. Data constraints − These define the set of task-relevant


information such as Dimension/level constraints − These
defines the desired dimensions (or attributes) of the
information, or methods of the concept hierarchies, to
be utilized in mining.

3. Interestingness constraints − These defines thresholds


on numerical measures of rule interestingness, including
support, confidence, and correlation.

4. Rule constraints − These defines the form of rules to be


mined. Such constraints can be defined as metarules
(rule templates), as the maximum or minimum number
of predicates that can appear in the rule antecedent or
consequent, or as relationships between attributes,
attribute values, and/or aggregates.

------------------THE END------------------
This marks the end of Unit1&2 Notes for this subject… Hope it helped,
for those who just try to command me to make notes in DM, better help me
making it quicker than just demanding or don’t demand, I’ll do what I have to.

Made By- Utkarsh M.

You might also like