Professional Documents
Culture Documents
Data Warehouse
Data Warehouse
The marketing department in your company has been concerned about the performance of the
West Coast Region and the sales numbers from the monthly report this month are drastically low.
The marketing Vice President is agitated and wants to get some reports from the IT department to
analyze the performance over the past two years, product by product, and compared to monthly
targets. He wants to make quick strategic decisions to rectify the situation. The CIO wants your
boss to deliver the reports as soon as possible. Your boss runs to you and asks you to stop
everything and work on the reports. There are no regular reports from any system to give the
marketing department what they want. You have to gather the data from multiple applications and
start from scratch. Does this sound familiar?
At one time or another in your career in information technology, you must have been exposed
to situations like this. Sometimes, you may be able to get the information required for such ad hoc
reports from the databases or files of one application. Usually this is not so. You may have to go to
several applications, perhaps running on different platforms in your company environment, to get
the information. What happens next? The marketing department likes the ad hoc reports you have
produced. But now they would like reports in a different form, containing more information that
they did not think of originally. After the second round, they find that …
The Data Warehousing Lifecycle refers to the process of designing, implementing, and
maintaining a data warehouse. A data warehouse is a centralized repository that stores
data from various sources in a structured format, enabling organizations to analyze and
make informed decisions. The lifecycle involves several key stages:
Planning:
● Define Objectives: Clearly define the goals and objectives of the data warehouse.
Understand the business requirements and how the data warehouse will support
decision-making processes.
● Assess Feasibility: Evaluate the technical and financial feasibility of
implementing a data warehouse. Consider factors such as data sources,
technology infrastructure, and organizational readiness.
Requirements Analysis:
● Gather Requirements: Work with stakeholders to identify and document data
requirements. Understand the types of queries and analyses that users will
perform to ensure the data warehouse meets their needs.
● Data Source Analysis: Identify and analyze potential data sources. Assess the
quality and compatibility of data from different systems.
Design:
● Data Model Design: Create a data model that represents the structure of the data
warehouse. This includes defining dimensions, facts, and relationships between
data elements.
● ETL (Extract, Transform, Load) Design: Plan the processes for extracting data
from source systems, transforming it into the desired format, and loading it into
the data warehouse.
● Infrastructure Design: Define the hardware and software infrastructure required
to support the data warehouse. Consider factors such as storage, processing
power, and data integration tools.
Implementation:
● ETL Development: Implement the ETL processes designed during the previous
stage. This involves extracting data from source systems, transforming it, and
loading it into the data warehouse.
● Data Warehouse Construction: Build the data warehouse based on the designed
data model. Populate the warehouse with data from the ETL processes.
Testing:
● Data Quality Assurance: Perform data quality checks to ensure the accuracy and
completeness of the data in the data warehouse.
● Performance Testing: Test the performance of queries and data retrieval
processes to ensure the data warehouse meets performance requirements.
Deployment:
● Data Warehouse Deployment: Deploy the data warehouse to production, making
it available for users and applications.
● User Training: Train end-users and relevant stakeholders on how to access and
use the data warehouse.
Maintenance and Evolution:
● Monitoring and Optimization: Monitor the performance of the data warehouse
and optimize queries or processes as needed.
● Data Refresh and Updates: Regularly update the data warehouse with new data
from source systems.
● Evolution and Expansion: Adapt the data warehouse to changing business
requirements. Consider expanding the data warehouse to include new data
sources or additional functionality.
Retirement (Optional):
● If the data warehouse becomes obsolete or is replaced by a newer system, plan
for its retirement. Migrate or archive relevant data and inform stakeholders about
the transition.
1.4 Architecture
1. Single-tier architecture:A single layer’s goal is to store as little data as possible. The
elimination of data redundancy is the aim. In reality, single-tier architecture is not frequently
employed. To accomplish this, it eliminates redundant data to keep as little data as possible.
The way a single-tier data warehouse is made reduces the amount of data that is stored while
making a dense data set.
Even though this warehouse design style is suitable for eliminating redundancies, it is not right
for companies with complex data needs and multiple data streams. Multi-tier data warehouse
architectures can help in this situation since they can handle more complicated data streams.
A relational database system is typically represented by the bottom tier or data warehouse
server. This architecture is vulnerable since it does not separate analytical and transactional
processing as required. Following the interpretation of the middleware, analysis queries are
approved for operational data. This is how inquiries have an impact on transactional
workloads.
● Source layer: A data warehouse system makes use of several data sources. The information
may originate from an information system beyond the company’s boundaries or be initially
housed in legacy or internal relational databases.
● Data staging: It entails extracting the data from the source, cleaning it to remove
discrepancies and fill in any gaps, and integrating it to combine data from several sources
into a single standard schema. The Extraction, Transformation, and Loading Tools (ETL)
process can combine data schemas that are different from one another, besides enabling
data extraction, transformation, cleaning, validation, and filtration to be loaded into a data
warehouse.
● Data warehouse layer: A data warehouse is where one can store information in a way that
makes sense as per centralization logic. Users can access data warehouses directly but can
also use them to make data marts for specific departments within the company and partly
copy the contents from the data warehouse. Data staging, users, sources, access processes,
data mart schema, and other information are all stored in meta-data repositories.
● Analysis: This layer allows for rapid and flexible access to integrated data to generate
reports, analyze data in real-time, and model fictitious business scenarios. It should have
customer-friendly GUIs, advanced query optimizers, and aggregate information
navigators.
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of the
data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part of
which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry produced
by the external department.
After we have been extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse. The extracted data coming from
several different sources need to be changed, converted, and made ready in a format that is
relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to employ
the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many different
sources. If data extraction for a data warehouse posture big challenges, data transformation
present even significant challenges. We perform several individual tasks as part of data
transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or elimination
of duplicates when we bring in the same data from various source systems.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time, we
do the initial loading of the information into the data warehouse storage. The initial load moves
high volumes of data using up a substantial amount of time.
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories include
the data structured in highly normalized for fast and efficient processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures,
the data about the records and addresses, the information about the indexes, and so on.
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope
is confined to particular selected subjects. Data in a data warehouse should be a fairly current,
but not mainly up to the minute, although development in the data warehouse industry has made
standard and incremental data dumps more achievable. Data marts are lower than data
warehouses and usually contain organization. The current trends in data warehousing are to
developed a data warehouse with several smaller related data marts for particular kinds of
queries and reports.
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific
data mining task.
Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data
preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset.
Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for
data integration.
Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization
is used to transform the data to have zero mean and unit variance. Discretization is used to
convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require
categorical data. Discretization can be achieved through techniques such as equal width binning,
equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1
or -1 and 1. Normalization is often used to handle data with different units and scales. Common
normalization techniques include min-max normalization, z-score normalization, and decimal
scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the
analysis results. The specific steps involved in data preprocessing may vary depending on the
nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results
become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
● Summary Statistics: Calculating basic statistics such as mean, median, mode, standard
deviation, minimum, maximum, and quartiles provides a general overview of the
distribution of numerical variables within the dataset.
● Frequency Distribution: Creating frequency tables or histograms allows you to visualize
the distribution of categorical variables and identify the most common categories or
levels within each variable
● Data Visualization: Techniques such as scatter plots, bar charts, pie charts, box plots, and
heatmaps are used to visually represent the relationships and patterns within the data.
Visualization aids in identifying outliers, clusters, trends, and correlations.
● Correlation Analysis: Computing correlation coefficients (e.g., Pearson correlation)
between pairs of numerical variables helps identify relationships and dependencies
between variables. Correlation analysis is useful for understanding how variables are
related to each other.
● Cross-Tabulation: Cross-tabulating categorical variables allows you to examine the
relationships between different categories and identify any associations or dependencies
between them.
● Data Profiling: Data profiling involves analyzing the structure and quality of the dataset,
including identifying missing values, outliers, data types, and unique values for each
variable. Data profiling helps in understanding the completeness and integrity of the
dataset.
● Cluster Analysis: Cluster analysis techniques such as k-means clustering or hierarchical
clustering group similar data points together based on their characteristics. Cluster
analysis helps identify natural groupings or patterns within the data.
● Dimensionality Reduction: Techniques such as principal component analysis (PCA) or
t-distributed stochastic neighbor embedding (t-SNE) can be used to reduce the
dimensionality of the dataset while preserving important information. Dimensionality
reduction helps in visualizing high-dimensional data and identifying underlying structures.
● Data Summarization Techniques: Techniques such as data discretization, binning, or
summarization by aggregation (e.g., grouping data by time periods) can be used to
reduce the complexity of the dataset while preserving key insights.
● Text Summarization: In cases where the dataset contains textual data, techniques such
as text summarization, keyword extraction, or sentiment analysis can be used to extract
key themes, topics, or sentiments from the text.
2.3 Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
2. Data Transformation:
Data transformation is a crucial step in the data mining process where raw data is manipulated
or modified to prepare it for analysis. This step involves converting data into a format that is
suitable for the chosen data mining technique or algorithm. Data transformation helps improve
the quality of the data, reduces noise, and enhances the performance of data mining algorithms.
Here are some common techniques used in data transformation in data mining:
Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy. In other words, data
discretization is a method of converting attributes values of continuous data into a finite set of
intervals with minimum data loss. There are two forms of data discretization first is supervised
discretization, and the second is unsupervised discretization. Supervised discretization refers
to a method in which the class data is used. Unsupervised discretization refers to a method
depending upon the way which operation proceeds. It means it works on the top-down splitting
strategy and bottom-up merging strategy.
Another example is analytics, where we gather the static data of website visitors. For example,
all visitors who visit the site with the IP address of India are shown under country level.
Histogram analysis
Binning
Binning refers to a data smoothing technique that helps to group a huge number of continuous
values into smaller values. For data discretization and the development of idea hierarchy, this
technique can also be used.
Cluster Analysis
Data discretization refers to a decision tree analysis in which a top-down slicing technique is
used. It is done through a supervised procedure. In a numeric attribute discretization, first, you
need to select the attribute that has the least entropy, and then you need to run it with the help
of a recursive process. The recursive process divides it into various discretized disjoint
intervals, from top to bottom, using the same splitting criterion.
Discretizing data by linear regression technique, you can get the best neighboring interval, and
then the large intervals are combined to develop a larger overlap to form the final 20
overlapping intervals. It is a supervised procedure.
Let's understand this concept hierarchy for the dimension location with the help of an example.
A particular city can map with the belonging country. For example, New Delhi can be mapped to India,
and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends with the
bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and ends with
the top to the generalized information.
OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is
rapidly enhancing the essential foundation for Intelligent Solutions containing Business
Performance Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis,
Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables
end-clients to perform ad hoc analysis of record in multiple dimensions, providing the insight
and understanding they require for better decision making.
○ Budgeting
○ Activity-based costing
○ Financial performance analysis
○ And financial modeling
Production
○ Production planning
○ Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.
Points to Remember −
MOLAP tools process information with consistent response time regardless of level of
summarizing or calculations selected.
MOLAP tools need to avoid many of the complexities of creating a relational database
to store data for analysis.
MOLAP tools need fastest possible performance.
MOLAP server adopts two level of storage representation to handle dense and sparse
data sets.
Denser sub-cubes are identified and stored as array structure.
Sparse sub-cubes employ compression technology.
Advantages
MOLAP
allows fastest
indexing to the
pre-computed
summarized
data.
Helps
the users
connected to a
network who need to analyze larger, less-defined data.
Easier to use, therefore MOLAP is suitable for inexperienced users.
Disadvantages
2. Transparency:
It makes the technology, underlying data repository, computing architecture,
and the diverse nature of source data totally transparent to users.
3. Accessibility:
Access should provided only to the data that is actually needed to perform
the specific analysis, presenting a single, coherent and consistent view to the
users.
5. Client/Server Architecture:
It conforms the system to the principles of client/server architecture for
optimum performance, flexibility, adaptability, and interoperability.
6. Generic Dimensionality:
It should be ensured that very data dimension is equivalent in both structure
and operational capabilities. Have one logical structure for all dimensions.
8. Multi-user Support:
Support should be provided for end users to work concurrently with either
the same analytical model or to create different models from the same data.
In the FASMI characteristics of OLAP methods, the term derived from the first letters of the
characteristics are:
Fast
Analysis
Share
It defines which the system tools all the security requirements for understanding and, if
multiple write connection is needed, concurrent update location at an appropriated level, not all
functions need customer to write data back, but for the increasing number which does, the
system should be able to manage multiple updates in a timely, secure manner.
Multidimensional
This is the basic requirement. OLAP system must provide a multidimensional conceptual view
of the data, including full support for hierarchies, as this is certainly the most logical method to
analyze business and organizations.
Information
The system should be able to hold all the data needed by the applications. Data sparsity
should be handled in an efficient manner.
Multidimensional conceptual view − A user view of the enterprise data is multidimensional. The
conceptual view of OLAP models should be multidimensional. The multidimensional models
can be manipulated more easily and intuitively than in the case of single-dimensional models.
Transparency − A user should be able to get full value from an OLAP engine without regarding
the source of the data. The OLAP system’s technology, underlying database, and computing
architecture, and the heterogeneity of input data sources should be transparent to users to
preserve their productivity and proficiency with familiar front-end environments and tools.
It should also be transparent to the user as to whether or not the enterprise data input to the
OLAP tool comes from homogeneous or heterogeneous database environments.
Accessibility − The OLAP user must be able to perform analysis based upon a common
conceptual schema composed of enterprise data in relational DBMS. The OLAP tool should
map its logical schema to heterogeneous physical data stores, create the data and implement
some conversions necessary to present a single, coherent and consistent user view.
Consistent performance − As the number of dimensions or the size of the database increases,
there should not be any significant degradation in reporting performance. Consistent reporting
performance is essential to supporting the ease to use and lack of complexity needed in
bringing OLAP to the end-user.
Client-server architecture − Most of the data which require online analytical processing are
stored on mainframe systems and accessed via personal computers. The OLAP servers must
be capable of operating in a client-server environment.
The server component of OLAP tools should be sufficiently intelligent such that various clients
can be connected with minimum effort and integration programming. This server may be
capable of performing the mapping and consolidation between disparate logical and physical
enterprise database schema necessary to affect transparency and to build common
conceptual, logical, and physical schemas.
Generic dimensionality − Each data dimension should be similar in both its architecture and
operational capabilities. Additional operational capabilities can be granted to selected
dimensions but since dimensions are symmetric a given additional function can be granted to
any dimension.
Multi-user support − There can be a need to work concurrently with either the same analytical
model or to create different models from the same enterprise data.
Intuitive data manipulation − Consolidation by drilling down across columns or rows, zooming
out, and other manipulation inherent in the consolidation outlines should be accomplished via
direct action upon the cells of the analytical model, and should neither require the use of a
menu nor multiple tips across the user interface
They use a relational or extended-relational DBMS to save and handle warehouse data, and
OLAP middleware to provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.
ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.
This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
performed on it relying on the random access capability of the arrays. Arrays elements
are determined by dimension instances, and the fact data or measured value
associated with each cell is usually stored in the corresponding array element. In
But unlike ROLAP, where only records with non-zero facts are stored, all array elements are
defined in MOLAP and as a result, the arrays generally tend to be sparse, with empty
elements occupying a greater part of it. Since both storage and retrieval costs are important
while assessing online performance efficiency, MOLAP systems typically include provisions
such as advanced indexing and hashing to locate data while performing queries for handling
sparse arrays. MOLAP cubes are fast data retrieval, optimal for slicing and dicing, and can
perform complex calculations. All calculations are pre-generated when the cube is created.
molap
TOLAP systems are designed to work transparently with existing RDBMS systems, allowing
users to access OLAP features without needing to transfer data to a separate OLAP system.
This allows for more seamless integration between OLAP and traditional RDBMS systems.
There are some other types of OLAP Systems that are used in analyzing databases. Some of
them are mentioned below.
1 OLAP stands for Online analytical processing. OLTP stands for online
transaction processing.
2 It includes software tools that help in It helps in managing online
analyzing data mainly for business decisions. database modification.
7 Here the tables are not normalized. Here, the tables are
normalized.
8 It allows only read and hardly write operations. It allows both read and write
operations.
9 Here, the complex queries are involved. Here, the queries are simple.
● Roll-up: operation and aggregate certain similar data attributes having the same
dimension together. For example, if the data cube displays the daily income of a
customer, we can use a roll-up operation to find the monthly income of his salary.
● Drill-down: this operation is the reverse of the roll-up operation. It allows us to take
particular information and then subdivide it further for coarser granularity analysis.
It zooms into more detail. For example- if India is an attribute of a country column
and we wish to see villages in India, then the drill-down operation splits India into
states, districts, towns, cities, villages and then displays the required information.
● Dicing: this operation does a multidimensional cutting, that not only cuts only one
dimension but also can go to another dimension and cut a certain range of it. As a
result, it looks more like a subcube out of the whole cube(as depicted in the figure).
For example- the user wants to see the annual salary of Jharkhand state
employees.
● Pivot: this operation is very important from a viewing point of view. It basically
transforms the data cube in terms of view. It doesn’t change the data present in the
data cube. For example, if the user is comparing year versus branch, using the pivot
operation, the user can change the viewpoint and now compare branch versus item
type.
UNIT-IV: Data Mining
In general terms, “Mining” is the process of extraction of some valuable material from the
earth e.g. coal mining, diamond mining, etc. In the context of computer science, “Data Mining”
can be referred to as knowledge mining from data, knowledge extraction, data/pattern
analysis, data archaeology, and data dredging. It is basically the process carried out for the
extraction of useful information from a bulk of data or data warehouses. One can see that the
term itself is a little confusing. In the case of coal or diamond mining, the result of the
extraction process is coal or diamond. But in the case of Data Mining, the result of the
extraction process is not data!! Instead, data mining results are the patterns and knowledge
that we gain at the end of the extraction process. In that sense, we can think of Data Mining
as a step in the process of Knowledge Discovery or Knowledge Extraction.
Nowadays, data mining is used in almost all places where a large amount of data is stored
and processed. For example, banks typically use ‘data mining’ to find out their prospective
customers who could be interested in credit cards, personal loans, or insurance as well. Since
banks have the transaction details and detailed profiles of their customers, they analyze all
this data and try to find out patterns that help them predict that certain customers could be
interested in personal loans, etc.
Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. Data mining tasks can be classified into two types:
descriptive and predictive. Descriptive mining tasks define the common features of the data in
the database, and the predictive mining tasks act in inference on the current information to
develop predictions.
Data mining is extensively used in many areas or sectors. It is used to predict and characterize
data. But the ultimate objective in Data Mining Functionalities is to observe the various trends
in data mining. There are several data mining functionalities that the organized and scientific
methods offer, such as:
1. Class/Concept Descriptions
A class or concept implies there is a data set or set of features that define the class or a
concept. A class can be a category of items on a shop floor, and a concept could be the
abstract idea on which data may be categorized like products to be put on clearance sale and
non-sale products. There are two concepts here, one that helps with grouping and the other
that helps in differentiating.
One of the functions of data mining is finding data patterns. Frequent patterns are things that
are discovered to be most common in data. Various types of frequency can be found in the
dataset.
○ Frequent item set:This term refers to a group of items that are commonly found
together, such as milk and sugar.
○ Frequent substructure: It refers to the various types of data structures that can be
combined with an item set or subsequences, such as trees and graphs.
○ Frequent Subsequence: A regular pattern series, such as buying a phone followed by a
cover.
3. Association Analysis
It analyses the set of items that generally occur together in a transactional dataset. It is also
known as Market Basket Analysis for its wide use in retail sales. Two parameters are used for
determining the association rules:
○ It provides which identifies the common item set in the database.
○ Confidence is the conditional probability that an item occurs when another item occurs
in a transaction.
4. Classification
Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to predict
a class or essentially classify a collection of items. A training set containing items whose
properties are known is used to train the system to predict the category of items from an
unknown collection of items.
5. Prediction
It defines predict some unavailable data values or spending trends. An object can be
anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase or decrease trends in time-related
information. There are primarily two types of predictions in data mining: numeric and class
predictions.
○ Numeric predictions are made by creating a linear regression model that is based on
historical data. Prediction of numeric values helps businesses ramp up for a future
event that might impact the business positively or negatively.
○ Class predictions are used to fill in missing class information for products using a
training data set where the class for products is known.
6. Cluster Analysis
7. Outlier Analysis
Outlier analysis is important to understand the quality of data. If there are too many outliers,
you cannot trust the data or draw patterns. An outlier analysis determines if there is something
out of turn in the data and whether it indicates a situation that a business needs to consider
and take measures to mitigate. An outlier analysis of the data that cannot be grouped into any
classes by the algorithms is pulled up.
9. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly two
attributes is related to one another. It refers to the various types of data structures, such as
trees and graphs, that can be combined with an item set or subsequence. It determines how
well two numerically measured continuous variables are linked. Researchers can use this type
of analysis to see if there are any possible correlations between variables in their study.
Data mining refers to the process of extracting important data from raw data. It analyses the
data patterns in huge sets of data with the help of several software. Ever since the
development of data mining, it is being incorporated by researchers in the research and
development field.
With Data mining, businesses are found to gain more profit. It has not only helped in
understanding customer demand but also in developing effective strategies to enforce overall
business turnover. It has helped in determining business objectives for making clear decisions.
Data collection and data warehousing, and computer processing are some of the strongest
pillars of data mining. Data mining utilizes the concept of mathematical algorithms to segment
the data and assess the possibility of occurrence of future events.
To understand the system and meet the desired requirements, data mining can be classified
into the following systems:
For example, if we want to classify a database based on the data model, we need to select
either relational, transactional, object-relational or data warehouse mining systems.
A data mining system categorized based on the kind of knowledge mind may have the
following functionalities:
1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis
A data mining system can also be classified based on the type of techniques that are being
incorporated. These techniques can be assessed based on the involvement of user interaction
involved or the methods of analysis employed.
Data mining systems classified based on adapted applications adapted are as follows:
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail
No Coupling
Loose Coupling
Semi-Tight Coupling
In semi-tight coupling, data mining is linked to either the DB or DW system and provides an
efficient implementation of data mining primitives within the database.
Tight Coupling
A data mining system can be effortlessly combined with a database or data warehouse
system in tight coupling.
The data mining tasks can be classified generally into two types based on what a specific task
tries to achieve. Those two categories are descriptive tasks and predictive tasks. The
descriptive data mining tasks characterize the general properties of data whereas predictive
data mining tasks perform inference on the available data set to predict how a new data set will
behave.
Different Data Mining Tasks
There are a number of data mining tasks such as classification, prediction, time-series
analysis, association, clustering, summarization etc. All these tasks are either predictive data
mining tasks or descriptive data mining tasks. A data mining system can execute one or more
of the above specified tasks as part
of data mining.
a) Classification
Classification derives a model to determine the class of an object based on its attributes. A
collection of records will be available, each record with a set of attributes. One of the attributes
will be class attribute and the goal of classification task is assigning a class attribute to new set
of records as accurately as possible.
Classification can be used in direct marketing, that is to reduce marketing costs by targeting a
set of customers who are likely to buy a new product. Using the available data, it is possible to
know which customers purchased similar products and who did not purchase in the past.
Hence, {purchase, don’t purchase} decision forms the class attribute in this case. Once the
class attribute is assigned, demographic and lifestyle information of customers who purchased
similar products can be collected and promotion mails can be sent to them directly.
b) Prediction
Prediction task predicts the possible values of missing or future data. Prediction involves
developing a model based on the available data and this model is used in predicting future
values of a new data set of interest. For example, a model can predict the income of an
employee based on education, experience and other demographic factors like place of stay,
gender etc. Also prediction analysis is used in different areas including medical diagnosis,
fraud detection etc.
d) Association
Association discovers the association or connection among a set of items. Association
identifies the relationships between objects. Association analysis is used for commodity
management, advertising, catalog design, direct marketing etc. A retailer can identify the
products that normally customers purchase together or even find the customers who respond
to the promotion of same kind of products. If a retailer finds that beer and nappy are bought
together mostly, he can put nappies on sale to promote the sale of beer.
e) Clustering
Clustering is used to identify data objects that are similar to one another. The similarity can be
decided based on a number of factors like purchase behavior, responsiveness to certain
actions, geographical locations and so on. For example, an insurance company can cluster its
customers based on age, residence, income etc. This group information will be helpful to
understand the customers better and hence provide better customized services.
f) Summarization
Summarization is the generalization of data. A set of relevant data is summarized which result
in a smaller set that gives aggregated information of the data. For example, the shopping done
by a customer can be summarized into total products, total spending, offers used, etc. Such
high level summarized information can be useful for sales or customer relationship team for
detailed customer and purchase behavior analysis. Data can be summarized in different
abstraction levels and from different angles.
Summary
Different data mining tasks are the core of data mining process. Different prediction and
classification data mining tasks actually extract the required information from the available data
sets.
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate
the development of parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel fashion. Then the
results from the partitions is merged. The incremental algorithms, update databases
without mining the data again from scratch.
Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible
for one system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may
be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
Unit-V: Association Rule Mining
INTRODUCTION:
1. Frequent item sets, also known as association rules, are a fundamental concept in
association rule mining, which is a technique used in data mining to discover
relationships between items in a dataset. The goal of association rule mining is to
identify relationships between items in a dataset that occur frequently together.
2. A frequent item set is a set of items that occur together frequently in a dataset. The
frequency of an item set is measured by the support count, which is the number of
transactions or records in the dataset that contain the item set. For example, if a
dataset contains 100 transactions and the item set {milk, bread} appears in 20 of
those transactions, the support count for {milk, bread} is 20.
3. Association rule mining algorithms, such as Apriori or FP-Growth, are used to find
frequent item sets and generate association rules. These algorithms work by
iteratively generating candidate item sets and pruning those that do not meet the
minimum support threshold. Once the frequent item sets are found, association rules
can be generated by using the concept of confidence, which is the ratio of the
number of transactions that contain the item set and the number of transactions that
contain the antecedent (left-hand side) of the rule.
4. Frequent item sets and association rules can be used for a variety of tasks such as
market basket analysis, cross-selling and recommendation systems. However, it
should be noted that association rule mining can generate a large number of rules,
many of which may be irrelevant or uninteresting. Therefore, it is important to use
appropriate measures such as lift and conviction to evaluate the interestingness of
the generated rules.
Association Mining searches for frequent items in the data set. In frequent mining usually,
interesting associations and correlations between item sets in transactional and relational
databases are found. In short, Frequent Mining shows which items appear together in a
transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules from a
Transactional Dataset. If there are 2 items X and Y purchased frequently then it’s good to put
them together in stores or provide some discount offer on one item on purchase of another
item. This can really increase sales. For example, it is likely to find that if a customer buys Milk
and bread he/she also buys Butter. So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So the
seller can suggest the customer buy butter if he/she buys Milk and Bread.
Important Definitions :
● Support : It is one of the measures of interestingness. This tells about the usefulness
and certainty of rules. 5% Support means total 5% of transactions in the database
follow the rule.
Example On finding Frequent Itemsets – Consider the given dataset with given transactions.
2-frequent: {A, B} = 2 // not frequent because support count < minimum support count so
ignore {A, C} = 3 // not closed due to {A, C, D} {A, D} = 3 // not closed due to {A, C, D} {B, C} =
3 // not closed due to {B, C, D} {B, D} = 4 // closed but not maximal due to {B, C, D} {C, D} = 4
// closed but not maximal due to {B, C, D}
3-frequent: {A, B, C} = 2 // ignore not frequent because support count < minimum support
count {A, B, D} = 2 // ignore not frequent because support count < minimum support count {A,
C, D} = 3 // maximal frequent {B, C, D} = 3 // maximal frequent
ADVANTAGES OR DISADVANTAGES:
Advantages of using frequent item sets and association rule mining include:
Disadvantages of using frequent item sets and association rule mining include:
1. Large number of generated rules: Association rule mining can generate a large
number of rules, many of which may be irrelevant or uninteresting, which can make
it difficult to identify the most important patterns.
2. Limited in detecting complex relationships: Association rule mining is limited in its
ability to detect complex relationships between items, and it only considers the
co-occurrence of items in the same transaction.
3. Can be computationally expensive: As the number of items and transactions
increases, the number of candidate item sets also increases, which can make the
algorithm computationally expensive.
4. Need to define the minimum support and confidence threshold: The minimum
support and confidence threshold must be set before the association rule mining
process, which can be difficult and requires a good understanding of the data.
In this article, we will discuss concepts of Multilevel Association Rule mining and its algorithms,
applications, and challenges.
Data mining is the process of extracting hidden patterns from large data sets. One of the
fundamental techniques in data mining is association rule mining. To identify relationships
between items in a dataset, Association rule mining is used. These relationships can then be
used to make predictions about future occurrences of those items.
Multilevel Association Rule mining is a technique that extends Association Rule mining to
discover relationships between items at different levels of granularity. Multilevel Association
Rule mining can be classified into two types: multi-dimensional Association Rule and
multi-level Association Rule.
This is used to find relationships between items in different dimensions of a dataset. For
example, in a sales dataset, multi-dimensional Association Rule mining can be used to find
relationships between products, regions, and time.
This is used to find relationships between items at different levels of granularity. For example,
in a retail dataset, multi-level Association Rule mining can be used to find relationships
between individual items and categories of items.
Multidimensional rule mining is important because data at lower levels may not exhibit any
meaningful patterns, yet it can contain valuable insights. The goal is to find such hidden
information within and across levels of abstraction.
There are several algorithms for Multilevel Association Rule mining, including partition-based,
agglomerative, and hybrid approaches.
Partition-based algorithms divide the data into partitions based on some criteria, such as the
level of granularity, and then mine Association Rules within each partition. Agglomerative
algorithms start with the smallest itemsets and then gradually merge them into larger itemsets,
until a set of rules is obtained. Hybrid algorithms combine the strengths of partition-based and
agglomerative approaches.
Multilevel Association Rule mining has different approaches to finding relationships between
items at different levels of granularity. There are three approaches: Uniform Support, Reduced
Support, and Group-based Support. These are explained as follows below in brief.
Uniform Support (using uniform minimum support for all levels)
where only one minimum support threshold is used for all levels. This approach is simple but
may miss meaningful associations at low levels.
where the minimum support threshold is lowered at lower levels to avoid missing important
associations. This approach uses different search techniques, such as Level-by-Level
independence and Level-cross separating by single item or K-itemset.
where the user or expert sets the support and confidence threshold based on a specific group
or product category.
For example, if an expert wants to study the purchase patterns of laptops and clothes in the
non-electronic category, a low support threshold can be set for this group to give attention to
these items' purchase patterns.
Multilevel Association Rule mining helps retailers gain insights into customer buying behavior
and preferences, optimize product placement and pricing, and improve supply chain
management.
Healthcare Management
Multilevel Association Rule mining helps healthcare providers identify patterns in patient
behavior, diagnose diseases, identify high-risk patients, and optimize treatment plans.
Fraud Detection
Multilevel Association Rule mining helps companies identify fraudulent patterns, detect
anomalies, and prevent fraud in various industries such as finance, insurance, and
telecommunications.
Multilevel Association Rule mining helps web-based companies gain insights into user
preferences, optimize website design and layout, and personalize content for individual users
by analyzing data at different levels of abstraction.
Social Network Analysis
Multilevel Association Rule mining helps social network providers identify influential users,
detect communities, and optimize network structure and design by analyzing social network
data at different levels of abstraction.
Multilevel Association Rule mining poses several challenges, including high dimensionality,
large data set size, and scalability issues.
High dimensionality
It is the problem of dealing with data sets that have a large number of attributes.
It is the problem of dealing with data sets that have a large number of records.
Scalability
It is the problem of dealing with data sets that are too large to fit into memory.
Conclusion
Multilevel Association Rule mining is a powerful technique that can be used to identify
relationships between items at different levels of granularity. It is an extension of Association
Rule mining that can discover patterns and trends that would otherwise be missed. Multilevel
Association Rule mining has several applications, including market basket analysis, medical
data analysis, and web usage mining.
However, Multilevel Association Rule mining also poses several challenges, including high
dimensionality, large data set size, and scalability issues. Future research directions in
Multilevel Association Rule mining include developing more efficient algorithms and addressing
these challenges.
In conclusion, Multilevel Association Rule mining is a powerful technique that can be used to
discover relationships between items at different levels of granularity. It has several
applications in various fields, but it also poses several challenges. As data sets continue to
grow in size and complexity, Multilevel Association Rule mining will become an increasingly
important tool for discovering hidden patterns in large data sets.
Unit-VI: Classification and Prediction
There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict continuous
valued functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the expenditures in dollars
of potential customers on computer equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing
data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or a
predictor will be constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.
With the help of the bank loan application that we have discussed above, let us understand the
working of classification. The Data Classification process includes two steps −
Note − Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.
Here is the criteria for comparing the methods of Classification and Prediction −
Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class
label correctly and the accuracy of the predictor refers to how well a given predictor can
guess the value of predicted attribute for a new data.
Speed − This refers to the computational cost in generating and using the classifier or
predictor.
Robustness − It refers to the ability of classifier or predictor to make correct predictions
from given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
Interpretability − It refers to what extent the classifier or predictor understands.
○ Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
○ In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
○ The decisions or the test are performed on the basis of features of the given dataset.
○ It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
○ It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
○ In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
○ A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
○ Below diagram explains the general structure of a decision tree:
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next
node.
Bayesian Classification:
In numerous applications, the connection between the attribute set and the class variable is
non- deterministic. In other words, we can say the class label of a test record cant be assumed
with certainty even though its attribute set is the same as some of the training examples.
These circumstances may emerge due to the noisy data or the presence of certain confusing
factors that influence classification, but it is not included in the analysis. For example, consider
the task of predicting the occurrence of whether an individual is at risk for liver illness based on
individuals eating habits and working efficiency. Although most people who eat healthly and
exercise consistently having less probability of occurrence of liver disease, they may still do so
due to other factors. For example, due to consumption of the high-calorie street foods and
alcohol abuse. Determining whether an individual's eating routine is healthy or the workout
efficiency is sufficient is also subject to analysis, which in turn may introduce vulnerabilities into
the leaning issue.
Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The
theory expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.
Rule-based classifiers are just another type of classifier which makes the class decision
depending by using various “if..else” rules. These rules are easily interpretable and thus
these classifiers are generally used to generate descriptive models. The condition used
with “if” is called the antecedent and the predicted class of each rule is called the
consequent. Properties of rule-based classifiers:
● Either rules can be ordered, i.e. the class corresponding to the highest priority
rule triggered is taken as the final class.
● Otherwise, we can assign votes for each class depending on some their
weights, i.e. the rules remain unordered.
Cap Cap
Class Bruises Odour Stalk Shape Population Habitat
Shape Surface
Rules:
Classifing a record: The classification algorithm described below assumes that the rules
are unordered and the classes are weighted.
R <-Set of rules generated using training Set T <-Test Record W <-class name to Weight mapping,
predefined, given as input F <-class name to Vote mapping, generated for each test record, to be
calculated for each rule r in R check if r covers T if so then add W of predicted_class to F of
predicted_class end for Output the class with the highest calculated vote in F
Introduction to SVM
Support vector machines (SVMs) are powerful yet flexible supervised machine learning
algorithms which are used both for classification and regression. But generally, they are used
in classification problems. In 1960s, SVMs were first introduced but later they got refined in
1990. SVMs have their unique way of implementation as compared to other machine learning
algorithms. Lately, they are extremely popular because of their ability to handle multiple
continuous and categorical variables.
Working of SVM
The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps −
First, SVM will generate hyperplanes iteratively that segregates the classes in best way.
Then, it will choose the hyperplane that separates the classes correctly.
For implementing SVM in Python − We will start with the standard libraries import as
follows −
SVM Kernels
In practice, SVM algorithm is implemented with kernel that transforms an input data space into
the required form. SVM uses a technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional space. In simple words,
kernel converts non-separable problems into separable problems by adding more dimensions
to it. It makes SVM more powerful, flexible and accurate. The following are some of the types
of kernels used by SVM.
Linear Kernel
It can be used as a dot product between any two observations. The formula of linear kernel is
as below −From the above formula, we can see that the product between two vectors say 𝑥 &
𝑥𝑖 is the sum of the multiplication of each pair of input values.
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or nonlinear input space.
Following is the formula for polynomial kernel −
Here d is the degree of polynomial, which we need to specify manually in the learning
algorithm.
RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional
space. Following formula explains it mathematically −
Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A
good default value of gamma is 0.1.
As we implemented SVM for linearly separable data, we can implement it in Python for the
data that is not linearly separable. It can be done by using kernels.
Example
The following is an example for creating an SVM classifier by using kernels. We will be using
iris dataset from scikit-learn −
import pandas as pd
import numpy as np
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
Next, we will plot the SVM boundaries with original data as follows −
h = (x_max / x_min)/100
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
Output
For creating SVM classifier with rbf kernel, we can change the kernel to rbf as follows −
Svc_classifier = svm.SVC(kernel = 'rbf', gamma =‘auto’,C = C).fit(X, y)
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
Output
Data mining is the process of discovering and extracting hidden patterns from different types of
data to help decision-makers make decisions. Associative classification is a common
classification learning method in data mining, which applies association rule detection methods
and classification to create classification models.
Association rule learning is a machine learning method for discovering interesting relationships
between variables in large databases. It is designed to detect strong rules in the database
based on some interesting metrics. For any given multi-item transaction, association rules aim
to obtain rules that determine how or why certain items are linked.
Association rules are created by searching for information on common if-then patterns and
using specific criteria with support and trust to define what the key relationships are. They help
to show the frequency of an item in a given data since confidence is defined by the number of
times an if-then statement is found to be true. However, a third criterion called lift is often used
to compare expected and actual confidence. Lift shows how many times the if-then statement
was predicted to be true. Create association rules to compute itemsets based on data created
by two or more items. Association rules usually consist of rules that are well represented by
the data.
There are different types of data mining techniques that can be used to find out the specific
analysis and result like Classification analysis, Clustering analysis, and multivariate analysis.
Association rules are mainly used to analyze and predict customer behavior.
● In Classification analysis, it is mostly used to question, make decisions, and predict
behavior.
● In Clustering analysis, it is mainly used when no assumptions are made about
possible relationships in the data.
● In Regression analysis, it is used when we want to predict an infinitely dependent
value of a set of independent variables.
We put the value of gamma to ‘auto’ but you can provide its
value between 0 to 1 also.
SVM classifiers offers great accuracy and work well with high dimensional space. SVM
classifiers basically use a subset of training points hence in result uses very less memory.
They have high training time hence in practice not suitable for large datasets. Another
disadvantage is that SVM classifiers do not work well with overlapping classes.
Lazy learning is a type of machine learning that doesn't process training data until it needs to
make a prediction. Instead of building models during training, lazy learning algorithms wait until
they encounter a new query. This method stores and compares training examples when
making predictions. It's also called instance-based or memory-based learning.
Lazy learning algorithms work by memorizing the training data rather than constructing a
general model. When a new query is received, lazy learning retrieves similar instances from
the training set and uses them to generate a prediction. The similarity between instances is
usually calculated using distance metrics, such as Euclidean distance or cosine similarity.
One of the most popular lazy learning algorithms is the k-nearest neighbors (k-NN) algorithm.
In k-NN, the k closest training instances to the query point are considered, and their class
labels are used to determine the class of the query. Lazy learning methods excel in situations
where the underlying data distribution is complex or where the training data is noisy.
What are the Benefits of Lazy Learning?
● Adaptability. Lazy learning algorithms can adapt quickly to new or changing data. Since
the learning process happens at prediction time, they can incorporate new instances
without requiring complete retraining of the model.
● Robustness to outliers. Lazy learning algorithms are less affected by outliers compared
to eager learning methods. Outliers have less influence on predictions because they are
not used during the learning phase.
● Flexibility. When it comes to handling complex data distributions and nonlinear
relationships, lazy learning algorithms are effective. They can capture intricate decision
boundaries by leveraging the information stored in the training instances.
K-Nearest Neighbors (KNN): KNN is a simple and intuitive classification algorithm. It works by
finding the k-nearest data points in the feature space and assigning the class label based on
the majority class among its neighbors.
Gaussian Naive Bayes: Gaussian Naive Bayes is an extension of the Naive Bayes algorithm. It
assumes that the features follow a Gaussian (normal) distribution, and it estimates the
parameters (mean and variance) of each class to make predictions.
Kernel Methods: Kernel methods, such as Kernel Support Vector Machines (SVMs) and Kernel
Logistic Regression, are used for nonlinear classification tasks. They transform the input data
into a higher-dimensional space using a kernel function, allowing them to learn complex
decision boundaries.
Ensemble Methods: Ensemble methods combine multiple base classifiers to improve the
overall classification performance. Examples include AdaBoost, Gradient Boosting Machines
(GBM), Random Forests, and Stacked Generalization (Stacking).
Neural Networks Variants: Besides traditional feedforward neural networks, there are various
neural network architectures specifically designed for classification tasks. These include
Convolutional Neural Networks (CNNs) for image classification, Recurrent Neural Networks
(RNNs) for sequential data, and Transformer models for natural language processing tasks.
Decision Boundary Estimation Methods: Techniques such as Gaussian Processes (GP) and
Generative Adversarial Networks (GANs) can be used for estimating decision boundaries in
classification tasks. GANs, for example, can generate synthetic data points to help better
understand the distribution of classes in the feature space.
Anomaly Detection Methods: While primarily used for anomaly detection, methods like
One-Class SVMs and Isolation Forests can also be adapted for binary classification tasks
where one class is significantly smaller than the other.
Prediction:
Prediction is a fundamental concept in machine learning and data science, where the goal is to
make informed guesses about unknown or future outcomes based on available data.
Predictive modeling involves building a mathematical model that captures patterns and
relationships in the data, which can then be used to make predictions on new or unseen data
points.
Problem Definition: The first step in prediction is defining the problem you want to solve.
This includes determining what you want to predict (the target variable) and what
features or predictors are available to make those predictions.
Data Collection and Preprocessing: Next, you collect relevant data that contains
information about both the predictors and the target variable. This data may come from
various sources such as databases, APIs, or manual data collection. Once collected,
the data needs to be preprocessed, which may involve tasks like cleaning missing
values, encoding categorical variables, and scaling numerical features.
Feature Engineering: Feature engineering is the process of creating new features or
transforming existing ones to improve the performance of the predictive model. This
may include techniques like feature scaling, dimensionality reduction, or creating
interaction terms between features.
Model Selection: Once the data is prepared, you choose an appropriate machine
learning algorithm or model to train on the data. The choice of model depends on
factors such as the nature of the problem (classification, regression, etc.), the size and
complexity of the data, and the interpretability requirements.
Model Training: In this step, you use the prepared data to train the selected model.
During training, the model learns the underlying patterns and relationships in the data
by adjusting its parameters to minimize a predefined loss function.
Model Evaluation: After training, the model's performance is evaluated using evaluation
metrics appropriate for the problem at hand. For classification tasks, metrics like
accuracy, precision, recall, and F1-score are commonly used, while for regression tasks,
metrics like mean squared error (MSE) or R-squared are used.
Prediction: Once the model is trained and evaluated, it can be used to make predictions
on new or unseen data points. These predictions provide insights into the likely
outcomes or values of the target variable based on the available information.
Model Deployment: Finally, if the model performs well and meets the desired criteria, it
can be deployed into production to make real-time predictions. This involves integrating
the model into the existing software infrastructure and monitoring its performance over
time.
Throughout this process, it's essential to iterate and refine the model based on feedback and
new data to ensure that it continues to make accurate predictions. Additionally, ethical
considerations such as fairness, transparency, and privacy should be taken into account when
making predictions that impact individuals or society.
Out of the nine data mining models in SQL Server, three of them can be considered as
classification models. The Classification models are Naïve Bayes Decision Trees, Neural
Network. Though the logistic regression is a regression technique, that can be used for a
classification problem as well. Since you have four models as a solution for the classification
problem, we need to look at which algorithm should be selected to use. Obviously, you need to
select the most accurate data mining model. To evaluate which algorithm to use, an accuracy
test should be done.
Let us create simple four models using Naïve Bayes, Decision Trees, Logistic Regression, and
Neural Network algorithms for measuring Accuracy in Data Mining.
Unit-VII: Cluster Analysis
INTRODUCTION:
Cluster analysis, also known as clustering, is a method of data mining that groups
similar data points together. The goal of cluster analysis is to divide a dataset into
groups (or clusters) such that the data points within each group are more similar to
each other than to data points in other groups. This process is often used for
exploratory data analysis and can help identify patterns or relationships within the
data that may not be immediately obvious. There are many different algorithms used
for cluster analysis, such as k-means, hierarchical clustering, and density-based
clustering. The choice of algorithm will depend on the specific requirements of the
analysis and the nature of the data being analyzed.
In this method, first, a cluster is made and then added to another cluster (the most similar and
closest one) to form one single cluster. This process is repeated until all subjects are in one
cluster. This particular method is known as Agglomerative method. Agglomerative clustering
starts with single objects and starts grouping them into clusters.
The divisive method is another kind of Hierarchical method in which clustering starts with the
complete data set and then starts dividing into partitions.
Centroid-based Clustering
Distribution-based Clustering
Density-based Clustering
In this type of clustering, clusters are defined by the
areas of density that are higher than the remaining
of the data set. Objects in sparse areas are usually
required to separate clusters.The objects in these
sparse points are usually noise and border points in
the graph.The most popular method in this type of
clustering is DBSCAN.
–Partitioning Methods:
In this method, let us say that “m” partition is done on the “p” objects of the database. A cluster
will be represented by each partition and m < p. K is the number of groups after the
classification of objects. There are some requirements which need to be satisfied with this
Partitioning Clustering Method and they are: –
There are some points which should be remembered in this type of Partitioning Clustering
Method which are:
1. There will be an initial partitioning if we already give no. of a partition (say m).
2. There is one technique called iterative relocation, which means the object will be moved
from one group to another to improve the partitioning.
Among the many different types of clustering in data mining, In this hierarchical clustering
method, the given set of an object of data is created into a kind of hierarchical decomposition.
The formation of hierarchical decomposition will decide the purposes of classification. There
are two types of approaches for the creation of hierarchical decomposition, which are: –
1. Divisive Approach
Another name for the Divisive approach is a top-down approach. At the beginning of this
method, all the data objects are kept in the same cluster. Smaller clusters are created by
splitting the group by using the continuous iteration. The constant iteration method will keep on
going until the condition of termination is met. One cannot undo after the group is split or
merged, and that is why this method is not so flexible.
2. Agglomerative Approach
Another name for this approach is the bottom-up approach. All the groups are separated in the
beginning. Then it keeps on merging until all the groups are merged, or condition of
termination is met.
There are two approaches which can be used to improve the Hierarchical Clustering Quality in
Data Mining which are: –
1. One should carefully analyze the linkages of the object at every partitioning of
hierarchical clustering.
2. One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the
microcluster.
In this method of clustering in Data Mining, density is the main focus. The notion of mass is
used as the basis for this clustering method. In this clustering method, the cluster will keep on
growing continuously. At least one number of points should be there in the radius of the group
for each point of data.
In this type of Grid-Based Clustering Method, a grid is formed using the object together. A Grid
Structure is formed by quantifying the object space into a finite number of cells.
1. Faster time of processing: The processing time of this method is much quicker than
another way, and thus it can save time.
2. This method depends on the no. of cells in the space of quantized each dimension.
In this type of clustering method, every cluster is hypothesized so that it can find the data
which is best suited for the model. The density function is clustered to locate the group in this
method.
Clustering of the High-Dimensional Data return the group of objects which are clusters. It is
required to group similar types of objects together to perform the cluster analysis of
high-dimensional data, But the High-Dimensional data space is huge and it has complex data
types and attributes. A major challenge is that we need to find out the set of attributes that are
present in each cluster. A cluster is defined and characterized based on the attributes present
in the cluster. Clustering High-Dimensional Data we need to search for clusters and find out
the space for the existing clusters.
7.3 Outliers in Data Mining
Outlier is a data object that deviates significantly from the rest of the data objects and behaves
in a different manner. An outlier is an object that deviates significantly from the rest of the
objects. They can be caused by measurement or execution errors. The analysis of outlier data
is referred to as outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected of not being
generated by the same method as the rest of the data objects.
Outliers are of three types, namely –
1. Global (or Point) Outliers
2. Collective Outliers
3. Contextual (or Conditional) Outliers
1. Global Outliers
1. Definition: Global outliers are data points that deviate significantly from the overall
distribution of a dataset.
2. Collective Outliers
1. Definition: Collective outliers are groups of data points that collectively deviate significantly
from the overall distribution of a dataset.
3. Contextual Outliers
1. Definition: Contextual outliers are data points that deviate significantly from the expected
behavior within a specific context or subgroup.