Professional Documents
Culture Documents
Screenshot 2023-10-19 at 11.36.57
Screenshot 2023-10-19 at 11.36.57
Screenshot 2023-10-19 at 11.36.57
The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and
science exploration.
The evolution of database technology
(1970s-early 1980s)
Data mining refers to extracting or mining" knowledge from large amounts of data.
There are many other terms related to data mining, such as knowledge mining,
knowledge extraction, data/pattern analysis, data archaeology, and data dredging.
Many people treat datamining as a synonym for another popularly used term,
Knowledge Discovery in Databases", or KDD
Data mining is a technical methodology to detect information from huge data sets.
The main objective of data mining is to identify patterns, trends, or rules that
explain data behavior contextually. The data mining method uses mathematical
analysis to deduce patterns and trends, which were not possible through the old
methods of data exploration. Data mining is a handy and extremely convenient
methodology when it comes to dealing with huge volumes of data. In this article, we
explore some data mining functionalities that are measured to predict the type of
patterns in data sets.
Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. In general, data mining tasks can be classified into two types
including descriptive and predictive. Descriptive mining tasks define the common features of
the data in the database and the predictive mining tasks act inference on the current
information to develop predictions.
There are various data mining functionalities which are as follows −
Data characterization − It is a summarization of the general characteristics of an object class
of data. The data corresponding to the user-specified class is generally collected by a database
query. The output of data characterization can be presented in multiple forms.
Data discrimination − It is a comparison of the general characteristics of target class data
objects with the general characteristics of objects from one or a set of contrasting classes. The
target and contrasting classes can be represented by the user, and the equivalent data objects
fetched through database queries.
Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset. There are two parameters that are used for determining the association
rules −
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs in a transaction when another
item occurs.
Classification − Classification is the procedure of discovering a model that represents and
distinguishes data classes or concepts, for the objective of being able to use the model to
predict the class of objects whose class label is anonymous. The derived model is established
on the analysis of a set of training data (i.e., data objects whose class label is common).
Prediction − It defines predict some unavailable data values or pending trends. An object can
be anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase/decrease trends in time-related
information.
Clustering − It is similar to classification but the classes are not predefined. The classes are
represented by data attributes. It is unsupervised learning. The objects are clustered or
grouped, depends on the principle of maximizing the intraclass similarity and minimizing the
intraclass similarity.
Outlier analysis − Outliers are data elements that cannot be grouped in a given class or
cluster. These are the data objects which have multiple behaviour from the general behaviour
of other data objects. The analysis of this type of data can be essential to mine the knowledge.
Evolution analysis − It defines the trends for objects whose behaviour changes over some
time.
Classification of DM Systems –
DM task primitives
Data Mining Primitives:
A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to inter-actively communicate with the data mining system during
discovery of knowledge.
The data mining task primitives includes the following:
Task-relevant data
Kind of knowledge to be mined
Background knowledge
Interestingness measurement
Presentation for visualizing the discovered patterns
Task-relevant data
This specifies the portions of the database or the dataset of data in which the user is interested.
This includes the database attributes or data warehouse dimensions of interest (referred to as the
relevant attributes or dimensions).
The kind of knowledge to be mined
This specifies the data mining functions to be performed. Such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.
The background knowledge to be used in the discovery process
The knowledge about the domain is useful for guiding the knowledge discovery process for
evaluating the interesting patterns. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of abstraction.
An example of a concept hierarchy for the attribute (or dimension) age is shown in user beliefs
regarding relationships in the data are another form of background knowledge.
The interestingness measures and thresholds for pattern evaluation:
Different kinds of knowledge may have different interestingness measures.
For example, interestingness measures for association rules include support and confidence.
Rules whose support and confidence values are below user-specified thresholds are considered
uninteresting.
The expected representation for visualizing the discovered patterns. It refers to the discovered
patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and
cubes.
A data mining query language can be designed to incorporate these primitives, allowing users to
flexibly interact with data mining systems.
A data mining query language can be designed to incorporate these primitives, allowing users to
interact with data mining systems flexibly. Having a data mining query language provides a
foundation on which user-friendly graphical interfaces can be built.
Designing a comprehensive data mining language is challenging because data mining covers a
wide spectrum of tasks, from data characterization to evolution analysis. Each task has different
requirements. The design of an effective data mining query language requires a deep
understanding of the power, limitation, and underlying mechanisms of the various kinds of data
mining tasks. This facilitates a data mining system's communication with other information
systems and integrates with the overall information processing environment.
This specifies the portions of the database or the set of data in which the user is interested. This
includes the database attributes or data warehouse dimensions of interest (the relevant attributes
or dimensions).
In a relational database, the set of task-relevant data can be collected via a relational query
involving operations like selection, projection, join, and aggregation.
The data collection process results in a new data relational called the initial data relation. The
initial data relation can be ordered or grouped according to the conditions specified in the query.
This data retrieval can be thought of as a subtask of the data mining task.
This initial relation may or may not correspond to physical relation in the database. Since virtual
relations are called Views in the field of databases, the set of task-relevant data for data mining is
called a minable view.
This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.
This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and evaluating the patterns found. Concept hierarchies are a popular form of
background knowledge, which allows data to be mined at multiple levels of abstraction.
An example of a concept hierarchy for the attribute (or dimension) age is shown below. User
beliefs regarding relationships in the data are another form of background knowledge.
Different kinds of knowledge may have different interesting measures. They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns. For example,
interesting measures for association rules include support and confidence. Rules whose support
and confidence values are below user-specified thresholds are considered uninteresting.
This refers to the form in which discovered patterns are to be displayed, which may include
rules, tables, cross tabs, charts, graphs, decision trees, cubes, or other visual representations.
Users must be able to specify the forms of presentation to be used for displaying the discovered
patterns. Some representation forms may be better suited than others for particular kinds of
knowledge.
For example, generalized relations and their corresponding cross tabs or pie/bar charts are good
for presenting characteristic descriptions, whereas decision trees are common for classification.
The data mining system is integrated with a database or data warehouse system so that it can do
its tasks in an effective presence. A data mining system operates in an environment that needed
it to communicate with other data systems like a database system. There are the possible
integration schemes that can integrate these systems which are as follows −
No coupling − No coupling defines that a data mining system will not use any function of a
database or data warehouse system. It can retrieve data from a specific source (including a file
system), process data using some data mining algorithms, and therefore save the mining results
in a different file.
Such a system, though simple, deteriorates from various limitations. First, a Database system
offers a big deal of flexibility and adaptability at storing, organizing, accessing, and processing
data. Without using a Database/Data warehouse system, a Data mining system can allocate a
large amount of time finding, collecting, cleaning, and changing data.
Loose Coupling − In this data mining system uses some services of a database or data warehouse
system. The data is fetched from a data repository handled by these systems. Data mining
approaches are used to process the data and then the processed data is saved either in a file or in
a designated area in a database or data warehouse. Loose coupling is better than no coupling as
it can fetch some area of data stored in databases by using query processing or various system
facilities.
Semitight Coupling − In this adequate execution of a few essential data mining primitives can
be supported in the database/datawarehouse system. These primitives can contain sorting,
indexing, aggregation, histogram analysis, multi-way join, and pre-computation of some
important statistical measures, including sum, count, max, min, standard deviation, etc.
Tight coupling − Tight coupling defines that a data mining system is smoothly integrated into
the database/data warehouse system. The data mining subsystem is considered as one functional
element of an information system.
Data mining queries and functions are developed and established on mining query analysis, data
structures, indexing schemes, and query processing methods of database/data warehouse systems.
It is hugely desirable because it supports the effective implementation of data mining functions,
high system performance, and an integrated data processing environment.
Issues in DM
KDD Process
KDD- Knowledge Discovery in Databases
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of
discovering knowledge in data and emphasizes the high-level applications of specific Data Mining
techniques. It is a field of interest to researchers in various fields, including artificial intelligence,
machine learning, pattern recognition, databases, statistics, knowledge acquisition for expert
systems, and data visualization.
The main objective of the KDD process is to extract information from data in the context of large
databases. It does this by using Data Mining algorithms to identify what is deemed knowledge.
The availability and abundance of data today make knowledge discovery and Data Mining a
matter of impressive significance and need. In the recent development of the field, it isn't
surprising that a wide variety of techniques is presently accessible to specialists and experts.
The KDD Process
The knowledge discovery process(illustrates in the given figure) is iterative and interactive,
comprises of nine steps. The process is iterative at each stage, implying that moving back to the
previous actions might be required. The process has many imaginative aspects in the sense that
one cant presents one formula or make a complete scientific categorization for the correct
decisions for each step and application type. Thus, it is needed to understand the process and the
different requirements and possibilities in each stage.
The process begins with determining the KDD objectives and ends with the implementation of
the discovered knowledge. At that point, the loop is closed, and the Active Data Mining starts.
Subsequently, changes would need to be made in the application domain. For example, offering
various features to cell phone users in order to reduce churn. This closes the loop, and the impacts
are then measured on the new data repositories, and the KDD process again. Following is a concise
description of the nine-step KDD process, Beginning with a managerial step:
This is the initial preliminary step. It develops the scene for understanding what should be done
with the various decisions like transformation, algorithms, representation, etc. The individuals
who are in charge of a KDD venture need to understand and characterize the objectives of the
end-user and the environment in which the knowledge discovery process will occur ( involves
relevant prior knowledge).
Once defined the objectives, the data that will be utilized for the knowledge discovery process
should be determined. This incorporates discovering what data is accessible, obtaining important
data, and afterward integrating all the data for knowledge discovery onto one set involves the
qualities that will be considered for the process. This process is important because of Data Mining
learns and discovers from the accessible data. This is the evidence base for building the models.
If some significant attributes are missing, at that point, then the entire study may be unsuccessful
from this respect, the more attributes are considered. On the other hand, to organize, collect, and
operate advanced data repositories is expensive, and there is an arrangement with the opportunity
for best understanding the phenomena. This arrangement refers to an aspect where the interactive
and iterative aspect of the KDD is taking place. This begins with the best available data sets and
later expands and observes the impact in terms of knowledge discovery and modeling.
In this step, data reliability is improved. It incorporates data clearing, for example, Handling the
missing quantities and removal of noise or outliers. It might include complex statistical techniques
or use a Data Mining algorithm in this context. For example, when one suspects that a specific
attribute of lacking reliability or has many missing data, at this point, this attribute could turn into
the objective of the Data Mining supervised algorithm. A prediction model for these attributes
will be created, and after that, missing data can be predicted. The expansion to which one pays
attention to this level relies upon numerous factors. Regardless, studying the aspects is significant
and regularly revealing by itself, to enterprise data frameworks.
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction( for example, feature selection and extraction
and record sampling), also attribute transformation(for example, discretization of numerical
attributes and functional transformation). This step can be essential for the success of the entire
KDD project, and it is typically very project-specific. For example, in medical assessments, the
quotient of attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts and transient
issues. For example, studying the impact of advertising accumulation. However, if we do not
utilize the right transformation at the starting, then we may acquire an amazing effect that insights
to us about the transformation required in the next iteration. Thus, the KDD process follows upon
itself and prompts an understanding of the transformation required.
Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For example,
considering precision versus understandability, the previous is better with neural networks, while
the latter is better with decision trees. For each system of meta-learning, there are several
possibilities of how it can be succeeded. Meta-learning focuses on clarifying what causes a Data
Mining algorithm to be fruitful or not in a specific issue. Thus, this methodology attempts to
understand the situation under which a Data Mining algorithm is most suitable. Each algorithm
has parameters and strategies of leaning, such as ten folds cross-validation or another division for
training and testing.
At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need
to utilize the algorithm several times until a satisfying outcome is obtained. For example, by
turning the algorithms control parameters, such as the minimum number of instances in a single
leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on the
Data Mining algorithm results. For example, including a feature in step 4, and repeat from there.
This step focuses on the comprehensibility and utility of the induced model. In this step, the
identified knowledge is also recorded for further use. The last step is the use, and overall feedback
and discovery results acquire by Data Mining.
Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and measure
the impacts. The accomplishment of this step decides the effectiveness of the whole KDD process.
There are numerous challenges in this step, such as losing the "laboratory conditions" under which
we have worked. For example, the knowledge was discovered from a certain static depiction, it is
usually a set of data, but now the data becomes dynamic. Data structures may change certain
quantities that become unavailable, and the data domain might be modified, such as an attribute
that may have a value that was not expected previously.
Data preprocessing is the process of transforming raw data into an understandable format. It
is also an important step in data mining as we cannot work with raw data. The quality of the
data should be checked before applying machine learning or data mining algorithms.
Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following
2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
2. Data Integration
3. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
Min-max normalization performs a linear transformation on the original data. Suppose that minA
and maxA are the minimum and maximum values of an attribute, A. Min-max normalization maps
a value, v, of A to v 0 in the range [new minA, newmaxA]
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v 0.
Normalization by decimal scaling normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the maximum absolute value of A. A value,
v, of A is normalized to v0.
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order
to get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless.
If after reconstruction from compressed data, original data can be retrieved,
such reduction are called lossless reduction else it is called lossy reduction. The
two effective methods of dimensionality reduction are:Wavelet transforms and
PCA (Principal Component Analysis).
Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Step: 2
Smoothing by bin means:
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Binning method - Example (Cont..)
Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Step: 2
Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
Data mining is the process of discovering interesting knowledge from large amounts
of data stored either in databases, data warehouses, or other information repositories.
Based on this view, the architecture of a typical data mining system may have the
following major components:
Pattern evaluation
Knowledge base
Data cleansing
Data Integration Filtering
There is a lot of confusion between data mining and data analysis. Data mining functions are used
to define the trends or correlations contained in data mining activities. While data analysis is used
to test statistical models that fit the dataset, for example, analysis of a marketing campaign, data
mining uses Machine Learning and mathematical and statistical models to discover patterns
hidden in the data. In comparison, data mining activities can be divided into two categories:
This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.
Data Discretization
Top-down Discretization -
If the process starts by first finding one or a few points called split points or cut
points to split the entire attribute range and then repeat this recursively on the
resulting intervals.
Bottom-up Discretization -
Concept Hierarchies