Ch 1- Introduction

Chapter One
Introduction to Data Mining
Shumet Tadesse
Department of Computer Science

College of Informatics
University of Gondar
February 2023
Shumet (Computer Science) Introduction February 2023 1 / 13

Motivation: Why data mining?
Our capacity of generating and collecting data have been increased
rapidly in the last several decades. Huge amount of data is available
at the tip of our hand because of:
− The computerization of business & scientific transactions
− Advances in data collection tools, ranging from scanned texts & image
platforms to satellite remote sensors
− Popular use of WWW as a global information system
Though all these have made it easier to create, collect, and store all
types of data, it results in data explosion (i.e., information
overload).
Data explosion is the problem of having huge amount of data in an
enterprise stored in databases, data warehouses and other information
repositories generated by automated data collection tools.
As the volume of data increases, the proportion of information in
which people could understand decreases substantially or as the size
of data get larger, analyzing the data becomes very difficult.
Motivation: Why data mining?
The true value is not in storing the data, but rather in our ability to
extract useful reports and to find interesting trends & correlations to
support decisions and policies made by businesses
− We are drowning in data, but starving for knowledge!
− Too much data & too little knowledge!
To bridge the gap of analyzing large volume of data and extracting
useful information and knowledge for decision making, computerized
methods known as Data Mining (DM) or Knowledge Discovery in
Databases (KDD) has emerged.
Facing too enormous volumes of data, human analysts with no special
tools can no longer make sense.
− Data mining can automate the process of finding patterns &
relationships in raw data and the results can be utilized for decision
support.
If we know how to reveal valuable knowledge hidden in raw data, data
might be one of our most valuable assets.
− Data mining is the tool that involves retrospective analysis to extract
diamonds of knowledge from historical data & predict outcome of the
future.
What is Data Mining?
Different scholars provided different definitions about DM
According to Berry and Linoff (2000); Han and Kamber (2006), DM
is the process of extracting or mining knowledge from large amounts
of data in order to discover meaningful patterns and rules
DM is a technology that uses various techniques to discover hidden
information or pattern (i.e., non-trivial, novel, valid, understandable,
and useful) from data in large databases (e.g. data warehouse)
The term DM is a misnomer as it doesn’t directly related to what it
does
Data mining should best described as knowledge mining from data
rather than data mining
− Any way, we will use the term with this understanding
− Alternative names
Knowledge discovery from databases (KDD),
knowledge extraction,
data/pattern analysis,
information
harvesting,
business intelligence, etc.
What is Data Mining?
Data mining is an interdisciplinary subfield of computer science and

statistics with an overall goal to extract information (with
intelligent methods) from a dataset and transform the
information into a comprehensible structure for further use.
Data Mining vs Machine Learning:

− different from each other while they have some commonalities
− regarding their purpose: data mining is used to extract the information
while machine learning is used to teach the computer how to learn and
work from past experiences without programming.

Data Mining Functionalities/Goals
Data mining functionalities are used to specify the kind of patterns to

be found in data mining task
Generally data mining task can be broadly classified as
− Descriptive (unsupervised)
− Predictive (supervised)
Descriptive data mining task characterize the general properties of the
data in a database
Predictive data mining task perform inference on the current data in
order to make prediction to the future reference
− permits the value of one variable to be predicted from the known
values of other variables

The supervised predictive data mining functionalities include

− Classification
− Regression
− Time series
− Prediction
The unsupervised descriptive data mining functionalities includes
− Association rule discovery
− Cluster analysis
− Summarization
− Sequence discovery

Classification
− DM system learns from examples or the data how to partition or
classify the data i.e. it formulates classification rules
− Example - customer database in a bank
Question - Is a new customer applying for a loan a good investment or
not?
Typical rule formulated: if STATUS = married and INCOME > 10000
and HOUSE OWNER = yes then INVESTMENT TYPE = good
Cluster Analysis
− Find out the group of objects which are similar to each other in the
group but are different from the object in other groups
− For example, in a company the classes of items for sales include
electronics and non electronic equipments.
Association
− Rules that associate one attribute of a relation to another
− Set oriented approaches are the most efficient means of discovering
such rules
− Example - supermarket database
72% of all the records that contain items A and B also contain item C
the specific percentage of occurrences, 72 is the confidence factor of
the rule
Relationship between Data Mining & Data
Warehousing
Data Warehouse: centralized data repository which can be queried
for business benefit.
DM is the automated process of analyzing large data sets to find
patterns, relationships and trends and ultimately to generate business
insights – which will be used to solve challenges and identify new
opportunities
A data warehouse – where the data from the various sources is
combined and stored – allows data mining to be used throughout the
organization
Data warehousing and data mining are the cornerstones of modern
business decisions.
Data Warehousing makes it possible to
− extract archived operational data
− overcome inconsistencies between different legacy data formats
− integrate data throughout an enterprise regardless of location, or format
− incorporate additional or expert information
The KDD/DM Process Model
The term KDD stands for Knowledge Discovery in Databases.

The main objective of the KDD process is to extract information from
data in the context of large databases.
− It does this by using Data Mining algorithms to identify what is
deemed knowledge.
Overview of the steps constituting the KDD process

Steps constituting the KDD process
Choosing and creating a dataset on which discovery will be

performed
− This incorporates discovering what data is accessible, obtaining
important data, and afterward integrating all the data for knowledge
discovery onto one set involves the qualities that will be considered for
the process.
− This process is important because of Data Mining learns and discovers
from the accessible data.
− This is the evidence base for building the models.
If some significant attributes are missing, at that point, then the entire
study may be unsuccessful from this respect, the more attributes are
considered.
Preprocessing
− In this step, data reliability is improved.
− It incorporates data clearing, for example, handling the missing
quantities and removal of noise or outliers.

Data Transformation
− In this stage, the creation of appropriate data for Data Mining is
prepared and developed.
− Techniques here incorporate dimension reduction( for example, feature
selection and extraction and record sampling).
− This step can be essential for the success of the entire KDD process,
and it is typically very project-specific.
Selecting the Data Mining algorithm
− This stage incorporates choosing a particular technique to be used for
searching patterns that include multiple inducers.
− For example, considering precision versus understandability, the
previous is better with neural networks, while the latter is better with
decision trees.
Evaluation
− In this step, we assess and interpret the mined patterns, rules, and
reliability to the objective characterized in the first step.
− Here we consider the preprocessing steps as for their impact on the
Data Mining algorithm results.
Using the discovered knowledge

− At this stage, we are prepared to include the knowledge into another
system for further activity.
− The knowledge becomes effective in the sense that we may make
changes to the system and measure the impacts.
− The accomplishment of this step decides the effectiveness of the whole
KDD process.
Exercise
What is Data mining and what is it used for?

What are the main reasons for DM to attract a great deal of
attention in the information industry in recent years?
Explain the main functionalities of DM.
Explain the main steps in the KDD process.

Ch 1- Introduction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch 1- Introduction

Uploaded by

Copyright:

Available Formats

Chapter One

Introduction to Data Mining

Department of Computer Science

Shumet (Computer Science) Introduction February 2023 1 / 13

Data mining is an interdisciplinary subfield of computer science and

Data Mining vs Machine Learning:

Shumet (Computer Science) Introduction February 2023 5 / 13

Data mining functionalities are used to specify the kind of patterns to

Shumet (Computer Science) Introduction February 2023 6 / 13

The supervised predictive data mining functionalities include

Shumet (Computer Science) Introduction February 2023 7 / 13

The term KDD stands for Knowledge Discovery in Databases.

Overview of the steps constituting the KDD process

Shumet (Computer Science) Introduction February 2023 10 / 13

Choosing and creating a dataset on which discovery will be

Shumet (Computer Science) Introduction February 2023 11 / 13

Using the discovered knowledge

What is Data mining and what is it used for?

Shumet (Computer Science) Introduction February 2023 13 / 13

You might also like