Data Mining

DATA MINING
By Abadir Tahir Mohamed

28/8/2023
HARAMAYA
UNIVERSITY
DEPARTMENT OF INFORMATION
SCIENCE
Instructor (Asst Prof) Tilahun Shiferaw

TABLE OF CONTENTS
Methodological consideration
Introduction
1
4.1. SAS The SEMMA analysis cycle
1.1. Data mining
1.2. Data mining & Data warehouse
1.3. Data mining And OLAP
4 4.2. SPSS The 5 A’s Process
4.3. CRISP-DM The De facto standard
for industry
1.4. Actores in Data mining
The Business Imperatives The Data Mining Process

2.1. Importance and usage of Data mining 5.1. Business Understanding
2
2.2. why do we need data mining tool?
2.3. what are the critical factors to be consider
while selecting data mining tools?
5 5.2. Data Understanding
5.3. Data preparation
5.4. Modeling
5.5. Evaluation
2.4. Data mining tool/ software 5.6. Deployment
2.5. The global data mining business market size
Conclusions and
6
2.6. High paying jobs in data mining fields
Directions for further
The Technical Imperatives
3
research
7
3.1. Data mining & machine learning
3.2. Data mining and statistics Summary and
3.3. Data mining and The web Reference
1 INTRODUCTION
1.1. Data mining
1.2. Data mining & Data warehouse
1.3. Data mining And OLAP
1.4. Actores in Data mining
1.1. DATA MINING
 Data can be a valuable resource for business, government, and
nonprofit organizations, but quantity isn’t what’s important about it. A
greater quantity of data does not guarantee better understanding or
competitive advantage. In fact, used well, a little bit of relevant data
provides more value than any poorly used extremely big database.
 Data mining is the way that ordinary business people use a range of
data analysis techniques to uncover useful information from data and
put that information into practical use. Data miners use tools designed
to help the work go quickly. They don’t fuss over theory and
assumptions. They validate their discoveries by testing. And they
understand that things change, so when the discovery that worked like
a charm yesterday doesn’t hold up today, they adapt.
 The objective of data mining is to identify valid novel, potentially
useful, and understandable correlations and patterns in existing data.
Conti...
 It involves using various techniques and algorithms to analyze the
data and extract valuable information that can be used for decision-
making, prediction, and optimization in different fields such as
business, healthcare, finance, and more.
 Data mining techniques are employed to examine vast datasets
utilizing statistical analysis, machine learning algorithms, and artificial
intelligence tools. Through this process, businesses can identify hidden
relationships or correlations within the data that might not be apparent
on the surface. By understanding these patterns, organizations can
predict future behavior and trends, optimize marketing strategies,
personalize customer experiences, detect anomalies or frauds within
their operations, and ultimately drive better decision-making.
 the overall goal of data mining is to transform raw information into
actionable knowledge that drives successful outcomes for businesses in
an increasingly data-driven world.
 Data mining is the process of analyzing large amounts of data to find
valuable patterns and insights. Businesses use data mining techniques
to make informed decisions, improve processes, and gain a competitive
advantage.
Conti….
 in the 80’s The term “data mining” is primarily used by statisticians,

database researchers, and the MIS ( management information system)
and business communities. The term Knowledge Discovery in
Databases (KDD) is generally used to refer to the overall process of
discovering useful knowledge from data, where data mining is a
particular step in this process. The additional steps in the KDD
process, such as data preparation, data selection, data cleaning, and
proper interpretation of the results of the data mining process, ensure
that useful knowledge is derived from the data.
 Data mining is an extension of traditional data analysis and statistical
approaches in that it incorporates analytical techniques drawn from a
range of disciplines including, but not limited to,
 Numerical analysis
 Pattern matching and areas of artificial intelligence such as
machine learning,
 Neural networks and genetic algorithms
Data mining approach
 There are two types of data mining approaches
1. Concerned with building models Building models is similar to

conventional exploratory statistical methods and aims to produce a
summary of the data to identify and describe its main features.
The objective is to produce an overall summary of a set of data to
identify and describe the main features of the shape of the
distribution.
 This approach, sometimes called operational, seeks to model

relationships without relying on any underlying theory.
 In model building, a distinction is sometimes made between
empirical and mechanistic models,
 Examples of such models include a cluster analysis partition
of a set of data, a regression model for prediction, and a tree-
based classification rule.
Data mining approach
2. Pattern detection, seeks to identify small departures from
the norm, to detect unusual patterns of behavior.
 It is also known as substantive or phenomenological

approach, as it is based on theories or mechanisms
underlying the data generation process.
 This type of data mining is primarily concerned with
operational strategies and is often described as searching
for valuable information among a large amount of data. ,
 Examples include unusual spending patterns in credit
card usage, and objects with patterns of characteristics
unlike others.
1.2. DATA MINING & DATA WAREHOUSE
Data mining uses the data warehouse as the

source of information for knowledge data
discovery (KDD) systems through an amalgam
of artificial intelligence and statistics-related
techniques to find associations, sequences,
Operational
classifications, clusters, and forecasts.
environment
Figures 1 illustrate this process. As shown ,
almost all data enter the warehouse from the
operational environment. The data are then
"cleaned" and moved into the warehouse. The
purge data continue to reside in the warehouse until
Clean the Reside in they reach an age where one of three actions is
data warehouse summarize taken: the data are purged; the data, together
with other information, are summarized and the
Archive data are archived.
Figure 1
COMPONENT OF WAREHOUSE DATA MINING
Typically the data warehouse architecture has three components. These three components may
reside on different platforms, or two or three of them may be on the same platform. Regardless of
the platform combination, all three components are required.
1 2 3
Data acquisition software Data warehouse Client software
Are back-end, which extracts data The data warehouse itself contains The client (front-end) software, which
from legacy systems and external the data and associated database allows users and applications, such as
DSS( decision support system)
sources, consolidates and software. It is often referred to as and EIS ( executive information system)
summarizes the data, and loads the "target database." to access and analyze data in the
them into the data warehouse. warehouse.
1.3. Data mining and OLAP
 The capability of OLAP (standing for Online Analytical Processing) to provide multiple
and dynamic views of summarized data in a data warehouse sets a solid foundation for
successful data mining. Therefore, data mining and OLAP can be seen as tools than can
be used to complement one another
 The essential distinction between OLAP and data mining is that OLAP is a data
summarization/aggregation tool, while data mining thrives on detail. Data mining
allows the automated discovery of implicit patterns and interesting knowledge that’s
hiding in large amounts of data.
 Expressions used in OLAP that describe the various functions include:
 rolling up (producing marginals)
 drilling (going down levels of aggregation—the opposite of rolling up),
 slicing (conditioning on one variable)
 dicing (conditioning on many variables) and
 pivoting (rotating the data axes to provide an alternative presentation of the data
 A powerful paradigm that integrates OLAP with data mining technology is OLAM
(Online Analytical Mining) which is sometimes referred to as OLAP mining.
 OLAM systems are particularly important because most data mining tools need to work
on integrated, consistent, and cleaned data, which again, requires costly data cleaning,
data transformation, and data integration as pre-processing steps.
1.5. Actors in Data mining
Data mining is performed by people, Depending on the scale and scope of the project, multiple individuals may
assume each of the various roles. For example, a large project would likely need several data mining analysts and
data mining engineers. But most project include:
 The project leader, who has the overall responsibility for planning, coordinating, executing, and deploying the
data mining project.
 The data mining client, who is the business domain expert that requests the project and utilizes the results, but
generally does not possess the technical skills needed to participate in the execution of the more technical phases
of the data mining project such as data preparation and modeling.
 The data mining analyst, who thoroughly understands, from a business perspective, what the client wants to
accomplish and assists in translating those business objectives into technical requirements to be used in the
subsequent development of the data mining model(s).
 The data mining engineer, who develops, interprets and evaluates the data mining model(s) in light of the
business objectives and business success criteria. Data mining engineering is performed in consultation with the
data mining client and the data mining analyst in order to assist in achieving business ends.
 The IT analyst, who provides access to the hardware, software and data needed to complete the data mining
project successfully. It is important to note that data mining is a technology that needs to co-exist harmoniously
with other technologies in the organization. In addition, the data to be mined could be coming from virtually any
existing system, database, or data warehouse in the organization.
2 The Business Imperative
2.1. Importance and usage of Data mining
2.2. why do we need data mining tool?
2.3. what are the critical factors to be consider while
selecting data mining tools?
2.4. Data mining tool/ software
2.5. The global data mining business market size
2.6. High paying jobs in data mining fields
2.1. Importance and usage of Data mining
 Data mining offers value across a broad spectrum of industries
and can be used as a vehicle to increase profits by reducing
costs and/or raising revenue. A few of the common ways in
which data mining can accomplish those objectives are.
 lowering costs at the beginning of the product life cycle

during research and development.
 determining the proper bounds for statistical process
control methods in automated manufacturing processes;
eliminating expensive mailings to customers who are
unlikely to respond to an offer during a marketing
campaign.
 Facilitating one-to-one marketing and mass customization
opportunities in customer relationship management.
 overall it increases the efficiency, productivity as well as
communication of business operation their by enhancing
better decision making process.
Conti…
Many organizations use data mining to help manage all phases of the customer life cycle,
including acquiring new customers, increasing revenue from existing customers, and
retaining good customers, because it is usually far less expensive to retain a customer than
acquire a new one.
other industries where data mining can make a contribution include:
● Telecommunications and credit card companies are two of the leaders in

applying data mining to detect fraudulent use of their services
● Insurance companies and stock exchanges are interested in applying data mining
to reduce fraud.
● Medical applications use data mining to predict the effectiveness of surgical
procedures, medical tests, or medications
● Financial firms use data mining to determine market and industry characteristics
as well as to predict individual company and stock performance
● Retailers make use of data mining to decide which products to stock in particular
stores (and even how to place them within a store), as well as to assess the
effectiveness of promotions and coupons
● Pharmaceutical firms mine large databases for chemical compounds and genetic
material to discover substances that might be candidates for development as agents
for the treatments of disease
Why do we need Data Mining Tools?
 Data mining plays a crucial role in the analytics of any organization. It generates valuable data that can
be utilized in business intelligence and advanced analytics. The primary advantage of data mining
tools lies in their ability to uncover hidden patterns, trends, and correlations within datasets. This
invaluable knowledge, derived from a combination of traditional data analysis and predictive analytics,
can greatly enhance decision-making and strategic planning within a company. Furthermore, data
mining tools often come equipped with features that facilitate data visualization and support
interfaces with standard database formats.
 Moreover, data mining tools are instrumental in identifying anomalies in models and patterns, thereby
safeguarding your system from potential compromises. With these tools at your disposal, there is no
need to develop complex algorithms from scratch, as they already possess a comprehensive range of
features.
 In summary, data mining tools are indispensable for organizations seeking to unlock the full potential
of their data. By harnessing the power of these tools, businesses can gain valuable insights, improve
decision-making processes, and fortify their systems against potential threats.
What are the Critical Factors to be Consider while Selecting Data Mining Tools?
 Data Mining Tools are a critical component of lead enrichment. You can establish patterns based on user behavior and use
them in your marketing campaigns. Let’s understand some of the key factors that you should keep in mind when selecting
the right Data Mining Tool.
1. Hardware, Software, Data and expertise in the field
2. Open Source or Proprietary, Choosing the right tool for data mining can be difficult, with many free options available.
Open source data mining tools are a good choice to begin with because they are continuously updated by a large
community, making them more flexible and efficient. While these tools have similar properties, there are a few key
differences. However, open source tools may not be as secure and well-developed, so businesses often opt for proprietary
tools that offer software, training, and support as a complete package.
3. Data Integrations, Some Data Mining Tools work better with huge datasets, while others work better with smaller ones.
When weighing your alternatives, think about the sorts of data you’ll be dealing with the most. If your data is presently
stored in a variety of systems or formats, your best chance is to locate a solution that can cope with the complexity.
4. Usability, Each Data Mining Tool will have a unique user interface that will make it easier for you to interact with the
work environment and engage with the data. Some Data Mining Tools are more educational in nature, focusing on offering
a general understanding of analytical procedures. Others are tailored to corporate needs, leading users through the process
of resolving a specific issue.
5. Programming Language, Open Source Data Mining Tools are mainly built using Java, but they also support R and
Python scripts. It is important to consider the programming languages that your programmers are familiar with and
whether they will collaborate with non-coders on Data Analysis projects.
Data Mining tool / Software
Selecting the best software for data mining depends on several factors, including the organization's specific requirements,
budget, and expertise. It is important to evaluate the features, scalability, ease of use, and support options of different
software options before making a decision
There are several different software options available for data mining, each with its own strengths and features. Some of
the popular software used for data mining include.
IBM SPSS SAS Enterprise

01 Modeler 04 Miner
02 RapidMiner 05 KNIME
03 Python liberary
06 Orange
The Global Data Mining Business Market Size by 2023
94,000,000,000 USD
According to a report by Market Research Future, the global data mining market size is
expected to reach USD 93.87 billion by 2023, with North America being one of the key
regions driving this growth. The United States, in particular, has a strong presence in the data
mining industry due to its advanced technological infrastructure and the presence of
numerous companies specializing in data analytics.
high-paying jobs in data mining field
In the IT field, there are several high-paying jobs that may involve data mining or analytics as
part of their responsibilities. Some of the most well-paid jobs in the information technology field
include:
- Data Scientist: Data scientists are responsible for analyzing and interpreting complex data to
derive insights and make data-driven decisions. They often use data mining techniques and
machine learning algorithms to extract valuable information from large datasets. Data scientists
are generally among the highest-paid professionals in the IT field due to the demand for their
expertise.
- Data Architect: Data architects design and maintain databases and data systems to ensure
efficient storage and retrieval of information. They work closely with data scientists and analysts
to ensure data is structured and organized for effective data mining and analysis.
- Machine Learning Engineer: Machine learning engineers develop and deploy machine
learning models and algorithms. They often work on building predictive models and optimizing
algorithms for data mining tasks.
- Business Intelligence (BI) Manager: BI managers oversee the implementation and
management of business intelligence systems, which may involve data mining and analytics.
They are responsible for ensuring that data is collected, analyzed, and presented in a meaningful
way to support decision-making processes.
- IT Project Manager: IT project managers oversee the planning, execution, and completion of
IT projects. While data mining may not be their primary responsibility, they may work closely
with data scientists and analysts to ensure successful project implementation.
Annual wage of Data scientist
5,053,811.8 BIRR
The median annual wage for data scientist and mathematical science occapations which
related to data mining as of may 2020, united state bureau of labor and statistics
3 TECHNICAL
IMPERATIVES
3.2. Data mining and statistics
3.3. Data mining and The web
Machine learning is the study of computational methods for improving performance by
mechanizing the acquisition of knowledge from experience. Machine learning aims to
provide increasing levels of automation in the knowledge engineering process, replacing
much time-consuming human activity with automatic techniques that improve accuracy
or efficiency by discovering and exploiting regularities in training data .
Although machine learning algorithms are central to the data mining process, it is
important to note that the process also involves other important steps, including.
 Building and maintaining the database

 Data formatting and cleansing
 Data visualization and summarization
 The use of human expert knowledge to formulate the inputs to the learning
algorithm and to evaluate the empirical regularities it discovers, and
 Determining how to deploy the results
Conti…
 For an exhaustive review of machine learning algorithms the following are the basic
learning algorithms.
 Neural Networks (NN) are systems designed to imitate the human brain. They
are made up of simulated neurons that are connected to each other, much like the
neurons in our brains. Similar to our brain, the connections between neurons can
change in strength based on the stimulus or output received, allowing the network
to learn.
 Case-Based Reasoning (CBR) is a technology that solves problems by using
past experiences and solutions. It works by identifying similar cases from a set of
stored cases and applying their solutions to new problems. The new problem is also
added to the case base for future reference.
 Genetic Algorithms (GA) Genetic Algorithms (GA) are computer procedures
inspired by natural selection and evolution. They use processes such as selection,
reproduction, mutation, and survival of the fittest to find high-quality solutions for
prediction and classification problems. In data mining, GA is employed to generate
hypotheses about relationships between variables by creating association rules or
other internal structures.
Conti…
 Decision Trees (DT) are a type of analysis tool used to make decisions based on
data. They work like a flowchart, where each step represents a test or decision and
leads to different branches. To classify a data item, you start at the root and follow
the path based on the test outcomes until you reach a final decision at a leaf node.
DTs can also be seen as a special type of rule set, organized in a hierarchy.
 Association Rules (AR) are statements that describe the connections between
attributes of a set of entities, allowing us to make predictions about other entities
that share the same attributes. In simpler terms, AR tell us about the relationships
between certain characteristics of data items or between different data sets. An
example of an AR is X1…Xn => Y[C,S], indicating that attributes X1 to Xn can
predict attribute Y with a confidence level of C and a significance level of S.
While these so-called first-generation algorithms are widely used, they have significant
limitations. They typically assume the data contains only numeric and textual symbols and do not
contain images. They assume the data was carefully collected into a single database with a
specific data mining task in mind. Furthermore, these algorithms tend to be fully automatic and
therefore fail to allow guidance from knowledgeable users at key stages in the search for data
regularities.
3.2. Data Mining and Statistics
 The disciplines of statistics and data mining both aim to discover structure in data. So much do their aims overlap, that some people
regard data mining as a subset of statistics. But that is not a realistic assessment as data mining also makes use of ideas, tools, and
methods from other areas – particularly database technology and machine learning, and is not heavily concerned with some areas in
which statisticians are interested.
 For an extensive review of classical statistical algorithms. Some of the commonly used statistical analysis techniques are discussed
below
 Descriptive and Visualization Techniques :
 Descriptive Techniques
• Averages and measures of variation
• Counts and percentages, and
• Cross-tabs and simple correlations
 Visualization Techniques, is primarily a discovery technique and is useful for interpreting large amounts of data
• Histograms
• box plots
• scatter diagrams
• multi-dimensional surface plots
 Cluster Analysis
 Correlation Analysis
 Discriminant Analysis
 Factor Analysis
 Regression Analysis
 Dependency analysis
3.2. Data Mining and The web
 With the large amount of information available online, the Web is a fertile area for data mining and knowledge discovery. In Web
mining, data can be collected at the
 Server-side
 Client-side
 Proxy servers
 Obtained from an organization’s database (which may contain business data or consolidated web data)
 Each type of data collection differs not only in terms of the location of the data source, but also
 the kinds of data available
 the segment of population from which the data was collected, and its
 Method of implementation
 A meta-analysis of the web mining literature, categorized web mining into three areas of interest based on which part of the web is to
be mined
1. Web Content Mining, describes the discovery of useful information from the web content/data/documents. Essentially, the
web content data consists of the data the web page was designed to convey to the users, including text, image, audio, video,
metadata, and hyperlinks.
2. Web structure mining tries to discover the model underlying the link structures of the Web. Intra-page structure information
includes the arrangement of various HTML or XML tags within a given page, while inter-page structure information is hyper-
links connecting one page to another. This model can be used to categorize web pages and is useful to generate information
such as the similarity and relationship among Web sit.
Conti…
3. Web usage mining (also referred to as click-stream analysis) is the process of applying data mining techniques to the discovery of usage
patterns from Web data, and is targeted towards applications. It tries to make sense of the data generated by the Web surfer’s sessions or
behaviors. While the web content and structure mining use the real or primary data on the web, web usage mining mines the secondary data
derived from the interactions of the users during Web sessions. Web usage data includes the data from web server access logs, browser logs,
user profiles, registration data, user sessions or transactions, cookies, user queries, mouse clicks, and any other data as the result of interaction
with the Web
 Given its application potential, particularly in terms of electronic commerce, interest in web usage mining, increased rapidly in both
the research and practice communities.
 three main tasks are performed in web usage mining; preprocessing, pattern discovery, and pattern analysis
1. Preprocessing consists of converting the usage, content, and structure contained in the various available data sources into the data
abstractions necessary for pattern discovery. It is typically the most difficult task in the web usage mining process due to the
incompleteness of the available data. Some of the typical problems include the kinds of data available
Single IP address/multiple server sessions
 Multiple IP address/single server sessions
 Multiple IP addresses/single user and
 Multiple agent/single user
2. Pattern discovery The methods and algorithms are similar to those developed for non-Web domains such as statistical analysis,
clustering, and classification, but those methods must take into consideration the different kinds of data abstractions and prior knowledge
available for Web Mining. For example, in association rule discovery, the notion of a transaction for market-basket analysis does not take into
consideration the order in which items are selected. However, in Web Usage Mining, a server session is an ordered sequence of pages requested
by a user.
Conti…
3. Pattern analysis is the last step in the overall Web Usage mining process. The motivation
behind pattern analysis is to filter out the uninteresting rules or patterns from the dataset
found in the pattern discovery phase. The exact methodology used for analysis is usually
governed by the application for which Web mining is to be done. The most common form
of pattern analysis consists of a knowledge query mechanism such as SQL. Another
method is to load usage data into a data cube to perform OLAP operations. Visualization
techniques, such as graphing patterns or assigning colors to different values, can highlight
patterns. The content and structure information can be used to filter out patterns which
contain pages of a certain use type or content, or pages that match a certain hyperlink
structure.
4
Methodological Considaration
4.2. SPSS The 5 A’s Process
4.3. CRISP-DM The De facto standard for industry
 SAS developed a data mining analysis cycle known by the acronym SEMMA. This acronym
stands for the five steps of the analyses that are generally a part of a data mining project.
1. Sample: the first step in is to create one or more data tables by sampling data from the
data warehouse. Mining a representative sample instead of the entire volume drastically
reduces the processing time required to obtain business information
2. Explore: after sampling the data, the next step is to explore the data visually or
numerically for trends or groupings. Exploration helps to refine the discovery process.
Techniques such as factor analysis, correlation analysis and clustering are often used in
the discovery process
3. Modify: modifying the data refers to creating, selecting, and transforming one or more
variables to focus the model selection process in a particular direction, or to modify the
data for clarity or consistence
4. Model: creating a data model involves using the data mining software to search
automatically for a combination of data that predicts the desired outcome reliably
5. Assess: the last step is to assess the model to determine how well it performs. A common
means of assessing a model is to set aside a portion of the data during the sampling stage.
If the model is valid, it should work for both the reserved sample and for the sample that
was used to develop the model
4.2. SPSS The 5 A’s Process
 SPSS originally developed a data mining analysis cycle called the 5 A’s Process. The five
steps in the process are
1. Assess
2. Access
Asses
3. Analyze
s
4. Act and
te
5. Automate
oma
Access
Au t
z e
Ac
t aly
An
Figure 2. The 5A’s Process

4.3. CRISP-DM The De facto standard for industry
 The CRISP-DM project began in mid-1997 and was funded in part by the European
commission. The leading sponsors were: NCR, DaimlerChrysler, Integral Solutions
Limited (ISL) (now a part of SPSS), and OHRA, a Netherlands’ independent insurance
company
 The goal of the project was to define and validate an industry- and tool-neutral data
mining process model that which would make the development of large as well as
small data mining projects faster, cheaper, more reliable and more manageable.
 The project started in July 1997 and was planned to be completed within 18 months.
However, the work of the CRISP-DM received substantial international interest, which
caused the project to put emphasis on disseminating its work. As a result, the project
end date was pushed back to and completed on April 30, 1999. The CRISP-DM model
is illustrated in Figure 3.
 The phases of the CRISP-DM process are
1. Business Understanding, Get a clear understanding of the problem you’re out to solve,
how it impacts your organization, and your goals for addressing it.
2. Data Understanding, Review the data that you have, document it, and identify data
management and data quality issues
3. Data Preparation, Get your data ready to use for modeling
4. Modeling, Use mathematical techniques to identify patterns within your data.
5. Evaluation, Review the patterns you have discovered and assess their potential for
business use
6. Deployment, Put your discoveries to work in everyday business.
5
DATA MINING PROCESS
5.1. Business Understanding
5.2. Data Understanding
5.3. Data preparation
5.4. Modeling
5.5. Evaluation
5.6. Deployment
5.1. Business Understanding
1.Determine Business Objectives
2.Assess Situation
3.Determine Data Mining Goals
4.Produce Project Plan
5.2. Data Understanding

1.Collect Initial Data
2.Describe Data
3.Explore Data
4.Verify Data Quality
5.3. Data Preparation

1.Select Data
2.Clean Data
3.Construct Data
4.Integrate Data
5.Format Data
5.4. Modeling
1.Select Modeling Technique
2.Generate Test Design
3.Build Model
4.Assess Model, the lift chart & the confusion matrix
5.5. Evaluation
1.Evaluate Results, Results = ƒ(models, findings)
2.Review Process
3.Determine Next Steps
5.6. Deployment
1.Plan Deployment
2.Plan Monitoring and Maintenance
3.Produce Final Report
4.Review Project
6
Conclusions and Directions for
further research
6.1. Conclusions
6.2. Directions For Further Research
6. Conclusion & Directions For Further Research
6.1. Conclusion
 Today, most enterprises are actively collecting and storing data in large
databases. Many of them have recognized the potential value of these data
as an information source for making business decisions. The dramatically
increasing demand for better decision support is answered by an extending
availability of knowledge discovery, and data mining is one step at the core
of the knowledge discovery process.
 In this presentation, the focus was on data mining and its purpose of
developing algorithms to extract structure from data. This structure includes
statistical patterns, models, and relationships which enable us to predict and
anticipate certain events. By understanding this importance, we realize the
significance of data mining.
 Opportunities for further research abound particularly as the Internet
provides businesses with an operational platform for interaction with their
customers around the clock without geographic or physical boundaries.
Therefore, from a strategic perspective, the need to navigate the rapidly
growing universe of digital data will rely heavily on the ability to
effectively manage and mine the raw data.
Conti….
6.2. Directions For Further Research
 The following is a (naturally incomplete) list of issues that warrant further investigation
in the emerging field of data mining:
1. Privacy: With such enthusiasm and opportunity for data mining the Internet, the serious
issue of privacy needs to be handled effectively. Although privacy is not only an issue
with data mining and the Internet, data mining researchers and practitioners need to be
constantly aware of the implications of tracking and analysis technologies on privacy.
Without properly addressing the issue of privacy on the Internet, the abundance of data
may eventually flow much slower due to regulations, and other corrective or preventive
restrictions.
2. Progress toward the development of a theory: Progress toward the development of a
theory regarding the correspondence between techniques and the specific problem
domain to which they apply is needed. Questions regarding the relative performance of
the various data mining algorithms remain largely unresolved. With a myriad of
algorithms and problem sets to which they are applied, a systematic investigation of their
performance is needed to guide the selection of a data mining algorithm for a specific
case.
3. Extensibility: it appears that any fixed set of algorithmDifferent techniques outperform
one another for different problems. With the increasing number of proposed data analysis
techniques as well as reported applications, s will not be able to cover all potential
problems and tasks. It is therefore important to provide an architecture that allows for
easy syntheses of new methods, and for the adaptation of existing methods with as little
effort as possible.
Conti….
4. Integration with databases: Most of the cost of data mining is not in the modeling algorithms;
rather it is in data cleaning and preparation, and in data maintenance and management. The
development of a standard application programming interface (API) and the subsequent
integration with a database environment could reduce the costs associated with these tasks. The
issues regarding data cleaning, preparation, maintenance and management are challenges that
face databases, data warehouses, and decision support systems in general.
5. Managing changing data: In many applications, particularly in the business domain, the data
is not stationary, but rather changing and evolving. This changing data may make previously
discovered patterns invalid and as a result, there is clearly a need for incremental methods that
are able to update changing models, and for strategies to identify and manage patterns of
temporal change in knowledge bases.
6. Non-standard data types: Today’s databases contain not only standard data such as numbers
and strings, but also large amounts of non-standard and multi-media data, such as free-form
text, audio, image and video data, temporal, spatial, and other data types. These data types
contain special patterns, which cannot be handled well by the standard analysis methods, and
therefore, require special, often domain-specific, methods and algorithms.
7. Pattern Evaluation: Several challenges remain regarding the development of techniques to
assess the interestingness of discovered patterns as a data mining system can uncover thousands
of patterns, but many of the patterns discovered may be uninteresting to the given user, i.e.
representing common knowledge or lacking novelty. The use of interestingness measures, to
guide and constrain the discovery process as well to reduce the search space is an active area
for research.
7
References
References
1. Data Mining For Dummies®

Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774,
www.wiley.com
Copyright © 2014 by John Wiley & Sons, Inc., Hoboken, New Jersey
2. Jackson, J. (2002). Data Mining; A Conceptual Overview. Communications of the

Association for Information Systems, 8(March). https://doi.org/10.17705/1cais.00819
THANKS!
Do you have any questions?
Feel free to ask!
waabadir2007@gmail.com
+251 70 322 4481
+251 94 299 4481
Abadir Tahir Mohamed (c/pgp/28/15)

HARAMAYA UNIVERSITY
CEP JIGJIGA CENTER
28/8/2023

Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

DATA MINING

By Abadir Tahir Mohamed

Instructor (Asst Prof) Tilahun Shiferaw

The Business Imperatives The Data Mining Process

 in the 80’s The term “data mining” is primarily used by statisticians,

1. Concerned with building models Building models is similar to

 This approach, sometimes called operational, seeks to model

 It is also known as substantive or phenomenological

Data mining uses the data warehouse as the

 lowering costs at the beginning of the product life cycle

other industries where data mining can make a contribution include:

● Telecommunications and credit card companies are two of the leaders in

IBM SPSS SAS Enterprise

 Building and maintaining the database

Figure 2. The 5A’s Process

5.2. Data Understanding

5.3. Data Preparation

1. Data Mining For Dummies®

2. Jackson, J. (2002). Data Mining; A Conceptual Overview. Communications of the

Abadir Tahir Mohamed (c/pgp/28/15)

You might also like