Introduction To Data Mining & Business Intelligence

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

1

COIS 448: Data Mining &


Business Intelligence
Introduction to Data Mining & Business
Intelligence
Information Systems Department
Faculty of Computing and Information Technology Rabigh
King Abdulaziz University

Slide adapted from Dr. Arda


2
Course Details
Course Code: COIS 448
Course Title: Data Mining & Business Intelligence
Lecture Days: Sunday, Tuesday & Thursday
Time: 10:00am to 10:50am
Venue: Virtual Class (Online)
3

Course Evaluation
• Midterm Exam
• Final Exam
• Assignment
• Quiz
Course Outcome
4

Outcome Description
An ability to analyze a problem, and identify
1 and define the computing requirements
appropriate to its solution
An ability to design, implement, and evaluate a
2 computer-based system, process, component,
or program to meet desired needs
5 Course Content
.No Description weeks
1 Introduction 1
2 Data Mining Processes and Knowledge Discovery 1
3 Overview of Data Mining Technique 1
4 Getting Know Your Data 1
5 Data Pre-processing 1
6 Market BasketAnalysis(Association Analysis) 1
7 Classification Analysis 1
8 Decision Tree Algorithms 1
9 Clustering Analysis 1
10 Regression Algorithm in Data Mining 1
11 Neural Network in Data Mining 1
12 Fuzzy Logic 1
13 Genetic Algorithm 1
14 Data Mining Tools 1
Classroom
6

• No cell phones
• No Food / Drinks (maybe some drinks)
• Participation is expected!
• Attendance is required
• Announcements are made in class
• If you come to class you are expected
to participate!
Why Mine Data? Commercial Viewpoint
7
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions

 Computers have become cheaper and more powerful


 Competitive Pressure is Strong
 Provide better, customized services for an edge (e.g. in Customer Relationship
Management)
Why Mine Data? Scientific Viewpoint
8
 Data collected and stored at
enormous speeds (GB/hour)
 remote sensors on a satellite
 telescopes scanning the skies
 microarrays generating gene
expression data
 scientific simulations
generating terabytes of data

 Traditional techniques infeasible for raw data


 Data mining may help scientists
 in classifying and segmenting data
 in Hypothesis Formation
Mining Large Data Sets - Motivation
9
 There is often information “hidden” in the data that is
not readily evident
 Human analysts may take weeks to discover useful
information
 Much of the data is never analyzed at all
4,000,000

3,500,000

3,000,000
The Data Gap
2,500,000

2,000,000

1,500,000
Total new disk (TB) since 1995
1,000,000

500,000
Number of
0
analysts
1995 1996 1997 1998 1999

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
Why Data Mining?
10
What is Data Mining?
11
 Many Definitions
 Non-trivial extraction of implicit, previously unknown and potentially
useful information from data
 Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
What is (not) Data Mining?
12
What is not Data  What is Data Mining?
Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
What is Data Mining
13
 Data mining is not another hype. Instead, the need for data mining
has arisen due to the wide availability of huge amounts of data
and the imminent need for turning such data into useful
information and knowledge. Thus, data mining can be viewed as
the result of the natural evolution of information technology.
 Data mining is more than a simple transformation of technology
developed from databases, statistics, and machine learning.
Instead, data mining involves an integration, rather than a simple
transformation, of techniques from multiple disciplines such as
database technology, statistics, machine learning, high-
performance computing, pattern recognition, neural networks,
data visualization, information retrieval, image and signal
processing, and spatial data analysis.
What is Data Mining
14
What is Data Mining
15
 Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in
the process of knowledge discovery.
The knowledge discovery process is shown in Figure 1.4 as an iterative sequence of the following
steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
16 Data Mining Tasks...

Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Origins of Data Mining
17
 Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
 Traditional Techniques
may be unsuitable due to
 Enormity of data
Statistics/ Machine Learning/
 High dimensionality AI Pattern
of data Recognition
 Heterogeneous,
distributed nature Data Mining
of data

Database
systems
Which Technologies Are Used?
18

 As a highly application-driven
domain, data mining has
incorporated many techniques
from other domains such as
statistics, machine learning, pattern
recognition, database and data
warehouse systems, information
retrieval, visualization, algorithms,
high performance computing, and
many application domains (Figure
1.11).
Statistic
19

 Statistics studies the collection, analysis, interpretation or


explanation, and presentation of data. Data mining has an
inherent connection with statistics.
 A statistical model is a set of mathematical functions that
describe the behavior of the objects in a target class in terms of
random variables and their associated probability distributions.
 Statistical models are widely used to model data and data classes.
 Statistical methods can also be used to verify data mining results.
For example, after a classification or prediction model is mined, the
model should be verified by statistical hypothesis testing.
Machine
20
Learning
 Machine learning investigates how computers can learn (or improve their
performance) based on data.
A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data. For
example, a typical machine learning problem is to program a computer so that it
can automatically recognize handwritten postal codes on mail after learning from
a set of examples.
 Some classic problems in machine learning that are highly related to
data mining:
 Supervised learning
 UnSupervised learning
 Semi-supervised learning
Machine Learning
21
 Supervised learning is basically a synonym for classification.
The supervision in the learning comes from the labeled examples in the
training data set. For example, in the postal code recognition problem, a set of
handwritten postal code images and their corresponding machine-readable
translations are used as the training examples, which supervise the learning of
the classification model.
 UnSupervised learning is essentially a synonym for clustering.
The learning process is unsupervised since the input examples are not class
labeled. Typically, we may use clustering to discover classes within the data.
 Semi-supervised learning is a class of machine learning techniques that
make use of both labeled and unlabeled examples when learning a model.
In one approach, labeled examples are used to learn class models and
unlabeled examples are used to refine the boundaries between classes.
Database Systems and Data Warehouses
22
 Database systems research focuses on the creation,
maintenance, and use of databases for organizations and end-
users.
 Particularly, database systems researchers have established
highly recognized principles in data models, query languages,
query processing and optimization methods, data storage, and
indexing and accessing methods. Database systems are often well
known for their high scalability in processing very large, relatively
structured data sets.
 Many data mining tasks need to handle large data sets or even
real-time, fast streaming data. Therefore, data mining can make
good use of scalable database technologies to achieve high
efficiency and scalability on large data sets.
Which Kinds of Applications Are Targeted?
23
 Where there are data, there are data mining applications
As a highly application-driven discipline, data mining has seen great successes
in many applications. It is impossible to enumerate all applications where data
mining plays a critical role.
Business Intelligence
 It is critical for businesses to acquire a better understanding of the commercial
context of their organization, such as their customers, the market, supply and
resources, and competitors. Business intelligence (BI) technologies provide
historical, current, and predictive views of business operations. Examples
include reporting, online analytical processing, business performance
management, competitive intelligence, benchmarking, and predictive analytics.
“How important is business intelligence?” Without data mining, many businesses
may not be able to perform effective market analysis, compare customer feedback
on similar products, discover the strengths and weaknesses of their competitors,
retain highly valuable customers, and make smart business decisions.
Which Kinds of Applications Are Targeted?
24
 Example where data mining is crucial to the success of a business, data mining
functions does this business need, and how can they be performed alternatively
by data query processing or simple statistical analysis?
 For example A department store,
can use data mining to assist with its target marketing mail campaign. Using data
mining functions such as association, the store can use the mined strong
association rules to determine which products bought by one group of customers
are likely to lead to the buying of certain other products.
With this information, the store can then mail marketing materials only to those
kinds of customers who exhibit a high likelihood of purchasing additional
products.
Data query processing is used for data or information retrieval and does not have
the means for finding association rules. Similarly, simple statistical analysis
cannot handle large amounts of data such as those of customer records in a
department store.
Which Kinds of Applications Are Targeted?
25
 Web Search Engines
 A Web search engine is a specialized computer server that
searches for information on the Web. The search results of a
user query are often returned as a list (sometimes called
hits). The hits may consist of web pages, images, and other
types of files. Some search engines also search and return
data available in public databases or open directories.
 Search engines differ from web directories in that web
directories are maintained by human editors whereas
search engines operate algorithmically or by a mixture of
algorithmic and human input

You might also like