Professional Documents
Culture Documents
Data Mining: Knowledge Discovery in Databases
Data Mining: Knowledge Discovery in Databases
Data Mining: Knowledge Discovery in Databases
Data Mining
Knowledge discovery in databases
By
M. Pranay Teja
Id no. 180030368.
Sec 25.
Data 3 1
What is Data Mining?
• Data mining is a capability to support the
recognition of previously unknown but
potentially useful relationships within large
databases/ data warehouses.
• Aim: find useful patterns in the data.
• Uses statistical, mathematical, artificial
intelligence, and machine-learning techniques
Data 3 2
Data Mining Tools
• Data mining tools use statistical or rules-based
methods to identify patterns and create predictive
models.
• Tools look for patterns using a variety of models
– Statistical methods e.g. correlation
– Decision trees
– Case based reasoning
– Neural computing
– Intelligent agents
– Genetic algorithms
Data 3 3
Text Mining
• Text Mining – Analyze text documents.
– Find Hidden content
– Group by themes
– Determine relationships between documents
Data 3 4
Process of Data Mining/ Knowledge Discovery
Knowledge
Pattern Evaluation
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases Data 3 5
What does it let you to do?
• Data mining automates the process of
sifting through historical data in order to
discover new information.
• Data Mining techniques enable users to
identify patterns and correlations within a
set of data
• These can then be used as predictive
models that anticipate behaviour or events
based on trends in the data.
Data 3 6
Correlation versus Causation
• Correlation
– A statistical relation between two or more
variables such that changes in the value of one
variable are accompanied by changes in the value
of the other
• Causation
– Changes in one variable cause changes in another.
Data 3 7
What do you need for Data Mining?
• Massive data collection
• Powerful computers
• Data mining algorithms
Data 3 8
Five Basic Operations
• Clustering
– Identifies groups of items that share a particular characteristic
• Classification
– infers the defining characteristics of a certain group
• Association
– identifies relationships between events that occur at the one
time
• Sequencing:
– relationships over time
• Forecasting
– estimates future values based on patterns within large sets of
data
Data 3 9
Clustering
• The process of identifying relationships between
similar records without any preconceived notion
of what that that similarity might involve.
• Examples:
– Disease clusters,
– Similarities in customers telephone usage
• Often used as an exploratory exercise before
further data mining using a classification
technique.
Data 3 10
Classification
• DM system learns from examples of the
data how to partition or classify the data
i.e. it formulates classification rules which
can be used for prediction.
– Example : Bank classifies customers and may
offer them differing levels of service, different
offers, different charges. Can build loan
approval models.
Data 3 11
Association
• Looks for links between records in a data set
– e.g. items purchased at the one time.
• Patterns can be identified to indicate probabilities
e.g.
• 500,000 transactions
• 20,000 nappies
• 30,000 beer
• 10,000 nappies + beer
– Beer and nappies occur together in 2% of transactions.
– “when people buy beer they buy nappies 1/3 of the
time”
– “when people buy nappies they buy beer 50% of the
time”
Data 3 12
Sequential Analysis
• A form of association used to track
relationships over time.
– E.g. health insurance claims.
– E.g. 10% of customers who bought a tent bought a
backpack within one month.
– Weather patterns e.g. tidal wave in Hawaii follows
hurricane in N. Atlantic x% of the time.
Data 3 13
Forecasting
• Concerns the prediction of continuous variables
e.g. sales, share values, stock market levels, oil
prices etc.
• Often done with regression functions statistical
methods for examining the relationship between
variables in order to predict a future value.
• 2 types
– Forecasting single continuous value based on
unordered examples. e.g. predict income based on
personal details.
– Predict one or more values based on a sequential
pattern – time series forecasting.
Data 3 14
Data Mining Tools in more detail
• Case-based Reasoning
– Use historical cases to identify patterns.
• Neural Computing :
– Examine historical data for pattern recognition e.g.
identify potential customers for a new product.
• Intelligent agents
– Retrieve information from large databases.
• Other tools e.g. decision trees, rule induction,
data visualisation.
Data 3 15
Some Key Application Areas
• Data mining is used in many different areas
• Two big areas are:
– Market analysis and management
• Initial Data Gathered From
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, lifestyle studies, focus groups
– Fraud detection and management
Data 3 16
Examples
• Target marketing
– Find clusters of “model” customers who share the same characteristics: e.g.
interests, income
• Determine customer purchasing patterns over time
• Cross-market analysis uses associations/co-relations between product
sales and predicts based on the association information
• Customer profiling:
– What types of customers buy what products
• Identifying customer requirements-
– Identifying the best products for different customers, use prediction to find
what factors will attract new customers
Data 3 17
Fraud detection and management
• Used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
• Use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances
• Examples
– auto insurance: detect a group of people who stage
accidents to collect on insurance
– money laundering: detect suspicious money
transactions
– medical insurance: detect professional patients and
ring of doctors and ring of references
Data 3 18
Text Mining
- Application of data mining to unstructured or less
structured files.
- Text mining operates with less structured
information and helps organisations to:-
– Find hidden content of documents including useful
relationships.
– Relate documents across unnoticed divisions e.g.
customers in 2 product division have the same
characteristics.
– Group documents by themes e.g. all customers who
have similar complaints.
Data 3 19
Some more example applications by area
• Marketing:- Predicting customers to respond to internet
banners or buy a product. Segmenting customer
demographics.
• Banking : forecasting bad loans and fraudulent credit card
usage, credit card spending by new customers and which
customers will respond bet to new loan offers.
• Retailing and Sales: Predicting sales, correct stock levels,
distribution schedules
• Manufacturing and Production: predicting when to expect
machinery failures , finding key factors that control the
optimisation of manufacturing capacity.
Data 3 20
• Brokerage and Securities Trading:- Predicting when bond
prices will change, forecasting range of stock fluctuation
for particular issues, determining when to trade stock.
• Insurance: forecasting claim amounts, medical coverage
costs, classifying the most important elements that affect
medical coverage, predicting which customers will buy
new policies.
• Computer Hardware and Software: Predicting drive
failure, forecasting creation time for new chips, predicting
potential security violations.
• Government and Defence: Forecasting cost of moving
military equipment, testing strategies for potential
military engagements, predicting resource consumption.
Data 3 21
• Airlines: Capturing data on what customers are
flying and destination of those who change
carriers midflight.
• Healthcare : correlating demographics of patients
with critical illnesses.
• Broadcasting – programs best shown in prime time
and how to maximize returns by inserting
advertisements.
• Police: tracking crime patterns, locations, criminal
behaviour and attributes to help crack criminal
cases. Data 3 22
Problems with data mining
• Need clear business objectives and access to the
appropriate data.
• Need the right data.
– Bad data quality can lead to spurious results
• Models are not fail-safe.
• Privacy, property and other legal and ethical
issues.
• Companies must change mode of operation and
maintain the effort (e.g. loyalty programs such as
air miles).
Data 3 23
Conclusion
• Data Mining is an attractive sounding
technology which is still evolving.
• The key is that the algorithms discover useful
relationships.
– Unlike standard research where researchers
hypothesise correlations and then search for
them.
• There are ethical issues:
– E.g. Criminal profiling.
Data 3 24
gwhhethethrghrh
• sgsfhbdfdhhhhhhhhhhhhhhhhhhhhhhhhhhhh
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
Data 3 25