Professional Documents
Culture Documents
01 Intro
01 Intro
(3rd ed.)
Chapter 1
Jiawei Han, Micheline Kamber, and Jian Pei
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?
The amount of information in the world doubles every 20 months and The sizes as well as number of databases are increasing even faster.
1 byte = 8 bits 1 kilobyte (K/KB) = 2 ^ 10 bytes = 1,024 bytes 1 megabyte (M/MB) = 2 ^ 20 bytes = 1,048,576 bytes 1 gigabyte (G/GB) = 2 ^ 30 bytes = 1,073,741,824 bytes 1 terabyte (T/TB) = 2 ^ 40 bytes = 1,099,511,627,776 bytes 1 petabyte (P/PB) = 2 ^ 50 bytes = 1,125,899,906,842,624 bytes 1 exabyte (E/EB) = 2 ^ 60 bytes = 1,152,921,504,606,846,976 bytes 1 zettabyte (Z/ZB) =1 000 000 000 000 000 000 000 bytes 1 yottabyte (Y/YB) =1 000 000 000 000 000 000 000 000 bytes
4
Information is at the heart of business operations and brain of decision makers. Database Management Systems gave access to the data stored but this was only a small part of what could be gained from the data. , OLTPs, are good at putting data into databases quickly, safely and efficiently but are not good at delivering meaningful analysis in return. Analyzing data can provide further knowledge about a business Can make decision efficiently and effectively
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Simple search and query processing (Deductive) expert systems
7
Alternative names
This is a view from typical database systems and data warehousing Pattern Evaluation communities Data mining plays an essential role in the knowledge discovery process Data Mining Task-relevant Data Data Warehouse Selection
Data Cleaning
Data Integration Databases
8
Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base
9
Decision Making
Data Presentation Visualization Techniques Data Mining Information Discovery
End User
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
10
Data warehousing
Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. Data warehouse systems are valuable tools in todays competitive, fast-evolving world According toWilliam H. Inmon, a leading architect in the construction of data warehouse systems, A data warehouse is
11
KDD vs. ML/Stat. vs. Business Intelligence Depending on the data, applications, and your focus Business intelligence view
Business objects vs. data mining tools Supply chain example: mining vs. OLAP vs. presentation tools Data presentation vs. data exploration
12
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?
Data to be mined Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, timeseries, sequence, text and web, multi-media, graphs & social and information networks Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining Multiple/integrated functions and mining at multiple levels Techniques utilized Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
14
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?
Relational database, data warehouse, transactional database (we focus this category in this course)
A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows).
16
Data warehouse
are constructed via a process of data cleaning, data integration, data transformation, and periodic data refreshing. To facilitate decision making
the data are subject oriented. Ex. major subjects are customer, item, supplier, and activity.
Usually modeled by a multidimensional structure, called data cube Each dimension corresponds to an attribute
17
18
A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction (such as items purchased in a store).
19
Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and information networks Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web
20
21
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?
Characterization and Discrimination Frequent pattern, association and correlation Classification and regression Clustering analysis Outlier analysis
23
Ex. to study the characteristics of software products whose sales increased by 10% in the last year
Data discrimination is comparison of general characteristics or features (often called the contrasting classes
to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by at least 30% during the same period.
24
Frequent patterns, are the patterns that occur frequently in data. frequent patterns, including itemsets, subsequences, and substructures
Frequent itemset: a set of items that frequently appear together in a transactional data set, such as milk and bread Frequent subsequences: customers tend to purchase first a PC, followed by a digital camera, and then a memory card,
25
Classification is the process of finding a model that describes and distinguishes data classes The model is based on the analysis of a set of training data set Able to use the model to predict the class of objects whose class label is unknown The model may be represented in various forms, such as
Clustering
The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another,
27
Outlier analysis
A database may contain objects that do not comply with the general behavior or model of the data. These data objects are outliers. Most data mining methods discard outliers as noise or exceptions.
28
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?
Applications
Data Mining
Visualization
Algorithm
Database Technology
High-Performance Computing
30
Tremendous amount of data Algorithms must be scalable to handle big data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social and information networks Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations
31
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?
Business Intelligence (BI) BI technologies provides historical, current, and predictive views of business operations. Data mining is the core of business intelligence OLAP is the tools, rely on data warehousing and multidimensional data mining. Classification is the core of predictive analytics in BI Clustering plays central role for customer relationship management grouping customer based on their similarities.
33
Web page analysis: from web page classification, clustering to PageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering
From major dedicated data mining systems/tools (e.g., SAS, MS SQLServer Analysis Manager, Oracle Data Mining Tools) to invisible data mining
34
Summary
Data mining: Discovering interesting patterns and knowledge from massive amount of data A natural evolution of science and information technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of data Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc. Data mining technologies and applications Major issues in data mining
35
36
Mining Methodology
Mining various and new kinds of knowledge Mining knowledge in multi-dimensional space Data mining: An interdisciplinary effort Boosting the power of discovery in a networked environment Handling noise, uncertainty, and incompleteness of data Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge Presentation and visualization of data mining results
37
38
KDD Conferences ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) SIAM Data Mining Conf. (SDM) (IEEE) Int. Conf. on Data Mining (ICDM) European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD) Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) Int. Conf. on Web Search and Data Mining (WSDM)
DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, Web and IR conferences: WWW, SIGIR, WSDM ML conferences: ICML, NIPS
PR conferences: CVPR,
Data Mining and Knowledge Discovery (DAMI or DMKD) IEEE Trans. On Knowledge and Data Eng. (TKDE) KDD Explorations ACM Trans. on KDD
39
Journals
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc. 40
Web and IR
Statistics
Visualization