Clase 1

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 30

Ernestina Menasalvas

Facultad de Informtica.
Universidad Politcnica de Madrid
emenasalvas@fi.upm.es
Fuentes
Las transparencias han sido generadas usando las siguientes
fuentes:
Data Mining Course by Gregory Piatetsky-Shapiro
http://www.kdnuggets.com/dmcourse/index.html
Data Mining by Tan, Steinbach, Kumar
Jiawei Han and Micheline Kamber . Data Mining: Concepts and
Techniques,
The Morgan Kaufmann Series in Data Management Systems, Jim
Gray, Series Editor
Morgan Kaufmann Publishers, August 2000. 550 pages. ISBN 1-
55860-489-8 http://www.cs.sfu.ca/~han/DM_Book.html
ECML/PKDD2004. Pisa. Tutorial en Evaluacin en Web Mining. M.
Spiliopopu, B. Berendt, E. Menasalvas
Weka. http://www.cs.waikato.ac.nz/~ml/weka/
Modeling the Internet and the Web. School of Information and
Computer Science. University of California, Irvine


ndice del curso
Introduccin
Tipos de tareas de data mining
Conceptos previos
Clasificacin
Enfoques bsicos
Enfoques avanzados
Evaluando resultados
Segmentacin
Asociacin
El proceso de data mining: CRISP-DM
Revisando el ciclo de un proyecto de data mining:
Requisitos,
preproceso
Introduccin
Trends leading to Data Flood
[piatesky05]
More data is
generated:
Bank, telecom, other
business transactions
...
Scientific data:
astronomy, biology,
etc
Web, text, and e-
commerce
Data Growth Rate [piatesky05]
Twice as much information was created in
2002 as in 1999 (~30% growth rate)
Other growth rate estimates even higher
Very little data will ever be looked at by a
human
Knowledge Discovery is NEEDED to
make sense and use of data.

Machine Learning / Data Mining
Application areas [piatesky05]
Science
astronomy, bioinformatics, drug discovery,
Business
advertising, CRM (Customer Relationship
management), investments, manufacturing,
sports/entertainment, telecom, e-Commerce,
targeted marketing, health care,
Web:
search engines, bots,
Government
law enforcement, profiling tax cheaters, anti-terror(?)
Why Mine Data? Commercial
Viewpoint
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Computers have become cheaper and more
powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g.
in Customer Relationship Management)

Why Mine Data? Scientific Viewpoint
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
Necessity is the Mother of Invention
[piatesky05]
Data explosion problem
Automated data collection tools and mature
database technology lead to tremendous amounts of
data stored in databases, data warehouses and
other information repositories
We are drowning in data, but starving for
knowledge!
Solution: Data warehousing? and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data in large
databases
What is Data Mining?
Many Definitions
Non-trivial extraction of implicit, previously
unknown and potentially useful information
from data
Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns

What is (not) Data Mining?
What is Data Mining?

Certain names are more
prevalent in certain US
locations (OBrien, ORurke,
OReilly in Boston area)
Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
What is not Data
Mining?
Look up phone
number in phone
directory

Query a Web
search engine for
information about
Amazon

Data Mining for Customer
Modeling [piatesky05]
Customer Tasks:
attrition prediction
targeted marketing:
cross-sell, customer acquisition
credit-risk
fraud detection
Industries
banking, telecom, retail sales,
POR QU?
Las empresas de todos los tamaos necesitan
aprender de sus datos para crear una relacin
one-to-one con sus clientes.
Las empresas recogen datos de todos lo
procesos.
Los datos recogidos se tienen que analizar,
comprender y convertir en informacin con la
que se pueda actuar y aqu es donde Data
Mining juega su papel
Data Mining proporciona la
Inteligencia
El Data Warehouse proporciona los datos.
La inteligencia permitir buscar en esos datos
tratando de encontrar patrones, descubrir
reglas, nuevas ideas que probar, y hacer
predicciones acerca del futuro
Se estudiarn las tcnicas y herramientas que
aaden la inteligencia al datawarehouse para
explotar los datos de los clientes y sacar el
mximo rendimiento
Como nos ayudan?
Qu clientes permanecern fieles?
Qu clientes estn a punto de abandonar?
Dnde debemos localizar la prxima sucursal?
Qu productos se deben promocionar a qu
prospectos?
...
Las respuestas a estas preguntas estn
enterradas en los datos y se necesitan las
tcnicas de Data Mining para buscarlas
Definicin Intuitiva
Data Mining (en este contexto) en el anlisis y
exploracin, por medios automticos o
semiautomticos de grandes cantidades de
datos para descubrir patrones significativos
(tiles), y reglas.
La meta es permitir a la organizacin mejorar
sus ventas, sus campaas de marketing, las
operaciones de soporte a los clientes, a travs
de una mejor comprensin de sus clientes
Por qu ahora?
Las tcnicas que se vern existan hace
aos pero la convergencia de los
siguientes factores:
Cantidad de datos producida
Los datos estn integrados (data
warehouse)
La potencia de los ordenadores
Fuerte presin de la competencia
Software de data mining ha hecho que ahora
se vuelva a hablar de ellas
Data Mining
Two major objectives
Prediction
Knowledge discovery
Use 3 different techniques:
Data Bases
Statistics
Machine learning.
So many
thing?
Typical problems
Forecasting
Classification
Regression
Temporal series
Knowledge discovery
Bias detection
Data base
segmentation
Clustering
Association rules
Reporting
Visualisation
Text Search
What is the
course focus ?
Related Fields

Statistics
Machine
Learning
Databases
Visualization
Data Mining and
Knowledge Discovery
Data Mining un proceso
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in
data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
Anlisis de la definicin: Data?
Coleccin de objetos y
sus atributos
An attribute is a property
or characteristic of an
object
Examples: eye color of a
person, temperature, etc.
Attribute is also known as
variable, field,
characteristic, or feature
A collection of attributes
describe an object
Object is also known as
record, point, case,
sample, entity, or instance

Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Attributes
Objects
KDD: anlisis de la definicin
Proceso no trivial de identificacin de
patrones
validos
novedosos
potentialmente tiles
Y finalmente comprensibles en los datos.
Patrn: cualquier definicin de alto nivel
de los datos
El Proceso de KDD
Conocimiento
LIMPIEZA
Datos Procesados
CODIFICACIN
Datos Transformados
DATA MINING
Modelos
INTERPRETACIN Y EVALUACIN
Datos objetivo
SELECCIN
Datos
El ciclo de data mining
Identificar
un problema
Usar data mining para
transformar los datos
en informacin
Actuar basndonos
en la informacin
Medir los
resultados
Importante
La promesa de Data Mining es encontrar los
patrones
Simplemente el hallazgo de los patrones no es
suficiente
Debemos ser capaces de entender los
patrones, responder a ellos, actuar sobre ellos,
para finalmente convertir los datos en
informacin, la informacin en accin y la
accin en valor para la empresa
Data Mining resumen
Data Mining es un proceso que se tiene que
centrar en las acciones derivadas del
descubrimiento de conocimiento no en el
mecanismo de descubrimiento en si mismo.
Aunque los algoritmos son importantes, la
solucin es ms que un conjunto de tcnicas y
herramientas.
Las tcnicas se tienen que aplicar en el
caso correcto a los datos correctos
References
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in
Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in Fayyad, Piatetsky-Shapiro, Smyth, Uthurusamy, Advances in
Knowledge Discovery and Data Mining, AAAI Press / The MIT Press, Menlo Park,
CA, 1996, pp.1-34
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann,
2000.
T. Imielinski and H. Mannila. A database perspective on knowledge discovery.
Communications of ACM, 39:58-64, 1996.
G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge
discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge
Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases.
AAAI/MIT Press, 1991.
http://www.cs.sfu.ca/~han
Michael J. A. Berry, Gordon Linoff, Data Mining Techniques, 1997, John Wiley
Pieter Adriaans, Dolf Zantinge, Data Mining, 1996, Addison-Wesley
Zhengxin Chen, Data Mining and Uncertain Reasoning, 2001, John Wiley & Son

You might also like