Professional Documents
Culture Documents
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber. Presentation Slides of Prof. Anita Wasilewska. http://en.wikipedia.org/wiki/Extract,_transform,_load Ralph Kimball, Joe Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data Conceptual modeling for ETL processes by Panos Vassiliadis,Alkis Simitsis,Spiros Skiadopoulos. http://en.wikipedia.org/wiki/Category:ETL_tools http://www.1keydata.com/datawarehousing/tooletl.html http://www.bi-bestpractices.com/view-articles/4738 http://www.computerworld. com/databasetopics/data/story/0,10801,80222,00. html
What is ETL? ETL In the Architecture. General ETL issues - Extract - Transformations/cleansing - Load ETL Example
Extract - Extract relevant data. Transform - Transform data to DW format. - Build DW keys, etc. - Cleansing of data. Load - Load data into DW. - Build aggregates, etc.
https://eprints.kfupm.edu.sa/74341/1/74341.pdf
Metadata
Data Warehouse
Serve
Data Marts
Data Sources
Data Storage
Front-End Tools
http://infolab.stanford.edu/warehousing/
- Snapshot sources provides only full copy of source, e.g., files - Specific sources each is different, e.g., legacy systems - Logged sources writes change log, e.g., DB log - Queryable sources provides query interface, e.g., RDBMS
Non-cooperative sources
Cooperative sources
- Replicated sources publish/subscribe mechanism - Call back sources calls external code (ETL) when changes occur - Internal action sources only internal actions when changes occur.eg.DB triggers.
https://intranet.cs.aau.dk/fileadmin/user_upload/Education/Courses/2009/D WML/slides/DW4_ETL.pdf
Create/Import Data Sources definition Define Stage or Work Areas() Validate Connectivity Preview/Analyze Sources Define Extraction Scheduling Determine Extract Windows for source system Batch Extract (Overnight, weekly, monthly) Continuous extracts (Trigger on Source Table)
Design Time
Connect to the predefined Data Sources as scheduled Get raw data save locally in workspace DB
Run Time
Common Transformations are: Convert data into a consistent, standardized form. Cleanse(Automated) Synonym Substitutions. Spelling Corrections. Encoding free-form values(map Male to 1 & Mr to M) Merge/Purge(join data from multiple sources). Aggregate(eg.rollup) Calculate(sale_amt = qty * price) Data type conversion Data content audit Null value handling(null = not to load) Customized transformation(based on user).
http://www.bi-bestpractices.com/view-articles/4738
Normalization/Denormalization
- To the desired DW format. - Depending on source format.
Building keys
- Table matches production keys to surrogate DW keys. - Correct handling of history - especially for total reload.
https://intranet.cs.aau.dk/fileadmin/user_upload/Education/Courses/2009/DWML/slides/DW 4_ETL.pdf
Why cleansing? Garbage In Garbage Out. BI does not work on raw data - Pre-processing necessary for BI analysis. Handle inconsistent data formats - Spellings, codings, Remove unnecessary attributes - Production keys, comments, Replace codes with text [for easy understanding] - City name instead of ZIP code, e.g., Aalborg Centrum vs. DK-9000 Combine data from multiple sources with common key - E.g., customer data from customer address, customer name,
Aalborg University 2009 - DWML course
Dont use special values (e.g., 0, -1) in your data. -They are hard to understand in query/analysis operations. Mark facts with Data Status dimension - Normal, abnormal, outside bounds, impossible, - Facts can be taken in/out of analyses. Uniform treatment of NULL - Use NULLs only for measure values (estimates instead?) - Use special dimension key (i.e., surrogate key value) for NULL dimension values E.g., for the time dimension, instead of NULL, use special key values to represent Date not known, Soon to happen. Avoid problems in joins, since NULL is not equal to NULL
Aalborg University 2009 - DWML course
Data almost never has decent quality Data in DW must be: 1] Precise - DW data must match known numbers - or explanation needed.
4] Unique - The same things is called the same and has the same key(customers).
5] Timely - Data is updated frequently enough and the users know when.
Appoint data quality administrator -Responsibility for data quality. -Includes manual inspections and corrections!
Source-controlled improvements. Construct programs that check data quality - Are totals as expected? -Do results agree with alternative source? -Number of NULL values?
Specify Criteria/Filter for aggregation. Define operators (Mostly Set/SQL based). Map columns using operators/Lookups. Define other transformation rules. Define mappings and/or add new fields.
Design Time
Transform (Cleanse, consolidate, Apply Business Rule, De-Normalize/Normalize) Extracted Data by applying the operators mapped in design time. Aggregate (create & populate raw table). Create & populate Staging table.
Run Time
Goal: fast loading into end target(DW). -Loading chunks is much faster than total load. SQL-based update is slow. - Large overhead (optimization, locking, etc.) for every SQL call. -DB load tools are much faster. Index on tables slows load a lot. - Drop index and rebuild after load - Can be done per index partition
Parallellization - Dimensions can be loaded concurrently - Fact tables can be loaded concurrently - Partitions can be loaded concurrently
http://en.wikipedia.org/wiki/Extract,_transform,_load
- Referential integrity and data consistency must be ensured before loading (Why?) --Because they wont be checked in the DW again - Can be done by loader
- Can be built and loaded at the same time as the detail data.
Aggregates
- Load without log. -Sort load file first. -Make only simple transformations in loader. -Use loader facilities for building aggregates.
http://en.wikipedia.org/wiki/Extract,_transform,_load
Load tuning
Design Time
Design Warehouse. Map Staging Data to fact or dimension table attributes.
Run Time
Publish Staging data to Data mart (update dimension tables along with the fact tables.
From big vendors : -Oracle Warehouse Builder -IBM DB2 Warehouse Manager -Microsoft Integration Services Offers much functionality at a reasonable price - Data modeling -ETL code generation -Scheduling DW jobs The best tool does not exist - Choose based on your own needs
http://en.wikipedia.org/wiki/Category:ETL_tools
new_empl.dbf
NAME
STATE ZIP
BIRTH DATE
Guiles, Makenzie
MA
01075
Forbes, Andrew
12 Yarmouth Drive
Holyoke
MA
01040
Andy
(413)555 -8863
Sandwich MA
02537
Al
(508)555 -8974
19861221 20060717 no
Mellon, Philip
MA
01321
Phil
(781)555 -4289
19700625 20060724 no
Clark, Pamela
44 Mayberry Circle
Bedford
MA
01730
Pam
(860)555 -6447
Take the data from the dbase III file and convert it into a more usable format - XML . Extracting data can be done using XML convertors .Just select our table and choose dbase III convertor will transfer the data into XML. The result of this extraction will be an XML file similar to this: <?xml version="1.0" encoding="UTF-8"?> <table date="20060731" rows="5"> <row row="1"> <NAME>Guiles, Makenzie</NAME> <STREET>145 Meadowview Road</STREET> <CITY>South Hadley</CITY> <STATE>MA</STATE> <ZIP>01075</ZIP> <DEAR_WHO>Macy</DEAR_WHO> <TEL_HOME>(413)555-6225</TEL_HOME> <BIRTH_DATE>19770201</BIRTH_DATE> <HIRE_DATE>20060703</HIRE_DATE> <INSIDE>yes</INSIDE> </row> ... </table>
Find out the Target Schema,which can be done by using DB to XML Data Source Module.Currently using Northwind that comes with Standard SQL Server.(save as etltarget.rdbxml)
In a production ETL operation, likely each step would be more complicated, and/or would use different technologies or methods.
1] Convert the dates from CCYYMMDD into CCYY-MM-DD (the "ISO 8601" format) [etl-code-1.xsl]
loading is a just matter of writing the output of the last XSLT transform step into the etltarget.rdbxml map we built earlier.
Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber. http://personalpages.manchester.ac.uk/staff/G.Nenadi c/CN3023/Lecture4.pdf http://en.wikipedia.org/wiki/Online_Analytical_Process ing http://www.cs.sfu.ca/~han http://en.wikipedia.org/wiki/OLAP_cube#cite_noteOLAPGlossary1995-5 http://www.fmt.vein.hu/softcom/dw
OLAP OLAP Cube & Multidimensional data OLAP Operations: - Roll up (Drill up) - Drill down (Roll down) - Slice & Dice - Pivot - Other operations Examples
Online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical queries.[http://en.wikipedia.org/wiki/Online_analytical_processing] The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas. The term OLAP was created as a slight modification of the traditional database term OLTP(Online Transaction Processing).[http://en.wikipedia.org/wiki/Online_analytical_processing]
Data warehouse & OLAP tools are based on a multidimensional data model which views data in the form of a data cube. An OLAP (Online Analytical Processing) cube is a data structure that allows fast analysis of data. The OLAP cube consists of numeric facts called measures which are categorized by dimensions. -Dimensions: perspective or entities with respect to which an organization wants to keep records. -Facts: quantities by which we want to analyze relations between dimensions. The cube metadata may be created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables.
Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
A concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level, more general concepts.
Each of the elements of a dimension could be summarized using a hierarchy. The hierarchy is a series of parent-child relationships, typically where a parent member represents the consolidation of the members which are its children. Parent members can be further aggregated as the children of another parent.
Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
time
time_key day day_of_the_week month quarter year
item
Sales Fact Table time_key item_key branch_key location_key units_sold
item_key item_name brand type supplier_type
location
location_key street city province_or_street country
branch
branch_key branch_name branch_type
Measures
Reference: http://www.cs.sfu.ca/~han
Example:
Dimensions: Item, Location, Time Hierarchical summarization paths
Type Region
Year
Item
Month
Reference: http://www.cs.sfu.ca/~han
Day
Reference: http://www.fmt.vein.hu/softcom/dw
Performs aggregation on a data cube either by climbing up the concept hierarchy for a dimension or by dimension reduction.[http://www.cs.sfu.ca/~han] Specific grouping on one dimension where we go from a lower level of aggregation to a higher.
[http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf]
Reverse of roll-up .[http://www.cs.sfu.ca/~han] Navigates from less detailed data to more detailed data. Can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions. Finer-grained view on aggregated data, i.e. going from higher to lower aggregation.
[http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf]
Reference: http://www.fmt.vein.hu/softcom/dw
Slice: - performs a selection on one dimension of the given cube, resulting in a sub cube. e.g. slicing volume of products in product dimension for product_model=1996.
Dice: -performs a selection operation on two or more dimensions. e.g. dicing the central cube based on the following selection criteria: (location=Montreal or Vancouver)and (time=Q1orQ2)and(item=cell phoneorpager)
Reference: http://www.fmt.vein.hu/softcom/dw
rotates the data axes in view in order to provide an alternate presentation of data. Select a different dimension (orientation) for analysis
E.g. pivot operation where
[http://personalpages.manchester.ac.uk/staff/G.Ne nadic/CN3023/lecture4.pdf]
location & item in a 2D slice are rotated. Other examples: - rotating the axes in a 3D cube. - transforming 3D cube into series of 2D planes.
Dimension Tables:
Fact Table:
Sales(Market_ID,Product_ID,Time_ID,Amount)
Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
SELECT S.Product_ID, M.City, SUM(S.Amount) INTO City_Sales FROM Sales S, Market M WHERE M.Market_ID = S.Market_ID GROUP BY S.Product_ID, M.City
SELECT T.Product_ID, M.Region, SUM(T.Amount) FROM City_Sales T, Market M WHERE T.City = M.City GROUP BY T.Product_ID, M.Region
Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
Slicing the data cube in the time dimension (e.g. choosing sales only in week 12)
SELECT S.* FROM Sales S, Time T WHERE T.Time_ID = S.Time_ID AND T.Week = Week12
Dicing sales in the time dimension (e.g. total sales for each product in each quarter)
SELECT S.Product_ID, T.Quarter, SUM(S.Amount) FROM Sales S, Time T WHERE T.Time_ID = S.Time_ID AND T.Week=Week12 AND (S.Product_ID =1002 OR S.Product_ID =1003) GROUP BY T.Quarter, S.Product_ID
Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
drill across:
makes use of relational SQL facilities to drill through the bottom level of the cube to its back-end relational tables.
Reference: [http://www.cs.sfu.ca/~han]
Qiang Yang
Department of Computer Science Hong Kong University of Science & Technology Clearwater Bay, Kowloon, Hong Kong, China ong Kong University of Science and Technology Clearwater Bay, Kowloon, Hong Kong, China
Xindong Wu
Department of Computer Science University of Vermont 33 Colchester Avenue, Burlington, Vermont 05405, USA xwu@cs.uvm.edu
Presented in : ICDM '05 The Fifth IEEE International Conference on Data Mining
Pedro Domingos Charles Elkan Johannes Gehrke Jiawei Han David Heckerman Daniel Keim Jiming Liu
Gregory Piatetsky-Shapiro Vijay V. Raghavan Rajeev Rastogi Salvatore J. Stolfo Alexander Tuzhilin Benjamin W. Wah
What are the 10 most challenging problems in data mining, today? Different people have different views, a function of time as well. What do the experts think? - Experts we consulted: Previous organizers of IEEE ICDM and ACM KDD They were asked to list their 10 problems (requests sent out in Oct 05,and replies Obtained in Nov 05) Replies: - Edited and presented in this paper - Hopefully be useful for young researchers - Not in any particular importance order
The current state of the art of data-mining research is too ``ad-hoc - techniques are designed for individual problems (e.g. classification or clustering) - no unifying theory
A theoretical framework is required that unifies: Data Mining tasks Clustering Classification Association Rules etc. Data Mining approaches Statistics Machine Learning Database systems etc.
Long standing problems in statistical research - How to avoid spurious correlations? - sometimes related to the problem of mining for deep knowledge. Example: Strong correlation found between the timing of TV series by a particular star and the occurrences of small market crashes in Hong Kong. Can we conclude that there is a hidden cause behind the correlation?
Scaling up is needed because of following challenges: - Classifiers with hundreds or billions of features to be built for applications like text mining & drug safety analysis. Challenge how to design classifiers to handle ultra high dimensional classification problems. - Satellite and Computer Network data comprise extremely large databases (e.g. 100TB). Data mining technology today is still slow. Challenge how can data mining technology handle data of this scale.
Data Mining should be a continuous online process, rather than an occasional one shot process. E.g. Analysis of high speed network traffic for identifying anomalous events. Challenge: how to compute models over streaming data which accommodate changing environments from which data is drawn. (Concept drift or Environment drift) Incremental Mining and effective model updating to maintain accurate modeling of the current stream required.
How to efficiently and accurately cluster, classify and predict the trends in sequential and time series data ? Time series data used for predictions are contaminated by noise How to do accurate short-term and longterm predictions? Signal processing techniques introduce lags in the filtered data, which reduces accuracy
Real time series data obtained from Wireless sensors in Hong Kong UST CS department hallway
Important type of complex knowledge is in the form of graphs. Challenge: More research required in the field of discovering graphs and structured patterns from large data. Data that are not i.i.d. (independent and identically distributed) -many objects are not independent of each other, and are not of a single type. Challenge: Data mining systems required that can soundly mine the rich structure of relations among objects. -E.g.: interlinked Web pages, social networks, metabolic networks in the cell
Most organizations data is in text form and in complex data formats like Image, Multimedia and Web data. Challenge: How to mine non-relational data.
Integration of data mining and knowledge inference required. Challenge ( The biggest gap): unable to relate the results of mining to the real-world decisions they affect - all they can do is hand the results back to the user. More research on interestingness of knowledge
Community and Social Networks: -Linked data between emails, Web pages, blogs, citations, sequences and people
Problems: - It is critical to have right characterization of the community to be detected. - Entities/ nodes are distributed. Hence, distributed means of identification desired. - Snapshot based dataset may not be able to capture the real picture. Challenge: To understand - Networks static structures (e.g. topologies & structures) - Dynamic Behavior (e.g. growth factor, robustness, functional efficiency)
Mining in and for Computer Networks - Network links are increasing in speed.(1-10 Gig Ethernet) - To be able to detect anomalies, fast capture of IP packets at high speed links and analyzing massive amounts of data required.
Challenge: - highly scalable solutions required. - i.e. Good algorithms required to (a) detect DoS attacks (b) trace back to find attackers (c ) drop packets that belong to attack traffic.
Important in Network Problems. In Distributed Environment(sensor/IP Network),distributed probes are placed at locations within the network. Problems : 1] Need to correlate & discover data patterns at various probes. 2] Communication Overhead (amount of data shipped between various sites). 3] How to mine across multiple heterogeneous data sources. Adversary data mining: deliberately manipulate the data to sabotage them(produce false negatives) e.g. email spam, counter-terrorism, intrusion detection/computer security, click spam, search engine spam, fraud detection, shopbots, file sharing,etc. Multi-agent Data Mining : Agents are often distributed & have proactive and reactive features.
http://www-ai.cs.uni-dortmund.de/auto?self=$ejr31cyc http://www.csc.liv.ac.uk/~ali/wp/MADM.pdf
How to automate mining process? Issues: 1] 90% of cost is in pre-processing. 2] Systematic documentation of data cleaning. 3] Combine visual interactive & automatic DM. 4] In exploratory data analysis,DM goal is undefined. Challenges: - The composition of data mining operations. - Data cleaning, with logging capabilities. - Visualization and mining automation. Need a methodology: help users avoid many data mining mistakes. -What are the approaches for multi-step mining queries? - What is a canonical set of data mining operations?
Sampling
How to ensure the users privacy while their data are being mined? How to do data mining for protection of security and privacy? Knowledge integrity assessment - Data are intentionally modified from their original version, in order to misinform the recipients or for privacy and security - Development of measures to evaluate the knowledge integrity of a collection of -- Data -- Knowledge and patterns
Challenges: 1] Develop efficient algorithms for comparing the knowledge contents of the two (before and after) versions of the data. 2] Develop algorithms for estimating the impact that certain modifications of the data have on the statistical significance of individual patterns obtainable by broad classes of data mining algorithms.
Headlines (Nov 21 2005) Senate Panel Approves Data Security Bill - The Senate Judiciary Committee on Thursday passed legislation designed to protect consumers against data security failures by, among other things, requiring companies to notify consumers when their personal information has been compromised. While several other committees in both the House and Senate have their own versions of data security legislation, S. 1789 breaks new ground by including provisions permitting consumers to access their personal files http://www.cdt.org/privacy/
Data is non-static,constantly changing.eg of collecting data in 2000,then 2001,2002 Problem is to correct the bias. Deal with unbalanced & cost-sensitive data: There is much information on costs and benefits, but no overall model of profit and loss. Data may evolve with a bias introduced by sampling
ICML 2003 Workshop on Learning from Imbalanced Data Sets
blood test ?
Pressure ?
biopsy ?
Each test incurs a cost Data extremely unbalanced Data change with time
There is still a lack of timely exchange of important topics in the community as a whole. These problems are sampled from a small, albeit important, segment of the community. The list should obviously be a function of time for this dynamic field.