Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Data Mining Notes Final exams

Preparations

Data Mining and Data Warehouse


Databases vs Data warehousing
Database is a collection of related data that represents some elements of the real world. Database
is designed to record data. Database is application-oriented collection of data. Database uses
Online Transactional Processing (OLTP). Database tables and joins are complicated because they
are normalized. ER modeling techniques are used for designing Database.
Data warehouse is an information system that stores historical and commutative data from single
or multiple sources. Data warehouse is designed to analyze data. Data Warehouse is the subject
oriented collection of data. Data warehouse uses Online Analytical Processing (OLAP). Data
Warehouse tables and joins are easy because they are DE normalized. Data modeling techniques
are used for designing Data Warehouse.
Application of database vs data warehouse
Database is used to store and retrieve data while data warehouse is used to analyze the data.
Everywhere.
What is data mining?
Data mining is also called knowledge discovery and data mining (KDD). Extraction of useful
patterns from data sources, e.g. databases, texts, web, image. Patterns must novel,
potentially useful, understandable.
What is Data?
Collection of records and their attributes. An attribute is a characteristic of an object. A collection
of attributes describe an object. Transactional Data, Temporal Data, Spatial & Spatial-Temporal
Data.
Types of Data
Record Data [Transactional Data], Temporal Data [Time Series Data, Sequence Data], Spatial &
Spatial-Temporal Data [Spatial Data, Spatial-Temporal Data], Graph Data [Transactional Data],
Unstructured Data [Twitter Status Message, Review, and news article], Semi-Structured Data
[Paper Publications Data, XML format].
Data Matrix
If data objects have the same fixed set of numeric attributes, then the data objects can be thought
of as points in a multidimensional space, where each dimension represents a distinct attribute.
Data mining functions
(1) Generalization
Information integration and data warehouse construction. Data cube technology. Multidimensional
concept description: Characterization and discrimination.
Data Mining Notes Final exams
Preparations

(2) Association and Correlation Analysis


Frequent patterns (or frequent item sets). Association, correlation vs. causality. How to mine such
patterns and rules efficiently in large datasets? How to use such patterns for classification,
clustering, and other applications?
(3) Classification
Classification and label prediction. Typical methods [Decision trees, naïve Bayesian
classification, support vector machines, neural networks, rule-based classification, pattern based
classification, logistic regression].
(4) Cluster Analysis
Unsupervised learning (i.e., Class label is unknown). Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find distribution patterns. Principle: Maximizing intra-class
similarity & minimizing interclass similarity.
(5) Outlier Analysis
Outlier: A data object that does not comply with the general behavior of the data. Noise or
exception? ― one person’s garbage could be another person’s treasure. Methods: by product of
clustering or regression analysis. Useful in fraud detection, rare events analysis.
Data Preprocessing
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility

Major Tasks in Data Preprocessing


Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Why Is Data Dirty?
Incomplete data may come from
“Not applicable” data value when collected. Different considerations between the time when the
data was collected and when it is analyzed. Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments. Human or computer error at data entry. Errors in data
transmission
Inconsistent data may come from
Data Mining Notes Final exams
Preparations

Different data sources. Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning

Data integration
Integration of multiple databases, data cubes, or files

Data transformation
Normalization and aggregation
Min-max normalization
Z-score normalization
Normalization by decimal scaling

Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Data cube aggregation:
Dimensionality reduction — e.g., remove unimportant attributes
Data Compression
Numerosity reduction — e.g., fit data into models
Discretization and concept hierarchy generation

Data discretization
Part of data reduction but with particular importance, especially for numerical data
Three types of attributes:
Nominal — values from an unordered set, e.g., color, profession
Ordinal — values from an ordered set, e.g., military or academic rank
Continuous — real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis

Correlation Analysis (Categorical Play chess Not play Sum (row)


Data) chess
Χ2 (chi-square) test. The larger the Like science 250(90) 200(360) 450
Χ2 value, the more likely the fiction
variables are related
(Observed  Expected) 2 Not like science 50(210) 1000(840) 1050
2   fiction
Expected
Sum(col.) 300 1200 1500
Example =>
Data Mining Notes Final exams
Preparations

What Is Frequent Pattern Analysis?


Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set. e.g. market-basket analysis.
Support and Confidence
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 ∪ 𝐵) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡𝑐𝑜𝑢𝑛𝑡(𝐴 ∪ 𝐵)
𝐶𝑂𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 (𝐴 ⇒ 𝐵) = 𝑃(𝐵|𝐴) = =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡𝑐𝑜𝑢𝑛𝑡(𝐴)
S𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 ⇒ 𝐵) = 𝑃(𝐴 ∪ 𝐵)

𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 (𝐴 ⇒ 𝐵) = 𝑃(𝐵|𝐴)
Apriori Algorithm
Apriori is an algorithm for frequent item set mining and association rule learning over relational
databases. It proceeds by identifying the frequent individual items in the database and extending
them to larger and larger item sets as long as those item sets appear sufficiently often in the
database.
FP Growth-Frequent Pattern Growth
FP growth algorithm represents the database in the form of a tree called a frequent pattern tree or
FP tree. This tree structure will maintain the association between the item sets. The database is
fragmented using one frequent item. This fragmented part is called “pattern fragment”.
IT Architecture of an Enterprise
The IT architecture of an enterprise depends on primarily three things – the business requirements
of the enterprise; – the available technology at that time; and – the accumulated investments of the
enterprise from earlier technology generations.
Operational Data Stores
The architectural construct where collective integrated operational data is stored is called
Operational data Store.
Data Warehouse Vs Operational Data Store
Data Mining Notes Final exams
Preparations

What is Data Warehouse?


Data Warehousing is an architectural construct of information systems that provides users with
current and historical decision support information that is hard to access or present in traditional
operational data stores.
What is Data Warehouse?
A formal definition of the data warehouse is offered by W.H. Inmon:
“A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data in
support of management decisions”
DATA WAREHOUSE: CBA/ROI
Benefits
Improved productivity of analytical staff as data is available – Data is readily available • Business
improvements from analysis of the warehouse data • Redeployment of staff at more productive
work.
Costs
Hardware – Software ○ Cost incurred in purchasing licensed software ○ Cost incurred in purchase
of additional software’s for automating extraction, cleansing, loading, retrieval and presentation
of data – Services ○ System Integrators, Trainers, Consultants – Internal staff training.
Architecture
Generally a data warehouses adopts a three-tier architecture.
Top-Tier
This tier is the front-end client layer. This layer holds the query tools and reporting tools, analysis
tools and data mining tools.
Middle Tier
In the middle tier, we have the OLAP Server that can be implemented in either of the following
ways. – By Relational OLAP (ROLAP), which is an extended relational database management
system. The ROLAP maps the operations on multidimensional data to standard relational
operations. – By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
Bottom Tier
The bottom tier of the architecture is the data warehouse database server. – It is the relational
database system. – We use the back end tools and utilities to feed data into the bottom tier. – These
back end tools and utilities perform the Extract, Clean, Load, and refresh functions.
Data Mining Notes Final exams
Preparations

Data mart
Containing lightly summarized departmental data and is customized to suit the needs of a particular
department that owns the data – Data marts. Data warehouse • Data Mart helps to enhance user's
response time due to reduction in volume of data • It provides easy access to frequently requested
data. • Data mart are simpler to implement when compared to corporate data warehouse. At the
same time, the cost of implementing Data Mart is certainly lower compared with implementing a
full data warehouse.
Type of Data Mart
There are three main types of data marts are:
Dependent:
Dependent data marts are created by drawing data directly from operational, external or both
sources.
Independent:
Independent data mart is created without the use of a central data warehouse.
Hybrid:
This type of data marts can take data from data warehouses or operational systems.

OLTP (online transaction processing)


OLTP is a class of software programs capable of supporting transaction-oriented applications on
the Internet. – "transaction" in the context of computer or database transactions. OLTP applications
have high throughput and are insert or update intensive in database management. These
applications are used concurrently by hundreds of users.

OLAP (Online Analytical Processing)


OLAP performs multidimensional analysis of business data and provides the capability for
complex calculations, trend analysis, and sophisticated data modeling. It is the foundation for
many kinds of business applications for Business Performance Management, Planning, Budgeting,
Forecasting, Financial Reporting, Analysis, Simulation Models, Knowledge Discovery, and Data
Warehouse Reporting. OLAP enables end-users to perform ad hoc analysis of data in multiple
dimensions, thereby providing the insight and understanding they need for better decision making.

Types of OLAP:
ROLAP
ROLAP products enable organizations to leverage their existing investments in RDBMS
(relational database management system) software. • ROLAP products provide GUIs and generate
SQL execution plans that typically remove end-users from the SQL writing process. • However,
this over-reliance on processing via SQL statements— including processing for multidimensional
analysis—is a drawback.
MOLAP
MOLAP products enable end-users to model data in a multidimensional environment, rather than
providing a multidimensional view of relational data, as ROLAP products do. • Multidimensional
databases allow users to add extra dimensions, rather than additional tables, as in a relational
model. • MOLAP cube structure allows for particularly fast, flexible data modeling and
calculations.
Data Mining Notes Final exams
Preparations

HTAP
Hybrid Transaction / Analytical Processing (HTAP) – It is in-memory data systems that do both
online transaction processing (OLTP) and online analytical processing (OLAP). – HTAP relies on
newer and much more powerful, often distributed, processing: sometimes it involves a new
hardware “appliance”, and it almost always requires a new software platform.
HOLAP
HOLAP is the product of the attempt to incorporate the best features of MOLAP and ROLAP into
a single architecture. • This kind of tool tries to bridge the technology gap of both products by
enabling access to or use of both multidimensional database (MDDB) and Relational Database
Management System (RDBMS) data stores. • HOLAP systems store larger quantities of detailed
data in the relational tables while the aggregations are stored in the recalculated cubes.
Desktop OLAP (DOLAP)
User can download a section of an OLAP model from another source, and work with that dataset
locally, on their desktop. – Easier to deploy, with a potential lower cost, but almost by definition
comes with a limited functionality in comparison with other OLAP applications.
Web OLAP (WOLAP)
Without any kind of option for a local install or local client to access data. – Considerably lower
investment and enhanced accessibility to connect to the data. – The fact is that by now most OLAP
products provide an option for Web-only connectivity, while still allowing other client options for
more robust data modeling and other functionality than a Web client can provide.
Mobile OLAP
OLAP functionalities on a wireless or mobile device. – This enables users to access and work on
OLAP data and applications remotely thorough the use of their mobile devices.
Spatial OLAP (SOLAP)
The aim of Spatial OLAP (thus, SOLAP) is to integrate the capabilities of both Geographic
Information Systems (GIS) and OLAP into a unified solution, thus facilitating the management of
both spatial and non-spatial data.

Multidimensional Data Model (MDDM)


The Dimensional Model was developed for Implementing data warehouse and data marts. MDDM
provide both a mechanism to store data and a way for business analysis the two primary component
of dimensional model are Dimensions and Facts. – Dimensions – Facts.

Measures, Dimensions & Facts


Measures are numerical values that mathematical functions work on. For example, a sales revenue
column is a measure because you can find out a total or average the data. Other examples Cost,
quantity.
Dimensions are qualitative and do not total a sum. For example, sales region, employee, location,
or date are dimensions.
Facts events are known as "facts.

Data Warehouse: Schema


Helping management in making business decisions. Very large amount of data (current +
historical). Achieving a satisfactory level of efficiency of analytical queries.
Data Mining Notes Final exams
Preparations

Star Schema
The star's scheme is the simplest data warehouse model. Fact table surrounded by dimension tables
is in the central place. Most often in the facts table you can find sales data. Dimensions are:
geography, customer, product, time, business.
Snowflake Schema
The snowflake scheme is a more complex version of the star scheme because the tables that
describe the dimensions are normalized. This means that each dimension can have several of its
own dimensions. This scheme is used primarily in situations when we deal with very complex
dimensions and to better reflect the way of thinking of users about data.
StarFlake Schema
Starflake schema is a hybrid structure that contains a mixture of star and snowflake schemas. The
most appropriate database schemas use a mixture of renormalized star and normalized snowflake.
Constellation Schema
A Galaxy Schema contains two fact table that shares dimension tables. • It is also called Fact
Constellation Schema. • The schema is viewed as a collection of stars hence the name Galaxy
Schema.

Cube
Data cube is a structure that enable OLAP to achieve the multidimensional functionality. The data
cube is used to represent data along some measure of interest. Data Cubes are an easy way to look
at the data (allow us to look at complex data in a simple format). Although called a "cube", it can
be 2-dimensional, 3dimensional, or higher-dimensional.
Important concepts associated with data cubes:
Slicing
The term slice most often refers to a two dimensional page selected from the cube. Subset of a
multidimensional array corresponding to a single value for one or more members of the dimensions
not in the subset.
Dicing
A related operation to slicing. In dicing, we define a sub cube of the original space. Dicing provides
you the smallest available slice.
Rotating or Pivoting
Rotating changes the dimensional orientation of the report from the cube data. • For example … –
rotating may consist of swapping the rows and columns, or moving one of the row dimensions into
the column dimension – or swapping an off spreadsheet dimension with one of the dimensions in
the page display.
Roll up and Roll down
The two basic hierarchical operations when displaying data at multiple levels of aggregations are
the ``drill-down'' and ``roll-up'' operations. Drill-down refers to the process of viewing data at a
level of increased detail, while roll-up refers to the process of viewing data with decreasing detail.

Three classifiers
Decision tree
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
Data Mining Notes Final exams
Preparations

KNN
K-Nearest Neighbors (KNN) is a standard machine-learning method that has been extended to
large-scale data mining efforts. The idea is that one uses a large amount of training data, where
each data point is characterized by a set of variables.
Naïve Bayes
The Naive Bayes classification algorithm is a probabilistic classifier. It is based on probability
models that incorporate strong independence assumptions. The independence assumptions often
do not have an impact on reality. Therefore they are considered as naive.

***Good Luck***

You might also like