Professional Documents
Culture Documents
Data Warehouse Notes
Data Warehouse Notes
Preparations
Different data sources. Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Min-max normalization
Z-score normalization
Normalization by decimal scaling
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Data cube aggregation:
Dimensionality reduction — e.g., remove unimportant attributes
Data Compression
Numerosity reduction — e.g., fit data into models
Discretization and concept hierarchy generation
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Three types of attributes:
Nominal — values from an unordered set, e.g., color, profession
Ordinal — values from an ordered set, e.g., military or academic rank
Continuous — real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 (𝐴 ⇒ 𝐵) = 𝑃(𝐵|𝐴)
Apriori Algorithm
Apriori is an algorithm for frequent item set mining and association rule learning over relational
databases. It proceeds by identifying the frequent individual items in the database and extending
them to larger and larger item sets as long as those item sets appear sufficiently often in the
database.
FP Growth-Frequent Pattern Growth
FP growth algorithm represents the database in the form of a tree called a frequent pattern tree or
FP tree. This tree structure will maintain the association between the item sets. The database is
fragmented using one frequent item. This fragmented part is called “pattern fragment”.
IT Architecture of an Enterprise
The IT architecture of an enterprise depends on primarily three things – the business requirements
of the enterprise; – the available technology at that time; and – the accumulated investments of the
enterprise from earlier technology generations.
Operational Data Stores
The architectural construct where collective integrated operational data is stored is called
Operational data Store.
Data Warehouse Vs Operational Data Store
Data Mining Notes Final exams
Preparations
Data mart
Containing lightly summarized departmental data and is customized to suit the needs of a particular
department that owns the data – Data marts. Data warehouse • Data Mart helps to enhance user's
response time due to reduction in volume of data • It provides easy access to frequently requested
data. • Data mart are simpler to implement when compared to corporate data warehouse. At the
same time, the cost of implementing Data Mart is certainly lower compared with implementing a
full data warehouse.
Type of Data Mart
There are three main types of data marts are:
Dependent:
Dependent data marts are created by drawing data directly from operational, external or both
sources.
Independent:
Independent data mart is created without the use of a central data warehouse.
Hybrid:
This type of data marts can take data from data warehouses or operational systems.
Types of OLAP:
ROLAP
ROLAP products enable organizations to leverage their existing investments in RDBMS
(relational database management system) software. • ROLAP products provide GUIs and generate
SQL execution plans that typically remove end-users from the SQL writing process. • However,
this over-reliance on processing via SQL statements— including processing for multidimensional
analysis—is a drawback.
MOLAP
MOLAP products enable end-users to model data in a multidimensional environment, rather than
providing a multidimensional view of relational data, as ROLAP products do. • Multidimensional
databases allow users to add extra dimensions, rather than additional tables, as in a relational
model. • MOLAP cube structure allows for particularly fast, flexible data modeling and
calculations.
Data Mining Notes Final exams
Preparations
HTAP
Hybrid Transaction / Analytical Processing (HTAP) – It is in-memory data systems that do both
online transaction processing (OLTP) and online analytical processing (OLAP). – HTAP relies on
newer and much more powerful, often distributed, processing: sometimes it involves a new
hardware “appliance”, and it almost always requires a new software platform.
HOLAP
HOLAP is the product of the attempt to incorporate the best features of MOLAP and ROLAP into
a single architecture. • This kind of tool tries to bridge the technology gap of both products by
enabling access to or use of both multidimensional database (MDDB) and Relational Database
Management System (RDBMS) data stores. • HOLAP systems store larger quantities of detailed
data in the relational tables while the aggregations are stored in the recalculated cubes.
Desktop OLAP (DOLAP)
User can download a section of an OLAP model from another source, and work with that dataset
locally, on their desktop. – Easier to deploy, with a potential lower cost, but almost by definition
comes with a limited functionality in comparison with other OLAP applications.
Web OLAP (WOLAP)
Without any kind of option for a local install or local client to access data. – Considerably lower
investment and enhanced accessibility to connect to the data. – The fact is that by now most OLAP
products provide an option for Web-only connectivity, while still allowing other client options for
more robust data modeling and other functionality than a Web client can provide.
Mobile OLAP
OLAP functionalities on a wireless or mobile device. – This enables users to access and work on
OLAP data and applications remotely thorough the use of their mobile devices.
Spatial OLAP (SOLAP)
The aim of Spatial OLAP (thus, SOLAP) is to integrate the capabilities of both Geographic
Information Systems (GIS) and OLAP into a unified solution, thus facilitating the management of
both spatial and non-spatial data.
Star Schema
The star's scheme is the simplest data warehouse model. Fact table surrounded by dimension tables
is in the central place. Most often in the facts table you can find sales data. Dimensions are:
geography, customer, product, time, business.
Snowflake Schema
The snowflake scheme is a more complex version of the star scheme because the tables that
describe the dimensions are normalized. This means that each dimension can have several of its
own dimensions. This scheme is used primarily in situations when we deal with very complex
dimensions and to better reflect the way of thinking of users about data.
StarFlake Schema
Starflake schema is a hybrid structure that contains a mixture of star and snowflake schemas. The
most appropriate database schemas use a mixture of renormalized star and normalized snowflake.
Constellation Schema
A Galaxy Schema contains two fact table that shares dimension tables. • It is also called Fact
Constellation Schema. • The schema is viewed as a collection of stars hence the name Galaxy
Schema.
Cube
Data cube is a structure that enable OLAP to achieve the multidimensional functionality. The data
cube is used to represent data along some measure of interest. Data Cubes are an easy way to look
at the data (allow us to look at complex data in a simple format). Although called a "cube", it can
be 2-dimensional, 3dimensional, or higher-dimensional.
Important concepts associated with data cubes:
Slicing
The term slice most often refers to a two dimensional page selected from the cube. Subset of a
multidimensional array corresponding to a single value for one or more members of the dimensions
not in the subset.
Dicing
A related operation to slicing. In dicing, we define a sub cube of the original space. Dicing provides
you the smallest available slice.
Rotating or Pivoting
Rotating changes the dimensional orientation of the report from the cube data. • For example … –
rotating may consist of swapping the rows and columns, or moving one of the row dimensions into
the column dimension – or swapping an off spreadsheet dimension with one of the dimensions in
the page display.
Roll up and Roll down
The two basic hierarchical operations when displaying data at multiple levels of aggregations are
the ``drill-down'' and ``roll-up'' operations. Drill-down refers to the process of viewing data at a
level of increased detail, while roll-up refers to the process of viewing data with decreasing detail.
Three classifiers
Decision tree
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
Data Mining Notes Final exams
Preparations
KNN
K-Nearest Neighbors (KNN) is a standard machine-learning method that has been extended to
large-scale data mining efforts. The idea is that one uses a large amount of training data, where
each data point is characterized by a set of variables.
Naïve Bayes
The Naive Bayes classification algorithm is a probabilistic classifier. It is based on probability
models that incorporate strong independence assumptions. The independence assumptions often
do not have an impact on reality. Therefore they are considered as naive.
***Good Luck***