3 - Business Analysis in Data Mining - L6 - 7 - 8 - 9 - 10

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

Business Analysis in Data

Mining
Lecture-6,7,8,9,10
Dr. Shweta Sharma
School of Computing Information Technology
Manipal University Jaipur
India
Topic Covered
• Data Warehouse Modelling:
• Data Cube – Star Schema, Snow Flake Scheme, Dimensions, Measures &
OLAP operations
• Data Warehouse Implementation:
• Indexing OLAP Data, Efficient Processing of OLAP queries
• OLAP Server Architecture
Data Warehouse Modelling:
• The analysis and documentation of the current processes and
transactions that exist during the software design and development is
known as data modeling.
• The data modeling techniques and tools simplify the complicated
system designs into easier data flows which can be used for re-
engineering.
• It is used to create the logical and physical design of a data
warehouse.
Need for Data Modeling in a Data warehouse
• Collecting the Business Requirements
• Typically, a data warehouse is designed with the data architects and the business users
determining the entities required in the data warehouse and the facts that need to be
recorded. This initial design has many iterations before deciding the final model
• Enhancing Database Performance
• Query performance is a vital feature of a data warehouse. Enormous data volumes are
involved in a data warehouse, so using a data model product for the management of the
metadata and the data used by the BI users is very important
• Offers Source and Target System Documentation
• In the process of designing an ETL system, it is very important to verify the physical and
logical models of the source and target systems respectively
Types of Data Model
• Conceptual Data Model
• Logical Data Model
• Physical Data Model
• Relational Data Model
• Multi Dimensional Data Model
Conceptual Data Model
• A conceptual data model determines the
highest-level relationships among the different
entities.
• It is the primary step in the creation of a data
model in a top-down approach that is an
exact representation of the business
organization
• Conceives the overall structure of the
database and gives information on the
subject-areas
• Consists of entity types and relationships.
The relationship between the subject areas
is represented by symbolic notations
(IDEF1X or IE). Cardinality is a data model
that exhibits the one to one relationship or
many-to-many relationship
• No primary key is stated
• No attribute is specified
Logical Data Model
• Logical data model represents the specific
particulars of the entities, attributes, and
relationships involved in a business. It is the
basis on which a physical model is designed.
• The development of a logical model begins
after the sign-off of the conceptual data
model by the functional team
• A logical model should systematize the
physical design process by defining the data
structures and the relationship between
them
• The primary keys and foreign keys are
established here
• Normalization occurs here
• Represents all the entities and the
relationships between them
Physical Data Model
• Physical data model exhibits the model of the
database that is to be built. It represents the table
structures, column names, column data types,
primary keys, and foreign keys.
• The physical data model is developed after receiving
the acceptance of the logical data model by the
functional team
• Physical data model might be different from the
logical data model due to few physical constraints
• Physical data model differs for different databases.
The data types change for different databases
• Denormalization takes place according to the user
requirements
• The logical model is changed to a physical data
model by implementing the database rules,
referential integrity, supertypes, and subtypes
Relational Data Model
Relational data modeling is used in OLTP systems that are transaction-oriented. The
major characteristics of a relational data model are:
• Relationship among the tables
• All the data is stored in tables and each relation has rows and columns
• The table should have a header and a body. The header is the list of columns in the table and
body consist of the values populated in the table. Tuple is the unique value generated from the
junction of one column and one row
• Usage of keys
• Primary key is the most important key in a table. It is used as a unique identifier. The primary
key is always a not null column
• Foreign key is used to relate to the primary key. They relate the data from one table to another
table and establish a relationship
• Data Redundancy
• The relational data model applies rules to maintain data integrity
• It eliminates data redundancy. The data is not stored repetitively. This helps in maintaining data
consistency and limited data storage
Multi Dimensional Data Model
A multidimensional data model is a logical view of an enterprise that represents the
important entities of a business and the relationship between them. It is not restricted to
a physical database and tables. It’s not represented by E-R diagrams. The main
components are:
• Attributes
• Attributes are the abstract terms devised for easier summarization of data on a report
• They can also be defined as the column headings that are not part of any calculations on a report
• Dimensions
• A dimension is a data set comprising individual, non-overlapping data elements
• They enable end-users to define, group, and filter the data for display and browsing purposes
• Facts
• A fact is a table consisting of columns that are used for numeric purposes to answer the business
questions
• They consist of additive, non-additive, and semi-additive measures
Data Cube
• When data is grouped or combined in
multidimensional matrices then it is
called as Data Cubes. It is also
known as "Multidimensional
databases," "materialized views," and
"OLAP (On-Line Analytical
Processing).“
• For example,  sales schema
• Entity: part, supplier, customer, and
sale-price
• Attributes
• Aggregation
Conti…
• A data cube is created from a subset of attributes
in the database. Specific attributes are chosen to
be measure attributes (whose values are of
interest).
• Other attributes are selected as dimensions or
functional attributes. The measure attributes are
aggregated according to the dimensions.
• Data cubes could be sparse in many cases because
not every cell in each dimension may have
corresponding data in the database.
• Techniques should be developed to handle sparse
cubes efficiently.
Example of Sales Schema
Two types of view of data (schema)
2. Example: 3-Dimensional Cuboids
• Let suppose we would like to view the
sales data with a third dimension.
• For example, suppose we would like
to view the data according to time,
item as well as location for the cities
Chicago, New York, Toronto, and
Vancouver. The measured display in
dollars sold (in thousands). These 3-D
data are shown in the table. The 3-D
data of the table are represented as a
series of 2-D tables.
Conti…
• Conceptually, we may
represent the same data in
the form of 3-D data
cubes, as shown in fig:
• In data warehousing, the
data cubes are n-
dimensional. The cuboid
which holds the lowest
level of summarization is
called a base cuboid.
3. Example:
• The 4-D cuboid in the figure is the base cuboid for the given time, item,
location, and supplier dimensions.
Conti…
• Figure is shown a 4-D data
cube representation of sales
data, according to the
dimensions time, item,
location, and supplier. The
measure displayed is dollars
sold (in thousands).
• The lattice of cuboids forms
a data cube. The figure
shows the lattice of cuboids
creating 4-D data cubes for
the dimension time, item,
location, and supplier. Each
cuboid represents a different
degree of summarization.
What is Multidimensional schema?
• Multidimensional Schema is specially designed to model data
warehouse systems. The schemas are designed to address the unique
needs of very large databases designed for analytical purposes
(OLAP).
• Types of Data Warehouse Schema:
• Star Schema
• Snowflake Schema
What is a Star Schema?
Star Schema in the data
warehouse, in which the center
of the star can have one fact
table and a number of
associated dimension tables. It is
known as star schema as its
structure resembles a star. The
Star Schema data model is the
simplest type of Data
Warehouse schema. It is also
known as Star Join Schema and
is optimized for querying large
data sets. Example of Star Schema Diagram
Conti…
Characteristics of Star Schema:

• Every dimension in a star schema is represented with only a one-dimension


table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk usage.
• The dimension tables are not normalized. For instance, in the above figure,
Country_ID does not have a Country lookup table as an OLTP design would have.
• The schema is widely supported by BI Tools
What is a Snowflake Schema?
• Snowflake Schema in data
warehouse is a logical
arrangement of tables in a
multidimensional
database such that the ER
diagram resembles a
snowflake shape. A
Snowflake Schema is an
extension of a Star
Schema, and it adds
additional dimensions. The
dimension tables are
normalized which splits
data into additional tables.

Example of Snowflake Schema


Characteristics of Snowflake Schema:
• The main benefit of the snowflake schema it uses smaller disk
space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts
because of the more lookup tables.
Key Differences
Star Schema Snowflake Schema
Hierarchies for the dimensions are stored in the dimensional
Hierarchies are divided into separate tables.
table.
One fact table surrounded by dimension table which are in turn
It contains a fact table surrounded by dimension tables.
surrounded by dimension table

In a star schema, only single join creates the relationship A snowflake schema requires many joins to fetch the data.
between the fact table and any dimension tables.
Simple DB Design. Very Complex DB Design.
Denormalized Data structure and query also run faster. Normalized Data Structure.
High level of Data redundancy Very low-level data redundancy
Single Dimension table contains aggregated data. Data Split into different Dimension Tables.

Cube processing is faster. Cube processing might be slow because of the complex join.

Offers higher-performing queries using Star Join Query The Snowflake schema is represented by centralized fact table
Optimization. Tables may be connected with multiple
which unlikely connected with multiple dimensions.
dimensions.
OLAP(Online Analytics Processing)
Operations
• Online analytical processing (OLAP) is a technology that organizes large
business databases and supports complex analysis. It can be used to perform
complex analytical queries without negatively affecting transactional systems.
• The databases that a business uses to store all its transactions and records are
called online transaction processing (OLTP) databases.
• OLAP is a widely spread technology belonging to Business Intelligence
processes developed to coordinate and analyze vast amounts of data. OLAP
databases are stored in the form of multidimensional cubes where each cube
comprises the data supposed relevant by a cube administrator. Through
certain OLAP operations, a user is able to obtain a specified view of the cube
and extract requisite information from it.
• So this way it’s possible to get a necessary Pivot Table and Pivot Chart report.
OLAP Operation
• General OLAP operations involve:
• Drill-up,
• Drill-down,
• Pivot, and
• Slice-and-Dice.
• OLAP operations could be based on two OLAP languages:
• SQL and MDX.
Conti…
Drill Up
• Drill-up is an operation to
gather data from the cube
either by ascending a concept
hierarchy for a dimension or
by dimension reduction in
order to receive measures at
a less detailed granularity.
• It provide broader
perspective in compliance
with the concept hierarchy
where a user has to group the
columns and unite the values.
Drill down
• It is carried out either by
descending a concept
hierarchy for a dimension or
by adding a new dimension.
It lets a user deploy highly
detailed data from a less
detailed cube. Consequently,
when the operation is run,
one or more dimensions
from the data cube must be
appended to provide more
information elements.
Slice
• The Slice OLAP operations take one
specific dimension from a cube and
represent a new sub-cube, which
provides information from another
point of view.
• It can create a new sub-cube by
choosing one or more dimensions.
The use of Slice implies the specified
granularity level of the dimension.
Dice
• OLAP Dice emphasizes on two
or more dimensions from a
cube and suggests a new sub-
cube, as well as Slice
operation does. In order to
locate a single value for a
cube, it includes adding values
for each dimension.
Pivot
• This OLAP operation rotates the
axes of a cube to provide an
alternative view of the data
cube. Pivot clusters the data
with other dimensions which
helps analyze the performance
of a company or enterprise.
Conti…
Scoping
• The operation of Scoping restrains the presentation of the database objects to a
specified subset. It will let users receive and update certain data values which
they want. If there is a huge amount of data and a user needs to constrain the
access of information to a specified subset Scoping is mostly conducive.
• Screening
• Screening is conducted to limit the set of data extracted.
• Drill across
• Drill across and Drill through in OLAP are another pair of opposite operations.
The operation Drill across reconciles cells from several data cubes which share
the same scheme.
Conti…
• Drill through
• OLAP Drill through enables to navigate from data at the lower level in a cube to data in the operational systems
whence the cube was ejected. The operation is usually exploited to identify the cause of outlier values in a data
cube.
• Sort
• Sort brings the cube back where the members of a dimension were sorted.
• Add Measure
• Thanks to this OLAP operation one is able to add new measures to a cube.
• Drop Measure
• In contrast to Add Measure, it’s also possible to get rid of a measure from a data cube if it's not necessary.
• Union
• Due to an opportunity of Union, you can unite a number of cubes that have the same scheme but separate
instances.
• Difference
• Difference eliminates the cells in a cube which are owned by another one. These two cubes must possess the same
scheme.
OLAP Servers
• Online Analytical Processing(OLAP) refers to a set of software tools used for
data analysis in order to make business decisions. OLAP provides a platform for
gaining insights from databases retrieved from multiple database systems at
the same time. It is based on a multidimensional data model, which enables
users to extract and view data from various perspectives. A multidimensional
database is used to store OLAP data. Many Business Intelligence (BI)
applications rely on OLAP technology.
• Type of OLAP servers:
The three major types of OLAP servers are as follows:
• ROLAP
• MOLAP 
• HOLAP
Relational OLAP (ROLAP):
• Relational On-Line Analytical Processing (ROLAP) is
primarily used for data stored in a relational database,
where both the base data and dimension tables are
stored as relational tables. ROLAP servers are used to
bridge the gap between the relational back-end server
and the client’s front-end tools. ROLAP servers store
and manage warehouse data using RDBMS, and
OLAP middleware fills in the gaps.
• Benefits:
• It is compatible with data warehouses and OLTP
systems.
• The data size limitation of ROLAP technology is
determined by the underlying RDBMS. As a result,
ROLAP does not limit the amount of data that can be
stored.
• Limitations:
• SQL functionality is constrained.
• It’s difficult to keep aggregate tables up to date.
Multidimensional OLAP (MOLAP):
• Through array-based multidimensional storage engines,
Multidimensional On-Line Analytical Processing (MOLAP)
supports multidimensional views of data. Storage utilization in
multidimensional data stores may be low if the data set is sparse.
• MOLAP stores data on discs in the form of a specialized
multidimensional array structure. It is used for OLAP, which is
based on the arrays’ random access capability. Dimension instances
determine array elements, and the data or measured value
associated with each cell is typically stored in the corresponding
array element. The multidimensional array is typically stored in
MOLAP in a linear allocation based on nested traversal of the axes
in some predetermined order.
• Benefits:
• Suitable for slicing and dicing operations.
• Outperforms ROLAP when data is dense.
• Capable of performing complex calculations.
• Limitations:
• It is difficult to change the dimensions without re-aggregating.
• Since all calculations are performed when the cube is built, a large
amount of data cannot be stored in the cube itself.
Hybrid OLAP (HOLAP):
• ROLAP and MOLAP are combined in Hybrid On-Line
Analytical Processing (HOLAP). HOLAP offers greater
scalability than ROLAP and faster computation than
MOLAP.HOLAP is a hybrid of ROLAP and MOLAP.
HOLAP servers are capable of storing large amounts of
detailed data. On the one hand, HOLAP benefits from
ROLAP’s greater scalability. HOLAP, on the other hand,
makes use of cube technology for faster performance and
summary-type information. Because detailed data is stored
in a relational database, cubes are smaller than MOLAP.
• Benefits:
• HOLAP combines the benefits of MOLAP and ROLAP.
• Provide quick access at all aggregation levels.
• Limitations
• Because it supports both MOLAP and ROLAP servers,
HOLAP architecture is extremely complex.
• There is a greater likelihood of overlap, particularly in
their functionalities.
Other types of OLAP include:
• Web OLAP (WOLAP): WOLAP refers to an OLAP application that can be accessed through
a web browser. WOLAP, in contrast to traditional client/server OLAP applications, is thought
to have a three-tiered architecture consisting of three components: a client, middleware, and a
database server.
• Desktop OLAP (DOLAP): DOLAP is an abbreviation for desktop analytical processing. In
that case, the user can download the data from the source and work with it on their desktop or
laptop. In comparison to other OLAP applications, functionality is limited. It is less expensive.
•  Mobile OLAP (MOLAP): Wireless functionality or mobile devices are examples of
MOLAP. The user is working and accessing data via mobile devices.
•  Spatial OLAP (SOLAP): SOLAP egress combines the capabilities of Geographic
Information Systems (GIS) and OLAP into a single user interface. SOLAP is created because
the data can be alphanumeric, image, or vector. This allows for the quick and easy exploration
of data stored in a spatial database.
Thank You!

You might also like