Professional Documents
Culture Documents
Adbs Unit IV
Adbs Unit IV
Adbs Unit IV
PLD
AY 2023-2024
(SEM-V)
1
UNIT NO 4 : Data Warehousing
UNIT 4 Contents:
2
Data Warehousing introduction :
3
• The process of creating data warehouses to store a large amount of data is
named Data Warehousing.
• Data Warehousing helps to improve the speed and efficiency of accessing
different data sets and makes it easier for company decision-makers to
obtain insights
• that will help the business and promoting marketing tactics that set them
aside from their competitors
4
Introduction to Decision Support System :
• Decision support systems (DSS) are interactive software-based systems intended to help
managers in decision-making by accessing large volumes of information generated from
various related information systems involved in organizational business processes, such as
office automation system, transaction processing system, etc.
• These decisions are based on the manger's discretion, instinct, perception and judgment.
For example, investing in a new technology is a non-programmed decision. 5
Characteristics of a DSS :
6
Components of a DSS –
Support Tools −
Support tools like online help; pulls down menus, user interfaces, graphical analysis, error
correction mechanism, facilitates the user interactions with the system.
7
Creating and maintaining a warehouse –
Some steps that are needed for building any data warehouse are as following below:
8
Introduction to Data warehouse and OLAP
• A data warehouse is like a big library where we keep a lot of information from different
places. It analyzes and understands the information easily.
• So you can make good decisions based on these facts.
• You have all the required information that you need in one place.
• We organize the information so it's easy to find and use.
• It takes information from different places and put it all together in one place, hence it is
easier to understand.
9
Functions of Data warehouse
1)Data Consolidation
2)Data Cleaning
3)Data Integration
4)Data Storage
5)Data Transformation
6)Data Analysis
7)Data Reporting
8)Data Mining
9)Performance Optimization
10
Online Analytical Processing Server (OLAP) :-
11
OLAP operations :-
These are used to analyze data in an OLAP cube. There are five basic operations:
1)Drill down
This makes the data more detailed by moving down the concept hierarchy or adding a new dimension.
For example, in a cube showing sales data by Quarter, drilling down would show sales data by Month.
2)Roll up
This makes the data less detailed by climbing up the concept hierarchy or reducing dimensions.
For example, in a cube showing sales data by City, rolling up would show sales data by Country.
3)Dice
This selects a sub-cube by choosing two or more dimensions and criteria.
For example, in a cube showing sales data by Location, Time, and Item, dicing could select sales data for Delhi
or Kolkata, in Q1 or Q2, for Cars or Buses.
4)Slice
This selects a single dimension and creates a new sub-cube. For example, in a cube showing sales data by
Location, Time, and Item, slicing by Time would create a new sub-cube showing sales data for Q1.
5)Pivot
This rotates the current view to get a new representation. For example, after slicing by Time, pivoting could
show the same data but with Location and Item as rows instead of columns
12
Multidimensional data model :
• The multi-Dimensional Data Model is a method which is used for ordering data in
the database along with good arrangement and assembling of the contents in the
database.
• The Multi Dimensional Data Model allows customers to interrogate analytical
questions associated with market or business trends, unlike relational databases
which allow customers to access data in the form of queries.
• OLAP (online analytical processing) and data warehousing uses multi dimensional
databases. It is used to show multiple dimensions of the data to users.
• It represents data in the form of data cubes. Data cubes allow to model and view the
data from many dimensions and perspectives.
• It is defined by dimensions and facts and is represented by a fact table.
• Facts are numerical measures and fact tables contain measures of the related
dimensional tables or names of the facts.
13
14
Working on a Multidimensional Data Model :-
15
Advantages of Multi Dimensional Data Model
The following are the advantages of a multi-dimensional data model :
• A multi-dimensional data model is easy to handle.
• It is easy to maintain.
• Its performance is better than that of normal databases (e.g. relational
databases).
• The representation of data is better than traditional databases.
• That is because the multi-dimensional databases are multi-viewed and
carry different types of factors.
• It is workable on complex systems and applications, contrary to the
simple one-dimensional database systems.
• The compatibility in this type of database is an up liftmen for projects
having lower bandwidth for maintenance staff.
16
Disadvantages of Multi Dimensional Data Model
The following are the disadvantages of a Multi Dimensional Data Model :
• The multi-dimensional Data Model is slightly complicated in nature and it
requires professionals to recognize and examine the data in the database.
• During the work of a Multi-Dimensional Data Model, when the system
caches, there is a great effect on the working of the system.
• It is complicated in nature due to which the databases are generally
dynamic in design.
• The path to achieving the end product is complicated most of the time.
• As the Multi Dimensional Data Model has complicated systems, databases
have a large number of databases due to which the system is very insecure
when there is a security break.
17
Data Warehouse Architecture :-
18
Operational System-
An operational system is a method used in data warehousing to refer to a system that is
used to process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in
the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work
with particular instances of data more accessible.
For example, author, data build, and data changed, and file size are examples of very basic
document metadata.
Metadata is used to direct a query to the most appropriate data source.
19
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance.
The summarized record is updated continuously as new information is loaded
into the warehouse.
20
OLAP and data cubes :-
• Online Analytical Processing (OLAP) is a category of software that allows users to analyze
information from multiple database systems at the same time.
• It is a technology that enables analysts to extract and view business data from different points
of view.
• OLAP databases are divided into one or more cubes. The cubes are designed in such a way
that creating and viewing reports become easy.
21
• At the core of the OLAP concept, is an OLAP Cube.
• The OLAP cube is a data structure optimized for very quick data analysis.
• The OLAP Cube consists of numeric facts called measures which are
categorized by dimensions. OLAP Cube is also called the hypercube.
• Data is loaded into an OLAP server (or OLAP cube) where information is
pre-calculated in advance for further analysis.
22
Basic analytical operations of OLAP
23
Drill-down
• In drill-down data is fragmented into smaller parts.
• It is the opposite of the rollup process. It can be done via Moving down the concept
hierarchy
Increasing a dimension
26
4) Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data.
In the following example, the pivot is based on item types.
27
• Data preprocessing, a component of data preparation, describes any type of
processing performed on raw data to prepare it for another data processing
procedure.
• It has traditionally been an important preliminary step for the data mining process.
• There are several different tools and methods used for preprocessing data,
including the following:
28
Why is data preprocessing important?
• As a result, a data set may be missing individual fields, contain manual input
errors, or have duplicate data or different names to describe the same thing.
• Humans can often identify and rectify these problems in the data they use in
the line of business, but data used to train machine learning or deep learning
algorithms needs to be automatically preprocessed.
29
Major Task in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
• Data Discretization
30
What are the key steps in data preprocessing?
The steps used in data preprocessing include the following:
1. Data profiling.
Data profiling is the process of examining, analyzing and reviewing data to collect
statistics about its quality. It starts with a survey of existing data and its
characteristics.
2. Data cleansing. The aim here is to find the easiest way to rectify quality issues,
such as eliminating bad data, filling in missing data or otherwise ensuring the raw
data is suitable for feature engineering.
3. Data reduction. Raw data sets often include redundant data that arise from
characterizing phenomena in different ways or data that is not relevant to a
particular ML, AI or analytics task..
31
4. Data transformation. Here, data scientists think about how different
aspects of the data need to be organized to make the most sense for the
goal. This could include things like structuring unstructured data, combining
salient variables when it makes sense or identifying important ranges to
focus on.
5. Data enrichment. In this step, data scientists apply the various feature
engineering libraries to the data to effect the desired transformations. The
result should be a data set organized to achieve the optimal balance
between the training time for a new model and the required compute.
6. Data validation. At this stage, the data is split into two sets. The first set is
used to train a machine learning or deep learning model. The second set is
the testing data that is used to gauge the accuracy and robustness of the
resulting model.
32
Analysis of Data Preprocessing :-
33
THANK YOU
34