Adbs Unit IV

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

MIT School of Computing

Department of Computer Science & Engineering

Third Year Engineering

21BTCS604- Advanced Databases

PLD

AY 2023-2024
(SEM-V)

1
UNIT NO 4 : Data Warehousing

UNIT 4 Contents:

Introduction to Decision Support, Data Warehousing, Creating and


maintaining a warehouse.
Introduction to Data warehouse and OLAP, Multidimensional
data model, Data Warehouse architecture, OLAP and data cubes,
Operations on cubes, Data preprocessing need for preprocessing,
Multidimensional data model,
Data warehousing Concepts, Study of Data preprocessing need for
preprocessing, Simulating and maintaining a Warehouse,
Analysis of Data preprocessing

2
Data Warehousing introduction :

• A data warehouse is a centralized storage system that allows for the


storing, analyzing, and interpreting of data in order to facilitate better
decision-making.
• Transactional systems, relational databases, and other sources provide
data into data warehouses on a regular basis.
• There are decision support technologies that help utilize the data
available in a data warehouse.
• These technologies help executives to use the warehouse quickly and
effectively.
• They can gather data, analyze it, and take decisions based on the
information present in the warehouse

3
• The process of creating data warehouses to store a large amount of data is
named Data Warehousing.
• Data Warehousing helps to improve the speed and efficiency of accessing
different data sets and makes it easier for company decision-makers to
obtain insights
• that will help the business and promoting marketing tactics that set them
aside from their competitors

4
Introduction to Decision Support System :

• Decision support systems (DSS) are interactive software-based systems intended to help
managers in decision-making by accessing large volumes of information generated from
various related information systems involved in organizational business processes, such as
office automation system, transaction processing system, etc.

• There are two types of decisions


1)Programmed decisions are basically automated processes, general routine work, where

a) These decisions have been taken several times.
b)These decisions follow some guidelines or rules.

For example, selecting a reorder level for inventories, is a programmed decision.


2)Non-programmed decisions occur in unusual and non-addressed situations, so −
It would be a new decision.
a)There will not be any rules to follow.
b)These decisions are made based on the available information.

• These decisions are based on the manger's discretion, instinct, perception and judgment.
For example, investing in a new technology is a non-programmed decision. 5
Characteristics of a DSS :

• Support for decision-makers in semi-structured and unstructured


problems.
• Support for managers at various managerial levels, ranging from top
executive to line managers.
• Support for individuals and groups. Less structured problems often
requires the involvement of several individuals from different
departments and organization level.
• Support for interdependent or sequential decisions.
• Support for intelligence, design, choice, and implementation.
• Support for variety of decision processes and styles.
• DSSs are adaptive over time.

6
Components of a DSS –

Following are the components of the Decision Support System −

Database Management System (DBMS) −


To solve a problem the necessary data may come from internal or external database.
In an organization, internal data are generated by a system such as TPS and MIS.
External data come from a variety of sources such as newspapers, online data services,
databases (financial, marketing, human resources).

Model Management System −


It stores and accesses models that managers use to make decisions.
Such models are used for designing manufacturing facility, analyzing the financial health of an
organization, forecasting demand of a product or service, etc.

Support Tools −
Support tools like online help; pulls down menus, user interfaces, graphical analysis, error
correction mechanism, facilitates the user interactions with the system.

7
Creating and maintaining a warehouse –

Some steps that are needed for building any data warehouse are as following below:

To extract the data (transnational) from different data sources:


For building a data warehouse, a data is extracted from various data sources and that data
is stored in central storage area.
To transform the transnational data:
There are various DBMS where many of the companies stores their data. Some of them
are: MS Access, MS SQL Server, Oracle, Sybase etc.
To load the data (transformed) into the dimensional database:
After building a dimensional model, the data is loaded in the dimensional database. This
process combines the several columns together or it may split one field into the several
columns.
To purchase a front-end reporting tool:
There are top notch analytical tools are available in the market. These tools are provided
by the several major vendors.

8
Introduction to Data warehouse and OLAP

• A data warehouse is like a big library where we keep a lot of information from different
places. It analyzes and understands the information easily.
• So you can make good decisions based on these facts.
• You have all the required information that you need in one place.
• We organize the information so it's easy to find and use.
• It takes information from different places and put it all together in one place, hence it is
easier to understand.

9
Functions of Data warehouse

A data warehouse is a collection of data that is organized to


provide various functions for managing and analyzing data.
Some of the important functions of a data warehouse are −

1)Data Consolidation
2)Data Cleaning
3)Data Integration
4)Data Storage
5)Data Transformation
6)Data Analysis
7)Data Reporting
8)Data Mining
9)Performance Optimization

10
Online Analytical Processing Server (OLAP) :-

• Online Analytical Processing Server (OLAP) is a software.


• Users can analyze information from many different databases all at once.
• It uses a multidimensional data model where users can ask questions based on
multiple dimensions at the same time.
• For example, a user could ask for sales data from Delhi in the year 2018.
• OLAP databases are split up into cubes, which are also called hyper-cubes.

11
OLAP operations :-

These are used to analyze data in an OLAP cube. There are five basic operations:

1)Drill down
This makes the data more detailed by moving down the concept hierarchy or adding a new dimension.
For example, in a cube showing sales data by Quarter, drilling down would show sales data by Month.
2)Roll up
This makes the data less detailed by climbing up the concept hierarchy or reducing dimensions.
For example, in a cube showing sales data by City, rolling up would show sales data by Country.
3)Dice
This selects a sub-cube by choosing two or more dimensions and criteria.
For example, in a cube showing sales data by Location, Time, and Item, dicing could select sales data for Delhi
or Kolkata, in Q1 or Q2, for Cars or Buses.

4)Slice
This selects a single dimension and creates a new sub-cube. For example, in a cube showing sales data by
Location, Time, and Item, slicing by Time would create a new sub-cube showing sales data for Q1.

5)Pivot
This rotates the current view to get a new representation. For example, after slicing by Time, pivoting could
show the same data but with Location and Item as rows instead of columns

12
Multidimensional data model :

• The multi-Dimensional Data Model is a method which is used for ordering data in
the database along with good arrangement and assembling of the contents in the
database.
• The Multi Dimensional Data Model allows customers to interrogate analytical
questions associated with market or business trends, unlike relational databases
which allow customers to access data in the form of queries.
• OLAP (online analytical processing) and data warehousing uses multi dimensional
databases. It is used to show multiple dimensions of the data to users.
• It represents data in the form of data cubes. Data cubes allow to model and view the
data from many dimensions and perspectives.
• It is defined by dimensions and facts and is represented by a fact table.
• Facts are numerical measures and fact tables contain measures of the related
dimensional tables or names of the facts.

13
14
Working on a Multidimensional Data Model :-

1)Assembling data from the client.


2) Grouping different segments of the system
3) Noticing the different proportions
4) Preparing the actual-time factors and their respective qualities
5) Finding the actuality of factors which are listed previously and
their qualities
6) Building the Schema to place the data, with respect to the
information collected from the steps above

15
Advantages of Multi Dimensional Data Model
The following are the advantages of a multi-dimensional data model :
• A multi-dimensional data model is easy to handle.
• It is easy to maintain.
• Its performance is better than that of normal databases (e.g. relational
databases).
• The representation of data is better than traditional databases.
• That is because the multi-dimensional databases are multi-viewed and
carry different types of factors.
• It is workable on complex systems and applications, contrary to the
simple one-dimensional database systems.
• The compatibility in this type of database is an up liftmen for projects
having lower bandwidth for maintenance staff.

16
Disadvantages of Multi Dimensional Data Model
The following are the disadvantages of a Multi Dimensional Data Model :
• The multi-dimensional Data Model is slightly complicated in nature and it
requires professionals to recognize and examine the data in the database.
• During the work of a Multi-Dimensional Data Model, when the system
caches, there is a great effect on the working of the system.
• It is complicated in nature due to which the databases are generally
dynamic in design.
• The path to achieving the end product is complicated most of the time.
• As the Multi Dimensional Data Model has complicated systems, databases
have a large number of databases due to which the system is very insecure
when there is a security break.

17
Data Warehouse Architecture :-

A data warehouse architecture is a method of defining the overall architecture of data


communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital components.

18
Operational System-
An operational system is a method used in data warehousing to refer to a system that is
used to process the day-to-day transactions of an organization.

Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in
the system must have a different name.

Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work
with particular instances of data more accessible.
For example, author, data build, and data changed, and file size are examples of very basic
document metadata.
Metadata is used to direct a query to the most appropriate data source.

19
Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance.
The summarized record is updated continuously as new information is loaded
into the warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the


business managers for strategic decision-making. These customers interact
with the warehouse using end-client access tools.

20
OLAP and data cubes :-

• Online Analytical Processing (OLAP) is a category of software that allows users to analyze
information from multiple database systems at the same time.

• It is a technology that enables analysts to extract and view business data from different points
of view.

• OLAP databases are divided into one or more cubes. The cubes are designed in such a way
that creating and viewing reports become easy.

• OLAP stands for Online Analytical Processing.

21
• At the core of the OLAP concept, is an OLAP Cube.

• The OLAP cube is a data structure optimized for very quick data analysis.

• The OLAP Cube consists of numeric facts called measures which are
categorized by dimensions. OLAP Cube is also called the hypercube.

• A Data warehouse would extract information from multiple data sources


and formats like text files, excel sheet, multimedia files, etc.

• The extracted data is cleaned and transformed.

• Data is loaded into an OLAP server (or OLAP cube) where information is
pre-calculated in advance for further analysis.

22
Basic analytical operations of OLAP

Four types of analytical OLAP operations are:


Roll-up
Drill-down
Slice and dice
Pivot (rotate)
1) Roll-up:
Roll-up is also known as “consolidation” or “aggregation.” The Roll-up operation can be
performed in 2 ways
Reducing dimensions
Climbing up concept hierarchy. Concept hierarchy is a system of grouping things based
on their order or level.

23
Drill-down
• In drill-down data is fragmented into smaller parts.
• It is the opposite of the rollup process. It can be done via Moving down the concept
hierarchy
Increasing a dimension

Quater Q1 is drilled down to months January, February, and


March. Corresponding sales are also registers.
In this example, dimension months are added.
24
Slice:
Here, one dimension is selected, and a new sub-cube is created.
Following diagram explain how slice operation performed:

Dimension Time is Sliced with Q1 as the filter.


A new cube is created altogether.
25
Dice:
This operation is similar to a slice. The difference in dice is you select 2 or more dimensions
that result in the creation of a sub-cube.

26
4) Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data.
In the following example, the pivot is based on item types.

27
• Data preprocessing, a component of data preparation, describes any type of
processing performed on raw data to prepare it for another data processing
procedure.

• It has traditionally been an important preliminary step for the data mining process.
• There are several different tools and methods used for preprocessing data,
including the following:

• 1)sampling, which selects a representative subset from a large population of data


2)transformation, which manipulates raw data to produce a single input;
3)denoising, which removes noise from data
4)imputation, which synthesizes statistically relevant data for missing values
5)normalization, which organizes data for more efficient access
6) feature extraction, which pulls out a relevant feature subset that is significant in
a particular context.

28
Why is data preprocessing important?

• Virtually any type of data analysis, data science or AI development requires


some type of data preprocessing to provide reliable, precise and robust results
for enterprise applications.

• Real-world data is messy and is often created, processed and stored by a


variety of humans, business processes and applications.

• As a result, a data set may be missing individual fields, contain manual input
errors, or have duplicate data or different names to describe the same thing.

• Humans can often identify and rectify these problems in the data they use in
the line of business, but data used to train machine learning or deep learning
algorithms needs to be automatically preprocessed.

29
Major Task in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Transformation

• Data Reduction

• Data Discretization

30
What are the key steps in data preprocessing?
The steps used in data preprocessing include the following:
1. Data profiling.
Data profiling is the process of examining, analyzing and reviewing data to collect
statistics about its quality. It starts with a survey of existing data and its
characteristics.

2. Data cleansing. The aim here is to find the easiest way to rectify quality issues,
such as eliminating bad data, filling in missing data or otherwise ensuring the raw
data is suitable for feature engineering.

3. Data reduction. Raw data sets often include redundant data that arise from
characterizing phenomena in different ways or data that is not relevant to a
particular ML, AI or analytics task..

31
4. Data transformation. Here, data scientists think about how different
aspects of the data need to be organized to make the most sense for the
goal. This could include things like structuring unstructured data, combining
salient variables when it makes sense or identifying important ranges to
focus on.

5. Data enrichment. In this step, data scientists apply the various feature
engineering libraries to the data to effect the desired transformations. The
result should be a data set organized to achieve the optimal balance
between the training time for a new model and the required compute.

6. Data validation. At this stage, the data is split into two sets. The first set is
used to train a machine learning or deep learning model. The second set is
the testing data that is used to gauge the accuracy and robustness of the
resulting model.

32
Analysis of Data Preprocessing :-

33
THANK YOU

34

You might also like