Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

RAJALAKSHMI INSTITUTE OF TECHNOLOGY

(An Autonomous Institution)


Kuthambakkam Post, Chennai – 600124

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

2023-24 Even Semester

Sub Code: CCS341


Subject Name: DATA WAREHOUSING
Semester/Year: VI/III

UNIT 1: Introduction to DataWarehouse

(R2021)
RIT CCS341 - DATA WAREHOUSING 1

UNIT-I
INTRODUCTION TO DATA WAREHOUSE
Data warehouse Introduction - Data warehouse components- operational database Vs data
warehouse – Data warehouse Architecture – Three-tier Data Warehouse Architecture -
Autonomous Data Warehouse- Autonomous Data Warehouse Vs Snowflake - Modern Data
Warehouse

Introduction
A Data Warehouse is Built by combining data from multiple diverse sources that support
analytical reporting, structured and unstructured queries, and decision making for the organization,
and Data Warehousing is a step-by-step approach for constructing and using a Data Warehouse.
Many data scientists get their data in raw formats from various sources of data and information. But,
for many data scientists also as business decision-makers, particularly in big enterprises, the main
sources of data and information are corporate data warehouses. A data warehouse holds data from
multiple sources, including internal databases and Software (SaaS) platforms. After the data is
loaded, it often cleansed, transformed, and checked for quality before it is used for analytics
reporting, data science, machine learning, or anything.

What is Data Warehouse?


A Data Warehouse is a collection of software tools that facilitates analysis of a large set of
business data used to help an organization make decisions. A large amount of data in data
warehouses comes from numerous sources such that internal applications like marketing, sales, and
finance; customer-facing apps; and external partner systems, among others. It is a centralized data
repository for analysts that can be queried whenever required for business benefits. A data
warehouse is mainly a data management system that’s designed to enable and support business
intelligence (BI) activities, particularly analytics. Data warehouses are alleged to perform queries,
cleaning, manipulating, transforming and analyzing the data and they also contain large amounts of
historical data.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 2

What is Data Warehousing?


The process of creating data warehouses to store a large amount of data is named Data
Warehousing. Data Warehousing helps to improve the speed and efficiency of accessing different
data sets and makes it easier for company decision-makers to obtain insights that will help the
business and promoting marketing tactics that set them aside from their competitors. We can say that
it is a blend of technologies and components which aids the strategic use of data and information. The
main goal of data warehousing is to create a hoarded wealth of historical data that can be retrieved and
analyzed to supply helpful insight into the organization’s operations.

Need of Data Warehousing.


Data Warehousing is a progressively essential tool for business intelligence. It allows organizations
to make quality business decisions. The data warehouse benefits by improving data analytics, it also
helps to gain considerable revenue and the strength to compete more strategically in the market. By
efficiently providing systematic, contextual data to the business intelligence tool of an organization,
the data warehouses can find out more practical business strategies.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 3

1. Business User: Business users or customers need a data warehouse to look at summarized
data from the past. Since these people are coming from a non-technical background also, the
data may berepresented to them in an uncomplicated way.
2. Maintains consistency: Data warehouses are programmed in such a way that they can be
applied in a regular format to all collected data from different sources, which makes it
effortless for company decision-makers to analyze and share data insights with their
colleagues around the globe. By standardizing the data, the risk of error in interpretation is
also reduced and improves overall accuracy.
3. Store historical data: Data Warehouses are also used to store historical data that means, the
time variable data from the past and this input can be used for various purposes.
4. Make strategic decisions: Data warehouses contribute to making better strategic decisions.
Some business strategies may be depending upon the data stored within the data warehouses.
5. High response time: Data warehouse has got to be prepared for somewhat sudden masses
and typeof queries that demands a major degree of flexibility and fast latency.
Characteristics of Data warehouse:
1. Subject Oriented: A data warehouse is often subject-oriented because it delivers may be
achieved on a particular theme which means the data warehousing process is proposed to
handle a particular theme that is more defined. These themes are often sales, distribution,
selling. etc.
2. Time-Variant: When the data is maintained via totally different intervals of time like
weekly, monthly, or annually, etc. It founds numerous time limits that are unit structured

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 4
between the big datasets and are command within the online transaction method (OLTP). The
time limits for the data warehouse are extended than that of operational systems. The data
resided within the data warehouse is predetermined with a particular interval of time and
delivers information from the historical perspective. It contains parts of time directly or
indirectly.
3. Non-volatile: The data residing in the data warehouse is permanent and defined by its
names. It additionally means that the data in the data warehouse is cannot be erased or
deleted or also when new data is inserted into it. In the data warehouse, data is read-only and
can only be refreshed at a particular interval of time. Operations such as delete, update and
insert that is done in a software application over data is lost in the data warehouse
environment. There are only two types of data operations that can be done in the data
warehouse:
 Data Loading
 Data Access
4. Integrated: A data warehouse is created by integrating data from numerous different
sources such that from mainframe computers and a relational database. Additionally, it
should also have reliable naming conventions, formats, and codes. Integration of data
warehouse benefits in the successful analysis of data. Dependability in naming conventions,
column scaling, encoding structure, etc. needs to be confirmed. Integration of data
warehouse handles numerous subject-oriented warehouses.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 5

Operational Database Data Warehouse

Operational systems are designed to support Data warehousing systems are typically
high-volume transaction processing. designed to support high-volume analytical
processing (i.e., OLAP).

Operational systems are usually concerned Data warehousing systems are usually
with current data. concerned with historical data.

Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.

It is designed for real-time business dealing It is designed for analysis of business


and processes. measures by subject area, categories, and
attributes.

It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a complex, unpredictable queries that access
time per table. many rows per table.

It is optimized for validation of incoming Loaded with consistent, valid information,


information during transactions, uses requires no real-time validation.
validation data tables.

It supports thousands of concurrent clients. It supports a few concurrent clients relative


to OLTP.

Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented

Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of optimized to perform fast retrievals of
associatively small volumes of data. relatively high volumes of data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 6

Features OLTP OLAP

Characteristic It is a system which is used to It is a system which is used to manage


manage operational Data. informational Data.

Users Clerks, clients, and information Knowledge workers, including


technology professionals. managers, executives, and analysts.

System OLTP system is a customer- OLAP system is market-oriented,


orientation oriented, transaction, and knowledge workers including managers,
query processing are done by do data analysts executive and analysts.
clerks, clients, and information
technology professionals.

Data contents OLTP system manages current OLAP system manages a large amount
data that too detailed and are of historical data, provides facilitates for
used for decision making. summarization and aggregation, and
stores and manages data at different
levels of granularity. This information
makes the data more comfortable to
use in informed decision making.

Database Size 100 MB-GB 100 GB-TB

Database design OLTP system usually uses an OLAP system typically uses either a star
entity-relationship (ER) data or snowflake model and subject-
model and application- oriented database design.
oriented database design.

View OLTP system focuses primarily OLAP system often spans multiple
on the current data within an versions of a database schema, due to
enterprise or department, the evolutionary process of an
without referring to historical organization. OLAP systems also deal
information or data in different with data that originates from various
organizations. organizations, integrating information
from many data stores.

Volume of data Not very large Because of their large volume, OLAP
data are stored on multiple storage
media.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 7

Access patterns The access patterns of an OLTP Accesses to OLAP systems are mostly
system subsist mainly of short, read-only methods because of these
atomic transactions. Such a data warehouses stores historical data.
system requires concurrency
control and recovery
techniques.

Access mode Read/write Mostly write

Insert and Short and fast inserts and Periodic long-running batch jobs
Updates updates proposed by end- refresh the data.
users.

Number of Tens Millions


records accessed

Normalization Fully Normalized Partially Normalized

Processing Speed Very Fast It depends on the amount of files


contained, batch data refresh, and
complex query may take many hours,
and query speed can be upgraded by
creating indexes.

Architecture & Components of Data Warehouse:


Data warehouse architecture defines the comprehensive architecture of data processing and
presentation that will be useful for data analysis and decision making within the enterprise and
organization. Each organization has different data warehouses depending upon their need, but all of
them are characterized by some standardcomponents.
Data Warehouse applications are designed to support the user’s data requirements, an example of
this is online analytical processing (OLAP). These include functions such as forecasting, profiling,
summary reporting, and trend analysis.
The architecture of the data warehouse mainly consists of the proper arrangement of its elements, to
build an efficient data warehouse with software and hardware components. The elements and
components may vary based on the requirement of organizations. All of these depend on the
organization’s circumstances.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 8

1. Source Data Component:


In the Data Warehouse, the source data comes from different places. They are group into four
categories:
 External Data: For data gathering, most of the executives and data analysts rely on
information coming from external sources for a numerous amount of the information they
use. They use statistical features associated with their organization that is brought out by
some external sources and department.
 Internal Data: In every organization, the consumer keeps their “private” spreadsheets,
reports, client profiles, and generally even department databases. This is often the interior
information, a part that might be helpful in every data warehouse.
 Operational System data: Operational systems are principally meant to run the business. In
each operation system, we periodically take the old data and store it in achieved files.
 Flat files: A flat file is nothing but a text database that stores data in a plain text format. Flat
files generally are text files that have all data processing and structure markup removed. A
flat file contains a table with a single record per line.
2. Data Staging:
After the data is extracted from various sources, now it’s time to prepare the data files for storing in
the data warehouse. The extracted data collected from various sources must be transformed and
made ready in a format that is suitable to be saved in the data warehouse for querying and analysis.
The data staging contains three primary functions that take place in this part:

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 9

 Data Extraction: This stage handles various data sources. Data analysts should employ
suitable techniques for every data source.
 Data Transformation: As we all know, information for a knowledge warehouse comes
from many alternative sources. If information extraction for a data warehouse posture huge
challenges, information transformation gifts even important challenges. We tend to perform
many individual tasks as a part of information transformation. First, we tend to clean the info
extracted from every source of data. Standardization of information elements forms an
outsized part of data transformation. Data transformation contains several kinds of
combining items of information from totally different sources. Information transformation
additionally contains purging supply information that’s not helpful and separating
outsourced records into new mixtures. Once the data transformation performs ends, we’ve got
a set of integrated information that’s clean, standardized, and summarized.
 Data Loading: When we complete the structure and construction of the data warehouse and
go live for the first time, we do the initial loading of the data into the data warehouse
storage. The initial load moves high volumes of data consuming a considerable amount of
time.
3. Data Storage in Warehouse:
Data storage for data warehousing is split into multiple repositories. These data repositories contain
structureddata in a very highlynormalized form for fast and efficient processing.
 Metadata: Metadata means data about data i.e. it summarizes basic details regarding data,
creating findings & operating with explicit instances of data. Metadata is generated by an
additional correction or automatically and can contain basic information about data.
 Raw Data: Raw data is a set of data and information that has not yet been processed and
was delivered from a particular data entity to the data supplier and hasn’t been processed
nonetheless by machine or human. This data is gathered out from online sources to deliver

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 10
deep insight into users’ online behavior.
 Summary Data or Data summary: Data summary is an easy term for a brief conclusion of
an enormous theory or a paragraph. This is often one thing where analysts write the code and
in the end, they declare the ultimate end in the form of summarizing data. Data summary is
the most essential thing in data mining and processing.
4. Data Marts:
Data marts are also the part of storage component in a data warehouse. It can store the information
of a specific function of an organization that is handled by a single authority. There may be any
number of data marts in a particular organization depending upon the functions. In short, data marts
contain subsets of the data stored in data warehouses. Now, the users and analysts can use data for
various applications like reporting, analyzing, mining, etc. The data is made available to them
whenever required.

Three Tier Architectures


Data Warehouse Architecture
The Three-Tier Data Warehouse Architecture is the commonly used Data Warehouse design in
order to build a Data Warehouse by including the required Data Warehouse Schema Model, the
required OLAP server type, and the required front-end tools for Reporting or Analysis purposes,
which as the name suggests contains three tiers such as Top tier, Bottom Tier and the Middle Tier
that are procedurally linked with one another from Bottom tier(data sources) through Middle
tier(OLAP servers) to the Top tier(Front-end tools).
Data Warehouse Architecture is the design based on which a Data Warehouse is built, to
accommodate the desired type of Data Warehouse Schema, user interface application and
database management system, for data organization and repository structure. The type of
Architecture is chosen based on the requirement provided by the project team. Three-tier Data
Warehouse Architecture is the commonly used choice, due to its detailing in the structure. The

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 11
three different tiers here are termed as:
 Top-Tier
 Middle-Tier
 Bottom-Tier
Each Tier can have different components based on the prerequisites presented by the decision-
makers of theproject but are o the novelty of their respective tier.

Three-Tier Data Warehouse Architecture


Here is a pictorial representation for the Three-Tier Data Warehouse Architecture
1. Bottom Tier
The Bottom Tier in the three-tier architecture of a data warehouse consists of the Data Repository.
Data Repository is the storage space for the data extracted from various data sources, which
undergoes a series of activities as a part of the ETL process. ETL stands for Extract, Transform and
Load. As a preliminary process, before the data is loaded into the repository, all the data relevant and
required are identified from several sources of the system. These data are then cleaned up, to avoid
repeating or junk data from its current storage units. The next step is to transform all these data into a
single format of storage. The final step of ETL is to Load the data on the repository. Few commonly
used ETL tools are:
Informatica
Microsoft SSIS
Snaplogic
Confluent
Apache Kafka
Alooma
Ab Initio
IBM Infosphere
UNIT 1: INTRODUCTION TO DATA WAREHOUSE
RIT CCS341 - DATA WAREHOUSING 12
The storage type of the repository can be a relational database management system or a
multidimensional database management system. A relational database system can hold simple
relational data, whereas a multidimensional database system can hold data that more than one
dimension. Whenever the Repository includes both relational and multidimensional database
management systems, there exists a metadata unit. As the name suggests, the metadata unit consists of
all the metadata fetched from both the relational database and multidimensional database systems. This
Metadata unit provides incoming data to the next tier, that is, the middle tier. From the user’s
standpoint, the data from the bottom tier can be accessed only with the use of SQL queries. The
complexity of the queries depends on the type of database. Data from the relational database system can
be retrieved using simple queries, whereas the multidimensional database system demands complex
queries with multiple joins and conditional statements.
2. Middle Tier
The Middle tier here is the tier with the OLAP servers. The Data Warehouse can have more than
one OLAP server, and it can have more than one type of OLAP server model as well, which
depends on the volume of the data to be processed and the type of data held in the bottom tier. There
are three types of OLAP server models, such as:
ROLAP
Relational online analytical processing is a model of online analytical processing which
carries out an active multidimensional breakdown of data stored in a relational database,
instead of redesigning a relational database into a multidimensional database.
This is applied when the repository consists of only the relational database system in it.
MOLAP
Multidimensional online analytical processing is another model of online analytical
processing that catalogs and comprises of directories directly on its multidimensional
database system.
This is applied when the repository consists of only the multidimensional database system in
it.
HOLAP
Hybrid online analytical processing is a hybrid of both relational and multidimensional online
analytical processing models.
When the repository contains both the relational database management system and the
multidimensional database management system, HOLAP is the best solution for a smooth
functional flow between the database systems. HOLAP allows storing data in both the

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 13
relational and the multidimensional formats.
The Middle Tier acts as an intermediary component between the top tier and the data repository,
that is, the top tier and the bottom tier respectively. From the user’s standpoint, the middle tier gives
an idea about the conceptual outlook of the database.
3. Top Tier
The Top Tier is a front-end layer, that is, the user interface that allows the user to connect with the
database systems. This user interface is usually a tool or an API call, which is used to fetch the
required data for Reporting, Analysis, and Data Mining purposes. The type of tool depends purely
on the form of outcome expected. It could be a Reporting tool, an Analysis tool, a Query tool or a
Data mining tool.
It is essential that the Top Tier should be uncomplicated in terms of usability. Only user-friendly
tools can give effective outcomes. Even when the bottom tier and middle tier are designed with at
most cautiousness and clarity, if the Top tier is enabled with a bungling front-end tool, then the
whole Data Warehouse Architecture can become an utter failure. This makes the selection of the user
interface/ front-end tool as the Top Tier, which will serve as the face of the Data Warehouse system, a
very significant part of the Three-Tier Data Warehouse Architecture designing process.

Autonomous Data Warehouse

Introduction to Oracle ADW


Within the range of Oracle’s Cloud-based services is Oracle’s Autonomous Data Warehouse. It is a
Cloud Data Warehouse that completely managed all Data Warehouse operations and their
complexity. Wide automation of features such as data securing, configuration, provisioning,
development, scaling, and backups are provided with Oracle Autonomous Data Warehouse.
The process of Data Management and Processing with Oracle Autonomous Data Warehouse can be
illustrated as follows:

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 14

Key Features of Oracle ADW


The Oracle Autonomous Data Warehouse offers several features that are easy to use and remain
compatiblewith various existing tools and applications. These include the following:
The end-to-end management system of Oracle ADW is fully managed and fully tuned to
load large amounts of data and run complex queries without any human moderation
required.
It is a highly elastic service and can be auto-scaled or manually scaled depending upon
how yourdata requirements grow.
The Oracle Autonomous Data Warehouse provides automated index management and
real-timestatistic updates collectively.
It can be configured in advance to incorporate different user types and provide optimized
querying features for multiple workloads.
It is enriched with built-in tools such as Oracle Application Express (APEX), Oracle
REST Data Services (ORDS), and Oracle Database Actions.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 15
Autonomous Data Warehouse Vs Snowflake
Autonomous Data Warehous Snowflake
 “Load and Go!” No DBA expertise  Requires manual tuning and specialized
required to provision, backup, secure, or DBA expertise for cluster set and
patch storage options
 Runs on Exadata backbone with  Runs on generic AWS or Azure
enterprise class capabilities infrastructure
 99.95% Availability (only 4 hours of  Only 99.9% Availability (up to 9 hours
downtime per year) with fully per year) and requires Premier Support
automated backups and RAC for for HA at additional cost
compute node failures
 Auto applies security patches and  Shared tenancy and no isolation of
upgrades with complete tenant and data operational users from customer
isolation application data
 Single data warehouse – scaling  DBA must setup single or multi-
compute or storage is not constrained by clusters, resulting in over-provisioning
ÿxed blocks and higher costs
 Advanced enterprise grade analytics  Must rely on 3rd parties for BI,
tools built-in with complete Oracle Data Integration and advanced
Cloud integration analytics

Modern Data Warehouse

What is Modern Data Warehouse?


A Modern Data Warehouse is a cloud-based solution that gathers and stores that information.
Organizations can process this data to make intelligent decisions. That’s why various organizations
use a Modern Data Warehouse to improve their finances, human resources, and operations business
processes. Quality cloud- based warehouse departments need this information to make smarter
decisions.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 16
Modern Data Warehouse Pyramid

There are five different components of a Modern Data Warehouse.

Level 1: Data Acquisition


Data acquisition can come from a variety of sources such as:
 IoT devices
 Social media posts
 YouTube videos
 Website content
 Customer data
 Enterprise Resource Planning
 Legacy data stores
Level 2: Data Engineering
Once you acquired it, you need to upload it into the data warehouse. Data engineering uses pipelines
and ETL (extract, transform, load) tools. Using these different tools, you can upload that information to a
data warehouse similar to a factory. Data engineering is similar to a truck bringing raw materials into
a factory.
Level 3: Data Management Governance
Once the data comes into the factory, you need someone to evaluate the quality of the data. You then
need tosteward that data because security and privacy must be considered.
Data governance helps ensure the quality of the info by stewarding, prepping, and cleaning the data
to ensureit is ready for analysis.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 17

Level 4: Reporting and Business Intelligence


Once you prep and clean the data, you can start using factory analysis to take that raw material(data)
and turn it into a finished good (business intelligence). For our purposes, we will use Microsoft
Power BIto help you visualize the information by using advanced analytics, KPIs, and workflow
automation. When you are finished, you can see exactly what’s going on with your data.
Level 5: Data Science
Modern Data Warehouse is about more than seeing the information; it’s about using the data to make
smarter decisions. That’s one of the key concepts you should walk away with here today. There are
several different programs to help you leverage the data to your benefit, including:
 AI
 Deep learning
 Machine learning
 Statistical modeling
 Natural language processing (NLP)
Keep in mind that all the algorithms above need data to work successfully. The more data you
provide, the smarter your decisions, and the smarter your results. It’s essential to see if you want to
understand your reports that you leverage AI to get better answers, leading us back to Modern Data
Warehouse. Again, it is more than gathering and storing data. It is about making smart decisions.
Final Thoughts
Modern data warehouse takes the data most corporations have and turns it into actionable
intelligence based on visual stories.

Referential Link: https://youtu.be/KP1rvlrrrwY

UNIT 1: INTRODUCTION TO DATA WAREHOUSE

You might also like