Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Pir Mehr Ali Shah

Arid Agriculture University, Rawalpindi


Office of the controller of Examinations

Final Exam (Theory) / Spring 2021 (Paper Duration 12 hours)

To be filled by Teacher

Course No.: ………CS-667…………Course Title:……Introduction to Data Warehousing……………

Total Marks:……….…20……………………Date of Exam:………………6/7/2021…......................................

Degree: ………………BSCS……………….Semester:…………8…………… Section:……………B ……………………

Marks
Q. No. 1 2 3 4 5 6 7 8 9 10 Obtained/
TotalMarks
Marks
Obtained /20
Total Marks in Words:
Name of the teacher: Saira Sultana
Who taught the course: Signature of teacher / Examiner:

To be filled by Student

Registration No.: 17-Arid-6382 Name: Muhammad Awais Riaz

Answer the following questions.

Please read all questions carefully. All are from lecture discussions, think critically and good luck.

Q1. (a) How data warehouse is fundamentally different, give your justifications. (3)

Answer:

A data warehouse is an information system which stores historical and commutative data from
single or multiple sources. It is designed to analyze, report, integrate transaction data from different
sources.

 Data Warehouse eases the analysis and reporting process of an organization. It is also a
single version of truth for the organization for decision making and forecasting process.
 Data warehouse helps business users to access critical data from some sources all in one
place.
 It provides consistent information on various cross-functional activities.
 Helps you to integrate many sources of data to reduce stress on the production system.
 Data warehouse helps you to reduce TAT (total turnaround time) for analysis and reporting.
 Data warehouse helps users to access critical data from different sources in a single place so,
it saves user's time of retrieving data information from multiple sources.
 You can also access data from the cloud easily.
 Data warehouse allows you to stores a large amount of historical data to analyze different
periods and trends to make future predictions.
 Enhances the value of operational business applications and customer relationship
management systems Separates analytics processing from transactional databases,
improving the performance of both systems
 Stakeholders and users may be overestimating the quality of data in the source systems.
 Data warehouse provides more accurate reports.

Characteristics of Data Warehouse

A data warehouse is subject oriented as it offers information related to theme instead of companies'
ongoing operations. The data also needs to be stored in the Data ware house in common and
unanimously acceptable manner. The time horizon for the data warehouse is relatively extensive
compared with other operational systems. A data warehouse is non-volatile which means the
previous data is not erased when new information is entered in it.

(b)What does ad hoc quire mean? Can we use it in a simple data base environment as well. (3)

Answer:

Ad-Hoc Quire:

• Does not have a certain predefined database access pattern.

• Queries not known in advance.

• Difficult to write SQL in advance.

An Ad-Hoc Query is a query that cannot be determined prior to the moment the query is issued. It is
created in order to get information when need arises and it consists of dynamically constructed SQL
which is usually constructed by desktop-resident query tools. An ad hoc query does not reside in the
computer or the database manager but is dynamically created depending on the needs of the data
user.

In SQL, an ad hoc query is a loosely typed command/query whose value depends upon some
variable. Each time the command is executed, the result is different, depending on the value of the
variable. It cannot be predetermined and usually comes under dynamic programming SQL query. An
ad hoc query is short lived and is created at runtime.

(c) Explain DWH architecture in detail, all four levels. (4)

Answer:

The Data Ware House architecture can be defined as a structural representation of the concrete
functional arrangement based on which a Data Ware House is constructed that should include all its
major pragmatic components, which is typically enclosed with four refined layers, such as

Source layer where all the data from different sources are situated, the staging layer where data
undergoes ETL processing, the storage layer where the processed data are stored for future
exercises, and the presentation layer where the front-end tools are employed as per the user
convenience.

DATA WARE HOUSE ARCHITECTURE:


Data Ware House Architecture

Data Reporting Analysis/Other


Mining Tool Tool Top Tier

Middle Tier
OLAP SERVER

Bottom Tier

ETL Data Ware


House

Data Marts

The Data Warehouse Architecture generally comprises of three tiers.

 Top Tier
 Middle Tier
 Bottom Tier

Top Tier
 The Top Tier consists of the Client-side front end of the architecture.
 The Transformed and Logic applied information stored in the Data
Warehouse will be used and acquired for Business purposes in this Tier.
 Several Tools for Report Generation and Analysis are present for the
generation of desired information.
 Data mining which has become a great trend these days, is done here.

All Requirement Analysis document, cost, and all features that determine a profit-
based Business deal is done based on these tools, which use the Data Warehouse
information.

Middle Tier
 The Middle Tier consists of the OLAP Servers
 OLAP is Online Analytical Processing Server
 OLAP is used to provide information to business analysts and managers
 As it is located in the Middle Tier, it rightfully interacts with the information
present in the Bottom Tier and passes on the insights to the Top Tier tools,
which processes the available information.
 Mostly Relational or MultiDimensional OLAP is used in Data warehouse
architecture.

Bottom Tier
The Bottom Tier mainly consists of the Data Sources, ETL Tool, and Data
Warehouse.

1. Data Sources

The Data Sources consists of the Source Data that is acquired and provided to the
Staging and ETL tools for further process.

2. ETL Tools

 ETL tools are very important because they help in combining Logic, Raw
Data, and Schema into one and loads the information to the Data
Warehouse Or Data Marts.
 Sometimes, ETL loads the data into the Data Marts, and then information is
stored in Data Warehouse. This approach is known as the Bottom-Up
approach.
 The approach where ETL loads information to the Data Warehouse directly
is known as the Top-down Approach.

Data Marts

 Data Mart is also a storage component used to store data of a specific


function or part related to a company by an individual authority.
 Datamart gathers the information from Data Warehouse, and hence we
can say data mart stores the subset of information in Data Warehouse.
 Data Marts are flexible and small in size.

3. Data Warehouse

 Data Warehouse is the central component of the whole Data Warehouse


Architecture.
 It acts as a repository to store information.
 Big Amounts of data are stored in the Data Warehouse.
 This information is used by several technologies like Big Data which require
analyzing large subsets of information.
 Data Mart is also a model of Data Warehouse.

Different Layers of Data Warehouse Architecture


There are four different types of layers which will always be present in Data
Warehouse Architecture.

1. Data Source Layer


 The Data Source Layer is the layer where the data from the source is
encountered and subsequently sent to the other layers for desired
operations.
 The data can be of any type.
 The Source Data can be a database, a Spreadsheet or any other kinds of
text file.
 The Source Data can be of any format. We cannot expect to get data with
the same format considering the sources are vastly different.
 In Real Life, Some examples of Source Data can be
 Log Files of each specific application or job or entry of employers in a
company.
 Survey Data, Stock Exchange Data, etc.
 Web Browser Data and many more.

2. Data Staging Layer


The following steps take place in Data Staging Layer.

Step #1: Data Extraction

The Data received by the Source Layer is feed into the Staging Layer, where the
first process that takes place with the acquired data is extraction.

Step #2: Landing Database

The extracted data is temporarily stored in a landing database.

It retrieves the data once the data is extracted.

Step #3: Staging Area

The Data in Landing Database is taken, and several quality checks and staging
operations are performed in the staging area.
The Structure and Schema are also identified, and adjustments are made to data
that are unordered, thus trying to bring about a commonality among the data
that has been acquired.

Having a place or set up for the data just before transformation and changes is an
added advantage that makes the Staging process very important.

It makes data processing easier.

Step #4: ETL

It is an Extraction, Transformation, and Load.

ETL Tools are used for the integration and processing of data where logic is
applied to rather raw but somewhat ordered data.

This data is extracted as per the analytical nature that is required and
transformed to data that is deemed fit to be stored in the Data Warehouse.

After Transformation, the data or rather information is finally loaded into the data
warehouse.

Some examples of ETL tools are Informatication , SSIS, etc.

3. Data Storage Layer


The processed data is stored in the Data Warehouse.

This Data is cleansed, transformed, and prepared with a definite structure and
thus provides opportunities for employers to use data as required by the
Business.

Depending upon the approach of the Architecture, the data will be stored in Data
Warehouse as well as Data Marts. Data Marts will be discussed in the later stages.

Some also include an Operational Data Store.

4. Data Presentation Layer


 This Layer where the users get to interact with the data stored in the data
warehouse.
 Queries and several tools will be employed to get different types of
information based on the data.
 The information reaches the user through the graphical representation of
data.
 Reporting Tools are used to get Business Data, and Business logic is also
applied to gather several kinds of information.
 Meta Data Information and System operations and performance are also
maintained and viewed in this layer.

Data Source Data Staging Data Storage Data presentation

Data 1

Staging &

Integration data

Summary data Meta data


Data 2

Data mart

Question No 2:

(a) What is the performance issue in result of horizontal splitting, also mention terminology
used for the issue.

Answer:

Splitting Tables:

Basically Horizontal Splitting is a technique based on divide and conquer to exploit parallelism. The
conquering part of this technique refers to the combining the final result.

There are two types of Horizontal splitting:

Round Robin and Random splitting-

This type of splitting is irreversible and is not procedure.

Horizontal splitting Breaks a table into multiple tables based upon common column values.

Range and Expression splitting –

It generally leads to uneven distribution of data but can facilate partition elimination through a
smart optimizer

GOAL

 Spreading rows for exploiting parallelism.


 Grouping data to avoid unnecessary query load in WHERE clause.

Horizontal splitting is a Divide & Conquer technique that exploits parallelism. The conquer part of
the technique is about combining the results.

Lets see how it works for hash based splitting/partitioning. Assuming uniform hashing, hash splitting
supports even data distribution across all partitions in a pre-defined manner.
However, hash based splitting is not easily reversible to eliminate the split.

In other words, a particular row will always hash to the same partition (assuming that the hashing
algorithm and number of partitions have not changed), but a large number of rows will be
“randomly” distributed across the partitions as long as a well-selected partitioning key is selected
and the hashing function is well-behaved. Notice that the “random” assignment of data rows across
partitions makes it nearly impossible to get any kind of meaningful partition elimination. Since data
rows are hash distributed across all partitions (for load-balancing purposes), there is not practical
way to perform partition elimination unless a very small number (e.g., singleton).

PROCESSORS

Dramatic cancellation of

Airline reservation after 9/11

Resulting in “hot-Spot”
P1 P2 P3 P4

1998 1999 2000 2001

Performance Issue De-Normalization

Performance Issue in the result of horizontal splitting is termed as "hot spot". This situation arises
when there is a large number of records in a particular partition as compared to other partitions. In
simple language if data is being partitioned into 4 parts and one of them is having a large portion of
the data, it means that the processor dealing with that particular portion has to work more than
others. As in a parallel processing environment (that is done to improve the performance by doing
partitioning of data), this horizontal splitting will fail because one of the partition will be overloaded
with the data making it a hot spot.

Advantages: Horizontal Spliting

 Enhance security of data.


 Reduced I/O overhead.
 Graceful degradation of database in case of table damage.

b) We have a scenario where we are using pre joining technique (1.5+4+1.5)

 Assume 1:30 record count ratio between master and detail for retail
application.
 Assume 5 million members in master table.
 Assume header size of primary key in master table is 20 byte.
 Assume 50 byte header for master and 120 byte header for detail tables.

Then

1. How many members are there in detail table?


2. What will be the storage issue justify by calculating storage with
normalization and DE normalization.?
3. What will be net result?

Answer:

1. How many members are there in detail table?

Answer:

There are 4 details in table.

2. What will be the storage issue justify by calculating storage with


normalization and DE normalization.?

Answer:

Pre-joining(Calculations) with normalization:

Total space used=(5x50)+(120)=3.7GB

After De normalization:

Total Space Used==(5+50-20) x 5=1.7 GB

3. What will be net result?

Answer:

Net result is 12.5% additional space required in raw data table size for database.

You might also like