Professional Documents
Culture Documents
17-Arid-6382 (Muhammad Awais Riaz)
17-Arid-6382 (Muhammad Awais Riaz)
To be filled by Teacher
Marks
Q. No. 1 2 3 4 5 6 7 8 9 10 Obtained/
TotalMarks
Marks
Obtained /20
Total Marks in Words:
Name of the teacher: Saira Sultana
Who taught the course: Signature of teacher / Examiner:
To be filled by Student
Please read all questions carefully. All are from lecture discussions, think critically and good luck.
Q1. (a) How data warehouse is fundamentally different, give your justifications. (3)
Answer:
A data warehouse is an information system which stores historical and commutative data from
single or multiple sources. It is designed to analyze, report, integrate transaction data from different
sources.
Data Warehouse eases the analysis and reporting process of an organization. It is also a
single version of truth for the organization for decision making and forecasting process.
Data warehouse helps business users to access critical data from some sources all in one
place.
It provides consistent information on various cross-functional activities.
Helps you to integrate many sources of data to reduce stress on the production system.
Data warehouse helps you to reduce TAT (total turnaround time) for analysis and reporting.
Data warehouse helps users to access critical data from different sources in a single place so,
it saves user's time of retrieving data information from multiple sources.
You can also access data from the cloud easily.
Data warehouse allows you to stores a large amount of historical data to analyze different
periods and trends to make future predictions.
Enhances the value of operational business applications and customer relationship
management systems Separates analytics processing from transactional databases,
improving the performance of both systems
Stakeholders and users may be overestimating the quality of data in the source systems.
Data warehouse provides more accurate reports.
A data warehouse is subject oriented as it offers information related to theme instead of companies'
ongoing operations. The data also needs to be stored in the Data ware house in common and
unanimously acceptable manner. The time horizon for the data warehouse is relatively extensive
compared with other operational systems. A data warehouse is non-volatile which means the
previous data is not erased when new information is entered in it.
(b)What does ad hoc quire mean? Can we use it in a simple data base environment as well. (3)
Answer:
Ad-Hoc Quire:
An Ad-Hoc Query is a query that cannot be determined prior to the moment the query is issued. It is
created in order to get information when need arises and it consists of dynamically constructed SQL
which is usually constructed by desktop-resident query tools. An ad hoc query does not reside in the
computer or the database manager but is dynamically created depending on the needs of the data
user.
In SQL, an ad hoc query is a loosely typed command/query whose value depends upon some
variable. Each time the command is executed, the result is different, depending on the value of the
variable. It cannot be predetermined and usually comes under dynamic programming SQL query. An
ad hoc query is short lived and is created at runtime.
Answer:
The Data Ware House architecture can be defined as a structural representation of the concrete
functional arrangement based on which a Data Ware House is constructed that should include all its
major pragmatic components, which is typically enclosed with four refined layers, such as
Source layer where all the data from different sources are situated, the staging layer where data
undergoes ETL processing, the storage layer where the processed data are stored for future
exercises, and the presentation layer where the front-end tools are employed as per the user
convenience.
Middle Tier
OLAP SERVER
Bottom Tier
Data Marts
Top Tier
Middle Tier
Bottom Tier
Top Tier
The Top Tier consists of the Client-side front end of the architecture.
The Transformed and Logic applied information stored in the Data
Warehouse will be used and acquired for Business purposes in this Tier.
Several Tools for Report Generation and Analysis are present for the
generation of desired information.
Data mining which has become a great trend these days, is done here.
All Requirement Analysis document, cost, and all features that determine a profit-
based Business deal is done based on these tools, which use the Data Warehouse
information.
Middle Tier
The Middle Tier consists of the OLAP Servers
OLAP is Online Analytical Processing Server
OLAP is used to provide information to business analysts and managers
As it is located in the Middle Tier, it rightfully interacts with the information
present in the Bottom Tier and passes on the insights to the Top Tier tools,
which processes the available information.
Mostly Relational or MultiDimensional OLAP is used in Data warehouse
architecture.
Bottom Tier
The Bottom Tier mainly consists of the Data Sources, ETL Tool, and Data
Warehouse.
1. Data Sources
The Data Sources consists of the Source Data that is acquired and provided to the
Staging and ETL tools for further process.
2. ETL Tools
ETL tools are very important because they help in combining Logic, Raw
Data, and Schema into one and loads the information to the Data
Warehouse Or Data Marts.
Sometimes, ETL loads the data into the Data Marts, and then information is
stored in Data Warehouse. This approach is known as the Bottom-Up
approach.
The approach where ETL loads information to the Data Warehouse directly
is known as the Top-down Approach.
Data Marts
3. Data Warehouse
The Data received by the Source Layer is feed into the Staging Layer, where the
first process that takes place with the acquired data is extraction.
The Data in Landing Database is taken, and several quality checks and staging
operations are performed in the staging area.
The Structure and Schema are also identified, and adjustments are made to data
that are unordered, thus trying to bring about a commonality among the data
that has been acquired.
Having a place or set up for the data just before transformation and changes is an
added advantage that makes the Staging process very important.
ETL Tools are used for the integration and processing of data where logic is
applied to rather raw but somewhat ordered data.
This data is extracted as per the analytical nature that is required and
transformed to data that is deemed fit to be stored in the Data Warehouse.
After Transformation, the data or rather information is finally loaded into the data
warehouse.
This Data is cleansed, transformed, and prepared with a definite structure and
thus provides opportunities for employers to use data as required by the
Business.
Depending upon the approach of the Architecture, the data will be stored in Data
Warehouse as well as Data Marts. Data Marts will be discussed in the later stages.
Data 1
Staging &
Integration data
Data mart
Question No 2:
(a) What is the performance issue in result of horizontal splitting, also mention terminology
used for the issue.
Answer:
Splitting Tables:
Basically Horizontal Splitting is a technique based on divide and conquer to exploit parallelism. The
conquering part of this technique refers to the combining the final result.
Horizontal splitting Breaks a table into multiple tables based upon common column values.
It generally leads to uneven distribution of data but can facilate partition elimination through a
smart optimizer
GOAL
Horizontal splitting is a Divide & Conquer technique that exploits parallelism. The conquer part of
the technique is about combining the results.
Lets see how it works for hash based splitting/partitioning. Assuming uniform hashing, hash splitting
supports even data distribution across all partitions in a pre-defined manner.
However, hash based splitting is not easily reversible to eliminate the split.
In other words, a particular row will always hash to the same partition (assuming that the hashing
algorithm and number of partitions have not changed), but a large number of rows will be
“randomly” distributed across the partitions as long as a well-selected partitioning key is selected
and the hashing function is well-behaved. Notice that the “random” assignment of data rows across
partitions makes it nearly impossible to get any kind of meaningful partition elimination. Since data
rows are hash distributed across all partitions (for load-balancing purposes), there is not practical
way to perform partition elimination unless a very small number (e.g., singleton).
PROCESSORS
Dramatic cancellation of
Resulting in “hot-Spot”
P1 P2 P3 P4
Performance Issue in the result of horizontal splitting is termed as "hot spot". This situation arises
when there is a large number of records in a particular partition as compared to other partitions. In
simple language if data is being partitioned into 4 parts and one of them is having a large portion of
the data, it means that the processor dealing with that particular portion has to work more than
others. As in a parallel processing environment (that is done to improve the performance by doing
partitioning of data), this horizontal splitting will fail because one of the partition will be overloaded
with the data making it a hot spot.
Assume 1:30 record count ratio between master and detail for retail
application.
Assume 5 million members in master table.
Assume header size of primary key in master table is 20 byte.
Assume 50 byte header for master and 120 byte header for detail tables.
Then
Answer:
Answer:
Answer:
After De normalization:
Answer:
Net result is 12.5% additional space required in raw data table size for database.