SE130336 Test 2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Student Code: SE130336

Student Name: Triệu Minh Huy


Class Name: SE1319

Chapter 8
1/ Question 1: Match the columns

1. operational infrastructure A. shared-nothing architecture


2. preemptive multitasking B. provides high concurrency
3. shared disk C. single memory address space
4. MPP D. operating system feature
5. SMP E. vertical parallelism
6. interquery parallelization F. people, procedures, training
7. intraquery parallelization G. easy administration
8. NUMA H. choice data warehouse platform
9. UNIX-based system I. optimize for data transformation
10. data staging area J. data movement option

Your answer 1:
(A ... J)
1 F
2 D
3 J
4 A
5 G
6 E
7 B
8 C
9 H
10 I

2/ Question 2: What are the platform options for the staging area? Compare the options and mention the
advantages and disadvantages.

Your answer 2: …
- Platform options for the staging area: Source data platform, data storage platform, separate platform.
- Compare:
• Advantages:
+ Optimize the separate platform for complex data transformations and data cleansing.
+ A separate environment is most conducive for managing the movement of data.
+ the possibility of having specialized tools to manipulate the data in the staging area.
• Disadvantages:
+ Source data and Data storage platform reduce data movement within the system.
+ Separate platform of environment needs complex data transformations.
Chpater 9
3/ Question 3: Why do you think metadata is important in a data warehouse environment? Give a
general explanation in one or two paragraphs.

Your answer 3: …
- Because metadata helps in decision support system for mapping of data when data is transformed from
operational environment to data warehouse environment and helps in summarization between current detailed
data and highly summarized data.
- Explanation: Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book.
In other words, we can say that metadata is the summarized data that leads us to detailed data.

6/ Question 4: Indicate if true or false


A. The importance of metadata is the same in a data warehouse as it is in an operational system.
B. Metadata is needed by IT for data warehouse administration.
C. Technical metadata is usually less structured than business metadata.
D. Maintaining metadata in a modern data warehouse is just for documentation.
Your answer 4:

(T/F)
A F
B T
C F
D F

Chapter 10
5/ Question 5: Why is the entity-relationship modeling technique not suitable for the data warehouse? How is
dimensional modeling different?

Your answer 5: …
- Because ER modelling aims to optimize performance for transaction processing. It is also hard to query
ER models because of the complexity; many tables should be joined to obtain a result set. Therefore, ER models
are not suitable for high performance retrieval of data.
- Dimensional model is more useful to random changes in user behavior and requirements. The logical
design can be made independent of expected query patterns. All dimensions can be thought as symmetrically
equal entry points into the fact table. Dimensional model is extensible to new design decisions and data
elements. All existing fact and dimension tables can be changed in place without having to reload data. End user
query and reporting tools are not affected by the change.

Chapter 11
6/ Question 6: How does a snowflake schema differ from a STAR schema? Name two advantages and
two disadvantages of the snowflake schema.
Your answer 6:

Snowflake Schema Star Schema


Ease of No redundancy, so snowflake schemas Has redundant data and hence less easy to
maintenance / are easier to maintain and change. maintain/change
change
Ease of Use More complex queries and hence less Lower query complexity and easy to
easy to understand understand
Query Performance More foreign keys and hence longer Less number of foreign keys and hence
query execution time (slower) shorter query execution time (faster)
Type of Good to use for data warehouse core to Good for data marts with simple
Data warehouse simplify complex relationships (many relationships (1:1 or 1:many)
:many)
Joins Higher number of Joins Fewer Joins
Dimension table A Snowflake schema may have more A Star schema contains only single
than one dimension table for each dimension table for each dimension.
dimension.
When to use When dimension table is relatively big in When dimension table contains less
size, snowflaking is better as it reduces number of rows, we can choose Star
space. schema.
Normalization/ De- Dimension Tables are in Normalized Both Dimension and Fact Tables are in De-
Normalization form but Fact Table is in De-Normalized Normalized form
form
Data model Bottom up approach Top down approach

 Advantages:
Better data quality (data is more structured, so data integrity problems are reduced)
Less disk space is used then in a De-Normalized model
 Disadvantages:
Ease of use: More complex queries and hence less easy to understand
Query performance: More foreign keys and hence longer query execution time (slower)

Chapter 12
7/ Question 7: When is a full data refresh preferable to an incremental load? Can you think of an example?

Your answer 7: …
- Full data refresh is preferable to an incremental load because refresh is a much simpler option than update.
- To use the update option, you have to devise the proper strategy to extract the changes from each data
source, then you have to determine the best strategy to apply the changes to the data warehouse.
- The refresh option simply involves the periodic replacement of complete data warehouse tables.
Example: When there are more than 35% of change, we should refresh instead of incremental load.

Chapter 13
8/ Question 8: Give examples of four types of data quality problems.

Your answer 8: …
1. Duplicates:
Multiple copies of the same records take a toll on the computation and storage, but may also produce skewed or
incorrect insights when they go undetected. One of the key problems could be human error — someone simply
entering the data multiple times by accident — or it can be an algorithm that has gone wrong.
Example: Same person with multiple email address
2. Incomplete Data: Many a times because the data has not been entered in the system correctly, or
certain files may have been corrupted, the remaining data has several missing variables.
Example: If an address does not include a zip code at all, the remaining information can be of little value, since
the geographical aspect of it would be hard to determine.
Product Code: 146, Product Name: Crystal Vase, and Height: 486 inches in the same record point to some sort of
data inaccuracy. The values for product name and height are not compatible. Perhaps the product code is also
incorrect

3. Inconsistent Formats:
If the data is stored in inconsistent formats, the systems used to analyze or store the information may not
interpret it correctly. For example, if an organization is maintaining the database of their consumers, then the
format for storing basic information should be pre-determined. Name (first name, last name), date of birth
(US/UK style) or phone number (with or without country code) should be saved in the exact same format. It may
take data scientists a considerable amount of time to simply unravel the many versions of data saved.

Violation of business rules: In a payroll system, an obvious business rule is that the days worked in a year plus
the vacation days, holidays, and sick days cannot exceed 365 or 366
Chapter 15
9/ Question 9: What is meant by slice-and-dice? Give an example.

Your answer 9: …
- Slice is the act of picking a rectangular subset of a cube by choosing a single value for one of its dimensions,
creating a new cube with fewer dimensions.
- Dice is the act of producing a sub cube by allowing the analyst to pick specific values of multiple dimensions
- To slice and dice is to break a body of information down into smaller parts or to examine it from different
viewpoints so that you can understand it better.
- In data analysis, the term generally implies a systematic reduction of a body of data into smaller parts or
views that will yield more information. The term is also used to mean the presentation of information in a
variety of different and useful ways.

Example
• Slice and dice

10/ Question 10: Discuss two reasons why feeding data into the OLAP system directly from the source
operational systems is not recommended.
Your answer 10: …
Business Users needed to build queries that summarized the data and fed management reports. Such queries,
were extremely slow because they usually summarize large amounts of data, while sharing the database engine
with every day operations,
Reason1: Which in turn adversely affected the performance of operational systems.
Reason2: Delay in strategic planning of the enterprise.
The solution was, therefore, to separate the data used for reporting and decision
making from the operational systems. Hence, data warehouses were designed and built to house this kind of
data so that it can be used later in the strategic planning of the enterprise

You might also like