Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Q2: Briefly define the following (Choose FOUR): (4 Marks)

1. Data staging: Data staging provides a place and an area with a set of functions to

clean, change, combine, convert, de-duplicate, and prepare source

data for storage and use in the DW.

2. DW Refresh:

3. Data granularity:

4. Virtual DW: virtual warehouse is a set of views over operational databases.

• For efficient query processing, only some of the possible summary viewsmay be
materialized.

• It is easy to build but requires excess capacity on operational databaseservers.

5. Non-Volatile Data: DW is always a physically separate store of data transformed from the

application data found in the operational environment.

• DW does not require transaction processing, recovery, and concurrency

control mechanisms (only two operations required: initial loading of data and

access of data).

Q2: Define FOUR terms out of SIX: (4 Marks)

1. Destructive Merge 2. Drill-Across 3. Enrichment

4. Measurements : • A multidimensional point in the data cube space can be defined by aset
of dimension value pairs•Time D “Q1”, location D “Vancouver”, item D“computer”.
• A data cube measure is a numeric function that can be evaluated ateach point in the data
cube space.
• A measure value is computed for a given point by aggregating the data
corresponding to the respective dimension value pairs defining the
given point.
Measures can be organized into three categories based on the kind of
aggregate functions used
•Distributive
•Algebraic
•Holistic
5. Semantic standardization: is another major task. You resolvesynonyms and homonyms.
• When two or more terms from different source systems mean
the same thing, you resolve the synonyms.
• When a single term means many different things in different
source systems, you resolve the homonym.
• Data transformation involves combining processes, followed by
sorting and merging.
• Combining data from single source record or related data elements from many source
records.
• Primary keys in the DW cannot have built-in meanings.

6. OLAP: stands for On-line Analytical Processing.•It uses database tables (fact and
dimension tables) toenable multidimensional viewing ,analysis andquerying of large
amounts of data.
•OLAP applications and tools are those that are designed to ask complex queries of large
multidimensional collections of data and provide fast answers to analyze historical data.
• Due to that OLAP is accompanied with data warehousing.

OLAP solves the problem of retrieving a very large number of


records (gigabytes and terabytes) and summarizing this data
into a form information that can by used by business analysts.
• The key driver of OLAP is the multidimensional nature of the
business problem.

Define the following: (choose FIVE OUT OF SIX): 10 MARKS

[2x5]

1. Data Characterization : describes data in ways that are useful to the miner and begins the
process of understanding what is in the data.

• The standard characteristics for a given dataset:

– The number of classes, the number of observations, the number of attributes, the number
of features with numeric data, type and the number of features with symbolic data typ

2. Hierarchy : concept hierarchy defines a sequence of mappings from a set oflow-level


concepts to higher-level, more general concepts.
• Many concept hierarchies are implicit within the database schema.
• Consider a concept hierarchy for the dimension location.
• Contains street, city, province or state, zip code, and country.
• These attributes are related by a total or partial order among attributes ,
forming a concept hierarchy such as “street < city < province or state <
country.
3. HOLAP : HOLAP stands for Hybrid OLAP.• (MQE: Managed Query Environment)

• HOLAP technologies attempt to combine the advantages ofMOLAP and ROLAP.

• For summary type information, HOLAP leverages cube technology for faster performance.

• It stores only the indexes and aggregations in the multidimensional form while the rest of
the data is stored in the relational database

4. Data Science: is the study of the generalizable extraction of knowledge from data.

5. Non-volatil

6. Neural Network: (computational model) works similar to the human brain neurons.
– Layer of "input" units is connected to a layer of "hidden" units (deep), which is connected
to a layer of "output" units. – Each neuron takes input(s), perform operation(s) and passthe
output to the following neuro

Define the following: (Choose SIX OUT OF EIGHT): 6 MARKS

1. Ranking

2. Homonym Standardization: When a single term means many different things in different

source systems, you resolve the homonym.

3. Virtual warehouse : virtual warehouse is a set of views over operational databases.

• For efficient query processing, only some of the possible summary views

may be materialized.

• It is easy to build but requires excess capacity on operational databaseservers.

4. Non-volatile

5. Deduplication : In many companies, the customer files have several

records for the same customer. Mostly, the duplicates

are the result of creating additional records by mistake.

• In your DW, you want to keep a single record for one customer and link

all the duplicates in the source systems to this single record (this

process is called deduplication of the

customer file).

6. Distributive Measures : Merge In this mode, you apply the incoming data to the target
data.

• If the primary key of an incoming record matches with the key of an

existing record, update the matching target record.

• If the incoming record is a new record without a match with any

existing record, add the incoming record to the target table

7. Job sequencing : determine whether the beginning of one job in anextraction job stream
has to wait until the previous job has finishedsuccessfully.

8. Business Meta data: It contains data that gives info related to business stored in DW to
users, examples (privacy level, security level and business rules).
Q2: Define the following: (choose SIX OUT OF SEVEN): 6 MARKS

[1x6]

1. Deduplication : In many companies, the customer files have several

records for the same customer. Mostly, the duplicates

are the result of creating additional records by mistake.

• In your DW, you want to keep a single record for one customer and link

all the duplicates in the source systems to this single record (this

process is called deduplication of the

customer file).

• Employee files and, sometimes, product master files

have this kind of duplication problem.

2. Data Mart : It contains a subset of corporate-wide data that is of value to a specific

group of users (using specific selected subjects).

• Implemented on low-cost departmental servers.

• The data tend to be summarized.

• The implementation cycle is more likely to be measured in weeks.

• Depending on the source of data, data marts can be categorized as

independent or dependent.

3. Federated DW : Used in companies with an existing legacy of an assortment of decision-


support

structures in the form of operational systems, extracted datasets, primitive data

marts, and so on (solution where data may be physically or logically integrated through

shared key fields, overall global metadata, distributed queries, and such other methods,

there is no one overall DW).

4. Periodic Status: Periodic Status: in this category, the value of the attribute is preserved as
thestatus every time a change occurs. At each of these points in time, the status

value is stored with reference to the time when the new value became
effective. This category also includes events stored with reference to the timewhen each
event occurred.

5. Non-volatile

6. Fact Constellation : Sophisticated applications may require multiple fact tables to share

dimension tables.

• This kind of schema can be viewed as a collection of stars, and hence is

called a galaxy schema or a fact constellation.

7. Data Granularities

Q4: Answer the following (choose THREE OUT OF FOUR): 6 MARKS

[2x3]

1. What is decision tree overfitting, and how to avoid it?

An induced tree may over-fit the training data. – Too many branches, some may reflect
anomalies due to noise or outliers.– Poor accuracy for unseen samples.Two approaches
to avoid overfitting

– Pre-pruning:
Halt tree construction early ̵do not split a node if this would result in the goodness
measure falling below a threshold Difficult to choose an appropriate threshold
– Post-pruning:
Remove branches from a “fully grown” tree get a sequence of progressively pruned
trees Use a set of data different from the training data to decide which is the “best
pruned tree”

2. Differentiate between the load modes (constructive and destructive merge)?

Destructive Merge
• Merge In this mode, you apply the incoming data to the target data.
• If the primary key of an incoming record matches with the key of an
existing record, update the matching target record.
• If the incoming record is a new record without a match with any
existing record, add the incoming record to the target table.
3.5.1.4 Constructive Merge
• This mode is slightly different from the destructive merge.
• If the primary key of an incoming record matches with the key of an
existing record, leave the existing record, add the incoming record,
and mark the added record as superseding the old record.
3. What are the key steps for data mining process?

4. What are the difference between operational systems and data warehouse according to
view and access patterns?

View
• An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historic data or data in different organizations.
• OLAP system often spans multiple versions of a database schema, due to the
evolutionary process of an organization. Because of their huge volume, OLAP data
are stored on multiple storage media.
• Access patterns
• The access patterns of an OLTP system consist mainly of short, atomic transactions.
Such a system requires concurrency control and recovery mechanisms.
• Access to OLAP systems are mostly read-only operations (because most DWs store
historic rather than up-to-date information), although many could be complex
queries.

B) Compare between (Choose FOUR OUT OF FIVE) 8 MARKS


1. OLTP and OLAP related to users and system orientation.
Users and system orientation.
• An OLTP system is customer-oriented and is used for transaction and query
processing by clerks, clients, and information technology professionals.
• An OLAP system is market-oriented and is used for data analysis by knowledge
workers, including managers, executives, and analysts.
2. Internal and external data sources.
Internal Data in every organization, users keep their “private” such as
(spreadsheets, documents, customer profiles, and sometimes even
departmental databases), size of internal data that should be included
add more complexity.
• In every operational system, you periodically take the old data and store it
in archived files from old legacy systems.

External Data most executives depend on data from external


sources for a high percentage of the information they use.
• They use statistics relating to their industry produced by
external agencies and national statistical offices (market share
data of competitors, standard values of financial indicators
for their business to check on their performance).

3. Capture through Database Triggers and Capture by Comparing Files


Capture through Database Triggers
• Triggers are special stored procedures (programs) that are stored on the
database and fired when certain predefined events occur. The output of the
trigger is written to a separate file that will be used to extract data for the DW.
• You can create trigger programs for all events for which you need data to be
captured (to capture all changes to the records in the customer table).
Applicable to database applications.
4. Append and constructive merge loading mode.
Append
• If data already exists in the table, the append process
unconditionally adds the incoming data, preserving the existing
data in the target table.
• When an incoming record is a duplicate of an already existing record ,
you may define how to handle an incoming duplicate.
• The incoming record may be allowed to be added as a duplicate or
may be rejected during the append process.
Constructive Merge
• This mode is slightly different from the destructive merge.
• If the primary key of an incoming record matches with the key of an
existing record, leave the existing record, add the incoming record,
and mark the added record as superseding the old record.

5. Extraction in DW and operational data sources.


Two major factors differentiate the data extraction in a new operational system
and DW.
• For a DW, you have to extract data from many disparate sources. Next, you
have to extract data on the changes for ongoing incremental loads as well as
for a one-time initial full load.
• For operational systems, all you need is one-time extractions and data
conversions.
Answer the following (choose THREE OUT OF FOUR): 6 MARKS
[2x3]
1. What are the difference between operational systems and data warehouse according to
data contents?
2. Differentiate between the load modes (load and destructive merge)?
Load
• If the target table to be loaded already exists and data exists in the
table, the load process wipes out the existing data and applies the
data from the incoming file.
• If the table is already empty before loadi

Destructive Merge
• Merge In this mode, you apply the incoming data to the target data.
• If the primary key of an incoming record matches with the key of an
existing record, update the matching target record.
• If the incoming record is a new record without a match with any
existing record, add the incoming record to the target table.

3. Explain the role of metadata in DW?


Meta Data
• It defines the location and contents of data in the DW, searchable by users
to find definitions or subject areas.

4. Explain the data extraction issue (extraction frequency)?


Extraction frequency: for each data source, establish how frequently the
data extraction must be done: daily, weekly, quarterly, and so on.

Answer TWO of the following: (4 Marks)


A- What is difference(s) between OLAP and OLTP based on database design?
Database design
• An OLTP system usually adopts an entity-relationship (ER) data model and
anapplication-oriented database design.• An OLAP system typically adopts either a
star or a snowflake model and a subject-oriented database design.
B- Explain briefly the differences between centralized DW and independent data
marts?
Centralized DW
• This architectural takes into account the enterprise-level information
requirements (no separate data marts). Queries and applications access the
normalized data in the central DW.
2.5.2 Independent Data Marts
• This type evolves in companies where the organizational units develop their own
data marts for their own specific purposes.
• Although each data mart serves the particular organizational unit, these separate
data marts do not provide “a single version of the truth.” • These data marts are
independent of one another and have inconsistent data definitions and standards.

C- Explain briefly the difference between base cuboid and apex cuboid?
The cuboid that holds the lowest level of summarization is called the base
cuboid.• The 0-D cuboid, which holds the highest level of summarization, is called
theapex cuboid.

F: What is the difference between time window and extraction frequency?


Extraction frequency: for each data source, establish how frequently the
data extraction must be done: daily, weekly, quarterly, and so on.
4. Time window: for each data source, denote the time window for the
extraction process.
Q3: Justify the following (choose THREE OUT OF FOUR): 6 MARKS
[2x3]
1. Data preprocessing is an essential step in data mining
Noise, missing data, inconsistent, human errors, mechanical errors result dirty
data.• It is one of the most important contributors to the success of the project.–
About 60% of the total time for a datamining project should be spent on data
preparation.– Change the data may affect the data positively or negatively on the
knowledge discovered.
2. Top-Down approach takes time and efforts to implement.
to build an enterprise DW with subset data marts.
However, it is expensive, takes a long time to develop, and lacks flexibility due to the
difficulty in achieving consistency and consensus for a common data model for the
entire organization.
3. Missing values are exist in most datasets.

This can occur for several reasons, for example:


– A malfunction of the equipment used to record
the data,
– A data collection form to which additional fields
were added after some data had been collected,
– Information that could not be obtained, e.g. about a hospital patient.
• Can be handled using (discard instances,
replace by most frequent/average value).

4. Standardization is one of the most difficult processes in data transformations.

Standardization of data elements forms a large part of data transformation,


you standardize the data types and field lengths for same data elements
retrieved from the various sources.

B) Choose TWO of the following: 4 MARKS


1- Explain how to the load fact and dimension tables? Which of them should be loaded
first? Why?

The key of the fact table is the concatenation of the keys of the dimension tables.
• Therefore, for this reason, dimension records are loaded first.
• You have to create the concatenated key for the fact table record from the keys of the
corresponding dimension records.Perform fact table surrogate key look-up.

2- What is the best approach to design independent data mart? Why?


Bottom up approach (to implement independent data marts) provides flexibility, low
cost, and rapid return of investment.
• Combine and integrate data marts to form DW (helps us incrementally build the DWby
developing and integrating data marts as and when the requirements are clear).
• Advantages (do not require high initial costs and have a faster implementationtime).
• Problems occur when integrating various disparate data marts into a consistent
enterprise DW.
• This approach is more realistic but the complexity of the integration may become a
serious obstacle.
3- What is ROLAP? Explain its advantages briefly.
This methodology relies on manipulating the data stored in the relational database to
give theappearance of traditional OLAP’s slicing and dicing functionality. Actions of
slicing and dicing is equivalent to adding a WHERE‖ clause in the SQL statement.
• Advantages:
• Can handle large amounts of data
• Can leverage functionalities inherent in the relational database
• Disadvantages:
• Performance can be slow
• Limited by SQL functionalities

Q3: Justify the following (choose THREE OUT OF FOUR): 6 MARKS


[2x3]
1. There is no extra overhead in the operational systems in capture through
transaction log process.There is no extra overhead in the operational systems
because logging is already part of the transaction processing.
2. Top-Down approach takes time and efforts to implement. Collected enterprise wide
business requirements and decided to build an enterprise DW with subset data
marts. However, it is expensive, takes a long time to develop, and lacks flexibility
due to the difficulty in achieving consistency and consensus for a common data
model for the entire organization.
3. Time is the most important factor in DW. Data are stored to provide information from
an historic perspective (e.g., the
past 5–10 years).
• Time element is very important in the data sources.
3. Star schema is better choice for fast query processing.
Large central table (fact table) containing the bulk of the data, with no redundancy

You might also like