Data Mining and Warehousing

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Course Code : CAT 307 WYXZ/MS – 23 / 1673

Sixth Semester B. Tech. ( Computer Science and Engineering /


AIML ) Examination

DATA MINING AND WAREHOUSING

Time : 3 Hours ] [ Max. Marks : 60

Instructions to Candidates :—
(1) All questions carry marks as indicated against them.
(2) Assume suitable data wherever necessary and clearly state your assumptions.

1. (a) An insurance company, with branches all over the country, wants to develop
a data warehouse for effective decision-making about their insurance policies.
There are a number of different types of insurance like Auto insurance,
Home insurance, Industrial insurance, etc. The entire country is categorized
into four regions, namely, North, South, East and West. Each region consists
of a set of states. There may be different types of customers like individuals,
institution, industry, etc. The data warehouse should record an entry for
each policy issued to each customer along with the premium paid.
With respect to the above use case, answer the following questions. Necessary
assumptions can be made to support your answer :
= Design a star schema for the data warehouse clearly identifying
the fact table(s), dimensional table(s), their attributes and measures
along with the primary key and foreign key relationships.
= Convert Star Schema to Snowflake Schema.
= Write an SQL query by which you can display region - wise,
insurance - type - wise, year-wise total premium collected from your
schema 6(CO1)

(b) Consider a dimension CUSTOMER (cust_key, cust_name, cust_code, acc_status,


marital_status, address, state, zip).
Using your knowledge of types of dimension tables, which of the dimension table
will you suggest for the following scenarios ? Give explanation for each :
= Correction in customer name.

WYXZ/MS-23 / 1673 Contd.


= Address of customer changes and the application needs to keep
track of current and previous address.

= Acc_status values can be good, late, very late, in arrears,


suspended. The history of account status of each customer is
to be maintained. The account status of a customer gets changed
frequently. 4(CO1)

2. (a) Consider the star schema of an automobile data :


Autos (ModelId, modelname, serialNo, color)
Dealers (DealerId, name, city, state, phone)
Time (TimeId, day, week, month, year)
Sales (ModelId, DealerId, TimeId, QtySold, AmountGenerated)
where the attribute val in intended to be the total price of all automobiles
for the given model, color, date and dealer, while cnt is the total number
of automobiles in that category.
Answer the following OLAP queries :—
= Find total sales generated for model name (Maruti, Honda) and
dealer state (Maharashtra, Gujarat) in September 2017 and October
2017 using ROLL - UP across three dimensions - ModelID, DealerID
and TimeId.

= Find total sales generated for model name (Maruti, Honda) and
dealer state (Maharashtra, Gujarat) in September 2017 and October
2017 using CUBE across three dimensions - ModelID, DealerID
and TimeId.

= Comment on difference in output using ROLL - UP and CUBE


aggregation clause.

= Find total sales generated for model name (Maruti, Honda) and
dealer state (Maharashtra, Gujarat) in September 2017 and October
2017 using Partial ROLL - UP across DealerID and TimeId and
group by ModelId..

= Perform aggregation on amount generated. It should get aggregated


by day first, then by all the weeks in each month, and then
across all months in the year.

WYXZ/MS-23 / 1673 2 Contd.


= Why Groupid( ) clause is used in OLAP queries ? 7(CO2)

(b) Illustrate various types of metadata used in the data warehouse. 3(CO1)

3. (a) Write SQL command for Index Organized Table Employee with the attributes
cust_no, cust_name and cust_address in tablespace ts_iot as directed :
b cust_no is primary key for the table.

b PCTTHRESHOLD is 30.

b Specify Overflow and Including clause. Assume cust_name to


be included in Including clause.

Give meaning of PCTTHRESHOLD, including and overflow clause. 5(CO2)

(b) Consider the following snapshot of SALES table :

Extract of Sales Data

Address or Rowid Date Product Color Region Sale(s)

00001BFE.0012.0111 15-Nov.-00 Dishwasher White East 300

00001BFE.0013.0114 15-Nov.-00 Dryer Almond West 450

00001BFF.0012.0115 16-Nov.-00 Dishwasher Almond West 350

00001BFF.0012.0138 16-Nov.-00 Washer Black North 550

00001BFF.0012.0145 17-Nov.-00 Washer White South 500

00001BFF.0012.0157 17-Nov.-00 Dryer White East 400

00001BFF.0014.0165 17-Nov.-00 Washer Almond South 575

Explain how the query : Select the rows from the Sales table where product
is "Washer" and color is "Almond" and division is "East" or "South" will
be executed if bitmap indexes are created on Product, Color and Region
columns. Show the intermediate steps. 5(CO2)

WYXZ/MS-23 / 1673 3 Contd.


4. (a) Suppose that the data for analysis include the attribute salary in thousands.
The salary values for the data tuples are :

24, 17, 23, 20, 19, 24, 20, 26, 29, 25, 26, 34, 29, 37, 29, 39,
29, 37, 40, 39, 44, 39, 39, 74, 49, 50, 56
(i) What is mean, median and mode of data ?

(ii) What is the range and midrange of the data ?

(iii) Find first quartile (Q1), third quartile (Q3), IQR of the data.

(iv) Give the five - number summary of the data.

(v) Show a boxplot of the data. 5(CO3)

(b) Generated all strong association rules using the Apriori algorithm for the
transaction database shown below and a minimum support s_min = 3 and
minimum confidence = 60%.

TId Items
T1 a, d, e
T2 b, c, d
T3 a, c, e
T4 a, c, d, e
T5 a, e
T6 a, c, d
T7 b, c
T8 a, c, d, e
T9 b, c, e
T10 a, d, e 5(CO3)

5. (a) Bring out the difference between supervised and unsupervised learning with
an example. 4(CO3)

WYXZ/MS-23 / 1673 4 Contd.


(b) Apply Naïve Bayes classification algorithm to predict whether
Rahul (Home owner : Yes, Marital status : Married, Job experience : 3) will
default his loan.
Home Marital Job Expreience
Defaulted
Owner Status (yrs.)
Yes Single 3 No
No Married 4 No
No Single 5 No
Yes Married 4 No
No Divorced 2 Yes
No Married 4 No
Yes Divorced 2 No
No Married 3 Yes
No Married 3 No
Yes Single 2 Yes 6(CO3)

6. (a) Consider the following data points : A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8),
B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 0). Suppose initially we assign A1, B1
and C1 as the center of cluster, respectively. Apply K-means algorithm
using distance function as Manhattan distance to show only the final three
clusters. 6(CO4)
(b) Present conditions under which density-based clustering is more suitable than
partitioning-based clustering and hierarchical clustering. Given some application
examples to support your argument. 4(CO4)

WYXZ/MS-23 / 1673 5 55

You might also like