Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Sample Questions

Information Technology

Subject Name: Data Mining and Business Intelligence Semester: VI

Multiple Choice Questions

Choose the correct option for following questions. All the Questions
carry equal marks
1. Which of the following can be considered as the correct process of Data
Mining?
Option A: Infrastructure, Exploration, Analysis, Interpretation, Exploitation
Option B: Exploration, Infrastructure, Analysis, Interpretation, Exploitation
Option C: Exploration, Infrastructure, Interpretation, Analysis, Exploitation
Option D: Exploration, Infrastructure, Analysis, Exploitation, Interpretation

2. Which of the following is an essential process in which the intelligent


methods are applied to extract data patterns?
Option A: Warehousing
Option B: Data Mining
Option C: Text Mining
Option D: Data Selection

3. What is KDD in data mining?


Option A: Knowledge Discovery Database
Option B: Knowledge Discovery Data
Option C: Knowledge Data definition
Option D: Knowledge data house

4. What are the functions of Data Mining?


Option A: Association and correctional analysis classification
Option B: Prediction and characterization
Option C: Cluster analysis and Evolution analysis
Option D: All of the above

5. Which one of the following statements about the K-means clustering is


incorrect?
Option A: The goal of the k-means clustering is to partition (n) observation into (k)
clusters
Option B: K-means clustering can be defined as the method of quantization
Option C: The nearest neighbor is the same as the K-means
Option D: All of the above

6. Which one of the following can be defined as the data object which does
not comply with the general behavior (or the model of available data)?
Option A: Evaluation Analysis
Option B: Outliner Analysis
Option C: Classification
Option D: Prediction

7. Which one of the following correctly refers to the task of the


classification?
Option A: A measure of the accuracy, of the classification of a concept that is given by a
certain theory
Option B: The task of assigning a classification to a set of examples
Option C: A subdivision of a set of examples into a number of classes
Option D: None of the above

8. Euclidean distance measure is can also defined as ___________


Option A: The process of finding a solution for a problem simply by enumerating all
possible solutions according to some predefined order and then testing them
Option B: The distance between two points as calculated using the Pythagoras theorem
Option C: A stage of the KDD process in which new data is added to the existing
selection
Option D: All of the above

9. Which of the following is a good alternative to the star schema?


Option A: snow flake schema
Option B: star schema
Option C: star snow flake schema
Option D: fact constellation

10. Efficiency and scalability of data mining algorithms” issues come under?
Option A: Mining Methodology and User Interaction Issues
Option B: Performance Issues
Option C: Diverse Data Types Issues
Option D: None of the above

11. ________ is the clustering technique which needs the merging approach.
Option A: Naïve Bayes
Option B: Hierarchical
Option C: Partitioned
Option D: All of the above

12. ________ are the Data mining Application?


Option A: Market Basket Analysis.
Option B: Fraud Detection.
Option C: Both A and B
Option D: None of the above

13. KDD process is consists of _______ steps.


Option A: 4
Option B: 9
Option C: 7
Option D: 5
14. Which among the following is a Data Mining Algorithm?
Option A: K-mean Algorithm
Option B: Apriori Algorithm.
Option C: Naive Bayes Algorithm
Option D: All of the above

15. Data mining requires


Option A: Large quantities of operational data stored over a period of time
Option B: Lots of tactical data
Option C: Several tape drives to store archival data
Option D: Large mainframe computers

16. Which of the following is NOT example of ordinal attributes?


Option A: Zip codes
Option B: Ordered numbers
Option C: Ascending or descending names
Option D: Military ranks

17. Identify the example of Nominal attribute


Option A: Temperature
Option B: Mass
Option C: Salary
Option D: Gender

18. Which of the following is not a data pre-processing methods?


Option A: Data Visualization
Option B: Data Discretization
Option C: Data Cleaning
Option D: Data Reduction

19. A data warehouse


Option A: must import data from transactional systems whenever significant changes
occur in the transactional data
Option B: works on live transactional data to provide up to date and valid results
Option C: takes regular copies of transaction data
Option D: takes preprocessed transaction data and stores in a way that is optimized for
analysis

20. In a snowflake schema which of the following types of tables is


considered?
Option A: Fact
Option B: Dimension
Option C: Both (a) and (b)
Option D: None of the above

21. When you ____ the data, you are aggregating the data to a higher level
Option A: Slice
Option B: Roll Up
Option C: Roll Down
Option D: Drill Down

22. Which type of data storage architecture gives fastest performance?


Option A: ROLAP
Option B: MOLAP
Option C: HOLAP
Option D: DOLAP

23. ______supports basic OLAP operations, including slice and dice,


drill-down,
roll-up and pivoting.
Option A: Information processing
Option B: Analytical processing
Option C: Data mining
Option D: Transaction processing

24. Data mining is______________?


Option A: time variant non-volatile collection of data
Option B: The actual discovery phase of a knowledge
Option C: The stage of selecting the right data
Option D: None of these

25. Business intelligence (BI) is a broad category of application programs


which includes _____________
Option A: Decision support
Option B: Data mining
Option C: OLAP
Option D: All of the mentioned

26. ________ is a performance management tool that recapitulates an


organization’s performance from several standpoints on a single page.
Option A: Balanced Scorecard
Option B: Data Cube
Option C: Dashboard
Option D: All of the mentioned

27. Prediction is_________


Option A: The result of the application of a theory or a rule in a specific case
Option B: One of several possible enters within a database table that is chosen by the
designer as the primary means of accessing the data in the table.
Option C: Discipline in statistics that studies ways to find the most interesting
projections of multi-dimensional spaces.
Option D: None of these

28. Decision support systems (DSS) is _____________


Option A: A family of relational database management systems marketed by IBM
Option B: Interactive systems that enable decision makers to use databases and models
on a computer in order to solve ill-structured problems
Option C: It consists of nodes and branches starting from a single root node. Each node
represents a test, or decision
Option D: None of these

29. Association analysis is used to discover patterns that describe


_________________associated features in the data.
Option A: largely
Option B: fewer
Option C: strongly
Option D: moderately

30. Binary attribute are __________________


Option A: This takes only two values. In general, these values will be 0 and 1 and they
can be coded as one bit
Option B: The natural environment of a certain species
Option C: Systems that can be used without knowledge of internal operations
Option D: None of these
Descriptive Questions

10 marks each
1. Explain role of Business intelligence in any one of following domain: Fraud Detection,
Market Segmentation, retail industry, and telecommunications industry. Explain how
data mining can be helpful in any of these cases.

Ans. Fraud Detection for Telecommunications Industry:

● In the last few years there has been huge expansion in telecommunications industry
with development in reasonable mobile phone technology. The global mobile phone
fraud is increasing with the increase in mobile phone users.
● There are various types of telecom fraud which occurs at different level.
● Common types of frauds are:
○ Subscription fraud
○ Superimposed or "surfing" fraud.
● When the fraud occur while subscribing to a service, then that fraud is called as
subscription fraud. In this type of fraud, the fraudster wants to hide his/her identity by
providing wrong documents with false identity. The intension of fraudster is not to pay
bill amount, so all transaction with this number will be fraudulent.
● When the unauthorized person uses the service without having any authority, then that
fraud is called as superimposed fraud. This type of fraud is detected when fake or
phantom calls appear on a bill.
● Superimposed frauds can be carried out in a number of ways, which includes mobile
phone cloning and getting authorization details of calling card.
● These types of frauds occur at the level of individual calls, the calls that are fraudulent
will be mixed in with authentic ones.
● Frauds at subscription can be detected at billing process, though the aim Involves in
detecting it much before that, since huge costs can run up.
● Superimposed fraud can remain undetected for a long time.
● Telecommunications networks generate vast quantities of data, sometimes on the order
of several gigabytes per day, so that data mining techniques are of particular
importance.
● Various data mining techniques like rule based detection system, statistical
summarization etc are used to detect such telecommunication fraud:

Simple rule-based detection systems use rules like:

○ Checks the distant geographical location with respect to time,


○ Overlapping calls at the same time,
○ Long calls or High value calls.

Extra:

Market Segmentation:

● Market segmentation is a process into which a market is divided into smaller


sub-markets called segments. They are homogenous or have similar attributes.
Purchasing Patterns and trends appear significantly in some segments. A Good market
segmentation is the one in which segments with significant patterns can arise.
● Using Market segmentation the following may be analysed:
○ Analysis of Market responsiveness: This is very useful in direct marketing as
product offerings responsiveness can be easily made available
○ Analysis of Market Trend: Market trends can be revealed by analysing segment
by segment changes in sales revenues. This information is very important for
every changing markets

Retail Industry:

● The retail industry is realizing the possibilities of advantages by using data mining,
● A rich source is provided by the Retail industry for Data mining Just like other industry
for eg Ranking retail industry is also collecting huge amounts of data throughout the
years and can find useful pieces of information through various tools.
● In Retail industry, data mining can be used for multidimensional analysis of sales,
customers and products. It can be helpful in customer retention, the effectiveness of
sales campaign may be analyzed.
● Following are some examples of how data mining is useful in retail industry.
○ Performing market basket analysis: Finding which items customers buy together.
This information can be helpful in Layout strategies inside the store, stocking and
promotions.
○ Sales forecasting: Retailers can make stocking decisions by exploring time hased
patterns. For eg if a customer is buying a certain item today, when likely he will
buy a complimentary item.

Telecommunications Industry:

● One of the first industries to adopt data mining was telecommunications industry. This
industry maintains data related to phone calls, which contains detailed information
about each phone call.
● The data stored is in a huge amount, it is of high quality and has a very large customer
base and it operates in rapidly changing and highly competitive environment.
● Data mining plays an important role in telecommunication industry in improving their
marketing efforts, identifying fraud and managing their networks.
● One of the challenging issues in using data mining in this industry is the huge amount of
data, the sequential and temporal aspects of their data, and prediction of a rare event
like a fraud or network failures in real time.
● Use of data mining in telecommunications can be viewed as an extension of the use of
expert systems.
2. Explain Star, Snowflake, and Fact Constellation Schema for Multidimensional Database
Ans:
Star Schema:
● Star schema is the type of multidimensional model which is used for data
● warehouse.
● In star schema, the fact tables and the dimension tables are contained.
● In this schema fewer foreign-key joins are used.
● This schema forms a star with a fact table and dimension tables.

Snowflake Schema:
● Snowflake Schema is also the type of multidimensional model which is used for data
warehouse.
● In snowflake schema, the fact tables, dimension tables as well as sub dimension tables
are contained.
● This schema forms a snowflake with fact tables, dimension tables as well as
sub-dimension tables.

Fact Constellation:
● Fact Constellation is a schema for representing multidimensional model.
● It is a collection of multiple fact tables having some common dimension tables.
● It can be viewed as a collection of several star schemas and hence, also known as Galaxy
schema.
● It is one of the widely used schema for Data warehouse designing and it is much more
complex than star and snowflake schema.
● For complex systems, we require fact constellations.
● Here, the pink coloured Dimension tables are the common ones among both the star
schemas.
● Green coloured fact tables are the fact tables of their respective star schemas.

3. Explain Data warehouse architecture

● Data warehouse is a Centralised system.


● It is a subject oriented, integrated, time-variant and non-volatile collection of data in
support of management's decision-making process.

Operational Databases:

● An Operational DBs is a method used in data warehousing to refer to a system that is


used to process the day to day transactions of an organization.

Flat Files:

● It is a system of files in which transactional data is stored and every file in the system
must have a different name.

External Sources:

● It is a source from where data is collected irrespective of the typeof data.

ETL:

● E (External): data is extracted from external data source


● T (Transform): data is transformed into the standard format
● L (Load): data is loaded into dataware house after transforming it into the standard
format

Meta Data:

● A set of data that defines and gives information about other data.
● It summarizes necessary information about data, which can make finding and work with
particular instances of data more accurate.

Summary Data:

● It is generated by the warehouse manager.


● It updates as new data loads into the warehouse.
● This component can include lightly or highly summarized data.
● Main role is to speed up the query performance.

Raw Data:

● It is the actual data loading into the repository which has not been processed.

Data Mart:

● It allows you to have multiple groups within the system by segmenting the data in the
warehouse into categories.

Data Mining:

● The practice of analysing the big data present in data warehouse is data mining.
● It is used to find the hidden patterns that are present in the data warehouse.
4. What is clustering? Explain K-means clustering algorithm. Suppose the data for
clustering- {2, 4, 10, 12, 3, 20, 11, 25} Consider k-2, cluster the given data using above
algorithm.

Ans:

Clustering:

● Clustering is an unsupervised learning problem.


● Clustering is a data mining (machine learning) technique used to place data elements
into related groups without advance knowledge of the group definitions. It is a process
of partitioning data objects into subclasses, which are called clusters.
● A cluster contains data objects which have high inter similarity and low intra similarity.

K-means clustering algorithm:

k- no. of clusters

n- sample feature vectors X1,X2……. Xn


mi- the mean of vector in cluster i

Algorithm:

● Assume k<n
● Make initial gusses for the mean m1,m2,...mk
● Until there are no changes in any mean

for i=1 to k

Replace, mi with the mean of all of the samples for the cluster i

end_for

end_until

● Following 3 steps are repeated.


● Iterate till no object moves to different groups.
○ Find the centroid coordinates
○ Find the distance of each object to the centroids
○ Baad on minimum distance group the objects

5. Explain multilevel association & multidimensional association rules with example.

Ans.

Multilevel Association Rules:


Mulitdimensional Association Rule:

6. Define support, confidence. Also generate association rules. A database has four
transitions. Let minimum support and confidence is 50
Ans:
7. Define support, confidence. Also generate association rules. A database has four
transitions. Let minimum support = 2 and confidence is 80%

Ans:

8. Explain Business Intelligence and decision support system.

Ans:
Business Intelligence:

Decision Support System:


9. Short note on Outlier analysis and describe the methods that can be used for outliers.

Ans:

● Outlier Analysis is a process that involves identifying the anomalous observation in the
dataset.
● Outlier Analysis is a data mining task which is referred to as an “outlier mining”.
● It has various applications in fraud detection, such as unusual usage of credit card or
telecommunication services, Healthcare analysis for finding unusual responses to
medical treatments, and also to identify the spending nature of the customers in
marketing.
10. Explain KDD process using figure.

Ans:
11. Define outlier analysis? Why outlier mining is important? Briefly describe the different
approaches: statistical-based outlier detection, distance-based outlier detection and
deviation- based outlier detection.

Ans:

Outlier Analysis:

● Outlier Analysis is a process that involves identifying the anomalous observation in the
dataset.
● Outlier Analysis is a data mining task which is referred to as an “outlier mining”.
● It has various applications in fraud detection, such as unusual usage of credit card or
telecommunication services, Healthcare analysis for finding unusual responses to
medical treatments, and also to identify the spending nature of the customers in
marketing.

The detection of these potential outliers is sometimes important for the following reasons:

● An outlier can indicate an erroneous data, this can be due to human error or some
experimental error, this type of data needs to be deleted either, from further analysis or
needs to be corrected before further analysis.
● Not always an outlier may indicate bad data, it may be due to some variation that the
outlier has occurred, in such cases some statistical analysis can be used to analyze the
outlier obtained.
● Outliers should not be confused with noise. Noise is a random error or variance
measured in a variable. In outlier detection noise should be removed first.

Statistical Based Outlier Analysis:

● The statistical distribution-based approach to outlier detection assumes a distribution or


probability model for the given data set (e.g: a normal or Poisson distribution) and then
identifies outliers with respect to the model using a discordancy test.

Distance Based Outlier Analysis:

● Distance-based outlier detection method consults the neighbourhood of an object,


which is defined by a given radius.
● An object is then considered an outlier if its neighborhood does not have enough other
points. A distance the threshold that can be defined as a reasonable neighbourhood of
the object.

Deviation Based Outlier Analysis:

● Deviation-based outlier detection does not use statistical tests or distance-based


measures to identify exceptional objects.
● Instead, it identifies outliers by examining the main characteristics of objects in a
group. Objects that “deviate” from this description are considered outliers.
● Hence, in this approach the term deviations are typically used to refer to outliers.
12. What is noise? Explain data smoothing methods as noise removal technique to divide
given data into bins of size 3 by bin partition (equal frequency), by bin means, by bin
medians and by bin boundaries. Consider the data:10, 2, 19, 18, 20, 18, 25, 28, 22

Ans:

Noise:

● A random error or variance in a measure variable is known as noise.


● Noise in the data may be introduced due to:
○ Fault in data collection instruments.
○ Error introduced at data entry by a human or a computer.
○ Data transmission errors.
● Different types of noise in data:
○ Unknown encoding: Gender: E
○ Out of range values: Temperature: 1004, Age: 125
○ Inconsistent entries: DoB: 10-Feb-2003; Age: 30
○ Inconsistent formats: DoB: 11-Feb-1984; Doj: 2/11/2007

Binning Algorithm:

● Used to smooth or handle noisy data.


● 1st sort the data then the sorted values are separated and stored in the forms of Bina.
● 3 methods for smoothing the data are:
○ Smoothing by bin mean: values in the bin is replaced by mean value of the bin
○ Smoothing by bin median: replaced by median values
○ Smoothing by bin boundaries: using minimum and maximum values of the bin
values are taken and the values are replaced by the closest boundary value.
13. State the Apriori Property. Generate candidate itemsets, frequent itemsets and
association rules using Apriori algorithm on the following data set with minimum
support count is 2.
Ans:
Apriori Algorithm:
● The Apriori Algorithm solves the frequent item sets problem.
● The algorithm analyzes a data set to determine which combinations of items occur
together frequently. The Apriori algorithm is at the core of various algorithms for data
mining problems. The best known problem is finding the association rules that hold in a
basket - item relation.
● Basic idea:
○ An itemset can only be a large itemset if all its subsets are large itemsets.
○ Frequent itemsets: The sets of items that have minimum support.
○ All the subsets of a frequent itemset must be frequent for e.g. (PQ) is a frequent
itemset (P) and (Q) must also be frequent.
○ Find frequent itemsets frequently with cardinality 1 to k(k-itemset). Generate
association rules from frequent itemsets.
14. Consider a data warehouse for a hospital where there are three dimensions:

a) Doctor b) Patient c) Time


Consider two measures
i) Count
ii) Charge where charge is the fee that the doctor charges a patient for a visit.
For the above example create a cube and illustrate the following OLAP
operations.
1) Rollup 2 ) Drill down 3) Slice 4) Dice 5) Pivot.
Ans:
15. Consider the following data
points:13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52
,70.

a) What is the mean of the data? What is the median?

b) What is the mode of data?


c) What is the midrange of the data?
d) Can you find Q1,Q3?
e) Show a boxplot of the data.

Ans:
16. Explain different methods that can be used to evaluate and compare the accuracy of
different classification algorithms?
Ans:
● Accuracy: Accuracy of classifier refers to the ability of classifier. It predicts the class label
correctly and the accuracy of the predictor refers to how well a given predictor can
guess the value of predicted attribute for a new data.
● Speed : This refers to the computational cost in generating and using the classifier or
predictor.
● Robustness: It refers to the ability of classifier or predictor to make correct prediction
from given noisy data.
● Scalability: Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
● Interpretability: It refers to what extent the classifier or predictor understands.

17. Predict the class for X={age=youth, income=medium, student=yes, credit_rating=fair}


using Naive Bayes Classification

Id Age Income Student Credit-ratin buys


g computer
1 Young High No Fair No
2 Young High No Good No
3 Middle High No Fair Yes
4 Old Medium No Fair Yes
5 Old Low Yes Fair Yes
6 Old Low Yes Good No
7 Middle Low Yes Good Yes
8 Young Medium No Fair No
9 Young Low Yes Fair Yes
10 Old Medium Yes Fair Yes
11 Young Medium Yes Good Yes
12 Middle Medium No Good Yes
13 Middle High Yes Fair Yes
14 Old Medium No Good No

18. Short note on DBSCAN clustering algorithm with example.

Ans:
19. Consider the following data points:
11,13,13,15,15,16,19,20,20,20,21,21,22,23,24,30,40,45,45,45,71,72,73,75.

(a) Find Mean, Median and Mode. 


(b) Show a box plot of the data. Clearly indicating the five-number summary.
Ans:
20. Why is Data Preprocessing required? Explain the different steps involved in data
preprocessing.

Ans:

Data Preprocessing:

● Process that involves transformation of data into information through classifying,


sorting, merging, recording. retrieving, transmitting, or reporting is called data
processing. Data processing can be manual or computer based.
● In Business related world, data processing refers to data processing so as to enable
effective functioning of the organizations and businesses.

Why it is required:

● Real world data are generally


○ Incomplete: The data is said to be incomplete when certain attributes or
attributes values are missing or only aggregate data is available.
○ Noisy: When the data contains errors or some outliers it is considered to be
noisy data.
○ Inconsistent: When the data contains differences in codes or names it is
inconsistent data.
● Tasks in data pre-processing:
○ Data cleaning: This process consists of filling of missing values, smoothening
noisy data, identifying and removing any outliers present and resolving
inconsistencies.
○ Data Integration : This refers to Integrating data from multiple sources like
databases, data cubes, or files.
○ Data transformation : Normalization and aggregation. Data reduction: In data
reduction the amount of data is reduced but same analytical results are
produced.
○ Data discretization: Part of data reduction, replacing numerical attributes with
nominal ones.

Different steps involved in data preprocessing:

1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

● Missing Data: This situation arises when some data is missing in the data. It can be
handled in various ways. Some of them are:
○ Ignore the tuples: This approach is suitable only when the dataset we have is
quite large and multiple values are missing within a tuple.
○ Fill the Missing values: There are various ways to do this task. You can choose to
fill the missing values manually, by attribute mean or the most probable value.
● Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
○ Binning Method: This method works on sorted data in order to smooth it. The
whole data is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to
complete the task.
○ Regression: Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
○ Clustering: This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

2. Data Transformation: This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:

● Normalization: It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
● Attribute Selection: In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
● Discretization: This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
● Concept Hierarchy Generation: Here attributes are converted from lower level to higher
level in hierarchy. For Example-The attribute “city” can be converted to “country”.

3. Data Reduction: Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases. In order to get
rid of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.

The various steps to data reduction are:

● Data Cube Aggregation: Aggregation operation is applied to data for the construction of
the data cube.
● Attribute Subset Selection: The highly relevant attributes should be used, rest all can be
discarded. For performing attribute selection, one can use level of significance and p-
value of the attribute.the attribute having p-value greater than significance level can be
discarded.
● Numerosity Reduction: This enable to store the model of data instead of whole data, for
example: Regression Models.
● Dimensionality Reduction: This reduce the size of data by encoding mechanisms.It can
be lossy or lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy reduction.
The two effective methods of dimensionality reduction are:Wavelet transforms and PCA
(Principal Component Analysis).
21. Illustrate any one classification technique for the above data set. Show how we can
classify a new tuple. With (Homeowner=Yes; status=Employed; Income=Average). 

Homeowne
Id Status Income Defaulted
r

1 Yes Employed High No

2 No Business Average No

3 No Employed Low No

4 Yes Business High No

Unemploye
5 No Average Yes
d

6 No Business Low No

Unemploye
7 Yes High No
d

8 No Employed Average Yes

9 No Business Low No

1
No Employed Average Yes
0

5 marks each
1) Explain why data warehouses are needed for developing business solutions from today’s
perspective. Discuss the role of data marts.

Ans: Importance of Dataware house:

● Potential high returns on investment and delivers enhanced business intelligence:


Implementation of data warehouse requires a huge investment in lakhs of Rs. But it
helps the organization to take strategic decisions based on past historical data and
organization can improve the results of various processes like marketing segmentation,
inventory management and sales.
● Competitive advantage: As previously unknown and unavailable data is available in data
warehouse; makers can access that data to take decisions to gain the competitive
advantage.
● Saves Time: As the data from multiple sources is available in integrated form, business
users can access data from one place. There is no need to retrieve the data from
multiple sources.
● Better enterprise intelligence: It improves the customer service and productivity.
● High quality data: Data in data warehouse is cleaned and transferred into desired
format. So data quality is high.

Data Mart:

● A data mart is oriented to a specific purpose or major data subject that may be
distributed to support business needs. It is a subset of the data resource.
● A data mart is a repository of a business organization's data implemented to answer
very specific questions for a specific group of data consumers such as organizational
divisions of marketing, sales, operations, collections and others. A data mart is typically
established as one dimensional model or star schema which is composed of a fact table
and multi-dimensional table.
● A data mart is a small warehouse which is designed for the department level.
● It is often a way to gain entry and provide an opportunity to learning.
● Major problem: If they differ from department to department, they can be difficult to
integrate enterprise wide.
2) Explain various features of Data Warehouse?

Ans:
3) Discuss the application of data warehousing and data mining

Ans:
● Healthcare:
○ One of the most important sector which utilizes data warehouses is the
Healthcare sector. All of their financial, clinical, and employee records are fed to
warehouses as it helps them to strategize and predict outcomes, track and
analyze their service feedback, generate patient reports, share data with tie-in
insurance companies, medical aid services, etc.
● Government and Education:
○ The federal government utilizes the warehouses for research in compliance,
whereas the state government uses it for services related to human resources
like recruitment, and accounting like payroll management.The government uses
data warehouses to maintain and analyze tax records, health policy records and
their respective providers, and also their entire criminal law database is
connected to the state’s data warehouse.
○ Universities use warehouses for extracting of information used for the proposal
of research grants, understanding their student demographics, and human
resource management.
● Finance Industry:
○ Similar to the applications seen in banking, mainly revolve around evaluation and
trends of customer expenses which aids in maximizing the profits earned by
their clients.
● Consumer Goods Industry:
○ They are used for prediction of consumer trends, inventory management, market
and advertising research. In-depth analysis of sales and production is also carried
out. Apart from these, information is exchanged business partners and clientele.
● Hospitality Industry:
○ A major proportion of this industry is dominated by hotel and restaurant
services, car rental services, and holiday home services. They utilize warehouse
services to design and evaluate their advertising and promotion campaigns
where they target customers based on their feedback and travel patterns.

● Scientific Analysis: Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about human psychology, etc.
Data mining techniques are capable of the analysis of these data. Now we can capture
and store more new data faster than we can analyze the old data already accumulated.
● Business Transactions: Every business industry is memorized for perpetuity. Such
transactions are usually time-related and can be inter-business deals or intra-business
operations
● Market Basket Analysis: Market Basket Analysis is a technique that gives the careful
study of purchases done by a customer in a supermarket. This concept identifies the
pattern of frequent purchase items by customers.
● Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force
activity and their outcomes to improve the focusing of high-value physicians and figure
out which promoting activities will have the best effect in the following upcoming
months, Whereas the Insurance sector, data mining can help to predict which customers
will buy new policies, identify behavior patterns of risky customers and identify
fraudulent behavior of customers.
● Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new
credit product.
○ Credit card fraud detection.
○ Identify ‘Loyal’ customers.
○ Extraction of information related to customers.
○ Determine credit card spending by customer groups.
4) A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of
data – Justify.
Ans:

5) Give differences between OLAP and OLTP

Ans:
6) Explain various OLAP operations
Ans:
7) Differentiate Fact table vs. Dimension table

Ans:

8) Define the term “data mining”. Discuss the major issues in data mining

Ans:

● Data mining is processing data to identify patterns and establish relationships.


● Data mining is the process of analyzing large amounts of data stored in a data
warehouse for useful information which makes use of artificial intelligence techniques,
neural networks, and advanced statistical tools (such as cluster analysis) to reveal
trends, patterns and relationships, which otherwise may be undetected.

Major issues in Data Mining :


9) In real-world data, tuples with missing values for some attributes are a common
occurrence. Describe various methods for handling this problem.

Ans:

● One method for handling this problem is filling in the missing values manually. This
method handles the tuples of missing values by manually filling in each and every empty
data cells. This method cannot consider to be an efficient approach as its time
consuming and does not work for large databases. Another method for handling this
problem is fill in the missing value filled by a constant. This method handles the tuples
of missing values by replacing all empty data cells by global constants that do not affect
the actual meaning of the data and does not affect any analysis of some sort.
● Another method is to ignore the tuple. This is done when the value is missing. This
method is not effective, as the type contains several attributes with missing values.
● Another method is to use the attribute mean for numeric values or attribute mode for
categorical values. You use this value to replace any missing values. Finally, another
method is to use the most probable value to fill in the missing
● value.
10) Explain the following data normalization techniques: (i) min-max normalization and (ii)
decimal scaling.

Ans:
11) Describe various methods for handling missing data values.

Ans:
12) What are the limitations of the Apriori approach for mining? Briefly describe the
techniques to improve the efficiency of Apriori algorithm

Ans:

LIMITATIONS OF APRIORI ALGORITHM

The main limitation is costly wasting of time to hold a vast number of candidate sets with much
frequent itemsets, low minimum support or large itemsets.
13) What is market basket analysis? Explain the two measures of rule interestingness: support and
confidence with suitable example.
Ans:
14) Explain measures for finding rule interestingness (support, confidence) with
example.
Ans: refer 13
15) Compare association and classification. Briefly explain associative classification with suitable
example.
Ans:
● Classification rule mining aims to discover a small set of rules in the database that forms an
accurate classifier.
● Association rule mining finds all the rules existing in the database that satisfy some minimum
support and minimum confidence constraints.
● Associative classification (AC) is a mining technique that integrates classification and association
rule mining to perform classification on unseen data instances. AC is one of the effective
classification techniques that applies the generated rules to perform classification
16) What is an attribute selection measure? Explain different attribute selection measures
with example.

Ans:
17) Do feature wise comparison between classification and prediction.
Ans:

18) Explain Linear regression with example.

Ans:

● Linear regression is used to predict the relationship between two variables by applying a
linear equation to observed data.
● There are two types of variable, one variable is called an independent variable, and the
other is a dependent variable.
● Linear regression is commonly used for predictive analysis.
● Used explain relationship between one independent variable and one dependant
variable.
● Equation: b=y+x*a

b = dependent

y = constant

x = regression coefficient

a = independent
● Example: The weight of the person is linearly related to their height. So, this shows a
linear relationship between the height and weight of the person. According to this, as
we increase the height, the weight of the person will also increase. It is not necessary
that one variable is dependent on others, or one causes the other, but there is some
critical relationship between the two variables. In such cases, we use a scatter plot to
simplify the strength of the relationship between the variables. If there is no relation or
linking between the variables then the scatter plot does not indicate any increasing or
decreasing pattern. In such cases, the linear regression design is not beneficial to the
given data.
19) Explain data mining application for fraud detection.

Ans : https://www.heavy.ai/technical-glossary/fraud-detection-and-prevention

20) Discuss applications of data mining in Banking and Finance.

Ans: Refer Q3

21) How K-Mean clustering method differs from K-Medoid clustering method?

Ans:

22) How FP tree is better than Apriori algorithm- Justify

Ans:
23) Define information gain, entropy, gini index

Ans:

Information Gain:

● A measure of how much information a feature provides about a class


● It helps to determine the order of attributes in the nodes of decision tree
● Gain=Eparent-Echild

Entropy:

● It is the information theory metric that measures the impurity or uncertainty in a group
of observations. It determines how a decision tree chooses to split data.

Gini Index:

● Gini index or Gini impurity measures the degree or probability of a particular variable
being wrongly classified when it is randomly chosen.
● Calculate the amount of probability of a specific feature that is classified incorrectly
when select randomly
● It is calculated by subtracting the sum of squared probabilities of each class from one. It
favors larger partitions and easy to implement whereas information gain favors smaller
partitions with distinct values.

● A feature with a lower Gini index is chosen for a split.


● The classic CART algorithm uses the Gini Index for constructing the decision tree.

You might also like