Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 219

1.

 What is true about data mining?


1. Data Mining is defined as the procedure of extracting information from huge sets
of data
2. Data mining also involves other processes such as Data Cleaning, Data
Integration, Data Transformation
3. Data mining is the procedure of mining knowledge from data.
4. All of the above
Advertisement
Show Answer
All of the above

2. How many categories of functions involved in Data Mining?


1. 1
2. 2
3. 3
4. 4
Advertisement
Show Answer
2

3. The mapping or classification of a class with some predefined group or class is


known
as?
1. Data Characterization
2. Data Discrimination
3. Data Set
4. Data Sub Structure
Advertisement
Show Answer
Data Discrimination

4. The analysis performed to uncover interesting statistical correlations between


associated-attribute-value pairs is called?
1. Mining of Association
2. Mining of Clusters
3. Mining of Correlations
4. None of the above
Advertisement
Show Answer
Mining of Correlations

5. ______ may be defined as the data objects that do not comply with the general
behavior or model of the data available.
1. Outlier Analysis
2. Evolution Analysis
3. Prediction
4. Classification
Advertisement
Show Answer
Outlier Analysis

6. “Efficiency and scalability of data mining algorithms” issues comes under?


A. Mining Methodology and User Interaction Issues
B. Performance Issues
C. Diverse Data Types Issues
D. None of the above

Advertisement
Show Answer
Performance Issues

7. To integrate heterogeneous databases, how many approaches are there in Data
Warehousing?
1. 1
2. 2
3. 3
4. 4
Show Answer
2

8. Which of the following is correct advantage of Update-Driven Approach in Data


Warehousing?
A. This approach provides high performance.
B. The data can be copied, processed, integrated, annotated, summarized and
restructured in the semantic data store in advance.
C. Both A and B
D. None Of the above

Show Answer
Both A and B

9. What is the use of data cleaning?


A. to remove the noisy data
B. correct the inconsistencies in data
C. transformations to correct the wrong data.
D. All of the above

Advertisement
Show Answer
All of the above
10. Data Mining System Classification consists of?
A. Database Technology
B. Machine Learning
C. Information Science
D. All of the above

Show Answer
All of the above

Data mining and warehousing mcq sppu


11. Which of the following is a good alternative to the star schema?
1. snow flake schema
2. star schema
3. star snow flake schema
4. fact constellation
Advertisement
Show Answer
fact constellation

12. Patterns that can be discovered from a given database are which type…
1. More than one type
2. Multiple type always
3. One type only
4. No specific type
Show Answer
More than one type

13. Background knowledge is…


1. It is a form of automatic learning.
2. A neural network that makes use of a hidden layer
3. The additional acquaintance used by a learning algorithm to facilitate the learning
process
4. None of these
Advertisement
Show Answer
The additional acquaintance used by a learning algorithm to facilitate the learning
process

14. Which of the following is true for Classification?


1. subdivision of a set
2. A measure of the accuracy
3. The task of assigning a classification
4. All of these
Show Answer
subdivision of a set

Data mining and Warehousing mcq


15. Data mining is?
1. time variant non-volatile collection of data
2. The actual discovery phase of a knowledge
3. The stage of selecting the right data
4. None of these
Advertisement
Show Answer
The actual discovery phase of a knowledge

16. ——- is not a data mining functionality?


A) Clustering and Analysis
B) Selection and interpretation
C) Classification and regression
D) Characterization and Discrimination

Show Answer
Selection and interpretation

17. Which of the following can also applied to other forms?


a) Data streams & Sequence data
b) Networked data
c) Text & Spatial data
d) All of these

Advertisement
Show Answer
All of these

18. ——– is the out put of KDD


a) Query
b) Useful Information
c) Data
d) information

Show Answer
Useful Information

19. What is noise?
a) component of a network
b) context of KDD and data mining
c) aspects of a data warehouse
d) None of these

Show Answer
context of KDD and data mining

data mining and warehousing mcq sppu


20. Firms that are engaged in sentiment mining are analyzing data collected from?
A. social media sites.
B. in-depth interviews.
C. focus groups.
D. experiments.

Show Answer
social media sites.

21. Which of the following forms of data mining assigns records to one of a
predefined set of classes?
(A). Classification
(B). Clustering
(C). Both A and B
(D). None

Show Answer
Clustering

22. The learning which is used to find the hidden pattern in unlabeled data is
called?
(A). Unsupervised learning
(B). Supervised learning
(C). Reinforcement learning

Show Answer
Unsupervised learning

23. The learning which is the example of Self-organizing maps?


(A). Reinforcement learning
(B). Supervised learning
(C). Unsupervised learning
(D). Missing data imputation

Show Answer
Unsupervised learning

24. According to storks’ population size, find the total number of babies from the
following example of predicting the number of babies.
(A). feature
(B). outcome
(C). attribute
(D). observation

Advertisement
Show Answer
outcome

25. Which of the following is not belong to data mining?


(A). Knowledge extraction
(B). Data transformation
(C). Data exploration
(D). Data archaeology

Show Answer
Data archaeology

26. The learning which is used for inferring a model from labeled training data is
called?
(A). Unsupervised learning
(B). Reinforcement learning
(C). Supervised learning
(D). Missing data imputation

Show Answer
Supervised learning

27. Which of the following is the right approach to Data Mining?


(A). Infrastructure, exploration, analysis, exploitation, interpretation
(B). Infrastructure, exploration, analysis, interpretation, exploitation
(C). Infrastructure, analysis, exploration, interpretation, exploitation
(D). None of these

Advertisement
Show Answer
Infrastructure, exploration, analysis, interpretation, exploitation

28. Which of the following terms is used as a synonym for data mining?


(A). knowledge discovery in databases
(B). data warehousing
(C). regression analysis
(D). parallel processing in databases

Show Answer
knowledge discovery in databases

29. …………………..is an essential process where intelligent methods are applied to


extract data patterns
A) Data Warehousing
B) Data Mining
C) Data Base
D) Data Structure

Show Answer
Data Mining

30. Data mining requires


1. Large quantities of operational data stored over a period of time
2. Lots of tactical data
3. Several tape drives to store archival data
4. Large mainframe computers
Show Answer
Large quantities of operational data stored over a period of time

data mining and warehousing mcq


questions
31. Data by itself is not useful unless
1. It is massive
2. It is processed to obtain information
3. It is collected as a raw data from diverse sources
4. It is properly stated
Show Answer
It is processed to obtain information

32. Which of the following is NOT example of ordinal attributes?


1. Zip codes
2. Ordered numbers
3. Ascending or descending names
4. Military ranks
Show Answer
Zip codes

33. In asymmetric attribute


1. Order of values is important
2. All values are equals
3. Only non-zero value is important
4. Range of values is important
Show Answer
Only non-zero value is important

34. Identify the example of Nominal attribute


1. Temperature
2. Mass
3. Salary
4. Gender
Advertisement
Show Answer
Gender

35. Which of the following is not a data pre-processing methods?


1. Data Visualization
2. Data Discretization
3. Data Cleaning
4. Data Reduction
Show Answer
Data Visualization

36. Correlation analysis is used for __


1. Handling missing values
2. Identifying redundant attributes
3. Handling different data formats
4. Eliminating noise
Show Answer
Identifying redundant attributes

37. ______combines data from multiple sources into a coherent store


1. Data Characterization
2. Data Classification
3. Data Integration
4. Data Selection
Show Answer
Data Integration

38. Which of the following is / are attribute subset selection criterion(s) ?


1. Forward selection
2. Backward elimination
3. Decision tree induction
4. All of the above
Show Answer
All of the above

39. Data mining can also applied to other forms such as…………….
i) Data streams
ii) Sequence data
iii) Networked data
iv) Text data
v) Spatial data
A)i, ii, iii and v only
B) ii, iii, iv and v only
C) i, iii, iv and v only
D) All i, ii, iii, iv and v

Show Answer
All i, ii, iii, iv and v

40. ____ normalization is not very well efficient in handling the outliers


Min max
1. Min max
2. Z Score
3. Decimal Scaling
4. None of the above
Advertisement
Show Answer

Data mining and warehousing mcq with


answers
41. The full form of KDD is………………
A) Knowledge Database
B) Knowledge Discovery Database
C) Knowledge Data House
D) Knowledge Data Definition

Show Answer
Knowledge Discovery Database

Data Analytics sppu mcq


42. A collection of interesting and useful patterns in database is called ___
A. knowledge.
B. information.
C. data.
D. algorithm

Show Answer
knowledge.

43. Data ………………. is the process of finding a model that describes and
distinguishes data classes or concepts.
a)Characterization
b)Mining
c) clustering
d )Classification

Advertisement
Show Answer
Classification

44. To remove noise and inconsistent data __ is needed


1. Data Transformation
2. Data Reduction
3. Data Integration
4. Data Cleaning
Show Answer
Data Cleaning

45.The terms equality and roll up are associated with _


1. OLTP
2. Visualization
3. Data mart
4. Decision Tree
Show Answer
Data mart

46. An operational system is which of the following?


A. A system that is used to run the business in real time and is based on historical data.
B. A system that is used to run the business in real time and is based on current
data.
C. A system that is used to support decision making and is based on current data.
D. A system that is used to support decision making and is based on historical data.

Advertisement
Show Answer
A system that is used to run the business in real time and is based on current data.
47. Data warehouse is which of the following?
A. Can be updated by end users.
B. Contains numerous naming conventions and formats.
C. Organized around important subject areas.
D. Contains only current data.

Show Answer
Organized around important subject areas.

48. Data transformation includes which of the following?


A. A process to change data from a detailed level to a summary level
B. A process to change data from a summary level to a detailed level
C. Joining data from one source into various sources of data
D. Separating data from one source into various sources of data

Show Answer
A process to change data from a detailed level to a summary level

49. The ……………… allows the selection of the relevant information necessary for
the data warehouse.
A top-down view
B data warehouse view
C data source view
D business query view

Show Answer
A top-down view

50. Which of the following is not a component of a data warehouse?


A Metadata
B Current detail data
C Lightly summarized data
D Component Key

Show Answer
Component Key

51. Which of the following is not a kind of data warehouse application?


A Information processing
B Analytical processing
C Data mining
D Transaction processing
Show Answer
Transaction processing

52. ___ is not associated with data cleaning process.


1. Deduplication
2. Domain consistency
3. Segmentation
4. Disambiguation
Show Answer
Segmentation

53. Dimensionality refers to
1. Cardinality of key values in a star schema
2. The data that describes the transactions in the fact table
3. The level of detail of data that is held in the fact table
4. The level of detail of data that is held in the dimension table
Show Answer
The data that describes the transactions in the fact table

54. Expansion for DSS in DW is


1. Decisive Strategic System
2. Data Support System
3. Data Store System
4. Decision Support system
Show Answer
Decision Support system

55. Data in a data warehouse


1. in a flat file format
2. can be normalised but often is not
3. must be in normalised form to at least 3NF
4. must be in normalised form to at least 2NF
Show Answer
can be normalised but often is not

56. Friendship structure of users in a social networking site can be considered as


an example of ____
1. Record data
2. Ordered data
3. Graph data
4. None of the above
Show Answer
Graph data
57. A café owner wanted to compare how much revenue he gained from lattes
across different months of the year. What type of variable is ‘month’?
1. Continuous
2. Categorical
3. Discrete
4. Nominal
Show Answer
Categorical

58. An outlier is a _


1. Description of records in the data
2. Data point which is considered different from other data points
3. Record with missing attributes
4. Duplicate record
Show Answer
Data point which is considered different from other data points

59. Which of the following operations can be performed on ordinal attributes?


1. Distictness
2. Documents
3. Both of the above
4. None of the above
Show Answer
Both of the above

60. Height of a person, can be considered as an attribute of _____type?


1. Nominal
2. Ordinal
3. Interval
4. Ratio
Show Answer
Ratio

61. The cosine similarity measure counts for _


1. The Euclidian distance between vectors
2. The Manhattan distance between vectors
3. The similarity of documents
4. The dissimilarity of vectors
Advertisement
Show Answer
The similarity of documents

62. Formula for dissimilarity computation between two objects for categorical
variable is – here p is categorical variable and m denotes number of matches
1. D ( i, j ) = p – m / p
2. D ( i, j ) = p – m / m
3. D ( i, j ) = m – p / p
4. D ( i, j ) = m – p / m
Show Answer
D ( i, j ) = p – m / p

63. Euclidean and Manhattan distances between the objects P, Q and R (1, 2, 3) and
(2, 1, 0) are _
1. 3.32, 4 respectively
2. 3.32, 5 respectively
3. 5, 3.32 respectively
4. 3.30, 3 respectively
Show Answer
3.32, 5 respectively

64. The main organisational justification for implementing a data warehouse is to


provide
1. ETL from operation systems to strategic systems
2. Large scale transaction processing
3. Storing large volumes of data
4. Decision support
Advertisement
Show Answer
Decision support

65. A data warehouse


a. must import data from transactional systems whenever significant changes occur in
the
transactional data
b. works on live transactional data to provide up to date and valid results
c. takes regular copies of transaction data
d. takes preprocessed transaction data and stores in a way that is optimised for
analysis

Show Answer
takes preprocessed transaction data and stores in a way that is optimised for analysis

66. Data warehouse contains ________data that is seldom found in the


operational environment
1. informational
2. normalized
3. denormalized
4. summary
Show Answer
summary
67. In a snowflake schema which of the following types of tables is considered?
1. Fact
2. Dimension
3. Both (a) and (b)
4. None of the above
Advertisement
Show Answer
Both (a) and (b)

68. Which of the following statements about data warehouse is true?


1. A data warehouse is necessary to all those organisations that are using relational
OLTP
2. A data warehouse is useful to all organisations that currently use OLTP
3. A data warehouse is valuable to the organisations that need to keep an audit trail
of their activities
4. A data warehouse is valuable only if the organisation has an interest in analysing
historical data
Show Answer
A data warehouse is valuable only if the organisation has an interest in analysing
historical data

69. When you ____ the data, you are aggregating the data to a higher level
1. Slice
2. Roll Up
3. Roll Down
4. Drill Down
Show Answer
Roll Up

70. The process of viewing the cross-tab (Single dimensional) with a fixed value of
one attribute is _
1. Slicing
2. Dicing
3. Pivoting
4. Both Slicing and Dicing
Advertisement
Show Answer
Slicing

71. What do data warehouses support?


1. OLAP
2. OLTP
3. OLAP and OLTP
4. Operational databases
Show Answer
OLAP

72. A data cube consist of _


1. Dimensional data
2. Multidimensional data
3. No dimensional data
4. 1 dimensional data
Show Answer
Multidimensional data

73. Which type of data storage architecture gives fastest performance?


1. ROLAP
2. MOLAP
3. HOLAP
4. DOLAP
Advertisement
Show Answer
MOLAP

74. Dissimilarity can be defined as __


1. How much certain objects differ from each other
2. How much certain objects simillar from each other
3. Dissimilarities are non negative numbers d(i,j) that are small when i and j are close
to each other and that become large when i and j are very different
4. Both (a) and (c)
Show Answer
Both (a) and (c)

75. ______supports basic OLAP operations, including slice and dice, drill-down,


roll-up and pivoting
1. Information processing
2. Analytical processing
3. Data processing
4. Transaction processing
Show Answer
Analytical processing

1) Which of the following refers to the problem of finding abstracted patterns (or
structures) in the unlabeled data?

a. Supervised learning
b. Unsupervised learning
c. Hybrid learning
d. Reinforcement learning

Hide Answer Workspace

Answer: b

Explanation: Unsupervised learning is a type of machine learning algorithm that is


generally used to find the hidden structured and patterns in the given unlabeled
data.

2) Which one of the following refers to querying the unstructured textual data?

a. Information access
b. Information update
c. Information retrieval
d. Information manipulation

Hide Answer Workspace

Answer: c

Explanation: Information retrieval refers to querying the unstructured textual data.


We can also understand information retrieval as an activity (or process) in which the
tasks of obtaining information from system recourses that are relevant to the
information required from the huge source of information.

3) Which of the following can be considered as the correct process of Data Mining?

19.5M
367
Difference between JDK, JRE, and JVM

a. Infrastructure, Exploration, Analysis, Interpretation, Exploitation


b. Exploration, Infrastructure, Analysis, Interpretation, Exploitation
c. Exploration, Infrastructure, Interpretation, Analysis, Exploitation
d. Exploration, Infrastructure, Analysis, Exploitation, Interpretation

Hide Answer Workspace

Answer: a
Explanation: The process of data mining contains many sub-processes in a specific
order. The correct order in which all sub-processes of data mining executes is
Infrastructure, Exploration, Analysis, Interpretation, and Exploitation.

4) Which of the following is an essential process in which the intelligent methods are
applied to extract data patterns?

a. Warehousing
b. Data Mining
c. Text Mining
d. Data Selection

Hide Answer Workspace

Answer: b

Explanation: Data mining is a type of process in which several intelligent methods


are used to extract meaningful data from the huge collection ( or set) of data.

5) What is KDD in data mining?

a. Knowledge Discovery Database


b. Knowledge Discovery Data
c. Knowledge Data definition
d. Knowledge data house

Hide Answer Workspace

Answer: a

Explanation: The term KDD or Knowledge Discovery Database is refers to a broad


process of discovering the knowledge in the data and emphasizes the high-level
applications of specific Data Mining techniques as well.

6) The adaptive system management refers to:


a. Science of making machine performs the task that would require intelligence
when performed by humans.
b. A computational procedure that takes some values as input and produces
some values as the output.
c. It uses machine learning techniques, in which programs learn from their past
experience and adapt themself to new conditions or situations.
d. All of the above.

Hide Answer Workspace

Answer: c

Explanation: Generally, adaptive system management refers to using machine


learning techniques. In which the programs learn from their past experience and
adapt themselves for new conditions and events.

7) For what purpose, the analysis tools pre-compute the summaries of the huge amount
of data?

a. In order to maintain consistency


b. For authentication
c. For data access
d. To obtain the queries response

Hide Answer Workspace

Answer: d

Explanation:

Whenever a query is fired, the response of the query would be put very earlier. So,
for the query response, the analysis tools pre-compute the summaries of the huge
amount of data. To understand it in more details, consider the following example:

Suppose that to get some information about something, you write a keyword in
Google search. Google's analytical tools will then pre-compute large amounts of data
to provide a quick output related to the keywords you have written.
8) What are the functions of Data Mining?

a. Association and correctional analysis classification


b. Prediction and characterization
c. Cluster analysis and Evolution analysis
d. All of the above

Hide Answer Workspace

Answer: d

Explanation: In data mining, there are several functionalities used for performing the
different types of tasks. The common functionalities used in data mining are cluster
analysis, prediction, characterization, and evolution. Still, the association and
correctional analysis classification are also one of the important functionalities of
data mining.

9) In the following given diagram, which type of clustering is used?

a. Hierarchal
b. Naive Bayes
c. Partitional
d. None of the above

Hide Answer Workspace

Answer: a

Explanation: In the above-given diagram, the hierarchal type of clustering is used.


The hierarchal type of clustering categorizes data through a variety of scales by
making a cluster tree. So the correct answer is A.

10) Which of the following statements is incorrect about the hierarchal clustering?

a. The hierarchal type of clustering is also known as the HCA


b. The choice of an appropriate metric can influence the shape of the cluster
c. In general, the splits and merges both are determined in a greedy manner
d. All of the above

Hide Answer Workspace

Answer: a

Explanation: All following statements given in the above question are incorrect, so


the correct answer is D.

11) Which one of the following can be considered as the final output of the hierarchal
type of clustering?

a. A tree which displays how the close thing are to each other
b. Assignment of each point to clusters
c. Finalize estimation of cluster centroids
d. None of the above

Hide Answer Workspace

Answer: a

Explanation: The hierarchal type of clustering can be referred to as the


agglomerative approach.
12) Which one of the following statements about the K-means clustering is incorrect?

a. The goal of the k-means clustering is to partition (n) observation into (k)
clusters
b. K-means clustering can be defined as the method of quantization
c. The nearest neighbor is the same as the K-means
d. All of the above

Hide Answer Workspace

Answer: c

Explanation: There is nothing to deal in between the k-means and the K- means the
nearest neighbor.

13) Which of the following statements about hierarchal clustering is incorrect?

a. The hierarchal clustering can primarily be used for the aim of exploration
b. The hierarchal clustering should not be primarily used for the aim of
exploration
c. Both A and B
d. None of the above

Hide Answer Workspace

Answer: a

Explanation: The hierarchical clustering technique can be used for exploration


because it is the deterministic technique of clustering.

14) Which one of the clustering technique needs the merging approach?

a. Partitioned
b. Naïve Bayes
c. Hierarchical
d. Both A and C

Hide Answer Workspace

Answer: c

Explanation: The hierarchal type of clustering is one of the most commonly used


methods to analyze social network data. In this type of clustering method, multiple
nodes are compared with each other on the basis of their similarities and several
larger groups' are formed by merging the nodes or groups of nodes that have similar
characteristics.

15) The self-organizing maps can also be considered as the instance of _________ type of
learning.

a. Supervised learning
b. Unsupervised learning
c. Missing data imputation
d. Both A & C

Hide Answer Workspace

Answer: b

Explanation: The Self Organizing Map (SOM), or the Self Organizing Feature Map is
a kind of Artificial Neural Network which is trained through unsupervised learning.

16) The following given statement can be considered as the examples of_________

Suppose one wants to predict the number of newborns according to the size of
storks' population by performing supervised learning

a. Structural equation modeling


b. Clustering
c. Regression
d. Classification

Hide Answer Workspace
Answer: c

Explanation: The above-given statement can be considered as an example of


regression. Therefore the correct answer is C.

17) In the example predicting the number of newborns, the final number of total
newborns can be considered as the _________

a. Features
b. Observation
c. Attribute
d. Outcome

Hide Answer Workspace

Answer: d

Explanation: In the example of predicting the total number of newborns, the result
will be represented as the outcome. Therefore, the total number of newborns will be
found in the outcome or addressed by the outcome.

18) Which of the following statement is true about the classification?

a. It is a measure of accuracy
b. It is a subdivision of a set
c. It is the task of assigning a classification
d. None of the above

Hide Answer Workspace

Answer: b

Explanation: The term "classification" refers to the classification of the given data


into certain sub-classes or groups according to their similarities or on the basis of the
specific given set of rules.

19) Which of the following statements is correct about data mining?


a. It can be referred to as the procedure of mining knowledge from data
b. Data mining can be defined as the procedure of extracting information from a
set of the data
c. The procedure of data mining also involves several other processes like data
cleaning, data transformation, and data integration
d. All of the above

Hide Answer Workspace

Answer: d

Explanation: The term data mining can be defined as the process of extracting


information from the massive collection of data. In other words, we can also say that
data mining is the procedure of mining useful knowledge from a huge set of data.

20) In data mining, how many categories of functions are included?

a. 5
b. 4
c. 2
d. 3

Hide Answer Workspace

Answer: c

Explanation: There are only two categories of functions included in data mining:


Descriptive, Classification and Prediction. Therefore the correct answer is C.

21) Which of the following can be considered as the classification or mapping of a set or
class with some predefined group or classes?

a. Data set
b. Data Characterization
c. Data Sub Structure
d. Data Discrimination
Hide Answer Workspace

Answer: d

Explanation: The discrimination refers to the mapping (or classification) of a class


with some predefined groups or classes. So the correct answer is D.

22) The analysis performed to uncover the interesting statistical correlation between
associated -attributes value pairs are known as the _______.

a. Mining of association
b. Mining of correlation
c. Mining of clusters
d. All of the above

Hide Answer Workspace

Answer: b

Explanation: Mining of correlation refers to the additional analysis performed for


uncovering the interesting statistical correlation in between associated-attribute-
value pairs.

23) Which one of the following can be defined as the data object which does not comply
with the general behavior (or the model of available data)?

a. Evaluation Analysis
b. Outliner Analysis
c. Classification
d. Prediction

Hide Answer Workspace

Answer: b

Explanation: It may be defined as the object that doesn't comply with the general
behavior or with the model of available data.
24) Which one of the following statements is not correct about the data cleaning?

a. It refers to the process of data cleaning


b. It refers to the transformation of wrong data into correct data
c. It refers to correcting inconsistent data
d. All of the above

Hide Answer Workspace

Answer: d

Explanation: Data cleaning is a kind of process that is applied to data set to remove


the noise from the data (or noisy data), inconsistent data from the given data. It also
involves the process of transformation where wrong data is transformed into the
correct data as well. In other words, we can also say that data cleaning is a kind of
pre-process in which the given set of data is prepared for the data warehouse.

25) The classification of the data mining system involves:

a. Database technology
b. Information Science
c. Machine learning
d. All of the above

Hide Answer Workspace

Answer: d

Explanation: Generally, the classification of a data mining system depends on the


following criteria: Database technology, machine learning, visualization, information
science, and several other disciplines.

26) In order to integrate heterogeneous databases, how many types of approaches are
there in the data warehousing?

a. 3
b. 4
c. 5
d. 2

Hide Answer Workspace

Answer: d

Explanation: In general, data warehousing consist of data integration, data cleaning,


and data consolidations. Therefore to integrate heterogeneous databases, there are
two approaches that are update-driven approach and the query-driven approach. So
the correct answer is D.

27) The issues like efficiency, scalability of data mining algorithms comes under_______

a. Performance issues
b. Diverse data type issues
c. Mining methodology and user interaction
d. All of the above

Hide Answer Workspace

Answer: a

Explanation: In order to extract information effectively from a huge collection of


data in databases, the data mining algorithm must be efficient and scalable.
Therefore the correct answer is A.

28) Which of the following is the correct advantage of the Update-Driven Approach?

a. This approach provides high performance.


b. The data can be copied, processed, integrated, annotated, summarized and
restructured in the semantic data store in advance.
c. Both A and B
d. None of the above

Hide Answer Workspace

Answer: c
Explanation: The statements given in both A and B are the advantage of the
Update-Driven Approach in Data Warehousing. So the correct answer is C.

29) Which of the following statements about the query tools is correct?

a. Tools developed to query the database


b. Attributes of a database table that can take only numerical values
c. Both and B
d. None of the above

Hide Answer Workspace

Answer: a

Explanation: The query tools are used to query the database. Or we can also say that
these tools are generally used to get only the necessary information from the entire
database.

30) Which one of the following correctly defines the term cluster?

a. Group of similar objects that differ significantly from other objects


b. Symbolic representation of facts or ideas from which information can
potentially be extracted
c. Operations on a database to transform or simplify data in order to prepare it
for a machine-learning algorithm
d. All of the above

Hide Answer Workspace

Answer: a

Explanation: The term "cluster" refers to the set of similar objects or items that differ
significantly from the other available objects. In other words, we can understand
clusters as making groups of objects that contain similar characteristics form all
available objects. Therefore the correct answer is A.

31) Which one of the following refers to the binary attribute?


a. This takes only two values. In general, these values will be 0 and 1, and they
can be coded as one bit
b. The natural environment of a certain species
c. Systems that can be used without knowledge of internal operations
d. All of the above

Hide Answer Workspace

Answer: a

Explanation: In general, the binary attribute takes only two types of values, that are
0 and 1and these values can be coded as one bit. So the correct answer will be A.

32) Which of the following correctly refers the data selection?

a. A subject-oriented integrated time-variant non-volatile collection of data in


support of management
b. The actual discovery phase of a knowledge discovery process
c. The stage of selecting the right data for a KDD process
d. All of the above

Hide Answer Workspace

Answer: c

Explanation: Data selection can be defined as the stage in which the correct data is
selected for the phase of a knowledge discovery process (or KKD process). Therefore
the correct answer C.

33) Which one of the following correctly refers to the task of the classification?

a. A measure of the accuracy, of the classification of a concept that is given by a


certain theory
b. The task of assigning a classification to a set of examples
c. A subdivision of a set of examples into a number of classes
d. None of the above
Hide Answer Workspace

Answer: b

Explanation: The task of classification refers to dividing the set into subsets or in the
numbers of the classes. Therefore the correct answer is C.

34) Which of the following correctly defines the term "Hybrid"?

a. Approach to the design of learning algorithms that is structured along the


lines of the theory of evolution.
b. Decision support systems that contain an information base filled with the
knowledge of an expert formulated in terms of if-then rules.
c. Combining different types of method or information
d. None of these

Hide Answer Workspace

Answer: c

Explanation: The term "hybrid" refers to merging two objects and forms individual
object that contains features of the combined objects.

35) Which of the following correctly defines the term "Discovery"?

a. It is hidden within a database and can only be recovered if one is given certain
clues (an example IS encrypted information).
b. An extremely complex molecule that occurs in human chromosomes and that
carries genetic information in the form of genes.
c. It is a kind of process of executing implicit, previously unknown and
potentially useful information from data
d. None of the above

Hide Answer Workspace

Answer: c
Explanation: The term "discovery" means to discover something new that has not
yet been discovered. It can also be interpreted as a process of executing underlying,
previously unknown and potentially useful information from data.

36) Euclidean distance measure is can also defined as ___________

a. The process of finding a solution for a problem simply by enumerating all


possible solutions according to some predefined order and then testing them
b. The distance between two points as calculated using the Pythagoras theorem
c. A stage of the KDD process in which new data is added to the existing
selection.
d. All of the above

Hide Answer Workspace

Answer: c

Explanation: Euclidean distance measure can be defined as the calculating distance


between two points in either in-plane or three-dimensional space measures the
length of the segments connecting two points. It can also define as the distance
between two points as calculated using the Pythagoras theorem.

37) Which one of the following can be considered as the correct application of the data
mining?

a. Fraud detection
b. Corporate Analysis & Risk management
c. Management and market analysis
d. All of the above

Hide Answer Workspace

Answer: d

Explanation: Data mining is highly useful in a variety of areas such as fraud


detection, corporate analysis, and risk management, and market analysis, etc., so the
correct option is D.
38) Which one of the following correctly refers to the Class study in the data
cauterization?

a. Final class
b. Study class
c. Target class
d. Both A and C

Hide Answer Workspace

Answer: c

Explanation: In the data cauterization, generally, the study class refers to the target
class, and the study class is the class that is under the process of summarizing data.

39) Which of the following refers to the sequence of pattern that occurs frequently?

a. Frequent sub-sequence
b. Frequent sub-structure
c. Frequent sub-items
d. All of the above

Hide Answer Workspace

Answer: a

Explanation: In data mining, the frequent sub-sequence refers to a certain sequence


of patterns that occurs frequently, for example, buying a camera followed by the
memory card. So the correct answer will be A.

40) Which one of the following refers to the model regularities or to the objects that
trends or not consistent with the change in time?

a. Prediction
b. Evolution analysis
c. Classification
d. Both A and B
Hide Answer Workspace

Answer: b

Explanation: In general, the evolution analysis refers to the model regularities or the
object trends that vary with change in time.

41) The issues like "handling the rational and complex types of data" comes under which
of the following category?

a. Diverse Data Type


b. Mining methodology and user interaction Issues
c. Performance issues
d. All of the above

Hide Answer Workspace

Answer: a

Explanation: It is quite often that a database can contain multiple types of data,
complex objects, and temporary data, etc., so it is not possible that only one type of
system can filter all data. Therefore this type of issue comes under the category
Diverse Data type. So the correct answer is A.

42) Which of the following also used as the first step in the knowledge discovery
process?

a. Data selection
b. Data cleaning
c. Data transformation
d. Data integration

Hide Answer Workspace

Answer: b

Explanation: Data cleaning is included as one of the first steps of the knowledge


discovery process. So the correct answer is B.
43) Which of the following refers to the steps of the knowledge discovery process, in
which the several data sources are combined?

a. Data selection
b. Data cleaning
c. Data transformation
d. Data integration

Hide Answer Workspace

Answer: d

Explanation: The step "data integration" of the knowledge discovery process refers


to combining several data sources. Therefore the correct answer is D.

44) Which of the following can be considered as the drawback of the query-Driven
approach in data warehousing?

a. This approach is expensive for queries that require aggregations


b. This approach is expensive insufficient, and very frequent queries
c. This approach requires a very complex integration and filtering process
d. All of the above

Hide Answer Workspace

Answer: d

Explanation: All statements given in the above question are drawbacks of the query-
driven approach. Therefore the correct answer is D.

45) Which of the following correctly refers to the term "Data Independence"?

a. It means that the programs are not dependent on the logical attributes
b. It refers to that data that is defined separately, not included in the program
c. It means that the programs are totally dependent on the physical attributes of
data
d. Both A and C

Hide Answer Workspace

Answer: d

Explanation: The term "Data Independence" refers that the programs are not
dependent on the physical attributes of data and neither on the logical attributes of
data.

46) Which of the following is generally used by the E-R model to represent the weak
entities?

a. Diamond
b. Doubly outlined rectangle
c. Dotted rectangle
d. Both B & C

Hide Answer Workspace

Answer: b

Explanation: Generally, the double outline rectangle is used in the E-R model to


represent the weak entities.

47) Which one of the following refers to the Black Box?

a. It can be referred as the system that can be used without the knowledge of
the internal operations
b. It referrers the natural environment of the specific species
c. It takes only two values at most that are 0 and 1
d. All of the above

Hide Answer Workspace

Answer: a

Explanation: Black Box is referred to as the system which takes only two values at
most are zero and one.
48) Which one of the following issues must be considered before investing in data
mining?

a. Compatibility
b. Functionality
c. Vendor consideration
d. All of the above

Hide Answer Workspace

Answer: d

Explanation: The common but important issues like functionality and compatibility


must always be discussed before investing in data mining. Therefore the correct
answer is D.

49) The term "DMQL" stands for _____

a. Data Marts Query Language


b. DBMiner Query Language
c. Data Mining Query Language
d. None of the above

Hide Answer Workspace

Answer: c

Explanation: The term "DMQL" refers to the Data Mining Query Language. Therefore
the correct answer is C.

50) In certain cases, it is not clear what kind of pattern need to find, data mining
should_________:

a. Try to perform all possible tasks


b. Perform both predictive and descriptive task
c. It may allow interaction with the user so that he can guide the mining process
d. All of the above

Hide Answer Workspace

Answer: c

Explanation: In some data mining operations where it is not clear what kind of
pattern needed to find, here the user can guide the data mining process. Because a
user has a good sense of which type of pattern he wants to find. So, he can eliminate
the discovery of all other non-required patterns and focus the process to find only
the required pattern by setting up some rules. Therefore the correct answer is C.

1.

Data scrubbing is which of the following?

A. A process to reject data from the data warehouse and to create the necessary indexes

B. A process to load the data in the data warehouse and to create the necessary indexes

C. A process to upgrade the quality of data after it is moved into a data warehouse

D. A process to upgrade the quality of data before it is moved into a data warehouse

Answer & Solution   Discuss in Board  Save for Later

Answer & Solution

Answer: Option D

No explanation is given for this question Let's Discuss on Board

2.

The @active data warehouse architecture includes which


of the following?

A. At least one data mart


B. Data that can extracted from numerous internal and external sources

C. Near real-time updates

D. All of the above

Answer & Solution   Discuss in Board  Save for Later

Answer & Solution

Answer: Option D

No explanation is given for this question Let's Discuss on Board

3.

A goal of data mining includes which of the following?

A. To explain some observed event or condition

B. To confirm that data exists

C. To analyze data for expected relationships

D. To create a new data warehouse

Answer & Solution   Discuss in Board  Save for Later

Answer & Solution

Answer: Option A

No explanation is given for this question Let's Discuss on Board

4.
An operational system is which of the following?

A. A system that is used to run the business in real time and is based on historical data.

B. A system that is used to run the business in real time and is based on current data.

C. A system that is used to support decision making and is based on current data.

D. A system that is used to support decision making and is based on historical data.

Answer & Solution   Discuss in Board  Save for Later

Answer & Solution

Answer: Option B

No explanation is given for this question Let's Discuss on Board

5.

A data warehouse is which of the following?

A. Can be updated by end users.

B. Contains numerous naming conventions and formats.

C. Organized around important subject areas.

D. Contains only current data.

Answer & Solution   Discuss in Board  Save for Later

Answer & Solution


Answer: Option C

No explanation is given for this question Let's Discuss on Board

1. What is true about data mining?


A. Data Mining is defined as the procedure of extracting information from
huge sets of data
B. Data mining also involves other processes such as Data Cleaning, Data
Integration, Data Transformation
C. Data mining is the procedure of mining knowledge from data.
D. All of the above
View Answer
Ans : D

Explanation: Data Mining is defined as extracting information from huge sets of data. In
other words, we can say that data mining is the procedure of mining knowledge from
data. The information or knowledge extracted so that it can be used.

2. How many categories of functions involved in Data Mining?


A. 2
B. 3
C. 4
D. 5
View Answer
Ans : A

Explanation: there are two categories of functions involved in Data Mining : 1.


Descriptive, 2. Classification and Prediction

3. The mapping or classification of a class with some predefined group


or class is known as?
A. Data Characterization
B. Data Discrimination
C. Data Set
D. Data Sub Structure
View Answer
Ans : B

Explanation: Data Discrimination : It refers to the mapping or classification of a class with


some predefined group or class
4. The analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs is called?
A. Mining of Association
B. Mining of Clusters
C. Mining of Correlations
D. None of the above
View Answer
Ans : C

Explanation: Mining of Correlations : It is a kind of additional analysis performed to


uncover interesting statistical correlations between associated-attribute-value pairs or
between two item sets to analyze that if they have positive, negative or no effect on each
other.

5. __________ may be defined as the data objects that do not comply with
the general behavior or model of the data available.
A. Outlier Analysis
B. Evolution Analysis
C. Prediction
D. Classification
View Answer
Ans : A

Explanation: Outlier Analysis : Outliers may be defined as the data objects that do not
comply with the general behavior or model of the data available.

6. "Efficiency and scalability of data mining algorithms" issues comes


under?
A. Mining Methodology and User Interaction Issues
B. Performance Issues
C. Diverse Data Types Issues
D. None of the above
View Answer
Ans : B

Explanation: In order to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
7. To integrate heterogeneous databases, how many approaches are
there in Data Warehousing?
A. 2
B. 3
C. 4
D. 5
View Answer
Ans : A

Explanation: Data warehousing involves data cleaning, data integration, and data
consolidations. To integrate heterogeneous databases, we have the following two
approaches : Query Driven Approach, Update Driven Approach

8. Which of the following is correct advantage of Update-Driven


Approach in Data Warehousing?
A. This approach provides high performance.
B. The data can be copied, processed, integrated, annotated, summarized
and restructured in the semantic data store in advance.
C. Both A and B
D. None Of the above
View Answer
Ans : C

Explanation: Both A and B are advantage of Update-Driven Approach in Data


Warehousing.

9. What is the use of data cleaning?


A. to remove the noisy data
B. correct the inconsistencies in data
C. transformations to correct the wrong data.
D. All of the above
View Answer
Ans : D

Explanation: Data cleaning is a technique that is applied to remove the noisy data and
correct the inconsistencies in data. Data cleaning involves transformations to correct the
wrong data. Data cleaning is performed as a data preprocessing step while preparing
the data for a data warehouse.

10. Data Mining System Classification consists of?


A. Database Technology
B. Machine Learning
C. Information Science
D. All of the above
View Answer
Ans : D

Explanation: A data mining system can be classified according to the following criteria :
Database Technology, Statistics, Machine Learning, Information Science, Visualization,
Other Disciplines

11. Which of the following is correct application of data mining?


A. Market Analysis and Management
B. Corporate Analysis & Risk Management
C. Fraud Detection
D. All of the above
View Answer
Ans : D

Explanation: Data mining is highly useful in the following domains : Market Analysis and
Management, Corporate Analysis & Risk Management, Fraud Detection

12. In Data Characterization, class under study is called as?


A. Study Class
B. Intial Class
C. Target Class
D. Final Class
View Answer
Ans : C

Explanation: Data Characterization : This refers to summarizing data of class under


study. This class under study is called as Target Class.

13. A sequence of patterns that occur frequently is known as?


A. Frequent Item Set
B. Frequent Subsequence
C. Frequent Sub Structure
D. All of the above
View Answer
Ans : B

Explanation: Frequent Subsequence : A sequence of patterns that occur frequently such


as purchasing a camera is followed by memory card.
14. __________ refers to the description and model regularities or trends
for objects whose behavior changes over time.
A. Outlier Analysis
B. Evolution Analysis
C. Prediction
D. Classification
View Answer
Ans : B

Explanation: Evolution Analysis : Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.

15. Pattern evaluation issue comes under?


A. Mining Methodology and User Interaction Issues
B. Performance Issues
C. Diverse Data Types Issues
D. None of the above
View Answer
Ans : A

Explanation: Pattern evaluation : The patterns discovered should be interesting because


either they represent common knowledge or lack novelty.

16. "Handling of relational and complex types of data" issue comes


under?
A. Mining Methodology and User Interaction Issues
B. Performance Issues
C. Diverse Data Types Issues
D. None of the above
View Answer
Ans : C

Explanation: The database may contain complex data objects, multimedia data objects,
spatial data, temporal data etc. It is not possible for one system to mine all these kind of
data.
17. Which of the following is correct disadvantage of Query-Driven
Approach in Data Warehousing?
A. The Query Driven Approach needs complex integration and filtering
processes.
B. It is very inefficient and very expensive for frequent queries.
C. This approach is expensive for queries that require aggregations.
D. All of the above
View Answer
Ans : D

Explanation: All statement are disadvantage of Query-Driven Approach in Data


Warehousing.

18. The first steps involved in the knowledge discovery is?


A. Data Integration
B. Data Selection
C. Data Transformation
D. Data Cleaning
View Answer
Ans : D

Explanation: The first steps involved in the knowledge discovery is Data Integration.

19. In which step of Knowledge Discovery, multiple data sources are


combined?
A. Data Cleaning
B. Data Integration
C. Data Selection
D. Data Transformation
View Answer
Ans : B

Explanation: Data Integration : multiple data sources are combined.

20. DMQL stands for?


A. Data Mining Query Language
B. Dataset Mining Query Language
C. DBMiner Query Language
D. Data Marts Query Language
View Answer
Ans : A

Explanation: The Data Mining Query Language (DMQL) was proposed by Han, Fu,
Wang, et al. for the DBMiner data mining system.

1: Which of the following applied on warehouse?


a) write only
b) read only
c) both a & b
d) none of these
Answer - Click Here:
B
2: Data can be store , retrive and updated in …
a) SMTOP
b) OLTP
c) FTP
d) OLAP
Answer - Click Here:
B
3: Which of the following is a good alternative to the star schema?
a) snow flake schema
b) star schema
c) star snow flake schema
d) fact constellation
Answer - Click Here:
D
4: Patterns that can be discovered from a given database are which type…
a) More than one type
b) Multiple type always
c) One type only
d) No specific type
Answer - Click Here:
A
5:Background knowledge is…
a) It is a form of automatic learning.
b) A neural network that makes use of a hidden layer
c) The additional acquaintance used by a learning algorithm to facilitate the learning
process
d) None of these
Answer - Click Here:
C
6: Which of the following is true for Classification?
a) A subdivision of a set
b) A measure of the accuracy
c) The task of assigning a classification
d) All of these
Answer - Click Here:
A
7: Data mining is?
a) time variant non-volatile collection of data
b) The actual discovery phase of a knowledge
c) The stage of selecting the right data
d) None of these
Answer - Click Here:
B
8: ——- is not a data mining functionality?
A) Clustering and Analysis
B) Selection and interpretation
C) Classification and regression
D) Characterization and Discrimination
Answer - Click Here:
B
9: Which of the following can also applied to other forms?
a) Data streams & Sequence data
b) Networked data
c) Text & Spatial data
d) All of these
Answer - Click Here:
D
10:Which of the following is general characteristics or features of a target
class of data?
a) Data selection
b) Data discrimination
c) Data Classification
c) Data Characterization
Answer - Click Here:
D
11: ——– is the out put of KDD…
a) Query
b) Useful Information
c) Data
d) information
Answer - Click Here:
B
12: What is noise?
a) component of a network
b) context of KDD and data mining
c) aspects of a data warehouse
d) None of these
Answer - Click Here:
B
Data mining is a tool for allowing users to find the hidden relationships in data.
True/False
Answer - Click Here:
True
Firms that are engaged in sentiment mining are analyzing data collected from?
A. social media sites.
B. in-depth interviews.
C. focus groups.
D. experiments.
E. observations.
Answer - Click Here:
A. social media sites.
Which of the following forms of data mining assigns records to one of a predefined
set of classes?
(A). Classification
(B). Clustering
(C). Both A and B
(D). None
Answer - Click Here:
(B). Clustering
What are the two main objectives associated with data mining?
Answer - Click Here:
to find hidden patterns and trends are the two main objectives associated with data
mining
The learning which is used to find the hidden pattern in unlabeled data is
called?
(A). Unsupervised learning
(B). Supervised learning
(C). Reinforcement learning
Answer : (A). Unsupervised learning
Select the correct statement about the Adaptive system management.
(A). Science of making machines performs tasks that would require intelligence
when performed by humans
(B). Takes some value as input and shows some values as an output.
(C). Both a and b
(D).  get the benefit from machine-learning. The program performs the process of
learning by past experience. This experience is helpful in  adapting themselves to
new problems
Answer(D).  get the benefit from machine-learning. The program performs the
process of learning by past experience. This experience is helpful in  adapting
themselves to new problems
The learning which is the example of Self-organizing maps?
(A). Reinforcement learning
(B). Supervised learning
(C). Unsupervised learning
(D). Missing data imputation
Ans: (C). Unsupervised learning
According to storks’ population size, find the total number of babies from the
following example of predicting the number of babies.
(A). feature
(B). outcome
(C). attribute
(D). observation
Answer: (B). outcome
Which of the following is not belong to data mining?
(A). Knowledge extraction
(B). Data transformation
(C). Data exploration
(D). Data archaeology
Answer: (B). Data transformation
The learning which is used for inferring a model from labeled training data is
called?
(A). Unsupervised learning
(B). Reinforcement learning
(C). Supervised learning
Answer : (C). Supervised learning
Which of the following is the right approach to Data Mining?
(A). Infrastructure, exploration, analysis, exploitation, interpretation
(B). Infrastructure, exploration, analysis, interpretation, exploitation
(C). Infrastructure, analysis, exploration, interpretation, exploitation
(D). None of these
Answer: (B). Infrastructure, exploration, analysis, interpretation, exploitation
Which of the following terms is used as a synonym for data mining?
(A). knowledge discovery in databases
(B). data warehousing
(C). regression analysis
(D). parallel processing in databases
Answer: (A)
Firms that are engaged in sentiment mining are analyzing data collected from
(A). experiments.
(B). social media sites.
(C). focus groups.
(D). in-depth interviews.
(E). observations.
(F). None of these
Answer: B. social media sites.
which of the following conditions is essential for data mining to work?
1. …………………. is an essential process where intelligent methods are applied to extract data
patterns.
A) Data warehousing
B) Data mining
C) Text mining
D) Data selection

2. Data mining can also applied to other forms such as …………….


i) Data streams
ii) Sequence data
iii) Networked data
iv) Text data
v) Spatial data

A) i, ii, iii and v only


B) ii, iii, iv and v only
C) i, iii, iv and v only
D) All i, ii, iii, iv and v

3. Which of the following is not a data mining functionality?


A) Characterization and Discrimination
B) Classification and regression
C) Selection and interpretation
D) Clustering and Analysis

4. ……………………….. is a summarization of the general characteristics or features of a


target class of data.
A) Data Characterization
B) Data Classification
C) Data discrimination
D) Data selection

5. ……………………….. is a comparison of the general features of the target class data objects
against the general features of objects from one or multiple contrasting classes.
A) Data Characterization
B) Data Classification
C) Data discrimination
D) Data selection

6. Strategic value of data mining is ………………….


A) cost-sensitive
B) work-sensitive
C) time-sensitive
D) technical-sensitive

Read Also: MCQ Questions on Data Warehouse

7. ……………………….. is the process of finding a model that describes and distinguishes


data classes or concepts.
A) Data Characterization
B) Data Classification
C) Data discrimination
D) Data selection

8. The various aspects of data mining methodologies is/are ……………….


i) Mining various and new kinds of knowledge
ii) Mining knowledge in multidimensional space
iii) Pattern evaluation and pattern or constraint-guided mining.
iv) Handling uncertainty, noise, or incompleteness of data
A) i, ii and iv only
B) ii, iii and iv only
C) i, ii and iii only
D) All i, ii, iii and iv

9. The full form of KDD is ………………


A) Knowledge Database
B) Knowledge Discovery Database
C) Knowledge Data House
D) Knowledge Data Definition

10. The out put of KDD is ………….


A) Data
B) Information
C) Query
D) Useful information

ANSWERS:
1. …………………. is an essential process where intelligent methods are applied to extract
data patterns.
B) Data mining

2. Data mining can also applied to other forms such as …………….


i) Data streams
ii) Sequence data
iii) Networked data
iv) Text data
v) Spatial data
D) All i, ii, iii, iv and v

3. Which of the following is not a data mining functionality?


C) Selection and interpretation
4. ……………………….. is a summarization of the general characteristics or features of a
target class of data.
A) Data Characterization

5. ……………………….. is a comparison of the general features of the target class data objects
against the general features of objects from one or multiple contrasting classes.
C) Data discrimination

6. Strategic value of data mining is ………………….


C) time-sensitive

7. ……………………….. is the process of finding a model that describes and distinguishes


data classes or concepts.
B) Data Classification

8. The various aspects of data mining methodologies is/are ……………….


i) Mining various and new kinds of knowledge
ii) Mining knowledge in multidimensional space
iii) Pattern evaluation and pattern or constraint-guided mining.
iv) Handling uncertainty, noise, or incompleteness of data
D) All i, ii, iii and iv

9. The full form of KDD is ………………


B) Knowledge Discovery Database

10. The out put of KDD is ………….


D) Useful information

11. Data modeling technique used for data marts is


(a) Dimensional modeling
(b) ER – model
(c) Extended ER – model
(d) Physical model
(e) Logical model.

12. A warehouse architect is trying to determine what data must be included in the
warehouse. A meeting has been arranged with a business analyst to understand the
data requirements, which of the following should be included in the agenda?
(a) Number of users
(b) Corporate objectives
(c) Database design
(d) Routine reporting
(e) Budget.
13. An OLAP tool provides for
(a) Multidimensional analysis
(b) Roll-up and drill-down
(c) Slicing and dicing
(d) Rotation
(e) Setting up only relations.

14. The Synonym for data mining is


(a) Data warehouse
(b) Knowledge discovery in database
(c) ETL
(d) Business intelligence
(e) OLAP.

15. Which of the following statements is true?


(a) A fact table describes the transactions stored in a DWH
(b) A fact table describes the granularity of data held in a DWH
(c) The fact table of a data warehouse is the main store of descriptions of the
transactions stored in a DWH
(d) The fact table of a data warehouse is the main store of all of the recorded
transactions over time
(e) A fact table maintains the old records of the database.

16. Most common kind of queries in a data warehouse


(a) Inside-out queries
(b) Outside-in queries
(c) Browse queries
(d) Range queries
(e) All (a), (b), (c) and (d) above.

17. Concept description is the basic form of the


(a) Predictive data mining
(b) Descriptive data mining
(c) Data warehouse
(d) Relational data base
(e) Proactive data mining.

18. The apriori property means


(a) If a set cannot pass a test, all of its supersets will fail the same test as well
(b) To improve the efficiency the level-wise generation of frequent item sets
(c) If a set can pass a test, all of its supersets will fail the same test as well
(d) To decrease the efficiency the level-wise generation of frequent item sets
(e) All (a), (b), (c) and (d) above.

19. Which of following form the set of data created to support a specific short lived
business situation?
(a) Personal data marts
(b) Application models
(c) Downstream systems
(d) Disposable data marts
(e) Data mining models.

20. What is/are the different types of Meta data?


I. Administrative.
II. Business.
III. Operational.
(a) Only (I) above
(b) Both (II) and (III) above
(c) Both (I) and (II) above
(d) Both (I) and (III) above
(e) All (I), (II) and (III) above.

Answers

      Ans                               Explanation

11. A Data modeling technique used for data marts is Dimensional modeling.

12. D Routine reporting should be included in the agenda.

13. C An OLAP tool provides for Slicing and dicing.

14. C The synonym for data mining is Knowledge discovery in Database.


15. D The fact table of a data warehouse is the main store of all of the recorded
transactions over time is the correct statement.

16. A The Most common kind of queries in a data warehouse is Inside-out queries.

17. B Concept description is the basis form of the descriptive data mining.

18. B The apriori property means to improve the efficiency the level-wise generation
of frequent item sets.

19. D Disposable Data Marts is the form the set of data created to support a specific
short lived business situation.

20. E The different types of Meta data are Administrative, Business and Operational.
Data scrubbing is which of the following?
A
A process to reject data from the data warehouse and to create the necessary indexes
.

B
A process to load the data in the data warehouse and to create the necessary indexes
.

C
A process to upgrade the quality of data after it is moved into a data warehouse
.

D
A process to upgrade the quality of data before it is moved into a data warehouse
.
Answer: Option D
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report

2.  The @active data warehouse architecture includes which of the following?
A
At least one data mart
.

B
Data that can extracted from numerous internal and external sources
.

C
Near real-time updates
.

D All of the above.


.
Answer: Option D
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report

3.  A goal of data mining includes which of the following?


A
To explain some observed event or condition
.

B
To confirm that data exists
.

C
To analyze data for expected relationships
.

D
To create a new data warehouse
.
Answer: Option A
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report

4.  An operational system is which of the following?


A
A system that is used to run the business in real time and is based on historical data.
.

B
A system that is used to run the business in real time and is based on current data.
.

C
A system that is used to support decision making and is based on current data.
.

D
A system that is used to support decision making and is based on historical data.
.
Answer: Option B
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report

5.  A data warehouse is which of the following?


A
Can be updated by end users.
.

B
Contains numerous naming conventions and formats.
.

C
Organized around important subject areas.
.

D Contains only current data.


.
Answer: Option C
6.  A snowflake schema is which of the following types of tables?
A
Fact
.

B
Dimension
.

C
Helper
.

D
All of the above
.
Answer: Option D
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report

7.  The generic two-level data warehouse architecture includes which of the following?
A
At least one data mart
.

B
Data that can extracted from numerous internal and external sources
.

C
Near real-time updates
.

D
All of the above.
.
Answer: Option B
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report

8.  Fact tables are which of the following?


A
Completely denoralized
.

B
Partially denoralized
.

C
Completely normalized
.

D
Partially normalized
.
Answer: Option C
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report
9.  Data transformation includes which of the following?
A
A process to change data from a detailed level to a summary level
.

B
A process to change data from a summary level to a detailed level
.

C
Joining data from one source into various sources of data
.

D
Separating data from one source into various sources of data
.
Answer: Option A
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report

10.  Reconciled data is which of the following?


A
Data stored in the various operational systems throughout the organization.
.

B
Current data intended to be the single source for all decision support systems.
.

C
Data stored in one operational system in the organization.
.

D
Data that has been selected and formatted for end-user support applications.
.
Answer: Option B

The load and index is which of the following?


A
A process to reject data from the data warehouse and to create the necessary indexes
.

B
A process to load the data in the data warehouse and to create the necessary indexes
.

C
A process to upgrade the quality of data after it is moved into a data warehouse
.

D
A process to upgrade the quality of data before it is moved into a data warehouse
.
Answer: Option B
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report

12.  The extract process is which of the following?


A Capturing all of the data contained in various operational systems
.

B
Capturing a subset of the data contained in various operational systems
.

C
Capturing all of the data contained in various decision support systems
.

D
Capturing a subset of the data contained in various decision support systems
.
Answer: Option B
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report

13.  A star schema has what type of relationship between a dimension and fact table?
A
Many-to-many
.

B
One-to-one
.

C
One-to-many
.

D
All of the above.
.
Answer: Option C
Explanation:
No answer description available for this question. Let us discuss.
View Answer Discuss in Forum Workspace Report

14.  Transient data is which of the following?


A Data in which changes to existing records cause the previous version of the records to
. be eliminated

B Data in which changes to existing records do not cause the previous version of the
. records to be eliminated

C
Data that are never altered or deleted once they have been added
.

D
Data that are never deleted once they have been added
.
Answer: Option A
A multifield transformation does which of the following?
A
Converts data from one field into multiple fields
.

B
Converts data from multiple fields into one field
.
C
Converts data from multiple fields into multiple fields
.

D
All of the above
.
Answer: Option D

1. In a data mining task when it is not clear about what type of patterns
could be interesting, the data mining system should:
a) Perform all possible data mining tasks
b) Handle different granularities of data and patterns
c) Perform both descriptive and predictive tasks
d) Allow interaction with the user to guide the mining process

Answer: (d) Allow interaction with the user to guide the mining


process  
Users have a good sense of which “direction” of mining may lead to
interesting patterns and the “form” of the patterns or rules they want to find.
They may also have a sense of “conditions” for the rules, which would
eliminate the discovery of certain rules that they know would not be of
interest. Thus, a good heuristic is to have the users specify such intuition or
expectations as constraints to confine the search space. This strategy is
known as constraint-based mining.

 
2. To detect fraudulent usage of credit cards, the following data mining
task should be used:
a) Feature selection
b) Prediction
c) Outlier analysis
d) All of the above

Answer: (c) Outlier analysis


Fraudulent usage of credit cards can be detected using outlier analysis or
outlier detection.
Outlier
A data element that stands out from the rest of the data. The values that
deviate from other observations on data are called outliers. In data
distribution, they are not part of the pattern.  Sometimes referred to as
abnormalities, anomalies, or deviants, outliers can occur by chance in any
given distribution.
Outlier analysis
The analysis used to find unusual patterns in a dataset. There are many
outlier detection algorithms proposed under these broad categories; statistical
based approaches, distance-based approaches, fuzzy approaches and kernel
functions.

 
 
3. In high dimensional spaces, the distance between data points becomes
meaningless because:
a) It becomes difficult to distinguish between the nearest and
farthest neighbors
b) The nearest neighbor becomes unreachable
c) The data becomes sparse
d) There are many uncorrelated features

Answer: (a) It becomes difficult to distinguish between the nearest


and farthest neighbors
Curse of dimensionality
The dimensionality curse phenomenon states that in high dimensional spaces
distances between nearest and farthest points from query points become
almost equal. Therefore, nearest neighbor calculations cannot discriminate
candidate points.
By high dimensional spaces, we are talking about hundreds to thousands of
dimensions for a dense vector (sparse vectors are a completely different topic).
Basically once you get up to high-dimensionality, pairwise distance between
all of your points approaches a constant.

 
 
4. The difference between supervised learning and unsupervised
learning is given by:
a) Unlike unsupervised learning, supervised learning needs
labeled data
b) Unlike unsupervised leaning, supervised learning can form new
classes
c) Unlike unsupervised learning, supervised learning can be used
to detect outliers
d) Unlike supervised learning, unsupervised learning can predict
the output class from among the known classes

Answer: (a) Unlike unsupervised learning, supervised learning


needs labeled data
Supervised learning: Supervised learning is the machine learning task of
learning a function that maps an input to an output based on example input-
output pairs. It is basically a synonym for classification. The supervision in
the learning comes from the labeled examples in the training data set.
Unsupervised learning: Unsupervised learning is essentially a synonym for
clustering. The learning process is unsupervised since the input examples are
not class labeled. Typically, we may use clustering to discover classes within
the data. The goal of unsupervised learning is to model the hidden patterns in
the given input data in order to learn about the data.

 
 
5. Which of the following is used to find inherent regularities in data?
a) Clustering
b) Frequent pattern analysis
c) Regression analysis
d) Outlier analysis

Answer: (b) Frequent pattern analysis


Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set. It is an intrinsic and important property
of datasets.
Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis are some of the
applications of frequent pattern analysis.

1. In non-parametric models
a) There are no parameters
b) The parameters are fixed in advance
c) A type of probability distribution is assumed, then its
parameters are inferred
d) The parameters are flexible

Answer: (d) The parameters are flexible


Non-parametric models differ from parametric models in that the model
structure is not specified a priori but is instead determined from data. The
term non-parametric is not meant to imply that such models completely lack
parameters but that the number and nature of the parameters are flexible and
not fixed in advance.
In non-parametric models, no fixed set of parameters and no probability
distribution is assumed. They have parameters that are flexible.

 
2. The goal of clustering analysis is to:
a) Maximize the inter-cluster similarity
b) Maximize the intra-cluster similarity
c) Maximize the number of clusters
d) Minimize the intra-cluster similarity

Answer: (b) Maximize the intra-cluster similarity


One of the goals of a clustering algorithm is to maximize the intra-cluster
similarity.
A clustering algorithm with small intra-cluster distance (high intra-cluster
similarity) and high inter-cluster distance (low inter-cluster similarity) is
said to be a good clustering algorithm.
Clustering analysis is a technique for grouping similar observations into a
number of clusters based on multiple variables for each individual observed
value. It is an unsupervised classification.
Inter-cluster distance – the distance between two objects from two different
clusters.
Intra-cluster distance – the distance between two objects from the
same cluster.

 
3. In decision tree algorithms, attribute selection measures are used to
a) Reduce the dimensionality
b) Select the splitting criteria which best separate the data
c) Reduce the error rate
d) Rank attributes

Answer: (b) Select the splitting criteria which best separate the data
Attribute selection measures in decision tree algorithms are mainly used to
select the splitting criterion that best separates the given data partition.  
During the induction phase of the decision tree, the attribute selection
measure is determined by choosing the attribute that will best separate the
remaining samples of the nodes partition into individual classes.
The data set is partitioned according to a splitting criterion into
subsets.    This procedure is repeated recursively for each
subset   until    each   subset   contains   only  members   belonging   to   the   s
ame    class    or    is sufficiently small.
Information gain, Gain ratio and Gini index are the popular attribute
selection measures.

 
4. Pruning a decision tree always 
a) Increases the error rate
b) Reduces the size of the tree
c) Provides the partitions with lower entropy
d) Reduces classification accuracy

Answer: (b) Reduces the size of the tree


Pruning means simplifying/compressing and optimizing a decision tree by
removing sections of the tree that are uncritical and redundant to classify
instances. It helps in significantly reducing the size of the decision tree.
Decision trees are the most susceptible machine learning algorithm to
overfitting (the undesired induction of noise in the tree). Pruning can
reduce the likelihood of overfitting problem.

 
5. Which of the following classifiers fall in the category of lazy learners:
a) Decision trees
b) Bayesian classifies
c) k-NN classifiers
d) Rule-based classifiers

Answer: (c) k-NN classifier


k-nearest neighbor (k-NN) classifier is a lazy learner because it doesn’t learn
a discriminative function from the training data but “memorizes” the
training dataset instead.
Lazy learning (e.g., instance-based learning): Simply stores training data (or
only minor processing) and waits until it is given a test tuple. When it does,
classification is conducted based on the most related data in the stored
training data.
Lazy learning is also referred as “just-in-time learning”.
The other categories of classifiers is “Eager learners”.

1. Which of the following practices can help in handling overfitting


problem?
a) Use of faster processor
b) Increasing the number of training examples
c) Reducing the number of training instances
d) Increasing the model complexity

Answer: (b) Increasing the number of training examples


Once we increase the number of training examples we will have lower test-
error (variance of the model decrease) and this results in reduced overfitting.
If our model does not generalize well from our training data to unseen data,
we denote this as overfitting. An overfit model will have extremely low
training error but a high testing error.

 
2. Which of the following statements is INCORRECT about the SVM and
kernels?
a. Kernels map the original dataset into a higher dimensional
space and then find a hyper-plane in the mapped space
b. Kernels map the original dataset into a higher dimensional
space and then find a hyper-plane in the original space
c. Using kernels allows us to obtain non linear decision boundaries
for a classification problem
d. The kernel trick allows us to perform computations in the
original space and enhances speed of SVM learning.

Answer: (b) Kernels map the original dataset into a higher


dimensional space and then find a hyper-plane in the original space
SVM transforms the original feature space into a higher-dimensional space
based on a user-defined kernel function and then finds support vectors to
maximize the separation (margin) between two classes in the higher-
dimensional space.

 
3. Dimensionality reduction reduces the data set size by removing
____________.
a) Relevant attributes.
b) Irrelevant attributes.
c) Support vector attributes.
d) Mining attributes

Answer: (b) Irrelevant attributes


We remove those attributes or features that are irrelevant and redundant in
order to reduce the dimension of the feature set.
Dimensionality reduction
Dimensionality reduction, or  dimension reduction, is the transformation of
data from a high-dimensional space into a low-dimensional space so that the
low-dimensional representation retains some meaningful properties of the
original data. [Wikipedia] 
The process of dimensionality reduction is divided into two components,
feature selection and feature extraction. In feature selection, smaller subsets
of features are chosen from a set of many dimensional data to represent the
model by filtering, wrapping or embedding. Feature extraction reduces the
number of dimensions in a dataset in order to model variables and perform
component analysis. [For more please  refer here]

 
4. What is the Hamming distance between the binary vectors a =
0101010001 and b = 0100011001?
a) 2
b) 3
c) 5
d) 10

Answer: (a) 2
For binary data, the Hamming distance is the number of bits that are
different between two binary vectors.

 
5. What is the Jaccard similarity between the binary vectors a =
0111010101 and b = 0100011111?
a) 0.5
b) 1.5
c) 2.5
d) 3

Answer: (a) 0.5
For binary data, the Jaccad similarity is a measure of similarity between two
binary vectors.
Jaccard similarity between binary vectors can be calculated using the
following equation;
Jsim  = C11  / (C01 + C10  + C11)
Here, C11 is the count of matching 1’s between two vectors,
C01 and C10 is the count of dissimilar binary values between two vectors
For the given question,
C11 = the number of bit positions that has matching 1’s = 4
C10 = the number of bit positions where the first binary vector (vector a) is 1
and second vector (vector b) is 0 = 2
C01 = the number of bit positions where the first binary (vector b) vector is 0
and second vector (vector b) is 1 = 2
Jsim(a, b) = 4/(2+2+4) = 4/8 = ½ = 0.5

1. Minkowski distance is a function used to find the distance between


two
a) Binary vectors
b) Boolean-valued vectors
c) Real-valued vectors
d) Categorical vectors

Answer: (c) Real-valued vectors


Minkowski distance finds the distance between two real-valued vectors. It is a
generalization of the Euclidean and Manhattan distance measures and adds a
parameter, called the “order” or “p“, that allows different distance measures
to be calculated.
Minkowski distance, 

If p=1 then L1 which is Manhattan distance (change p with 1 in above


equation)
If p=2 then L2 which is Euclidean distance (change p with 2 in above
equation)
[For more,  please refer here]

 
2. Which of the following distance measure is similar to Simple
Matching Coefficient (SMC)?
a) Euclidean distance
b) Hamming distance
c) Jaccard distance
d) Manhattan distance

Answer: (b) Hamming distance


Hamming distance is the number of bits that are different between two binary
vectors.
The Hamming distance is similar to the SMC in which both methods look at
the whole data and looks for when data points are similar and dissimilar.   The
Hamming distance gives the number of bits that are different whereas the
SMC gives the result of the ratio of how many bits were the same over the
entirety of the sample set.   In a nutshell, Hamming distance reveals how
many were different, SMC reveals how many were same, and therefore one
reveals the inverse information of the other.
SMC = Hamming distance / number of bits

 
3. The statement “if an itemset is frequent then all of its subsets must also be
frequent” describes _________ .
a) Unique item property
b) Downward closure property
c) Apriori property
d) Contrast set learning

Answer: (b) Downward closure property and (c) Apriori property


The Apriori property state that if an itemset is frequent then all of its subsets
must also be frequent.
Apriori algorithm is a classical data mining algorithm used for mining
frequent itemsets and learning of relevant association rules over relational
databases.
Apriori property expresses monotonic decrease of an evaluation criterion
accompanying with the progress of a sequential pattern.
Both downward closure property and Apriori property are synonyms
to each other.

 
4. Prediction differs from classification in which of the following senses?
a) Not requiring a training phase
b) The type of the outcome value
c) Using unlabeled data instead of labeled data
d) Prediction is about determining a class

Answer: (b) The type of the outcome value


The type of outcome values of prediction differs from that of
classification.
Predicting class labels is classification, and predicting values (e.g. using
regression techniques) is prediction.
Classification is the process of identifying the category or class label of the
new observation to which it belongs.   Predication is the process of identifying
the missing or unavailable numerical data for a new observation.

 
5. The statement “if an itemset is infrequent then it’s superset must also be an
infrequent set” denotes _______.
a) Maximal frequent set.
b) Border set.
c) Upward closure property.
d) Downward closure property.

Answer: (c) Upward closure property


Any subset of a frequent item set must be frequent (downward closure
property) or any superset of an infrequent item set must be infrequent
(Upward closure property). Both are Apriori properties.

1. Which of the following best describes the sample of data used to


provide an unbiased evaluation of a model fit on the training dataset
while tuning model hyper-parameters?
a) Training dataset
b) Test dataset
c) Validation dataset
d) Holdout dataset

Answer: (c) Validation dataset


Validation dataset is the sample of data used to provide an unbiased
evaluation of a model fit on the training dataset while tuning model hyper-
parameters.
It is usually used for parameter selection and to avoid overfitting. It helps in
tuning the parameters of the model. For example, in neural network, it is used
to choose the number of hidden units.
Validation dataset is different from test dataset.
The validation set is also known as the Development set.

 
2. In which of the following, data are stored, retrieved and updated?
a) OLAP
b) MOLAP
c) HTTP
d) OLTP
 

Answer: (d) OLTP
Online Transaction Processing (OLTP) is a type of data processing in
information systems that typically facilitate transaction oriented applications.
A system to handle inventory of a super market, ticket booking system, and
financial transaction systems are some examples of OLTP.
OLAP is Online Analytical Processing system used primarily for data
warehouse environments.

 
3. Data warehouse deals with which type of data that is never found in
the operational environment?
a) Normalized
b) Informal
c) Summarized
d) Denormalized

Answer: (c) Summarized
Data warehouse handles summarized (aggregated) data that are aggregated
from OLTP systems.
A data warehouse is a relational database that is designed for query and
analysis rather than for transaction processing. It usually contains historical
data derived from transaction data.
Data warehouses  are large databases that are specifically designed for OLAP
and business analytics workloads.
As per definition of Ralph Kimball, a data warehouse is “a copy of transaction
data specifically structured for query and analysis.”

 
4. Classification is a data mining task that maps the data into _________ .
a) predefined group
b) real valued prediction variable
c) time series
d) clusters

Answer: (a) predefined group


Classification is a data mining function that assigns items in a collection to
target categories or classes that are predefined. The goal of classification is to
accurately predict the target class for each case in the data. For example, a
classification model could be used to identify loan applicants as low, medium,
or high credit risks. [for more on  sample classification problems]
k-nearest neighbor (knn), naïve bayes and support vector machine (svm) are
few of the classification algorithms.

 
5. Which of the following clustering techniques start with as many
clusters as there are records or observations with each cluster having
only one observation at the starting?
a) Agglomerative clustering
b) Fuzzy clustering
c) Divisive clustering
d) Model-based clustering

Answer: (a) Agglomerative clustering


This is a "bottom-up" approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the
hierarchy.
Agglomerative clustering starts with single object clusters (singletons) and
proceeds by progressively merging the most similar clusters, until a stopping
criterion (which could be a predefined number of groups k) is reached. In
some cases, the procedure ends only when all the clusters are merged into a
single one, which is when one aims at investigating the overall granularity of
the data structure.
You may refer here for  applications of hierarchical clustering

1. With data mining, the best way to accomplish this is by setting aside some of your
data in a vault to isolate it from the mining process; once the mining is complete, the
results can be tested against the isolated data to confirm the model's _______.
A. Validity
B. Security
C. Integrity
D. None of above
Ans:  A
2. The automated, prospective analyses offered by data mining move beyond the
analyses of past events provided by _______ tools typical of decision support
systems.
A. Introspective
B. Intuitive
C. Reminiscent
D. Retrospective
Ans:   D
 

3. The technique that is used to perform these feats in data mining is called
modeling, and this act of model building is something that people have been doing
for a long time, certainly before the _______ of computers or data mining
technology.
A. Access
B. Advent
C. Ascent
D. Avowal
Ans:   B
 

4. Classification consists of examining the properties of a newly presented


observation and assigning it to a predefined ________.
A. Object
B. Container
C. Subject
D. Class
Ans:   D
 

5. During business hours, most ______ systems should probably not use parallel
execution.
A. OLAP
B. DSS
C. Data Mining
D. OLTP
Ans:   D
 

6. In contrast to statistics, data mining is ______ driven.


A. Assumption
B. Knowledge
C. Human
D. Database
Ans:   B
 

7. Data mining derives its name from the similarities between searching for valuable
business information in a large database, for example, finding linked products in
gigabytes of store scanner data, and mining a mountain for a _________ of valuable
ore.
A. Furrow
B. Streak
C. Trough
D. Vein
Ans:   D
 

8. As opposed to the outcome of classification, estimation deal with __________


valued outcome.
A. Discrete
B. Isolated
C. Continuous
D. Distinct
Ans:  A
 

9. The goal of ideal parallel execution is to completely parallelize those parts of a


computation that are not constrained by data dependencies. The smaller the portion
of the program that must be executed __________, the greater the scalability of the
computation.
A. In Parallel
B. Distributed
C. Sequentially
D. None of above
Ans:   C
 

10. Data mining evolve as a mechanism to cater the limitations of ________ systems
to deal massive data sets with high dimensionality, new data types, multiple
heterogeneous data resources etc.
A. OLTP
B. OLAP
C. DSS
D. DWH
Ans:  A
 

11. The goal of ideal parallel execution is to completely parallelize those parts of a
computation that are not constrained by data dependencies. The ______ the portion
of the program that must be executed sequentially, the greater the scalability of the
computation.
A. Larger
B. Smaller
C. Unambiguous
D. Superior
Ans:  B
 
12. The goal of ________ is to look at as few blocks as possible to find the matching
records(s).
A. Indexing
B. Partitioning
C. Joining
D. None of above
Ans:  A
 

13. In nested-loop join case, if there are ‘M’ rows in outer table and ‘N’ rows in inner
table, time complexity is
A. (M log N)
B. (log MN)
C. (MN)
D. (M + N)
Ans:  C
 

14. Many data warehouse project teams waste enormous amounts of time searching
in vain for a _________.
A. Silver Bullet
B. Golden Bullet
C. Suitable Hardware
D. Compatible Product
Ans:  A
 

15. A dense index, if fits into memory, costs only ______ disk I/O access to locate a
record by given key.
A. One
B. Two
C. lg (n)
D. n
Ans:   A
 
16. All data is ________ of something real.
I An Abstraction
II A Representation
Which of the following option is true?
A. I Only
B. II Only
C. Both I & II
D. None of I & II
Ans:  A
 

18. The key idea behind ___________ is to take a big task and break it into subtasks
that can be processed concurrently on a stream of data inputs in multiple,
overlapping stages of execution.
A. Pipeline Parallelism
B. Overlapped Parallelism
C. Massive Parallelism
D. Distributed Parallelism
Ans:  A
 

19. Non uniform distribution, when the data is distributed across the processors, is
called ______.
A. Skew in Partition
B. Pipeline Distribution
C. Distributed Distribution
D. Uncontrolled Distribution
Ans:  A
 

20. The goal of ideal parallel execution is to completely parallelize those parts of a
computation that are not constrained by data dependencies. The smaller the portion
of the program that must be executed __________, the greater the scalability of the
computation.
A. None of these
B. Sequentially
C. In Parallel
D. Distributed
Ans:  B
 

21. Data mining is a/an __________ approach, where browsing through data using
data mining techniques may reveal something that might be of interest to the user as
information that was unknown previously.
A. Exploratory
B. Non-Exploratory
C. Computer Science
Ans:  A
 

22. Data mining evolve as a mechanism to cater the limitations of ________ systems
to dealmassive data sets with high dimensionality, new data types, multiple
heterogeneous data resources etc.
A. OLTP
B. OLAP
C. DSS
D. DWH
Ans:   A
 

23. ________ is the technique in which existing heterogeneous segments are


reshuffled, relocated into homogeneous segments.
A. Clustering
B. Aggregation
C. Segmentation
D. Partitioning
Ans:   A
 

24. To measure or quantify the similarity or dissimilarity, different techniques are


available. Which of the following option represent the name of available techniques?
A. Pearson correlation is the only technique
B. Euclidean distance is the only technique
C. Both Pearson correlation and Euclidean distance
D. None of these
Ans:  A
 

25. For a DWH project, the key requirement are ________ and product experience.
A. Tools
B. Industry
C. Software
D. None of these
Ans:  B
 

26. Pipeline parallelism focuses on increasing throughput of task execution, NOT on


_______ sub-task execution time.
A. Increasing
B. Decreasing
C. Maintaining
D. None of these
Ans:   B
 

27. Focusing on data warehouse delivery only often end up _________.


A. Rebuilding
B. Success
C. Good Stable Product
D. None of these
Ans:  D
 

28. Pakistan is one of the five major ________ countries in the world.
A. Cotton-growing
B. Rice-growing
C. Weapon Producing
Ans:  A
 
29. ______ is a process which involves gathering of information about column
through execution of certain queries with intention to identify erroneous records.
A. Data profiling
B. Data Anomaly Detection
C. Record Duplicate Detection
D. None of these
Ans:  A
 

30. Relational databases allow you to navigate the data in ________ that is
appropriate using the primary, foreign key structure within the data model.
A. Only One Direction
B. Any Direction
C. Two Direction
D. None of these
Ans:  B
 

31. DSS queries do not involve a primary key


A. True
B. False
Ans:  A
 

32. _______ contributes to an under-utilization of valuable and expensive historical


data, and inevitably results in a limited capability to provide decision support and
analysis.
A. The lack of data integration and standardization
B. Missing Data
C. Data Stored in Heterogeneous Sources
Ans:  A
 

33. DTS allows us to connect through any data source or destination that is
supported by ________.
A. OLE DB
B. OLAP
C. OLTP
D. Data Warehouse
Ans:  A
 

34. If some error occurs, execution will be terminated abnormally and all transactions
will be rolled back. In this case when we will access the database we will find it in the
state that was before the ________.
A. Execution of package
B. Creation of package
C. Connection of package
Ans:  A
 

35. The need to synchronize data upon update is called


A. Data Manipulation
B. Data Replication
C. Data Coherency
D. Data Imitation
Ans:  C
 

36. Taken jointly, the extract programs or naturally evolving systems formed a spider
web, also known as
A. Distributed Systems Architecture
B. Legacy Systems Architecture
C. Online Systems Architecture
D. Intranet Systems Architecture
Ans:  B
 

37. It is observed that every year the amount of data recorded in an organization is
A. Doubles
B. Triples
C. Quartiles
D. Remains same as previous year
Ans:  A
 

38. Pre-computed _______ can solve performance problems


A. Aggregates
B. Facts
C. Dimensions
Ans:  A
 

39. The degree of similarity between two records, often measured by a numerical
value between _______, usually depends on application characteristics.
A. 0 and 1
B. 0 and 10
C. 0 and 100
D. 0 and 99
Ans:  A
 

40. The purpose of the House of Quality technique is to reduce ______ types of risk.
A. Two
B. Three
C. Four
D. All
Ans:  A
 

41. NUMA stands for __________


A. Non-uniform Memory Access
B. Non-updateable Memory Architecture
C. New Universal Memory Architecture
Ans:  A
 

42. There are many variants of the traditional nested-loop join. If the index is built as
part of the query plan and subsequently dropped, it is called
A. Naive nested-loop join
B. Index nested-loop join
C. Temporary index nested-loop join
D. None of these
Ans:  C
 

43. The Kimball s iterative data warehouse development approach drew on decades
of experience to develop the ______.
A. Business Dimensional Lifecycle
B. Data Warehouse Dimension
C. Business Definition Lifecycle
D. OLAP Dimension
Ans:  A
 

44. During the application specification activity, we also must give consideration to
the organization of the applications.
A. True
B. False
Ans:  A
 

45. The most recent attack is the ________ attack on the cotton crop during 2003-
04, resulting in a loss of nearly 0.5 million bales.
A. Boll Worm
B. Purple Worm
C. Blue Worm
D. Cotton Worm
Ans:  A
 

46. The users of data warehouse are knowledge workers in other words they
are_________ in the organization.
A. Decision maker
B. Manager
C. Database Administrator
D. DWH Analyst
Ans:  A
 

47. _________ breaks a table into multiple tables based upon common column
values.
A. Horizontal splitting
B. Vertical splitting
Ans:  A
 

48. _____modeling technique is more appropriate for data warehouses.


A. entity-relationship
B. dimensional
C. physical
D. None of the given
Ans:  B
 

49. Multi-dimensional databases (MDDs) typically use _______ formats to store pre-
summarized cube structures.
A. SQL
B. proprietary file
C. Object oriented
D. Non- proprietary file
Ans:  B
 

50. Data warehousing and on-line analytical processing (OLAP) are _______
elements of decision support system.
A. Unusual
B. Essential
C. Optional
D. None of the given
Ans:  B
 

51. Analytical processing uses ______ , instead of record level access.


A. multi-level aggregates
B. Single-level aggregates
C. Single-level hierarchy
D. None of the Given
Ans:  A
 

52. The divide & conquer cube partitioning approach helps alleviate the ______
limitations of MOLAP implementation.

A. Flexibility
B. Maintainability
C. Security
D. Scalability
Ans:  D
 

53. Data Warehouse provides the best support for analysis while OLAP carries out
the _________ task.
A. Mandatory
B. Whole
C. Analysis
D. Prediction
Ans:  C
 

54. Data Warehouse provides the best support for analysis while OLAP carries out
the _________ task.
A. Mandatory
B. Whole
C. Analysis
D. Prediction
Ans:  C
 

55. Virtual cube is used to query two similar cubes by creating a third “virtual” cube
by a join between two cubes.
A. True
B. False
Ans:  A
 

56. Data warehousing and on-line analytical processing (OLAP) are _______
elements of decision support system.
A. Unusual
B. Essential
C. Optional
D. None of the given 
E. Add comment
Ans:  B

1)  The problem of finding hidden structure in unlabeled data is called... | Data


Mining Mcqs
   A.  Supervised learning
   B.  Unsupervised learning
   C.  Reinforcement learning
Ans: B
 

2)  Task of inferring a model from labeled training data is called | Data Mining
Mcqs
    A.  Unsupervised learning
   B.  Supervised learning
   C.  Reinforcement learning  
Ans: B
 

3)  Some telecommunication company wants to segment their customers into


distinct groups in order to send appropriate subscription offers, this is an example of 
| Data Mining Mcqs
   A.  Supervised learning
   B.  Data extraction
   C.  Serration
   D.  Unsupervised learning
Ans: D
 

4)  Self-organizing maps are an example of... | Data Mining Mcqs


   A.  Unsupervised learning  
   B.  Supervised learning
   C.  Reinforcement learning
   D.  Missing data imputation
Ans: A
 

5)  You are given data about seismic activity in Japan, and you want to predict a
magnitude of the next earthquake, this is in an example of... | Data Mining Mcqs
   A.  Supervised learning
   B.  Unsupervised learning
   C.  Serration
   D.  Dimensionality reduction
Ans: A
 

6)  Assume you want to perform supervised learning and to predict number of


newborns according to size of storks' population
(http://www.brixtonhealth.com/storksBabies.pdf), it is an example of ...  | Data Mining
Mcqs
   A.  Classification
   B.  Regression
   C.  Clustering
   D.  Structural equation modeling
Ans: B
 

7)  Discriminating between spam and ham e-mails is a classification task, true or


false? | Data Mining Mcqs
   A.  True
   B.  False
Ans: A
 

8)  In the example of predicting number of babies based on storks' population


size, number of babies is... | Data Mining Mcqs
   A.  outcome
   B.  feature  
   C.  attribute
   D.  observation
Ans: A
 

9)  It may be better to avoid the metric of ROC curve as it can suffer from
accuracy paradox. | Data Mining Mcqs
   A.  True
   B.  False  
Ans: B
 

10)  which of the following is not involve in data mining? | Data Mining Mcqs
   A.  Knowledge extraction
   B.  Data archaeology  
   C.  Data exploration
   D.  Data transformation
Ans: D
 

11)  Which is the right approach of Data Mining? | Data Mining Mcqs
 
   A.  Infrastructure, exploration, analysis, interpretation, exploitation
   B.  Infrastructure, exploration, analysis, exploitation, interpretation
   C.  Infrastructure, analysis, exploration, interpretation, exploitation  
   D.  Infrastructure, analysis, exploration, exploitation, interpretation
Ans: A
 

12)   Which of the following issue is considered before investing in Data Mining? |
Data Mining Mcqs
 A.  Functionality
   B.  Vendor consideration  
   C.  Compatibility
   D.  All of the above
Ans: D
 

13.  Adaptive system management is  | Data Mining Mcqs


A. It uses machine-learning techniques. Here program can learn from past experience
and adapt themselves to new situations 
B. Computational procedure that takes some value as input and produces some value
as output. 
C. Science of making machines performs tasks that would require intelligence when
performed by humans 
D. none of these
Ans: A
 

14. Bayesian classifiers is | Data Mining Mcqs 


A. A class of learning algorithm that tries to find an optimum classification of a set of
examples using the probabilistic theory. 
B.  Any mechanism employed by a learning system to constrain the search space of a
hypothesis 
C.  An approach to the design of learning algorithms that is inspired by the fact that
when people encounter new situations, they often explain them by reference to familiar
experiences, adapting the explanations to fit the new situation. 
D. None of these 
Ans: A
 
15. Algorithm is | Data Mining Mcqs 
A. It uses machine-learning techniques. Here program can learn from past experience
and adapt themselves to new situations 
B. Computational procedure that takes some value as input and produces some value
as output 
C. Science of making machines performs tasks that would require intelligence when
performed by humans 
D. None of these 
Ans: B
 

16. Bias is | Data Mining Mcqs 


A.A class of learning algorithm that tries to find an optimum classification of a set of
examples using the probabilistic theory 
B. Any mechanism employed by a learning system to constrain the search space of a
hypothesis 
C. An approach to the design of learning algorithms that is inspired by the fact that when
people encounter new situations, they often explain them by reference to familiar
experiences, adapting the explanations to fit the new situation. 
D. None of these 
Ans: B
 

17. Background knowledge referred to | Data Mining Mcqs 


A.  Additional acquaintance used by a learning algorithm to facilitate the learning
process 
B. A neural network that makes use of a hidden layer 
C. It is a form of automatic learning. 
D. None of these 
Ans: A
 

18. Case-based learning is | Data Mining Mcqs 


A. A class of learning algorithm that tries to find an optimum classification of a set of
examples using the probabilistic theory. 
B. Any mechanism employed by a learning system to constrain the search space of a
hypothesis 
c. An approach to the design of learning algorithms that is inspired by the fact that when
people encounter new situations, they often explain them by reference to familiar
experiences, adapting the explanations to fit the new situation. 
D. None of these 
Ans: C
 

19. Classification is | Data Mining Mcqs 


A. A subdivision of a set of examples into a number of classes 
B. A measure of the accuracy, of the classification of a concept that is given by a certain
theory 
C. The task of assigning a classification to a set of examples 
D. None of these 
Ans: A
 

20. Binary attribute are | Data Mining Mcqs 


A. This takes only two values. In general, these values will be 0 and 1 and .they can be
coded as one bit 
B. The natural environment of a certain species 
C. Systems that can be used without knowledge of internal operations 
D. None of these 
Ans: A
 

21. Classification accuracy is 


A. A subdivision of a set of examples into a number of classes 
B. Measure of the accuracy, of the classification of a concept that is given by a certain
theory 
C. The task of assigning a classification to a set of examples 
D. None of these 
Ans: B
 

22. Biotope are 


A. This takes only two values. In general, these values will be 0 and 1 
and they can be coded as one bit. 
B. The natural environment of a certain species 
C. Systems that can be used without knowledge of internal operations 
D. None of these 
Ans: B
 

23. Cluster is 


A. Group of similar objects that differ significantly from other objects 
B. Operations on a database to transform or simplify data in order to prepare it for a
machine-learning algorithm 
C. Symbolic representation of facts or ideas from which information can potentially be
extracted 
D. None of these 
Ans: A
 

24. Black boxes are 


A. This takes only two values. In general, these values will be 0 and 1 
and they can be coded as one bit. 
B. The natural environment of a certain species 
C. Systems that can be used without knowledge of internal operations 
D. None of these 
Ans: C
 

25. A definition of a concept is if it recognizes all the instances of that concept 


A. Complete 
B. Consistent 
C. Constant 
D. None of these 
Ans: A
 

26. Data mining is 


A. The actual discovery phase of a knowledge discovery process 
B. The stage of selecting the right data for a KDD process 
C. A subject-oriented integrated time variant non-volatile collection of data in support of
management 
D. None of these 
Ans: A
 

27. A definition or a concept is if it classifies any examples as coming within the


concept 
A. Complete 
B. Consistent 
C. Constant 
D. None of these 
Ans: B
 

28. Data independence means 


A. Data is defined separately and not included in programs 
B. Programs are not dependent on the physical attributes of data. 
C. Programs are not dependent on the logical attributes of data 
D. Both (B) and (C). 
Ans: D
 

29. E-R model uses this symbol to represent weak entity set? 
A. Dotted rectangle 
B. Diamond 
C. Doubly outlined rectangle 
D. None of these 
Ans: C
 

30. SET concept is used in 


A. Network Model 
B. Hierarchical Model 
C. Relational Model 
D. None of these 
Ans: D
 
31. Relational Algebra is 
A. Data Definition Language 
B. Meta Language 
C. Procedural query Language 
D. None of the above 
Ans: C
 

32. Key to represent relationship between tables is called 


A. Primary key 
B. Secondary Key 
C. Foreign Key 
D. None of these 
Ans: C
 

33. ________ produces the relation that has attributes of Ri and R2 
A. Cartesian product 
B. Difference 
C. Intersection 
D. Product 
Ans: A
 

34. Which of the following are the properties of entities? 


A. Groups 
B. Table 
C. Attributes 
D. Switchboards 
Ans: C
 

35. In a relation 
A. Ordering of rows is immaterial 
B. No two rows are identical 
C. (A) and (B) both are true 
D. None of these 
Ans: C
1. State whether the following statements about the three-tier data warehouse architecture
are True or False.
i) OLAP server is the middle tier of data warehouse architecture.
ii) The bottom tier of data warehouse architecture does not include a metadata repository.
A) i-True, ii-False
B) i-False, ii-True
C) i-True, ii-True
D) i-False, ii-False

2. The … of the data warehouse architecture contains query and reporting tools, analysis
tools, and data mining tools.
A) bottom tier
B) middle tier
C) top tier
D) both B and C

3. Which of the following are the examples of gateways of the bottom tier of data
warehouse architecture.
i) ODBC (Open Database Connection)
ii) OLEDB (Open-Linking and Embedding of Databases)
iii) JDBC (Java Database Connection)
A) i and ii only
B) ii and iii only
C) i and iii only
D) All i, ii and iii
4. Back-end tools and utilities are used to feed data into the … from operational databases
or other external sources.
A) bottom tier
B) middle tier
C) top tier
D) both A and B

5. From the architecture point of view, there are… data warehouse models.
A) two
B) three
C) four
D) five

6. A … contains a subset of corporate-wide data that is of value to a specific group of users.


A) primary warehouse
B) virtual warehouse
C) enterprise warehouse
D) data mart
7. A … is a set of views over operational databases.
A) primary warehouse
B) virtual warehouse
C) enterprise warehouse
D) data mart
8. State whether the following statements about the enterprise warehouse are True or
False.
i) Enterprise warehouse contains details as well as summarized data.
ii) It provides corporate-wide data integration.
A) i-True, ii-False
B) i-False, ii-True
C) i-True, ii-True
D) i-False, ii-False

9. State whether the following statements about the OLTP system are True.
i) Clerk, database administrators, and database professionals are the users of the OLTP
system.
ii) It is used on long-term informational requirements.
iii) It has a short and simple transaction.
A) i and ii only
B) ii and iii only
C) i and iii only
D) All i, ii and iii

10. State whether the following statements about the OLAP system are True or False.
i) Knowledge workers such as managers, executive analysts are the users of the OLAP
system.
ii) This system is used in day-to-day operations.
iii) The database size of the OLAP system will be 100GB to TB.
A) i-True, ii-False, iii-True
B) i-False, ii-True, iii-True
C) i-True, ii-True, iii-False
D) i-False, ii-False, iii-True

11. Multidimensional model of a data warehouse can exist in the form of the following
schema.
i) Star Schema
ii) Snowflake Schema
iii) Fact Constellation Schema
A) i and ii only
B) ii and iii only
C) i and iii only
D) All i, ii and iii
12. In the … the dimension tables displayed in a radial pattern around the central fact
table.
A) snowflake schema
B) star schema
C) fact schema
D) fact constellation schema
13. The dimension tables of the … model can be kept in the normalized form to reduce the
redundancies.
A) snowflake schema
B) star schema
C) fact schema
D) fact constellation schema

14. State whether the following statements about the fact constellation schema are True or
False.
i) The fact constellation schema is also called galaxy schema.
ii) The fact constellation schema allows dimension tables to be shared between fact tables.
iii) This kind of schema can be viewed as a collection of snowflakes.
A) i-True, ii-False, iii-True
B) i-False, ii-True, iii-True
C) i-True, ii-True, iii-False
D) i-False, ii-False, iii-True

15. Which of the following are the different OLAP operations performed in the
multidimensional data model.
i) Roll-up
ii) Roll-down
iii) Drill-down
iv) Slice
A) i, ii, and iii only
B) ii, iii, and iv only
C) i, iii, and iv only
D) All i, ii, iii, and iv

16. When … operation is performed, one or more dimensions from the data cube are
removed.
A) roll-up
B) roll-down
C) drill-down
D) drill-up

17. The … operation selects one particular dimension from a given cube and provides a
new subcube.
A) drill
B) dice
C) pivot
D) slice

18. The … operation rotates the data axes in view in order to provide an alternative
presentation of data.
A) drill
B) dice
C) pivot
D) slice
19. Which of the following are the different types of OLAP servers.
i) Relational OLAP
ii) Multidimensional OLAP
iii) Hybrid OLAP
iv) Specialized SQL Servers
A) i, ii, and iii only
B) ii, iii, and iv only
C) i, iii, and iv only
D) All i, ii, iii, and iv

20. … servers allow storing a large data volume of detailed information.


A) Relational OLAP
B) Multidimensional OLAP
C) Hybrid OLAP
D) Specialized SQL Servers

Answers:
1. A) i-True, ii-False
2. C) top tier
3. D) All i, ii, and iii
4. A) bottom tier
5. B) three
6. D) data mart
7. B) virtual warehouse
8. C) i-True, ii-True
9. C) i and iii only
10. A) i-True, ii-False, iii-True
11. D) All i, ii, and iii
12. B) Star Schema
13. A) snowflake schema
14. C) i-True, ii-True, iii-False
15. C) i, iii, and iv only
16. A) roll-up
17. D) slice
18. C) pivot
19. D) All i, ii, iii, and iv
20. C) Hybrid OLAP

1. Data warehouse architecture is based on …………………..


A) DBMS
B) RDBMS
C) Sybase
D) SQL Server
2. …………………….. supports basic OLAP operations, including slice and dice, drill-down,
roll-up and pivoting.
A) Information processing
B) Analytical processing
C) Data mining
D) Transaction processing

3. The core of the multidimensional model is the ………………….. , which consists of a large set of
facts and a number of dimensions.
A) Multidimensional cube
B) Dimensions cube
C) Data cube
D) Data model

4. The data from the operational environment enter …………………… of data warehouse.
A) Current detail data
B) Older detail data
C) Lightly Summarized data
D) Highly summarized data

5. A data warehouse is ………………….


A) updated by end users.
B) contains numerous naming conventions and formats
C) organized around important subject areas
D) contain only current data

6. Business Intelligence and data warehousing is used for …………..


A) Forecasting
B) Data Mining
C) Analysis of large volumes of product sales data
D) All of the above
7. Data warehouse contains ……………. data that is never found in the operational environment.
A) normalized
B) informational
C) summary
D) denormalized

8. ………………. are responsible for running queries and reports against data warehouse
tables.
A) Hardware
B) Software
C) End users
D) Middle ware

9. The biggest drawback of the level indicator in the classic star schema is that is limits
…………
A) flexibility
B) quantify
C) qualify
D) ability

10. ……………………….. are designed to overcome any limitations placed on the warehouse
by the nature of the relational data model.
A) Operational database
B) Relational database
C) Multidimensional database
D) Data repository

ANSWERS:
1. Data warehouse architecture is based on …………………….
B) RDBMS

2. …………………….. supports basic OLAP operations, including slice and dice, drill-down,
roll-up and pivoting.
B) Analytical processing

3. The core of the multidimensional model is the ………………….. , which consists of a large
set of facts and a number of dimensions.
C) Data cube

4. The data from the operational environment enter …………………… of data warehouse.
A) Current detail data

5. A data warehouse is ………………….


C) organized around important subject areas

6. Business Intelligence and data warehousing is used for …………..


D) All of the above

7. Data warehouse contains ……………. data that is never found in the operational
environment.
C) summary

8. ………………. are responsible for running queries and reports against data warehouse
tables.
C) End users

9. The biggest drawback of the level indicator in the classic star schema is that is limits
…………
A) flexibility

10. ……………………….. are designed to overcome any limitations placed on the warehouse
by the nature of the relational data model.
C) Multidimensional database
1. The full form of OLAP is
A) Online Analytical Processing
B) Online Advanced Processing
C) Online Advanced Preparation
D) Online Analytical Performance

2. ……………………. is a subject-oriented, integrated,time-variant, nonvolatile collection or


data in support of management decisions.
A) Data Mining
B) Data Warehousing
C) Document Mining
D) Text Mining
3. The data is stored, retrieved and updated in ………………..
A) OLAP
B) OLTP
C) SMTP
D) FTP

4. An ……………… system is market-oriented and is used for data analysis by knowledge workers,
including managers, executives, and analysts.
A) OLAP
B) OLTP
C) Both of the above
D) None of the above

5. …………………… is a good alternative to the star schema.


A) Star schema
B) Snowflake schema
C) Fact constellation
D) Star-snowflake schema

6. The ………………………. exposes the information being captured, stored, and managed by
operational systems.
A) top-down view
B) data warehouse view
C) data source view
D) business query view

7. The type of relationship in star schema is ……………


A) many to many
B) one to one
C) one to many
D) many to one

8. The ……………… allows the selection of the relevant information necessary for the data
warehouse.
A) top-down view
B) data warehouse view
C) data source view
D) business query view

9. Which of the following is not a component of a data warehouse?


A) Metadata
B) Current detail data
C) Lightly summarized data
D) Component Key

10. Which of the following is not a kind of data warehouse application?


A) Information processing
B) Analytical processing
C) Data mining
D) Transaction processing

ANSWERS:
1. The full form of OLAP is
A) Online Analytical Processing

2. ……………………. is a subject-oriented, integrated, time-variant, nonvolatile collection or data in


support of management decisions.
B) Data Warehousing

3. The data is stored, retrieved and updated in ………………..


B) OLTP

4. An ……………… system is market-oriented and is used for data analysis by knowledge


workers, including managers, executives, and analysts.
A) OLAP

5. …………………… is a good alternative to the star schema.


C) Fact constellation

6. The ………………………. exposes the information being captured, stored, and managed by
operational systems.
C) data source view

7. The type of relationship in star schema is ……………


C) one to many

8. The ……………… allows the selection of the relevant information necessary for the data
warehouse.
A) top-down view

9. Which of the following is not a component of a data warehouse?


D) Component Key
10. Which of the following is not a kind of data warehouse application?
D) Transaction processing

1. What is Datawarehousing?

A Datawarehouse is the repository of a data and it is used for Management


decision support system. Datawarehouse consists of wide variety of data that
has high level of business conditions at a single point in time.

In single sentence, it is repository of integrated information which can be


available for queries and analysis.

2. What is Business Intelligence?

Business Intelligence is also known as DSS – Decision support system which


refers to the technologies, application and practices for the collection,
integration and analysis of the business related information or data. Even, it
helps to see the data on the information itself.

3. What is Dimension Table?

Dimension table is a table which contain attributes of measurements stored in


fact tables. This table consists of hierarchies, categories and logic that can be
used to traverse in nodes.

4. What is Fact Table?

10 Most Common Interview Questions and Answers ????

Fact table contains the measurement of business processes, and it contains


foreign keys for the dimension tables.

Example – If the business process is manufacturing of bricks

Average number of bricks produced by one person/machine – measure of the


business process
5. What are the stages of Datawarehousing?

There are four stages of Datawarehousing:

Datawarehouse

 Offline Operational Database


 Offline Data Warehouse
 Real Time Datawarehouse
 Integrated Datawarehouse

6. What is Data Mining?

Data Mining is set to be a process of analyzing the data in different


dimensions or perspectives and summarizing into a useful information. Can be
queried and retrieved the data from database in their own format.

7. What is OLTP?

OLTP is abbreviated as On-Line Transaction Processing, and it is an application


that modifies the data whenever it received and has large number of
simultaneous users.

8. What is OLAP?
OLAP is abbreviated as Online Analytical Processing, and it is set to be a
system which collects, manages, processes multi-dimensional data for analysis
and management purposes.

9. What is the difference between OLTP and OLAP?

Following are the differences between OLTP and OLAP:

OLTP OLAP

Data is from original data source Data is from various data sources

Simple queries by users Complex queries by system

Normalized small database De-normalized Large Database

Fundamental business tasks Multi-dimensional business tasks

10. What is ODS?

ODS is abbreviated as Operational Data Store and it is a repository of real time


operational data rather than long term trend data.

11. What is the difference between View and Materialized View?

A view is nothing but a virtual table which takes the output of the query and it
can be used in place of tables.

A materialized view is nothing but an indirect access to the table data by


storing the results of a query in a separate schema.

12. What is ETL?

ETL is abbreviated as Extract, Transform and Load. ETL is a software which is


used to reads the data from the specified data source and extracts a desired
subset of data. Next, it transform the data using rules and lookup tables and
convert it to a desired state.

Then, load function is used to load the resulting data to the target database.

13. What is VLDB?


VLDB is abbreviated as Very Large Database and its size is set to be more than
one terabyte database. These are decision support systems which is used to
server large number of users.

14. What is real-time datawarehousing?

Real-time datawarehousing captures the business data whenever it occurs.


When there is business activity gets completed, that data will be available in
the flow and become available for use instantly.

15. What are Aggregate tables?

Aggregate tables are the tables which contain the existing warehouse data
which has been grouped to certain level of dimensions. It is easy to retrieve
data from the aggregated tables than the original table which has more
number of records.

This table reduces the load in the database server and increases the
performance of the query.

16. What is factless fact tables?

A factless fact tables are the fact table which doesn’t contain numeric fact
column in the fact table.

17. How can we load the time dimension?

Time dimensions are usually loaded through all possible dates in a year and it
can be done through a program. Here, 100 years can be represented with one
row per day.

18. What are Non-additive facts?

Non-Addictive facts are said to be facts that cannot be summed up for any of
the dimensions present in the fact table. If there are changes in the
dimensions, same facts can be useful.

19. What is conformed fact?

Conformed fact is a table which can be used  across multiple data marts in
combined with the multiple fact tables.

20. What is Datamart?


A Datamart is a specialized version of Datawarehousing and it contains a
snapshot of operational data that helps the business people to decide with the
analysis of past trends and experiences. A data mart helps to emphasizes on
easy access to relevant information.

21. What is Active Datawarehousing?

An active datawarehouse is a datawarehouse that enables decision makers


within a company or organization to manage customer relationships
effectively and efficiently.

22. What is the difference between Datawarehouse and OLAP?

Datawarehouse is a place where the whole data is stored for analyzing, but
OLAP is used for analyzing the data, managing aggregations, information
partitioning into minor level information.

23. What is ER Diagram?

ER diagram is abbreviated as Entity-Relationship diagram which illustrates the


interrelationships between the entities in the database. This diagram shows
the structure of each tables and the links between the tables.

24. What are the key columns in Fact and dimension tables?

Foreign keys of dimension tables are primary keys of entity tables. Foreign
keys of fact tables are the primary keys of the dimension tables.

25. What is SCD?

SCD is defined as slowly changing dimensions, and it applies to the cases


where record changes over time.

26. What are the types of SCD?

There are three types of SCD and they are as follows:

SCD 1 – The new record replaces the original record

SCD 2 – A new record is added to the existing customer dimension table

SCD 3 – A original data is modified to include new data


27. What is BUS Schema?

BUS schema consists of suite of confirmed dimension and standardized


definition if there is a fact tables.

28. What is Star Schema?

Star schema is nothing but a type of organizing the tables in such a way that
result can be retrieved from the database quickly in the data warehouse
environment.

29. What is Snowflake Schema?

Snowflake schema which has primary dimension table to which one or more
dimensions can be joined. The primary dimension table is the only table that
can be joined with the fact table.

30. What is a core dimension?

Core dimension is nothing but a Dimension table which is used as dedicated


for single fact table or datamart.

31. What is called data cleaning?

Name itself implies that it is a self explanatory term. Cleaning of Orphan


records, Data breaching business rules, Inconsistent data and missing
information in a database.

32. What is Metadata?

Metadata is defined as data about the data. The metadata contains


information like number of columns used, fix width and limited width, ordering
of fields and data types of the fields.

33. What are loops in Datawarehousing?

In datawarehousing, loops are existing between the tables. If there is a loop


between the tables, then the query generation will take more time and it
creates ambiguity. It is advised to avoid loop between the tables.

34. Whether Dimension table can have numeric value?


Yes, dimension table can have numeric value as they are the descriptive
elements of our business.

35. What is the definition of Cube in Datawarehousing?

Cubes are logical representation of multidimensional data. The edge of the


cube has the dimension members,and the body of the cube contains the data
values.

36. What is called Dimensional Modelling?

Dimensional Modeling is a concept which can be used by dataware house


designers to build their own datawarehouse. This model can be stored in two
types of tables – Facts and Dimension table.

Fact table has facts and measurements of the business and dimension table
contains the context of measurements.

37. What are the types of Dimensional Modeling?

There are three types of Dimensional Modeling and they are as follows:

 Conceptual Modeling
 Logical Modeling
 Physical Modeling

38. What is surrogate key?

Surrogate key is nothing but a substitute for the natural primary key. It is set
to be a unique identifier for each row that can be used for the primary key to a
table.

39. What is the difference between ER Modeling and Dimensional


Modeling?

ER modeling will have logical and physical model but Dimensional modeling
will have only Physical model.

ER Modeling is used for normalizing the OLTP database design whereas


Dimensional Modeling is used for de-normalizing the ROLAP and MOLAP
design.

40. What are the steps to build the datawarehouse?


Following are the steps to be followed to build the datawaerhouse:

 Gathering business requirements


 Identifying the necessary sources
 Identifying the facts
 Defining the dimensions
 Defining the attributes
 Redefine the dimensions and attributes if required
 Organize the Attribute hierarchy
 Define Relationships
 Assign unique Identifiers

41. What are the different types of datawarehosuing?

Following are the different types of Datawarehousing:

 Enterprise Datawarehousing
 Operational Data Store
 Data Mart

42. What needs to be done while starting the database?

Following need to be done to start the database:

1. Start an Instance
2. Mount the database
3. Open the database

43. What needs to be done when the database is shutdown?

Following needs to be done when the database is shutdown:

1. Close the database


2. Dismount the database
3. Shutdown the Instance

44. Can we take backup when the database is opened?

Yes, we can take full backup when the database is opened.

45. What is defined as Partial Backup?


A Partial backup in an operating system is a backup short of full backup and it
can be done while the database is opened or shutdown.

46. What is the goal of Optimizer?

The goal to Optimizer is to find the most efficient way to execute


the SQL statements.

47. What is Execution Plan?

Execution Plan is a plan which is used to the optimizer to select the


combination of the steps.

48. What are the approaches used by Optimizer during execution plan?

There are two approaches:

1. Rule Based
2. Cost Based

49. What are the tools available for ETL?

Following are the ETL tools available:

Informatica
Data Stage
Oracle
Warehouse Builder
Ab Initio
Data Junction

50.What is the difference between metadata and data dictionary?


Metadata is defined as data about the data. But, Data dictionary contain the
information about the project information, graphs, abinito commands and
server information.
1. Data scrubbing is which of the following?
A. A process to reject data from the data warehouse and to create the necessary indexes
B. A process to load the data in the data warehouse and to create the necessary indexes
C. A process to upgrade the quality of data after it is moved into a data warehouse
D. A process to upgrade the quality of data before it is moved into a data warehouse
Answer: Option D

2. The @active data warehouse architecture includes which of the following?


A. At least one data mart
B. Data that can be extracted from numerous internal and external sources
C. Near real-time updates
D. All of the above

Answer: Option D

3. A goal of data mining includes which of the following?


A.To explain some observed event or condition
B.To confirm that data exists
C.To analyze data for expected relationships
D.To create a new data warehouse
Answer: Option A

4. An operational system is which of the following?


A.A system that is used to run the business in real-time and is based on historical data.
B.A system that is used to run the business in real-time and is based on current data.
C.A system that is used to support decision-making and is based on current data.
D.A system that is used to support decision-making and is based on historical data.

Answer: Option B

5. A data warehouse is which of the following?


A.Can be updated by end-users.
B.Contains numerous naming conventions and formats.
C.Organized around important subject areas.
D.Contains only current data.

Answer: Option C

6. A snowflake schema is which of the following types of tables?


A.Fact
B.Dimension
C.Helper
D.All of the above

Answer: Option D

7. The generic two-level data warehouse architecture includes which of the following?
A.At least one data mart
B.Data that can be extracted from numerous internal and external sources
C.Near real-time updates
D.All of the above

Answer: Option B

8. Fact tables are which of the following?


A.Completely denoralized
B.Partially denoralized
C.Completely normalized
D.Partially normalized

Answer: Option C

9. Data transformation includes which of the following?


A.A process to change data from a detailed level to a summary level
B.A process to change data from a summary level to a detailed level
C.Joining data from one source into various sources of data
D.Separating data from one source into various sources of data

Answer: Option A
10. Reconciled data is which of the following?
A.Data stored in the various operational systems throughout the organization.
B.Current data intended to be the single source for all decision support systems.
C.Data stored in one operational system in the organization.
D.Data that has been selected and formatted for end-user support applications.

Answer: Option B

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. The load and index is which of the following?


A.A process to reject data from the data warehouse and to create the necessary indexes
B.A process to load the data in the data warehouse and to create the necessary indexes
C.A process to upgrade the quality of data after it is moved into a data warehouse
D.A process to upgrade the quality of data before it is moved into a data warehouse

Answer: Option B

12. The extract process is which of the following?


A.Capturing all of the data contained in various operational systems
B.Capturing a subset of the data contained in various operational systems
C.Capturing all of the data contained in various decision support systems
D.Capturing a subset of the data contained in various decision support systems

Answer: Option B

13. A star schema has what type of relationship between a dimension and fact table?
A.Many-to-many
B.One-to-one
C.One-to-many
D.All of the above

Answer: Option C

14. Transient data is which of the following?


A.Data in which changes to existing records cause the previous version of the records to be
eliminated
B.Data in which changes to existing records do not cause the previous version of the records
to be eliminated
C.Data that are never altered or deleted once they have been added
D.Data that are never deleted once they have been added

Answer: Option A

15. A multifield transformation does which of the following?


A.Converts data from one field into multiple fields
B.Converts data from multiple fields into one field
C.Converts data from multiple fields into multiple fields
D.All of the above

Answer: Option D

16. A data mart is designed to optimize the performance for well-defined and predicable
uses.
A. True
B. False

Answer: Option A

17. Successful data warehousing requires that a formal program in total quality
management (TQM) be implemented.
A. True
B. False

Answer: Option A

18. Data in operational systems are typically fragmented and inconsistent.


A. True
B. False

Answer: Option A

19. Most operational systems are based on the use of transient data.
A. True
B. False

Answer: Option A

20. Independent data marts are often created because an organization focuses on a
series of short-term business objectives.
A. True
B. False
Answer: Option A

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Joining is the process of partitioning data according to predefined criteria.


A. True
B. False

Answer: Option B
22. The role of the ETL process is to identify erroneous data and to fix them.
A. True
B. False

Answer: Option B

23. Data in the data warehouse are loaded and refreshed from operational systems.
A. True
B. False

Answer: Option A

24. Star schema is suited to online transaction processing and therefore is generally
used in operational systems, operational data stores, or an EDW.
A. True
B. False

Answer: Option B

25. Periodic data are data that are physically altered once added to the store.
A. True
B. False

Answer: Option B

26. Both status data and event data can be stored in a database.
A. True
B. False

Answer: Option A

27. Static extract is used for ongoing warehouse maintenance.


A. True
B. False

Answer: Option b

28. Data scrubbing can help upgrade data quality;it is not a long-term solution to the
data quality problem.
A. True
B. False

Answer: Option A

29. Every key used to join the fact table with a dimensional table should be a surrogate
key.
A. True
B. False

Answer: Option A
30. Derived data are detailed, current data intended to be single, authoritative source
for all decision suport applications.
A. True
B. False

Answer: Option B

ETL Process and OLAP


Module 2

1. All data in flat file is in this format.


A. Sort
B. ETL
C. Format
D. String

Ans: D

2. It is used to push data into a relation database table. This control will be the
destination for most fact table data flows.
A. Web Scraping
B. Data inspection
C. OLE DB Source
D. OLE DB Destination

Ans: D

3. Logical Data Maps


A. These are used to identify which fields from which sources are going to with destinations.
It allows the ETL developer to identify if there is a need to do a data type change or
aggregation prior to beginning coding of an ETL process.
B. These can be used to flag an entire file-set that is ready for processing by the ETL process.
It contains no meaningful data bu the fact it exists is the key to the process.
C. Data is pulled from multiple sources to be merged into one or more destinations.
D. It is used to massage data in transit between the source and destination.

Ans: A

4. Data access methods.


A. Pull Method
B. Push and Pull
C. Load in Parallel
D. Union all

Ans: B

5. OLTP
A. Process to move data from a source to destination.
B. Transactional database that is typically attached to an application. This source provides the
benefit of known data types and standardized access methods. This system enforces data
integrity.
C. All data in flat file is in this format.
D. This control can be used to add columns to the stream or make modifications to data
within the stream. Should be used for simple modifications.

Ans: B

6. COBOL
A. Process to move data from a source to destination.
B. The easiest to consume from the ETL standpoint.
C. Two methods to ensure data integrity.
D. Many routines of the Mainframe system are written in this.

Ans: D

7. What ETL Stands for


A. Data inspection
B. Transformation
C. Extract, Transform, Load
D. Data Flow

Ans: C

8. The source system initiates the data transfer for the ETL process. This method is
uncommon in practice, as each system would have to move the data to the ETL process
individually.
A. Custom
B. Automation
C. Pull Method
D. Push Method

Ans: D

9. Sentinel Files
A. These are used to identify which fields from which sources are going to with destinations.
It allows the ETL developer to identify if there is a need to do a data type change or
aggregation prior to beginning coding of an ETL process.
B. These can be used to flag an entire file-set that is ready for processing by the ETL process.
It contains no meaningful data bu the fact it exists is the key to the process.
C. ETL can be used to automate the movement of data between two locations. This
standardizes the process so that the load is done the same way every run.
D. This is used to create multiple streams within a data flow from a single stream. All records
in the stream are sent down all paths. Typically uses a merge-join to recombine the streams
later in the data flow.

Ans: B

10. Checkpoints
A. Similar to “break up processes”, checkpoints provide markers for what data has been
processed in case an error occurs during the ETL process.
B. Similar to XML’s structured text file.
C. Many routines of the Mainframe system are written in this.
D. It is used to import text files for ETL processing.

Ans: A

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. Mainframe systems use this. This requires a conversion to the more common ASCII
format.
A. ETL
B. XML
C. Sort
D. EBCDIC

Ans: D

12. Ultimate flexibility, unit testing is available, usually poor documentation.


A. ETL
B. Custom
C. OLTP
D. Sort

Ans: B

13. Conditional Split


A. Many routines of the Mainframe system are written in this.
B. Data is pulled from multiple sources to be merged into one or more destinations.
C. It allows multiple streams to be created from a single stream. Only rows that match the
criteria for a given path are sent down that path.
D. This is used to create multiple streams within a data flow from a single stream. All records
in the stream are sent down all paths. Typically uses a merge-join to recombine the streams
later in the data flow.

Ans: C

14. Flat files


A. The easiest to consume from the ETL standpoint.
B. Three components of data flow.
C. Three common usages of ETL.
D. Two methods to ensure data integrity.

Ans: A

15. This is used to create multiple streams within a data flow from a single stream. All
records in the stream are sent down all paths. Typically uses a merge-join to recombine
the streams later in the data flow.
A. OLTP
B. Mainframe
C. EBCDIC
D. Multicast

Ans: D

16. There are little to no benefits to the ETL developer when accessing these types of
systems and many detriments. The ability to access these systems is very limited and
typically FTP of text files is used to facilitate access.
A. Mainframe
B. Union all
C. File Name
D. Multicast

Ans: A

17. Shows the path to the file to be imported.


A. File Name
B. Mainframe
C. Format
D. Union all

Ans: A

18. Wheel is already invented, documented, good support.


A. Format
B. COBOL
C. Tool Suite
D. Flat files

Ans: C

19. Similar to XML’s structured text file.


A. Data Scrubbing
B. EBCDIC
C. String
D. Web Scraping

Ans: D

20. Flat file control


A. Three components of data flow.
B. It is used to import text files for ETL processing.
C. The easiest to consume from the ETL standpoint.
D. Shows the path to the file to be imported.

Ans: B

Learn Datawarehouse and Data mining from Scratch


Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Two methods to ensure data integrity.


A. Sources, Transformation, Destination
B. Data inspection
C. Row Count Inspection, Data Inspection
D. Row Count Inspection

Ans: C

22. Transformation
A. Data is pulled from multiple sources to be merged into one or more destinations.
B. It is used to import text files for ETL processing.
C. Process to move data from a source to destination.
D. It is used to massage data in transit between the source and destination.

Ans: D

23. Three common usages of ETL.


A. Data Scrubbing
B. Sources, Transformation, Destination
C. Merging Data
D. Merging Data, Data Scrubbing, Automation

Ans: D

24. Load in Parallel


A. A value of delimited shou;d be selected for delimited files.
B. Data is pulled from multiple sources to be merged into one or more destinations.
C. This will reduce the run time of ETL process and reduce the window for hardware failure
to affect the process.
D. this should be check if column name have been included in the first row of the file.

Ans: C

25. This can be computationally expensive excluding SSD.


A. Hard Drive I/O
B. Mainframe
C. Tool Suite
D. Data Scrubbing

Ans: A

26. A value of delimited shou;d be selected for delimited files.


A. Sort
B. Format
C. String
D. OLTP
Ans: B

27. this should be check if column name have been included in the first row of the file.
A. Row Count Inspection, Data Inspection
B. Format of the Date
C. Column names in the first data row checkbox
D. Do most work in transformation phase

Ans: C

28. OLAP stands for


a) Online analytical processing
b) Online analysis processing
c) Online transaction processing
d) Online aggregate processing

Answer: a

29. Data that can be modeled as dimension attributes and measure attributes are called
_______ data.
a) Multidimensional
b) Single Dimensional
c) Measured
d) Dimensional

Answer: a

30. The generalization of cross-tab which is represented visually is ____________ which


is also called as data cube.
a) Two dimensional cube
b) Multidimensional cube
c) N-dimensional cube
d) Cuboid

Answer: a

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

31. The process of viewing the cross-tab (Single dimensional) with a fixed value of one
attribute is
a) Slicing
b) Dicing
c) Pivoting
d) Both Slicing and Dicing

Answer: a
32. The operation of moving from finer-granularity data to a coarser granularity (by
means of aggregation) is called a ________
a) Rollup
b) Drill down
c) Dicing
d) Pivoting

Answer: a

33. In SQL the cross-tabs are created using


a) Slice
b) Dice
c) Pivot
d) All of the mentioned

Answer: a

34.{ (item name, color, clothes size), (item name, color), (item name, clothes size), (color,
clothes size), (item name), (color), (clothes size), () }
This can be achieved by using which of the following ?
a) group by rollup
b) group by cubic
c) group by
d) none of the mentioned

Answer: d

35. What do data warehouses support?


a) OLAP
b) OLTP
c) OLAP and OLTP
d) Operational databases
Answer: a

36.SELECT item name, color, clothes SIZE, SUM(quantity)


FROM sales
GROUP BY rollup(item name, color, clothes SIZE);
How many grouping is possible in this rollup?
a) 8
b) 4
c) 2
d) 1
Answer: b

37. Which one of the following is the right syntax for DECODE?
a) DECODE (search, expression, result [, search, result]… [, default])
b) DECODE (expression, result [, search, result]… [, default], search)
c) DECODE (search, result [, search, result]… [, default], expression)
d) DECODE (expression, search, result [, search, result]… [, default])
Answer: d
Introduction to Data Mining, Data Exploration and Preprocessing
Module 3

1. Data mining refers to ______


a) Special fields for database
b) Knowledge discovery from large database
c) Knowledge base for the database
d) Collections of attributes

Answer: B

2. An attribute is a ____
a) Normalization of Fields
b) Property of the class
c) Characteristics of the object
d) Summarise value

Answer: C

3. Which are not related to Ratio Attributes?


a) Age Group 10-20, 30-50, 35-45 (in Years)
b) Mass 20-30 kg, 10-15 kg
c) Areas 10-50, 50-100 (in Kilometres)
d) Temperature 10°-20°, 30°-50°, 35°-45°

Answer: D

4. The mean is the ________ of a dataset.


a) Average
b) Middle
c) Central
d) Ordered

Answer: A

5. The number that occurs most often within a set of data called as ______
a) Mean
b) Median
c) Mode
d) Range

Answer: C

6. Find the range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50
a) 19
b) 29
c) 35
d) 49

Answer: B
7. Which are not the part of the KDD process from the following
a) Selection
b) Pre-processing
c) Reduction
d) Summation

Answer: D

8. _______ is the output of KDD Process.


a) Query
b) Useful Information
c) Information
d) Data

Answer: B

9. Data mining turns a large collection of data into _____


a) Database
b) Knowledge
c) Queries
d) Transactions

Answer: B

10. In KDD Process, where data relevant to the analysis task are retrieved from the
database means _____
a) Data Selection
b) Data Collection
c) Data Warehouse
d) Data Mining

Answer: A

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. In KDD Process, data are transformed and consolidated into appropriate forms for
mining by performing summary or aggregation operations is called as _____
a) Data Selection
b) Data Transformation
c) Data Reduction
d) Data Cleaning

Answer: B

12. What kinds of data can be mined?


a) Database data
b) Data Warehouse data
c) Transactional data
d) All of the above

Answer:D

13. Data selection is _____


a) The actual discovery phase of a knowledge discovery process
b) The stage of selecting the right data for a KDD process
c) A subject-oriented integrated time-variant non-volatile collection of data in support of
management
d) Record oriented classes finding

Answer: B

14. To remove noise and inconsistent data ____ is needed.


a) Data Cleaning
b) Data Transformation
c) Data Reduction
d) Data Integration

Answer: A

15. Multiple data sources may be combined is called as _____


a) Data Reduction
b) Data Cleaning
c) Data Integration
d) Data Transformation

Answer: C

16. A _____ is a collection of tables, each of which is assigned a unique name which uses
the entity-relationship (ER) data model.
a) Relational database
b) Transactional database
c) Data Warehouse
d) Spatial database

Answer: A

17. Relational data can be accessed by _____ written in a relational query language.
a) Select
b) Queries
c) Operations
d) Like

Answer: B

18. _____ studies the collection, analysis, interpretation or explanation, and


presentation of data.
a) Statistics
b) Visualization
c) Data Mining
d) Clustering

Answer: A

19. ______ investigates how computers can learn (or improve their performance) based
on data.
a) Machine Learning
b) Artificial Intelligence
c) Statistics
d) Visualization

Answer: A

20. _____ is the science of searching for documents or information in documents.


a) Data Mining
b) Information Retrieval
c) Text Mining
d) Web Mining

Answer: B

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Data often contain _____


a) Target Class
b) Uncertainty
c) Methods
d) Keywords

Answer: B

22. The data mining process should be highly ______


a) On Going
b) Active
c) Interactive
d) Flexible

Answer: C

23. In real world multidimensional view of data mining, The major dimensions are data,
knowledge, technologies, and _____
a) Methods
b) Applications
c) Tools
d) Files
Answer: B

24. An _____ is a data field, representing a characteristic or feature of a data object.


a) Method
b) Variable
c) Task
d) Attribute

Answer: D

25. The values of a _____ attribute are symbols or names of things.


a) Ordinal
b) Nominal
c) Ratio
d) Interval

Answer:B

26. “Data about data” is referred to as _____


a) Information
b) Database
c) Metadata
d) File

Answer: C

27. ______ partitions the objects into different groups.


a) Mapping
b) Clustering
c) Classification
d) Prediction

Answer:B

28. In _____, the attribute data are scaled so as to fall within a smaller range, such as
-1.0 to 1.0, or 0.0 to 1.0.
a) Aggregation
b) Binning
c) Clustering
d) Normalization

29. Normalization by ______ normalizes by moving the decimal point of values of


attributes.
a) Z-Score
b) Z-Index
c) Decimal Scaling
d) Min-Max Normalization
Answer: C

30._______ is a top-down splitting technique based on a specified number of bins.


a) Normalization
b) Binning
c) Clustering
d) Classification

Answer: B

Classification, Prediction and Clustering


Module 4

1. How many terms are required for building a bayes model?


a) 1
b) 2
c) 3
d) 4

Answer: c

2. What is needed to make probabilistic systems feasible in the world?


a) Reliability
b) Crucial robustness
c) Feasibility
d) None of the mentioned

Answer: b

3. Where does the bayes rule can be used?


a) Solving queries
b) Increasing complexity
c) Decreasing complexity
d) Answering probabilistic query

Answer: d

4. What does the bayesian network provides?


a) Complete description of the domain
b) Partial description of the domain
c) Complete description of the problem
d) None of the mentioned

Answer: a

5. How the entries in the full joint probability distribution can be calculated?
a) Using variables
b) Using information
c) Both Using variables & information
d) None of the mentioned
Answer: b

6. How the bayesian network can be used to answer any query?


a) Full distribution
b) Joint distribution
c) Partial distribution
d) All of the mentioned

Answer: b

7. How the compactness of the bayesian network can be described?


a) Locally structured
b) Fully structured
c) Partial structure
d) All of the mentioned

Answer: a

8. To which does the local structure is associated?


a) Hybrid
b) Dependant
c) Linear
d) None of the mentioned

Answer: c

9. Which condition is used to influence a variable directly by all the others?


a) Partially connected
b) Fully connected
c) Local connected
d) None of the mentioned

Answer: b

10. What is the consequence between a node and its predecessors while creating
bayesian network?
a) Functionally dependent
b) Dependant
c) Conditionally independent
d) Both Conditionally dependant & Dependant

Answer: c

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. A _________ is a decision support tool that uses a tree-like graph or model of
decisions and their possible consequences, including chance event outcomes, resource
costs, and utility.
a) Decision tree
b) Graphs
c) Trees
d) Neural Networks

Answer: a

12. Decision Tree is a display of an algorithm.


a) True
b) False

Answer: a

13. What is Decision Tree?


a) Flow-Chart
b) Structure in which internal node represents test on an attribute, each branch represents
outcome of test and each leaf node represents class label
c) Flow-Chart & Structure in which internal node represents test on an attribute, each branch
represents outcome of test and each leaf node represents class label
d) None of the mentioned

Answer: c

14. Decision Trees can be used for Classification Tasks.


a) True
b) False

Answer: a

15. Choose from the following that are Decision Tree nodes?
a) Decision Nodes
b) End Nodes
c) Chance Nodes
d) All of the mentioned

Answer: d

16. Decision Nodes are represented by ____________


a) Disks
b) Squares
c) Circles
d) Triangles

Answer: b

17. Chance Nodes are represented by __________


a) Disks
b) Squares
c) Circles
d) Triangles

Answer: c

18. End Nodes are represented by __________


a) Disks
b) Squares
c) Circles
d) Triangles

Answer: d

19. Which of the following are the advantage/s of Decision Trees?


a) Possible Scenarios can be added
b) Use a white box model, If given result is provided by a model
c) Worst, best and expected values can be determined for different scenarios
d) All of the mentioned

Answer: d

20. Which of the following is the valid component of the predictor?


a) data
b) question
c) algorithm
d) all of the mentioned

Answer: d

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Point out the wrong statement.


a) In Sample Error is also called generalization error
b) Out of Sample Error is the error rate you get on the new dataset
c) In Sample Error is also called resubstitution error
d) All of the mentioned

Answer: a

22. Which of the following is correct order of working?


a) questions->input data ->algorithms
b) questions->evaluation ->algorithms
c) evaluation->input data ->algorithms
d) all of the mentioned

Answer: a
23. Which of the following shows correct relative order of importance?
a) question->features->data->algorithms
b) question->data->features->algorithms
c) algorithms->data->features->question
d) none of the mentioned

Answer: b

24. Point out the correct statement.


a) In Sample Error is the error rate you get on the same dataset used to model a predictor
b) Data have two parts-signal and noise
c) The goal of predictor is to find signal
d) None of the mentioned

Answer: d

25. Which of the following is characteristic of best machine learning method?


a) Fast
b) Accuracy
c) Scalable
d) All of the mentioned

Answer: d

26. True positive means correctly rejected.


a) True
b) False

Answer: b

27. Which of the following trade-off occurs during prediction?


a) Speed vs Accuracy
b) Simplicity vs Accuracy
c) Scalability vs Accuracy
d) None of the mentioned

Answer: d

28. Which of the following expression is true?


a) In sample error < out sample error
b) In sample error > out sample error
c) In sample error = out sample error
d) All of the mentioned

Answer: a

29. Backtesting is a key component of effective trading-system development.


a) True
b) False
Answer: a

30. Which of the following is correct use of cross validation?


a) Selecting variables to include in a model
b) Comparing predictors
c) Selecting parameters in prediction function
d) All of the mentioned

Answer: d

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

31. Point out the wrong combination.


a) True negative=correctly rejected
b) False negative=correctly rejected
c) False positive=correctly identified
d) All of the mentioned

Answer: c

32. Which of the following is a common error measure?


a) Sensitivity
b) Median absolute deviation
c) Specificity
d) All of the mentioned

Answer: d

33. Which of the following is not a machine learning algorithm?


a) SVG
b) SVM
c) Random forest
d) None of the mentioned

Answer: a

34. Point out the wrong statement.


a) ROC curve stands for receiver operating characteristic
b) Foretime series, data must be in chunks
c) Random sampling must be done with replacement
d) None of the mentioned

Answer: d

35. Which of the following is a categorical outcome?


a) RMSE
b) RSquared
c) Accuracy
d) All of the mentioned

Answer: c

36. For k cross-validation, larger k value implies more bias.


a) True
b) False

Answer: b

37. Which of the following method is used for trainControl resampling?


a) repeatedcv
b) svm
c) bag32
d) none of the mentioned

Answer: a

38. Which of the following can be used to create the most common graph types?
a) qplot
b) quickplot
c) plot
d) all of the mentioned

Answer: a

39. For k cross-validation, smaller k value implies less variance.


a) True
b) False

Answer: a

40. Predicting with trees evaluate _____________ within each group of data.
a) equality
b) homogeneity
c) heterogeneity
d) all of the mentioned

Answer: b

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

41. Point out the wrong statement.


a) Training and testing data must be processed in different way
b) Test transformation would mostly be imperfect
c) The first goal is statistical and second is data compression in PCA
d) All of the mentioned

Answer: a

42. Which of the following method options is provided by train function for bagging?
a) bagEarth
b) treebag
c) bagFDA
d) all of the mentioned

Answer: d

43. Which of the following is correct with respect to random forest?


a) Random forest are difficult to interpret but often very accurate
b) Random forest are easy to interpret but often very accurate
c) Random forest are difficult to interpret but very less accurate
d) None of the mentioned

Answer: a

44. Point out the correct statement.


a) Prediction with regression is easy to implement
b) Prediction with regression is easy to interpret
c) Prediction with regression performs well when linear model is correct
d) All of the mentioned

Answer: d

45. Which of the following library is used for boosting generalized additive models?
a) gamBoost
b) gbm
c) ada
d) all of the mentioned

Answer: a

46. The principal components are equal to left singular values if you first scale the
variables.
a) True
b) False

Answer: b

47. Which of the following is statistical boosting based on additive logistic regression?
a) gamBoost
b) gbm
c) ada
d) mboost
Answer: a

48. Which of the following is one of the largest boost subclass in boosting?
a) variance boosting
b) gradient boosting
c) mean boosting
d) all of the mentioned

Answer: b

49. PCA is most useful for non linear type models.


a) True
b) False

Answer: b

50. Which of the following clustering type has characteristic shown in the below figure?

a) Partitional
b) Hierarchical
c) Naive bayes
d) None of the mentioned

Answer: b

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

51. Point out the correct statement.


a) The choice of an appropriate metric will influence the shape of the clusters
b) Hierarchical clustering is also called HCA
c) In general, the merges and splits are determined in a greedy manner
d) All of the mentioned

Answer: d

52. Which of the following is finally produced by Hierarchical Clustering?


a) final estimate of cluster centroids
b) tree showing how close things are to each other
c) assignment of each point to clusters
d) all of the mentioned
Answer: b

53. Which of the following is required by K-means clustering?


a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the mentioned

Answer: d

54. Point out the wrong statement.


a) k-means clustering is a method of vector quantization
b) k-means clustering aims to partition n observations into k clusters
c) k-nearest neighbor is same as k-means
d) none of the mentioned

Answer: c

55. Which of the following combination is incorrect?


a) Continuous – euclidean distance
b) Continuous – correlation similarity
c) Binary – manhattan distance
d) None of the mentioned

Answer: d

56. Hierarchical clustering should be primarily used for exploration.


a) True
b) False

Answer: a

57. Which of the following function is used for k-means clustering?


a) k-means
b) k-mean
c) heatmap
d) none of the mentioned

Answer: a

58. Which of the following clustering requires merging approach?


a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned

Answer: b
59. K-means is not deterministic and it also consists of number of iterations.
a) True
b) False

Answer: a

60. Hierarchical clustering should be mainly used for exploration.


a) True
b) False

Answer: a

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

61. K-means clustering consists of a number of iterations and not deterministic.


a) True
b) False

Answer: a

62. Which is needed by K-means clustering?


a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of these

Answer: d

63. Which function is used for k-means clustering?


a) k-means
b) k-mean
c) heatmap
d) none of the mentioned

Answer: a

64. Which is conclusively produced by Hierarchical Clustering?


a) final estimation of cluster centroids
b) tree showing how nearby things are to each other
c) assignment of each point to clusters
d) all of these

Answer: b

65. Which clustering technique requires a merging approach?


a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned

Answer: b

Mining Frequent Patterns and Association Rules


Module 5

1. A collection of one or more items is called as _____


a) Itemset
b) Support
c) Confidence
d) Support Count

Answer: A

2. Frequency of occurrence of an itemset is called as _____


a) Support
b) Confidence
c) Support Count
d) Rules

Answer: C

3. An itemset whose support is greater than or equal to a minimum support threshold is


______
a) Itemset
b) Frequent Itemset
c) Infrequent items
d) Threshold values

Answer: B

4. What does FP growth algorithm do?


a) It mines all frequent patterns through pruning rules with lesser support
b) It mines all frequent patterns through pruning rules with higher support
c) It mines all frequent patterns by constructing a FP tree
d) It mines all frequent patterns by constructing an itemsets

Answer: C

5. What techniques can be used to improve the efficiency of apriori algorithm?


a) Hash-based techniques
b) Transaction Increases
c) Sampling
d) Cleaning

Answer: A
6. What do you mean by support(A)?
a) Total number of transactions containing A
b) Total Number of transactions not containing A
c) Number of transactions containing A / Total number of transactions
d) Number of transactions not containing A / Total number of transactions

Answer: C

7. How do you calculate Confidence (A -> B)?


a) Support(A #
# B) / Support (A)
b) Support(A #
# B) / Support (B)
c) Support(A #
# B) / Support (A)
d) Support(A #
# B) / Support (B)

Answer: A

8. Which of the following is the direct application of frequent itemset mining?


a) Social Network Analysis
b) Market Basket Analysis
c) Outlier Detection
d) Intrusion Detection

Answer: B

9. What is not true about FP growth algorithms?


a) It mines frequent itemsets without candidate generation
b) There are chances that FP trees may not fit in the memory
c) FP trees are very expensive to build
d) It expands the original database to build FP trees

Answer: D

10. When do you consider an association rule interesting?


a)If it only satisfies min_support
b) If it only satisfies min_confidence
c) If it satisfies both min_support and min_confidence
d) There are other measures to check so

Answer: C

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!
11. What is the relation between a candidate and frequent itemsets?
a) A candidate itemset is always a frequent itemset
b) A frequent itemset must be a candidate itemset
c) No relation between these two
d) Strong relation with transactions

Answer:B

12. Which of the following is not a frequent pattern mining algorithm?


a) Apriori
b) FP growth
c) Decision trees
d) Eclat

Answer: C

13. Which algorithm requires fewer scans of data?


a)Apriori
b)FP Growth
c)Naive Bayes
d)Decision Trees

Answer: B

14. For the question given below consider the data Transactions :

I1, I2, I3, I4, I5, I6


I7, I2, I3, I4, I5, I6
I1, I8, I4, I5
I1, I9, I10, I4, I6
I10, I2, I4, I11, I5
With support as 0.6 find all frequent itemsets?

a) <I1>, <I2>, <I4>, <I5>, <I6>, <I1, I4>, <I2, I4>, <I2, I5>, <I4, I5>, <I4, I6>, <I2, I4, I5>

b) <I2>, <I4>, <I5>, <I2, I4>, <I2, I5>, <I4, I5>, <I2, I4, I5>

c) <I11>, <I4>, <I5>, <I6>, <I1, I4>, <I5, I4>, <I11, I5>, <I4, I6>, <I2, I4, I5>

d) <I1>, <I4>, <I5>, <I6>

Answer: A

15. What will happen if support is reduced?


a) Number of frequent itemsets remains the same
b) Some itemsets will add to the current set of frequent itemsets.
c) Some itemsets will become infrequent while others will become frequent
d) Can not say

Answer: B
16. What is association rule mining?
a) Same as frequent itemset mining
b) Finding of strong association rules using frequent itemsets
c) Using association to analyze correlation rules
d) Finding Itemsets for future trends

Answer: B

17. A definition or a concept is ______ if it classifies any examples as coming within the
concept
a) Concurrent
b) Consistent
c) Constant
d) Compete

Answer: B

1. Data scrubbing is which of the following?


A. A process to reject data from the data warehouse and to create the necessary indexes
B. A process to load the data in the data warehouse and to create the necessary indexes
C. A process to upgrade the quality of data after it is moved into a data warehouse
D. A process to upgrade the quality of data before it is moved into a data warehouse
Answer: Option D

2. The @active data warehouse architecture includes which of the following?


A. At least one data mart
B. Data that can be extracted from numerous internal and external sources
C. Near real-time updates
D. All of the above

Answer: Option D

3. A goal of data mining includes which of the following?


A.To explain some observed event or condition
B.To confirm that data exists
C.To analyze data for expected relationships
D.To create a new data warehouse

Answer: Option A

4. An operational system is which of the following?


A.A system that is used to run the business in real-time and is based on historical data.
B.A system that is used to run the business in real-time and is based on current data.
C.A system that is used to support decision-making and is based on current data.
D.A system that is used to support decision-making and is based on historical data.

Answer: Option B

5. A data warehouse is which of the following?


A.Can be updated by end-users.
B.Contains numerous naming conventions and formats.
C.Organized around important subject areas.
D.Contains only current data.

Answer: Option C

6. A snowflake schema is which of the following types of tables?


A.Fact
B.Dimension
C.Helper
D.All of the above

Answer: Option D

7. The generic two-level data warehouse architecture includes which of the following?
A.At least one data mart
B.Data that can be extracted from numerous internal and external sources
C.Near real-time updates
D.All of the above

Answer: Option B

8. Fact tables are which of the following?


A.Completely denoralized
B.Partially denoralized
C.Completely normalized
D.Partially normalized

Answer: Option C

9. Data transformation includes which of the following?


A.A process to change data from a detailed level to a summary level
B.A process to change data from a summary level to a detailed level
C.Joining data from one source into various sources of data
D.Separating data from one source into various sources of data

Answer: Option A

10. Reconciled data is which of the following?


A.Data stored in the various operational systems throughout the organization.
B.Current data intended to be the single source for all decision support systems.
C.Data stored in one operational system in the organization.
D.Data that has been selected and formatted for end-user support applications.

Answer: Option B

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!
11. The load and index is which of the following?
A.A process to reject data from the data warehouse and to create the necessary indexes
B.A process to load the data in the data warehouse and to create the necessary indexes
C.A process to upgrade the quality of data after it is moved into a data warehouse
D.A process to upgrade the quality of data before it is moved into a data warehouse

Answer: Option B

12. The extract process is which of the following?


A.Capturing all of the data contained in various operational systems
B.Capturing a subset of the data contained in various operational systems
C.Capturing all of the data contained in various decision support systems
D.Capturing a subset of the data contained in various decision support systems

Answer: Option B

13. A star schema has what type of relationship between a dimension and fact table?
A.Many-to-many
B.One-to-one
C.One-to-many
D.All of the above

Answer: Option C

14. Transient data is which of the following?


A.Data in which changes to existing records cause the previous version of the records to be
eliminated
B.Data in which changes to existing records do not cause the previous version of the records
to be eliminated
C.Data that are never altered or deleted once they have been added
D.Data that are never deleted once they have been added

Answer: Option A

15. A multifield transformation does which of the following?


A.Converts data from one field into multiple fields
B.Converts data from multiple fields into one field
C.Converts data from multiple fields into multiple fields
D.All of the above

Answer: Option D

16. A data mart is designed to optimize the performance for well-defined and predicable
uses.
A. True
B. False

Answer: Option A
17. Successful data warehousing requires that a formal program in total quality
management (TQM) be implemented.
A. True
B. False

Answer: Option A

18. Data in operational systems are typically fragmented and inconsistent.


A. True
B. False

Answer: Option A

19. Most operational systems are based on the use of transient data.
A. True
B. False

Answer: Option A

20. Independent data marts are often created because an organization focuses on a
series of short-term business objectives.
A. True
B. False
Answer: Option A

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Joining is the process of partitioning data according to predefined criteria.


A. True
B. False

Answer: Option B

22. The role of the ETL process is to identify erroneous data and to fix them.
A. True
B. False

Answer: Option B

23. Data in the data warehouse are loaded and refreshed from operational systems.
A. True
B. False

Answer: Option A

24. Star schema is suited to online transaction processing and therefore is generally
used in operational systems, operational data stores, or an EDW.
A. True
B. False

Answer: Option B

25. Periodic data are data that are physically altered once added to the store.
A. True
B. False

Answer: Option B

26. Both status data and event data can be stored in a database.
A. True
B. False

Answer: Option A

27. Static extract is used for ongoing warehouse maintenance.


A. True
B. False

Answer: Option b

28. Data scrubbing can help upgrade data quality;it is not a long-term solution to the
data quality problem.
A. True
B. False

Answer: Option A

29. Every key used to join the fact table with a dimensional table should be a surrogate
key.
A. True
B. False

Answer: Option A

30. Derived data are detailed, current data intended to be single, authoritative source
for all decision suport applications.
A. True
B. False

Answer: Option B

ETL Process and OLAP


Module 2

1. All data in flat file is in this format.


A. Sort
B. ETL
C. Format
D. String

Ans: D

2. It is used to push data into a relation database table. This control will be the
destination for most fact table data flows.
A. Web Scraping
B. Data inspection
C. OLE DB Source
D. OLE DB Destination

Ans: D

3. Logical Data Maps


A. These are used to identify which fields from which sources are going to with destinations.
It allows the ETL developer to identify if there is a need to do a data type change or
aggregation prior to beginning coding of an ETL process.
B. These can be used to flag an entire file-set that is ready for processing by the ETL process.
It contains no meaningful data bu the fact it exists is the key to the process.
C. Data is pulled from multiple sources to be merged into one or more destinations.
D. It is used to massage data in transit between the source and destination.

Ans: A

4. Data access methods.


A. Pull Method
B. Push and Pull
C. Load in Parallel
D. Union all

Ans: B

5. OLTP
A. Process to move data from a source to destination.
B. Transactional database that is typically attached to an application. This source provides the
benefit of known data types and standardized access methods. This system enforces data
integrity.
C. All data in flat file is in this format.
D. This control can be used to add columns to the stream or make modifications to data
within the stream. Should be used for simple modifications.

Ans: B

6. COBOL
A. Process to move data from a source to destination.
B. The easiest to consume from the ETL standpoint.
C. Two methods to ensure data integrity.
D. Many routines of the Mainframe system are written in this.
Ans: D

7. What ETL Stands for


A. Data inspection
B. Transformation
C. Extract, Transform, Load
D. Data Flow

Ans: C

8. The source system initiates the data transfer for the ETL process. This method is
uncommon in practice, as each system would have to move the data to the ETL process
individually.
A. Custom
B. Automation
C. Pull Method
D. Push Method

Ans: D

9. Sentinel Files
A. These are used to identify which fields from which sources are going to with destinations.
It allows the ETL developer to identify if there is a need to do a data type change or
aggregation prior to beginning coding of an ETL process.
B. These can be used to flag an entire file-set that is ready for processing by the ETL process.
It contains no meaningful data bu the fact it exists is the key to the process.
C. ETL can be used to automate the movement of data between two locations. This
standardizes the process so that the load is done the same way every run.
D. This is used to create multiple streams within a data flow from a single stream. All records
in the stream are sent down all paths. Typically uses a merge-join to recombine the streams
later in the data flow.

Ans: B

10. Checkpoints
A. Similar to “break up processes”, checkpoints provide markers for what data has been
processed in case an error occurs during the ETL process.
B. Similar to XML’s structured text file.
C. Many routines of the Mainframe system are written in this.
D. It is used to import text files for ETL processing.

Ans: A

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. Mainframe systems use this. This requires a conversion to the more common ASCII
format.
A. ETL
B. XML
C. Sort
D. EBCDIC

Ans: D

12. Ultimate flexibility, unit testing is available, usually poor documentation.


A. ETL
B. Custom
C. OLTP
D. Sort

Ans: B

13. Conditional Split


A. Many routines of the Mainframe system are written in this.
B. Data is pulled from multiple sources to be merged into one or more destinations.
C. It allows multiple streams to be created from a single stream. Only rows that match the
criteria for a given path are sent down that path.
D. This is used to create multiple streams within a data flow from a single stream. All records
in the stream are sent down all paths. Typically uses a merge-join to recombine the streams
later in the data flow.

Ans: C

14. Flat files


A. The easiest to consume from the ETL standpoint.
B. Three components of data flow.
C. Three common usages of ETL.
D. Two methods to ensure data integrity.

Ans: A

15. This is used to create multiple streams within a data flow from a single stream. All
records in the stream are sent down all paths. Typically uses a merge-join to recombine
the streams later in the data flow.
A. OLTP
B. Mainframe
C. EBCDIC
D. Multicast

Ans: D

16. There are little to no benefits to the ETL developer when accessing these types of
systems and many detriments. The ability to access these systems is very limited and
typically FTP of text files is used to facilitate access.
A. Mainframe
B. Union all
C. File Name
D. Multicast

Ans: A

17. Shows the path to the file to be imported.


A. File Name
B. Mainframe
C. Format
D. Union all

Ans: A

18. Wheel is already invented, documented, good support.


A. Format
B. COBOL
C. Tool Suite
D. Flat files

Ans: C

19. Similar to XML’s structured text file.


A. Data Scrubbing
B. EBCDIC
C. String
D. Web Scraping

Ans: D

20. Flat file control


A. Three components of data flow.
B. It is used to import text files for ETL processing.
C. The easiest to consume from the ETL standpoint.
D. Shows the path to the file to be imported.

Ans: B

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Two methods to ensure data integrity.


A. Sources, Transformation, Destination
B. Data inspection
C. Row Count Inspection, Data Inspection
D. Row Count Inspection

Ans: C
22. Transformation
A. Data is pulled from multiple sources to be merged into one or more destinations.
B. It is used to import text files for ETL processing.
C. Process to move data from a source to destination.
D. It is used to massage data in transit between the source and destination.

Ans: D

23. Three common usages of ETL.


A. Data Scrubbing
B. Sources, Transformation, Destination
C. Merging Data
D. Merging Data, Data Scrubbing, Automation

Ans: D

24. Load in Parallel


A. A value of delimited shou;d be selected for delimited files.
B. Data is pulled from multiple sources to be merged into one or more destinations.
C. This will reduce the run time of ETL process and reduce the window for hardware failure
to affect the process.
D. this should be check if column name have been included in the first row of the file.

Ans: C

25. This can be computationally expensive excluding SSD.


A. Hard Drive I/O
B. Mainframe
C. Tool Suite
D. Data Scrubbing

Ans: A

26. A value of delimited shou;d be selected for delimited files.


A. Sort
B. Format
C. String
D. OLTP

Ans: B

27. this should be check if column name have been included in the first row of the file.
A. Row Count Inspection, Data Inspection
B. Format of the Date
C. Column names in the first data row checkbox
D. Do most work in transformation phase

Ans: C
28. OLAP stands for
a) Online analytical processing
b) Online analysis processing
c) Online transaction processing
d) Online aggregate processing

Answer: a

29. Data that can be modeled as dimension attributes and measure attributes are called
_______ data.
a) Multidimensional
b) Single Dimensional
c) Measured
d) Dimensional

Answer: a

30. The generalization of cross-tab which is represented visually is ____________ which


is also called as data cube.
a) Two dimensional cube
b) Multidimensional cube
c) N-dimensional cube
d) Cuboid

Answer: a

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

31. The process of viewing the cross-tab (Single dimensional) with a fixed value of one
attribute is
a) Slicing
b) Dicing
c) Pivoting
d) Both Slicing and Dicing

Answer: a

32. The operation of moving from finer-granularity data to a coarser granularity (by
means of aggregation) is called a ________
a) Rollup
b) Drill down
c) Dicing
d) Pivoting

Answer: a
33. In SQL the cross-tabs are created using
a) Slice
b) Dice
c) Pivot
d) All of the mentioned

Answer: a

34.{ (item name, color, clothes size), (item name, color), (item name, clothes size), (color,
clothes size), (item name), (color), (clothes size), () }
This can be achieved by using which of the following ?
a) group by rollup
b) group by cubic
c) group by
d) none of the mentioned

Answer: d

35. What do data warehouses support?


a) OLAP
b) OLTP
c) OLAP and OLTP
d) Operational databases
Answer: a

36.SELECT item name, color, clothes SIZE, SUM(quantity)


FROM sales
GROUP BY rollup(item name, color, clothes SIZE);
How many grouping is possible in this rollup?
a) 8
b) 4
c) 2
d) 1
Answer: b

37. Which one of the following is the right syntax for DECODE?
a) DECODE (search, expression, result [, search, result]… [, default])
b) DECODE (expression, result [, search, result]… [, default], search)
c) DECODE (search, result [, search, result]… [, default], expression)
d) DECODE (expression, search, result [, search, result]… [, default])
Answer: d

Introduction to Data Mining, Data Exploration and Preprocessing


Module 3

1. Data mining refers to ______


a) Special fields for database
b) Knowledge discovery from large database
c) Knowledge base for the database
d) Collections of attributes
Answer: B

2. An attribute is a ____
a) Normalization of Fields
b) Property of the class
c) Characteristics of the object
d) Summarise value

Answer: C

3. Which are not related to Ratio Attributes?


a) Age Group 10-20, 30-50, 35-45 (in Years)
b) Mass 20-30 kg, 10-15 kg
c) Areas 10-50, 50-100 (in Kilometres)
d) Temperature 10°-20°, 30°-50°, 35°-45°

Answer: D

4. The mean is the ________ of a dataset.


a) Average
b) Middle
c) Central
d) Ordered

Answer: A

5. The number that occurs most often within a set of data called as ______
a) Mean
b) Median
c) Mode
d) Range

Answer: C

6. Find the range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50
a) 19
b) 29
c) 35
d) 49

Answer: B

7. Which are not the part of the KDD process from the following
a) Selection
b) Pre-processing
c) Reduction
d) Summation

Answer: D
8. _______ is the output of KDD Process.
a) Query
b) Useful Information
c) Information
d) Data

Answer: B

9. Data mining turns a large collection of data into _____


a) Database
b) Knowledge
c) Queries
d) Transactions

Answer: B

10. In KDD Process, where data relevant to the analysis task are retrieved from the
database means _____
a) Data Selection
b) Data Collection
c) Data Warehouse
d) Data Mining

Answer: A

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. In KDD Process, data are transformed and consolidated into appropriate forms for
mining by performing summary or aggregation operations is called as _____
a) Data Selection
b) Data Transformation
c) Data Reduction
d) Data Cleaning

Answer: B

12. What kinds of data can be mined?


a) Database data
b) Data Warehouse data
c) Transactional data
d) All of the above

Answer:D

13. Data selection is _____


a) The actual discovery phase of a knowledge discovery process
b) The stage of selecting the right data for a KDD process
c) A subject-oriented integrated time-variant non-volatile collection of data in support of
management
d) Record oriented classes finding

Answer: B

14. To remove noise and inconsistent data ____ is needed.


a) Data Cleaning
b) Data Transformation
c) Data Reduction
d) Data Integration

Answer: A

15. Multiple data sources may be combined is called as _____


a) Data Reduction
b) Data Cleaning
c) Data Integration
d) Data Transformation

Answer: C

16. A _____ is a collection of tables, each of which is assigned a unique name which uses
the entity-relationship (ER) data model.
a) Relational database
b) Transactional database
c) Data Warehouse
d) Spatial database

Answer: A

17. Relational data can be accessed by _____ written in a relational query language.
a) Select
b) Queries
c) Operations
d) Like

Answer: B

18. _____ studies the collection, analysis, interpretation or explanation, and


presentation of data.
a) Statistics
b) Visualization
c) Data Mining
d) Clustering

Answer: A

19. ______ investigates how computers can learn (or improve their performance) based
on data.
a) Machine Learning
b) Artificial Intelligence
c) Statistics
d) Visualization

Answer: A

20. _____ is the science of searching for documents or information in documents.


a) Data Mining
b) Information Retrieval
c) Text Mining
d) Web Mining

Answer: B

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Data often contain _____


a) Target Class
b) Uncertainty
c) Methods
d) Keywords

Answer: B

22. The data mining process should be highly ______


a) On Going
b) Active
c) Interactive
d) Flexible

Answer: C

23. In real world multidimensional view of data mining, The major dimensions are data,
knowledge, technologies, and _____
a) Methods
b) Applications
c) Tools
d) Files

Answer: B

24. An _____ is a data field, representing a characteristic or feature of a data object.


a) Method
b) Variable
c) Task
d) Attribute
Answer: D

25. The values of a _____ attribute are symbols or names of things.


a) Ordinal
b) Nominal
c) Ratio
d) Interval

Answer:B

26. “Data about data” is referred to as _____


a) Information
b) Database
c) Metadata
d) File

Answer: C

27. ______ partitions the objects into different groups.


a) Mapping
b) Clustering
c) Classification
d) Prediction

Answer:B

28. In _____, the attribute data are scaled so as to fall within a smaller range, such as
-1.0 to 1.0, or 0.0 to 1.0.
a) Aggregation
b) Binning
c) Clustering
d) Normalization

29. Normalization by ______ normalizes by moving the decimal point of values of


attributes.
a) Z-Score
b) Z-Index
c) Decimal Scaling
d) Min-Max Normalization

Answer: C

30._______ is a top-down splitting technique based on a specified number of bins.


a) Normalization
b) Binning
c) Clustering
d) Classification
Answer: B

Classification, Prediction and Clustering


Module 4

1. How many terms are required for building a bayes model?


a) 1
b) 2
c) 3
d) 4

Answer: c

2. What is needed to make probabilistic systems feasible in the world?


a) Reliability
b) Crucial robustness
c) Feasibility
d) None of the mentioned

Answer: b

3. Where does the bayes rule can be used?


a) Solving queries
b) Increasing complexity
c) Decreasing complexity
d) Answering probabilistic query

Answer: d

4. What does the bayesian network provides?


a) Complete description of the domain
b) Partial description of the domain
c) Complete description of the problem
d) None of the mentioned

Answer: a

5. How the entries in the full joint probability distribution can be calculated?
a) Using variables
b) Using information
c) Both Using variables & information
d) None of the mentioned

Answer: b

6. How the bayesian network can be used to answer any query?


a) Full distribution
b) Joint distribution
c) Partial distribution
d) All of the mentioned
Answer: b

7. How the compactness of the bayesian network can be described?


a) Locally structured
b) Fully structured
c) Partial structure
d) All of the mentioned

Answer: a

8. To which does the local structure is associated?


a) Hybrid
b) Dependant
c) Linear
d) None of the mentioned

Answer: c

9. Which condition is used to influence a variable directly by all the others?


a) Partially connected
b) Fully connected
c) Local connected
d) None of the mentioned

Answer: b

10. What is the consequence between a node and its predecessors while creating
bayesian network?
a) Functionally dependent
b) Dependant
c) Conditionally independent
d) Both Conditionally dependant & Dependant

Answer: c

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. A _________ is a decision support tool that uses a tree-like graph or model of
decisions and their possible consequences, including chance event outcomes, resource
costs, and utility.
a) Decision tree
b) Graphs
c) Trees
d) Neural Networks

Answer: a
12. Decision Tree is a display of an algorithm.
a) True
b) False

Answer: a

13. What is Decision Tree?


a) Flow-Chart
b) Structure in which internal node represents test on an attribute, each branch represents
outcome of test and each leaf node represents class label
c) Flow-Chart & Structure in which internal node represents test on an attribute, each branch
represents outcome of test and each leaf node represents class label
d) None of the mentioned

Answer: c

14. Decision Trees can be used for Classification Tasks.


a) True
b) False

Answer: a

15. Choose from the following that are Decision Tree nodes?
a) Decision Nodes
b) End Nodes
c) Chance Nodes
d) All of the mentioned

Answer: d

16. Decision Nodes are represented by ____________


a) Disks
b) Squares
c) Circles
d) Triangles

Answer: b

17. Chance Nodes are represented by __________


a) Disks
b) Squares
c) Circles
d) Triangles

Answer: c

18. End Nodes are represented by __________


a) Disks
b) Squares
c) Circles
d) Triangles

Answer: d

19. Which of the following are the advantage/s of Decision Trees?


a) Possible Scenarios can be added
b) Use a white box model, If given result is provided by a model
c) Worst, best and expected values can be determined for different scenarios
d) All of the mentioned

Answer: d

20. Which of the following is the valid component of the predictor?


a) data
b) question
c) algorithm
d) all of the mentioned

Answer: d

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Point out the wrong statement.


a) In Sample Error is also called generalization error
b) Out of Sample Error is the error rate you get on the new dataset
c) In Sample Error is also called resubstitution error
d) All of the mentioned

Answer: a

22. Which of the following is correct order of working?


a) questions->input data ->algorithms
b) questions->evaluation ->algorithms
c) evaluation->input data ->algorithms
d) all of the mentioned

Answer: a

23. Which of the following shows correct relative order of importance?


a) question->features->data->algorithms
b) question->data->features->algorithms
c) algorithms->data->features->question
d) none of the mentioned

Answer: b
24. Point out the correct statement.
a) In Sample Error is the error rate you get on the same dataset used to model a predictor
b) Data have two parts-signal and noise
c) The goal of predictor is to find signal
d) None of the mentioned

Answer: d

25. Which of the following is characteristic of best machine learning method?


a) Fast
b) Accuracy
c) Scalable
d) All of the mentioned

Answer: d

26. True positive means correctly rejected.


a) True
b) False

Answer: b

27. Which of the following trade-off occurs during prediction?


a) Speed vs Accuracy
b) Simplicity vs Accuracy
c) Scalability vs Accuracy
d) None of the mentioned

Answer: d

28. Which of the following expression is true?


a) In sample error < out sample error
b) In sample error > out sample error
c) In sample error = out sample error
d) All of the mentioned

Answer: a

29. Backtesting is a key component of effective trading-system development.


a) True
b) False

Answer: a

30. Which of the following is correct use of cross validation?


a) Selecting variables to include in a model
b) Comparing predictors
c) Selecting parameters in prediction function
d) All of the mentioned
Answer: d

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

31. Point out the wrong combination.


a) True negative=correctly rejected
b) False negative=correctly rejected
c) False positive=correctly identified
d) All of the mentioned

Answer: c

32. Which of the following is a common error measure?


a) Sensitivity
b) Median absolute deviation
c) Specificity
d) All of the mentioned

Answer: d

33. Which of the following is not a machine learning algorithm?


a) SVG
b) SVM
c) Random forest
d) None of the mentioned

Answer: a

34. Point out the wrong statement.


a) ROC curve stands for receiver operating characteristic
b) Foretime series, data must be in chunks
c) Random sampling must be done with replacement
d) None of the mentioned

Answer: d

35. Which of the following is a categorical outcome?


a) RMSE
b) RSquared
c) Accuracy
d) All of the mentioned

Answer: c

36. For k cross-validation, larger k value implies more bias.


a) True
b) False
Answer: b

37. Which of the following method is used for trainControl resampling?


a) repeatedcv
b) svm
c) bag32
d) none of the mentioned

Answer: a

38. Which of the following can be used to create the most common graph types?
a) qplot
b) quickplot
c) plot
d) all of the mentioned

Answer: a

39. For k cross-validation, smaller k value implies less variance.


a) True
b) False

Answer: a

40. Predicting with trees evaluate _____________ within each group of data.
a) equality
b) homogeneity
c) heterogeneity
d) all of the mentioned

Answer: b

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

41. Point out the wrong statement.


a) Training and testing data must be processed in different way
b) Test transformation would mostly be imperfect
c) The first goal is statistical and second is data compression in PCA
d) All of the mentioned

Answer: a

42. Which of the following method options is provided by train function for bagging?
a) bagEarth
b) treebag
c) bagFDA
d) all of the mentioned
Answer: d

43. Which of the following is correct with respect to random forest?


a) Random forest are difficult to interpret but often very accurate
b) Random forest are easy to interpret but often very accurate
c) Random forest are difficult to interpret but very less accurate
d) None of the mentioned

Answer: a

44. Point out the correct statement.


a) Prediction with regression is easy to implement
b) Prediction with regression is easy to interpret
c) Prediction with regression performs well when linear model is correct
d) All of the mentioned

Answer: d

45. Which of the following library is used for boosting generalized additive models?
a) gamBoost
b) gbm
c) ada
d) all of the mentioned

Answer: a

46. The principal components are equal to left singular values if you first scale the
variables.
a) True
b) False

Answer: b

47. Which of the following is statistical boosting based on additive logistic regression?
a) gamBoost
b) gbm
c) ada
d) mboost

Answer: a

48. Which of the following is one of the largest boost subclass in boosting?
a) variance boosting
b) gradient boosting
c) mean boosting
d) all of the mentioned

Answer: b
49. PCA is most useful for non linear type models.
a) True
b) False

Answer: b

50. Which of the following clustering type has characteristic shown in the below figure?

a) Partitional
b) Hierarchical
c) Naive bayes
d) None of the mentioned

Answer: b

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

51. Point out the correct statement.


a) The choice of an appropriate metric will influence the shape of the clusters
b) Hierarchical clustering is also called HCA
c) In general, the merges and splits are determined in a greedy manner
d) All of the mentioned

Answer: d

52. Which of the following is finally produced by Hierarchical Clustering?


a) final estimate of cluster centroids
b) tree showing how close things are to each other
c) assignment of each point to clusters
d) all of the mentioned

Answer: b

53. Which of the following is required by K-means clustering?


a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the mentioned

Answer: d
54. Point out the wrong statement.
a) k-means clustering is a method of vector quantization
b) k-means clustering aims to partition n observations into k clusters
c) k-nearest neighbor is same as k-means
d) none of the mentioned

Answer: c

55. Which of the following combination is incorrect?


a) Continuous – euclidean distance
b) Continuous – correlation similarity
c) Binary – manhattan distance
d) None of the mentioned

Answer: d

56. Hierarchical clustering should be primarily used for exploration.


a) True
b) False

Answer: a

57. Which of the following function is used for k-means clustering?


a) k-means
b) k-mean
c) heatmap
d) none of the mentioned

Answer: a

58. Which of the following clustering requires merging approach?


a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned

Answer: b

59. K-means is not deterministic and it also consists of number of iterations.


a) True
b) False

Answer: a

60. Hierarchical clustering should be mainly used for exploration.


a) True
b) False

Answer: a
Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

61. K-means clustering consists of a number of iterations and not deterministic.


a) True
b) False

Answer: a

62. Which is needed by K-means clustering?


a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of these

Answer: d

63. Which function is used for k-means clustering?


a) k-means
b) k-mean
c) heatmap
d) none of the mentioned

Answer: a

64. Which is conclusively produced by Hierarchical Clustering?


a) final estimation of cluster centroids
b) tree showing how nearby things are to each other
c) assignment of each point to clusters
d) all of these

Answer: b

65. Which clustering technique requires a merging approach?


a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned

Answer: b

Mining Frequent Patterns and Association Rules


Module 5

1. A collection of one or more items is called as _____


a) Itemset
b) Support
c) Confidence
d) Support Count

Answer: A

2. Frequency of occurrence of an itemset is called as _____


a) Support
b) Confidence
c) Support Count
d) Rules

Answer: C

3. An itemset whose support is greater than or equal to a minimum support threshold is


______
a) Itemset
b) Frequent Itemset
c) Infrequent items
d) Threshold values

Answer: B

4. What does FP growth algorithm do?


a) It mines all frequent patterns through pruning rules with lesser support
b) It mines all frequent patterns through pruning rules with higher support
c) It mines all frequent patterns by constructing a FP tree
d) It mines all frequent patterns by constructing an itemsets

Answer: C

5. What techniques can be used to improve the efficiency of apriori algorithm?


a) Hash-based techniques
b) Transaction Increases
c) Sampling
d) Cleaning

Answer: A

6. What do you mean by support(A)?


a) Total number of transactions containing A
b) Total Number of transactions not containing A
c) Number of transactions containing A / Total number of transactions
d) Number of transactions not containing A / Total number of transactions

Answer: C

7. How do you calculate Confidence (A -> B)?


a) Support(A #
# B) / Support (A)
b) Support(A #
# B) / Support (B)
c) Support(A #
# B) / Support (A)
d) Support(A #
# B) / Support (B)

Answer: A

8. Which of the following is the direct application of frequent itemset mining?


a) Social Network Analysis
b) Market Basket Analysis
c) Outlier Detection
d) Intrusion Detection

Answer: B

9. What is not true about FP growth algorithms?


a) It mines frequent itemsets without candidate generation
b) There are chances that FP trees may not fit in the memory
c) FP trees are very expensive to build
d) It expands the original database to build FP trees

Answer: D

10. When do you consider an association rule interesting?


a)If it only satisfies min_support
b) If it only satisfies min_confidence
c) If it satisfies both min_support and min_confidence
d) There are other measures to check so

Answer: C

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. What is the relation between a candidate and frequent itemsets?


a) A candidate itemset is always a frequent itemset
b) A frequent itemset must be a candidate itemset
c) No relation between these two
d) Strong relation with transactions

Answer:B

12. Which of the following is not a frequent pattern mining algorithm?


a) Apriori
b) FP growth
c) Decision trees
d) Eclat
Answer: C

13. Which algorithm requires fewer scans of data?


a)Apriori
b)FP Growth
c)Naive Bayes
d)Decision Trees

Answer: B

14. For the question given below consider the data Transactions :

I1, I2, I3, I4, I5, I6


I7, I2, I3, I4, I5, I6
I1, I8, I4, I5
I1, I9, I10, I4, I6
I10, I2, I4, I11, I5
With support as 0.6 find all frequent itemsets?

a) <I1>, <I2>, <I4>, <I5>, <I6>, <I1, I4>, <I2, I4>, <I2, I5>, <I4, I5>, <I4, I6>, <I2, I4, I5>

b) <I2>, <I4>, <I5>, <I2, I4>, <I2, I5>, <I4, I5>, <I2, I4, I5>

c) <I11>, <I4>, <I5>, <I6>, <I1, I4>, <I5, I4>, <I11, I5>, <I4, I6>, <I2, I4, I5>

d) <I1>, <I4>, <I5>, <I6>

Answer: A

15. What will happen if support is reduced?


a) Number of frequent itemsets remains the same
b) Some itemsets will add to the current set of frequent itemsets.
c) Some itemsets will become infrequent while others will become frequent
d) Can not say

Answer: B

16. What is association rule mining?


a) Same as frequent itemset mining
b) Finding of strong association rules using frequent itemsets
c) Using association to analyze correlation rules
d) Finding Itemsets for future trends

Answer: B

17. A definition or a concept is ______ if it classifies any examples as coming within the
concept
a) Concurrent
b) Consistent
c) Constant
d) Compete

Answer: B

1. Data scrubbing is which of the following?


A. A process to reject data from the data warehouse and to create the necessary indexes
B. A process to load the data in the data warehouse and to create the necessary indexes
C. A process to upgrade the quality of data after it is moved into a data warehouse
D. A process to upgrade the quality of data before it is moved into a data warehouse
Answer: Option D

2. The @active data warehouse architecture includes which of the following?


A. At least one data mart
B. Data that can be extracted from numerous internal and external sources
C. Near real-time updates
D. All of the above

Answer: Option D

3. A goal of data mining includes which of the following?


A.To explain some observed event or condition
B.To confirm that data exists
C.To analyze data for expected relationships
D.To create a new data warehouse

Answer: Option A

4. An operational system is which of the following?


A.A system that is used to run the business in real-time and is based on historical data.
B.A system that is used to run the business in real-time and is based on current data.
C.A system that is used to support decision-making and is based on current data.
D.A system that is used to support decision-making and is based on historical data.

Answer: Option B

5. A data warehouse is which of the following?


A.Can be updated by end-users.
B.Contains numerous naming conventions and formats.
C.Organized around important subject areas.
D.Contains only current data.

Answer: Option C

6. A snowflake schema is which of the following types of tables?


A.Fact
B.Dimension
C.Helper
D.All of the above
Answer: Option D

7. The generic two-level data warehouse architecture includes which of the following?
A.At least one data mart
B.Data that can be extracted from numerous internal and external sources
C.Near real-time updates
D.All of the above

Answer: Option B

8. Fact tables are which of the following?


A.Completely denoralized
B.Partially denoralized
C.Completely normalized
D.Partially normalized

Answer: Option C

9. Data transformation includes which of the following?


A.A process to change data from a detailed level to a summary level
B.A process to change data from a summary level to a detailed level
C.Joining data from one source into various sources of data
D.Separating data from one source into various sources of data

Answer: Option A

10. Reconciled data is which of the following?


A.Data stored in the various operational systems throughout the organization.
B.Current data intended to be the single source for all decision support systems.
C.Data stored in one operational system in the organization.
D.Data that has been selected and formatted for end-user support applications.

Answer: Option B

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. The load and index is which of the following?


A.A process to reject data from the data warehouse and to create the necessary indexes
B.A process to load the data in the data warehouse and to create the necessary indexes
C.A process to upgrade the quality of data after it is moved into a data warehouse
D.A process to upgrade the quality of data before it is moved into a data warehouse

Answer: Option B

12. The extract process is which of the following?


A.Capturing all of the data contained in various operational systems
B.Capturing a subset of the data contained in various operational systems
C.Capturing all of the data contained in various decision support systems
D.Capturing a subset of the data contained in various decision support systems

Answer: Option B

13. A star schema has what type of relationship between a dimension and fact table?
A.Many-to-many
B.One-to-one
C.One-to-many
D.All of the above

Answer: Option C

14. Transient data is which of the following?


A.Data in which changes to existing records cause the previous version of the records to be
eliminated
B.Data in which changes to existing records do not cause the previous version of the records
to be eliminated
C.Data that are never altered or deleted once they have been added
D.Data that are never deleted once they have been added

Answer: Option A

15. A multifield transformation does which of the following?


A.Converts data from one field into multiple fields
B.Converts data from multiple fields into one field
C.Converts data from multiple fields into multiple fields
D.All of the above

Answer: Option D

16. A data mart is designed to optimize the performance for well-defined and predicable
uses.
A. True
B. False

Answer: Option A

17. Successful data warehousing requires that a formal program in total quality
management (TQM) be implemented.
A. True
B. False

Answer: Option A

18. Data in operational systems are typically fragmented and inconsistent.


A. True
B. False

Answer: Option A
19. Most operational systems are based on the use of transient data.
A. True
B. False

Answer: Option A

20. Independent data marts are often created because an organization focuses on a
series of short-term business objectives.
A. True
B. False
Answer: Option A

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Joining is the process of partitioning data according to predefined criteria.


A. True
B. False

Answer: Option B

22. The role of the ETL process is to identify erroneous data and to fix them.
A. True
B. False

Answer: Option B

23. Data in the data warehouse are loaded and refreshed from operational systems.
A. True
B. False

Answer: Option A

24. Star schema is suited to online transaction processing and therefore is generally
used in operational systems, operational data stores, or an EDW.
A. True
B. False

Answer: Option B

25. Periodic data are data that are physically altered once added to the store.
A. True
B. False

Answer: Option B
26. Both status data and event data can be stored in a database.
A. True
B. False

Answer: Option A

27. Static extract is used for ongoing warehouse maintenance.


A. True
B. False

Answer: Option b

28. Data scrubbing can help upgrade data quality;it is not a long-term solution to the
data quality problem.
A. True
B. False

Answer: Option A

29. Every key used to join the fact table with a dimensional table should be a surrogate
key.
A. True
B. False

Answer: Option A

30. Derived data are detailed, current data intended to be single, authoritative source
for all decision suport applications.
A. True
B. False

Answer: Option B

ETL Process and OLAP


Module 2

1. All data in flat file is in this format.


A. Sort
B. ETL
C. Format
D. String

Ans: D

2. It is used to push data into a relation database table. This control will be the
destination for most fact table data flows.
A. Web Scraping
B. Data inspection
C. OLE DB Source
D. OLE DB Destination
Ans: D

3. Logical Data Maps


A. These are used to identify which fields from which sources are going to with destinations.
It allows the ETL developer to identify if there is a need to do a data type change or
aggregation prior to beginning coding of an ETL process.
B. These can be used to flag an entire file-set that is ready for processing by the ETL process.
It contains no meaningful data bu the fact it exists is the key to the process.
C. Data is pulled from multiple sources to be merged into one or more destinations.
D. It is used to massage data in transit between the source and destination.

Ans: A

4. Data access methods.


A. Pull Method
B. Push and Pull
C. Load in Parallel
D. Union all

Ans: B

5. OLTP
A. Process to move data from a source to destination.
B. Transactional database that is typically attached to an application. This source provides the
benefit of known data types and standardized access methods. This system enforces data
integrity.
C. All data in flat file is in this format.
D. This control can be used to add columns to the stream or make modifications to data
within the stream. Should be used for simple modifications.

Ans: B

6. COBOL
A. Process to move data from a source to destination.
B. The easiest to consume from the ETL standpoint.
C. Two methods to ensure data integrity.
D. Many routines of the Mainframe system are written in this.

Ans: D

7. What ETL Stands for


A. Data inspection
B. Transformation
C. Extract, Transform, Load
D. Data Flow

Ans: C

8. The source system initiates the data transfer for the ETL process. This method is
uncommon in practice, as each system would have to move the data to the ETL process
individually.
A. Custom
B. Automation
C. Pull Method
D. Push Method

Ans: D

9. Sentinel Files
A. These are used to identify which fields from which sources are going to with destinations.
It allows the ETL developer to identify if there is a need to do a data type change or
aggregation prior to beginning coding of an ETL process.
B. These can be used to flag an entire file-set that is ready for processing by the ETL process.
It contains no meaningful data bu the fact it exists is the key to the process.
C. ETL can be used to automate the movement of data between two locations. This
standardizes the process so that the load is done the same way every run.
D. This is used to create multiple streams within a data flow from a single stream. All records
in the stream are sent down all paths. Typically uses a merge-join to recombine the streams
later in the data flow.

Ans: B

10. Checkpoints
A. Similar to “break up processes”, checkpoints provide markers for what data has been
processed in case an error occurs during the ETL process.
B. Similar to XML’s structured text file.
C. Many routines of the Mainframe system are written in this.
D. It is used to import text files for ETL processing.

Ans: A

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. Mainframe systems use this. This requires a conversion to the more common ASCII
format.
A. ETL
B. XML
C. Sort
D. EBCDIC

Ans: D

12. Ultimate flexibility, unit testing is available, usually poor documentation.


A. ETL
B. Custom
C. OLTP
D. Sort
Ans: B

13. Conditional Split


A. Many routines of the Mainframe system are written in this.
B. Data is pulled from multiple sources to be merged into one or more destinations.
C. It allows multiple streams to be created from a single stream. Only rows that match the
criteria for a given path are sent down that path.
D. This is used to create multiple streams within a data flow from a single stream. All records
in the stream are sent down all paths. Typically uses a merge-join to recombine the streams
later in the data flow.

Ans: C

14. Flat files


A. The easiest to consume from the ETL standpoint.
B. Three components of data flow.
C. Three common usages of ETL.
D. Two methods to ensure data integrity.

Ans: A

15. This is used to create multiple streams within a data flow from a single stream. All
records in the stream are sent down all paths. Typically uses a merge-join to recombine
the streams later in the data flow.
A. OLTP
B. Mainframe
C. EBCDIC
D. Multicast

Ans: D

16. There are little to no benefits to the ETL developer when accessing these types of
systems and many detriments. The ability to access these systems is very limited and
typically FTP of text files is used to facilitate access.
A. Mainframe
B. Union all
C. File Name
D. Multicast

Ans: A

17. Shows the path to the file to be imported.


A. File Name
B. Mainframe
C. Format
D. Union all

Ans: A
18. Wheel is already invented, documented, good support.
A. Format
B. COBOL
C. Tool Suite
D. Flat files

Ans: C

19. Similar to XML’s structured text file.


A. Data Scrubbing
B. EBCDIC
C. String
D. Web Scraping

Ans: D

20. Flat file control


A. Three components of data flow.
B. It is used to import text files for ETL processing.
C. The easiest to consume from the ETL standpoint.
D. Shows the path to the file to be imported.

Ans: B

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Two methods to ensure data integrity.


A. Sources, Transformation, Destination
B. Data inspection
C. Row Count Inspection, Data Inspection
D. Row Count Inspection

Ans: C

22. Transformation
A. Data is pulled from multiple sources to be merged into one or more destinations.
B. It is used to import text files for ETL processing.
C. Process to move data from a source to destination.
D. It is used to massage data in transit between the source and destination.

Ans: D

23. Three common usages of ETL.


A. Data Scrubbing
B. Sources, Transformation, Destination
C. Merging Data
D. Merging Data, Data Scrubbing, Automation
Ans: D

24. Load in Parallel


A. A value of delimited shou;d be selected for delimited files.
B. Data is pulled from multiple sources to be merged into one or more destinations.
C. This will reduce the run time of ETL process and reduce the window for hardware failure
to affect the process.
D. this should be check if column name have been included in the first row of the file.

Ans: C

25. This can be computationally expensive excluding SSD.


A. Hard Drive I/O
B. Mainframe
C. Tool Suite
D. Data Scrubbing

Ans: A

26. A value of delimited shou;d be selected for delimited files.


A. Sort
B. Format
C. String
D. OLTP

Ans: B

27. this should be check if column name have been included in the first row of the file.
A. Row Count Inspection, Data Inspection
B. Format of the Date
C. Column names in the first data row checkbox
D. Do most work in transformation phase

Ans: C

28. OLAP stands for


a) Online analytical processing
b) Online analysis processing
c) Online transaction processing
d) Online aggregate processing

Answer: a

29. Data that can be modeled as dimension attributes and measure attributes are called
_______ data.
a) Multidimensional
b) Single Dimensional
c) Measured
d) Dimensional
Answer: a

30. The generalization of cross-tab which is represented visually is ____________ which


is also called as data cube.
a) Two dimensional cube
b) Multidimensional cube
c) N-dimensional cube
d) Cuboid

Answer: a

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

31. The process of viewing the cross-tab (Single dimensional) with a fixed value of one
attribute is
a) Slicing
b) Dicing
c) Pivoting
d) Both Slicing and Dicing

Answer: a

32. The operation of moving from finer-granularity data to a coarser granularity (by
means of aggregation) is called a ________
a) Rollup
b) Drill down
c) Dicing
d) Pivoting

Answer: a

33. In SQL the cross-tabs are created using


a) Slice
b) Dice
c) Pivot
d) All of the mentioned

Answer: a

34.{ (item name, color, clothes size), (item name, color), (item name, clothes size), (color,
clothes size), (item name), (color), (clothes size), () }
This can be achieved by using which of the following ?
a) group by rollup
b) group by cubic
c) group by
d) none of the mentioned
Answer: d

35. What do data warehouses support?


a) OLAP
b) OLTP
c) OLAP and OLTP
d) Operational databases
Answer: a

36.SELECT item name, color, clothes SIZE, SUM(quantity)


FROM sales
GROUP BY rollup(item name, color, clothes SIZE);
How many grouping is possible in this rollup?
a) 8
b) 4
c) 2
d) 1
Answer: b

37. Which one of the following is the right syntax for DECODE?
a) DECODE (search, expression, result [, search, result]… [, default])
b) DECODE (expression, result [, search, result]… [, default], search)
c) DECODE (search, result [, search, result]… [, default], expression)
d) DECODE (expression, search, result [, search, result]… [, default])
Answer: d

Introduction to Data Mining, Data Exploration and Preprocessing


Module 3

1. Data mining refers to ______


a) Special fields for database
b) Knowledge discovery from large database
c) Knowledge base for the database
d) Collections of attributes

Answer: B

2. An attribute is a ____
a) Normalization of Fields
b) Property of the class
c) Characteristics of the object
d) Summarise value

Answer: C

3. Which are not related to Ratio Attributes?


a) Age Group 10-20, 30-50, 35-45 (in Years)
b) Mass 20-30 kg, 10-15 kg
c) Areas 10-50, 50-100 (in Kilometres)
d) Temperature 10°-20°, 30°-50°, 35°-45°
Answer: D

4. The mean is the ________ of a dataset.


a) Average
b) Middle
c) Central
d) Ordered

Answer: A

5. The number that occurs most often within a set of data called as ______
a) Mean
b) Median
c) Mode
d) Range

Answer: C

6. Find the range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50
a) 19
b) 29
c) 35
d) 49

Answer: B

7. Which are not the part of the KDD process from the following
a) Selection
b) Pre-processing
c) Reduction
d) Summation

Answer: D

8. _______ is the output of KDD Process.


a) Query
b) Useful Information
c) Information
d) Data

Answer: B

9. Data mining turns a large collection of data into _____


a) Database
b) Knowledge
c) Queries
d) Transactions

Answer: B
10. In KDD Process, where data relevant to the analysis task are retrieved from the
database means _____
a) Data Selection
b) Data Collection
c) Data Warehouse
d) Data Mining

Answer: A

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. In KDD Process, data are transformed and consolidated into appropriate forms for
mining by performing summary or aggregation operations is called as _____
a) Data Selection
b) Data Transformation
c) Data Reduction
d) Data Cleaning

Answer: B

12. What kinds of data can be mined?


a) Database data
b) Data Warehouse data
c) Transactional data
d) All of the above

Answer:D

13. Data selection is _____


a) The actual discovery phase of a knowledge discovery process
b) The stage of selecting the right data for a KDD process
c) A subject-oriented integrated time-variant non-volatile collection of data in support of
management
d) Record oriented classes finding

Answer: B

14. To remove noise and inconsistent data ____ is needed.


a) Data Cleaning
b) Data Transformation
c) Data Reduction
d) Data Integration

Answer: A

15. Multiple data sources may be combined is called as _____


a) Data Reduction
b) Data Cleaning
c) Data Integration
d) Data Transformation

Answer: C

16. A _____ is a collection of tables, each of which is assigned a unique name which uses
the entity-relationship (ER) data model.
a) Relational database
b) Transactional database
c) Data Warehouse
d) Spatial database

Answer: A

17. Relational data can be accessed by _____ written in a relational query language.
a) Select
b) Queries
c) Operations
d) Like

Answer: B

18. _____ studies the collection, analysis, interpretation or explanation, and


presentation of data.
a) Statistics
b) Visualization
c) Data Mining
d) Clustering

Answer: A

19. ______ investigates how computers can learn (or improve their performance) based
on data.
a) Machine Learning
b) Artificial Intelligence
c) Statistics
d) Visualization

Answer: A

20. _____ is the science of searching for documents or information in documents.


a) Data Mining
b) Information Retrieval
c) Text Mining
d) Web Mining

Answer: B

Learn Datawarehouse and Data mining from Scratch


Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Data often contain _____


a) Target Class
b) Uncertainty
c) Methods
d) Keywords

Answer: B

22. The data mining process should be highly ______


a) On Going
b) Active
c) Interactive
d) Flexible

Answer: C

23. In real world multidimensional view of data mining, The major dimensions are data,
knowledge, technologies, and _____
a) Methods
b) Applications
c) Tools
d) Files

Answer: B

24. An _____ is a data field, representing a characteristic or feature of a data object.


a) Method
b) Variable
c) Task
d) Attribute

Answer: D

25. The values of a _____ attribute are symbols or names of things.


a) Ordinal
b) Nominal
c) Ratio
d) Interval

Answer:B

26. “Data about data” is referred to as _____


a) Information
b) Database
c) Metadata
d) File
Answer: C

27. ______ partitions the objects into different groups.


a) Mapping
b) Clustering
c) Classification
d) Prediction

Answer:B

28. In _____, the attribute data are scaled so as to fall within a smaller range, such as
-1.0 to 1.0, or 0.0 to 1.0.
a) Aggregation
b) Binning
c) Clustering
d) Normalization

29. Normalization by ______ normalizes by moving the decimal point of values of


attributes.
a) Z-Score
b) Z-Index
c) Decimal Scaling
d) Min-Max Normalization

Answer: C

30._______ is a top-down splitting technique based on a specified number of bins.


a) Normalization
b) Binning
c) Clustering
d) Classification

Answer: B

Classification, Prediction and Clustering


Module 4

1. How many terms are required for building a bayes model?


a) 1
b) 2
c) 3
d) 4

Answer: c

2. What is needed to make probabilistic systems feasible in the world?


a) Reliability
b) Crucial robustness
c) Feasibility
d) None of the mentioned

Answer: b

3. Where does the bayes rule can be used?


a) Solving queries
b) Increasing complexity
c) Decreasing complexity
d) Answering probabilistic query

Answer: d

4. What does the bayesian network provides?


a) Complete description of the domain
b) Partial description of the domain
c) Complete description of the problem
d) None of the mentioned

Answer: a

5. How the entries in the full joint probability distribution can be calculated?
a) Using variables
b) Using information
c) Both Using variables & information
d) None of the mentioned

Answer: b

6. How the bayesian network can be used to answer any query?


a) Full distribution
b) Joint distribution
c) Partial distribution
d) All of the mentioned

Answer: b

7. How the compactness of the bayesian network can be described?


a) Locally structured
b) Fully structured
c) Partial structure
d) All of the mentioned

Answer: a

8. To which does the local structure is associated?


a) Hybrid
b) Dependant
c) Linear
d) None of the mentioned
Answer: c

9. Which condition is used to influence a variable directly by all the others?


a) Partially connected
b) Fully connected
c) Local connected
d) None of the mentioned

Answer: b

10. What is the consequence between a node and its predecessors while creating
bayesian network?
a) Functionally dependent
b) Dependant
c) Conditionally independent
d) Both Conditionally dependant & Dependant

Answer: c

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. A _________ is a decision support tool that uses a tree-like graph or model of
decisions and their possible consequences, including chance event outcomes, resource
costs, and utility.
a) Decision tree
b) Graphs
c) Trees
d) Neural Networks

Answer: a

12. Decision Tree is a display of an algorithm.


a) True
b) False

Answer: a

13. What is Decision Tree?


a) Flow-Chart
b) Structure in which internal node represents test on an attribute, each branch represents
outcome of test and each leaf node represents class label
c) Flow-Chart & Structure in which internal node represents test on an attribute, each branch
represents outcome of test and each leaf node represents class label
d) None of the mentioned

Answer: c
14. Decision Trees can be used for Classification Tasks.
a) True
b) False

Answer: a

15. Choose from the following that are Decision Tree nodes?
a) Decision Nodes
b) End Nodes
c) Chance Nodes
d) All of the mentioned

Answer: d

16. Decision Nodes are represented by ____________


a) Disks
b) Squares
c) Circles
d) Triangles

Answer: b

17. Chance Nodes are represented by __________


a) Disks
b) Squares
c) Circles
d) Triangles

Answer: c

18. End Nodes are represented by __________


a) Disks
b) Squares
c) Circles
d) Triangles

Answer: d

19. Which of the following are the advantage/s of Decision Trees?


a) Possible Scenarios can be added
b) Use a white box model, If given result is provided by a model
c) Worst, best and expected values can be determined for different scenarios
d) All of the mentioned

Answer: d

20. Which of the following is the valid component of the predictor?


a) data
b) question
c) algorithm
d) all of the mentioned

Answer: d

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

21. Point out the wrong statement.


a) In Sample Error is also called generalization error
b) Out of Sample Error is the error rate you get on the new dataset
c) In Sample Error is also called resubstitution error
d) All of the mentioned

Answer: a

22. Which of the following is correct order of working?


a) questions->input data ->algorithms
b) questions->evaluation ->algorithms
c) evaluation->input data ->algorithms
d) all of the mentioned

Answer: a

23. Which of the following shows correct relative order of importance?


a) question->features->data->algorithms
b) question->data->features->algorithms
c) algorithms->data->features->question
d) none of the mentioned

Answer: b

24. Point out the correct statement.


a) In Sample Error is the error rate you get on the same dataset used to model a predictor
b) Data have two parts-signal and noise
c) The goal of predictor is to find signal
d) None of the mentioned

Answer: d

25. Which of the following is characteristic of best machine learning method?


a) Fast
b) Accuracy
c) Scalable
d) All of the mentioned

Answer: d
26. True positive means correctly rejected.
a) True
b) False

Answer: b

27. Which of the following trade-off occurs during prediction?


a) Speed vs Accuracy
b) Simplicity vs Accuracy
c) Scalability vs Accuracy
d) None of the mentioned

Answer: d

28. Which of the following expression is true?


a) In sample error < out sample error
b) In sample error > out sample error
c) In sample error = out sample error
d) All of the mentioned

Answer: a

29. Backtesting is a key component of effective trading-system development.


a) True
b) False

Answer: a

30. Which of the following is correct use of cross validation?


a) Selecting variables to include in a model
b) Comparing predictors
c) Selecting parameters in prediction function
d) All of the mentioned

Answer: d

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

31. Point out the wrong combination.


a) True negative=correctly rejected
b) False negative=correctly rejected
c) False positive=correctly identified
d) All of the mentioned

Answer: c
32. Which of the following is a common error measure?
a) Sensitivity
b) Median absolute deviation
c) Specificity
d) All of the mentioned

Answer: d

33. Which of the following is not a machine learning algorithm?


a) SVG
b) SVM
c) Random forest
d) None of the mentioned

Answer: a

34. Point out the wrong statement.


a) ROC curve stands for receiver operating characteristic
b) Foretime series, data must be in chunks
c) Random sampling must be done with replacement
d) None of the mentioned

Answer: d

35. Which of the following is a categorical outcome?


a) RMSE
b) RSquared
c) Accuracy
d) All of the mentioned

Answer: c

36. For k cross-validation, larger k value implies more bias.


a) True
b) False

Answer: b

37. Which of the following method is used for trainControl resampling?


a) repeatedcv
b) svm
c) bag32
d) none of the mentioned

Answer: a

38. Which of the following can be used to create the most common graph types?
a) qplot
b) quickplot
c) plot
d) all of the mentioned

Answer: a

39. For k cross-validation, smaller k value implies less variance.


a) True
b) False

Answer: a

40. Predicting with trees evaluate _____________ within each group of data.
a) equality
b) homogeneity
c) heterogeneity
d) all of the mentioned

Answer: b

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

41. Point out the wrong statement.


a) Training and testing data must be processed in different way
b) Test transformation would mostly be imperfect
c) The first goal is statistical and second is data compression in PCA
d) All of the mentioned

Answer: a

42. Which of the following method options is provided by train function for bagging?
a) bagEarth
b) treebag
c) bagFDA
d) all of the mentioned

Answer: d

43. Which of the following is correct with respect to random forest?


a) Random forest are difficult to interpret but often very accurate
b) Random forest are easy to interpret but often very accurate
c) Random forest are difficult to interpret but very less accurate
d) None of the mentioned

Answer: a

44. Point out the correct statement.


a) Prediction with regression is easy to implement
b) Prediction with regression is easy to interpret
c) Prediction with regression performs well when linear model is correct
d) All of the mentioned

Answer: d

45. Which of the following library is used for boosting generalized additive models?
a) gamBoost
b) gbm
c) ada
d) all of the mentioned

Answer: a

46. The principal components are equal to left singular values if you first scale the
variables.
a) True
b) False

Answer: b

47. Which of the following is statistical boosting based on additive logistic regression?
a) gamBoost
b) gbm
c) ada
d) mboost

Answer: a

48. Which of the following is one of the largest boost subclass in boosting?
a) variance boosting
b) gradient boosting
c) mean boosting
d) all of the mentioned

Answer: b

49. PCA is most useful for non linear type models.


a) True
b) False

Answer: b
50. Which of the following clustering type has characteristic shown in the below figure?

a) Partitional
b) Hierarchical
c) Naive bayes
d) None of the mentioned

Answer: b

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

51. Point out the correct statement.


a) The choice of an appropriate metric will influence the shape of the clusters
b) Hierarchical clustering is also called HCA
c) In general, the merges and splits are determined in a greedy manner
d) All of the mentioned

Answer: d

52. Which of the following is finally produced by Hierarchical Clustering?


a) final estimate of cluster centroids
b) tree showing how close things are to each other
c) assignment of each point to clusters
d) all of the mentioned

Answer: b

53. Which of the following is required by K-means clustering?


a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the mentioned

Answer: d

54. Point out the wrong statement.


a) k-means clustering is a method of vector quantization
b) k-means clustering aims to partition n observations into k clusters
c) k-nearest neighbor is same as k-means
d) none of the mentioned
Answer: c

55. Which of the following combination is incorrect?


a) Continuous – euclidean distance
b) Continuous – correlation similarity
c) Binary – manhattan distance
d) None of the mentioned

Answer: d

56. Hierarchical clustering should be primarily used for exploration.


a) True
b) False

Answer: a

57. Which of the following function is used for k-means clustering?


a) k-means
b) k-mean
c) heatmap
d) none of the mentioned

Answer: a

58. Which of the following clustering requires merging approach?


a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned

Answer: b

59. K-means is not deterministic and it also consists of number of iterations.


a) True
b) False

Answer: a

60. Hierarchical clustering should be mainly used for exploration.


a) True
b) False

Answer: a

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!
61. K-means clustering consists of a number of iterations and not deterministic.
a) True
b) False

Answer: a

62. Which is needed by K-means clustering?


a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of these

Answer: d

63. Which function is used for k-means clustering?


a) k-means
b) k-mean
c) heatmap
d) none of the mentioned

Answer: a

64. Which is conclusively produced by Hierarchical Clustering?


a) final estimation of cluster centroids
b) tree showing how nearby things are to each other
c) assignment of each point to clusters
d) all of these

Answer: b

65. Which clustering technique requires a merging approach?


a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned

Answer: b

Mining Frequent Patterns and Association Rules


Module 5

1. A collection of one or more items is called as _____


a) Itemset
b) Support
c) Confidence
d) Support Count

Answer: A
2. Frequency of occurrence of an itemset is called as _____
a) Support
b) Confidence
c) Support Count
d) Rules

Answer: C

3. An itemset whose support is greater than or equal to a minimum support threshold is


______
a) Itemset
b) Frequent Itemset
c) Infrequent items
d) Threshold values

Answer: B

4. What does FP growth algorithm do?


a) It mines all frequent patterns through pruning rules with lesser support
b) It mines all frequent patterns through pruning rules with higher support
c) It mines all frequent patterns by constructing a FP tree
d) It mines all frequent patterns by constructing an itemsets

Answer: C

5. What techniques can be used to improve the efficiency of apriori algorithm?


a) Hash-based techniques
b) Transaction Increases
c) Sampling
d) Cleaning

Answer: A

6. What do you mean by support(A)?


a) Total number of transactions containing A
b) Total Number of transactions not containing A
c) Number of transactions containing A / Total number of transactions
d) Number of transactions not containing A / Total number of transactions

Answer: C

7. How do you calculate Confidence (A -> B)?


a) Support(A #
# B) / Support (A)
b) Support(A #
# B) / Support (B)
c) Support(A #
# B) / Support (A)
d) Support(A #
# B) / Support (B)
Answer: A

8. Which of the following is the direct application of frequent itemset mining?


a) Social Network Analysis
b) Market Basket Analysis
c) Outlier Detection
d) Intrusion Detection

Answer: B

9. What is not true about FP growth algorithms?


a) It mines frequent itemsets without candidate generation
b) There are chances that FP trees may not fit in the memory
c) FP trees are very expensive to build
d) It expands the original database to build FP trees

Answer: D

10. When do you consider an association rule interesting?


a)If it only satisfies min_support
b) If it only satisfies min_confidence
c) If it satisfies both min_support and min_confidence
d) There are other measures to check so

Answer: C

Learn Datawarehouse and Data mining from Scratch

Understand the Concept of Datawarehouse and Data mining in Detail [Videos + Notes]
Click Here!

11. What is the relation between a candidate and frequent itemsets?


a) A candidate itemset is always a frequent itemset
b) A frequent itemset must be a candidate itemset
c) No relation between these two
d) Strong relation with transactions

Answer:B

12. Which of the following is not a frequent pattern mining algorithm?


a) Apriori
b) FP growth
c) Decision trees
d) Eclat

Answer: C

13. Which algorithm requires fewer scans of data?


a)Apriori
b)FP Growth
c)Naive Bayes
d)Decision Trees

Answer: B

14. For the question given below consider the data Transactions :

I1, I2, I3, I4, I5, I6


I7, I2, I3, I4, I5, I6
I1, I8, I4, I5
I1, I9, I10, I4, I6
I10, I2, I4, I11, I5
With support as 0.6 find all frequent itemsets?

a) <I1>, <I2>, <I4>, <I5>, <I6>, <I1, I4>, <I2, I4>, <I2, I5>, <I4, I5>, <I4, I6>, <I2, I4, I5>

b) <I2>, <I4>, <I5>, <I2, I4>, <I2, I5>, <I4, I5>, <I2, I4, I5>

c) <I11>, <I4>, <I5>, <I6>, <I1, I4>, <I5, I4>, <I11, I5>, <I4, I6>, <I2, I4, I5>

d) <I1>, <I4>, <I5>, <I6>

Answer: A

15. What will happen if support is reduced?


a) Number of frequent itemsets remains the same
b) Some itemsets will add to the current set of frequent itemsets.
c) Some itemsets will become infrequent while others will become frequent
d) Can not say

Answer: B

16. What is association rule mining?


a) Same as frequent itemset mining
b) Finding of strong association rules using frequent itemsets
c) Using association to analyze correlation rules
d) Finding Itemsets for future trends

Answer: B

17. A definition or a concept is ______ if it classifies any examples as coming within the
concept
a) Concurrent
b) Consistent
c) Constant
d) Compete

Answer: B
1. Which of the following features usually applies to data in a data warehouse?
A.Data are often deleted
B.Most applications consist of transactions
C.Data are rarely deleted
D.Relatively few records are processed by applications
Ans: c

2. Which of the following statement is true?


A.The data warehouse consists of data marts and operational data
B.The data warehouse is used as a source for the operational data
C.The operational data are used as a source for the data warehouse
D.All of the above
Ans: c

3. The following is true of three-tier data warehouses:


A.Once created, the data marts will keep on being updated from the data warehouse at
periodic times

B.Once created, the data marts will directly receive their new data from the operational
databases

C.The data marts are different groups of tables in the data warehouse
D.A data mart becomes a data warehouse when it reaches a critical size
Ans: a

4. The following technology is not well-suited for data mining:


A.Expert system technology
B.Data visualization
C.Technology limited to specific data types such as numeric data types
D.Parallel architecture
Ans: c

5. What is true of the multidimensional model?


A.It typically requires less disk storage
B.It typically requires more disk storage
C.Typical business queries requiring aggregate functions take more time
D.Increasing the size of a dimension is difficult
Ans: b

6. The value at the intersection of the row labeled “India” and the column
“Savings” in Table2 should be:
A.800,000
B.300,000
C.200,000
D.300,000
Ans: a

7. We want to add the following capabilities to Table2: show the data for 3 age groups
(20-39, 40-60, over 60), 3 revenue groups (less than $10,000, $10,000-$30,000, over
$30,000) and add a new type of account: Money market. The total number of measures
will be:
A.4
B.More than 100
C.Between 10 and 30 (boundaries includeD.
D.Between 40 and 60 (boundaries includeD.
Ans: b

8. We want to add the following capability to Table2: for each type of account in each
region, also show the dollar amount besides the number of customers. This adds to
Table2:
A.Another dimension
B.Other column(s)
C.Other row(s)
D.Another measure for each cell
Ans: d

9. The most common source of change data in refreshing a data warehouse is:
A.Queryable change data
B.Cooperative change data
C.Logged change data
D.Snapshot change data
Ans: d

10. Which of the following statements is not true about refreshing a data warehouse:
A.It is a process of managing timing differences between the updating of data sources and the
related data warehouse objects

B.Updates to dimension tables may occur at different times than the fact table
C.The data warehouse administrator has more control over the load time lag than the valid
time lag
D.None of the above
Ans: d

11. A data warehouse is which of the following?


A. Can be updated by end users.
B. Contains numerous naming conventions and formats.
C. Organized around important subject areas.
D. Contains only current data.
Ans: C

12. An operational system is which of the following?


A. A system that is used to run the business in real time and is based on historical data.
B. A system that is used to run the business in real time and is based on current data.
C. A system that is used to support decision making and is based on current data.
D. A system that is used to support decision making and is based on historical data.
Ans: B
13. The generic two-level data warehouse architecture includes which of the following?
A. At least one data mart
B. Data that can extracted from numerous internal and external sources
C. Near real-time updates
D. All of the above.
Ans: B

14. The @active data warehouse architecture includes which of the following?
A. At least one data mart
B. Data that can extracted from numerous internal and external sources
C. Near real-time updates
D. All of the above.
Ans: D

15. Reconciled data is which of the following?


A. Data stored in the various operational systems throughout the organization.
B. Current data intended to be the single source for all decision support systems.
C. Data stored in one operational system in the organization.
D. Data that has been selected and formatted for end-user support applications.
Ans: B

16. Transient data is which of the following?


A. Data in which changes to existing records cause the previous version of the records to be
eliminated
B. Data in which changes to existing records do not cause the previous version of the records
to be eliminated
C. Data that are never altered or deleted once they have been added
D. Data that are never deleted once they have been added
Ans: A

17. The extract process is which of the following?


A. Capturing all of the data contained in various operational systems
B. Capturing a subset of the data contained in various operational systems
C. Capturing all of the data contained in various decision support systems
D. Capturing a subset of the data contained in various decision support systems
Ans: B

18. Data scrubbing is which of the following?


A. A process to reject data from the data warehouse and to create the necessary indexes
B. A process to load the data in the data warehouse and to create the necessary indexes
C. A process to upgrade the quality of data after it is moved into a data warehouse
D. A process to upgrade the quality of data before it is moved into a data warehouse
Ans: D

19. The load and index is which of the following?


A. A process to reject data from the data warehouse and to create the necessary indexes
B. A process to load the data in the data warehouse and to create the necessary indexes
C. A process to upgrade the quality of data after it is moved into a data warehouse
D. A process to upgrade the quality of data before it is moved into a data warehouse
Ans: B
20. Data transformation includes which of the following?
A. A process to change data from a detailed level to a summary level
B. A process to change data from a summary level to a detailed level
C. Joining data from one source into various sources of data
D. Separating data from one source into various sources of data
Ans: A

21. A multifield transformation does which of the following?


A. Converts data from one field into multiple fields
B. Converts data from multiple fields into one field
C. Converts data from multiple fields into multiple fields
D. All of the above
Ans: D

22. A star schema has what type of relationship between a dimension and fact table?
A. Many-to-many
B. One-to-one
C. One-to-many
D. All of the above.
Ans: C

23. Fact tables are which of the following?


A. Completely demoralized
B. Partially demoralized
C. Completely normalized
D. Partially normalized
Ans: C

24. A snowflake schema is which of the following types of tables?


A. Fact
B. Dimension
C. Helper
D. All of the above
Ans: D

25. A goal of data mining includes which of the following?


A. To explain some observed event or condition
B. To confirm that data exists
C. To analyze data for expected relationships
D. To create a new data warehouse
Ans: A

26. Which of the following statements does not apply to relational databases?
A. Relational databases are simple to understand
B. Tables are one of the basic components of relational databases
C. Relational databases have a strong procedural orientation
D. Relational databases have a strong mathematical foundation
Ans: C
27. In the relational database terminology, a table is synonymous with:
A. A column
B. A row
C. An attribute
D. A relation
Ans: D

28. A null value indicates:


A. A numeric value with value 0
B. The absence of a value
C. A very small value
D. An erroneous value
Ans: B

29. When the referential integrity rule is enforced, which one is usually not a valid
action in response to the deletion of a row that contains a primary key value referenced
elsewhere?
A. Do not allow the deletion
B. Accept the deletion without any other action
C. Delete the related rows
D. Set the foreign keys of related rows to null
Ans: B

30. When an equi-join is performed on a table of N rows and a table of M rows, the
resulting table has the following number of rows:
A. M
B. N
C. The smaller of M or N
D. A number in the range 0 to M*N
Ans: D

1. The problem of finding hidden structure in unlabeled data is called


A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
Ans: B

2. Task of inferring a model from labeled training data is called


A. Unsupervised learning
B. Supervised learning
C. Reinforcement learning
Ans: B

3. Some telecommunication company wants to segment their customers into distinct


groups in order to send appropriate subscription offers, this is an example of
A. Supervised learning
B. Data extraction
C. Serration
D. Unsupervised learning
Ans: D
4. Self-organizing maps are an example of
A. Unsupervised learning
B. Supervised learning
C. Reinforcement learning
D. Missing data imputation
Ans: A

5. You are given data about seismic activity in Japan, and you want to predict a
magnitude of the next earthquake, this is in an example of
A. Supervised learning
B. Unsupervised learning
C. Serration
D. Dimensionality reduction
Ans: A

6. Assume you want to perform supervised learning and to predict number of newborns
according to size of storks’ population
(http://www.brixtonhealth.com/storksBabies.pdf), it is an example of
A. Classification
B. Regression
C. Clustering
D. Structural equation modeling
Ans: B

7. Discriminating between spam and ham e-mails is a classification task, true or false?
A. True
B. False
Ans: A

8. In the example of predicting number of babies based on storks’ population size,


number of babies is
A. outcome
B. feature
C. attribute
D. observation
Ans: A

9. It may be better to avoid the metric of ROC curve as it can suffer from accuracy
paradox.
A. True
B. False
Ans: B

10. which of the following is not involve in data mining?


A. Knowledge extraction
B. Data archaeology
C. Data exploration
D. Data transformation
Ans: D
DATA MINING Questions

11. Which is the right approach of Data Mining?

A. Infrastructure, exploration, analysis, interpretation, exploitation


B. Infrastructure, exploration, analysis, exploitation, interpretation
C. Infrastructure, analysis, exploration, interpretation, exploitation
D. Infrastructure, analysis, exploration, exploitation, interpretation
Ans: A

12. Which of the following issue is considered before investing in Data Mining?
A. Functionality
B. Vendor consideration
C. Compatibility
D. All of the above
Ans: D

13. Adaptive system management is


A. It uses machine-learning techniques. Here program can learn from past experience and
adapt themselves to new situations
B. Computational procedure that takes some value as input and produces some value as
output.
C. Science of making machines performs tasks that would require intelligence when
performed by humans
D. none of these
Ans: A

14. Bayesian classifiers is


A. A class of learning algorithm that tries to find an optimum classification of a set of
examples using the probabilistic theory.
B. Any mechanism employed by a learning system to constrain the search space of a
hypothesis
C. An approach to the design of learning algorithms that is inspired by the fact that when
people encounter new situations, they often explain them by reference to familiar
experiences, adapting the explanations to fit the new situation.
D. None of these
Ans: A

15. Algorithm is
A. It uses machine-learning techniques. Here program can learn from past experience and
adapt themselves to new situations
B. Computational procedure that takes some value as input and produces some value as
output
C. Science of making machines performs tasks that would require intelligence when
performed by humans
D. None of these
Ans: B

16. Bias is
A.A class of learning algorithm that tries to find an optimum classification of a set of
examples using the probabilistic theory
B. Any mechanism employed by a learning system to constrain the search space of a
hypothesis
C. An approach to the design of learning algorithms that is inspired by the fact that when
people encounter new situations, they often explain them by reference to familiar
experiences, adapting the explanations to fit the new situation.
D. None of these
Ans: B

17. Background knowledge referred to


A. Additional acquaintance used by a learning algorithm to facilitate the learning process
B. A neural network that makes use of a hidden layer
C. It is a form of automatic learning.
D. None of these
Ans: A

18. Case-based learning is


A. A class of learning algorithm that tries to find an optimum classification of a set of
examples using the probabilistic theory.
B. Any mechanism employed by a learning system to constrain the search space of a
hypothesis
c. An approach to the design of learning algorithms that is inspired by the fact that when
people encounter new situations, they often explain them by reference to familiar
experiences, adapting the explanations to fit the new situation.
D. None of these
Ans: C

19. Classification is
A. A subdivision of a set of examples into a number of classes
B. A measure of the accuracy, of the classification of a concept that is given by a certain
theory
C. The task of assigning a classification to a set of examples
D. None of these
Ans: A

20. Binary attribute are


A. This takes only two values. In general, these values will be 0 and 1 and .they can be coded
as one bit
B. The natural environment of a certain species
C. Systems that can be used without knowledge of internal operations
D. None of these
Ans: A
21. Classification accuracy is
A. A subdivision of a set of examples into a number of classes
B. Measure of the accuracy, of the classification of a concept that is given by a certain theory
C. The task of assigning a classification to a set of examples
D. None of these
Ans: B

22. Biotope are


A. This takes only two values. In general, these values will be 0 and 1
and they can be coded as one bit.
B. The natural environment of a certain species
C. Systems that can be used without knowledge of internal operations
D. None of these
Ans: B

23. Cluster is
A. Group of similar objects that differ significantly from other objects
B. Operations on a database to transform or simplify data in order to prepare it for a machine-
learning algorithm
C. Symbolic representation of facts or ideas from which information can potentially be
extracted
D. None of these
Ans: A

24. Black boxes are


A. This takes only two values. In general, these values will be 0 and 1
and they can be coded as one bit.
B. The natural environment of a certain species
C. Systems that can be used without knowledge of internal operations
D. None of these
Ans: C

25. A definition of a concept is if it recognizes all the instances of that concept


A. Complete
B. Consistent
C. Constant
D. None of these
Ans: A

26. Data mining is


A. The actual discovery phase of a knowledge discovery process
B. The stage of selecting the right data for a KDD process
C. A subject-oriented integrated time variant non-volatile collection of data in support of
management
D. None of these
Ans: A

27. A definition or a concept is if it classifies any examples as coming within the concept
A. Complete
B. Consistent
C. Constant
D. None of these
Ans: B

28. Data independence means


A. Data is defined separately and not included in programs
B. Programs are not dependent on the physical attributes of data.
C. Programs are not dependent on the logical attributes of data
D. Both (B) and (C).
Ans: D

29. E-R model uses this symbol to represent weak entity set?
A. Dotted rectangle
B. Diamond
C. Doubly outlined rectangle
D. None of these
Ans: C

30. SET concept is used in


A. Network Model
B. Hierarchical Model
C. Relational Model
D. None of these
Ans: D

31. Relational Algebra is


A. Data Definition Language
B. Meta Language
C. Procedural query Language
D. None of the above
Ans: C

32. Key to represent relationship between tables is called


A. Primary key
B. Secondary Key
C. Foreign Key
D. None of these
Ans: C

33. ________ produces the relation that has attributes of Ri and R2


A. Cartesian product
B. Difference
C. Intersection
D. Product
Ans: A

34. Which of the following are the properties of entities?


A. Groups
B. Table
C. Attributes
D. Switchboards
Ans: C

35. In a relation
A. Ordering of rows is immaterial
B. No two rows are identical
C. (A) and (B) both are true
D. None of these
Ans: C

36. Inductive logic programming is


A. A class of learning algorithms that try to derive a Prolog program from examples
B. A table with n independent attributes can be seen as an n-dimensional space
C. A prediction made using an extremely simple method, such as always predicting the same
output
D. None of these

37. Machine learning is


A. An algorithm that can learn
B. A sub-discipline of computer science that deals with the design and implementation
of learning algorithms
C. An approach that abstracts from the actual strategy of an individual algorithm and can
therefore be applied to any other form of machine learning.
D. None of these

38. Projection pursuit is


A. The result of the application of a theory or a rule in a specific case
B. One of several possible enters within a database table that is chosen by the designer as the
primary means of accessing the data in the table.
C. Discipline in statistics that studies ways to find the most interesting projections of
multi-dimensional spaces
D. None of these

39. Node is
A. A component of a network
B. In the context of KDD and data mining, this refers to random errors in a database table.
C. One of the defining aspects of a data warehouse
D. None of these

40. Statistical significance is


A. The science of collecting, organizing, and applying numerical facts
B. Measure of the probability that a certain hypothesis is incorrect given certain
observations.
C. One of the defining aspects of a data warehouse, which is specially built around all the
existing applications of the operational data
D. None of these

41. Multi-dimensional knowledge is


A. A class of learning algorithms that try to derive a Prolog program from examples
B. A table with n independent attributes can be seen as an n-dimensional space
C. A prediction made using an extremely simple method, such as always predicting the same
output.
D. None of these

42. Noise is
A. A component of a network
B. In the context of KDD and data mining, this refers to random errors in a database table.
C. One of the defining aspects of a data warehouse
D. None of these

43. Query tools are


A. A reference to the speed of an algorithm, which is quadratically dependent on the size of
the data
B. Attributes of a database table that can take only numerical values.
C. Tools designed to query a database.
D. None of these

44. Operational database is


A. A measure of the desired maximal complexity of data mining algorithms
B. A database containing volatile data used for the daily operation of an organization
C. Relational database management system
D. None of these

45. Prediction is
A. The result of the application of a theory or a rule in a specific case
B. One of several possible enters within a database table that is chosen by the designer as the
primary means of accessing the data in the table.
C. Discipline in statistics that studies ways to find the most interesting projections of multi-
dimensional spaces.
D. None of these

You might also like