Download as pdf or txt
Download as pdf or txt
You are on page 1of 130

UNIT-1

1. What is Machine learning?

a) The autonomous acquisition of knowledge through the use of computer programs

b) The autonomous acquisition of knowledge through the use of manual programs

c) The selective acquisition of knowledge through the use of computer programs

d) The selective acquisition of knowledge through the use of manual programs

Answer: a

Explanation: Machine learning is the autonomous acquisition of knowledge through the use of
computer programs.

2. Which of the factors affect the performance of learner system does not include?

a) Representation scheme used

b) Training scenario

c) Type of feedback

d) Good data structures

Answer: d

Explanation: Factors that affect the performance of learner system does not include good data
structures.

3. Different learning methods does not include?

a) Memorization

b) Analogy

c) Deduction

d) Introduction
Answer: d

Explanation: Different learning methods does not include the introduction.

4. In language understanding, the levels of knowledge that does not include?

a) Phonological

b) Syntactic

c) Empirical

d) Logical

Answer: c

Explanation: In language understanding, the levels of knowledge that does not include empirical
knowledge.

5. A model of language consists of the categories which does not include?

a) Language units

b) Role structure of units

c) System constraints

d) Structural units

Answer: d

Explanation: A model of language consists of the categories which does not include structural units.

6. What is a top-down parser?

a) Begins by hypothesizing a sentence (the symbol S) and successively predicting lower level
constituents until individual preterminal symbols are written

b) Begins by hypothesizing a sentence (the symbol S) and successively predicting upper level
constituents until individual preterminal symbols are written

c) Begins by hypothesizing lower level constituents and successively predicting a sentence (the symbol
S)
d) Begins by hypothesizing upper level constituents and successively predicting a sentence (the symbol
S)

Answer: a

Explanation: A top-down parser begins by hypothesizing a sentence (the symbol S) and successively
predicting lower level constituents until individual preterminal symbols are written.

7. Among the following which is not a horn clause?

a) p

b) Øp V q

c) p → q

d) p → Øq

Answer: d

Explanation: p → Øq is not a horn clause.

8. The action ‘STACK(A, B)’ of a robot arm specify to _______________

a) Place block B on Block A

b) Place blocks A, B on the table in that order

c) Place blocks B, A on the table in that order

d) Place block A on block B

Answer: d

Explanation: The action ‘STACK(A,B)’ of a robot arm specify to Place block A on block B.

9. Choose the options that are correct regarding machine learning (ML) and

artificial intelligence (AI),

(A) ML is an alternate way of programming intelligent machines.

(B) ML and AI have very different goals.


(C) ML is a set of techniques that turns a dataset into a software.

(D) AI is a software that can emulate the human mind.

Answer: (A), (C), (D)

10. Which of the following sentence is FALSE regarding regression?

(A) It relates inputs to outputs.

(B) It is used for prediction.

(C) It may be used for interpretation.

(D) It discovers causal relationships.

Answer: (D)

11. Which of the following is a widely used and effective machine learning algorithm based on the
idea of bagging?

Decision Tree
Regression
Classification
Random Forest – answer

12. What is Machine learning?

a) The autonomous acquisition of knowledge through the use of computer programs

b) The autonomous acquisition of knowledge through the use of manual programs

c) The selective acquisition of knowledge through the use of computer programs

d) The selective acquisition of knowledge through the use of manual programs

Ans: a

13. Which of the factors affect the performance of learner system does not include?

a) Representation scheme used

b) Training scenario
c) Type of feedback

d) Good data structures

Ans: d

14. Different learning methods does not include?

a) Memorization

b) Analogy

c) Deduction

d) Introduction

Ans: d

15. In language understanding, the levels of knowledge that does not include?

a) Phonological

b) Syntactic

c) Empirical

d) Logical

Ans: c

16. A model of language consists of the categories which does not include?

a) Language units

b) Role structure of units

c) System constraints

d) Structural units

Ans: d
17. What is a top-down parser?

a) Begins by hypothesizing a sentence (the symbol S) and successively predicting lower level
constituents until individual preterminal symbols are written

b) Begins by hypothesizing a sentence (the symbol S) and successively predicting upper level
constituents until individual preterminal symbols are written

c) Begins by hypothesizing lower level constituents and successively predicting a sentence (the symbol
S)

d) Begins by hypothesizing upper level constituents and successively predicting a sentence (the symbol
S)
Answer: a

18. Among the following which is not a horn clause?

a) p

b) Øp V q

c) p → q

d) p → ØqView Answer

Answer: d

19. The action ‘STACK(A, B)’ of a robot arm specify to _______________

a) Place block B on Block A

b) Place blocks A, B on the table in that order

c) Place blocks B, A on the table in that order

d) Place block A on block B

Answer: d
20. Choose the options that are correct regarding machine learning (ML) and artificial intelligence
(AI),

(A)ML is an alternate way of programming intelligent machines.

(B)ML and AI have very different goals.

(C)ML is a set of techniques that turns a dataset into software.

(D)AI is software that can emulate the human mind.

Answer: (A), (C), (D)

21. Which of the following sentence is FALSE regarding regression?

(A)It relates inputs to outputs.

(B)It is used for prediction.

(C)It may be used for interpretation.

(D)It discovers causal relationships.1

Answer:(D)

22. Which of the following is a widely used and effective machine learning algorithm based on the
idea of bagging?

a) Decision Tree

b) Regression

c) Classification

d) Random Forest

Answer: (D)

23.To find the minimum or the maximum of a function, we set the gradient to zero because:

A) The value of the gradient at extrema of a function is always zero

B) Depends on the type of problem


C) Both A and B

D) None of the above

Answer: (A)

24.The most widely used metrics and tools to assess a classification model are:

A) Confusion matrix

B) Cost-sensitive accuracy

C) Area under the ROC curve

D) All of the above

Answer: (D)

25.Which of the following is a good test dataset characteristic?

A) Large enough to yield meaningful results

B) Is representative of the dataset as a whole

C) Both A and B

D) None of the above

Answer: (C)

26) Which of the following is a disadvantage of decision trees?

A) Factor analysis

B) Decision trees are robust to outliers

C) Decision trees are prone to be overfit

D)None of the above

Answer (C)
27) How do you handle missing or corrupted data in a dataset?

A) Drop missing rows or columns

B) Replace missing values with mean/median/mode

C) Assign a unique category to missing values

D) All of the above


Answer (D)

28) What is the purpose of performing cross-validation?

A) To assess the predictive performance of the models

B) To judge how the trained model performs outside the sample on test data

C) Both A and B
Answer (C)

29) Why is second order differencing in time series needed?

A) To remove stationarity

B) To find the maxima or minima at the local point

C) Both A and B

D) None of the above


Answer (C)

30) When performing regression or classification, which of the following is the correct way to
preprocess the data?

A) Normalize the data → PCA → training

B) PCA → normalize PCA output → training

C) Normalize the data → PCA → normalize PCA output → training

D) None of the above


Answer (A).
31) Which of the folllowing is an example of feature extraction?

A) Constructing bag of words vector from an email

B) Applying PCA projects to a large high-dimensional data

C) Removing stopwords in a sentence

D) All of the above

Answer (D)

32. Which combines inductive methods with the power of first-order representations?

a) Inductive programming

b) Logic programming

c) Inductive logic programming

d) Lisp programming

Answer: c

Explanation: Inductive logic programming(ILP) combines inductive methods with the power of first-
order representations.

33. How many reasons are available for the popularity of ILP?

a) 1

b) 2

c) 3

d) 4

Answer: c

Explanation: The three reasons available for the popularity of ILP are general knowledge, Complete
algorithm and hypotheses.

34. Which cannot be represented by a set of attributes?


a) Program

b) Three-dimensional configuration of a protein molecule

c) Agents

d) None of the mentioned

Answer: b

Explanation: Because the configuration inherently refers to relationships between objects.

35. Which is an appropriate language for describing the relationships?

a) First-order logic

b) Propositional logic

c) ILP

d) None of the mentioned

Answer: a

36. Which produces hypotheses that are easy to read for humans?

a) ILP

b) Artificial intelligence

c) Propositional logic

d) First-order logic

Answer: a

Explanation: Because ILP can participate in the scientific cycle of experimentation, So that it can
produce flexible structure.

37. What need to be satisfied in inductive logic programming?


a) Constraint

b) Entailment constraint

c) Both Constraint & Entailment constraint

d) None of the mentioned

Answer: b

Explanation: The objective of an ILP is to come up with a set of sentences for the hypothesis such that
the entailment constraint is satisfied.

38. How many literals are available in top-down inductive learning methods?

a) 1

b) 2

c) 3

d) 4

Answer: c

39. Which inverts a complete resolution strategy?

a) Inverse resolution

b) Resolution

c) Trilogy

d) None of the mentioned

Answer: a

Explanation: Because it is a complete algorithm for learning first-order theories.

40. Which method can’t be used for expressing relational knowledge?

a) Literal system
b) Variable-based system

c) Attribute-based system

d) None of the mentioned

Answer: c

Explanation: ILP methods can learn relational knowledge that is not expressible in attribute-based
system.

41. Which approach is used for refining a very general rule through ILP?

a) Top-down approach

b) Bottom-up approach

c) Both Top-down & Bottom-up approach

d) None of the mentioned

Answer: a

42. The characteristics of the computer system capable of thinking, reasoning and learning is known
is

a. machine intelligence

b. human intelligence

c. artificial intelligence

d. virtual intelligence

Answer: (c).

43. What is the term used for describing the judgmental or common-sense part of problem solving?

a. Heuristic
b. Critical

c. Value based

d. Analytical

Answer: (a).

44. Which kind of planning consists of successive representations of different levels of a plan?

a. hierarchical planning

b. non-hierarchical planning

c. project planning

d. All of the above

Answer: (a).

45. What was originally called the "imitation game" by its creator?

a. The Turing Test

b. LISP

c. The Logic Theorist

d. Cybernetics

Answer: (a).

46. An AI technique that allows computers to understand associations and relationships between
objects and events is called:

a. heuristic processing
b. cognitive science

c. relative symbolism

d. pattern matching

Answer: (d).

47. The field that investigates the mechanics of human intelligence is:

a. history

b. cognitive science

c. psychology

d. sociology

Answer: (b).

48. What is the name of the computer program that simulates the thought processes of human
beings?

a. Human logic

b. Expert reason

c. Expert system

d. Personal information

Answer: (c).

49. What is the name of the computer program that contains the distilled knowledge of an expert?

a. Data base management system

b. Management information System


c. Expert system

d. Artificial intelligence

Answer: (c).

50. High-resolution, bit-mapped displays are useful for displaying:

a. clearer characters

b. graphics

c. more characters

d. All of the above

Answer: (d).
UNIT – 1

1. What is termed as Well Posed Learning Problem


(a) If a unique solution to that problem exists
(b) If a solution exits but not unique
(c) If a unique solution exists and the solution depends on some experience
(d) None of the above

2. What are main requirements for a learning problem


(a) Class of tasks T, Experience E and Performance Measure P
(b) Class C, Performance M,
(c) Only Experience E
(d) Only Performance measure P

3. What type of datasets are used in Supervised Learning


(a) Labeled Datasets
(b) Unlabeled Datasets
(c) Both of the Above
(d) None of the above

4. Regarding bias which of the following statements is true? (Here ‘high’ and ‘low’ are relative to
the ideal model.)
(a) Models which over fit have a high bias
(b) Models which over fit have a low bias
(c) Models which under fit have a high variance
(d) None of the above

5. A feature F1 can take certain value: A, B, C, D, E, & F and represents grade of students from a
college. Which of the following statement is true in following case?
(a) Feature F1 is an example of nominal variable
(b) Feature F1 is an example of ordinal variable
(c) It doesn’t belong to any of the above category
(d) Both of these

6. Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce
the overfitting?
(a) Increase the amount of training data
(b) Improve the optimization algorithm being used for error minimization
(c) Decrease the model complexity
(d) Reduce the noise in the training data
7. Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic Gradient
Decent (SGD)?
a) In GD and SGD, you update a set of parameters in an iterative manner to minimize the error
function
b) In SGD, you have to run through all the samples in your training set for a single update of a parameter
in each iteration
c) In GD, you either use the entire data or a subset of training data to update a parameter in each iteration
d) None of the above

8. Which of the following hyper parameter(s), when increased may cause random forest to over fit
the data?
a) Number of Trees
b) Depth of Tree
c) Learning Rate
d) None of the above

9. What kind of learning algorithm for "Future stock prices or currency exchange rates"?
a) Recognizing Anomalies
b) Prediction
c) Generating Patterns
d) Recognition Patterns

10. Adding a non-important feature to a linear regression model may result in


a) Increase in R value
b) Decrease in R value
c) Both Increase and decrease in R value
d) None of the above

11. The type of Training Experience available can have a significant impact on success or failure of
learner
a) The above statement is TRUE
b) The above statement is FALSE
c) Cannot say
d) None of the above

12. “Problem of searching through a predefined space of potential hypotheses for the hypothesis that
best fits the training examples” is termed as :
a) Target Learning
b) Concept Learning
c) Unsupervised Learning
d) Supervised Learning

13. If you ask your friend to bring you a PIZZA, This is an example of ;
a) Specific Hypothesis
b) General Hypothesis
c) Both of the above
d) None of the above

14. Imagine, you are solving a classification problem with highly imbalanced class. The majority
class is observed 99% of times in the training data. Your model has 99% accuracy after taking the
predictions on test data. Which of the following is true in such a case?
(1) Accuracy metric is not a good idea for imbalanced class problems
(2) Accuracy metric is a good idea for imbalanced class problems
(3) Precision and recall metrics are good for imbalanced class problems
(4) Precision and recall metrics aren’t good for imbalanced class problem

a) Option 1 and 3
b) Option 1 and 4
c) Option 2 and 4
d) Only Option 4

15. Consider two hypothesis as follows:, Which of the following hypothesis is more general?

H1= {Sunny,?, ? , Strong, ?,?}


H2= {Sunny, ?, ?,?,?,?}

a) H1 is more general than H2


b) H2 is more general than H1
c) Only H1 is more general
d) Only H2 is more general

16. What is the basic condition for FIND- S Algorithm?


a) It deals with most specific hypothesis and considers only positive examples
b) It deals with most general hypothesis and considers only negative examples
c) It deals with most specific hypothesis and considers only positive examples
d) None of the above

17. In FIND- S Algorithm the search space moves from;


a) From most specific to more general
b) From most general to most specific
c) Can be in either way
d) None of the above

18. What is the key idea behind CANDIDATE ELIMINATION ALGORITHM


a) Output a set of all hypothesis that are consistent with the training example
b) Output a set of general hypothesis
c) Output a set of Specific hypothesis
d) None of the above
19. When the hypothesis h is consistent with the training example in candidate elimination
algorithm;
a) h(x)=c(x)
b) h(x) ≠ c(x)
c) Both of the Above
d) None of the above

20. The intermediate space between General and Specific Hypothesis in Candidate elimination
algorithm is known as ;
a) Version Space
b) Candidate Space
c) Risk Space
d) None of the above

21. Candidate Elimination Algorithm works well with both positive and negative examples.
a) True
b) False
c) Only Positive Examples
d) Only Negative Examples

22. For a positive example in Candidate elimination Algorithm, what is general tend;
a) We tend to make specific hypothesis more general
b) We tend to make general hypothesis more specific
c) Both of the above
d) None of the above

23. The obvious solution of assuring that target space is in Hypothesis space H is;
a) It is capable of representing every possible subset of the instances X
b) It is capable of representing every set of instances X
c) Both of the above
d) None of the above

24. The set of all subsets of Set X is known as ;


a) Power set of X
b) Set of X
c) Super set of X
d) None of the above

25. The number of distinct subsets that can be defined over a set X containing 1x1 elements is ;
a) 2|x|
b) 2x
c) 4x
d) None of the above

26. What is a Target Variable?


a) Variable whose values are to be predicted by other variables
b) Variable whose values are known to user
c) A variable that has categories
d) None of the above

27. For what purpose Training data is used?


a) To form the learned Hypothesis
b) For testing the Learned hypothesis
c) Can be both
d) None of the above

28. What happens when a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new data?
a) Underfitting of the model
b) Overfitting of the model
c) Variance of a model
d) None of the above

29. For what problems ANN or Neural networks are suitable to use?
a) Problems in which training data corresponds to noisy and complex sensor data
b) Problems in which training data is a labeled dataset
c) Both of the above
d) None of the above

30. What is the output released by Perceptron in Neural network when the result is greater than some
threshold value?
a) 1
b) -1
c) 0
d) None of the above

31. If the training examples are not linearly separable which rule will design a best fit approximation
for target concept?
a) Perceptron Rule
b) Delta Rule
c) Both of the above
d) None of the above
32. The Hypothesis search space for Back propagation algorithm is consisting of;
a) Continuous Representations
b) Discrete Representations
c) Both of the above
d) None of the above

33. The ………. error of a hypothesis with respect to some sample S of instances drawn from X is the
fraction of S that it misclassifies is known as?
a) True error
b) Sample error
c) Mean Square error
d) None of the above

34. Sampling error increases with the increase in sampling size


a) False
b) True
c) Can’t Say
d) None of the above

35. A statement made about a population for testing purpose is called?


a) Statistic
b) Hypothesis
c) Level of Significance
d) Test-Statistic

36. A statement whose validity is tested on the basis of a sample is called?


a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis

37. What are the two main properties of a Random Variable?


a) Error and Bias
b) Mean and Variance
c) Mean and Bias
d) None of the above

38. What is Machine Learning (ML)?


a) The autonomous acquisition of knowledge through the use of manual programs
b) The selective acquisition of knowledge through the use of computer programs
c) The selective acquisition of knowledge through the use of manual programs
d) The autonomous acquisition of knowledge through the use of computer programs

39. Father of Machine Learning (ML)


a) Geoffrey Chaucer
b) Geoffrey Hill
c) Geoffrey Everest Hinton

40. Choose the correct option regarding machine learning (ML) and artificial intelligence (AI)
a) ML is a set of techniques that turns a dataset into a software
b) AI is a software that can emulate the human mind
c) ML is an alternate way of programming intelligent machines
d) All of the above

41. Which of the factors affect the performance of the learner system does not include?
a) Good data structures
b) Representation scheme used
c) Training scenario
d) Type of feedback

42. In general, to have a well-defined learning problem, we must identity which of the following
a) The class of tasks
b) The measure of performance to be improved
c) The source of experience
d) All of the above

43. Successful applications of ML


a) Learning to recognize spoken words
b) Learning to drive an autonomous vehicle
c) Learning to classify new astronomical structures
d) Learning to play world-class backgammon
e) All of the above

44. Which of the following does not include different learning methods?
a) Analogy
b) Introduction
c) Memorization
d) Deduction
45. In language understanding, the levels of knowledge that does not include?
a) Empirical
b) Logical
c) Phonological
d) Syntactic

46. Designing a machine learning approach involves: -


a) Choosing the type of training experience
b) Choosing the target function to be learned
c) Choosing a representation for the target function
d) Choosing a function approximation algorithm
e) All of the above

47. Concept learning inferred a valued function from training examples of its input and output.
a) Decimal
b) Hexadecimal
c) Boolean
d) All of the above

48. Which of the following is not a supervised learning?


a) Naive Bayesian
b) PCA
c) Linear Regression
d) Decision Tree Answer

49. What is Machine Learning?


a) Artificial Intelligence
b) Deep Learning
c) Data Statistics
a. Only (i)
b. Both (i) & (ii)
c. All
d. None

50. What kind of learning algorithm for “Facial identities or facial expressions”?
a) Prediction
b) Recognition Patterns
c) Generating Patterns
d) Recognizing Anomalies Answer

51. Which of the following is not type of learning?


a) Unsupervised Learning
b) Supervised Learning
c) Semi-unsupervised Learning
d) Reinforcement Learning

52. Real-Time decisions, Game AI, Learning Tasks, Skill Acquisition, and Robot Navigation are
applications of which of the following
a) Supervised Learning: Classification
b) Reinforcement Learning
c) Unsupervised Learning: Clustering
d) Unsupervised Learning: Regression

53. Targeted marketing, Recommended Systems, and Customer Segmentation are applications in
which of the following
a) Supervised Learning: Classification
b) Unsupervised Learning: Clustering
c) Unsupervised Learning: Regression
d) Reinforcement Learning

54. Fraud Detection, Image Classification, Diagnostic, and Customer Retention are applications in
which of the following
a) Unsupervised Learning: Regression
b) Supervised Learning: Classification
c) Unsupervised Learning: Clustering
d) Reinforcement Learning

55. Which of the following is not numerical functions in the various function representation of
Machine Learning?
a) Neural Network
b) Support Vector Machines
c) Case-based
d) Linear Regression
56. FIND-S Algorithm starts from the most specific hypothesis and generalize it by considering only
a) Negative
b) Positive
c) Negative or Positive
d) None of the above

57. FIND-S algorithm ignores


a) Negative
b) Positive
c) Both
d) None of the above

58. The Candidate-Elimination Algorithm represents the.


a) Solution Space
b) Version Space
c) Elimination Space
d) All of the above

59. Inductive learning is based on the knowledge that if something happens a lot it is likely to be
generally
a) True
b) False Answer

60. Inductive learning takes examples and generalizes rather than starting with
a) Inductive
b) Existing
c) Deductive
d) None of these

61. A drawback of the FIND-S is that it assumes the consistency within the training set
a) True
b) False
62. The Candidate-Elimination Algorithm
a) The key idea in the Candidate-Elimination algorithm is to output a description of the set of all
hypotheses consistent with the training
b) Candidate-Elimination algorithm computes the description of this set without explicitly
enumerating all of its
c) This is accomplished by using the more-general-than partial ordering and maintaining a
compact representation of the set of consistent
d) All of these

63. Concept learning is basically acquiring the definition of a general category from given sample
positive and negative training examples of the
a) TRUE
b) FALSE

64. The hypothesis h1 is more-general-than hypothesis h2 ( h1 > h2) if and only if h1≥h2 is true
and h2≥h1 is false. We also say h2 is more-specific-than h1
a) The statement is true
b) The statement is false
c) We cannot
d) None of these

65. The List-Then-Eliminate Algorithm


a) The List-Then-Eliminate algorithm initializes the version space to contain all hypotheses
in H, then eliminates any hypothesis found inconsistent with any training
b) The List-Then-Eliminate algorithm not initializes to the version
c) None of these Answer
Unit -2
1. What is the objective of backpropagation algorithm?

a) to develop learning algorithm for multilayer feedforward neural network

b) to develop learning algorithm for single layer feedforward neural network

c) to develop learning algorithm for multilayer feedforward neural network, so that network can be
trained to capture the mapping implicitly

d) none of the mentioned

Answer: c

Explanation: The objective of backpropagation algorithm is to to develop learning algorithm for


multilayer feedforward neural network, so that network can be trained to capture the mapping
implicitly.

2. The backpropagation law is also known as generalized delta rule, is it true?

a) yes

b) no

Answer: a

Explanation: Because it fulfils the basic condition of delta rule.

3. What is true regarding backpropagation rule?

a) it is also called generalized delta rule

b) error in output is propagated backwards only to determine weight updates

c) there is no feedback of signal at nay stage

d) all of the mentioned

Answer: d

Explanation: These all statements defines backpropagation algorithm.


4. There is feedback in final stage of backpropagation algorithm?

a) yes

b) no

Answer: b

Explanation: No feedback is involved at any stage as it is a feedforward neural network.

5. What is true regarding backpropagation rule?

a) it is a feedback neural network

b) actual output is determined by computing the outputs of units for each hidden layer

c) hidden layers output is not all important, they are only meant for supporting input and output
layers

d) none of the mentioned

Answer: b

Explanation: In backpropagation rule, actual output is determined by computing the outputs of units
for each hidden layer.

6. What is meant by generalized in statement “backpropagation is a generalized delta rule” ?

a) because delta rule can be extended to hidden layer units

b) because delta is applied to only input and output layers, thus making it more simple and
generalized

c) it has no significance

d) none of the mentioned

Answer: a

Explanation: The term generalized is used because delta rule could be extended to hidden layer units.

7. What are general limitations of back propagation rule?

a) local minima problem

b) slow convergence

c) scaling
d) all of the mentioned

Answer: d

Explanation: These all are limitations of backpropagation algorithm in general.

8. What are the general tasks that are performed with backpropagation algorithm?

a) pattern mapping

b) function approximation

c) prediction

d) all of the mentioned

Answer: d

Explanation: These all are the tasks that can be performed with backpropagation algorithm in
general.

9. Does backpropagation learning is based on gradient descent along error surface?

a) yes

b) no

c) cannot be said

d) it depends on gradient descent but not error surface

Answer: a

Explanation: Weight adjustment is proportional to negative gradient of error with respect to weight.

10. How can learning process be stopped in backpropagation rule?

a) there is convergence involved

b) no heuristic criteria exist

c) on basis of average gradient value

d) none of the mentioned

Answer: c
Explanation: If average gradient value falls below a preset threshold value, the process may be
stopped.

11. A _________ is a decision support tool that uses a tree-like graph or model of decisions and
their possible consequences, including chance event outcomes, resource costs, and utility.

a) Decision tree

b) Graphs

c) Trees

d) Neural Networks

Answer: a

Explanation: Refer the definition of Decision tree.

12. Decision Tree is a display of an algorithm.

a) True

b) False

Answer: a

13. What is Decision Tree?

a) Flow-Chart

b) Structure in which internal node represents test on an attribute, each branch represents outcome
of test and each leaf node represents class label

c) Flow-Chart & Structure in which internal node represents test on an attribute, each branch
represents outcome of test and each leaf node represents class label

d) None of the mentioned

Answer: c

14. Decision Trees can be used for Classification Tasks.


a) True

b) False

Answer: a

15. Choose from the following that are Decision Tree nodes?

a) Decision Nodes

b) End Nodes

c) Chance Nodes

d) All of the mentioned

Answer: d

16. Decision Nodes are represented by ____________

a) Disks

b) Squares

c) Circles

d) Triangles

Answer: b

17. Chance Nodes are represented by __________

a) Disks

b) Squares

c) Circles

d) Triangles

Answer: c
18. End Nodes are represented by __________

a) Disks

b) Squares

c) Circles

d) Triangles

Answer: d

19. Which of the following are the advantage/s of Decision Trees?

a) Possible Scenarios can be added

b) Use a white box model, If given result is provided by a model

c) Worst, best and expected values can be determined for different scenarios

d) All of the mentioned

Answer: d

20. Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic
Gradient Decent (SGD)?

In GD and SGD, you update a set of parameters in an iterative manner to minimize the error function.
In SGD, you have to run through all the samples in your training set for a single update of a
parameter in each iteration.
In GD, you either use the entire data or a subset of training data to update a parameter in each
iteration.

a) Only 1

b) Only 2

c) Only 3

d) 1 and 2

e) 2 and 3
f) 1,2 and 3

Answer: a

Explanation: In SGD for each iteration you choose the batch which is generally contain the random
sample of data But in case of GD each iteration contain the all of the training observations.

21. Below are the 8 actual values of target variable in the train file.

[0,0,0,1,1,1,1,1]

What is the entropy of the target variable?

a) -(5/8 log(5/8) + 3/8 log(3/8))

b) 5/8 log(5/8) + 3/8 log(3/8)

c) 3/8 log(5/8) + 5/8 log(3/8)

d) 5/8 log(3/8) – 3/8 log(5/8)

Answer: a

22. A 3-input neuron is trained to output a zero when the input is 110 and a one when the input is
111. After generalization, the output will be zero when and only when the input is?
a) 000 or 110 or 011 or 101
b) 010 or 100 or 110 or 101
c) 000 or 010 or 110 or 100
d) 100 or 111 or 101 or 001

Answer: c

23. What is perceptron?


a) a single layer feed-forward neural network with pre-processing
b) an auto-associative neural network
c) a double layer auto-associative neural network
d) a neural network that contains feedback

Answer: a
Explanation: The perceptron is a single layer feed-forward neural network. It is not an auto-
associative network because it has no feedback and is not a multiple layer neural network because
the pre-processing stage is not made of neurons.
24. What is an auto-associative network?
a) a neural network that contains no loops
b) a neural network that contains feedback
c) a neural network that has only one loop
d) a single layer feed-forward neural network with pre-processing

Answer: b
Explanation: An auto-associative network is equivalent to a neural network that contains feedback.
The number of feedback paths(loops) does not have to be one.

25. A 4-input neuron has weights 1, 2, 3 and 4. The transfer function is linear with the constant of
proportionality being equal to 2. The inputs are 4, 10, 5 and 20 respectively. What will be the
output?
a) 238
b) 76
c) 119
d) 123

Answer: a
Explanation: The output is found by multiplying the weights with their respective inputs, summing the
results and multiplying with the transfer function. Therefore:
Output = 2 * (1*4 + 2*10 + 3*5 + 4*20) = 238.

26. Which of the following is true?


(i) On average, neural networks have higher computational rates than conventional computers.
(ii) Neural networks learn by example.
(iii) Neural networks mimic the way the human brain works.
a) All of the mentioned are true
b) (ii) and (iii) are true
c) (i), (ii) and (iii) are true

Answer: a
Explanation: Neural networks have higher computational rates than conventional computers because
a lot of the operation is done in parallel. That is not the case when the neural network is simulated on
a computer. The idea behind neural nets is based on the way the human brain works. Neural nets
cannot be programmed, they can only learn by examples.

27. Which of the following is true for neural networks?


(i) The training time depends on the size of the network.
(ii) Neural networks can be simulated on a conventional computer.
(iii) Artificial neurons are identical in operation to biological ones.
a) All of the mentioned
b) (ii) is true
c) (i) and (ii) are true
d) None of the mentioned

Answer: c
Explanation: The training time depends on the size of the network; the number of neuron is greater
and therefore the number of possible ‘states’ is increased. Neural networks can be simulated on a
conventional computer but the main advantage of neural networks – parallel execution – is lost.
Artificial neurons are not identical in operation to the biological ones.

28. What are the advantages of neural networks over conventional computers?
(i) They have the ability to learn by example
(ii) They are more fault tolerant
(iii)They are more suited for real time operation due to their high ‘computational’ rates
a) (i) and (ii) are true
b) (i) and (iii) are true
c) Only (i)
d) All of the mentioned

Answer: d
Explanation: Neural networks learn by example. They are more fault tolerant because they are always
able to respond and small changes in input do not normally cause a change in output. Because of
their parallel architecture, high computational rates are achieved.

29. Which of the following is true?


Single layer associative neural networks do not have the ability to:
(i) perform pattern recognition
(ii) find the parity of a picture
(iii)determine whether two or more shapes in a picture are connected or not
a) (ii) and (iii) are true
b) (ii) is true
c) All of the mentioned
d) None of the mentioned

Answer: a
Explanation: Pattern recognition is what single layer neural networks are best at but they don’t have
the ability to find the parity of a picture or to determine whether two shapes are connected or not.

30. Which is true for neural networks?


a) It has set of nodes and connections
b) Each node computes it’s weighted input
c) Node could be in excited state or non-excited state
d) All of the mentioned
Answer: d
Explanation: All mentioned are the characteristics of neural network.

31. What is Neuro software?


a) A software used to analyze neurons
b) It is powerful and easy neural network
c) Designed to aid experts in real world
d) It is software used by Neurosurgeon

Answer: b

32. Why is the XOR problem exceptionally interesting to neural network researchers?
a) Because it can be expressed in a way that allows you to use a neural network
b) Because it is complex binary operation that cannot be solved using neural networks
c) Because it can be solved by a single layer perceptron
d) Because it is the simplest linearly inseparable problem that exists.

Answer: d

33. What is back propagation?


a) It is another name given to the curvy function in the perceptron
b) It is the transmission of error back through the network to adjust the inputs
c) It is the transmission of error back through the network to allow weights to be adjusted so that the
network can learn
d) None of the mentioned

Answer: c
Explanation: Back propagation is the transmission of error back through the network to allow weights
to be adjusted so that the network can learn.

34. Why are linearly separable problems of interest of neural network researchers?
a) Because they are the only class of problem that network can solve successfully
b) Because they are the only class of problem that Perceptron can solve successfully
c) Because they are the only mathematical functions that are continue
d) Because they are the only mathematical functions you can draw

Answer: b
Explanation: Linearly separable problems of interest of neural network researchers because they are
the only class of problem that Perceptron can solve successfully.

35. Which of the following is not the promise of artificial neural network?
a) It can explain result
b) It can survive the failure of some nodes
c) It has inherent parallelism
d) It can handle noise

Answer: a
Explanation: The artificial Neural Network (ANN) cannot explain result.

36. Neural Networks are complex ______________ with many parameters.


a) Linear Functions
b) Nonlinear Functions
c) Discrete Functions
d) Exponential Functions

Answer: a
Explanation: Neural networks are complex linear functions with many parameters.

37. A perceptron adds up all the weighted inputs it receives, and if it exceeds a certain value, it
outputs a 1, otherwise it just outputs a 0.
a) True
b) False
c) Sometimes – it can also output intermediate values as well
d) Can’t say

Answer: a
Explanation: Yes the perceptron works like that.

38. What is the name of the function in the following statement “A perceptron adds up all the
weighted inputs it receives, and if it exceeds a certain value, it outputs a 1, otherwise it just
outputs a 0”?
a) Step function
b) Heaviside function
c) Logistic function
d) Perceptron function

Answer: b
Explanation: Also known as the step function – so answer 1 is also right. It is a hard thresholding
function, either on or off with no in-between.

39. Having multiple perceptrons can actually solve the XOR problem satisfactorily: this is because
each perceptron can partition off a linear part of the space itself, and they can then combine their
results.
a) True – this works always, and these multiple perceptrons learn to classify even complex problems
b) False – perceptrons are mathematically incapable of solving linearly inseparable functions, no
matter what you do
c) True – perceptrons can do this but are unable to learn to do it – they have to be explicitly hand-
coded
d) False – just having a single perceptron is enough
Answer: c

40. The network that involves backward links from output to the input and hidden layers is called
_________
a) Self organizing maps
b) Perceptrons
c) Recurrent neural network
d) Multi layered perceptron

Answer: c
Explanation: RNN (Recurrent neural network) topology involves backward links from output to the
input and hidden layers.

41. Which of the following is an application of NN (Neural Network)?


a) Sales forecasting
b) Data validation
c) Risk management
d) All of the mentioned

Answer: d
Explanation: All mentioned options are applications of Neural Network.

42. Internal nodes of a decision tree correspond to:


a) Attributes
b) Classes
c) Data instances
d) None of the above
Answer: a

43. Leaf nodes of a decision tree correspond to:


a) Attributes
b) Classes
c) Data instances
d) None of the above
Answer: b
44. Which of the following criteria is not used to decide which attribute to split next in a decision
tree:

a) Gini index

b) Information gain

c) Entropy

d) Scatter

Answer: d

45. Which of the following is a valid logical rule for the decision tree below?

a) IF Business Appointment = No & Temp above 70 = No THEN Decision = wear slacks

b) F Business Appointment = Yes & Temp above 70 = Yes THEN Decision = wear shorts

c) IF Temp above 70 = No THEN Decision = wear shorts

d) IF Business Appointment= No & Temp above 70 = No THEN Decision = wear jeans

Answer: d

46. A decision tree is pruned in order to:

a) improve classification accuracy on training set

b) improve generalization performance

c) reduce dimensionality of the data

d) make the tree balanced

Answer: b

47. For questions 47 (a) to 47 (e), consider the following small data table for two classes of woods.
Using

information gain, construct a decision tree to classify the data set. Answer the following question
for the resulting tree.

Example Density Grain Hardness Class

Example #1 Heavy Small Hard Oak

Example #2 Heavy Large Hard Oak

Example #3 Heavy Small Hard Oak

Example #4 Light Large Soft Oak

Example #5 Light Large Hard Pine

Example #6 Heavy Small Soft Pine

Example #7 Heavy Large Soft Pine

Example #8 Heavy Small Soft Pine

47.(a)Which attribute would information gain choose as the root of the tree?

a) Density

b) Grain

c) Hardness

d) None of the above


Answer: c

47.(b) What class does the tree infer for the example {Density=Light, Grain=Small,
Hardness=Hard}?
a) Oak
b) Pine
c) The example cannot be classified
d) Both classes are equally likely

Answers: b
47.(c) What class does the tree infer for the example {Density=Light, Grain=Small, Hardness=Soft}?
a) Oak
b) Pine
c) The example cannot be classified
d) Both classes are equally likely
Answer: a

47.(d) What class does the tree infer for the example {Density=Heavy, Grain=Small,
Hardness=Soft}?
a) Oak
b) Pine
c) The example cannot be classified
d) Both classes are equally likely
Answer: b

47.(e) What class does the tree infer for the example {Density=Heavy, Grain=Small,
Hardness=Hard}?
a) Oak
b) Pine
c) The example cannot be classified
d) Both classes are equally likely
Answer: a

48. A perceptron consists of -


a) one neuron
b) two neuron
c) three neuron
d) four neuron
Answer: a
Explanation: perceptron consists of a single neuron.

49. A perceptron can correctly classify instances into two classes where the classes are:
a) Overlapping
b) Linearly separable
c) Non-linearly separable
d) None of the above

Answer: b
Explanation: Perceptron is a linear classifier.
50. The logic function that cannot be implemented by a perceptron having two inputs is?
a) AND
b) OR
c) NOR
d) XOR

Answer: d
Explanation: XOR is not linearly separable.
UNIT - 2
1. Which of the following hyper parameter(s), when increased may cause random forest to over fit the data?

a) Number of Trees
b) Depth of Tree
c) Learning Rate
d) None of the above

2. What describes best the Decision Tree Learning Algorithm?


a) A method for approximating discrete-valued target functions, in which the learned function is
represented by a decision tree.
b) A method for approximating continuous-valued target functions, in which the learned function is
represented by a decision tree.
c) Both of the Above
d) None of the above
3. What is a Target Variable?
a) Variable whose values are to be predicted by other variables
b) Variable whose values are known to user
c) A variable that has categories
d) None of the above

4. What is the order of sorting in decision tree


a) From root node to leaf node
b) From leaf to leaf
c) From root to root
d) None of the above

5. Node in a decision tree represents what


a) Attributes
b) Instances
c) Both of above
d) None of above

6. How are instances represented in decision tree


a) Attribute Value pairs
b) Values only
c) Both of above
d) None of above
7.. The ID3 algorithm constructs decision tree in which order
a) Top to bottom approach
b) Bottom to top approach
c) Both of the above
d) None of the above
8. What is central choice for ID3 algorithm
a) Selecting which attribute to test at each node
b) Selecting which attribute to train at each node
c) Both of the above
d) None of the above

9. Property that measures how well a given attribute separates the training examples according to their
target classification;
a) Information Gain
b) Entropy
c) Gini Index
d) None of the above

10. The information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined
as;

a)

b)
c)
d) None of the above

11. Do ID3 perform a complete hypothesis search?


a) True
b) False
c) Can’t say
d) None of the above

12. What strategy is used by ID3 Algorithm?


a) Information Gain Heuristic and Hill Climbing
b) Only Hill Climbing
c) Only Information Gain
d) None of the above

13. The strategy where we keep on designing the decision tree but keeps an eye on Over fitting is
termed as ;
a) Pre pruning
b) Post pruning
c) Middle pruning
d) None of the above

14. What are the advantages to convert decision trees to rules before pruning?

a) It allows to distinguish among different contexts


b) It removes distinction between attribute set

c) Both of the above

d) None of the above

15. Which of the following best describes formula for Split Information?

a) Split Information (S, A)=

b)

c) Split Information (S, A) =

d) None of the above

16. The Gain ratio is defined in terms of :

a) Entropy

b) Information Gain

c) Split Information

d) Gain Measure and Split Information

17. Gain ratio is represented by which of the following equation?

a)

b) Gain Ratio (S,A)

c) Gain Ratio (S, A)

d) None of the above

18. For what problems ANN or Neural networks are suitable to use?

a) Problems in which training data corresponds to noisy and complex sensor data

b) Problems in which training data is a labeled dataset

c) Both of the above

d) None of the above

19.. What is the output released by Perceptron in Neural network when the result is greater than some
threshold value?
a) 1

b) -1

c) 0

d) None of the above

20.. If the training examples are not linearly separable which rule will design a best fit approximation for
target concept?

a) Perceptron Rule

b) Delta Rule

c) Both of the above

d) None of the above

21. What is the idea behind Stochastic Gradient rule?

a) To approximate this gradient descent search by updating weights incrementally, following the
calculation of the error for each individual example.

b) Only to approximate the gradient descent

c) Both of the above

d) None of the above

22. Which rule is used to minimize the squared error between network output values and the target
values for this output?

a) Delta rule

b) Gradient descent rule

c) Back propagation rule

d) None of the above

23. The Hypothesis search space for Back propagation algorithm is consisting of;

a) Continuous Representations

b) Discrete Representations

c) Both of the above

d) None of the above


24. Which of the following is a valid production rule for the decision tree below?

a. IF Business Appointment = No & Temp above 70 = No


THEN Decision = wear slacks
b. IF Business Appointment = Yes & Temp above 70 = Yes
THEN Decision = wear shorts
c. IF Temp above 70 = No
THEN Decision = wear shorts
d. IF Business Appointment= No & Temp above 70 = No
THEN Decision = wear jeans
25. A structure designed to store data for decision support.
a. operational database
b. flat file
c. decision tree
d. data warehouse

25. Assume that we have a dataset containing information about 200 individuals. One
hundred of these individuals have purchased life insurance. A supervised data mining
session has discovered the following rule:

IF age < 30 & credit card insurance = yes


THEN life insurance = yes
Rule Accuracy: 70%
Rule Coverage: 63%

How many individuals in the class life insurance= no have credit card insurance and are less than 30
years old?
a. 63
b. 70
c. 30
d. 27
26.Which statement is true about neural network and linear regression models?
a. Both models require input attributes to be numeric.
b. Both models require numeric attributes to range between 0 and 1.
c. The output of both models is a categorical attribute value.
d. Both techniques build models whose output is determined by a linear sum of weighted input
attribute values.
e. More than one of a,b,c or d is true.

27.Unlike traditional production rules, association rules


a. allow the same variable to be an input attribute in one rule and an output attribute in another rule.
b. allow more than one input attribute in a single rule.
c. require input attributes to take on numeric values.
d. require each rule to have exactly one categorical output attribute.
28. A data mining algorithm is unstable if
a. test set accuracy depends on the ordering of test set instances.
b. the algorithm builds models unable to classify outliers.
c. the algorithm is highly sensitive to small changes in the training data.
d. test set accuracy depends on the choice of input attributes.

29. Which statement is true about the decision tree attribute selection process described in
your book?
a. A categorical attribute may appear in a tree node several times but a numeric attribute may appear
at most once.
b. A numeric attribute may appear in several tree nodes but a categorical attribute may appear at
most once.
c. Both numeric and categorical attributes may appear in several tree nodes.
d. Numeric and categorical attributes may appear in at most one tree node.

30. Given a rule of the form IF X THEN Y, rule confidence is defined as the conditional
probability that
a. Y is true when X is known to be true.
b. X is true when Y is known to be true.
c. Y is false when X is known to be false.
d. X is false when Y is known to be false.

31. Association rule support is defined as


a. the percentage of instances that contain the antecendent conditional items listed in the association
rule.
b. the percentage of instances that contain the consequent conditions listed in the association rule.
c. the percentage of instances that contain all items listed in the association rule.
d. the percentage of instances in the database that contain at least one of the antecendent conditional
items listed in the association rule.

Use these tables to answer questions 5 and 6.


Single Item Sets Number of Items
Magazine Promo = Yes 7
Watch Promo = No 6
Life Ins Promo = Yes 5
Life Ins Promo = No 5
Card Insurance = No 8
Sex = Male 6

Two Item Sets Number of


Items
Magazine Promo = Yes & Watch Promo = No 4
Magazine Promo = Yes & Life Ins Promo = Yes 5
Magazine Promo = Yes & Card Insurance = No 5
Watch Promo = No & Card Insurance = No 5

32. One two-item set rule that can be generated from the tables above is:

If Magazine Promo = Yes Then Life Ins promo = Yes

The confidence for this rule is:

a. 5/7
b. 5 / 12
c. 7 / 12
d. 1

33. Based on the two-item set table, which of the following is not a possible two-item set
rule?
a. IF Life Ins Promo = Yes THEN Magazine Promo = Yes
b. IF Watch Promo = No THEN Magazine Promo = Yes
c. IF Card Insurance = No THEN Magazine Promo = Yes
d. IF Life Ins Promo = No THEN Card Insurance = No

34. A feed-forward neural network is said to be fully connected when


a. all nodes are connected to each other.
b. all nodes at the same layer are connected to each other.
c. all nodes at one layer are connected to all nodes in the next higher layer.
d. all hidden layer nodes are connected to all output layer nodes.

36. The values input into a feed-forward neural network


a. may be categorical or numeric.
b. must be either all categorical or all numeric but not both.
c. must be numeric.
d. must be categorical.

37. Neural network training is accomplished by repeatedly passing the training data through the
network while
a. individual network weights are modified.
b. training instance attribute values are modified.
c. the ordering of the training instances is modified.
d. individual network nodes have the coefficients on their corresponding functional parameters
modified.

38. Genetic learning can be used to train a feed-forward network. This is accomplished by having each
population element represent one possible
a. network configuration of nodes and links.
b. set of training data to be fed through the network.
c. set of network output values.
d. set of network connection weights.

39. With a Kohonen network, the output layer node that wins an input instance is rewarded by having
a. a higher probability of winning the next training instance to be presented.
b. its connect weights modified to more closely match those of the input instance.
c. its connection weights modified to more closey match those of its neighbors.
d. neighoring connection weights modified to become less similar to its own connection weights.

40. A two-layered neural network used for unsupervised clustering.


a. backpropagation network
b. Kohonen network
c. perceptron network
d. aggomerative network

41. This neural network explanation technique is used to determine the relative importance of
individual input attributes.
a. sensitivity analysis
b. average member technique
c. mean squared error analysis
d. absolute average technique

42. Which one of the following is not a major strength of the neural network approach?
a. Neural networks work well with datasets containing noisy data.
b. Neural networks can be used for both supervised learning and unsupervised clustering.
c. Neural network learning algorithms are guaranteed to converge to an optimal solution.
d. Neural networks can be used for applications that require a time element to be included in the data.

43. During backpropagation training, the purpose of the delta rule is to make weight adjustments so as
to
a. minimize the number of times the training data must pass through the network.
b. minimize the number of times the test data must pass through the network.
c. minimize the sum of absolute differences between computed and actual outputs.
d. minimize the sum of squared error differences between computed and actual output.

44. Epochs represent the total number of


a. input layer nodes.
b. passes of the training data through the network.
c. network nodes.
d. passes of the test data through the network.
45. Two classes each of which is represented by the same pair of numeric attributes are linearly
separable if
a. at least one of the pairs of attributes shows a curvilinear relationship between the classes.
b. at least one of the pairs of attributes shows a high positive correlation between the classes.
c. at least one of the pairs of attributes shows a high positive correlation between the classes.
d. a straight line partitions the instances of the two classes.

46. The test set accuracy of a backpropagation neural network can often be improved by
a. increasing the number of epochs used to train the network.
b. decreasing the number of hidden layer nodes.
c. increasing the learning rate.
d. decreasing the number of hidden layers.

47. This type of supervised network architecture does not contain a hidden layer.
a. backpropagation
b. perceptron
c. self-organizing map
d. genetic

48. The total delta measures the total absolute change in network connection weights for each pass of
the training data through a neural network. This value is most often used to determine the
convergence of a
a. perceptron network.
b. feed-forward network.
c. backpropagation network.
d. self-organizing network.
49. What strategies can help reduce overfitting in decision trees?
• Enforce a maximum depth for the tree
• Enforce a minimum number of samples in leaf nodes
• Pruning
• Make sure each leaf node is one pure class
A. All
B. (i), (ii) and (iii)
C. (i), (iii), (iv)
D. None
Correct option is B

50. Which of the following is a widely used and effective machine learning algorithm
based on the idea of bagging?
A. Decision Tree
B. Random Forest
C. Regression
D. Classification
Correct option is B

51. To find the minimum or the maximum of a function, we set the gradient to zero
because which of the following
A. Depends on the type of problem
B. The value of the gradient at extrema of a function is always zero
C. Both (A) and (B)
D. None of these
Correct option is B

52. Which of the following is a disadvantage of decision trees?


A. Decision trees are prone to be overfit
B. Decision trees are robust to outliers
C. Factor analysis
D. None of the above
Correct option is A

53. What is perceptron?


A. A single layer feed-forward neural network with pre-processing
B. A neural network that contains feedback
C. A double layer auto-associative neural network
D. An auto-associative neural network
Correct option is A

54. Which of the following is true for neural networks?


• The training time depends on the size of the
• Neural networks can be simulated on a conventional
• Artificial neurons are identical in operation to biological
A. All
B. Only (ii)
C. (i) and (ii)
D. None
Correct option is C

55. What are the advantages of neural networks over conventional computers?
• They have the ability to learn by
• They are more fault
• They are more suited for real time operation due to their high
„computational‟
A. (i) and (ii)
B. (i) and (iii)
C. Only (i)
D. All
E. None
Correct option is D

56. What is Neuro software?


A. It is software used by Neurosurgeon
B. Designed to aid experts in real world
C. It is powerful and easy neural network
D. A software used to analyze neurons
Correct option is C

57. Which is true for neural networks?


A. Each node computes it‟s weighted input
B. Node could be in excited state or non-excited state
C. It has set of nodes and connections
D. All of the above
Correct option is D

58. What is the objective of backpropagation algorithm?


A. To develop learning algorithm for multilayer feedforward neural network, so
that network can be trained to capture the mapping implicitly
B. To develop learning algorithm for multilayer feedforward neural network
C. To develop learning algorithm for single layer feedforward neural network
D. All of the above
Correct option is A
59. Which of the following is true?
Single layer associative neural networks do not have the ability to:-

• Perform pattern recognition


• Find the parity of a picture
• Determine whether two or more shapes in a picture are connected or not
A. (ii) and (iii)
B. Only (ii)
C. All
D. None
Correct option is A

60. The backpropagation law is also known as generalized delta rule


A. True
B. False
Correct option is A

61. Which of the following is true?


• On average, neural networks have higher computational rates than
conventional computers.
• Neural networks learn by
• Neural networks mimic the way the human brain
A. All
B. (ii) and (iii)
C. (i), (ii) and (iii)
D. None
Correct option is A

62. What is true regarding backpropagation rule?


A. Error in output is propagated backwards only to determine weight updates
B. There is no feedback of signal at nay stage
C. It is also called generalized delta rule
D. All of the above
Correct option is D

63. There is feedback in final stage of backpropagation


A. True
B. False
Correct option is B
64. An auto-associative network is
A. A neural network that has only one loop
B. A neural network that contains feedback
C. A single layer feed-forward neural network with pre-processing
D. A neural network that contains no loops
Correct option is B

65. A 3-input neuron has weights 1, 4 and 3. The transfer function is linear with the
constant of proportionality being equal to 3. The inputs are 4, 8 and 5 respectively.
What will be the output?
A. 139
B. 153
C. 162
D. 160
Correct option is B

66. What of the following is true regarding backpropagation rule?


A. Hidden layers output is not all important, they are only meant for supporting
input and output layers
B. Actual output is determined by computing the outputs of units for each
hidden layer
C. It is a feedback neural network
D. None of the above
Correct option is B

67. What is back propagation?


A. It is another name given to the curvy function in the perceptron
B. It is the transmission of error back through the network to allow weights to be
adjusted so that the network can learn
C. It is another name given to the curvy function in the perceptron
D. None of the above
Correct option is B

68. The general limitations of back propagation rule is/are


A. Scaling
B. Slow convergence
C. Local minima problem
D. All of the above
Correct option is D
69. What is the meaning of generalized in statement “backpropagation is a generalized
delta rule” ?
A. Because delta is applied to only input and output layers, thus making it more
simple and generalized
B. It has no significance
C. Because delta rule can be extended to hidden layer units
D. None of the above
Correct option is C

70. Neural Networks are complex functions with many parameter


A. Linear
B. Non linear
C. Discreate
D. Exponential
Correct option is A

71. The general tasks that are performed with backpropagation algorithm
A. Pattern mapping
B. Prediction
C. Function approximation
D. All of the above
Correct option is D

72. Backpropagaion learning is based on the gradient descent along error surface.
A. True
B. False
Correct option is A

73. In backpropagation rule, how to stop the learning process?


A. No heuristic criteria exist
B. On basis of average gradient value
C. There is convergence involved
D. None of these
Correct option is B

74. Applications of NN (Neural Network)


A. Risk management
B. Data validation
C. Sales forecasting
D. All of the above
Correct option is D
75. The network that involves backward links from output to the input and hidden layers
is known as
A. Recurrent neural network
B. Self organizing maps
C. Perceptrons
D. Single layered perceptron
Correct option is A

76. Decision Tree is a display of an Algorithm?


A. True
B. False
Correct option is A

77. Which of the following is/are the decision tree nodes?


A. End Nodes
B. Decision Nodes
C. Chance Nodes
D. All of the above
Correct option is D

78. End Nodes are represented by which of the following


A. Solar street light
B. Triangles
C. Circles
D. Squares
Correct option is B

79. Decision Nodes are represented by which of the following


A. Solar street light
B. Triangles
C. Circles
D. Squares
Correct option is D

80. Chance Nodes are represented by which of the following


A. Solar street light
B. Triangles
C. Circles
D. Squares
Correct option is C
81. Advantage of Decision Trees
A. Possible Scenarios can be added
B. Use a white box model, if given result is provided by a model
C. Worst, best and expected values can be determined for different scenarios
D. All of the above
Correct option is D
UNIT-3
1. How many terms are required for building a bayes model?
a) 1
b) 2
c) 3
d) 4

Answer: c

Explanation: The three required terms are a conditional probability and two unconditional probability.

2. What is needed to make probabilistic systems feasible in the world?


a) Reliability
b) Crucial robustness
c) Feasibility
d) None of the mentioned

Answer: b

Explanation: On a model-based knowledge provides the crucial robustness needed to make


probabilistic system feasible in the real world.

3. Where does the bayes rule can be used?


a) Solving queries
b) Increasing complexity
c) Decreasing complexity
d) Answering probabilistic query

Answer: d

Explanation: Bayes rule can be used to answer the probabilistic queries conditioned on one piece of
evidence

4. What does the bayesian network provides?


a) Complete description of the domain
b) Partial description of the domain
c) Complete description of the problem
d) None of the mentioned

Answer: a
Explanation: A Bayesian network provides a complete description of the domain.
5. How the entries in the full joint probability distribution can be calculated?
a) Using variables
b) Using information
c) Both Using variables & information
d) None of the mentioned

Answer: b

Explanation: Every entry in the full joint probability distribution can be calculated from the information
in the network.

6. How the bayesian network can be used to answer any query?


a) Full distribution
b) Joint distribution
c) Partial distribution
d) All of the mentioned

Answer: b

Explanation: If a bayesian network is a representation of the joint distribution, then it can solve any
query, by summing all the relevant joint entries.

7. How the compactness of the bayesian network can be described?


a) Locally structured
b) Fully structured
c) Partial structure
d) All of the mentioned

Answer: a

Explanation: The compactness of the bayesian network is an example of a very general property of a
locally structured system.

8. To which does the local structure is associated?


a) Hybrid
b) Dependant
c) Linear
d) None of the mentioned

Answer: c

Explanation: Local structure is usually associated with linear rather than exponential growth in
complexity.
9. Which condition is used to influence a variable directly by all the others?
a) Partially connected
b) Fully connected
c) Local connected
d) None of the mentioned

Answer: b

10. What is the consequence between a node and its predecessors while creating bayesian netw
a) Functionally dependent
b) Dependant
c) Conditionally independent
d) Both Conditionally dependant & Dependant

Answer: c

Explanation: The semantics to derive a method for constructing bayesian networks were led to the
consequence that a node can be conditionally independent of its predecessors.

11. Which of the following statements about Naive Bayes is incorrect?


a) Attributes are equally important.
b) Attributes are statistically dependent of one another given the class value.
c) Attributes are statistically independent of one another given the class value.
d) Attributes can be nominal or numeric
e) All of the above"

Answer: b

12.The Naive Bayes Classifier is a _____ in probability.


a) Technique.
b) Process.
c) Classification.
d) None of these answers are correct.

Answer: c

13. Suppose we would like to convert a nominal attribute X with 4 values to a data table with only
binary variables. How many new attributes are needed?
a) 1
b) 2
c) 4
d) 8
e) 16

Answer: C
14. In a medical application domain, suppose we build a classifier for patient screening (True means
patient has cancer). Suppose that the confusion matrix is from testing the classifier on some test data.

Predicted
TRUE FALSE
TRUE TP FN
Actual FALSE FP TN

Which of the following situations would you like your classifier to have?

A. FP >> FN
B. FN >> FP
C. FN = FP × TP
D. TN >> FP
E. FN × TP >> FP × TN
F. All of the above

Answer: A (because, when FN is small, we can guarantee that true cancer patients are not diagnosed
as non-patients.)

15. Consider discretizing a continuous attribute whose values are listed below:
3, 4, 5, 10, 20, 32, 43, 44, 46, 52, 59, 61

Which of the following number of bins is not possible for using equidepth bins?
A. 2
B. 3
C. 4
D. 5
E. 6
F. All of the above

Answer: D

16. Consider discretizing a continuous attribute whose values are listed below:
3, 4, 5, 10, 21, 32, 43, 44, 46, 52, 59, 67

Using equal-width partitioning and four bins, how many values are there in the first bin (the bin with
small values)?
A. 1
B. 2
C. 3
D. 4
E. 5
Answer: 4D (because, the first bin is between 3 and 19, in which there are 4 items: 3,4,5,10)
17. High entropy means that the partitions in classification are
A. pure
B. not pure
C. useful
D. useless
E. None of the above

Answer: B

18. A machine learning problem involves four attributes plus a class. The attributes have 3, 2, 2, and 2
possible values each. The class has 3 possible values. How many possible different examples are there?

A. 3
B. 6
C. 12
D. 24
E. 48
F. 72

Answer: F

19. Which of the following is not supervised learning?


A. PCA
B. Clustering
C. Decision Tree
D. Linear Regression
E. Naive Bayesian
F. None of the above

Answer: B or C

20. Which of the following statements about Naive Bayes is incorrect?


A. Attributes are equally important.
B. Attributes are statistically dependent of one another given the class value.
C. Attributes are statistically independent of one another given the class value.
D. Attributes can be nominal or numeric
E. All of the above

Answer: B

21. What are the axes of an ROC curve?


A. Vertical axis: % of true negatives; Horizontal axis: % of false negatives
B. Vertical axis: % of true positives; Horizontal axis: % of false positives
C. Vertical axis: % of false negatives; Horizontal axis: % of false positives
D. Vertical axis: % of false positives; Horizontal axis: % of true negatives

Answer: B
22. Suppose that there are a total of 50 data mining related documents in a library of 200 documents.
Suppose that a search engine retrieves 10 documents after a user enters ‘data mining’ as a query, of
which 5 are data mining related documents. What are the precision and recall

A. (50%, 10%)
B. (60%, 20%)
C. (70%, 30%)
D. (60%, 30%)

Answer: A. (50%, 10%)

23. Three companies A, B and C supply 25%, 35% and 40% of the notebooks to a school. Past
experience shows that 5%, 4% and 2% of the notebooks produced by these companies are defective.
If a notebook was found to be defective, what is the probability that the notebook was supplied by A?

a) 44⁄69
b) 25⁄69
c) 13⁄24
d) 11⁄24

Answer: b

Explanation: Let A, B and C be the events that notebooks are provided by A, B and C respectively.
Let D be the event that notebooks are defective
Then,
P(A) = 0.25, P(B) = 0.35, P(C) = 0.4
P(D|A) = 0.05, P(D|B) = 0.04, P(D|C) = 0.02
P(A│D) = (P(D│A)*P(A))/(P(D│A) * P(A) + P(D│B) * P(B) + P(D│C) * P(C) )
0.362318841

24. A box of cartridges contains 30 cartridges, of which 6 are defective. If 3 of the cartridges are
removed from the box in succession without replacement, what is the probability that all the 3
cartridges are defective?
a) (6∗5∗4)(30∗30∗30)
b) (6∗5∗4)(30∗29∗28)
c) (6∗5∗3)(30∗29∗28)
d) (6∗6∗6)(30∗30∗30)

Answer: b

Explanation: Let A be the event that the first cartridge is defective. Let B be the event that the second
cartridge is defective. Let C be the event that the third cartridge is defective. Then probability that all
3 cartridges are defective is P(A ∩ B ∩ C)
Hence,
P(A ∩ B ∩ C) = P(A) * P(B|A) * P(C | A ∩ B)
= (6⁄30) * (5⁄29) * (4⁄28)
= (6 * 5 * 4)⁄(30 * 29 * 28).

25. Two boxes containing candies are placed on a table. The boxes are labelled B1 and B2. Box B1
contains 7 cinnamon candies and 4 ginger candies. Box B2 contains 3 cinnamon candies and 10 pepper
candies. The boxes are arranged so that the probability of selecting box B1 is 1⁄3 and the probability
of selecting box B2 is 2⁄3. Suresh is blindfolded and asked to select a candy. He will win a colour TV if
he selects a cinnamon candy. What is the probability that Suresh will win the TV (that is, she will select
a cinnamon candy)?
a) 7⁄33
b) 6⁄33
c) 13⁄33
d) 20⁄33

Answer: c

Explanation: Let A be the event of drawing a cinnamon candy.


Let B1 be the event of selecting box B1.
Let B2 be the event of selecting box B2.
Then, P(B1) =1⁄3 and P(B2) = 2⁄3
P(A) = P(A ∩ B1) + P(A ∩ B2)
P(A|B1) * P(B1) + P(A|B2)*P(B2)
= (7⁄11) * (1⁄3) + (3⁄11) * (2⁄3)
= 13⁄33.

26. Two boxes containing candies are placed on a table. The boxes are labelled B1 and B2. Box B1
contains 7 cinnamon candies and 4 ginger candies. Box B2 contains 3 cinnamon candies and 10 pepper
candies. The boxes are arranged so that the probability of selecting box B1 is 1⁄3 and the probability
of selecting box B2 is 2⁄3. Suresh is blindfolded and asked to select a candy. He will win a colour TV if
he selects a cinnamon candy. If he wins a colour TV, what is the probability that the marble was from
the first box?
a) 7⁄13
b) 13⁄7
c) 7⁄33
d) 6⁄33

Answer: a

Explanation: Let A be the event of drawing a cinnamon candy.


Let B1 be the event of selecting box B1.
Let B2 be the event of selecting box B2.
Then, P(B1) = 1⁄3 and P(B2) = 2⁄3
Given that Suresh won the TV, the probability that the cinnamon candy was selected from B1 is
P(B1|A) = (P(A|B1) * P( B1))/( P(A│B1) * P( B1) + P(A│B1) * P(B2))
(711)∗(13)(711)∗(13)+(311)∗(23)
= 7⁄13.
27. Suppose box A contains 4 red and 5 blue coins and box B contains 6 red and 3 blue coins. A coin is
chosen at random from the box A and placed in box B. Finally, a coin is chosen at random from among
those now in box B. What is the probability a blue coin was transferred from box A to box B given that
the coin chosen from box B is red?
a) 15⁄29
b) 14⁄29
c) 1⁄2
d) 7⁄10

Answer: a

Explanation: Let E represent the event of moving a blue coin from box A to box B. We want to find the
probability of a blue coin which was moved from box A to box B given that the coin chosen from B was
red. The probability of choosing a red coin from box A is P(R) = 7⁄9 and the probability of choosing a
blue coin from box A is P(B) = 5⁄9. If a red coin was moved from box A to box B, then box B has 7 red
coins and 3 blue coins. Thus the probability of choosing a red coin from box B is 7⁄10 . Similarly, if a
blue coin was moved from box A to box B, then the probability of choosing a red coin from box B is
6⁄10.

Hence, the probability that a blue coin was transferred from box A to box B given that the coin chosen
from box B is red is given by

P(E|R)=P(R|E)∗P(E)P(R)
=(610)∗(59)(710)∗(49)+(610)∗(59)
= 15⁄29.

28. An urn B1 contains 2 white and 3 black chips and another urn B2 contains 3 white and 4 black
chips. One urn is selected at random and a chip is drawn from it. If the chip drawn is found black, find
the probability that the urn chosen was B1.

a) 4⁄7
b) 3⁄7
c) 20⁄41
d) 21⁄41

Answer: d

Explanation: Let E1, E2 denote the vents of selecting urns B1 and B2 respectively.
Then P(E1) = P(E2) = 1⁄2
Let B denote the event that the chip chosen from the selected urn is black .
Then we have to find P(E1 /B).
By hypothesis P(B /E1) = 3⁄5
and P(B /E2) = 4⁄7
By Bayes theorem P(E1 /B) = (P(E1)*P(B│E1))/((P(E1) * P(B│E1)+P(E2) * P(B│E2)) )
= ((1/2) * (3/5))/((1/2) * (3/5)+(1/2)*(4/7) ) = 21/41.
29. At a certain university, 4% of men are over 6 feet tall and 1% of women are over 6 feet tall. The
total student population is divided in the ratio 3:2 in favour of women. If a student is selected at
random from among all those over six feet tall, what is the probability that the student is a woman?

a) 2⁄5
b) 3⁄5
c) 3⁄11
d) 1⁄100

Answer: c

Explanation: Let M be the event that student is male and F be the event that the student is female.
Let T be the event that student is taller than 6 ft.
P(M) = 2⁄5 P(F) = 3⁄5 P(T|M) = 4⁄100 P(T|F) = 1⁄100
P(F│T) = (P(T│F) * P(F))/(P(T│F) * P(F) + P(T│M) * P(M))
=0.272727273
= ((1/100) * (3/5))/((1/100) * (3/5) + (4/100) * (2/5) )
= 3⁄11.

30. Naina receives emails that consists of 18% spam of those emails. The spam filter is 93% reliable
i.e., 93% of the mails it marks as spam are actually a spam and 93% of spam mails are correctly labelled
as spam. If a mail marked spam by her spam filter, determine the probability that it is really a spam.

a) 50%
b) 84%
c) 39%
d) 63%

Answer: a

Explanation: 18% email are spam and 82% email are not spam. Now, 18% of mail marked as spam is
spam and 82% mail marked as spam are not spam. By Bayes theorem the probability that a mail
marked spam is really a spam = (Probability of being spam and being detected as spam)/(Probability
of being detected as spam) = (0.18 * 0.82)/(0.18 * 0.82) + (0.18 * 0.82) = 0.5 or 50%.

31. A meeting has 12 employees. Given that 8 of the employees is a woman, find the probability that
all the employees are women?
a) 1123
b) 1235
c) 29
d) 18
Answer: c
Explanation: Assume that the probability of an employee being a man or woman is (12). By using
Bayes’ theorem: let B be the event that the meeting has 3 employees who is a woman and let A be
the event that all employees are women. We want to find P(A|B) = P(B|A)∗P(A)P(B). P(B|A) = 1, P(A)
= 112 and P(B) = 812. So, P(A|B) = 1∗112812=18.

32. A cupboard A has 4 red carpets and 4 blue carpets and a cupboard B has 3 red carpets and 5 blue
carpets. A carpet is selected from a cupboard and the carpet is chosen from the selected cupboard
such that each carpet in the cupboard is equally likely to be chosen. Cupboards A and B can be selected
in 15 and 35 ways respectively. Given that a carpet selected in the above process is a blue carpet, find
the probability that it came from the cupboard B.
a) 25
b) 1519
c) 3173
d) 49

Answer: b

Explanation: The probability of selecting a blue carpet = 15∗48+35∗58=440+1540=1940. Probability of


selecting a blue carpet from cupboard, P(B) = 35∗58=1540. Given that a carpet selected in the above
process is a blue carpet, the probability that it came from the cupboard A is = 15401940=1519.

33. Mangoes numbered 1 through 18 are placed in a bag for delivery. Two mangoes are drawn out of
the bag without replacement. Find the probability such that all the mangoes have even numbers on
them?
a) 43.7%
b) 34%
c) 6.8%
d) 9.3%

Answer: c

Explanation: The events are not independent. There will be a 1018=59 chance that any of the mangoes
in the bag is even. The probability that the first one is even is 12, for the second mango, given that the
first one was even, there are only 9 even numbered balls that could be drawn from a total of 17 balls,
so the probability is 917. For the third mango, since the first two are both odd, there are 8 even
numbered mangoes that could be drawn from a total of 16 remaining balls and so the probability is
816 and for fourth mango, the probability is = 715. So the probability that all 4 mangoes are even
numbered is 1018∗917∗816∗716 = 0.068 or 6.8%.

34. A family has two children. Given that one of the children is a girl and that she was born on a
Monday, what is the probability that both children are girls?
a) 1327
b) 2354
c) 1219
d) 4358

Answer: a

Explanation: We let Y be the event that the family has one child who is a girl born on Tuesday and X
be the event that both children are boys, and apply Bayes’ Theorem. Given that there are 7 days of
the week and there are 49 possible combinations for the days of the week the two girls were born on
and 13 of these have a girl who was born on a Monday, so P(Y|X) = 1349. P(X) remains unchanged at
14. To calculate P(Y), there are 142 = 196 possible ways to select the gender and the day of the week
the child was born on. There are 132 = 169 ways which do not have a girl born on Monday and which
196 – 169 = 27 which do, so P(Y) = 27196. This gives is that P(X|Y) = 1319∗1427196=1327.

35. A jar containing 8 marbles of which 4 red and 4 blue marbles are there. Find the probability of
getting a red given the first one was red too.
a) 413
b) 211
c) 37
d) 815
Answer: c

Explanation: Suppose, P (A) = getting a red marble in the first turn, P (B) = getting a black marble in
the second turn. P (A) = 48 and P (B) = 37 and P (A and B) = 48∗37=314 P(B/A) =
P(AandB)P(A)=31412=37.

36. Maximum a posteriori classifier is also known as:


A. Decision tree classifier
B. Bayes classifier
C. Gaussian classifier
D. Maximum margin classifier

Ans: B

Explanation: Maximum a Posteriori or MAP for short is a Bayesian-based approach to estimating a


distribution and model parameters that best explain an observed dataset.

37. If we are provided with an infinite sized training set which of the following classifier will have the
lowest error probability?
A. Decision tree
B. K- nearest neighbor classifier
C. Bayes classifier
D. Support vector machine

Ans: C
Explanation: Bayes classifier has lowest error probability when trained with infinite sized training set.
38. Let A be an example, and C be a class. The probability P(C|A) is known as:
A. Apriori probability
B. Aposteriori probability
C. Class conditional probability
D. None of the above

Ans: B
Explanation: conditional probability P(C|A) is known as aposteriori probability.

39. Let A be an example, and C be a class. The probability P(C) is known as:
A. Apriori probability
B. Aposteriori probability
C. Class conditional probability
D. None of the above

Ans: A
Explanation: Apriori probability is a probability that is deduced from formal reasoning. In other words,
apriori probability is derived from logically examining an event. Class probability P(C) is apriori
probability.

40. A bank classifies its customer into two classes “fraud” and “normal” based on their instalment
payment behaviour. We know that the probability of a customer being fraud is P(fraud) = 0.20,
the probability of customer defaulting instalment payment is P(default) = 0.40, and the probability
that a fraud customer defaults in installment payment is P(default|fraud) = 0.80. What is the
probability of a customer who defaults in payment being a fraud?
A. 0.80
B. 0.60
C. 0.40
D. 0.20

Ans: C
Explanation: We have to find P(fraud|defaults).
By Bayes’ Rule: P(fraud|defaults) = (P(default|fraud) * P(fault))/P(default)=(0.80*0.20)/0.40 =0.40

41. Consider two binary attributes X and Y. We know that the attributes are independent and
Probability P(X=1) = 0.6, and P(Y=0) = 0.4. What is the probability that both X and Yhave values 1?
A. 0.06
B. 0.16
C. 0.26
D. 0.36

Ans: D
Explanation: P(X=1)=0.6 P(Y=0)=0.4 P(Y=1)=1-0.4=0.6
P(X=1, Y=1) = P(X=1)*P(Y=1) = 0.6*0.6 = 0.36 (Since, X and Y are independent)
42. Consider a binary classification problem with two classes C1 and C2. Class labels of ten other
training set instances sorted in increasing order of their distance to an instance x is as follows: {C1, C2,
C1, C2, C2, C2, C1, C2, C1, C2}. How will a K=7 nearest neighbor classifier classify x?

A. There will be a tie


B. C1
C. C2
D. Not enough information to classify

Ans: C
Explanation: closest 7 neighbours are C1, C2, C1, C2, C2, C2, C1. In this C1 has 3 occurrences and C2
has 4 occurrences, therefore, by majority voting X will be classified as C2.

43. Given the following training set for classification problem into two classes “fraud” and “normal”.
There are two attributes A1 and A2 taking values 0 or 1. What is the estimated apriori probability
P(fraud)of the class fraud?

A1 A2 Class
1 0 fraud
1 1 fraud
1 1 fraud
1 0 normal
1 1 fraud
0 0 normal
0 0 normal
0 0 normal
1 1 normal
1 0 normal

A. 0.2
B. 0.4
C. 0.6
D. 0.8

Ans: B
Explanation: P(fraud) = 4/10=0.4. Since 4 out 10 are fraud cases.

44. Given the following training set for classification problem into two classes “fraud” and “normal”.
There are two attributes A1 and A2 taking values 0 or 1. What is the estimated class conditional
probability P(A1=1, A2=1|fraud)?
A1 A2 Class
1 0 fraud
1 1 fraud

1 1 fraud
1 0 normal
1 1 fraud
0 0 normal
0 0 normal
0 0 normal
1 1 normal
1 0 normal

A. 0.25
B. 0.50
C. 0.75
D. 1.00
Ans: C
Explanation: P(A1=1, A2=1|fraud) = ( P(fraud| A1=1,A2=1)*P(A1=1, A2=1))/P(fraud) =( (3/4)*0.4)/0.4
= 0.75

45. Given the following training set for classification problem into two classes “fraud” and “normal”.
There are two attributes A1 and A2 taking values 0 or 1. The Bayes classifier classifies the instance
(A1=1, A2=1) into class?

A1 A2 Class
1 0 fraud
1 1 fraud
1 1 fraud
1 0 normal
1 1 fraud
0 0 normal
0 0 normal
0 0 normal
1 1 normal
1 0 normal

A. fraud
B. normal
C. there will be a tie
D. not enough information to classify

Ans: A
Explanation: P(fraud| A1=1,A2=1) = 0.75
P(Normal| A1=1,A2=1) = 0.25
P(fraud| A1=1,A2=1) > P(Normal| A1=1,A2=1) therefore classified as fraud.

46. In which of the following cases will K-means clustering fail to give good results? 1) Data points with
outliers 2) Data points with different densities 3) Data points with nonconvex shapes
a. 1 and 2
b. 2 and 3
c. 1, 2, and 3
d. 1 and 3

Answer: c

47. Which of the following is a reasonable way to select the number of principal components "k"?
a. Choose k to be the smallest value so that at least 99% of the variance is retained.
b. Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer).
c. Choose k to be the largest value so that 99% of the variance is retained.
d. Use the elbow method

Answer: a

48. You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each iteration.
You find that the value of J(Theta) decreases quickly and then levels off. Based on this, which of the
following conclusions seems most plausible?
a. Rather than using the current value of a, use a larger value of a (say a=1.0)
b. Rather than using the current value of a, use a smaller value of a (say a=0.1)
c. a=0.3 is an effective choice of learning rate
d. None of the above

Answer: c

49. What is a sentence parser typically used for?


a. It is used to parse sentences to check if they are utf-8 compliant.
b. It is used to parse sentences to derive their most likely syntax tree structures.
c. It is used to parse sentences to assign POS tags to all tokens.
d. It is used to check if sentences can be parsed into meaningful tokens.

Answer: b

50. Suppose you have trained a logistic regression classifier and it outputs a new example x with a
prediction ho(x) = 0.2. This means
a. Our estimate for P(y=1 | x)
b. Our estimate for P(y=0 | x)
c. Our estimate for P(y=1 | x)
d. Our estimate for P(y=0 | x)

Answer: b
UNIT -3

1-Bayesian network make use of_____ to answer queries?


(a)Full distribution
(b)Joint distribution
(c)Partial distribution
(d)All of the mentioned
Sol-(b)

2-Bayesian network are :


(a)Locally structured
(b)Fully structured
(c)Partial structured
(d) All of the mentioned
Sol-(a)

3.Tool used to compute conditional probability in Bayes model is?


(a)Histogram
(b)Pivot Table
(c)Lift curve
(d)Conditional Formatting
Sol-(b)
4.For Naive Bayes Model which of the following statements is true?
(a)It is not suitable for classification task.
(b)It requires reasonable accuracy in rank ordering of probability values to
classify a new observation.
(c)Denominator in Naïve Bayes Formula (probability computation)impacts
The rank ordering of probability values.
(d)All of the above
Sol-(b)

5.Which of the following is incorrect?


(a)Bayes Model is more suitable for classification task rather than prediction
task.
(b)When the number of predictors is more ,Exact Bayes Model is more suitable.
(c)It is preferable to use categorical predictors for computing Bayes Model.
(d) Numerical predictors are converted into categorical predictors through binning.
Sol-(b)

6.Which technique use the method of finding training partition records that have
the exact predictor values as the new observation?
(a)Naïve Bayes
(b)Complete Bayes
(c)Multiple Linear Regresssion
(d)None of the above
Sol-(b)
7.Bayes rule is used to
(a)Solve queries
(b)Increase complexity of a query
(c)Decrease complexity of a query
(d)Answer probabilistic queries
Sol-(d)

8. Bayesian networks allow compact specification of


(a)Belief
(b)Propositional Logic statements
(c)Joint probability distributions
(d)Conditional independence
Sol-(c)

9.What is the naïve assumption in a naïve bayes classifier


(a)All the features of a class are conditional dependent to each other.
(b)All the features of a class are dependent to each other.
(c)All the classes are dependent of each other.
(d)The most probable feature of a class is the most important feature to considered
for classification.
Sol-(b)

10.Why we restrict hypothesis space in MACHINE LEARNING


(a)Makes searching easier
(b)Avoid overfit
(c)Both (a) and(b)
(d) None of them
Sol-(c)

11.Consider the following statements are True or False


(i)When Hypothesis space is reacher,Overfitting is most likely
(ii)When feature space is larger,Over fitting is most likely
(a)True,False
(b)True,True
(c)False,True
(d)False,True
Sol-(b)

12.The VC dimension of hypothesis space H1 is more than the VC dimension of


hypothesis space H2.Which of the following can be concluded?
(a)The number of examples required for learning a hypothesis in H1 is larger than
the number of examples required for H2.
(b)The number of examples required for learning a hypothsis in H1 is smaller
than the number of examples required for H2.
(c)No relation to number of samples required for PAC learning.
Sol-(a)
13.How many terms are required for building a bayes model?
(a)1
(b)2
(c)3
(d)4
Sol-(c)

14.What is needed to make probabilistic systems feasible in the world?


(a)Reliability
(b)Crucial Robustness
(c)Feasibility
(d)None of the mentioned
Sol-(b)

15.What does the Bayesian Network Provides?


(a)Complete description of the domain.
(b)Partial description of the domain.
(c)Complete description of the problem.
(d)None of the mentioned.
Sol-(a)

16.How the entries in the full joint probability distribution can be calculated?
(a)Using variables
(b)Using information
(c)Both using variables and information
(d)None of the mentioned
Answer-(b)

17.How the compactness of the Bayesian network can be described?


(a)Locally Structured
(b)Fully Structured
(c)Partial structure
(d)All of the mentioned
Sol-(a)

18.To which does the local structure is associated?


(a)Hybrid
(b)Dependent
(c)Linear
(d)None of the mentioned
Sol-(c)

19.Which condition is used to influence a variable directly by all the other.


(a)Partially connected
(b)Fully connected
(c)Local Connected
(d)None of the mentioned
Sol-(b)
20.Which of the following statements about the Naïve Bayes is incorrect?
(a)Attributes are equally important.
(b)Attributes are statistically dependent on one another given the class value.
(c)Attributes are statistical independent of one another given the class value.
(d)Attributes can be nominal or numerical
(e) All of the above
Sol-(b)

21.The Naïve Bayes Classifier is a ____in probability.


(a)Technique
(b)Process
(c)Classification
(d)None of the above
Sol-(c)

22.Suppose we would like to convert a nominal attribute X with 4 values to a data


table with only binary variables.How many new attributes are needed?
(a)1
(b)2
(c) 4
(d)8
(e)16
Sol-(c)
23.How large is the hypothesis space when we have n Boolean attributes
(a)|H|=3 power n
(b)|H|=2 power n
(c)|H|=1 power n
(d)|H|=4 power n
Sol-(a)

24.Computational Complexity of classes of learning problems depends on which of


the following?
(a)The size of complexity of the hypothesis space considered by learner.
(b)The accuracy to which the target concept must be approximated
©The probability that the learner will output a successful hypothesis
(d)All of these
Sol-(d)

25.What area of CLT tells”How many mistakes we will make before finding a good
hypothesis”?
(a)Sample Complexity
(b)Computational Complexity
(c)Mistake Bound
(d)None of these
Sol-(c)
26. What area of CLT tells”How many examples we need to find a good
hypothesis”?
(a)Sample complexity
(b) Computational Complexity
(c)Mistake Bound
(d)None of these
Sol-(a)

27.The quality of the result depends on(LWR)


(a)Choice of the function
(b)Choice of the kernel function K
(c)Choice of the hypothesis space H
(d)All of these
Sol-(d)

28.Which of the following is correct about the Naïve Bayes?


(a)Assumes that all the features in a dataset are independent
(b)Assumes that all the features in a dataset are equally important
(c)Both
(d)All of the above
Sol-(c)
29.Naive Bayes algorithm is a ____Learning algorithm.
(a)Supervised
(b)Reinforcement
(c)Unsupervised
(d)None of these
Sol-(a)

30.The benefit of Naïve Bayes:-


(a)Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
(b)It is the most popular choice for text classification problems
(c)It can be used for binary as well as Multi-Class Classifications
(d)All of the above
Sol-(d)
UNIT-4
1) [True or False] k-NN algorithm does more computation on test time rather than train time.
A) TRUE
B) FALSE
Solution: A
Explanation: The training phase of the algorithm consists only of storing the feature vectors and class labels
of the training samples.
In the testing phase, a test point is classified by assigning the label which are most frequent among the k
training samples nearest to that query point – hence higher computation.

2) In the image below, which would be the best value for k assuming that the algorithm you are using is k-
Nearest Neighbour.

A) 3
B) 10
C) 20
D 50

Solution: B

Explanation: Validation error is the least when the value of k is 10. So it is best to use this value of k

3) Which of the following distance metric cannot be used in k-NN?


A) Manhattan
B) Minkowski
C) Tanimoto
D) Jaccard
E) Mahalanobis
F) All can be used
Solution: F
Explanation: All of these distance metrics can be used as a distance metric for k-NN.

4) Which of the following option is true about k-NN algorithm?


A) It can be used for classification
B) It can be used for regression
C) It can be used in both classification and regression
Solution: C
Explanation: We can also use k-NN for regression problems. In this case the prediction can be based on the
mean or the median of the k-most similar instances.

5) Which of the following statement is true about k-NN algorithm?


1 k-NN performs much better if all of the data have the same scale
2 k-NN works well with a small number of input variables (p), but struggles when the number
of inputs is very large
3 k-NN makes no assumptions about the functional form of the problem being solved
A) 1 and 2
B) 1 and 3
C) Only 1
D) All of the above
Solution: D
Explanation: The above-mentioned statements are assumptions of kNN algorithm

6) Which of the following machine learning algorithm can be used for imputing missing values of both
categorical and continuous variables?
A) K-NN
B) Linear Regression
C) Logistic Regression
Solution: A
Explanation: k-NN algorithm can be used for imputing missing value of both categorical and continuous
variables.

7) Which of the following is true about Manhattan distance?


A) It can be used for continuous variables
B) It can be used for categorical variables
C) It can be used for categorical as well as continuous
D) None of
these
Solution: A
Explanation: Manhattan Distance is designed for calculating the distance between real valued features.

8) Which of the following distance measure do we use in case of categorical variables in k-NN?
1 Hamming Distance
2 Euclidean Distance
3 Manhattan Distance
A) 1
B) 2
C) 3
D) 1 and 2
E) 2 and 3
F) 1,2 and 3
Solution: A
Explanation: Both Euclidean and Manhattan distances are used in case of continuous variables, whereas
hamming distance is used in case of categorical variable.
9) Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
Solution: A
Explanation: sqrt ((1-2)^2 + (3-3)^2) = sqrt(1^2 + 0^2) = 1

10) Which of the following will be Manhattan Distance between the two data point A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
Solution: A
Explanation: sqrt( mod((1-2)) + mod((3-3))) = sqrt(1 + 0) = 1
Context: 11-12
Suppose, you have given the following data where x and y are the 2 input variables and Class is the
dependent variable.

Below is a scatter plot which shows the above data in 2D space.

11) Suppose, you want to predict the class of new data point x=1 and y=1 using eucludian distance in 3-NN.
In which class this data point belongs to?
A) + Class
B) – Class
C) Can’t say
D) None of these
Solution: A
Explanation: All three nearest point are of +class so this point will be classified as +class.

12) In the previous question, you are now wanting use 7-NN instead of 3-KNN which of the following x=1
and y=1 will belong to?
A) + Class
B) – Class
C) Can’t say
Solution: B
Explanation: Now this point will be classified as – class because there are 4 – class and 3 +class point are
in nearest circle.
Context 13-14:
Suppose you have given the following 2-class data where “+” represent a postive class and “” is represent
negative class.

13) Which of the following value of k in k-NN would minimize the leave one out cross validation accuracy?
A) 3
B) 5
C) Both have same
D) None of these
Solution: B
Explanation: 5-NN will have least leave one out cross validation error.

14) Which of the following would be the leave on out cross validation accuracy for k=5?
A) 2/14
B) 4/14
C) 6/14
D) 8/14
E) None of the above
Solution: E
Explanation: In 5-NN we will have 10/14 leave one out cross validation accuracy.

15) Which of the following will be true about k in k-NN in terms of Bias?
A) When you increase the k the bias will be increases
B) When you decrease the k the bias will be increases
C) Can’t say
D) None of these
Solution: A
Explanation: large K means simple model, simple model always condider as high bias

16) Which of the following will be true about k in k-NN in terms of variance?
A) When you increase the k the variance will increases
B) When you decrease the k the variance will increases
C) Can’t say
D) None of these
Solution: B
Explanation: Simple model will be considered as less variance model

17) The following two distances (Eucludean Distance and Manhattan Distance) have given to you which
generally we used in K-NN algorithm. These distance are between two points A(x1,y1) and B(x2,Y2).
Your task is to tag the both distance by seeing the following two graphs. Which of the following option is true
about below graph?

A) Left is Manhattan Distance and right is euclidean Distance


B) Left is Euclidean Distance and right is Manhattan Distance
C) Neither left or right are a Manhattan Distance
D) Neither left or right are a Euclidian Distance
Solution: B
Explanation: Left is the graphical depiction of how Euclidean distance works, whereas right one is of
Manhattan distance.

18) When you find noise in data which of the following option would you consider in k-NN?
A) I will increase the value of k
B) I will decrease the value of k
C) Noise cannot be dependent on value of k
D) None of these
Solution: A
Explanation: To be surer of which classifications you make, you can try increasing the value of k.

19) In k-NN it is very likely to overfit due to the curse of dimensionality. Which of the following option would
you consider to handle such problem?
1 Dimensionality Reduction
2 Feature selection
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Explanation: In such case you can use either dimensionality reduction algorithm or the feature selection
algorithm

20) Below are two statements given. Which of the following will be true both statements?
1 k-NN is a memory-based approach is that the classifier immediately adapts as we collect
new training data.
2 The computational complexity for classifying new samples grows linearly with the number
of samples in the training dataset in the worst-case scenario.
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C

21) Suppose you have given the following images(1 left, 2 middle and 3 right), Now your task is to find out
the value of k in k-NN in each image where k1 is for 1st, k2 is for 2nd and k3 is for 3rd figure.

A) k1 > k2> k3
B) k1<k2
C) k1 = k2 = k3
D) None of these
Solution: D
Explanation: Value of k is highest in k3, whereas in k1 it is lowest

22) Which of the following value of k in the following graph would you give least leave one out cross validation
accuracy?

A) 1
B) 2
C) 3
D) 5
Solution: B
Explanation: If you keep the value of k as 2, it gives the lowest cross validation accuracy. You can try this
out yourself.

23) A company has build a kNN classifier that gets 100% accuracy on training data. When they deployed
this model on client side it has been found that the model is not at all accurate. Which of the following thing
might gone wrong?
Note: Model has successfully deployed and no technical issues are found at client side except the model
performance
A) It is probably a overfitted model
B) It is probably a underfitted model
C) Can’t say
D) None of these
Solution: A
Explanation: In an overfitted module, it seems to be performing well on training data, but it is not generalized
enough to give the same results on a new data.

24) You have given the following 2 statements, find which of these option is/are true in case of k-NN?
1 In case of very large value of k, we may include points from other classes into the
neighborhood.
2 In case of too small value of k the algorithm is very sensitive to noise
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Explanation: Both the options are true and are self explanatory.

25) Which of the following statements is true for k-NN classifiers?


A) The classification accuracy is better with larger values of k
B) The decision boundary is smoother with smaller values of k
C) The decision boundary is linear
D) k-NN does not require an explicit training step
Solution: D
Option A: This is not always true. You have to ensure that the value of k is not too high or not too low.
Option B: This statement is not true. The decision boundary can be a bit jagged
Option C: Same as option B
Option D: This statement is true

26) True-False: It is possible to construct a 2-NN classifier by using the 1-NN classifier?
A) TRUE
B) FALSE
Solution: A
Explanation: You can implement a 2-NN classifier by ensembling 1-NN classifiers

27) In k-NN what will happen when you increase/decrease the value of k?
A) The boundary becomes smoother with increasing value of K
B) The boundary becomes smoother with decreasing value of K
C) Smoothness of boundary doesn’t dependent on value of K
D) None of these
Solution: A
Explanation: The decision boundary would become smoother by increasing the value of K

28) Following are the two statements given for k-NN algorthm, which of the statement(s)
is/are true?
1 We can choose optimal value of k with the help of cross validation
2 Euclidean distance treats each feature as equally important
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Explanation: Both the statements are true

29) What would be the time taken by 1-NN if there are N(Very large) observations in test data?
A) N*D
B) N*D*2
C) (N*D)/2
D) None of these
Solution: A
Explanation: The value of N is very large, so option A is correct

30) What would be the relation between the time taken by 1-NN,2-NN,3-NN.
A) 1-NN >2-NN >3-NN
B) 1-NN < 2-NN < 3-NN
C) 1-NN ~ 2-NN ~ 3-NN
D) None of these
Solution: C
Explanation: The training time for any value of k in kNN algorithm is the same.

31. A good clustering is one having-


A. Low inter-cluster distance and low intra-cluster distance
B. Low inter-cluster distance and high intra-cluster distance
C. High inter-cluster distance and low intra-cluster distance
D. High inter-cluster distance and high intra-cluster distance
Explanation: A good clustering technique is one which produces high quality clusters in which intra-cluster
similarity (i.e. intra cluster distance) is high and the inter-cluster similarity (i.e. inter cluster distance) is low.

32. Which of the following is an exploratory data mining technique?


A. Classification
B. Clustering
C. Regression
D. None of the above
Explanation: Clustering is an exploratory data mining technique.
33. Which of the following is a hierarchical clustering algorithm?
A. Single linkage clustering
B. K-means clustering
C. DBSCAN
D. None of the above
Explanation: single-linkage clustering is one of several methods of hierarchical clustering. It is based on
grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that
contain the closest pair of elements not yet belonging to the same cluster as each other.

34. Which of the following clustering algorithm uses a dendrogram?


A. Complete linkage clustering
B. k-means clustering
C. DBSCAN
D. None of the above
Explanation: A dendrogram is a diagram representing a tree. In hierarchical clustering, it illustrates the
arrangement of the clusters produced by the corresponding analyses. Complete-linkage clustering is one of
several methods of hierarchical clustering.

35. Which of the following clustering algorithm uses a minimal spanning tree?
A. Complete linkage clustering
B. Single linkage clustering
B. Average linkage clustering
C. DBSCAN
Answer: B
Explanation: The naive algorithm for single-linkage clustering has time complexity O(n3

36. Which of the following is a widely used and effective machine learning algorithm based on the idea of
bagging?
a. Decision Tree
b. Regression
c. Classification
d. Random Forest
Answer: d

37. To find the minimum or the maximum of a function, we set the gradient to zero because:
a. The value of the gradient at extrema of a function is always zero
b. Depends on the type of problem
c. Both A and B
d. None of the above
Answer: a

38. The most widely used metrics and tools to assess a classification model are:
a. Confusion matrix
b. Cost-sensitive accuracy
c. Area under the ROC curve
d. All of the above
Answer: d
39. Which of the following is a good test dataset characteristic?
a. Large enough to yield meaningful results
b. Is representative of the dataset as a whole
c. Both A and B
d. None of the above
Answer: c

40. Which of the following is a disadvantage of decision trees?


a. Factor analysis
b. Decision trees are robust to outliers
c. Decision trees are prone to be overfit
d. None of the above
Answer: c

41. How do you handle missing or corrupted data in a dataset?


a. Drop missing rows or columns
b. Replace missing values with mean/median/mode
c. Assign a unique category to missing values
d. All of the above
Answer: d

42. What is the purpose of performing cross-validation?


a. To assess the predictive performance of the models
b. To judge how the trained model performs outside the sample on test data
c. Both A and B
Answer: c

43. Why is second order differencing in time series needed?


a. To remove stationarity
b. To find the maxima or minima at the local point
c. Both A and B
d. None of the above
Answer: c

44. When performing regression or classification, which of the following is the correct way to preprocess the
data?
a. Normalize the data → PCA → training
b. PCA → normalize PCA output → training
c. Normalize the data → PCA → normalize PCA output → training
d. None of the above
Answer: a

45. Which of the following is an example of feature extraction?


a. Constructing bag of words vector from an email
b. Applying PCA projects to a large high-dimensional data
c. Removing stop words in a sentence
d. All of the above
Answer: d

46. What is pca components_ in Sklearn?


a. Set of all eigen vectors for the projection space
b. Matrix of principal components
c. Result of the multiplication matrix
d. None of the above options
Answer: a

47. Which of the following is true about Naive Bayes?


a. Assumes that all the features in a dataset are equally important
b. Assumes that all the features in a dataset are independent
c. Both A and B
d. None of the above options
Answer: c

48. Which of the following statements about regularization is not correct?


a. Using too large a value of lambda can cause your hypothesis to underfit the data.
b. Using too large a value of lambda can cause your hypothesis to overfit the data.
c. Using a very large value of lambda cannot hurt the performance of your hypothesis.
d. None of the above - answer
Answer: d

49. How can you prevent a clustering algorithm from getting stuck in bad local optima?
a. Set the same seed value for each run
b. Use multiple random initializations
c. Both A and B
d. None of the above
Answer: b

50. Which of the following techniques can be used for normalization in text mining?
a. Stemming
b. Lemmatization
c. Stop Word Removal
d. Both A and B - answer
Answer: d
UNIT – 4

1. What is the assumption in K- Nearest Neighbor?


a) All instances corresponds to dimensional space
b) All instances corresponds to points in n- dimensional space
c) Similar points forms a group or cluster
d) None of the above

2. What is the value of Target function in Nearest neighbor

a) Variable with discrete value


b) Variable with real value

c) Variable with either a discrete value or real value

d) None of the above

3. K-NN algorithm does more computation on test time rather than train time?

a) True
b) False

c) Both of the above


d) None of the above

4. Which of the following option is true about k-NN algorithm?

a) It can only be used for classification problems


b) It can only be used for regression problems
c) It can be used for both Classification and regression problems

d) None of the above

5. Which of the following statement is true about k-NN algorithm?

1. k-NN performs much better if all of the data have the same scale
2. k-NN works well with a small number of input variables (p), but struggles when the number of
inputs is very large

3. k-NN makes no assumptions about the functional form of the problem being solved
a) Only 1 is true
b) Both 1 and 3 are true

c) Only 2

d) All of the above

6. What properties are common between K – nearest Neighbor and Locally Weighted Regression?

1. They are lazy learning methods

2. They classify new query instances by analyzing similar instances while ignoring instances that are
very different from the query.

3. Third, they represent instances as real-valued points in an n-dimensional Euclidean space

a) Only 1

b) Only 1 and 3

c) Only 2 and 3
d) All of the above

7. Which of the following are considered as Lazy Learning Algorithms?


1. K Nearest Neighbor
2. Locally Weighted Regression (LWR)

3. Case Based Reasoning


a) Only 2
b) Only 1 and 2

c) Only 3

d) All the three are Lazy Learning Algorithms.

8. What is the difference between lazy Learning and Eager Learning methods?
a) Lazy methods may consider the query instance x, when deciding how to generalize beyond
the training data D.

b) Lazy methods will not consider the query instance x, when deciding how to generalize beyond the
training data D.
c) Lazy Methods will only consider the training data
9. Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)?

a) 1

b) 2

c) 4
d) 8

10. Which of the following will be true about k in k-NN in terms of Bias?
a) When you increase the k the bias will be increases
b) When you decrease the k the bias will be increases
c) It can either increase or decrease
d) None of the above

11. Which of the following will be true about k in k-NN in terms of variance?
a) When you increase the k the variance will increases
b) When you decrease the k the variance will increases
c) It can either increase or decrease
d) None of the above

12. Computational complexity of classes of learning problems depends on which of the following?
a) The size or complexity of the hypothesis space considered by learner
b) The accuracy to which the target concept must be approximated
c) The probability that the learner will output a successful hypothesis
d) All of these

13. The instance-based learner is a


a) Lazy-learner
b) Eager learner
c) Can’t say
Correct option is A
14. When to consider nearest neighbour algorithms?
a) Instance map to point in kn
b) Not more than 20 attributes per instance
c) Lots of training data
d) None of these
e) A, B & C

15. What are the advantages of Nearest neighbour alogo?


a) Training is very fast
b) Can learn complex target functions
c) Don‟t lose information
d) All of these

16. What are the difficulties with k-nearest neighbour algo?


a) Calculate the distance of the test case from all training cases
b) Curse of dimensionality
c) Both A & B
d) None of these

17. What if the target function is real valued in kNN algo?


a) Calculate the mean of the k nearest neighbours
b) Calculate the SD of the k nearest neighbour
c) None of these

18. What is/are true about Distance-weighted KNN?


a) The weight of the neighbour is considered
b) The distance of the neighbour is considered
c) Both A & B
d) None of these

19. What is/are advantage(s) of Distance-weighted k-NN over k-NN?


a) Robust to noisy training data
b) Quite effective when a sufficient large set of training data is provided
c) Both A & B
d) None of these
20. What is/are advantage(s) of Locally Weighted Regression?
a) Pointwise approximation of complex target function
b) Earlier data has no influence on the new ones
c) Both A & B
d) None of these

21. The quality of the result depends on (LWR)


a) Choice of the function
b) Choice of the kernel function K
c) Choice of the hypothesis space H
d) All of these

22. In k-NN algorithm, given a set of training examples and the value of k < size of training set
(n), the algorithm predicts the class of a test example to be the. What is/are advantages of
CBR?
a) Least frequent class among the classes of k closest training
b) Most frequent class among the classes of k closest training
c) Class of the closest
d) Most frequent class among the classes of the k farthest training examples.

23. Which of the following will be true about k in k-NN in terms of variance
a) When you increase the k the variance will increases
b) When you decrease the k the variance will increases
c) Can’t say
d) None of these

24. Which of the following option is true about k-NN algorithm?


a) It can be used for classification
b) It can be used for regression
c) It can be used in both classification and regression Answer
25. In k-NN it is very likely to overfit due to the curse of dimensionality. Which of the following
option would you consider to handle such problem? 1). Dimensionality Reduction
2). Feature selection
a) 1
b) 2
c) 1 and 2
d) None of these

26. When you find noise in data which of the following option would you consider in k- NN
a) I will increase the value of k
b) I will decrease the value of k
c) Noise can not be dependent on value of k
d) None of these

27. Which of the following will be true about k in k-NN in terms of Bias?
a) When you increase the k the bias will be increases
b) When you decrease the k the bias will be increases
c) Can’t say
d) None of these

28. What is used to mitigate overfitting in a test set?


a) Overfitting set
b) Training set
c) Validation dataset
d) Evaluation set

29. A radial basis function is a


a) Activation function
b) Weight
c) Learning rate
d) none
30. Mistake Bound is
a) How many training examples are needed for learner to converge to a successful hypothesis.
b) How much computational effort is needed for a learner to converge to a successful hypothesis
c) How many training examples will the learner misclassify before conversing to a
successful hypothesis
d) None of these

31. In K-Nearest Neighbor it is very likely to overfit due to the curse of dimensionality. Which of
the following option would you consider to handle such problem?
• Dimensionality Reduction
• Feature selection
a) 1
b) 2
c) 1 and 2
d) None of these
UNIT - 5
1) Which of the following statement is true in following case?

A) Feature F1 is an example of nominal variable.


B) Feature F1 is an example of ordinal variable.
C) It doesn’t belong to any of the above category.
D) Both of
these

Solution: (B)

Explanation: Ordinal variables are the variables which has some order in their categories. For
example, grade A should be consider as high grade than grade B.

2) Which of the following is an example of a deterministic algorithm?

A) PCA

B) K-Means

C) None of the above

Solution: (A)

Explanation: A deterministic algorithm is that in which output does not change on different runs.
PCA would give the same result if we run again, but not k-means.

3) [True or False] A Pearson correlation between two variables is zero but, still their values can
still be related to each other.

A) TRUE

B) FALSE

Solution: (A)

Explanation: Y=X2. Note that, they are not only associated, but one is a function of the other
and Pearson correlation between them is 0.

4) Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic
Gradient Decent (SGD)?

In GD and SGD, you update a set of parameters in an iterative manner to minimize the error
function.
In SGD, you have to run through all the samples in your training set for a single update of a
parameter in each iteration.
In GD, you either use the entire data or a subset of training data to update a parameter in each
iteration.

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 2 and 3

F) 1,2 and 3

Solution: (A)

Explanation: In SGD for each iteration you choose the batch which is generally contain the
random sample of data But in case of GD each iteration contain the all of the training
observations.

5) Which of the following hyper parameter(s), when increased may cause random forest to over
fit the data?

(1) Number of Trees


(2) Depth of Tree
(3) Learning Rate

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 2 and 3

F) 1,2 and 3

Solution: (B)

Explanation: Usually, if we increase the depth of tree it will cause overfitting. Learning rate is
not an hyperparameter in random forest. Increase in the number of tree will cause under fitting.

6) Imagine, you are working with “Analytics Vidhya” and you want to develop a machine
learning algorithm which predicts the number of views on the articles.
Your analysis is based on features like author name, number of articles written by the same
author on Analytics Vidhya in past and a few other features. Which of the following evaluation
metric would you choose in that case?

(1) Mean Square Error


(2) Accuracy
(3) F1 Score

A) Only 1

B) Only 2

C) Only 3

D) 1 and 3

E) 2 and 3

F) 1 and 2

Solution:(A)

Explanation: You can think that the number of views of articles is the continuous target variable
which fall under the regression problem. So, mean squared error will be used as an evaluation
metrics.

7) Given below are three images (1,2,3). Which of the following option is correct for these
images?
A) 1 is tanh, 2 is ReLU and 3 is SIGMOID activation functions.

B) 1 is SIGMOID, 2 is ReLU and 3 is tanh activation functions.

C) 1 is ReLU, 2 is tanh and 3 is SIGMOID activation functions.

D) 1 is tanh, 2 is SIGMOID and 3 is ReLU activation functions.

Solution: (D)

Explanation: The range of SIGMOID function is [0,1]. The range of the tanh function is [-1,1].
So Option D is the right answer. So, option D is the right answer.

8) Below are the 8 actual values of target variable in the train file.

[0,0,0,1,1,1,1,
1]

What is the entropy of the target variable?


A) -(5/8 log(5/8) + 3/8
log(3/8))

B) 5/8 log(5/8) + 3/8 log(3/8)

C) 3/8 log(5/8) + 5/8 log(3/8)

D) 5/8 log(3/8) – 3/8 log(5/8)

Solution: (A)

The formula for entropy is

So the answer is A.

9) Let’s say, you are working with categorical feature(s) and you have not looked at the
distribution of the categorical variable in the test data.

You want to apply one hot encoding (OHE) on the categorical feature(s). What challenges you
may face if you have applied OHE on a categorical variable of train dataset?

A) All categories of categorical variable are not present in the test dataset.

B) Frequency distribution of categories is different in train as compared to the test dataset.

C) Train and Test always have same distribution.

D) Both A and
B

E) None of
these

Solution: (D)

Both are true, The OHE will fail to encode the categories which is present in test but not in train
so it could be one of the main challenges while applying OHE. The challenge given in option B
is also true you need to more careful while applying OHE if frequency distribution doesn’t same
in train and test.

10) Skip gram model is one of the best models used in Word2vec algorithm for words
embedding. Which one of the following models depict the skip gram model?
A) A

B) B

C) Both A and
B

D) None of
these

Solution: (B)

Both models (model1 and model2) are used in Word2vec algorithm. The model1 represent a
CBOW model where as Model2 represent the Skip gram model.

11) Let’s say, you are using activation function X in hidden layers of neural network. At a
particular neuron for any given input, you get the output as “-0.0001”. Which of the following
activation function could X represent?

A) ReLU

B) tanh

C) SIGMOID

D) None of
these

Solution: (B)

The function is a tanh because the this function output range is between (-1,-1).

12) [True or False] LogLoss evaluation metric can have negative values.

A) TRUE
B) FALSE

Solution: (B)

Log loss cannot have negative values.

13) Which of the following statements is/are true about “Type-1” and “Type-2” errors?

Type1 is known as false positive and Type2 is known as false negative.


Type1 is known as false negative and Type2 is known as false positive.
Type1 error occurs when we reject a null hypothesis when it is actually true.

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 1 and 3

F) 2 and 3

Solution: (E)

In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis
(a “false positive”), while a type II error is incorrectly retaining a false null hypothesis (a “false
negative”).

14) Which of the following is/are one of the important step(s) to pre-process the text in NLP
based projects?

Stemming
Stop word removal
Object Standardization

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) 1,2 and 3

Solution: (D)

Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s”
etc) from a word.

Stop words are those words which will have not relevant to the context of the data for example
is/am/are.

Object Standardization is also one of the good way to pre-process the text.

15) Suppose you want to project high dimensional data into lower dimensions. The two most
famous dimensionality reduction algorithms used here are PCA and t-SNE. Let’s say you have
applied both algorithms respectively on data “X” and you got the datasets “X_projected_PCA”
, “X_projected_tSNE”.

Which of the following statements is true for “X_projected_PCA” & “X_projected_tSNE” ?

A) X_projected_PCA will have interpretation in the nearest neighbour space.

B) X_projected_tSNE will have interpretation in the nearest neighbour space.

C) Both will have interpretation in the nearest neighbour space.

D) None of them will have interpretation in the nearest neighbour space.

Solution: (B)

t-SNE algorithm consider nearest neighbour points to reduce the dimensionality of the data. So,
after using t-SNE we can think that reduced dimensions will also have interpretation in nearest
neighbour space. But in case of PCA it is not the case.

Context: 16-
17

Given below are three scatter plots for two features (Image 1, 2 & 3 from left to right).

16) In the above images, which of the following is/are example of multi-collinear features?

A) Features in Image 1

B) Features in Image 2

C) Features in Image 3

D) Features in Image 1 & 2

E) Features in Image 2 & 3


F) Features in Image 3 & 1

Solution: (D)

In Image 1, features have high positive correlation where as in Image 2 has high negative
correlation between the features so in both images pair of features are the example of
multicollinear features.

17) In previous question, suppose you have identified multi-collinear features. Which of the
following action(s) would you perform next?

Remove both collinear


variables.
Instead of removing both variables, we can remove only one variable.
Removing correlated variables might lead to loss of information. In order to retain those
variables, we can use penalized regression models like ridge or lasso regression.

A) Only 1

B)Only 2

C) Only 3

D) Either 1 or
3

E) Either 2 or
3

Solution: (E)

You cannot remove the both features because after removing the both features you will lose all
of the information so you should either remove the only 1 feature or you can use the
regularization algorithm like L1 and L2.

18) Adding a non-important feature to a linear regression model may result in.

Increase in R-square
Decrease in R-square

A) Only 1 is correct

B) Only 2 is correct

C) Either 1 or
2
D) None of
these

Solution: (A)

After adding a feature in feature space, whether that feature is important or unimportant features
the R-squared always increase.

19) Suppose, you are given three variables X, Y and Z. The Pearson correlation coefficients for
(X, Y), (Y, Z) and (X, Z) are C1, C2 & C3 respectively.

Now, you have added 2 in all values of X (i.enew values become X+2), subtracted 2 from all
values of Y (i.e. new values are Y-2) and Z remains the same. The new coefficients for (X,Y),
(Y,Z) and (X,Z) are given by D1, D2 & D3 respectively. How do the values of D1, D2 & D3
relate to C1, C2 & C3?

A) D1= C1, D2 < C2, D3 > C3

B) D1 = C1, D2 > C2, D3 > C3

C) D1 = C1, D2 > C2, D3 < C3

D) D1 = C1, D2 < C2, D3 < C3

E) D1 = C1, D2 = C2, D3 = C3

F) Cannot be determined

Solution: (E)

Correlation between the features won’t change if you add or subtract a value in the features.

20) Imagine, you are solving a classification problems with highly imbalanced class. The
majority class is observed 99% of times in the training data.

Your model has 99% accuracy after taking the predictions on test data. Which of the following
is true in such a case?

Accuracy metric is not a good idea for imbalanced class problems.


Accuracy metric is a good idea for imbalanced class problems.
Precision and recall metrics are good for imbalanced class problems.
Precision and recall metrics aren’t good for imbalanced class problems.

A) 1 and 3

B) 1 and 4

C) 2 and 3
D) 2 and 4

Solution: (A)

Refer the question number 4 from in this article.

21) In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble
of these models will give a better prediction than prediction of individual models.

Which of the following statements is / are true for weak learners used in ensemble model?

They don’t usually overfit.


They have high bias, so they cannot solve complex learning problems
They usually overfit.

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) Only 1

E) Only 2

F) None of the above

Solution: (A)

Weak learners are sure about particular part of a problem. So, they usually don’t overfit which
means that weak learners have low variance and high bias.

22) Which of the following options is/are true for K-fold cross-validation?

Increase in K will result in higher time required to cross validate the result.
Higher values of K will result in higher confidence on the cross-validation result as compared to
lower value of K.
If K=N, then it is called Leave one out cross validation, where N is the number of observations.

A) 1 and 2

B) 2 and 3

C) 1 and 3

D) 1,2 and 3

Solution: (D)
Larger k value means less bias towards overestimating the true expected error (as training folds
will be closer to the total dataset) and higher running time (as you are getting closer to the limit
case: Leave-One-Out CV). We also need to consider the variance between the k folds accuracy
while selecting the k.

Question Context 23-24

Cross-validation is an important step in machine learning for hyper parameter tuning. Let’s say
you are tuning a hyper-parameter “max_depth” for GBM by selecting it from 10 different depth
values (values are greater than 2) for tree based model using 5-fold cross validation.

Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10 seconds and
for the prediction on remaining 1-fold is 2 seconds.

Note: Ignore hardware dependencies from the equation.

23) Which of the following option is true for overall execution time for 5-fold cross validation
with 10 different values of “max_depth”?

A) Less than 100 seconds

B) 100 – 300 seconds

C) 300 – 600 seconds

D) More than or equal to 600 seconds

C) None of the above

D) Can’t
estimate

Solution: (D)

Each iteration for depth “2” in 5-fold cross validation will take 10 secs for training and 2 second
for testing. So, 5 folds will take 12*5 = 60 seconds. Since we are searching over the 10 depth
values so the algorithm would take 60*10 = 600 seconds. But training and testing a model on
depth greater than 2 will take more time than depth “2” so overall timing would be greater than
600.

24) In previous question, if you train the same algorithm for tuning 2 hyper parameters say
“max_depth” and “learning_rate”.

You want to select the right value against “max_depth” (from given 10 depth values) and
learning rate (from given 5 different learning rates). In such cases, which of the following will
represent the overall time?

A) 1000-1500 second

B) 1500-3000 Second
C) More than or equal to 3000 Second

D) None of
these

Solution: (D)

Same as question number 23.

25) Given below is a scenario for training error TE and Validation error VE for a machine
learning algorithm M1. You want to choose a hyperparameter (H) based on TE and VE.

H TE VE
1 105 90
2 200 85
3 250 96
4 105 85
5 300 100

Which value of H will you choose based on the above table?

A) 1

B) 2

C) 3

D) 4

E) 5

Solution: (D)

Looking at the table, option D seems the best

26) What would you do in PCA to get the same projection as SVD?

A) Transform data to zero


mean

B) Transform data to zero


median

C) Not
possible

D) None of
these
Solution: (A)

When the data has a zero mean vector PCA will have same projections as SVD, otherwise you
have to centre the data first before taking SVD.

Question Context 27-28

Assume there is a black box algorithm, which takes training data with multiple observations (t1,
t2, t3,…….. tn) and a new observation (q1). The black box outputs the nearest neighbor of q1
(say ti) and its corresponding class label ci.

You can also think that this black box algorithm is same as 1-NN (1-nearest neighbor).

27) It is possible to construct a k-NN classification algorithm based on this black box alone.

Note: Where n (number of training observations) is very large compared to k.

A) TRUE

B) FALSE

Solution: (A)

In first step, you pass an observation (q1) in the black box algorithm so this algorithm would
return a nearest observation and its class.

In second step, you through it out nearest observation from train data and again input the
observation (q1). The black box algorithm will again return the a nearest observation and it’s
class.

You need to repeat this procedure k times

28) Instead of using 1-NN black box we want to use the j-NN (j>1) algorithm as black box.
Which of the following option is correct for finding k-NN using j-NN?

J must be a proper factor of k


J>k
Not possible

A) 1

B) 2

C) 3

Solution: (A)

Same as question number 27


29) Suppose you are given 7 Scatter plots 1-7 (left to right) and you want to compare Pearson
correlation coefficients between variables of each scatterplot.

Which of the following is in the right order?

1<2<3<4
1>2>3 > 4
7<6<5<4
7>6>5>4

A) 1 and 3

B) 2 and 3

C) 1 and 4

D) 2 and 4

Solution: (B)

from image 1to 4 correlation is decreasing (absolute value). But from image 4 to 7 correlation is
increasing but values are negative (for example, 0, -0.3, -0.7, -0.99).

30) You can evaluate the performance of a binary class classification problem using different
metrics such as accuracy, log-loss, F-Score. Let’s say, you are using the log-loss function as
evaluation metric.

Which of the following option is / are true for interpretation of log-loss as an evaluation metric?

If a classifier is confident about an incorrect classification, then log-loss will penalise it heavily.
For a particular observation, the classifier assigns a very small probability for the correct class
then the corresponding contribution to the log-loss will be very large.
Lower the log-loss, the better is the model.

A) 1 and 3

B) 2 and 3

C) 1 and 2

D) 1,2 and 3
Solution: (D)

Options are self-explanatory.

Question 31-
32

Below are five samples given in the dataset.

Note: Visual distance between the points in the image represents the actual distance.

31) Which of the following is leave-one-out cross-validation accuracy for 3-NN (3-nearest
neighbor)?

A) 0

D) 0.4

C) 0.8

D) 1

Solution: (C)

In Leave-One-Out cross validation, we will select (n-1) observations for training and 1
observation of validation. Consider each point as a cross validation point and then find the 3
nearest point to this point. So if you repeat this procedure for all points you will get the correct
classification for all positive class given in the above figure but negative class will be
misclassified. Hence you will get 80% accuracy.

32) Which of the following value of K will have least leave-one-out cross validation accuracy?

A) 1NN

B) 3NN

C) 4NN

D) All have same leave one out error

Solution: (A)
Each point which will always be misclassified in 1-NN which means that you will get the 0%
accuracy.

33) Suppose you are given the below data and you want to apply a logistic regression model for
classifying it in two given classes.

You are using logistic regression with L1 regularization.

Where C is the regularization parameter and w1 & w2 are the coefficients of x1 and x2.

Which of the following option is correct when you increase the value of C from zero to a very
large value?

A) First w2 becomes zero and then w1 becomes zero

B) First w1 becomes zero and then w2 becomes zero

C) Both becomes zero at the same time

D) Both cannot be zero even after very large value of C

Solution: (B)

By looking at the image, we see that even on just using x2, we can efficiently perform
classification. So at first w1 will become 0. As regularization parameter increases more, w2 will
come more and more closer to 0.

34) Suppose we have a dataset which can be trained with 100% accuracy with help of a decision
tree of depth 6. Now consider the points below and choose the option based on these points.

Note: All other hyper parameters are same and other factors are not affected.

Depth 4 will have high bias and low variance


Depth 4 will have low bias and low variance

A) Only 1

B) Only 2

C) Both 1 and
2

D) None of the above

Solution: (A)

If you fit decision tree of depth 4 in such data means it will more likely to underfit the data. So,
in case of underfitting you will have high bias and low variance.
35) Which of the following options can be used to get global minima in k-Means Algorithm?

Try to run algorithm for different centroid initialization


Adjust number of iterations
Find out the optimal number of clusters

A) 2 and 3

B) 1 and 3

C) 1 and 2

D) All of
above

Solution: (D)

All of the option can be tuned to find the global minima.

36) Imagine you are working on a project which is a binary classification problem. You trained
a model on training dataset and get the below confusion matrix on validation dataset.

Based on the above confusion matrix, choose which option(s) below will give you correct
predictions?

Accuracy is ~0.91
Misclassification rate is ~ 0.91
False positive rate is ~0.95
True positive rate is ~0.95

A) 1 and 3

B) 2 and 4

C) 1 and 4

D) 2 and 3
Solution: (C)

The Accuracy (correct classification) is (50+100)/165 which is nearly equal to 0.91.

The true Positive Rate is how many times you are predicting positive class correctly so true
positive rate would be 100/105 = 0.95 also known as “Sensitivity” or “Recall”

37) For which of the following hyperparameters, higher value is better for decision tree
algorithm?

Number of samples used for


split
Depth of tree
Samples for
leaf

A)1 and 2

B) 2 and 3

C) 1 and 3

D) 1, 2 and 3

E) Can’t say

Solution: (E)

For all three options A, B and C, it is not necessary that if you increase the value of parameter
the performance may increase. For example, if we have a very high value of depth of tree, the
resulting tree may overfit the data, and would not generalize well. On the other hand, if we have
a very low value, the tree may underfit the data. So, we can’t say for sure that “higher is better”.

Context 38-39

Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network on it with
the input depth of 3 and output depth of 8.

Note: Stride is 1 and you are using same padding.

38) What is the dimension of output feature map when you are using the given parameters.

A) 28 width, 28 height and 8


depth

B) 13 width, 13 height and 8


depth
C) 28 width, 13 height and 8
depth

D) 13 width, 28 height and 8


depth

Solution: (A)

The formula for calculating output size is

output size = (N – F)/S + 1

where, N is input size, F is filter size and S is stride.

Read this article to get a better understanding.

39) What is the dimensions of output feature map when you are using following parameters.

A) 28 width, 28 height and 8


depth

B) 13 width, 13 height and 8


depth

C) 28 width, 13 height and 8


depth

D) 13 width, 28 height and 8


depth

Solution: (B)

Same as
above

40) Suppose, we were plotting the visualization for different values of C (Penalty parameter) in
SVM algorithm. Due to some reason, we forgot to tag the C values with visualizations. In that
case, which of the following option best explains the C values for the images below (1,2,3 left
to right, so C values are C1 for image1, C2 for image2 and C3 for image3 ) in case of rbf kernel.

A) C1 = C2 =
C3

B) C1 > C2 >
C3
C) C1 < C2 <
C3

D) None of
these

Solution: (C)

40. Factors which affect the performance of learner system does not include?

a) Representation scheme used

b) Training scenario

c) Type of feedback

d) Good data structures

Answer: d

Explanation: Factors which affect the performance of learner system does not include good data
structures.

41. Which of the following does not include different learning methods?

a)
Memorization

b) Analogy

c) Deduction

d)
Introduction

Answer: d

Explanation: Different learning methods include memorization, analogy and deduction.

42. Which of the following is the model used for learning?

a) Decision
trees

b) Neural networks

c) Propositional and FOL rules


d) All of the mentioned

Answer: d

Explanation: Decision trees, Neural networks, Propositional rules and FOL rules all are the
models of learning.

43. Automated vehicle is an example of ______

a) Supervised learning

b) Unsupervised learning

c) Active
learning

d) Reinforcement learning

Answer: a

Explanation: In automatic vehicle set of vision inputs and corresponding actions are available to
learner hence it’s an example of supervised learning.

44. Which of the following is an example of active learning?

a) News Recommender system

b) Dust cleaning machine

c) Automated vehicle

d) None of the mentioned

Answer: a

Explanation: In active learning, not only the teacher is available but the learner can ask suitable
perception-action pair examples to improve performance.

45. In which of the following learning the teacher returns reward and punishment to learner?

a) Active
learning

b) Reinforcement learning

c) Supervised learning

d) Unsupervised learning
Answer: b

Explanation: Reinforcement learning is the type of learning in which teacher returns reward or
punishment to learner.

46. Decision trees are appropriate for the problems where ___________

a) Attributes are both numeric and nominal

b) Target function takes on a discrete number of values.

c) Data may have errors

d) All of the mentioned

Answer: d

Explanation: Decision trees can be used in all the conditions stated.

47. Which of the following is not an application of learning?

a) Data
mining

b) WWW

c) Speech recognition

d) None of the mentioned

Answer: d

Explanation: All mentioned options are applications of learning.

48. Which of the following is the component of learning system?

a) Goal

b) Model

c) Learning
rules

d) All of the mentioned

Answer: d

Explanation: Goal, model, learning rules and experience are the components of learning system.

49. Which of the following is also called as exploratory learning?


a) Supervised learning

b) Active
learning

c) Unsupervised learning

d) Reinforcement learning

Answer: c

Explanation: In unsupervised learning, no teacher is available hence it is also called unsupervised


learning.
Unit - 5
1. The most significant phase in a genetic algorithm is
A. Crossover
B. Mutation
C. Selection
D. Fitness function
Correct option is A

2. The crossover operator produces two new offspring from


A. Two parent strings, by copying selected bits from each parent
B. One parent strings, by copying selected bits from selected parent
C. Two parent strings, by copying selected bits from one parent
D. None of these
Correct option is A

3. Mathematically characterize the evolution over time of the population


within a GA based on the concept of
A. Schema
B. Crossover
C. Don‟t care
D. Fitness function
Correct option is A

4. In genetic algorithm process of selecting parents which mate and


recombine to create off-springs for the next generation is known as:
A. Tournament selection
B. Rank selection
C. Fitness sharing
D. Parent selection
Correct option is D

5. Crossover operations are performed in genetic programming by replacing


A. Randomly chosen sub tree of one parent program by a sub tree
from the other parent program.
B. Randomly chosen root node tree of one parent program by a sub
tree from the other parent program
C. Randomly chosen root node tree of one parent program by a root
node tree from the other parent program
D. None of these
Correct option is A
Unit - 5
6. emphasizes learning feedback that evaluates the learner’s
performance without providing standards of correctness in the form of
behavioural
A. Reinforcement learning
B. Supervised Learning
C. None of these
Correct option is A

7. Features of Reinforcement learning


A. Set of problem rather than set of techniques
B. RL is training by reward and
C. RL is learning from trial and error with the
D. All of these
Correct option is D

8. Which type of feedback used by RL?


A. Purely Instructive feedback
B. Purely Evaluative feedback
C. Both A & B
D. None of these
Correct option is B

9. What is/are the problem solving methods for RL?


A. Dynamic programming
B. Monte Carlo Methods
C. Temporal-difference learning
D. All of these
Correct option is D

10. Which among the following is not a necessary feature of a reinforcement


learning solution to a learning problem?
A. exploration versus exploitation dilemma
B. trial and error approach to learning
C. learning based on rewards
D. representation of the problem as a Markov Decision Process
Correct option is D

11. Which of the following sentence is FALSE regarding reinforcement learning


Unit - 5
A. It relates inputs to
B. It is used for
C. It may be used for
D. It discovers causal relationships

Correct option is D

12.Consider the following modification to the tic-tac-toe game: at the end of


game, a coin is tossed and the agent wins if a head appears regardless of
whatever has happened in the game.Can reinforcement learning be used to
learn an optimal policy of playing Tic-Tac-Toe in this case?
A. Yes
B. No
Correct option is B

13. Suppose the reinforcement learning player was greedy, that is, it always
played the move that brought it to the position that it rated the best. Might it
learn to play better, or worse, than a non greedy player?
E. Worse
F. Better
Correct option is B

14. A chess agent trained by using Reinforcement Learning can be trained by


playing against a copy of the same
15. True
16. False
Correct option is A

15. A model can learn based on the rewards it received for its previous action is
known as:
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. Concept learning
Correct option is C

16. A genetic learning operation that creates new population elements by combining parts of
two or more existing elements.
Unit - 5
a. selection
b. crossover
c. mutation
d. absorption
17. An evolutionary approach to data mining.
a. backpropagation learning
b. genetic learning
c. decision tree learning
d. linear regression

18. The computational complexity as well as the explanation offered by a genetic algorithm is largely
determined by the
a. fitness function
b. techniques used for crossover and mutation
c. training data
d. population of elements

19. This approach is best when we are interested in finding all possible interactions among a set of
attributes.
a. decision tree
b. association rules
c. K-Means algorithm
d. genetic learning
20. Genetic learning can be used to train a feed-forward network. This is accomplished by
having each population element represent one possible
a. network configuration of nodes and links.
b. set of training data to be fed through the network.
c. set of network output values.
d. set of network connection weights.

21’ How Genetic Algorithm does perform its search?


a) Search from General Hypothesis to Specific
b) Search from Specific Hypothesis to General
c) Generate successor hypotheses by repeatedly mutating and recombining parts of the best currently
known hypotheses
d) None of the above
22. Which Hypothesis is considered to best hypothesis in Genetic Algorithm
a) Hypothesis that is specific in nature
b) Hypothesis which is general in nature
c) Hypothesis that optimizes a predefined numerical measure for the problem at hand, called the
hypothesis Fitness.
23. How the offspring are created in two –point cross over in Genetic Algorithm?
a) By combining bits sampled uniformly from the two parents
Unit - 5
b) By randomly choosing n- bit for transformation
c) Offspring are created by substituting intermediate segments of one parent into the middle of the
second parent string
d) None of the above

25. What does Reinforcement Learning involves?


a) Only an Agent
b) Only the Environment
c) Both Agent and Environment
d) None of the above

26. . What is the algorithm that falls under Reinforcement Learning


a) Decision Tree Learning
b) Q- Learning Algorithm
c) Both of the above
d) None of the above

You might also like