Professional Documents
Culture Documents
UNIT-1: 1. What Is Machine Learning?
UNIT-1: 1. What Is Machine Learning?
Answer: a
Explanation: Machine learning is the autonomous acquisition of knowledge through the use of
computer programs.
2. Which of the factors affect the performance of learner system does not include?
b) Training scenario
c) Type of feedback
Answer: d
Explanation: Factors that affect the performance of learner system does not include good data
structures.
a) Memorization
b) Analogy
c) Deduction
d) Introduction
Answer: d
a) Phonological
b) Syntactic
c) Empirical
d) Logical
Answer: c
Explanation: In language understanding, the levels of knowledge that does not include empirical
knowledge.
a) Language units
c) System constraints
d) Structural units
Answer: d
Explanation: A model of language consists of the categories which does not include structural units.
a) Begins by hypothesizing a sentence (the symbol S) and successively predicting lower level
constituents until individual preterminal symbols are written
b) Begins by hypothesizing a sentence (the symbol S) and successively predicting upper level
constituents until individual preterminal symbols are written
c) Begins by hypothesizing lower level constituents and successively predicting a sentence (the symbol
S)
d) Begins by hypothesizing upper level constituents and successively predicting a sentence (the symbol
S)
Answer: a
Explanation: A top-down parser begins by hypothesizing a sentence (the symbol S) and successively
predicting lower level constituents until individual preterminal symbols are written.
a) p
b) Øp V q
c) p → q
d) p → Øq
Answer: d
Answer: d
Explanation: The action ‘STACK(A,B)’ of a robot arm specify to Place block A on block B.
9. Choose the options that are correct regarding machine learning (ML) and
Answer: (D)
11. Which of the following is a widely used and effective machine learning algorithm based on the
idea of bagging?
Decision Tree
Regression
Classification
Random Forest – answer
Ans: a
13. Which of the factors affect the performance of learner system does not include?
b) Training scenario
c) Type of feedback
Ans: d
a) Memorization
b) Analogy
c) Deduction
d) Introduction
Ans: d
15. In language understanding, the levels of knowledge that does not include?
a) Phonological
b) Syntactic
c) Empirical
d) Logical
Ans: c
16. A model of language consists of the categories which does not include?
a) Language units
c) System constraints
d) Structural units
Ans: d
17. What is a top-down parser?
a) Begins by hypothesizing a sentence (the symbol S) and successively predicting lower level
constituents until individual preterminal symbols are written
b) Begins by hypothesizing a sentence (the symbol S) and successively predicting upper level
constituents until individual preterminal symbols are written
c) Begins by hypothesizing lower level constituents and successively predicting a sentence (the symbol
S)
d) Begins by hypothesizing upper level constituents and successively predicting a sentence (the symbol
S)
Answer: a
a) p
b) Øp V q
c) p → q
d) p → ØqView Answer
Answer: d
Answer: d
20. Choose the options that are correct regarding machine learning (ML) and artificial intelligence
(AI),
Answer:(D)
22. Which of the following is a widely used and effective machine learning algorithm based on the
idea of bagging?
a) Decision Tree
b) Regression
c) Classification
d) Random Forest
Answer: (D)
23.To find the minimum or the maximum of a function, we set the gradient to zero because:
Answer: (A)
24.The most widely used metrics and tools to assess a classification model are:
A) Confusion matrix
B) Cost-sensitive accuracy
Answer: (D)
C) Both A and B
Answer: (C)
A) Factor analysis
Answer (C)
27) How do you handle missing or corrupted data in a dataset?
B) To judge how the trained model performs outside the sample on test data
C) Both A and B
Answer (C)
A) To remove stationarity
C) Both A and B
30) When performing regression or classification, which of the following is the correct way to
preprocess the data?
Answer (D)
32. Which combines inductive methods with the power of first-order representations?
a) Inductive programming
b) Logic programming
d) Lisp programming
Answer: c
Explanation: Inductive logic programming(ILP) combines inductive methods with the power of first-
order representations.
33. How many reasons are available for the popularity of ILP?
a) 1
b) 2
c) 3
d) 4
Answer: c
Explanation: The three reasons available for the popularity of ILP are general knowledge, Complete
algorithm and hypotheses.
c) Agents
Answer: b
a) First-order logic
b) Propositional logic
c) ILP
Answer: a
36. Which produces hypotheses that are easy to read for humans?
a) ILP
b) Artificial intelligence
c) Propositional logic
d) First-order logic
Answer: a
Explanation: Because ILP can participate in the scientific cycle of experimentation, So that it can
produce flexible structure.
b) Entailment constraint
Answer: b
Explanation: The objective of an ILP is to come up with a set of sentences for the hypothesis such that
the entailment constraint is satisfied.
38. How many literals are available in top-down inductive learning methods?
a) 1
b) 2
c) 3
d) 4
Answer: c
a) Inverse resolution
b) Resolution
c) Trilogy
Answer: a
a) Literal system
b) Variable-based system
c) Attribute-based system
Answer: c
Explanation: ILP methods can learn relational knowledge that is not expressible in attribute-based
system.
41. Which approach is used for refining a very general rule through ILP?
a) Top-down approach
b) Bottom-up approach
Answer: a
42. The characteristics of the computer system capable of thinking, reasoning and learning is known
is
a. machine intelligence
b. human intelligence
c. artificial intelligence
d. virtual intelligence
Answer: (c).
43. What is the term used for describing the judgmental or common-sense part of problem solving?
a. Heuristic
b. Critical
c. Value based
d. Analytical
Answer: (a).
44. Which kind of planning consists of successive representations of different levels of a plan?
a. hierarchical planning
b. non-hierarchical planning
c. project planning
Answer: (a).
45. What was originally called the "imitation game" by its creator?
b. LISP
d. Cybernetics
Answer: (a).
46. An AI technique that allows computers to understand associations and relationships between
objects and events is called:
a. heuristic processing
b. cognitive science
c. relative symbolism
d. pattern matching
Answer: (d).
47. The field that investigates the mechanics of human intelligence is:
a. history
b. cognitive science
c. psychology
d. sociology
Answer: (b).
48. What is the name of the computer program that simulates the thought processes of human
beings?
a. Human logic
b. Expert reason
c. Expert system
d. Personal information
Answer: (c).
49. What is the name of the computer program that contains the distilled knowledge of an expert?
d. Artificial intelligence
Answer: (c).
a. clearer characters
b. graphics
c. more characters
Answer: (d).
UNIT – 1
4. Regarding bias which of the following statements is true? (Here ‘high’ and ‘low’ are relative to
the ideal model.)
(a) Models which over fit have a high bias
(b) Models which over fit have a low bias
(c) Models which under fit have a high variance
(d) None of the above
5. A feature F1 can take certain value: A, B, C, D, E, & F and represents grade of students from a
college. Which of the following statement is true in following case?
(a) Feature F1 is an example of nominal variable
(b) Feature F1 is an example of ordinal variable
(c) It doesn’t belong to any of the above category
(d) Both of these
6. Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce
the overfitting?
(a) Increase the amount of training data
(b) Improve the optimization algorithm being used for error minimization
(c) Decrease the model complexity
(d) Reduce the noise in the training data
7. Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic Gradient
Decent (SGD)?
a) In GD and SGD, you update a set of parameters in an iterative manner to minimize the error
function
b) In SGD, you have to run through all the samples in your training set for a single update of a parameter
in each iteration
c) In GD, you either use the entire data or a subset of training data to update a parameter in each iteration
d) None of the above
8. Which of the following hyper parameter(s), when increased may cause random forest to over fit
the data?
a) Number of Trees
b) Depth of Tree
c) Learning Rate
d) None of the above
9. What kind of learning algorithm for "Future stock prices or currency exchange rates"?
a) Recognizing Anomalies
b) Prediction
c) Generating Patterns
d) Recognition Patterns
11. The type of Training Experience available can have a significant impact on success or failure of
learner
a) The above statement is TRUE
b) The above statement is FALSE
c) Cannot say
d) None of the above
12. “Problem of searching through a predefined space of potential hypotheses for the hypothesis that
best fits the training examples” is termed as :
a) Target Learning
b) Concept Learning
c) Unsupervised Learning
d) Supervised Learning
13. If you ask your friend to bring you a PIZZA, This is an example of ;
a) Specific Hypothesis
b) General Hypothesis
c) Both of the above
d) None of the above
14. Imagine, you are solving a classification problem with highly imbalanced class. The majority
class is observed 99% of times in the training data. Your model has 99% accuracy after taking the
predictions on test data. Which of the following is true in such a case?
(1) Accuracy metric is not a good idea for imbalanced class problems
(2) Accuracy metric is a good idea for imbalanced class problems
(3) Precision and recall metrics are good for imbalanced class problems
(4) Precision and recall metrics aren’t good for imbalanced class problem
a) Option 1 and 3
b) Option 1 and 4
c) Option 2 and 4
d) Only Option 4
15. Consider two hypothesis as follows:, Which of the following hypothesis is more general?
20. The intermediate space between General and Specific Hypothesis in Candidate elimination
algorithm is known as ;
a) Version Space
b) Candidate Space
c) Risk Space
d) None of the above
21. Candidate Elimination Algorithm works well with both positive and negative examples.
a) True
b) False
c) Only Positive Examples
d) Only Negative Examples
22. For a positive example in Candidate elimination Algorithm, what is general tend;
a) We tend to make specific hypothesis more general
b) We tend to make general hypothesis more specific
c) Both of the above
d) None of the above
23. The obvious solution of assuring that target space is in Hypothesis space H is;
a) It is capable of representing every possible subset of the instances X
b) It is capable of representing every set of instances X
c) Both of the above
d) None of the above
25. The number of distinct subsets that can be defined over a set X containing 1x1 elements is ;
a) 2|x|
b) 2x
c) 4x
d) None of the above
28. What happens when a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new data?
a) Underfitting of the model
b) Overfitting of the model
c) Variance of a model
d) None of the above
29. For what problems ANN or Neural networks are suitable to use?
a) Problems in which training data corresponds to noisy and complex sensor data
b) Problems in which training data is a labeled dataset
c) Both of the above
d) None of the above
30. What is the output released by Perceptron in Neural network when the result is greater than some
threshold value?
a) 1
b) -1
c) 0
d) None of the above
31. If the training examples are not linearly separable which rule will design a best fit approximation
for target concept?
a) Perceptron Rule
b) Delta Rule
c) Both of the above
d) None of the above
32. The Hypothesis search space for Back propagation algorithm is consisting of;
a) Continuous Representations
b) Discrete Representations
c) Both of the above
d) None of the above
33. The ………. error of a hypothesis with respect to some sample S of instances drawn from X is the
fraction of S that it misclassifies is known as?
a) True error
b) Sample error
c) Mean Square error
d) None of the above
40. Choose the correct option regarding machine learning (ML) and artificial intelligence (AI)
a) ML is a set of techniques that turns a dataset into a software
b) AI is a software that can emulate the human mind
c) ML is an alternate way of programming intelligent machines
d) All of the above
41. Which of the factors affect the performance of the learner system does not include?
a) Good data structures
b) Representation scheme used
c) Training scenario
d) Type of feedback
42. In general, to have a well-defined learning problem, we must identity which of the following
a) The class of tasks
b) The measure of performance to be improved
c) The source of experience
d) All of the above
44. Which of the following does not include different learning methods?
a) Analogy
b) Introduction
c) Memorization
d) Deduction
45. In language understanding, the levels of knowledge that does not include?
a) Empirical
b) Logical
c) Phonological
d) Syntactic
47. Concept learning inferred a valued function from training examples of its input and output.
a) Decimal
b) Hexadecimal
c) Boolean
d) All of the above
50. What kind of learning algorithm for “Facial identities or facial expressions”?
a) Prediction
b) Recognition Patterns
c) Generating Patterns
d) Recognizing Anomalies Answer
52. Real-Time decisions, Game AI, Learning Tasks, Skill Acquisition, and Robot Navigation are
applications of which of the following
a) Supervised Learning: Classification
b) Reinforcement Learning
c) Unsupervised Learning: Clustering
d) Unsupervised Learning: Regression
53. Targeted marketing, Recommended Systems, and Customer Segmentation are applications in
which of the following
a) Supervised Learning: Classification
b) Unsupervised Learning: Clustering
c) Unsupervised Learning: Regression
d) Reinforcement Learning
54. Fraud Detection, Image Classification, Diagnostic, and Customer Retention are applications in
which of the following
a) Unsupervised Learning: Regression
b) Supervised Learning: Classification
c) Unsupervised Learning: Clustering
d) Reinforcement Learning
55. Which of the following is not numerical functions in the various function representation of
Machine Learning?
a) Neural Network
b) Support Vector Machines
c) Case-based
d) Linear Regression
56. FIND-S Algorithm starts from the most specific hypothesis and generalize it by considering only
a) Negative
b) Positive
c) Negative or Positive
d) None of the above
59. Inductive learning is based on the knowledge that if something happens a lot it is likely to be
generally
a) True
b) False Answer
60. Inductive learning takes examples and generalizes rather than starting with
a) Inductive
b) Existing
c) Deductive
d) None of these
61. A drawback of the FIND-S is that it assumes the consistency within the training set
a) True
b) False
62. The Candidate-Elimination Algorithm
a) The key idea in the Candidate-Elimination algorithm is to output a description of the set of all
hypotheses consistent with the training
b) Candidate-Elimination algorithm computes the description of this set without explicitly
enumerating all of its
c) This is accomplished by using the more-general-than partial ordering and maintaining a
compact representation of the set of consistent
d) All of these
63. Concept learning is basically acquiring the definition of a general category from given sample
positive and negative training examples of the
a) TRUE
b) FALSE
64. The hypothesis h1 is more-general-than hypothesis h2 ( h1 > h2) if and only if h1≥h2 is true
and h2≥h1 is false. We also say h2 is more-specific-than h1
a) The statement is true
b) The statement is false
c) We cannot
d) None of these
c) to develop learning algorithm for multilayer feedforward neural network, so that network can be
trained to capture the mapping implicitly
Answer: c
a) yes
b) no
Answer: a
Answer: d
a) yes
b) no
Answer: b
b) actual output is determined by computing the outputs of units for each hidden layer
c) hidden layers output is not all important, they are only meant for supporting input and output
layers
Answer: b
Explanation: In backpropagation rule, actual output is determined by computing the outputs of units
for each hidden layer.
b) because delta is applied to only input and output layers, thus making it more simple and
generalized
c) it has no significance
Answer: a
Explanation: The term generalized is used because delta rule could be extended to hidden layer units.
b) slow convergence
c) scaling
d) all of the mentioned
Answer: d
8. What are the general tasks that are performed with backpropagation algorithm?
a) pattern mapping
b) function approximation
c) prediction
Answer: d
Explanation: These all are the tasks that can be performed with backpropagation algorithm in
general.
a) yes
b) no
c) cannot be said
Answer: a
Explanation: Weight adjustment is proportional to negative gradient of error with respect to weight.
Answer: c
Explanation: If average gradient value falls below a preset threshold value, the process may be
stopped.
11. A _________ is a decision support tool that uses a tree-like graph or model of decisions and
their possible consequences, including chance event outcomes, resource costs, and utility.
a) Decision tree
b) Graphs
c) Trees
d) Neural Networks
Answer: a
a) True
b) False
Answer: a
a) Flow-Chart
b) Structure in which internal node represents test on an attribute, each branch represents outcome
of test and each leaf node represents class label
c) Flow-Chart & Structure in which internal node represents test on an attribute, each branch
represents outcome of test and each leaf node represents class label
Answer: c
b) False
Answer: a
15. Choose from the following that are Decision Tree nodes?
a) Decision Nodes
b) End Nodes
c) Chance Nodes
Answer: d
a) Disks
b) Squares
c) Circles
d) Triangles
Answer: b
a) Disks
b) Squares
c) Circles
d) Triangles
Answer: c
18. End Nodes are represented by __________
a) Disks
b) Squares
c) Circles
d) Triangles
Answer: d
c) Worst, best and expected values can be determined for different scenarios
Answer: d
20. Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic
Gradient Decent (SGD)?
In GD and SGD, you update a set of parameters in an iterative manner to minimize the error function.
In SGD, you have to run through all the samples in your training set for a single update of a
parameter in each iteration.
In GD, you either use the entire data or a subset of training data to update a parameter in each
iteration.
a) Only 1
b) Only 2
c) Only 3
d) 1 and 2
e) 2 and 3
f) 1,2 and 3
Answer: a
Explanation: In SGD for each iteration you choose the batch which is generally contain the random
sample of data But in case of GD each iteration contain the all of the training observations.
21. Below are the 8 actual values of target variable in the train file.
[0,0,0,1,1,1,1,1]
Answer: a
22. A 3-input neuron is trained to output a zero when the input is 110 and a one when the input is
111. After generalization, the output will be zero when and only when the input is?
a) 000 or 110 or 011 or 101
b) 010 or 100 or 110 or 101
c) 000 or 010 or 110 or 100
d) 100 or 111 or 101 or 001
Answer: c
Answer: a
Explanation: The perceptron is a single layer feed-forward neural network. It is not an auto-
associative network because it has no feedback and is not a multiple layer neural network because
the pre-processing stage is not made of neurons.
24. What is an auto-associative network?
a) a neural network that contains no loops
b) a neural network that contains feedback
c) a neural network that has only one loop
d) a single layer feed-forward neural network with pre-processing
Answer: b
Explanation: An auto-associative network is equivalent to a neural network that contains feedback.
The number of feedback paths(loops) does not have to be one.
25. A 4-input neuron has weights 1, 2, 3 and 4. The transfer function is linear with the constant of
proportionality being equal to 2. The inputs are 4, 10, 5 and 20 respectively. What will be the
output?
a) 238
b) 76
c) 119
d) 123
Answer: a
Explanation: The output is found by multiplying the weights with their respective inputs, summing the
results and multiplying with the transfer function. Therefore:
Output = 2 * (1*4 + 2*10 + 3*5 + 4*20) = 238.
Answer: a
Explanation: Neural networks have higher computational rates than conventional computers because
a lot of the operation is done in parallel. That is not the case when the neural network is simulated on
a computer. The idea behind neural nets is based on the way the human brain works. Neural nets
cannot be programmed, they can only learn by examples.
Answer: c
Explanation: The training time depends on the size of the network; the number of neuron is greater
and therefore the number of possible ‘states’ is increased. Neural networks can be simulated on a
conventional computer but the main advantage of neural networks – parallel execution – is lost.
Artificial neurons are not identical in operation to the biological ones.
28. What are the advantages of neural networks over conventional computers?
(i) They have the ability to learn by example
(ii) They are more fault tolerant
(iii)They are more suited for real time operation due to their high ‘computational’ rates
a) (i) and (ii) are true
b) (i) and (iii) are true
c) Only (i)
d) All of the mentioned
Answer: d
Explanation: Neural networks learn by example. They are more fault tolerant because they are always
able to respond and small changes in input do not normally cause a change in output. Because of
their parallel architecture, high computational rates are achieved.
Answer: a
Explanation: Pattern recognition is what single layer neural networks are best at but they don’t have
the ability to find the parity of a picture or to determine whether two shapes are connected or not.
Answer: b
32. Why is the XOR problem exceptionally interesting to neural network researchers?
a) Because it can be expressed in a way that allows you to use a neural network
b) Because it is complex binary operation that cannot be solved using neural networks
c) Because it can be solved by a single layer perceptron
d) Because it is the simplest linearly inseparable problem that exists.
Answer: d
Answer: c
Explanation: Back propagation is the transmission of error back through the network to allow weights
to be adjusted so that the network can learn.
34. Why are linearly separable problems of interest of neural network researchers?
a) Because they are the only class of problem that network can solve successfully
b) Because they are the only class of problem that Perceptron can solve successfully
c) Because they are the only mathematical functions that are continue
d) Because they are the only mathematical functions you can draw
Answer: b
Explanation: Linearly separable problems of interest of neural network researchers because they are
the only class of problem that Perceptron can solve successfully.
35. Which of the following is not the promise of artificial neural network?
a) It can explain result
b) It can survive the failure of some nodes
c) It has inherent parallelism
d) It can handle noise
Answer: a
Explanation: The artificial Neural Network (ANN) cannot explain result.
Answer: a
Explanation: Neural networks are complex linear functions with many parameters.
37. A perceptron adds up all the weighted inputs it receives, and if it exceeds a certain value, it
outputs a 1, otherwise it just outputs a 0.
a) True
b) False
c) Sometimes – it can also output intermediate values as well
d) Can’t say
Answer: a
Explanation: Yes the perceptron works like that.
38. What is the name of the function in the following statement “A perceptron adds up all the
weighted inputs it receives, and if it exceeds a certain value, it outputs a 1, otherwise it just
outputs a 0”?
a) Step function
b) Heaviside function
c) Logistic function
d) Perceptron function
Answer: b
Explanation: Also known as the step function – so answer 1 is also right. It is a hard thresholding
function, either on or off with no in-between.
39. Having multiple perceptrons can actually solve the XOR problem satisfactorily: this is because
each perceptron can partition off a linear part of the space itself, and they can then combine their
results.
a) True – this works always, and these multiple perceptrons learn to classify even complex problems
b) False – perceptrons are mathematically incapable of solving linearly inseparable functions, no
matter what you do
c) True – perceptrons can do this but are unable to learn to do it – they have to be explicitly hand-
coded
d) False – just having a single perceptron is enough
Answer: c
40. The network that involves backward links from output to the input and hidden layers is called
_________
a) Self organizing maps
b) Perceptrons
c) Recurrent neural network
d) Multi layered perceptron
Answer: c
Explanation: RNN (Recurrent neural network) topology involves backward links from output to the
input and hidden layers.
Answer: d
Explanation: All mentioned options are applications of Neural Network.
a) Gini index
b) Information gain
c) Entropy
d) Scatter
Answer: d
45. Which of the following is a valid logical rule for the decision tree below?
b) F Business Appointment = Yes & Temp above 70 = Yes THEN Decision = wear shorts
Answer: d
Answer: b
47. For questions 47 (a) to 47 (e), consider the following small data table for two classes of woods.
Using
information gain, construct a decision tree to classify the data set. Answer the following question
for the resulting tree.
47.(a)Which attribute would information gain choose as the root of the tree?
a) Density
b) Grain
c) Hardness
47.(b) What class does the tree infer for the example {Density=Light, Grain=Small,
Hardness=Hard}?
a) Oak
b) Pine
c) The example cannot be classified
d) Both classes are equally likely
Answers: b
47.(c) What class does the tree infer for the example {Density=Light, Grain=Small, Hardness=Soft}?
a) Oak
b) Pine
c) The example cannot be classified
d) Both classes are equally likely
Answer: a
47.(d) What class does the tree infer for the example {Density=Heavy, Grain=Small,
Hardness=Soft}?
a) Oak
b) Pine
c) The example cannot be classified
d) Both classes are equally likely
Answer: b
47.(e) What class does the tree infer for the example {Density=Heavy, Grain=Small,
Hardness=Hard}?
a) Oak
b) Pine
c) The example cannot be classified
d) Both classes are equally likely
Answer: a
49. A perceptron can correctly classify instances into two classes where the classes are:
a) Overlapping
b) Linearly separable
c) Non-linearly separable
d) None of the above
Answer: b
Explanation: Perceptron is a linear classifier.
50. The logic function that cannot be implemented by a perceptron having two inputs is?
a) AND
b) OR
c) NOR
d) XOR
Answer: d
Explanation: XOR is not linearly separable.
UNIT - 2
1. Which of the following hyper parameter(s), when increased may cause random forest to over fit the data?
a) Number of Trees
b) Depth of Tree
c) Learning Rate
d) None of the above
9. Property that measures how well a given attribute separates the training examples according to their
target classification;
a) Information Gain
b) Entropy
c) Gini Index
d) None of the above
10. The information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined
as;
a)
b)
c)
d) None of the above
13. The strategy where we keep on designing the decision tree but keeps an eye on Over fitting is
termed as ;
a) Pre pruning
b) Post pruning
c) Middle pruning
d) None of the above
14. What are the advantages to convert decision trees to rules before pruning?
15. Which of the following best describes formula for Split Information?
b)
a) Entropy
b) Information Gain
c) Split Information
a)
18. For what problems ANN or Neural networks are suitable to use?
a) Problems in which training data corresponds to noisy and complex sensor data
19.. What is the output released by Perceptron in Neural network when the result is greater than some
threshold value?
a) 1
b) -1
c) 0
20.. If the training examples are not linearly separable which rule will design a best fit approximation for
target concept?
a) Perceptron Rule
b) Delta Rule
a) To approximate this gradient descent search by updating weights incrementally, following the
calculation of the error for each individual example.
22. Which rule is used to minimize the squared error between network output values and the target
values for this output?
a) Delta rule
23. The Hypothesis search space for Back propagation algorithm is consisting of;
a) Continuous Representations
b) Discrete Representations
25. Assume that we have a dataset containing information about 200 individuals. One
hundred of these individuals have purchased life insurance. A supervised data mining
session has discovered the following rule:
How many individuals in the class life insurance= no have credit card insurance and are less than 30
years old?
a. 63
b. 70
c. 30
d. 27
26.Which statement is true about neural network and linear regression models?
a. Both models require input attributes to be numeric.
b. Both models require numeric attributes to range between 0 and 1.
c. The output of both models is a categorical attribute value.
d. Both techniques build models whose output is determined by a linear sum of weighted input
attribute values.
e. More than one of a,b,c or d is true.
29. Which statement is true about the decision tree attribute selection process described in
your book?
a. A categorical attribute may appear in a tree node several times but a numeric attribute may appear
at most once.
b. A numeric attribute may appear in several tree nodes but a categorical attribute may appear at
most once.
c. Both numeric and categorical attributes may appear in several tree nodes.
d. Numeric and categorical attributes may appear in at most one tree node.
30. Given a rule of the form IF X THEN Y, rule confidence is defined as the conditional
probability that
a. Y is true when X is known to be true.
b. X is true when Y is known to be true.
c. Y is false when X is known to be false.
d. X is false when Y is known to be false.
32. One two-item set rule that can be generated from the tables above is:
a. 5/7
b. 5 / 12
c. 7 / 12
d. 1
33. Based on the two-item set table, which of the following is not a possible two-item set
rule?
a. IF Life Ins Promo = Yes THEN Magazine Promo = Yes
b. IF Watch Promo = No THEN Magazine Promo = Yes
c. IF Card Insurance = No THEN Magazine Promo = Yes
d. IF Life Ins Promo = No THEN Card Insurance = No
37. Neural network training is accomplished by repeatedly passing the training data through the
network while
a. individual network weights are modified.
b. training instance attribute values are modified.
c. the ordering of the training instances is modified.
d. individual network nodes have the coefficients on their corresponding functional parameters
modified.
38. Genetic learning can be used to train a feed-forward network. This is accomplished by having each
population element represent one possible
a. network configuration of nodes and links.
b. set of training data to be fed through the network.
c. set of network output values.
d. set of network connection weights.
39. With a Kohonen network, the output layer node that wins an input instance is rewarded by having
a. a higher probability of winning the next training instance to be presented.
b. its connect weights modified to more closely match those of the input instance.
c. its connection weights modified to more closey match those of its neighbors.
d. neighoring connection weights modified to become less similar to its own connection weights.
41. This neural network explanation technique is used to determine the relative importance of
individual input attributes.
a. sensitivity analysis
b. average member technique
c. mean squared error analysis
d. absolute average technique
42. Which one of the following is not a major strength of the neural network approach?
a. Neural networks work well with datasets containing noisy data.
b. Neural networks can be used for both supervised learning and unsupervised clustering.
c. Neural network learning algorithms are guaranteed to converge to an optimal solution.
d. Neural networks can be used for applications that require a time element to be included in the data.
43. During backpropagation training, the purpose of the delta rule is to make weight adjustments so as
to
a. minimize the number of times the training data must pass through the network.
b. minimize the number of times the test data must pass through the network.
c. minimize the sum of absolute differences between computed and actual outputs.
d. minimize the sum of squared error differences between computed and actual output.
46. The test set accuracy of a backpropagation neural network can often be improved by
a. increasing the number of epochs used to train the network.
b. decreasing the number of hidden layer nodes.
c. increasing the learning rate.
d. decreasing the number of hidden layers.
47. This type of supervised network architecture does not contain a hidden layer.
a. backpropagation
b. perceptron
c. self-organizing map
d. genetic
48. The total delta measures the total absolute change in network connection weights for each pass of
the training data through a neural network. This value is most often used to determine the
convergence of a
a. perceptron network.
b. feed-forward network.
c. backpropagation network.
d. self-organizing network.
49. What strategies can help reduce overfitting in decision trees?
• Enforce a maximum depth for the tree
• Enforce a minimum number of samples in leaf nodes
• Pruning
• Make sure each leaf node is one pure class
A. All
B. (i), (ii) and (iii)
C. (i), (iii), (iv)
D. None
Correct option is B
50. Which of the following is a widely used and effective machine learning algorithm
based on the idea of bagging?
A. Decision Tree
B. Random Forest
C. Regression
D. Classification
Correct option is B
51. To find the minimum or the maximum of a function, we set the gradient to zero
because which of the following
A. Depends on the type of problem
B. The value of the gradient at extrema of a function is always zero
C. Both (A) and (B)
D. None of these
Correct option is B
55. What are the advantages of neural networks over conventional computers?
• They have the ability to learn by
• They are more fault
• They are more suited for real time operation due to their high
„computational‟
A. (i) and (ii)
B. (i) and (iii)
C. Only (i)
D. All
E. None
Correct option is D
65. A 3-input neuron has weights 1, 4 and 3. The transfer function is linear with the
constant of proportionality being equal to 3. The inputs are 4, 8 and 5 respectively.
What will be the output?
A. 139
B. 153
C. 162
D. 160
Correct option is B
71. The general tasks that are performed with backpropagation algorithm
A. Pattern mapping
B. Prediction
C. Function approximation
D. All of the above
Correct option is D
72. Backpropagaion learning is based on the gradient descent along error surface.
A. True
B. False
Correct option is A
Answer: c
Explanation: The three required terms are a conditional probability and two unconditional probability.
Answer: b
Answer: d
Explanation: Bayes rule can be used to answer the probabilistic queries conditioned on one piece of
evidence
Answer: a
Explanation: A Bayesian network provides a complete description of the domain.
5. How the entries in the full joint probability distribution can be calculated?
a) Using variables
b) Using information
c) Both Using variables & information
d) None of the mentioned
Answer: b
Explanation: Every entry in the full joint probability distribution can be calculated from the information
in the network.
Answer: b
Explanation: If a bayesian network is a representation of the joint distribution, then it can solve any
query, by summing all the relevant joint entries.
Answer: a
Explanation: The compactness of the bayesian network is an example of a very general property of a
locally structured system.
Answer: c
Explanation: Local structure is usually associated with linear rather than exponential growth in
complexity.
9. Which condition is used to influence a variable directly by all the others?
a) Partially connected
b) Fully connected
c) Local connected
d) None of the mentioned
Answer: b
10. What is the consequence between a node and its predecessors while creating bayesian netw
a) Functionally dependent
b) Dependant
c) Conditionally independent
d) Both Conditionally dependant & Dependant
Answer: c
Explanation: The semantics to derive a method for constructing bayesian networks were led to the
consequence that a node can be conditionally independent of its predecessors.
Answer: b
Answer: c
13. Suppose we would like to convert a nominal attribute X with 4 values to a data table with only
binary variables. How many new attributes are needed?
a) 1
b) 2
c) 4
d) 8
e) 16
Answer: C
14. In a medical application domain, suppose we build a classifier for patient screening (True means
patient has cancer). Suppose that the confusion matrix is from testing the classifier on some test data.
Predicted
TRUE FALSE
TRUE TP FN
Actual FALSE FP TN
Which of the following situations would you like your classifier to have?
A. FP >> FN
B. FN >> FP
C. FN = FP × TP
D. TN >> FP
E. FN × TP >> FP × TN
F. All of the above
Answer: A (because, when FN is small, we can guarantee that true cancer patients are not diagnosed
as non-patients.)
15. Consider discretizing a continuous attribute whose values are listed below:
3, 4, 5, 10, 20, 32, 43, 44, 46, 52, 59, 61
Which of the following number of bins is not possible for using equidepth bins?
A. 2
B. 3
C. 4
D. 5
E. 6
F. All of the above
Answer: D
16. Consider discretizing a continuous attribute whose values are listed below:
3, 4, 5, 10, 21, 32, 43, 44, 46, 52, 59, 67
Using equal-width partitioning and four bins, how many values are there in the first bin (the bin with
small values)?
A. 1
B. 2
C. 3
D. 4
E. 5
Answer: 4D (because, the first bin is between 3 and 19, in which there are 4 items: 3,4,5,10)
17. High entropy means that the partitions in classification are
A. pure
B. not pure
C. useful
D. useless
E. None of the above
Answer: B
18. A machine learning problem involves four attributes plus a class. The attributes have 3, 2, 2, and 2
possible values each. The class has 3 possible values. How many possible different examples are there?
A. 3
B. 6
C. 12
D. 24
E. 48
F. 72
Answer: F
Answer: B or C
Answer: B
Answer: B
22. Suppose that there are a total of 50 data mining related documents in a library of 200 documents.
Suppose that a search engine retrieves 10 documents after a user enters ‘data mining’ as a query, of
which 5 are data mining related documents. What are the precision and recall
A. (50%, 10%)
B. (60%, 20%)
C. (70%, 30%)
D. (60%, 30%)
23. Three companies A, B and C supply 25%, 35% and 40% of the notebooks to a school. Past
experience shows that 5%, 4% and 2% of the notebooks produced by these companies are defective.
If a notebook was found to be defective, what is the probability that the notebook was supplied by A?
a) 44⁄69
b) 25⁄69
c) 13⁄24
d) 11⁄24
Answer: b
Explanation: Let A, B and C be the events that notebooks are provided by A, B and C respectively.
Let D be the event that notebooks are defective
Then,
P(A) = 0.25, P(B) = 0.35, P(C) = 0.4
P(D|A) = 0.05, P(D|B) = 0.04, P(D|C) = 0.02
P(A│D) = (P(D│A)*P(A))/(P(D│A) * P(A) + P(D│B) * P(B) + P(D│C) * P(C) )
0.362318841
24. A box of cartridges contains 30 cartridges, of which 6 are defective. If 3 of the cartridges are
removed from the box in succession without replacement, what is the probability that all the 3
cartridges are defective?
a) (6∗5∗4)(30∗30∗30)
b) (6∗5∗4)(30∗29∗28)
c) (6∗5∗3)(30∗29∗28)
d) (6∗6∗6)(30∗30∗30)
Answer: b
Explanation: Let A be the event that the first cartridge is defective. Let B be the event that the second
cartridge is defective. Let C be the event that the third cartridge is defective. Then probability that all
3 cartridges are defective is P(A ∩ B ∩ C)
Hence,
P(A ∩ B ∩ C) = P(A) * P(B|A) * P(C | A ∩ B)
= (6⁄30) * (5⁄29) * (4⁄28)
= (6 * 5 * 4)⁄(30 * 29 * 28).
25. Two boxes containing candies are placed on a table. The boxes are labelled B1 and B2. Box B1
contains 7 cinnamon candies and 4 ginger candies. Box B2 contains 3 cinnamon candies and 10 pepper
candies. The boxes are arranged so that the probability of selecting box B1 is 1⁄3 and the probability
of selecting box B2 is 2⁄3. Suresh is blindfolded and asked to select a candy. He will win a colour TV if
he selects a cinnamon candy. What is the probability that Suresh will win the TV (that is, she will select
a cinnamon candy)?
a) 7⁄33
b) 6⁄33
c) 13⁄33
d) 20⁄33
Answer: c
26. Two boxes containing candies are placed on a table. The boxes are labelled B1 and B2. Box B1
contains 7 cinnamon candies and 4 ginger candies. Box B2 contains 3 cinnamon candies and 10 pepper
candies. The boxes are arranged so that the probability of selecting box B1 is 1⁄3 and the probability
of selecting box B2 is 2⁄3. Suresh is blindfolded and asked to select a candy. He will win a colour TV if
he selects a cinnamon candy. If he wins a colour TV, what is the probability that the marble was from
the first box?
a) 7⁄13
b) 13⁄7
c) 7⁄33
d) 6⁄33
Answer: a
Answer: a
Explanation: Let E represent the event of moving a blue coin from box A to box B. We want to find the
probability of a blue coin which was moved from box A to box B given that the coin chosen from B was
red. The probability of choosing a red coin from box A is P(R) = 7⁄9 and the probability of choosing a
blue coin from box A is P(B) = 5⁄9. If a red coin was moved from box A to box B, then box B has 7 red
coins and 3 blue coins. Thus the probability of choosing a red coin from box B is 7⁄10 . Similarly, if a
blue coin was moved from box A to box B, then the probability of choosing a red coin from box B is
6⁄10.
Hence, the probability that a blue coin was transferred from box A to box B given that the coin chosen
from box B is red is given by
P(E|R)=P(R|E)∗P(E)P(R)
=(610)∗(59)(710)∗(49)+(610)∗(59)
= 15⁄29.
28. An urn B1 contains 2 white and 3 black chips and another urn B2 contains 3 white and 4 black
chips. One urn is selected at random and a chip is drawn from it. If the chip drawn is found black, find
the probability that the urn chosen was B1.
a) 4⁄7
b) 3⁄7
c) 20⁄41
d) 21⁄41
Answer: d
Explanation: Let E1, E2 denote the vents of selecting urns B1 and B2 respectively.
Then P(E1) = P(E2) = 1⁄2
Let B denote the event that the chip chosen from the selected urn is black .
Then we have to find P(E1 /B).
By hypothesis P(B /E1) = 3⁄5
and P(B /E2) = 4⁄7
By Bayes theorem P(E1 /B) = (P(E1)*P(B│E1))/((P(E1) * P(B│E1)+P(E2) * P(B│E2)) )
= ((1/2) * (3/5))/((1/2) * (3/5)+(1/2)*(4/7) ) = 21/41.
29. At a certain university, 4% of men are over 6 feet tall and 1% of women are over 6 feet tall. The
total student population is divided in the ratio 3:2 in favour of women. If a student is selected at
random from among all those over six feet tall, what is the probability that the student is a woman?
a) 2⁄5
b) 3⁄5
c) 3⁄11
d) 1⁄100
Answer: c
Explanation: Let M be the event that student is male and F be the event that the student is female.
Let T be the event that student is taller than 6 ft.
P(M) = 2⁄5 P(F) = 3⁄5 P(T|M) = 4⁄100 P(T|F) = 1⁄100
P(F│T) = (P(T│F) * P(F))/(P(T│F) * P(F) + P(T│M) * P(M))
=0.272727273
= ((1/100) * (3/5))/((1/100) * (3/5) + (4/100) * (2/5) )
= 3⁄11.
30. Naina receives emails that consists of 18% spam of those emails. The spam filter is 93% reliable
i.e., 93% of the mails it marks as spam are actually a spam and 93% of spam mails are correctly labelled
as spam. If a mail marked spam by her spam filter, determine the probability that it is really a spam.
a) 50%
b) 84%
c) 39%
d) 63%
Answer: a
Explanation: 18% email are spam and 82% email are not spam. Now, 18% of mail marked as spam is
spam and 82% mail marked as spam are not spam. By Bayes theorem the probability that a mail
marked spam is really a spam = (Probability of being spam and being detected as spam)/(Probability
of being detected as spam) = (0.18 * 0.82)/(0.18 * 0.82) + (0.18 * 0.82) = 0.5 or 50%.
31. A meeting has 12 employees. Given that 8 of the employees is a woman, find the probability that
all the employees are women?
a) 1123
b) 1235
c) 29
d) 18
Answer: c
Explanation: Assume that the probability of an employee being a man or woman is (12). By using
Bayes’ theorem: let B be the event that the meeting has 3 employees who is a woman and let A be
the event that all employees are women. We want to find P(A|B) = P(B|A)∗P(A)P(B). P(B|A) = 1, P(A)
= 112 and P(B) = 812. So, P(A|B) = 1∗112812=18.
32. A cupboard A has 4 red carpets and 4 blue carpets and a cupboard B has 3 red carpets and 5 blue
carpets. A carpet is selected from a cupboard and the carpet is chosen from the selected cupboard
such that each carpet in the cupboard is equally likely to be chosen. Cupboards A and B can be selected
in 15 and 35 ways respectively. Given that a carpet selected in the above process is a blue carpet, find
the probability that it came from the cupboard B.
a) 25
b) 1519
c) 3173
d) 49
Answer: b
33. Mangoes numbered 1 through 18 are placed in a bag for delivery. Two mangoes are drawn out of
the bag without replacement. Find the probability such that all the mangoes have even numbers on
them?
a) 43.7%
b) 34%
c) 6.8%
d) 9.3%
Answer: c
Explanation: The events are not independent. There will be a 1018=59 chance that any of the mangoes
in the bag is even. The probability that the first one is even is 12, for the second mango, given that the
first one was even, there are only 9 even numbered balls that could be drawn from a total of 17 balls,
so the probability is 917. For the third mango, since the first two are both odd, there are 8 even
numbered mangoes that could be drawn from a total of 16 remaining balls and so the probability is
816 and for fourth mango, the probability is = 715. So the probability that all 4 mangoes are even
numbered is 1018∗917∗816∗716 = 0.068 or 6.8%.
34. A family has two children. Given that one of the children is a girl and that she was born on a
Monday, what is the probability that both children are girls?
a) 1327
b) 2354
c) 1219
d) 4358
Answer: a
Explanation: We let Y be the event that the family has one child who is a girl born on Tuesday and X
be the event that both children are boys, and apply Bayes’ Theorem. Given that there are 7 days of
the week and there are 49 possible combinations for the days of the week the two girls were born on
and 13 of these have a girl who was born on a Monday, so P(Y|X) = 1349. P(X) remains unchanged at
14. To calculate P(Y), there are 142 = 196 possible ways to select the gender and the day of the week
the child was born on. There are 132 = 169 ways which do not have a girl born on Monday and which
196 – 169 = 27 which do, so P(Y) = 27196. This gives is that P(X|Y) = 1319∗1427196=1327.
35. A jar containing 8 marbles of which 4 red and 4 blue marbles are there. Find the probability of
getting a red given the first one was red too.
a) 413
b) 211
c) 37
d) 815
Answer: c
Explanation: Suppose, P (A) = getting a red marble in the first turn, P (B) = getting a black marble in
the second turn. P (A) = 48 and P (B) = 37 and P (A and B) = 48∗37=314 P(B/A) =
P(AandB)P(A)=31412=37.
Ans: B
37. If we are provided with an infinite sized training set which of the following classifier will have the
lowest error probability?
A. Decision tree
B. K- nearest neighbor classifier
C. Bayes classifier
D. Support vector machine
Ans: C
Explanation: Bayes classifier has lowest error probability when trained with infinite sized training set.
38. Let A be an example, and C be a class. The probability P(C|A) is known as:
A. Apriori probability
B. Aposteriori probability
C. Class conditional probability
D. None of the above
Ans: B
Explanation: conditional probability P(C|A) is known as aposteriori probability.
39. Let A be an example, and C be a class. The probability P(C) is known as:
A. Apriori probability
B. Aposteriori probability
C. Class conditional probability
D. None of the above
Ans: A
Explanation: Apriori probability is a probability that is deduced from formal reasoning. In other words,
apriori probability is derived from logically examining an event. Class probability P(C) is apriori
probability.
40. A bank classifies its customer into two classes “fraud” and “normal” based on their instalment
payment behaviour. We know that the probability of a customer being fraud is P(fraud) = 0.20,
the probability of customer defaulting instalment payment is P(default) = 0.40, and the probability
that a fraud customer defaults in installment payment is P(default|fraud) = 0.80. What is the
probability of a customer who defaults in payment being a fraud?
A. 0.80
B. 0.60
C. 0.40
D. 0.20
Ans: C
Explanation: We have to find P(fraud|defaults).
By Bayes’ Rule: P(fraud|defaults) = (P(default|fraud) * P(fault))/P(default)=(0.80*0.20)/0.40 =0.40
41. Consider two binary attributes X and Y. We know that the attributes are independent and
Probability P(X=1) = 0.6, and P(Y=0) = 0.4. What is the probability that both X and Yhave values 1?
A. 0.06
B. 0.16
C. 0.26
D. 0.36
Ans: D
Explanation: P(X=1)=0.6 P(Y=0)=0.4 P(Y=1)=1-0.4=0.6
P(X=1, Y=1) = P(X=1)*P(Y=1) = 0.6*0.6 = 0.36 (Since, X and Y are independent)
42. Consider a binary classification problem with two classes C1 and C2. Class labels of ten other
training set instances sorted in increasing order of their distance to an instance x is as follows: {C1, C2,
C1, C2, C2, C2, C1, C2, C1, C2}. How will a K=7 nearest neighbor classifier classify x?
Ans: C
Explanation: closest 7 neighbours are C1, C2, C1, C2, C2, C2, C1. In this C1 has 3 occurrences and C2
has 4 occurrences, therefore, by majority voting X will be classified as C2.
43. Given the following training set for classification problem into two classes “fraud” and “normal”.
There are two attributes A1 and A2 taking values 0 or 1. What is the estimated apriori probability
P(fraud)of the class fraud?
A1 A2 Class
1 0 fraud
1 1 fraud
1 1 fraud
1 0 normal
1 1 fraud
0 0 normal
0 0 normal
0 0 normal
1 1 normal
1 0 normal
A. 0.2
B. 0.4
C. 0.6
D. 0.8
Ans: B
Explanation: P(fraud) = 4/10=0.4. Since 4 out 10 are fraud cases.
44. Given the following training set for classification problem into two classes “fraud” and “normal”.
There are two attributes A1 and A2 taking values 0 or 1. What is the estimated class conditional
probability P(A1=1, A2=1|fraud)?
A1 A2 Class
1 0 fraud
1 1 fraud
1 1 fraud
1 0 normal
1 1 fraud
0 0 normal
0 0 normal
0 0 normal
1 1 normal
1 0 normal
A. 0.25
B. 0.50
C. 0.75
D. 1.00
Ans: C
Explanation: P(A1=1, A2=1|fraud) = ( P(fraud| A1=1,A2=1)*P(A1=1, A2=1))/P(fraud) =( (3/4)*0.4)/0.4
= 0.75
45. Given the following training set for classification problem into two classes “fraud” and “normal”.
There are two attributes A1 and A2 taking values 0 or 1. The Bayes classifier classifies the instance
(A1=1, A2=1) into class?
A1 A2 Class
1 0 fraud
1 1 fraud
1 1 fraud
1 0 normal
1 1 fraud
0 0 normal
0 0 normal
0 0 normal
1 1 normal
1 0 normal
A. fraud
B. normal
C. there will be a tie
D. not enough information to classify
Ans: A
Explanation: P(fraud| A1=1,A2=1) = 0.75
P(Normal| A1=1,A2=1) = 0.25
P(fraud| A1=1,A2=1) > P(Normal| A1=1,A2=1) therefore classified as fraud.
46. In which of the following cases will K-means clustering fail to give good results? 1) Data points with
outliers 2) Data points with different densities 3) Data points with nonconvex shapes
a. 1 and 2
b. 2 and 3
c. 1, 2, and 3
d. 1 and 3
Answer: c
47. Which of the following is a reasonable way to select the number of principal components "k"?
a. Choose k to be the smallest value so that at least 99% of the variance is retained.
b. Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer).
c. Choose k to be the largest value so that 99% of the variance is retained.
d. Use the elbow method
Answer: a
48. You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each iteration.
You find that the value of J(Theta) decreases quickly and then levels off. Based on this, which of the
following conclusions seems most plausible?
a. Rather than using the current value of a, use a larger value of a (say a=1.0)
b. Rather than using the current value of a, use a smaller value of a (say a=0.1)
c. a=0.3 is an effective choice of learning rate
d. None of the above
Answer: c
Answer: b
50. Suppose you have trained a logistic regression classifier and it outputs a new example x with a
prediction ho(x) = 0.2. This means
a. Our estimate for P(y=1 | x)
b. Our estimate for P(y=0 | x)
c. Our estimate for P(y=1 | x)
d. Our estimate for P(y=0 | x)
Answer: b
UNIT -3
6.Which technique use the method of finding training partition records that have
the exact predictor values as the new observation?
(a)Naïve Bayes
(b)Complete Bayes
(c)Multiple Linear Regresssion
(d)None of the above
Sol-(b)
7.Bayes rule is used to
(a)Solve queries
(b)Increase complexity of a query
(c)Decrease complexity of a query
(d)Answer probabilistic queries
Sol-(d)
16.How the entries in the full joint probability distribution can be calculated?
(a)Using variables
(b)Using information
(c)Both using variables and information
(d)None of the mentioned
Answer-(b)
25.What area of CLT tells”How many mistakes we will make before finding a good
hypothesis”?
(a)Sample Complexity
(b)Computational Complexity
(c)Mistake Bound
(d)None of these
Sol-(c)
26. What area of CLT tells”How many examples we need to find a good
hypothesis”?
(a)Sample complexity
(b) Computational Complexity
(c)Mistake Bound
(d)None of these
Sol-(a)
2) In the image below, which would be the best value for k assuming that the algorithm you are using is k-
Nearest Neighbour.
A) 3
B) 10
C) 20
D 50
Solution: B
Explanation: Validation error is the least when the value of k is 10. So it is best to use this value of k
6) Which of the following machine learning algorithm can be used for imputing missing values of both
categorical and continuous variables?
A) K-NN
B) Linear Regression
C) Logistic Regression
Solution: A
Explanation: k-NN algorithm can be used for imputing missing value of both categorical and continuous
variables.
8) Which of the following distance measure do we use in case of categorical variables in k-NN?
1 Hamming Distance
2 Euclidean Distance
3 Manhattan Distance
A) 1
B) 2
C) 3
D) 1 and 2
E) 2 and 3
F) 1,2 and 3
Solution: A
Explanation: Both Euclidean and Manhattan distances are used in case of continuous variables, whereas
hamming distance is used in case of categorical variable.
9) Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
Solution: A
Explanation: sqrt ((1-2)^2 + (3-3)^2) = sqrt(1^2 + 0^2) = 1
10) Which of the following will be Manhattan Distance between the two data point A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
Solution: A
Explanation: sqrt( mod((1-2)) + mod((3-3))) = sqrt(1 + 0) = 1
Context: 11-12
Suppose, you have given the following data where x and y are the 2 input variables and Class is the
dependent variable.
11) Suppose, you want to predict the class of new data point x=1 and y=1 using eucludian distance in 3-NN.
In which class this data point belongs to?
A) + Class
B) – Class
C) Can’t say
D) None of these
Solution: A
Explanation: All three nearest point are of +class so this point will be classified as +class.
12) In the previous question, you are now wanting use 7-NN instead of 3-KNN which of the following x=1
and y=1 will belong to?
A) + Class
B) – Class
C) Can’t say
Solution: B
Explanation: Now this point will be classified as – class because there are 4 – class and 3 +class point are
in nearest circle.
Context 13-14:
Suppose you have given the following 2-class data where “+” represent a postive class and “” is represent
negative class.
13) Which of the following value of k in k-NN would minimize the leave one out cross validation accuracy?
A) 3
B) 5
C) Both have same
D) None of these
Solution: B
Explanation: 5-NN will have least leave one out cross validation error.
14) Which of the following would be the leave on out cross validation accuracy for k=5?
A) 2/14
B) 4/14
C) 6/14
D) 8/14
E) None of the above
Solution: E
Explanation: In 5-NN we will have 10/14 leave one out cross validation accuracy.
15) Which of the following will be true about k in k-NN in terms of Bias?
A) When you increase the k the bias will be increases
B) When you decrease the k the bias will be increases
C) Can’t say
D) None of these
Solution: A
Explanation: large K means simple model, simple model always condider as high bias
16) Which of the following will be true about k in k-NN in terms of variance?
A) When you increase the k the variance will increases
B) When you decrease the k the variance will increases
C) Can’t say
D) None of these
Solution: B
Explanation: Simple model will be considered as less variance model
17) The following two distances (Eucludean Distance and Manhattan Distance) have given to you which
generally we used in K-NN algorithm. These distance are between two points A(x1,y1) and B(x2,Y2).
Your task is to tag the both distance by seeing the following two graphs. Which of the following option is true
about below graph?
18) When you find noise in data which of the following option would you consider in k-NN?
A) I will increase the value of k
B) I will decrease the value of k
C) Noise cannot be dependent on value of k
D) None of these
Solution: A
Explanation: To be surer of which classifications you make, you can try increasing the value of k.
19) In k-NN it is very likely to overfit due to the curse of dimensionality. Which of the following option would
you consider to handle such problem?
1 Dimensionality Reduction
2 Feature selection
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Explanation: In such case you can use either dimensionality reduction algorithm or the feature selection
algorithm
20) Below are two statements given. Which of the following will be true both statements?
1 k-NN is a memory-based approach is that the classifier immediately adapts as we collect
new training data.
2 The computational complexity for classifying new samples grows linearly with the number
of samples in the training dataset in the worst-case scenario.
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
21) Suppose you have given the following images(1 left, 2 middle and 3 right), Now your task is to find out
the value of k in k-NN in each image where k1 is for 1st, k2 is for 2nd and k3 is for 3rd figure.
A) k1 > k2> k3
B) k1<k2
C) k1 = k2 = k3
D) None of these
Solution: D
Explanation: Value of k is highest in k3, whereas in k1 it is lowest
22) Which of the following value of k in the following graph would you give least leave one out cross validation
accuracy?
A) 1
B) 2
C) 3
D) 5
Solution: B
Explanation: If you keep the value of k as 2, it gives the lowest cross validation accuracy. You can try this
out yourself.
23) A company has build a kNN classifier that gets 100% accuracy on training data. When they deployed
this model on client side it has been found that the model is not at all accurate. Which of the following thing
might gone wrong?
Note: Model has successfully deployed and no technical issues are found at client side except the model
performance
A) It is probably a overfitted model
B) It is probably a underfitted model
C) Can’t say
D) None of these
Solution: A
Explanation: In an overfitted module, it seems to be performing well on training data, but it is not generalized
enough to give the same results on a new data.
24) You have given the following 2 statements, find which of these option is/are true in case of k-NN?
1 In case of very large value of k, we may include points from other classes into the
neighborhood.
2 In case of too small value of k the algorithm is very sensitive to noise
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Explanation: Both the options are true and are self explanatory.
26) True-False: It is possible to construct a 2-NN classifier by using the 1-NN classifier?
A) TRUE
B) FALSE
Solution: A
Explanation: You can implement a 2-NN classifier by ensembling 1-NN classifiers
27) In k-NN what will happen when you increase/decrease the value of k?
A) The boundary becomes smoother with increasing value of K
B) The boundary becomes smoother with decreasing value of K
C) Smoothness of boundary doesn’t dependent on value of K
D) None of these
Solution: A
Explanation: The decision boundary would become smoother by increasing the value of K
28) Following are the two statements given for k-NN algorthm, which of the statement(s)
is/are true?
1 We can choose optimal value of k with the help of cross validation
2 Euclidean distance treats each feature as equally important
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Explanation: Both the statements are true
29) What would be the time taken by 1-NN if there are N(Very large) observations in test data?
A) N*D
B) N*D*2
C) (N*D)/2
D) None of these
Solution: A
Explanation: The value of N is very large, so option A is correct
30) What would be the relation between the time taken by 1-NN,2-NN,3-NN.
A) 1-NN >2-NN >3-NN
B) 1-NN < 2-NN < 3-NN
C) 1-NN ~ 2-NN ~ 3-NN
D) None of these
Solution: C
Explanation: The training time for any value of k in kNN algorithm is the same.
35. Which of the following clustering algorithm uses a minimal spanning tree?
A. Complete linkage clustering
B. Single linkage clustering
B. Average linkage clustering
C. DBSCAN
Answer: B
Explanation: The naive algorithm for single-linkage clustering has time complexity O(n3
36. Which of the following is a widely used and effective machine learning algorithm based on the idea of
bagging?
a. Decision Tree
b. Regression
c. Classification
d. Random Forest
Answer: d
37. To find the minimum or the maximum of a function, we set the gradient to zero because:
a. The value of the gradient at extrema of a function is always zero
b. Depends on the type of problem
c. Both A and B
d. None of the above
Answer: a
38. The most widely used metrics and tools to assess a classification model are:
a. Confusion matrix
b. Cost-sensitive accuracy
c. Area under the ROC curve
d. All of the above
Answer: d
39. Which of the following is a good test dataset characteristic?
a. Large enough to yield meaningful results
b. Is representative of the dataset as a whole
c. Both A and B
d. None of the above
Answer: c
44. When performing regression or classification, which of the following is the correct way to preprocess the
data?
a. Normalize the data → PCA → training
b. PCA → normalize PCA output → training
c. Normalize the data → PCA → normalize PCA output → training
d. None of the above
Answer: a
49. How can you prevent a clustering algorithm from getting stuck in bad local optima?
a. Set the same seed value for each run
b. Use multiple random initializations
c. Both A and B
d. None of the above
Answer: b
50. Which of the following techniques can be used for normalization in text mining?
a. Stemming
b. Lemmatization
c. Stop Word Removal
d. Both A and B - answer
Answer: d
UNIT – 4
3. K-NN algorithm does more computation on test time rather than train time?
a) True
b) False
1. k-NN performs much better if all of the data have the same scale
2. k-NN works well with a small number of input variables (p), but struggles when the number of
inputs is very large
3. k-NN makes no assumptions about the functional form of the problem being solved
a) Only 1 is true
b) Both 1 and 3 are true
c) Only 2
6. What properties are common between K – nearest Neighbor and Locally Weighted Regression?
2. They classify new query instances by analyzing similar instances while ignoring instances that are
very different from the query.
a) Only 1
b) Only 1 and 3
c) Only 2 and 3
d) All of the above
c) Only 3
8. What is the difference between lazy Learning and Eager Learning methods?
a) Lazy methods may consider the query instance x, when deciding how to generalize beyond
the training data D.
b) Lazy methods will not consider the query instance x, when deciding how to generalize beyond the
training data D.
c) Lazy Methods will only consider the training data
9. Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)?
a) 1
b) 2
c) 4
d) 8
10. Which of the following will be true about k in k-NN in terms of Bias?
a) When you increase the k the bias will be increases
b) When you decrease the k the bias will be increases
c) It can either increase or decrease
d) None of the above
11. Which of the following will be true about k in k-NN in terms of variance?
a) When you increase the k the variance will increases
b) When you decrease the k the variance will increases
c) It can either increase or decrease
d) None of the above
12. Computational complexity of classes of learning problems depends on which of the following?
a) The size or complexity of the hypothesis space considered by learner
b) The accuracy to which the target concept must be approximated
c) The probability that the learner will output a successful hypothesis
d) All of these
22. In k-NN algorithm, given a set of training examples and the value of k < size of training set
(n), the algorithm predicts the class of a test example to be the. What is/are advantages of
CBR?
a) Least frequent class among the classes of k closest training
b) Most frequent class among the classes of k closest training
c) Class of the closest
d) Most frequent class among the classes of the k farthest training examples.
23. Which of the following will be true about k in k-NN in terms of variance
a) When you increase the k the variance will increases
b) When you decrease the k the variance will increases
c) Can’t say
d) None of these
26. When you find noise in data which of the following option would you consider in k- NN
a) I will increase the value of k
b) I will decrease the value of k
c) Noise can not be dependent on value of k
d) None of these
27. Which of the following will be true about k in k-NN in terms of Bias?
a) When you increase the k the bias will be increases
b) When you decrease the k the bias will be increases
c) Can’t say
d) None of these
31. In K-Nearest Neighbor it is very likely to overfit due to the curse of dimensionality. Which of
the following option would you consider to handle such problem?
• Dimensionality Reduction
• Feature selection
a) 1
b) 2
c) 1 and 2
d) None of these
UNIT - 5
1) Which of the following statement is true in following case?
Solution: (B)
Explanation: Ordinal variables are the variables which has some order in their categories. For
example, grade A should be consider as high grade than grade B.
A) PCA
B) K-Means
Solution: (A)
Explanation: A deterministic algorithm is that in which output does not change on different runs.
PCA would give the same result if we run again, but not k-means.
3) [True or False] A Pearson correlation between two variables is zero but, still their values can
still be related to each other.
A) TRUE
B) FALSE
Solution: (A)
Explanation: Y=X2. Note that, they are not only associated, but one is a function of the other
and Pearson correlation between them is 0.
4) Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic
Gradient Decent (SGD)?
In GD and SGD, you update a set of parameters in an iterative manner to minimize the error
function.
In SGD, you have to run through all the samples in your training set for a single update of a
parameter in each iteration.
In GD, you either use the entire data or a subset of training data to update a parameter in each
iteration.
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 2 and 3
F) 1,2 and 3
Solution: (A)
Explanation: In SGD for each iteration you choose the batch which is generally contain the
random sample of data But in case of GD each iteration contain the all of the training
observations.
5) Which of the following hyper parameter(s), when increased may cause random forest to over
fit the data?
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 2 and 3
F) 1,2 and 3
Solution: (B)
Explanation: Usually, if we increase the depth of tree it will cause overfitting. Learning rate is
not an hyperparameter in random forest. Increase in the number of tree will cause under fitting.
6) Imagine, you are working with “Analytics Vidhya” and you want to develop a machine
learning algorithm which predicts the number of views on the articles.
Your analysis is based on features like author name, number of articles written by the same
author on Analytics Vidhya in past and a few other features. Which of the following evaluation
metric would you choose in that case?
A) Only 1
B) Only 2
C) Only 3
D) 1 and 3
E) 2 and 3
F) 1 and 2
Solution:(A)
Explanation: You can think that the number of views of articles is the continuous target variable
which fall under the regression problem. So, mean squared error will be used as an evaluation
metrics.
7) Given below are three images (1,2,3). Which of the following option is correct for these
images?
A) 1 is tanh, 2 is ReLU and 3 is SIGMOID activation functions.
Solution: (D)
Explanation: The range of SIGMOID function is [0,1]. The range of the tanh function is [-1,1].
So Option D is the right answer. So, option D is the right answer.
8) Below are the 8 actual values of target variable in the train file.
[0,0,0,1,1,1,1,
1]
Solution: (A)
So the answer is A.
9) Let’s say, you are working with categorical feature(s) and you have not looked at the
distribution of the categorical variable in the test data.
You want to apply one hot encoding (OHE) on the categorical feature(s). What challenges you
may face if you have applied OHE on a categorical variable of train dataset?
A) All categories of categorical variable are not present in the test dataset.
D) Both A and
B
E) None of
these
Solution: (D)
Both are true, The OHE will fail to encode the categories which is present in test but not in train
so it could be one of the main challenges while applying OHE. The challenge given in option B
is also true you need to more careful while applying OHE if frequency distribution doesn’t same
in train and test.
10) Skip gram model is one of the best models used in Word2vec algorithm for words
embedding. Which one of the following models depict the skip gram model?
A) A
B) B
C) Both A and
B
D) None of
these
Solution: (B)
Both models (model1 and model2) are used in Word2vec algorithm. The model1 represent a
CBOW model where as Model2 represent the Skip gram model.
11) Let’s say, you are using activation function X in hidden layers of neural network. At a
particular neuron for any given input, you get the output as “-0.0001”. Which of the following
activation function could X represent?
A) ReLU
B) tanh
C) SIGMOID
D) None of
these
Solution: (B)
The function is a tanh because the this function output range is between (-1,-1).
12) [True or False] LogLoss evaluation metric can have negative values.
A) TRUE
B) FALSE
Solution: (B)
13) Which of the following statements is/are true about “Type-1” and “Type-2” errors?
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 1 and 3
F) 2 and 3
Solution: (E)
In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis
(a “false positive”), while a type II error is incorrectly retaining a false null hypothesis (a “false
negative”).
14) Which of the following is/are one of the important step(s) to pre-process the text in NLP
based projects?
Stemming
Stop word removal
Object Standardization
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) 1,2 and 3
Solution: (D)
Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s”
etc) from a word.
Stop words are those words which will have not relevant to the context of the data for example
is/am/are.
Object Standardization is also one of the good way to pre-process the text.
15) Suppose you want to project high dimensional data into lower dimensions. The two most
famous dimensionality reduction algorithms used here are PCA and t-SNE. Let’s say you have
applied both algorithms respectively on data “X” and you got the datasets “X_projected_PCA”
, “X_projected_tSNE”.
Solution: (B)
t-SNE algorithm consider nearest neighbour points to reduce the dimensionality of the data. So,
after using t-SNE we can think that reduced dimensions will also have interpretation in nearest
neighbour space. But in case of PCA it is not the case.
Context: 16-
17
Given below are three scatter plots for two features (Image 1, 2 & 3 from left to right).
16) In the above images, which of the following is/are example of multi-collinear features?
A) Features in Image 1
B) Features in Image 2
C) Features in Image 3
Solution: (D)
In Image 1, features have high positive correlation where as in Image 2 has high negative
correlation between the features so in both images pair of features are the example of
multicollinear features.
17) In previous question, suppose you have identified multi-collinear features. Which of the
following action(s) would you perform next?
A) Only 1
B)Only 2
C) Only 3
D) Either 1 or
3
E) Either 2 or
3
Solution: (E)
You cannot remove the both features because after removing the both features you will lose all
of the information so you should either remove the only 1 feature or you can use the
regularization algorithm like L1 and L2.
18) Adding a non-important feature to a linear regression model may result in.
Increase in R-square
Decrease in R-square
A) Only 1 is correct
B) Only 2 is correct
C) Either 1 or
2
D) None of
these
Solution: (A)
After adding a feature in feature space, whether that feature is important or unimportant features
the R-squared always increase.
19) Suppose, you are given three variables X, Y and Z. The Pearson correlation coefficients for
(X, Y), (Y, Z) and (X, Z) are C1, C2 & C3 respectively.
Now, you have added 2 in all values of X (i.enew values become X+2), subtracted 2 from all
values of Y (i.e. new values are Y-2) and Z remains the same. The new coefficients for (X,Y),
(Y,Z) and (X,Z) are given by D1, D2 & D3 respectively. How do the values of D1, D2 & D3
relate to C1, C2 & C3?
E) D1 = C1, D2 = C2, D3 = C3
F) Cannot be determined
Solution: (E)
Correlation between the features won’t change if you add or subtract a value in the features.
20) Imagine, you are solving a classification problems with highly imbalanced class. The
majority class is observed 99% of times in the training data.
Your model has 99% accuracy after taking the predictions on test data. Which of the following
is true in such a case?
A) 1 and 3
B) 1 and 4
C) 2 and 3
D) 2 and 4
Solution: (A)
21) In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble
of these models will give a better prediction than prediction of individual models.
Which of the following statements is / are true for weak learners used in ensemble model?
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) Only 1
E) Only 2
Solution: (A)
Weak learners are sure about particular part of a problem. So, they usually don’t overfit which
means that weak learners have low variance and high bias.
22) Which of the following options is/are true for K-fold cross-validation?
Increase in K will result in higher time required to cross validate the result.
Higher values of K will result in higher confidence on the cross-validation result as compared to
lower value of K.
If K=N, then it is called Leave one out cross validation, where N is the number of observations.
A) 1 and 2
B) 2 and 3
C) 1 and 3
D) 1,2 and 3
Solution: (D)
Larger k value means less bias towards overestimating the true expected error (as training folds
will be closer to the total dataset) and higher running time (as you are getting closer to the limit
case: Leave-One-Out CV). We also need to consider the variance between the k folds accuracy
while selecting the k.
Cross-validation is an important step in machine learning for hyper parameter tuning. Let’s say
you are tuning a hyper-parameter “max_depth” for GBM by selecting it from 10 different depth
values (values are greater than 2) for tree based model using 5-fold cross validation.
Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10 seconds and
for the prediction on remaining 1-fold is 2 seconds.
23) Which of the following option is true for overall execution time for 5-fold cross validation
with 10 different values of “max_depth”?
D) Can’t
estimate
Solution: (D)
Each iteration for depth “2” in 5-fold cross validation will take 10 secs for training and 2 second
for testing. So, 5 folds will take 12*5 = 60 seconds. Since we are searching over the 10 depth
values so the algorithm would take 60*10 = 600 seconds. But training and testing a model on
depth greater than 2 will take more time than depth “2” so overall timing would be greater than
600.
24) In previous question, if you train the same algorithm for tuning 2 hyper parameters say
“max_depth” and “learning_rate”.
You want to select the right value against “max_depth” (from given 10 depth values) and
learning rate (from given 5 different learning rates). In such cases, which of the following will
represent the overall time?
A) 1000-1500 second
B) 1500-3000 Second
C) More than or equal to 3000 Second
D) None of
these
Solution: (D)
25) Given below is a scenario for training error TE and Validation error VE for a machine
learning algorithm M1. You want to choose a hyperparameter (H) based on TE and VE.
H TE VE
1 105 90
2 200 85
3 250 96
4 105 85
5 300 100
A) 1
B) 2
C) 3
D) 4
E) 5
Solution: (D)
26) What would you do in PCA to get the same projection as SVD?
C) Not
possible
D) None of
these
Solution: (A)
When the data has a zero mean vector PCA will have same projections as SVD, otherwise you
have to centre the data first before taking SVD.
Assume there is a black box algorithm, which takes training data with multiple observations (t1,
t2, t3,…….. tn) and a new observation (q1). The black box outputs the nearest neighbor of q1
(say ti) and its corresponding class label ci.
You can also think that this black box algorithm is same as 1-NN (1-nearest neighbor).
27) It is possible to construct a k-NN classification algorithm based on this black box alone.
A) TRUE
B) FALSE
Solution: (A)
In first step, you pass an observation (q1) in the black box algorithm so this algorithm would
return a nearest observation and its class.
In second step, you through it out nearest observation from train data and again input the
observation (q1). The black box algorithm will again return the a nearest observation and it’s
class.
28) Instead of using 1-NN black box we want to use the j-NN (j>1) algorithm as black box.
Which of the following option is correct for finding k-NN using j-NN?
A) 1
B) 2
C) 3
Solution: (A)
1<2<3<4
1>2>3 > 4
7<6<5<4
7>6>5>4
A) 1 and 3
B) 2 and 3
C) 1 and 4
D) 2 and 4
Solution: (B)
from image 1to 4 correlation is decreasing (absolute value). But from image 4 to 7 correlation is
increasing but values are negative (for example, 0, -0.3, -0.7, -0.99).
30) You can evaluate the performance of a binary class classification problem using different
metrics such as accuracy, log-loss, F-Score. Let’s say, you are using the log-loss function as
evaluation metric.
Which of the following option is / are true for interpretation of log-loss as an evaluation metric?
If a classifier is confident about an incorrect classification, then log-loss will penalise it heavily.
For a particular observation, the classifier assigns a very small probability for the correct class
then the corresponding contribution to the log-loss will be very large.
Lower the log-loss, the better is the model.
A) 1 and 3
B) 2 and 3
C) 1 and 2
D) 1,2 and 3
Solution: (D)
Question 31-
32
Note: Visual distance between the points in the image represents the actual distance.
31) Which of the following is leave-one-out cross-validation accuracy for 3-NN (3-nearest
neighbor)?
A) 0
D) 0.4
C) 0.8
D) 1
Solution: (C)
In Leave-One-Out cross validation, we will select (n-1) observations for training and 1
observation of validation. Consider each point as a cross validation point and then find the 3
nearest point to this point. So if you repeat this procedure for all points you will get the correct
classification for all positive class given in the above figure but negative class will be
misclassified. Hence you will get 80% accuracy.
32) Which of the following value of K will have least leave-one-out cross validation accuracy?
A) 1NN
B) 3NN
C) 4NN
Solution: (A)
Each point which will always be misclassified in 1-NN which means that you will get the 0%
accuracy.
33) Suppose you are given the below data and you want to apply a logistic regression model for
classifying it in two given classes.
Where C is the regularization parameter and w1 & w2 are the coefficients of x1 and x2.
Which of the following option is correct when you increase the value of C from zero to a very
large value?
Solution: (B)
By looking at the image, we see that even on just using x2, we can efficiently perform
classification. So at first w1 will become 0. As regularization parameter increases more, w2 will
come more and more closer to 0.
34) Suppose we have a dataset which can be trained with 100% accuracy with help of a decision
tree of depth 6. Now consider the points below and choose the option based on these points.
Note: All other hyper parameters are same and other factors are not affected.
A) Only 1
B) Only 2
C) Both 1 and
2
Solution: (A)
If you fit decision tree of depth 4 in such data means it will more likely to underfit the data. So,
in case of underfitting you will have high bias and low variance.
35) Which of the following options can be used to get global minima in k-Means Algorithm?
A) 2 and 3
B) 1 and 3
C) 1 and 2
D) All of
above
Solution: (D)
36) Imagine you are working on a project which is a binary classification problem. You trained
a model on training dataset and get the below confusion matrix on validation dataset.
Based on the above confusion matrix, choose which option(s) below will give you correct
predictions?
Accuracy is ~0.91
Misclassification rate is ~ 0.91
False positive rate is ~0.95
True positive rate is ~0.95
A) 1 and 3
B) 2 and 4
C) 1 and 4
D) 2 and 3
Solution: (C)
The true Positive Rate is how many times you are predicting positive class correctly so true
positive rate would be 100/105 = 0.95 also known as “Sensitivity” or “Recall”
37) For which of the following hyperparameters, higher value is better for decision tree
algorithm?
A)1 and 2
B) 2 and 3
C) 1 and 3
D) 1, 2 and 3
E) Can’t say
Solution: (E)
For all three options A, B and C, it is not necessary that if you increase the value of parameter
the performance may increase. For example, if we have a very high value of depth of tree, the
resulting tree may overfit the data, and would not generalize well. On the other hand, if we have
a very low value, the tree may underfit the data. So, we can’t say for sure that “higher is better”.
Context 38-39
Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network on it with
the input depth of 3 and output depth of 8.
38) What is the dimension of output feature map when you are using the given parameters.
Solution: (A)
39) What is the dimensions of output feature map when you are using following parameters.
Solution: (B)
Same as
above
40) Suppose, we were plotting the visualization for different values of C (Penalty parameter) in
SVM algorithm. Due to some reason, we forgot to tag the C values with visualizations. In that
case, which of the following option best explains the C values for the images below (1,2,3 left
to right, so C values are C1 for image1, C2 for image2 and C3 for image3 ) in case of rbf kernel.
A) C1 = C2 =
C3
B) C1 > C2 >
C3
C) C1 < C2 <
C3
D) None of
these
Solution: (C)
40. Factors which affect the performance of learner system does not include?
b) Training scenario
c) Type of feedback
Answer: d
Explanation: Factors which affect the performance of learner system does not include good data
structures.
41. Which of the following does not include different learning methods?
a)
Memorization
b) Analogy
c) Deduction
d)
Introduction
Answer: d
a) Decision
trees
b) Neural networks
Answer: d
Explanation: Decision trees, Neural networks, Propositional rules and FOL rules all are the
models of learning.
a) Supervised learning
b) Unsupervised learning
c) Active
learning
d) Reinforcement learning
Answer: a
Explanation: In automatic vehicle set of vision inputs and corresponding actions are available to
learner hence it’s an example of supervised learning.
c) Automated vehicle
Answer: a
Explanation: In active learning, not only the teacher is available but the learner can ask suitable
perception-action pair examples to improve performance.
45. In which of the following learning the teacher returns reward and punishment to learner?
a) Active
learning
b) Reinforcement learning
c) Supervised learning
d) Unsupervised learning
Answer: b
Explanation: Reinforcement learning is the type of learning in which teacher returns reward or
punishment to learner.
46. Decision trees are appropriate for the problems where ___________
Answer: d
a) Data
mining
b) WWW
c) Speech recognition
Answer: d
a) Goal
b) Model
c) Learning
rules
Answer: d
Explanation: Goal, model, learning rules and experience are the components of learning system.
b) Active
learning
c) Unsupervised learning
d) Reinforcement learning
Answer: c
Correct option is D
13. Suppose the reinforcement learning player was greedy, that is, it always
played the move that brought it to the position that it rated the best. Might it
learn to play better, or worse, than a non greedy player?
E. Worse
F. Better
Correct option is B
15. A model can learn based on the rewards it received for its previous action is
known as:
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. Concept learning
Correct option is C
16. A genetic learning operation that creates new population elements by combining parts of
two or more existing elements.
Unit - 5
a. selection
b. crossover
c. mutation
d. absorption
17. An evolutionary approach to data mining.
a. backpropagation learning
b. genetic learning
c. decision tree learning
d. linear regression
18. The computational complexity as well as the explanation offered by a genetic algorithm is largely
determined by the
a. fitness function
b. techniques used for crossover and mutation
c. training data
d. population of elements
19. This approach is best when we are interested in finding all possible interactions among a set of
attributes.
a. decision tree
b. association rules
c. K-Means algorithm
d. genetic learning
20. Genetic learning can be used to train a feed-forward network. This is accomplished by
having each population element represent one possible
a. network configuration of nodes and links.
b. set of training data to be fed through the network.
c. set of network output values.
d. set of network connection weights.