Professional Documents
Culture Documents
MFDS - Test 1 Problems
MFDS - Test 1 Problems
MFDS - Test 1 Problems
Question 1:
The data scientists at an FMCG (Fast Moving Consumer Goods) company are trying to come up with a sweet
peptide that can make their juice sweeter; without adding as many calories. Given the structure of the
peptide; they are looking for a model that can predict whether the peptide is sweet. Following variables
were used to characterize the peptides.
Option1:
6) is a ratio variable
Option2:
5) is a ratio variable
Option3:
1) and 2) are ordinal variables
Option4:
3) is an ordinal variable
Correct Answer:
6) is a ratio variable
2. Temperature in Celsius is not a ratio variable, it is an interval variable. For instance, 20 C is not twice
the temperature of 10 C, but it is higher than 10 C.
3. The variable 1) is not ordinal, it is nominal. This is because the values 0 and 1 have no numerical
significance, they do not indicate quantity in any way. Variable 2) is ordinal.
4. Variable 3) is not ordinal, it is nominal. This is because the values 0 and 1 have no numerical
significance, they do not indicate quantity in any way.
Question 2:
Which of given algorithms will be able to solve the following problem: Glass Transition temperature
prediction of a polyhydroxyalkanoate
Option1:
Decision Trees
Option2:
Quadratic Discriminant Analysis
Option3:
Naive Bayes Algorithm
Option4:
K-Means Clustering
Correct Answer:
Decision Trees
What supervised technique can you use to approach the following problem?
A Bank wants to decide whether a person will default on their loan, using the data collected over loan
activities in the past five years.
Option1:
Not applicable
Option2:
Not enough information
Option3:
Function Approximation
Option4:
Classification
Correct Answer:
Classification
Option1:
Given historical data of temperature and rainfall, find the coldest year when average temperature was at
least 20 degree Celsius.
Option2:
A chips company collects data of the weight of its chips packet leaving the line. Due to the speed of the
assembly line, only a fraction of the packets can be weighed. Estimate, on an average, how different a
packet is from the mentioned value.
Option3:
Given the pattern of vehicles moving on a bridge, and the maximum weight it can handle, find the
expected amount of time it would take to collapse.
Option4:
Given employee salary, performance and reputation data, how likely is it that an employee will not leave the
job for the next 2 years?
Answer:
Given employee salary, performance and reputation data, how likely is it that an employee will not leave the
job for the next 2 years?
A college student wants to predict if a new movie will be a hit or a flop (H/F). A movie is a commercial
hit if it makes more money in revenue than its production budget, and a flop otherwise. She notes
down budget (B), ticket cost (C), average rating by critics (R) and also the total revenue of all movies
shown in the local theater in the past five years.
The student has to identify a movie as H/F, given its B, C and R values. As a friend, you want to help out.
So you decided to phrase it as a Machine Learning problem. Which of the following are possible
approaches?
Option1:
Function approximation with 'H'/'F' as the data matrix, and Total Revenue as the output.
Option2:
Classification with B C R values as the data matrix, and Total Revenue as the output.
Option3:
Classification with 'H'/'F' as the data matrix, and Total Revenue as the output.
Option4:
Function approximation with B C R values as the data matrix, and Total Revenue as the output.
Answer:
Function approximation with B C R values as the data matrix, and Total Revenue as the output.
Here, the problem of identifying a movie as H/F can be approached by building a classification model
directly, or by estimating the Total Revenue using a regression model and comparing Total Revenue against
the Budget B. Therefore, both classification and function approximation solutions are viable for this
problem. In both cases, the input data for the model will consist of Budget, Ticket Cost and Rating (B, C, R
values) which will form the data or feature matrix. In case of function approximation, the output is Total
Revenue, and in case of classification, the output is H/F. Option d is the only option among the four which
contains one of the potential solutions.
Question 6:
Which of the below procedures would you expect to follow when solving a typical machine learning
problem?
Option1:
Collecting labels for training a classifier through linear algebra and matrix operations.
Option2:
Evaluating a model by comparing its dataset with another model’s feature matrix.
Option3:
Converting experimental data into a ML model to construct the feature matrix.
Option4:
Deciding which classifier algorithm to use for the given dataset.
Answer:
Deciding which classifier algorithm to use for the given dataset.
1. Collection of labels for training cannot be done through linear algebra and matrix operations alone. If
this were the case, we would not need to train a classifier in the first place. Generally, labels are
collected through performing experiments or manually assigned.
2. Model evaluation is not done by comparing datasets and feature matrices. Rather, it is done by
applying the model to predict on a dataset which was not used in the training process. For
comparing two models’ performances, they should both be trained on the same dataset and
evaluated on a different dataset.
3. Experimental data is converted to a feature matrix before training the ML model. Therefore, the ML
model is not generated from raw data before construction of the feature matrix.
4. Deciding the classifier algorithm is an important step in any classification task. In general, multiple
algorithms are tried, and the best performing one is chosen to be the classifier for deployment.
Hence, this option is the correct answer.
Question 7:
Tell whether these statements are true or false in the context of data science?
1. Complex and State of the Art Data Science Algorithms need not outperform simple models -
True (complex models can run into problems of overfitting or usability for simpler tasks or small
data)
2. Machine Learning problems always involve training ML models and subsequently applying these
models to solve the problem statement -
False (model free ML techniques exist, such as KNN)
3. When choosing between a simple and a complex ML algorithm, model prediction performance after
training is the only factor to consider -
False (Other factors such as scalability, interpretability, computational time needed also need to be
considered)
4. Machine Learning models, after training, are able to make accurate predictions, but cannot give
insights as to why they are so effective at doing so-
False (A lot of models are indeed able to give us some insights into what drives their predictions, or
in the case of DL, what type of features they learn)
6. Function approximation in ML problems is not always done through estimating the coefficients of a
parametric model -
True (cue KNN)
Question 8:
We wish to build a machine that can play Go (Chinese board game) against an opponent. A group of
people came up with three possible ways to do so:
Method 1: Learn using a dataset of historic matches curated by experts that contains a list of moves
made and their respective impact on winning chances as per expert opinion.
Method 2: Learn by actively playing against opponents and analyzing the results as the game
progresses.
Method 3: Learn by watching several Go games and analyzing the moves made.
Option1:
Method 3 is Reinforcement learning
Option2:
Method 3 is Supervised learning
Option3:
Method 2 is Unsupervised learning
Option4:
Method 1 is Unsupervised learning
Answer:
Method 2 is Unsupervised learning
Method 1 requires pre-existing knowledge in the form of an expertly curated dataset. This type of learning is
called supervised learning.
Method 2 requires the machine to interact with the environment (by playing against others) in order to
improve its own performance in the same task. This type of learning is called reinforcement learning.
Additionally, no pre-existing knowledge is used to train the machine’s decisions, and therefore it may also
be considered an unsupervised learning process.
Method 3 also does not require any pre-existing knowledge, and so comes under unsupervised learning.
However, as the machine is not actively interacting with the environment, it cannot be considered
reinforcement learning.
Therefore, Option c is the only statement that is correct among the four. Hence it is the answer.
Question 9:
A small bank has data about its customers, and they wish to predict whether a customer will default or
not in the future. Given the data:
Option1:
We can use clustering followed by regression to solve this problem
Option2:
Column 6 contains nominal data
Option3:
We can use clustering followed by classification
Option4:
The no. of samples in the data is 6
Answer:
We can use a classification algorithm to solve this problem
END