MFDS - Test 1 Problems

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

CH5019: Test-1 Problems

Question 1:

The data scientists at an FMCG (Fast Moving Consumer Goods) company are trying to come up with a sweet
peptide that can make their juice sweeter; without adding as many calories. Given the structure of the
peptide; they are looking for a model that can predict whether the peptide is sweet. Following variables
were used to characterize the peptides.

1) Whether the peptide is soluble-0/insoluble-1


2) Safety level of peptide for consumption - 3 for Highly safe, 2 for moderately safe, 1 for unsafe.
3) The presence of a given amino acid bigram in the sequence (eg if G-Glycine is an amino acid, V-Valine is
an amino acid; VG is a bigram, 0/1 based on presence or absence of the same in the sequence)
4) pH of the peptide solution
5) Characteristic temperature of protein folding in degree Celsius
6) Chemical Charge.
Identify the correct statement from the following.

Option1:
6) is a ratio variable

Option2:
5) is a ratio variable

Option3:
1) and 2) are ordinal variables

Option4:
3) is an ordinal variable

Correct Answer:
6) is a ratio variable

What concept we test:


Students should be able to distinguish between variable data types based on context given. Looking at the
options:
1. Chemical charge can take positive and negative values, and a charge of +2 is considered twice as
much as a charge of +1. This makes it a ratio variable.

2. Temperature in Celsius is not a ratio variable, it is an interval variable. For instance, 20 C is not twice
the temperature of 10 C, but it is higher than 10 C.

3. The variable 1) is not ordinal, it is nominal. This is because the values 0 and 1 have no numerical
significance, they do not indicate quantity in any way. Variable 2) is ordinal.

4. Variable 3) is not ordinal, it is nominal. This is because the values 0 and 1 have no numerical
significance, they do not indicate quantity in any way.
Question 2:

Which of given algorithms will be able to solve the following problem: Glass Transition temperature
prediction of a polyhydroxyalkanoate

Option1:
Decision Trees

Option2:
Quadratic Discriminant Analysis

Option3:
Naive Bayes Algorithm

Option4:
K-Means Clustering

Correct Answer:
Decision Trees

What concept we test:


Students should have a broad idea of where a particular algorithm can be applied.
The problem given requires the prediction of glass transition temperature. It is implied that this transition
temperature is influenced by some other physical properties of the hydrocarbon. Therefore, a relation
between the transition temperature and other properties of the molecule can be determined using any
regression-based algorithm. Here, Decision Trees is the only option from the set which may be applied for
regression.
Question 3:

What supervised technique can you use to approach the following problem?
A Bank wants to decide whether a person will default on their loan, using the data collected over loan
activities in the past five years.

Option1:
Not applicable

Option2:
Not enough information

Option3:
Function Approximation

Option4:
Classification

Correct Answer:
Classification

What concept we test:


Before distinguishing between classification or regression, we should first check if there is enough data for
supervised learning, or if unsupervised learning is more suitable.
It is a reasonable assumption that the data collected over loan activities in the past five years in a Bank will
contain tractable information about its clients and also whether they defaulted on their loan or not. This will
give enough information to solve the problem using supervised classification techniques.
Question 4:

Which of the following problems should be approached using Machine Learning?

Option1:
Given historical data of temperature and rainfall, find the coldest year when average temperature was at
least 20 degree Celsius.

Option2:
A chips company collects data of the weight of its chips packet leaving the line. Due to the speed of the
assembly line, only a fraction of the packets can be weighed. Estimate, on an average, how different a
packet is from the mentioned value.

Option3:
Given the pattern of vehicles moving on a bridge, and the maximum weight it can handle, find the
expected amount of time it would take to collapse.

Option4:
Given employee salary, performance and reputation data, how likely is it that an employee will not leave the
job for the next 2 years?

Answer:
Given employee salary, performance and reputation data, how likely is it that an employee will not leave the
job for the next 2 years?

What concept we test:


ML uses statistics, but ML is not statistics.
Machine learning algorithms are used to find, or ‘learn’ hidden relationships that are present within data,
which cannot be directly arrived at using First Principles. Looking at the options:
1. The required data can be retrieved via a query on the given dataset. ML algorithms are unnecessary
here.
2. The average deviation from mean of the weight of the chips packet is a statistic that can be
estimated directly from the data, using the associated formula for Mean Absolute Deviation. Thus,
ML algorithms are not used here.
3. The pattern of vehicles moving on a bridge will provide a probability distribution of the load carried
by the bridge at any given time. The next step is to find the expected time before this load exceeds
the maximum capacity, which can be arrived at analytically or using numerical methods. In either
case, ML algorithms are not needed.
4. The likelihood of an employee leaving is closely linked with data records of past employees, the
qualities of the employee in question, and many other unseen factors. No precise mathematical
model may be arrived at to perfectly explain all the intricacies involved in this decision process.
However, a suitable ML model trained on the available data could provide a prediction for the same,
so that the company may take action accordingly.
Question 5:

A college student wants to predict if a new movie will be a hit or a flop (H/F). A movie is a commercial
hit if it makes more money in revenue than its production budget, and a flop otherwise. She notes
down budget (B), ticket cost (C), average rating by critics (R) and also the total revenue of all movies
shown in the local theater in the past five years.

The student has to identify a movie as H/F, given its B, C and R values. As a friend, you want to help out.
So you decided to phrase it as a Machine Learning problem. Which of the following are possible
approaches?

Option1:
Function approximation with 'H'/'F' as the data matrix, and Total Revenue as the output.

Option2:
Classification with B C R values as the data matrix, and Total Revenue as the output.

Option3:
Classification with 'H'/'F' as the data matrix, and Total Revenue as the output.

Option4:
Function approximation with B C R values as the data matrix, and Total Revenue as the output.

Answer:
Function approximation with B C R values as the data matrix, and Total Revenue as the output.

What concept we test:


Students should be able to readily map a real-world problem to a ML problem, and identify the key data
components - the feature matrix and output matrix.

Here, the problem of identifying a movie as H/F can be approached by building a classification model
directly, or by estimating the Total Revenue using a regression model and comparing Total Revenue against
the Budget B. Therefore, both classification and function approximation solutions are viable for this
problem. In both cases, the input data for the model will consist of Budget, Ticket Cost and Rating (B, C, R
values) which will form the data or feature matrix. In case of function approximation, the output is Total
Revenue, and in case of classification, the output is H/F. Option d is the only option among the four which
contains one of the potential solutions.
Question 6:

Which of the below procedures would you expect to follow when solving a typical machine learning
problem?

Option1:
Collecting labels for training a classifier through linear algebra and matrix operations.

Option2:
Evaluating a model by comparing its dataset with another model’s feature matrix.

Option3:
Converting experimental data into a ML model to construct the feature matrix.

Option4:
Deciding which classifier algorithm to use for the given dataset.

Answer:
Deciding which classifier algorithm to use for the given dataset.

What concept we test:


Students should be clear on the high-level overview of an ML process, and the significance of each step of
the process. Every procedure followed has its own significance.

In the given question, looking at the options:

1. Collection of labels for training cannot be done through linear algebra and matrix operations alone. If
this were the case, we would not need to train a classifier in the first place. Generally, labels are
collected through performing experiments or manually assigned.
2. Model evaluation is not done by comparing datasets and feature matrices. Rather, it is done by
applying the model to predict on a dataset which was not used in the training process. For
comparing two models’ performances, they should both be trained on the same dataset and
evaluated on a different dataset.
3. Experimental data is converted to a feature matrix before training the ML model. Therefore, the ML
model is not generated from raw data before construction of the feature matrix.
4. Deciding the classifier algorithm is an important step in any classification task. In general, multiple
algorithms are tried, and the best performing one is chosen to be the classifier for deployment.
Hence, this option is the correct answer.
Question 7:

Tell whether these statements are true or false in the context of data science?

1. Complex and State of the Art Data Science Algorithms need not outperform simple models -
True (complex models can run into problems of overfitting or usability for simpler tasks or small
data)

2. Machine Learning problems always involve training ML models and subsequently applying these
models to solve the problem statement -
False (model free ML techniques exist, such as KNN)

3. When choosing between a simple and a complex ML algorithm, model prediction performance after
training is the only factor to consider -
False (Other factors such as scalability, interpretability, computational time needed also need to be
considered)

4. Machine Learning models, after training, are able to make accurate predictions, but cannot give
insights as to why they are so effective at doing so-
False (A lot of models are indeed able to give us some insights into what drives their predictions, or
in the case of DL, what type of features they learn)

5. Domain Knowledge is useful while solving problems using data science -


True (Domain knowledge significantly helps us in choosing the right models and inductive biases)

6. Function approximation in ML problems is not always done through estimating the coefficients of a
parametric model -
True (cue KNN)
Question 8:

We wish to build a machine that can play Go (Chinese board game) against an opponent. A group of
people came up with three possible ways to do so:

Method 1: Learn using a dataset of historic matches curated by experts that contains a list of moves
made and their respective impact on winning chances as per expert opinion.
Method 2: Learn by actively playing against opponents and analyzing the results as the game
progresses.
Method 3: Learn by watching several Go games and analyzing the moves made.

Option1:
Method 3 is Reinforcement learning

Option2:
Method 3 is Supervised learning

Option3:
Method 2 is Unsupervised learning

Option4:
Method 1 is Unsupervised learning

Answer:
Method 2 is Unsupervised learning

What concept we test:


There are multiple approaches to solve a single ML problem. Approaches differ in their type of data
collected and the type of interaction with the environment. Unsupervised Learning, Supervised Learning
and Reinforcement Learning are three such approaches.

Method 1 requires pre-existing knowledge in the form of an expertly curated dataset. This type of learning is
called supervised learning.

Method 2 requires the machine to interact with the environment (by playing against others) in order to
improve its own performance in the same task. This type of learning is called reinforcement learning.
Additionally, no pre-existing knowledge is used to train the machine’s decisions, and therefore it may also
be considered an unsupervised learning process.

Method 3 also does not require any pre-existing knowledge, and so comes under unsupervised learning.
However, as the machine is not actively interacting with the environment, it cannot be considered
reinforcement learning.

Therefore, Option c is the only statement that is correct among the four. Hence it is the answer.
Question 9:

A small bank has data about its customers, and they wish to predict whether a customer will default or
not in the future. Given the data:

Option1:
We can use clustering followed by regression to solve this problem

Option2:
Column 6 contains nominal data

Option3:
We can use clustering followed by classification

Option4:
The no. of samples in the data is 6

Answer:
We can use a classification algorithm to solve this problem

What concept we test:


Just by observing the matrix entries, many inferences regarding the data and problem can be made.

In the given options:


1. Here the task is to predict whether a customer will default or not. In this case, classification methods
are to be preferred rather than function approximation methods.
2. Column 6 takes discrete values in {10, 20, 30}. We cannot say it is interval data just yet, but we can
infer that any entry with the value 30 for example would be higher in some sense than an entry with
the value of 10 for this column. Hence, it can be considered ordinal data, and not nominal data.
3. Classification algorithms are suited to solve the problem statement. Since there is no label explicitly
specified, clustering may be used to group the data points into two categories. Based on further
inspection, the two categories may be given appropriate labels. A classification model can then be
trained. Hence, this is the most suitable answer.
4. The number of samples corresponds to the number of rows which is 12, and not 6.

END

You might also like