Professional Documents
Culture Documents
Data Science Interview Q's - V
Data Science Interview Q's - V
Hi hey there, thanks for the continuous support for my previous articles. Today we
will continue from our previous article “Data Science Interview Q’s — IV” PART-IV,
the commonly asked essential questions by the interviewers to understand the root
level knowledge of DS rather than going for fancy advanced questions.
The decision boundary is a line that separates the target variables into different
classes. The decision boundary can either be linear or nonlinear. In case of a logistic
regression model, the decision boundary is a straight line.
Logistic regression model formula = α+1X1+2X2+….+kXk. This clearly represents a
straight line. Logistic regression is only suitable in such cases where a straight line is
able to separate the different classes. If a straight line is not able to do it, then
nonlinear algorithms should be used to achieve better results.
The likelihood function is the joint probability of observing the data. For example,
let’s assume that a coin is tossed 100 times and we want to know the probability of
getting 60 heads from the tosses. This example follows the binomial distribution
formula.
p = Probability of heads from a single coin toss
n = 100 (the number of coin tosses)
x = 60 (the number of heads — success)
n-x = 30 (the number of tails)
Pr(X=60 |n = 100, p)
The likelihood function is the probability that the number of heads received is 60 in a
trail of 100 coin tosses, where the probability of heads received in each coin toss is p.
Here the coin toss result follows a binomial distribution.
This can be reframed as follows:
Pr(X=60|n=100,p) = c x p60x(1-p)100–60
c = constant
p = unknown parameter
The likelihood function gives the probability of observing the results using unknown
parameters.
The MLE chooses those sets of unknown parameters (estimator) that maximise the
likelihood function. The method to find the MLE is to use calculus and setting the
derivative of the logistic function with respect to an unknown parameter to zero,
and solving it will give the MLE. For a binomial model, this will be easy, but for a
logistic model, the calculations are complex. Computer programs are used for
deriving MLE for logistic models.
(Here’s another approach to answering the question.)
MLE is a statistical approach to estimating the parameters of a mathematical model.
MLE and ordinary square estimation give the same results for linear regression if the
dependent variable is assumed to be normally distributed. MLE does not assume
anything about independent variables.
4. What are the different methods of MLE and when is each method
preferred?
In case of logistics regression, there are two approaches of MLE. They are conditional
and unconditional methods. Conditional and unconditional methods are algorithms
that use different likelihood functions. The unconditional formula employs joint
probability of positives (for example, churn) and negatives (for example, non-churn).
The conditional formula is the ratio of the probability of observed data to the
probability of all possible configurations.
The unconditional method is preferred if the number of parameters is lower
compared to the number of instances. If the number of parameters is high compared
to the number of instances, then conditional MLE is to be preferred. Statisticians
suggest that conditional MLE is to be used when in doubt. Conditional MLE will
always provide unbiased results.
7. Why can’t we use Mean Square Error (MSE) as a cost function for
logistic regression?
Accuracy is not a good measure for classification problems because it gives equal
importance to both false positives and false negatives. However, this may not be the
case in most business problems. For example, in case of cancer prediction, declaring
cancer as benign is more serious than wrongly informing the patient that he is
suffering from cancer. Accuracy gives equal importance to both cases and cannot
differentiate between them.
9. What are the true positive rate (TPR), true negative rate (TNR),
false-positive rate (FPR), and false-negative rate (FNR)?
TPR refers to the ratio of positives correctly predicted from all the true labels. In
simple words, it is the frequency of correctly predicted true labels.
TPR = TP/TP+FN
TNR refers to the ratio of negatives correctly predicted from all the false labels. It is
the frequency of correctly predicted false labels.
TNR = TN/TN+FP
FPR refers to the ratio of positives incorrectly predicted from all the true labels. It is
the frequency of incorrectly predicted false labels.
FPR = FP/TN+FP
FNR refers to the ratio of negatives incorrectly predicted from all the false labels. It
is the frequency of incorrectly predicted true labels.
FNR = FN/TP+FN
It is the harmonic mean of precision and recall. In some cases, there will be a trade-
off between the precision and the recall. In such cases, the F-measure will drop. It
will be high when both the precision and the recall are high. Depending on the
business case at hand and the goal of data analytics, an appropriate metric should be
selected.
F-measure = 2 X (Precision X Recall) / (Precision+Recall)
The lift is the improvement in model performance (increase in true positive rate)
when compared to random performance. Random performance means if 50% of the
instances is targeted, then it is expected that it will detect 50% of the positives. Lift is
in comparison to the random performance of a model. If a model’s performance is
better than its random performance, then its lift will be greater than 1.
In a lift curve, lift is plotted on the Y-axis and the percentage of the population
(sorted in descending order) on the X-axis. At a given percentage of the target
population, a model with a high lift is preferred.
Logistic regression will find a linear boundary if it exists to accommodate the outliers.
Logistic regression will shift the linear boundary in order to accommodate the
outliers. SVM is insensitive to individual samples. There will not be a major shift in
the linear boundary to accommodate an outlier. SVM comes with inbuilt complexity
controls, which take care of overfitting. This is not true in case of logistic regression.
11. How will you deal with the multiclass classification problem
using logistic regression?
The most famous method of dealing with multiclass classification using logistic
regression is using the one-vs-all approach. Under this approach, a number of
models are trained, which is equal to the number of classes. The models work in a
specific way. For example, the first model classifies the datapoint depending on
whether it belongs to class 1 or some other class; the second model classifies the
datapoint into class 2 or some other class. This way, each data point can be checked
over all the classes.
12. Explain the use of ROC curves and the AUC of an ROC Curve.
· L1 or LASSO regularisation: Here, the absolute values of the coefficients are added
to the cost function. This can be seen in the following equation; the highlighted part
corresponds to the L1 or LASSO regularisation. This regularisation technique gives
sparse results, which lead to feature selection as well.
· L2 or Ridge regularisation: Here, the squares of the coefficients are added to the
cost function. This can be seen in the following equation, where the highlighted part
corresponds to the L2 or Ridge regularisation.
Selecting the regularisation parameter is a tricky business. If the value of λ is too high,
it will lead to extremely small values of the regression coefficient β, which will lead
to the model underfitting (high bias — low variance). On the other hand, if the value
of λ is 0 (very small), the model will tend to overfit the training data (low bias — high
variance).
There is no proper way to select the value of λ. What you can do is have a sub-
sample of data and run the algorithm multiple times on different sets. Here, the
person has to decide how much variance can be tolerated. Once the user is satisfied
with the variance, that value of λ can be chosen for the full dataset.
One thing to be noted is that the value of λ selected here was optimal for that subset,
not for the entire training data.
One can use linear regression for time series analysis, but the results are not
promising. So, it is generally not advisable to do so. The reasons behind this are —
1. Time series data is mostly used for the prediction of the future, but linear
regression seldom gives good results for future prediction as it is not meant for
extrapolation.
2. Mostly, time series data have a pattern, such as during peak hours, festive seasons,
etc., which would most likely be treated as outliers in the linear regression analysis.
17. What value is the sum of the residuals of a linear regression close to? Justify.
The sum of the residuals of a linear regression is 0. Linear regression works on the
assumption that the errors (residuals) are normally distributed with a mean of 0,
18. You run your regression on different subsets of your data, and in each subset,
the beta value for a certain variable varies wildly. What could be the issue here?
This case implies that the dataset is heterogeneous. So, to overcome this problem,
the dataset should be clustered into different subsets, and then separate models
should be built for each cluster. Another way to deal with this problem is to use non-
parametric models, such as decision trees, which can deal with heterogeneous data
quite efficiently.
19. Your linear regression doesn’t run and communicates that there is an infinite
number of best estimates for the regression coefficients. What could be wrong?
20. What do you mean by adjusted R2? How is it different from R2?
Adjusted R2, just like R2, is a representative of the number of points lying around the
regression line. That is, it shows how well the model is fitting the training data.
One drawback of R2 is that it will always increase with the addition of a new feature,
whether the new feature is useful or not. The adjusted R2 overcomes this drawback.
The value of the adjusted R2 increases only if the newly added feature plays a
significant role in the model
The residual vs fitted value plot is used to see whether the predicted values and
residuals have a correlation or not. If the residuals are distributed normally, with a
mean around the fitted value and a constant variance, our model is working fine;
otherwise, there is some issue with the model.
The most common problem that can be found when training the model over a large
range of a dataset is heteroscedasticity(this is explained in the answer below). The
presence of heteroscedasticity can be easily seen by plotting the residual vs fitted
value curve.
22. What is heteroscedasticity? What are the consequences, and how can you
overcome it?
2. Using weighted linear regression: Here, the OLS method is applied to the weighted
values of X and Y. One way is to attach weights directly related to the magnitude of
the dependent variable.
Hypothesis testing can be carried out in linear regression for the following purposes:
1. To check whether a predictor is significant for the prediction of the target variable.
Two common methods for this are —
4. To check whether the calculated regression coefficients are good estimators of the
actual coefficients.
Before fitting the model, one must be well aware of the data, such as what the
trends, distribution, skewness, etc. in the variables are. Graphs such as histograms,
box plots, and dot plots can be used to observe the distribution of the variables.
Apart from this, one must also analyse what the relationship between dependent
and independent variables is. This can be done by scatter plots (in case of univariate
problems), rotating plots, dynamic plots, etc.
The generalized linear model is the derivative of the ordinary linear regression model.
GLM is more flexible in terms of residuals and can be used where linear regression
does not seem appropriate. GLM allows the distribution of residuals to be other than
a normal distribution. It generalizes the linear regression by allowing the linear
model to link to the target variable using the linking function. Model estimation is
done using the method of maximum likelihood estimation.
26. You will see two statements listed below. You will have to read both of them
carefully and then choose one of the options from the two statements’ options.
The contextual question is, Choose the statements which are true about bagging
trees.
1. The individual trees are not at all dependent on each other for a bagging tree.
2. To improve the overall performance of the model, the aggregate is taken from
weak learners. This method is known as bagging trees.
Ans. The correct answer to this question is C because, for a bagging tree, both of
these statements are true. In bagging trees or bootstrap aggregation, the main goal
of applying this algorithm is to reduce the amount of variance present in the decision
tree. The mechanism of creating a bagging tree is that with replacement, a number
of subsets are taken from the sample present for training the data.
Now, each of these smaller subsets of data is used to train a separate decision tree.
Since the information which is fed into each tree comes out to be unique, the
likelihood of any tree having any impact on the other becomes very low. The final
result which all these trees give is collected and then processed to provide the
output. Thus, the second statement also comes out to be true.
27. You will see two statements listed below. You will have to read both of them
carefully and then choose one of the options from the two statements’ options.
The contextual question is, Choose the statements which are true about boosting
trees.
2. The weak learners’ performance is all collected and aggregated to improve the
boosted tree’s overall performance.
Ans. If you were to understand how the boosting of trees is done, you will
understand and will be able to differentiate the correct statement from the
statement, which is false. So, a boosted tree is created when many weak learners are
connected in series. Each tree present in this sequence has one sole aim: to reduce
the error which its predecessor made.
If the trees are connected in such fashion, all the trees cannot be independent of
each other, thus rendering the first statement false. When coming to the second
statement, it is true mainly because, in a boosted tree, that is the method that is
applied to improve the overall performance of the model. The correct option will be
B, i.e., only the statement number two is TRUE, and the statement number one is
FALSE.
28. You will see four statements listed below. You will have to read all of them
carefully and then choose one of the options from the options which follows the
four statements. The contextual question is, Choose the statements which are true
about Radom forests and Gradient boosting ensemble method.
1. Both Random forest and Gradient boosting ensemble methods can be used to
perform classification.
2. Random Forests can be used to perform classification tasks, whereas the gradient
boosting method can only perform regression.
4. Both Random forest and Gradient boosting ensemble methods can be used to
perform regression.
29. You will see four statements listed below. You will have to read all of them
carefully and then choose one of the options from the options which follows the
four statements. The contextual question is, consider a random forest of trees. So
what will be true about each or any of the trees in the random forest?
1. Each tree that constitutes the random forest is based on the subset of all the
features.
3. Each of the trees in a random forest is built on a subset of all the observations
present.
4. Each of the trees in a random forest is built on the full observation set.
11. Both the statements number one and three are TRUE
12. Both the statements number two and three are TRUE
13. Both the statements number two and four are TRUE
Ans. The generation of random forests is based on the concept of bagging. To build a
random forest, a small subset is taken from both the observations and the features.
The values which are obtained after taking out the subsets are then fed into singular
decision trees. Then all the values from all such decision trees are collected to make
the final decision. That means the only statements which are correct would be one
and three. So, the right option would be G.
30. You will see four statements listed below. You will have to read all of them
carefully and then choose one of the options from the options which follows the
four statements. The contextual question is, select the correct statements about
the hyperparameter known as “max_depth” of the gradient boosting algorithm.
3. If we are to increase this hyperparameter’s value, then the chances of this model
actually overfitting the data increases.
4. If we are to increase this hyperparameter’s value, then the chances of this model
actually underfitting the data increases.
11. Both the statements number one and three are TRUE
12. Both the statements number two and three are TRUE
13. Both the statements number two and four are TRUE
14.
Ans. The hyperparameter max_depth controls the depth until the gradient boosting
will model the presented data in front of it. If you keep on increasing the value of
this hyperparameter, then the model is bound to overfit. So, statement number
three is correct. If we have the same scores on the validation data, we generally
prefer the model with a lower depth. So, statements number one and three are
correct, and thus the answer to this decision tree interview questions is g.
31. You will see four statements listed below. You will have to read all of them
carefully and then choose one of the options from the options which follows the
four statements. The contextual question is which of the following methods does
not have a learning rate as one of their tunable hyperparameters.
1. Extra Trees.
2. AdaBoost
3. Random Forest
4. Gradient boosting.
11. Both the statements number one and three are TRUE
12. Both the statements number two and three are TRUE
13. Both the statements number two and four are TRUE
Ans. Only Extra Trees and Random forest does not have a learning rate as one of
their tunable hyperparameters. So, the answer would be g because the statement
number one and three are TRUE.
32. Choose the option, which is true.
1. Only in the algorithm of random forest, real values can be handled by making
them discrete.
2. Only in the algorithm of gradient boosting, real values can be handled by making
them discrete.
3. In both random forest and gradient boosting, real values can be handled by
making them discrete.
Ans. Both of the algorithms are capable ones. They both can easily handle the
features which have real values in them. So, the answer to this decision tree
interview questions and answers is C.
33. Choose one option from the list below. The question is, choose the algorithm
which is not an ensemble learning algorithm.
1. Gradient boosting
2. AdaBoost
3. Extra Trees
4. Random Forest
5. Decision Trees
34. You will see two statements listed below. You will have to read both of them
carefully and then choose one of the options from the two statements’ options.
The contextual question is, which of the following would be true in the paradigm
of ensemble learning.
2. You will still be able to interpret what is happening even after you implement the
algorithm of Random Forest.
Ans. Since any ensemble learning method is based on coupling a colossal number of
decision trees (which on its own is a very weak learner) together so it will always be
beneficial to have more number of trees to make your ensemble method. However,
the algorithm of random forest is like a black box. You will not know what is
happening inside the model. So, you are bound to lose all the interpretability after
you apply the random forest algorithm. So, the correct answer to this question
would be A because only the statement that is true is statement number one.
35. Answer in only in TRUE or FALSE. Algorithm of bagging works best for the
models which have high variance and low bias?
Ans. True. Bagging indeed is most favorable to be used for high variance and low
bias model.
36 . You will see two statements listed below. You will have to read both of them
carefully and then choose one of the options from the two statements’ options.
The contextual question is, choose the right ideas for Gradient boosting trees.
1. In every stage of boosting, the algorithm introduces another tree to ensure all the
current model issues are compensated.
Ans. The answer to this question is C meaning both of the two options are TRUE. For
the first statement, that is how the boosting algorithm works. The new trees
introduced into the model are just to augment the existing algorithm’s performance.
Yes, the gradient descent algorithm is the function that is applied to reduce the loss
function.
37. In the gradient boosting algorithm, which of the statements below are correct
about the learning rate?
4. The learning rate which you are setting should be high but not super high.
Ans. The learning rate should be low, but not very low, so the answer to this decision
tree interview questions and answers would be option C.
I hope you will find the questionnaires useful for your career and also the credit goes
to up grad from i was able to gather this set of interview questionnaires for you.
Next, we will walk through the more advanced topics of data science like comparing
the 2 machine learning models.
Thanks again, for your time, if you enjoyed this short article there are tons of topics in
advanced analytics, data science, and machine learning available in my medium repo.
https://medium.com/@bobrupakroy