ML Interview Questions and Answers

1. Describe K-means clustering algorithm.

Ans. K-Means is an unsupervised learning algorithm. K-Means clustering is the task of
grouping a set of data points in such a way that data points in the same group (called a
cluster) are closer to each other than to those in other groups (clusters).

Steps for K-Means clustering: -
1. In the dataset provided, consider the variables required for clustering
2. Randomly Initialize the cluster centroid
3. Calculate Euclidean distance between each observation and initial cluster centroids
4. Based on Euclidean distance each observation is assigned to one of the clusters -
based on minimum distance.
5. Update cluster centroid by taking the mean of the variables in a cluster
6. Repeat the process till convergence is achieved i.e. there is no further change in the
cluster centroids.
Example: - Segmenting individuals into different clusters based on their height and weight

Additional Information:-
Link for various distance measures:-
http://www.sthda.com/english/articles/26-clustering-basics/86-clustering-distance-measure
s-essentials/

Steps and pseudo code for K-means clustering:-
http://mnemstudio.org/clustering-k-means-example-1.htm

2. What are the important considerations in K-means clustering?
Ans. a) Scale of measurements influences Euclidean Distance, so variable
standardisation becomes necessary
b) Outlier treatment is necessary depending on the problem statement

c) K- Means clustering may be biased on initial centroids - called cluster seeds
d) The number of clusters to be created is an input to the algorithm and it
impacts the clusters getting created

3. How is the number of clusters identified in K-means clustering.
Ans Elbow method:-
a) Compute clustering algorithm (e.g., k-means clustering) for different values of k. For
instance, by varying k from 1 to 10 clusters.
b) For each k, calculate the total within-cluster sum of square (wss).
c) Plot the curve of wss according to the number of clusters k.
d) The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.

Average silhouette Method
A. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For
instance, by varying k from 1 to 10 clusters.
B. For each k, calculate the average silhouette of observations (avg.sil).
C. Plot the curve of a
vg.s il according to the number of clusters k.
D. The location of the maximum is considered as the appropriate number of clusters.
4. What is an ensemble method?
Ans. An ensemble is a collection of multiple models (usually supervised) which is used to
obtain better predictive performance than any of the individual models.

The two most common types of ensembles are bagging and boosting.

Bagging stands for bootstrapped aggregation, and RF is the most common implementation
of bagging.

-- If asked to explain what is random forest:

An RF is basically a collection of multiple decision trees, say 500. To make a decision, a
majority vote of the 500 trees is taken for each data point.

To build an RF, we take bootstrapped samples, i,e. do sampling with replacement (e.g. take
random 40% data points from training data n times, and use them to build n trees). This
ensures that each decision tree is trained using different training sets, and is evaluated
using data points which were not in the 40% points - this is called out of bag error or OOB
since the evaluation is done on points not used for training. If each tree is still performing
well (as measured by OOB error), the entire forest is likely performing well, i.e if the
average OOB error is low, we can be confident that the model (RF) will not overfit.

Also, each node in the tree is built using only a subset of the features. This is because if all
the features are available for all the nodes, the top nodes in each tree (the important ones)
will almost always contain the most imp variables, and all the tree will look similar. This is
not desirable because we want to ensure diversity in the ensemble model (the entire
forest) - if all trees are similar, there is no point taking a majority vote. But if the trees are
different, then the majority vote is unlikely to be a result of overfitting, since even if some
trees are overfitting, others will likely not overfit.

-- If asked about boosting: Read below

5. Difference between Bagging and Boosting?
Ans. Bagging and Boosting are called "meta-algorithms" approaches which combine
several machine learning techniques into one predictive model. Their purpose is either to
decrease the variance (bagging), bias (boosting) or improving the prediction outcomes. Every
algorithm consists of two steps:
1. Producing a distribution of simple ML models on subsets of the original data.
2. Combining the distribution into one "aggregated" model.

Here is a short description of the two methods:
Bagging (stands for B ootstrap A

ggregating) It is a way to decrease the variance of
prediction by bootstrapping for training from original dataset using combinations with
repetitions to produce multisets of the same cardinality/size as original data. By increasing
the size of training set the prediction outcomes cannot be increased, but just decrease the
variance, narrowly tuning the prediction to expected outcome.

Boosting It is a two-step approach, where it uses subsets of the original data to
produce a series of averagely performing models and then "boosts" their performance by
combining them together using a cost function (for example, majority vote. Unlike bagging,
in the classical boosting the subset creation is not random and depends upon the
performance of the previous models: every new subset contains the elements that were
(likely to be) misclassified by previous models.

5. What is a cost function in Machine learning?
Ans A cost function is a measure of how wrong the model is in terms of its ability to
estimate the relationship between X and y. This cost function (you may also see this
referred to as loss or error.) can be estimated by iteratively running the model to compare
estimated predictions against “ground truth” — the known values of y.The objective of a ML
model, therefore, is to find parameters, weights or a structure that minimises the cost
function.

5. What are parametric and nonparametric machine learning algorithms
Ans Assumptions can greatly simplify the learning process but can also limit what can be
learned. Algorithms that simplify the function to a known form are called parametric
machine learning algorithms. Algorithms that do not make strong assumptions about the
form of the mapping function are called nonparametric machine learning algorithms. By
not making assumptions, they are free to learn any functional form from the training data.

6. Give examples of Parametric and Non-Parametric machine learning
algorithms
Ans Examples of Parametric machine learning algorithms: -
● Logistic Regression
● Linear Discriminant Analysis
● Naive Bayes
● Simple Neural Networks
Examples of Nonparametric machine learning algorithms: -
● k-Nearest Neighbors
● Decision Trees like CART and C4.5
● Support Vector Machines

7. What are the advantages and disadvantages of parametric and nonparametric
machine learning algorithms
Ans Benefits of Parametric Machine Learning Algorithms are: -
a) Simpler: These methods are easier to understand and interpret results.
b) Speed: Parametric models are very fast to learn from data.
c) Less Data: They do not require as much training data and can work well even if the
fit to the data is not perfect.
Limitations of Parametric Machine Learning Algorithms are: -
a) Constrained: As the model is simplistic there will be inherent bias in the model. The
variance of these models will be low.
b) Limited Complexity: The methods are more suited to simpler problems.
c) Poor Fit: In practice the methods are unlikely to match the underlying mapping
function.

Benefits of Nonparametric Machine Learning Algorithms:
a) Flexibility: Capable of fitting many functional forms.
b) Power: No assumptions (or weak assumptions) about the underlying function.
c) Performance: Can result in higher performance models for prediction.
Limitations of Nonparametric Machine Learning Algorithms:
a) More data: Require a lot more training data to estimate the mapping function.
b) Slower: A lot slower to train as they often have far more parameters to train.
c) Overfitting: More of a risk to overfit the training data and it is harder to explain why
specific predictions are made.

8. Explain Irreducible, bias and variance error
Ans. The prediction error for any machine learning algorithm can be broken down into
three parts: -
Bias Error
Variance Error
Irreducible Error

Irreducible Error It cannot be reduced regardless of what algorithm is used. It is the error
introduced from the chosen framing of the problem and may be caused by factors like
unknown variables that influence the mapping of the input variables to the output variable.
Bias Error Bias are the simplifying assumptions made by a model to make the target
function easier to learn. Here the model is not able to come up with a proper mapping
function from the inputs to the output.
Generally, parametric algorithms have a high bias making them fast to learn and easier to
understand but generally less flexible. In turn, they have lower predictive performance on
complex problems that fail to meet the simplifying assumptions of the algorithms bias.
Low Bias: Suggests less assumptions about the form of the target function.
High-Bias: Suggests more assumptions about the form of the target function (example
fitting a logistic regression model on a complex image classification model).
Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest
Neighbors and Support Vector Machines.
Examples of high-bias machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
Variance Error Variance is the amount by which estimate of the target function will
change if different training data was used. The target function is estimated from the
training data by a machine learning algorithm, so we should expect the algorithm to have
some variance. Ideally, it should not change too much from one training dataset to the
next, meaning that the algorithm is good at picking out the hidden underlying mapping
between the inputs and the output variables. Machine learning algorithms that have a high
variance are strongly influenced by the noise in the training data. Generally, nonparametric
machine learning algorithms that have a lot of flexibility have a high variance. For example,
decision trees have a high variance, that is even higher if the trees are not pruned before
use.
Examples of low-variance machine learning algorithms: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
Examples of high-variance machine learning algorithms: Decision Trees, k-Nearest
Neighbors and Support Vector Machines.

9. Explain Bias-Variance Trade-Off
Ans. The goal of any supervised machine learning algorithm is to achieve low bias and
low variance. In turn the algorithm should achieve good prediction performance. For
example, take a dataset in which the relationship between predictor and target variable is
not exactly linear. A straight line will have high bias but low variance but a polynomial of
degree 10 will have low bias and high variance.

The parameterization of machine learning algorithms is often a battle to balance out bias
and variance. Below are two examples of configuring the bias-variance trade-off for specific
algorithms:
a) The k-nearest neighbors algorithm has low bias and high variance, but the
trade-off can be changed by increasing the value of k which increases the number of
neighbors that contribute to the prediction and in turn increases the bias of the model.
b) The support vector machine algorithm has low bias and high variance, but the
trade-off can be changed by increasing the C parameter that influences the number of
violations of the margin allowed in the training data which increases the bias but
decreases the variance.
There is no escaping the relationship between bias and variance in machine learning.
a) Increasing the bias will decrease the variance.
b) Increasing the variance will decrease the bias.
There is a trade-off at play between these two concerns and the algorithms you choose and
the way you choose to configure them are finding different balances in this trade-off for
your problem. We cannot calculate the real bias and variance error terms because we do
not know the actual underlying target function. Nevertheless, as a framework, bias and
variance provide the tools to understand the behavior of machine learning algorithms in
the pursuit of predictive performance.

10. What is overfitting and underfitting
Ans Overfitting refers to a model that models the training data too well. To put it in a
different way, it memorises the data instead of extracting generalisable patterns from it.
Overfitting happens when a model learns the detail and noise in the training data to the
extent that it negatively impacts the performance of the model on new data. This means
that the noise or random fluctuations in the training data is picked up and learned as
ncepts by the model. The problem is that these concepts do not apply to new data and
negatively impact the model’s ability to generalize. Overfitting is more likely with
nonparametric and nonlinear models that have more flexibility when learning a target
function. As such, many nonparametric machine learning algorithms also include
parameters or techniques to limit and constrain how much detail the model learns.
Underfitting refers to a model that can neither model the training data nor
generalize to new data. An underfit machine learning model is not a suitable model and will
be obvious as it will have poor performance on the training data. Underfitting is often not
discussed as it is easy to detect given a good performance metric. The remedy is to move
on and try alternate, more complex machine learning algorithms. Nevertheless, it does
provide a good contrast to the problem of overfitting. Ideally, a model is to be selected at
the sweet spot between underfitting and overfitting.

11. Explain regularization and how it helps to address the overfitting problem.
Ans Regularization is a method used to address overfitting if there are a large number of
features in the dataset. Simple model will be very poor in generalization. At the same time,
complex model may not perform well in test data due to over fitting. We need to choose
the right model in between simple and complex model. Regularization helps to choose
preferred model complexity, so that a model is better at predicting. Regularization is
nothing but adding a penalty term to the objective function and control the model
complexity using that penalty term. Regularisation is achieved by tuning models such that
model has significantly lower variance while compromising on bias (little bit) so that overall
model is much more generalised.

Additional information about regularization:-
https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a
http://enhancedatascience.com/2017/07/04/machine-learning-explained-regularization/

12. What is the difference between L1 and L2 regularization?
Ans A regression model t
hat uses L1 regularization technique is called Lasso Regression and model which used L2 is
called R idge regression.The key difference between the two is the penalty term. R idge
regression adds “squared magnitude” of coefficient as penalty term to the loss function.
If too much weight is add to the regularization term (Squared coefficient) it will lead to
under-fitting. The weight needs to be tuned appropriately to avoid overfitting. Lasso
Regression (Least absolute shrinkage and selection operator) a dds “absolute value of
Magnitude” of coefficient as penalty term to the loss function. The key difference between
these techniques is that Lasso shrinks the less important features coefficient to zero thus,
removing some features altogether. So, this works well for feature selection in case we
have a huge number of features. Traditional methods like cross-validation, stepwise
regression to handle overfitting and perform feature selection work well with a small set of
features but these techniques are a great alternative when we are dealing with a large set
of features.

13. Define precision, recall, and F-measure?
Ans Recall is also known as sensitivity or the true positive rate. Recall is the number
recall is the number of true positives we predicted divided by the total number of elements
that in reality are positive.Recall is a measure of completeness. High recall means that our
model classified most or all of the possible positive elements as positive. A recall score of
1.0 means that every item from that class was labeled as belonging to that class. However,
having just the recall score, you do not know how many other items were incorrectly
labeled (ie. did your model just say everything is of class positive).
Recall = True Positives / (True Positives + False Negatives).
Precision is also called the positive predictive value. It is a number of correct

positives your model predicts compared to the total number of positives it predicts.
Precision is a measure of exactness, quality, or accuracy. High precision means that more
or all of the positive results you predicted are correct. A precision score of 1.0 means that
every item labeled positive, does indeed belong to the positive class.
Precision = True Positives / (True Positives + False Positives)
F-Measure Precision and recall are often used together because they complement each
other in how they describe the effectiveness of a model. The F-measure is a score that
combines these two as the weighted harmonic mean of precision and recall.
F-Measure = 2 * (Precision * Recall) / (Precision + Recall)

14. Describe Gaussian Mixture model
Ans Gaussian mixture models are probabilistic models for representing normally
distributed subpopulations within an overall population. M ixture models in general don't
require knowing which subpopulation a data point belongs to, allowing the model to learn
the subpopulations automatically. Since subpopulation assignment is not known, this
constitutes a form of u nsupervised learning.For example, in modeling human height data,
height is typically modeled as a normal distribution for each gender with a mean of
approximately 5'10" for males and 5'5" for females. Given only the height data and not the
gender assignments for each data point, the distribution of all heights would follow the
sum of two scaled (different variance) and shifted (different mean) normal distributions. A
model making this assumption is an example of a Gaussian mixture model (GMM), though
in general, a GMM may have more than two components. GMMs have been used for
feature extraction from speech data, and have also been used extensively in object tracking
of multiple objects, where the number of mixture components and their means predict
object locations at each frame in a video sequence.

15. How do you make sure a particular set of data follows a particular distribution
(say gaussian)?
Ans The following are the some of the methods to find out the distribution of the data:-
a) Histogram Plot a histogram of the data and see if the shape of the histogram
resembles any of the known and used statistical distribution.
b) Distribution test - D istribution tests are hypothesis tests that determine whether
the sample data were drawn from a population that follows a hypothesized
distribution. Like any statistical hypothesis test, distribution tests have a null
hypothesis and an alternative hypothesis.
c) Probability plots - P robability plots can be used to determine whether data follow
a particular distribution. If your data follow the straight line on the graph, the
distribution fits the data. This is a good visual test. Informally, this process is called
the “fat pencil” test. If all the data points line up within the area of a fat pencil laid
over the center straight line, you can conclude that your data follow the distribution.

Additional information:-
https://www.statmethods.net/advgraphs/probability.html
http://www.instantr.com/2012/11/28/creating-a-normal-probability-plot/

16. What is the chain rule of probability
Ans The chain rule of probability is a theorem that allows one to calculate joint
distribution of random variables using conditional probabilities. Given the joint probability
distribution:
P(A,B)
We can use the definition of conditional probability to factor the joint probability as follows:
P(A,B) = P(A|B)*P(B)

7. What is the difference between discriminative and generative models?
Ans Discriminative models learn the boundaries between the data. Generative models
model the distribution of individual classes. Discriminative models do not offer clear
representations of relations between features and classes in the dataset. Instead of using
resources to fully model each class, they focus on richly modeling the boundary between
classes. They classify points, without providing a model of how the points are actually
generated. Generative algorithms makes some kind of structural assumptions on the
model but discriminative algorithms make fewer assumptions. For example, Naive Bayes
algorithm makes the assumption of conditional independence between features of the
dataset while logistic regression does not make such assumptions. Generative models
provide a model of how the data was generated.In general, Discriminative models perform
better than generative models. They work better for larger datasets. They might tend to
overfit on smaller datasets. Generative models often outperform discriminative models on
smaller datasets because their generative assumptions place some structure on the model
that prevent overfitting. For more information regarding the difference, refer this
stackoverflow answer:-
https://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-
and-discriminative-algorithm

18. What is the difference between bernoulli and binomial distribution?
Ans Bernoulli trial is a random experiment with only two possible outcomes. Binomial is
a sequence of Bernoulli trials performed independently. A random experiment with only
two possible outcomes with probability p and q; where p+q=1, is called Bernoulli trials. For
example, if we consider of tossing a coin, there are two possible outcomes, which is said to
be ‘head’ or ‘tail’. If we are interested in the head to fall; the probability of success is 1/2,
which can be denoted as P (success) =1/2, and the probability of failure is 1/2. Binomial
distribution is a sum of independent and evenly distributed Bernoulli trials. Binomial
distribution is denoted by the notation b(k;n,p); b(k;n,p) = C(n,k)pkqn-k, where C(n,k) is
known as the binomial coefficient. The binomial coefficient C(n,k) can be calculated by
using the formula n!/k!(n-k)! For example, if an instant lottery with 25% winning tickets is
sold among 10 people, the probability of purchasing a winning ticket is b(1;10,0.25) =
C(10,1)(0.25)(0.75)9 ≈ 9 x 0.25 x 0.075 ≈ 0.169

19. Optimization function used in logistic regression?
Ans In logistic regression, a cost function called ‘cross-entropy’ is used. This is also called
log loss.

An intuitive way to understand this equation is as follows:-

y = actual label (It takes 0 for negative class and 1 is positive
class).
hθ(x) = Predicted probabilities by logistic regression.

If an actual label of a particular data point (y(xi)) is zero , then the cost of logistic function
will be given by green graph. If an actual label of a particular data point (y(xi)) is one , then
the cost of logistic function will be given by red graph.
It shows that, If an actual label of a particular data point (y(xi)) is zero and the predicted
probability of xi is one, then the cost of the logistic function will be very high. Similarly, If an
actual label of a particular data point (y(xi)) and the predicted probability of xi are same,
then the cost of a logistic function will be zero. So, we need to find an estimate (β^) in such
a way that the cost function will have to be minimum. The above graph shows that the
logistic cost function is convex cost function, so we don’t need to worry about local
minimum. But, it is not possible to find a global minimum point using closed form solution
as linear regression because the sigmoid function is nonlinear
20. How do you select variables from a large dataset
Ans Filter Methods They are generally used as a preprocessing step. The
selection of features is independent of any machine learning algorithms. Instead, features
are selected on the basis of their scores in various statistical tests for their correlation with
the outcome variable
Wrapper methods In this method we try to use a subset of features and train a model
using them. The problem is essentially reduced to a search problem. These methods are
usually computationally very expensive.Some common examples of wrapper methods are
forward feature selection, backward feature elimination, recursive feature elimination, etc.
Embedded methods Embedded methods combine the qualities’ of filter and wrapper
methods. It’s implemented by algorithms that have their own built-in feature selection
methods.Some of the most popular examples of these methods are LASSO and RIDGE
regression which have inbuilt penalization functions to reduce overfitting.
The main differences between the filter and wrapper methods for feature selection
are:
● Filter methods measure the relevance of features by their correlation with
dependent variable while wrapper methods measure the usefulness of a subset of
feature by actually training a model on it.
● Filter methods are much faster compared to wrapper methods as they do not
involve training the models. On the other hand, wrapper methods are
computationally very expensive as well.
● Filter methods use statistical methods for evaluation of a subset of features while
wrapper methods use cross validation.
● Filter methods might fail to find the best subset of features in many occasions but
wrapper methods can always provide the best subset of features.
● Using the subset of features from the wrapper methods make the model more
prone to overfitting as compared to using subset of features from the filter
methods.

Additional Information:-
https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-
with-an-example-or-how-to-select-the-right-variables/

21. Difference between linear and logistic regression
Ans The difference between linear and logistic regression is as follows:-

Linear Regression Logistic Regression
Target variable is continuous or Target variable is categorical

numeric
The algorithm is based on ‘Least It is based on Maximum Likelihood

square estimation’. The regression Estimation which says coefficients
coefficients should be chosen in such should be chosen in such a way that it
a way that it minimizes the sum of the maximizes the Probability of Y given X
squared distances of each observed (likelihood).
response to its fitted value.
Equation - Equation -

Curve - aims at finding the best-fitting Changing the coefficient leads to change
straight line which is also called a in both the direction and the steepness
regression line. of the logistic function. It means positive
slopes result in an S-shaped curve and
negative slopes result in a Z-shaped
curve.
Linear regression requires error term Logistic regression does not require that
should be normally distributed. the error term should be normally
distributed
Linear regression assumes that Logistic regression does not need

residuals are approximately equal for residuals to be equal for each level of
all predicted dependent variable the predicted dependent variable values.
values
Interpretation - Keeping all other Interpretation - The effect of a one unit

independent variables constant, how of change in X in the predicted odds ratio
much the dependent variable is with the other variables in the model
expected to increase/decrease with held constant.
an unit increase in the independent
variable.
Linear regression assumes normal or Logistic regression assumes binomial

gaussian distribution of dependent distribution of dependent variable.
variable.
Linear regression uses Identity link logistic regression uses Logit function of
function of gaussian family. Binomial family.

22. State four assumptions of a linear regression model

Ans a. The residuals are independent
b. The residuals are normally distributed
c.The residuals have a mean of 0 at all values of X
e. The relationship between the independent and dependent variables to be linear
f. It requires all variables to be multivariate normal.
g. There is little or no multicollinearity in the data

23. What is a Generalized linear model?
Ans In case of general linear models the distribution of residuals is assumed to be
Gaussian. If it is not the case, it turns out that the relationship between Y and the model
parameters is no longer linear. But if the distribution of residuals is one from the
exponential family such as binomial, Poisson, negative binomial, or gamma distributions,
there exists some functions of mean of Y, which has linear relationship with model
parameters. This function is called link function.For example, a binomial residual can use a
logit or a probit link function. A Poisson residual uses a log link function.The basic
difference between generalized linear model & general linear model can be summarized as
follows:-

24. In which cases you would use Generalized linear models?
Ans. The cases in which Generalized linear models can be used are as follows:-
(a) If the response variable is categorical
(b) When the distribution of residuals in non-normal or non-gaussian
(c) If the distribution of the residuals is from the exponential of gamma distributions

25. What is a naïve bayes model called a naïve?
Ans Naïve Bayes machine learning algorithm is considered Naïve because the
assumptions the algorithm makes are virtually impossible to find in real-life data.
Conditional probability is calculated as a pure product of individual probabilities of
components. This means that the algorithm assumes the presence or absence of a specific
feature of a class is not related to the presence or absence of any other feature (absolute
independence of features), given the class variable. For instance, a fruit may be considered
to be a banana if it is yellow, long and about 5 inches in length. However, if these features
depend on each other or are based on the existence of other features, a naïve Bayes
classifier will assume all these properties to contribute independently to the probability
that this fruit is a banana. Assuming that all features in a given dataset are equally
important and independent rarely exists in the real-world scenario.

26. How do you handle class imbalance in a dataset. Explain what could you do at the
data level and at the model level?
Ans a) Collecting more data
b) Changing the performance metric (It was observed in many cases that accuracy
does not work well with imbalanced datasets.
c) Resampling the dataset. Add samples of data that is under represented. This is
known as oversampling. Delete samples of data that is over represented. This is
known as undersampling.
d) Generating synthetic samples - to randomly sample the attributes from instances
in the minority class.
e) Spot checking of different algorithms
f) Penalizing the models - Penalized classification imposes an additional cost on the
model for making classification mistakes on the minority class during training. These
penalties can bias the model to pay more attention to the minority class.

Additional information:-
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machin
e-learning-dataset/

27. What is the difference between R2 and adj-R2?
Ans One major difference between R-squared and the adjusted R-squared is that
R-squared supposes that every independent variable in the model explains the variation in
the dependent variable R-squared cannot verify whether the coefficient ballpark figure and
its predictions are prejudiced. It also does not show if a regression model is satisfactory; it
can show an R-squared figure for a good model, or a high R-squared figure for a model that
doesn’t fit.

The adjusted R-squared compares the descriptive power of regression models that include
diverse numbers of predictors. Every predictor added to a model increases R-squared and
never decreases it. Thus, a model with more terms may seem to have a better fit just for
the fact that it has more terms, while the adjusted R-squared compensates for the addition
of variables and only increases if the new term enhances the model above what would be
obtained by probability and decreases when a predictor enhances the model less than
what is predicted by chance. In an overfitting condition, an incorrectly high value of
R-squared, which leads to a decreased ability to predict, is obtained. This is not the case
with the adjusted R-squared.

28. Why is "having" used when "where" is already used ?

Both “Having” and “Where” clauses are similar in nature and have a functionality of
“FILTERING” the data with a slight difference The WHERE clause filters on individual rows
and the HAVING clause filters on the aggregated values, it applies only to groups as a
whole. Now, why "having" clause is being used when "where" is already used, let’s consider
an example, you are booking a movie ticket from specific theatre e.g. IMAX (as you are
having specific taste) but due to tight budget you want to list ONLY IMAX theatres which
has a “average” movie ticket price below Rs. 300 or you want to list only IMAXs whose
number of shows are greater than 5 per day.
Here “where” clause will eliminate the theatres that are not from “IMAX” before calculating
the average price of the movie ticket. To implement the filter on average ticket price we’ll
need the HAVING clause as here we need both “GROUPING” and “SUMMARIZING”, like
count(*) or avg(). Also when using having, you must also have a “Group By” clause in the
query.

https://docs.microsoft.com/en-us/sql/ssms/visual-db-tools/use-having-and-where-clauses-i
n-the-same-query-visual-database-tools
http://www.java2s.com/Code/Oracle/Select-Query/UsingtheWHEREGROUPBYandHAVINGCl
ausesTogether.htm

29. What are window functions?
The window function performs calculation on “set of rows” and return a SINGLE
aggregated value for each row. This is comparably similar to the form of calculation that
can be done with an aggregate function but with little bit of difference.
When we use aggregate functions with the GROUP BY clause, we “lose” the individual rows.
We can’t mix attributes from an individual row with the results of an aggregate function;
the function is performed on the rows as an entire group. But unlike regular aggregate
functions, use of a window function does not cause rows to become grouped into a single
output row — the rows retain their separate identities. We can generate a result set with
some attributes of an individual row together with the results of the window function. This
makes windowing one of the coolest feature of SQL.
e.g if you want to compare each player’s auction value in IPL to average auction amount
spent on each player of his team in same table.
SELECT team_name, player_name, auction_amt, avg(auction_amt) OVER (PARTITION BY
team_name) FROM auction_table;

Refer PostgreSQL documentation:
https://www.postgresql.org/docs/9.1/static/tutorial-window.html
https://community.modeanalytics.com/sql/tutorial/sql-window-functions/

30. Difference between logistic and linear regression.Is logistic regression a linear
model? Why or why not?

Linear regression : This algorithm’s principle is to find a linear relation within your data.
Once the linear relation is found, predicting a new value is done with respect to this
relation.Linear regression is used when the desired output is required to take a continuous
value based on whatever input/dataset is given to the algorithm.
Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of
weight, without asking them their weights! What do you think the child will do? He / she
would likely look (visually analyze) at the height and build of people and arrange them
using a combination of these visible parameters. This is linear regression in real life! The
child has actually figured out that height and build would be correlated to the weight by a
relationship, which looks like the equation above.

Logistic Regression: Don’t go by its name! It is a c
lassification algorithm not a regression
algorithm.Let’s say your friend gives you a puzzle to solve. There are only 2 outcome
scenarios – either you solve it or you don’t.

Linear regression is used when the output is continuous in nature based on corresponding
input, it can have any one of an infinite number of possible values. Consider the weather
forecasting problem where you want to predict the tomorrow’s temperature, % humidity
etc.
Now suppose your problem was to not predict the average temperature or % humidity, but
what type of day it will be (eg., sunny, cloudy, stormy, rainy etc). This problem will give an
output belonging to a certain set of values predefined, hence it is basically classifying your
output into categories. Classifying problems can be either binary (yes/no, o/1 like you either
solve the problem or not) or multiclass (like the problem described above). Logistic
regression is used in classifying problems of machine learning.
https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/
For mathematical POV:
https://medium.com/deep-math-machine-learning-ai/chapter-1-complete-linear-regression
-with-math-25b2639dde23
https://medium.com/deep-math-machine-learning-ai/chapter-2-0-logistic-regression-with-
math-e9cbb3ec6077

Also logistic regression It is called a generalized linear model not because the estimated
probability of the response event is linear, but because the logit of the estimated
probability response is a linear function of the parameters.

https://www.slideshare.net/SatishGupta4/ihcc-logistic-regression
Linear means linear (degree =1 ) in betas (the coefficients) but not in x's (the independent
variables).
https://stats.stackexchange.com/questions/88603/why-is-logistic-regression-a-linear-model

31. What is p value, how to read it, how to calculate it.

Imagine that , India in its full strength (all top player in best of the form) got into a 1 on 1
match with ZIMBABWE, But it turned out that India lost the game.

Fans were stunned. And frustrated. And angry.

The reasoning goes like this: if India had played as usual, he would have been highly
unlikely to be defeated. But he lost the game! So fans had every reason to cast doubt on
the team’s fair play. (Some might pull the allegation of Match Fixing)

To put it another way, the reasoning goes like this:
We have a hypothesis,that India rocks as usual. If the hypothesis had been true, the
probability of India losing would have been very small, say, less than 5%. But India lost the
game.
So the unlikelihood was considered as evidence against his fair play.
You may say p-value is a measurement of weirdness of your observations, according to
your current believes, smaller the weirder.So you believe in lots of things, for some things,
the reality reconfirms your believes by giving you expected outcomes; but for the others,
the reality challenges you by throwing at you weird unexpected outcomes, and at some
point, you can't deny it anymore, so you start to realize what you once believed may be
wrong.

The P value, or calculated probability, is the probability of finding the observed, or more
extreme, results when the null hypothesis (H0) of a study question is true – the definition of
‘extreme’ depends on how the hypothesis is being tested.The p-value is defined as the
probability, under the null hypothesis H, of obtaining a result equal to or more extreme
than what was actually observed.
Calculation of P-value:
Look up your test statistic on the appropriate distribution — in this case, on the standard
normal (Z-) distribution (see the following Z-table).

Pr(X >= x|H) for right tail event
Pr(X <= x|H) for left tail event
2min(Pr(X >= x|H),Pr(X <= x|H)) for double tail event

Relationship between significance level and p-value. The relationship is: the p-value is the
smallest significance level at which the null hypothesis would be rejected
Reject the null hypothesis if P i s "small".

https://en.wikipedia.org/wiki/P-value
http://www.perfendo.org/docs/BayesProbability/twelvePvaluemisconceptions.pdf
https://www.students4bestevidence.net/p-value-in-plain-english-2/
https://www.quora.com/What-is-a-p-value-explained-in-layman%E2%80%99s-terms

To know how to do Hypothesis testing:
http://people.cas.uab.edu/~mpogwizd/ma180-fall-2014/HypothesisTesting.pdf

32. What is the difference between median and mean, when to chose what?
The average (mean) is the sum of a set of numbers divided by the count of numbers in the
data set.

Whereas Median is the middle number in the data set, which can be determined by sorting
the numbers in order and finding the middle number in the data set

Now which one to choose it depends on the data distribution and purpose.
Mean is the average count, if you want to to find the per capita income i.e. “What is the
average income of the country?” you will use mean here and coming to accuracy For a
bell-shaped population distribution, the mean will be more accurate.
Median is more of a central measure where you draw some middle line What is the income
of an average person? On the basis of that you‘ll know people below poverty line and
coming to accuracy For a heavy-tailed distribution if the data sets are skewed (which is the
usual case) in one direction or the other the median will be more accurate
The forte of the median comes when you want to handle outliers.

https://learnandteachstatistics.wordpress.com/2013/04/29/median/
https://math.stackexchange.com/questions/2304710/mean-vs-median-when-to-use

33. What is VIF?
A variance inflation factor(VIF) detects multicollinearity (a predictor/independent variable
can be linearly predicted from the others with a significant accuracy) in regression analysis.
Multicollinearity is when there’s correlation between predictors (i.e. independent variables)
in a model;
it’s presence can adversely affect your regression results. The VIF estimates how much the
variance of a regression coefficient is inflated due to existing multicollinearity in the model.

VIFs are calculated by taking a predictor, and regressing it against every other predictor in
the model. This gives you the R-squared values, which can then be plugged into the VIF
formula. “i” is the predictor you’re looking at (e.g. x1 or x2):

R^2 is the coefficient of determination ( proportion of the variance in the dependent
variable that is predictable from the independent variable)
https://en.wikipedia.org/wiki/Coefficient_of_determination
https://onlinecourses.science.psu.edu/stat501/node/347
https://en.wikipedia.org/wiki/Variance_inflation_factor
34. What is BIC and AIC?

Bayesian information criterion (BIC) or Schwarz criterion (also SBC, SBIC) is a basis or
criterion for model selection among a set of models; Ideally, the model with the lowest BIC
is preferred.
Whereas Akaike information criterion (AIC) is gives the idea of the relative quality of
statistical models for a given set of data.
For given a collection of models for the data, AIC estimates the quality of each model,
relative to each of the other models.
When fitting models, it is possible to increase the likelihood by adding parameters, but
doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by
introducing a penalty term for the number of parameters in the model
https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic
-over-the-other
https://methodology.psu.edu/AIC-vs-BIC

AIC = -2*ln(likelihood) + 2*k,
and

BIC = -2*ln(likelihood) + ln(N)*k,
where:

k = model degrees of freedom
N = number of observations
https://www.quora.com/What-is-the-difference-between-an-AIC-information-criteron-and-a
-BIC-information-criterion

35. What is t-test?
A t-test is commonly used to determine whether the mean of a population significantly
differs from a specific value (called the hypothesized mean) or from the mean of another
population.
t-test is a procedure to narrow down all of your sample data down to one value, the t-value

Where t value is signal to noise ratio:

Signal : It’s the difference between sample mean and null hypothesis valued mean.

Let’s consider hypothesized mean wait time of an ola cab is 5 minutes.If in your random
sample, had a mean wait time for cab of 15.1 minutes, the signal is 5.1-15 = 0.1 minutes.
The difference is relatively small, so the signal in the numerator is weak.

However, if patients in your random sample had a mean wait time of 18 minutes, the
difference is much larger: 18 - 5 = 3 minutes. So the signal is stronger

The denominator is the noise. The equation in the denominator is a measure of variability
known as the standard error of the mean.

http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-t-tests-t-values-and-
t-distributions
http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-t-tests%3A-1-sample
%2C-2-sample%2C-and-paired-t-tests

36. How to know if a cluster generated to be good?
To measure the quality of clustering results, there are two kinds of validity indices: external
indices and internal indices.

An external index is a measure of agreement between two partitions where the first
partition is the a priori known clustering structure, and the second results from the
clustering procedure (Dudoit et al., 2002).

Internal indices are used to measure the goodness of a clustering structure without
external information (Tseng et al., 2005).

For external indices, we evaluate the results of a clustering algorithm based on a known
cluster structure of a data set (or cluster labels).

For internal indices, we evaluate the results using quantities and features inherent in the
data set. The optimal number of clusters is usually determined based on an internal validity
index.
http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-
optimal-number-of-clusters-3-must-know-methods/
https://link.springer.com/article/10.1007/s40595-016-0086-9
As your unsupervised learning method is probabilistic, another option is to evaluate some
probability measure (log-likelihood, perplexity, etc) on held out data. The motivation here is
that if your unsupervised learning method assigns high probability to similar data that
wasn't used to fit parameters, then it has probably done a good job of capturing the
distribution of interest

http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
https://stats.stackexchange.com/questions/21807/evaluation-measure-of-clustering-withou
t-having-truth-labels
https://www.analyticsvidhya.com/blog/2013/11/getting-clustering-right-part-ii/

37. What is ANOVA?
If you want to check the variation among and between the considered groups , basically
you are testing whether there is difference between the groups or not
. As example: Students of different coaching institute are giving IIT JEE, we want to see
which coaching institute is giving better result. Or A researcher conducted a study to
investigate the effect of 3 different teaching method on the reading ability of school
children.
T-test vs ANOVA
It’s very similar to t-test just for more than two groups.When comparing only two groups (A
and B), you test the difference between the two groups with a Student t test. So when
comparing three groups (A, B, and C) it’s natural to think of testing each of the three
possible two-group comparisons (A – B, A – C, and B – C) with a t test.
But running an exhaustive set of two-group t tests can be risky, because as the number of
groups goes up, the number of two-group comparisons goes up even more.
So here ANOVA comes for rescue.

Where n-1 is the degrees of freedom (DF), the summation is called the sum of squares (SS),
the result is called the mean square (MS) and the squared terms are deviations from the
sample mean.

https://www.edanzediting.com/blogs/statistics-anova-explained
http://www.dummies.com/education/science/biology/the-basic-idea-of-an-analysis-of-varia
nce-anova/
It’s always to practice for ANOVA. Few problems to understand those:
http://rstudio-pubs-static.s3.amazonaws.com/228015_d8d0ddab79664707890681a9a75cf1
6d.html

38. Basic of Neural Networks ?
https://towardsdatascience.com/a-gentle-introduction-to-neural-networks-series-part-1-2b
90b87795bc
Basic of Convolutional neural networks (CNNs) and recurrent neural networks
(RNNs).
https://medium.com/machine-learning-for-humans/neural-networks-deep-learning-cdad8a
eae49b

39. What is KS statistics?
Kolmogorov Smirnov Test is basically a test of goodness of fit. It compares the cumulative
distribution function for a variable with a “specified distribution”. Let’s say we
Suppose that we have an i.i.d. sample X1, . . . , Xn with some unknown distribution “D” and
we would like to test the hypothesis that “D” is equal to a particular distribution “D0”, i.e.
(KS-test) tries to determine whether the two datasets differ significantly or not. The
advantage of KS-test is its agnostic to the distribution of the sample considered.

Programming POV: http://daithiocrualaoich.github.io/kolmogorov_smirnov/
Mathematics POV:
https://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lectur
e-notes/lecture14.pdf

40. What is cross-Validation? Why is it important ? What are the different methods
of cross validation?
Ans There is always a need to validate the stability of the machine learning model.There
is no assurance that the model created will work well on unseen data. we need some kind
of assurance that the model has got most of the patterns from the data correct, and its not
picking up too much on the noise, or in other words its low on bias and variance.

Validation
This process of deciding whether the numerical results quantifying hypothesized
relationships between variables, are acceptable as descriptions of the data, is known as
validation. Generally, an error estimation for the model is made after training, better
known as evaluation of residuals. In this process, a numerical estimate of the difference in
predicted and original responses is done, also called the training error. However, this only
gives us an idea about how well our model does on data used to train it. Now its possible
that the model is underfitting or overfitting the data. So, the problem with this evaluation
technique is that it does not give an indication of how well the learner will generalize to an
independent/ unseen data set. Getting this idea about our model is known as Cross
Validation.

Holdout Method
Now a basic remedy for this involves removing a part of the training data and using it to get
predictions from the model trained on rest of the data. The error estimation then tells how
our model is doing on unseen data or the validation set. This is a simple kind of cross
validation technique, also known as the holdout method. Although this method doesn’t
take any overhead to compute and is better than traditional validation, it still suffers from
issues of high variance. This is because it is not certain which data points will end up in the
validation set and the result might be entirely different for different sets.
K-Fold Cross Validation
As there is never enough data to train your model, removing a part of it for validation
poses a problem of underfitting. By reducing the training data, we risk losing important
patterns/ trends in data set, which in turn increases error induced by bias. So, what we
require is a method that provides ample data for training the model and also leaves ample
data for validation. K Fold cross validation does exactly that.
In K
Fold cross validation, the data is divided into k subsets. Now the holdout method is
repeated k times, such that each time, one of the k subsets is used as the test set/
validation set and the other k-1 subsets are put together to form a training set. The error
estimation is averaged over all k trials to get total effectiveness of our model. As can be
seen, every data point gets to be in a validation set exactly once, and gets to be in a training
set k-1times. This significantly reduces bias as we are using most of the data for fitting, and
also significantly reduces variance as most of the data is also being used in validation set.
Interchanging the training and test sets also adds to the effectiveness of this method. As a
general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed
and it can take any value.
Stratified K-Fold Cross Validation
In some cases, there may be a large imbalance in the response variables. For example, in
dataset concerning price of houses, there might be large number of houses having high
price. Or in case of classification, there might be several times more negative samples than
positive samples. For such problems, a slight variation in the K Fold cross validation
technique is made, such that each fold contains approximately the same percentage of
samples of each target class as the complete set, or in case of prediction problems, the
mean response value is approximately equal in all the folds. This variation is also known as
Stratified K Fold.
Above explained validation techniques are also referred to as Non-exhaustive cross
validation methods. These do not compute all ways of splitting the original sample, i.e. you
just have to decide how many subsets need to be made.

ML Interview Questions and Answers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Interview Questions and Answers

Uploaded by

Copyright:

Available Formats

1. Describe K-means clustering algorithm.

4. What is an ensemble method?

Bagging (stands for B ootstrap A

Precision is also called the positive predictive value. It is a number of correct

20. How do you select variables from a large dataset

Target variable is continuous or Target variable is categorical

The algorithm is based on ‘Least It is based on Maximum Likelihood

Linear regression assumes that Logistic regression does not need

Interpretation - Keeping all other Interpretation - The effect of a one unit

Linear regression assumes normal or Logistic regression assumes binomial

22. State four assumptions of a linear regression model

34. What is BIC and AIC?

You might also like

ML Interview Questions and Answers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Interview Questions and Answers

Uploaded by

Copyright:

Available Formats

1. Describe K-means clustering algorithm.

4. What is an ensemble method?

Bagging​ (stands for B ​ ​ootstrap A

Precision​ is also called the positive predictive value. It is a number of correct

20. How do you select variables from a large dataset

Target variable is continuous or Target variable is categorical

The algorithm is based on ‘Least It is based on Maximum Likelihood

Linear regression assumes that Logistic regression does not need

Interpretation - Keeping all other Interpretation - The effect of a one unit

Linear regression assumes normal or Logistic regression assumes binomial

22. State four assumptions of a linear regression model

34. What is BIC and AIC?

You might also like

Bagging (stands for B ootstrap A

Precision is also called the positive predictive value. It is a number of correct