Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

1. Describe K-means clustering algorithm.

 
 
Ans. K-Means is an unsupervised learning algorithm. K-Means ​clustering is the task of 
grouping a set of data points in such a way that data points in the same group (called a 
cluster) are closer to each other than to those in other groups (clusters). 
 
Steps for K-Means clustering: - 
1. In the dataset provided, consider the variables required for clustering 
2. Randomly Initialize the cluster centroid 
3. Calculate Euclidean distance between each observation and initial cluster centroids 
4. Based on Euclidean distance each observation is assigned to one of the clusters - 
based on minimum distance. 
5. Update cluster centroid by taking the mean of the variables in a cluster 
6. Repeat the process till convergence is achieved i.e. there is no further change in the 
cluster centroids.  
Example​: -   Segmenting individuals into different clusters based on their height and weight 
 
Additional Information​:- 
Link for various distance measures:- 
http://www.sthda.com/english/articles/26-clustering-basics/86-clustering-distance-measure
s-essentials/ 
 
Steps and pseudo code for K-means clustering:- 
http://mnemstudio.org/clustering-k-means-example-1.htm 
 
 
2.  What are the important considerations in K-means clustering? 
Ans. a) Scale of measurements influences Euclidean Distance, so variable 
standardisation becomes necessary 
b) Outlier treatment is necessary depending on the problem statement 
                        
c)  K- Means clustering may be biased on initial centroids - called cluster seeds 
d) The number of clusters to be created is an input to the algorithm and it 
impacts the clusters getting created 
 
3. How is the number of clusters identified in K-means clustering. 
Ans Elbow method​:- 
a) Compute clustering algorithm (e.g., k-means clustering) for different values of k. For 
instance, by varying k from 1 to 10 clusters. 
b) For each k, calculate the total within-cluster sum of square (wss). 
c) Plot the curve of wss according to the number of clusters k. 
d) The  location of a bend (knee) in the plot is generally considered as an indicator of the 
appropriate number of clusters. 
 
Average silhouette Method 
A. Compute  clustering  algorithm  (e.g.,  k-means  clustering)  for  different  values  of  k.  For 
instance, by varying k from 1 to 10 clusters. 
B. For each k, calculate the average silhouette of observations (​avg​.​sil​). 
C. Plot the curve of a
​ vg​.s​ il​ according to the number of clusters k. 
D. The location of the maximum is considered as the appropriate number of clusters. 

4.  What is an ensemble method? 

Ans. An ensemble is a collection of multiple models (usually supervised) which is used to 
obtain better predictive performance than any of the individual models.  
 
 
The two most common types of ensembles are bagging and boosting.  
 
Bagging stands for bootstrapped aggregation, and RF is the most common implementation 
of bagging.  
 
-- If asked to explain what is random forest: 
 
An RF is basically a collection of multiple decision trees, say 500. To make a decision, a 
majority vote of the 500 trees is taken for each data point. 
 
To build an RF, we take bootstrapped samples, i,e. do sampling with replacement (e.g. take 
random 40% data points from training data n times, and use them to build n trees). This 
ensures that each decision tree is trained using different training sets, and is evaluated 
using data points which were not in the 40% points - this is called out of bag error or OOB 
since the evaluation is done on points not used for training. If each tree is still performing 
well (as measured by OOB error), the entire forest is likely performing well, i.e if the 
average OOB error is low, we can be confident that the model (RF) will not overfit.  
 
Also, each node in the tree is built using only a subset of the features. This is because if all 
the features are available for all the nodes, the top nodes in each tree (the important ones) 
will almost always contain the most imp variables, and all the tree will look similar. This is 
not desirable because we want to ensure diversity in the ensemble model (the entire 
forest) - if all trees are similar, there is no point taking a majority vote. But if the trees are 
different, then the majority vote is unlikely to be a result of overfitting, since even if some 
trees are overfitting, others will likely not overfit. 
 
-- If asked about boosting: Read below 
 
5.  Difference between Bagging and Boosting? 
Ans. Bagging and Boosting are called "meta-algorithms" approaches which combine 
several machine learning techniques into one predictive model. Their purpose is either to 
decrease the variance (​bagging​), bias (​boosting​) or improving the prediction outcomes. Every 
algorithm consists of two steps: 
1. Producing a distribution of simple ML models on subsets of the original data. 
2. Combining the distribution into one "aggregated" model. 

 
Here is a short description of the two methods: 

Bagging​ (stands for B ​ ​ootstrap A


​ gg​regat​ing​)  It is a way to decrease the variance of 
prediction by bootstrapping for training from original dataset using combinations with 
repetitions to produce multisets of the same cardinality/size as original data. By increasing 
the size of training set the prediction outcomes cannot be increased, but just decrease the 
variance, narrowly tuning the prediction to expected outcome. 
 
Boosting  It is a two-step approach, where it uses subsets of the original data to 
produce a series of averagely performing models and then "boosts" their performance by 
combining them together using a cost function (for example, majority vote. Unlike bagging, 
in the classical boosting the subset creation is not random and depends upon the 
performance of the previous models: every new subset contains the elements that were 
(likely to be) misclassified by previous models. 
 
5. What is a cost function in Machine learning? 
Ans A cost function is a measure of how wrong the model is in terms of its ability to 
estimate the relationship between X and y. This cost function (you may also see this 
referred to as loss or error.) can be estimated by iteratively running the model to compare 
estimated predictions against “ground truth” — the known values of y.The objective of a ML 
model, therefore, is to find parameters, weights or a structure that minimises the cost 
function. 
 
 
5. What are parametric and nonparametric machine learning algorithms 
Ans Assumptions can greatly simplify the learning process but can also limit what can be 
learned. Algorithms that simplify the function to a known form are called parametric 
machine learning algorithms. Algorithms that do not make strong assumptions about the 
form of the mapping function are called nonparametric machine learning algorithms. By 
not making assumptions, they are free to learn any functional form from the training data.  
 
6. Give examples of Parametric and Non-Parametric machine learning 
algorithms 
Ans Examples of Parametric machine learning algorithms: - 
● Logistic Regression 
● Linear Discriminant Analysis 
● Naive Bayes 
● Simple Neural Networks 
Examples of Nonparametric machine learning algorithms: - 
● k-Nearest Neighbors 
● Decision Trees like CART and C4.5 
● Support Vector Machines 
 
7. What are the advantages and disadvantages of parametric and nonparametric 
machine learning algorithms 
Ans Benefits of Parametric Machine Learning Algorithms are​: - 
a) Simpler​: These methods are easier to understand and interpret results. 
b) Speed​: Parametric models are very fast to learn from data. 
c) Less Data​: They do not require as much training data and can work well even if the 
fit to the data is not perfect. 
Limitations of Parametric Machine Learning Algorithms are​: - 
a) Constrained​: As the model is simplistic there will be inherent bias in the model. The 
variance of these models will be low.  
b) Limited Complexity​: The methods are more suited to simpler problems. 
c) Poor Fit​: In practice the methods are unlikely to match the underlying mapping 
function. 
 
Benefits of Nonparametric Machine Learning Algorithms​: 
a) Flexibility​: Capable of fitting many functional forms. 
b) Power​: No assumptions (or weak assumptions) about the underlying function. 
c) Performance​: Can result in higher performance models for prediction. 
Limitations of Nonparametric Machine Learning Algorithms​: 
a) More data​: Require a lot more training data to estimate the mapping function. 
b) Slower:​ A lot slower to train as they often have far more parameters to train. 
c) Overfitting​: More of a risk to overfit the training data and it is harder to explain why 
specific predictions are made. 
 
8. Explain Irreducible, bias and variance error 
Ans. The prediction error for any machine learning algorithm can be broken down into 
three parts: - 
Bias Error 
Variance Error 
Irreducible Error 
 
Irreducible Error​ It cannot be reduced regardless of what algorithm is used. It is the error 
introduced from the chosen framing of the problem and may be caused by factors like 
unknown variables that influence the mapping of the input variables to the output variable. 
Bias Error Bias are the simplifying assumptions made by a model to make the target 
function easier to learn. Here the model is not able to come up with a proper mapping 
function from the inputs to the output. 
Generally, parametric algorithms have a high bias making them fast to learn and easier to 
understand but generally less flexible. In turn, they have lower predictive performance on 
complex problems that fail to meet the simplifying assumptions of the algorithms bias. 
Low Bias: Suggests less assumptions about the form of the target function. 
High-Bias: Suggests more assumptions about the form of the target function (example 
fitting a logistic regression model on a complex image classification model). 
Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest 
Neighbors and Support Vector Machines. 
Examples of high-bias machine learning algorithms include: Linear Regression, Linear 
Discriminant Analysis and Logistic Regression. 
Variance Error​ Variance is the amount by which estimate of the target function will 
change if different training data was used. The target function is estimated from the 
training data by a machine learning algorithm, so we should expect the algorithm to have 
some variance. Ideally, it should not change too much from one training dataset to the 
next, meaning that the algorithm is good at picking out the hidden underlying mapping 
between the inputs and the output variables. Machine learning algorithms that have a high 
variance are strongly influenced by the noise in the training data. Generally, nonparametric 
machine learning algorithms that have a lot of flexibility have a high variance. For example, 
decision trees have a high variance, that is even higher if the trees are not pruned before 
use. 
Examples of low-variance machine learning algorithms​: Linear Regression, Linear 
Discriminant Analysis and Logistic Regression. 
Examples of high-variance machine learning algorithms​: Decision Trees, k-Nearest 
Neighbors and Support Vector Machines. 
 
9.  Explain Bias-Variance Trade-Off 
Ans. The goal of any supervised machine learning algorithm is to achieve low bias and 
low variance. In turn the algorithm should achieve good prediction performance. For 
example, take a dataset in which the relationship between predictor and target variable is 
not exactly linear. A straight line will have high bias but low variance but a polynomial of 
degree 10 will have low bias and high variance.  
 
The parameterization of machine learning algorithms is often a battle to balance out bias 
and variance. Below are two examples of configuring the bias-variance trade-off for specific 
algorithms: 
a) The k-nearest neighbors algorithm has low bias and high variance, but the 
trade-off can be changed by increasing the value of k which increases the number of 
neighbors that contribute to the prediction and in turn increases the bias of the model. 
b) The support vector machine algorithm has low bias and high variance, but the 
trade-off can be changed by increasing the C parameter that influences the number of 
violations of the margin allowed in the training data which increases the bias but 
decreases the variance. 
There is no escaping the relationship between bias and variance in machine learning. 
a) Increasing the bias will decrease the variance. 
b) Increasing the variance will decrease the bias. 
There is a trade-off at play between these two concerns and the algorithms you choose and 
the way you choose to configure them are finding different balances in this trade-off for 
your problem. We cannot calculate the real bias and variance error terms because we do 
not know the actual underlying target function. Nevertheless, as a framework, bias and 
variance provide the tools to understand the behavior of machine learning algorithms in 
the pursuit of predictive performance. 
 
10.  What is overfitting and underfitting 
Ans Overfitting refers to a model that models the training data too well. To put it in a 
different way, it memorises the data instead of extracting generalisable patterns from it. 
Overfitting happens when a model learns the detail and noise in the training data to the 
extent that it negatively impacts the performance of the model on new data. This means 
that the noise or random fluctuations in the training data is picked up and learned as 
ncepts by the model. The problem is that these concepts do not apply to new data and 
negatively impact the model’s ability to generalize. Overfitting is more likely with 
nonparametric and nonlinear models that have more flexibility when learning a target 
function. As such, many nonparametric machine learning algorithms also include 
parameters or techniques to limit and constrain how much detail the model learns. 
Underfitting refers to a model that can neither model the training data nor 
generalize to new data. An underfit machine learning model is not a suitable model and will 
be obvious as it will have poor performance on the training data. Underfitting is often not 
discussed as it is easy to detect given a good performance metric. The remedy is to move 
on and try alternate, more complex machine learning algorithms. Nevertheless, it does 
provide a good contrast to the problem of overfitting. ​Ideally, a model is to be selected at 
the sweet spot between underfitting and overfitting. 
 
11. Explain regularization and how it helps to address the overfitting problem. 
Ans Regularization is a method used to address overfitting if there are a large number of 
features in the dataset. Simple model will be very poor in generalization. At the same time, 
complex model may not perform well in test data due to over fitting. We need to choose 
the right model in between simple and complex model. Regularization helps to choose 
preferred model complexity, so that a model is better at predicting. Regularization is 
nothing but adding a penalty term to the objective function and control the model 
complexity using that penalty term. Regularisation is achieved by tuning models such that 
model has significantly lower variance while compromising on bias (little bit) so that overall 
model is much more generalised.  
 
Additional information about regularization:- 
https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a 
http://enhancedatascience.com/2017/07/04/machine-learning-explained-regularization/ 
 
 
12. What is the difference between L1 and L2 regularization? 
Ans A regression model t 
hat uses L1 regularization technique is called ​Lasso ​Regression and model which used L2 is 
called R​ idge​ regression.The key difference between the two is the penalty term. R ​ idge 
regression​ adds “​squared magnitude​” of coefficient as penalty term to the loss function. 
If too much weight is add to the regularization term (Squared coefficient) it will lead to 
under-fitting. The weight needs to be tuned appropriately to avoid overfitting. ​Lasso 
Regression (Least absolute shrinkage and selection operator) a ​ dds “​absolute value of 
Magnitude​” of coefficient as penalty term to the loss function. The key difference between 
these techniques is that Lasso shrinks the less important features coefficient to zero thus, 
removing some features altogether. So, this works well for feature selection in case we 
have a huge number of features. Traditional methods like cross-validation, stepwise 
regression to handle overfitting and perform feature selection work well with a small set of 
features but these techniques are a great alternative when we are dealing with a large set 
of features. 
 
13.  Define precision, recall, and F-measure? 
Ans Recall​ is also known as sensitivity or the true positive rate. Recall is the number 
recall is the number of true positives we predicted divided by the total number of elements 
that in reality are positive.Recall is a measure of completeness. High recall means that our 
model classified most or all of the possible positive elements as positive. A recall score of 
1.0 means that every item from that class was labeled as belonging to that class. However, 
having just the recall score, you do not know how many other items were incorrectly 
labeled (ie. did your model just say everything is of class positive). 
Recall = True Positives / (True Positives + False Negatives). 

Precision​ is also called the positive predictive value. It is a number of correct 


positives your model predicts compared to the total number of positives it predicts. 
Precision is a measure of exactness, quality, or accuracy. High precision means that more 
or all of the positive results you predicted are correct. A precision score of 1.0 means that 
every item labeled positive, does indeed belong to the positive class.   
Precision = True Positives / (True Positives + False Positives) 
F-Measure  Precision and recall are often used together because they complement each 
other in how they describe the effectiveness of a model. The F-measure is a score that 
combines these two as the weighted harmonic mean of precision and recall. 
F-Measure = 2 * (Precision * Recall) / (Precision + Recall) 
 
14. Describe Gaussian Mixture model 
Ans Gaussian mixture models are probabilistic models for representing ​normally 
distributed​ subpopulations within an overall population. M ​ ixture models​ in general don't 
require knowing which subpopulation a data point belongs to, allowing the model to learn 
the subpopulations automatically. Since subpopulation assignment is not known, this 
constitutes a form of u ​ nsupervised learning​.For example, in modeling human height data, 
height is typically modeled as a normal distribution for each gender with a mean of 
approximately 5'10" for males and 5'5" for females. Given only the height data and not the 
gender assignments for each data point, the distribution of all heights would follow the 
sum of two scaled (different variance) and shifted (different mean) normal distributions. A 
model making this assumption is an example of a Gaussian mixture model (GMM), though 
in general, a GMM may have more than two components. GMMs have been used for 
feature extraction from speech data, and have also been used extensively in object tracking 
of multiple objects, where the number of mixture components and their means predict 
object locations at each frame in a video sequence. 
 
15.  How do you make sure a particular set of data follows a particular distribution 
(say gaussian)? 
Ans The following are the some of the methods to find out the distribution of the data:- 
a) ​Histogram​ Plot a histogram of the data and see if the shape of the histogram 
resembles any of the known and used statistical distribution. 
b) Distribution test​ - D ​ istribution tests are hypothesis tests that determine whether 
the sample data were drawn from a population that follows a hypothesized 
distribution. Like any statistical hypothesis test, distribution tests have a null 
hypothesis and an alternative hypothesis. 
c) Probability plots​ - P ​ robability plots can be used to determine whether data follow 
a particular distribution. If your data follow the straight line on the graph, the 
distribution fits the data. This is a good visual test. Informally, this process is called 
the “fat pencil” test. If all the data points line up within the area of a fat pencil laid 
over the center straight line, you can conclude that your data follow the distribution. 
 
Additional information:- 
https://www.statmethods.net/advgraphs/probability.html 
http://www.instantr.com/2012/11/28/creating-a-normal-probability-plot/ 
 
 
16. What is the chain rule of probability 
Ans The chain rule of probability is a theorem that allows one to calculate joint 
distribution of random variables using conditional probabilities. Given the joint probability 
distribution: 
P(A,B) 
We can use the definition of conditional probability to factor the joint probability as follows: 
P(A,B) = P(A|B)*P(B) 
 
 
7. What is the difference between discriminative and generative models? 
Ans Discriminative models learn the boundaries between the data. Generative models 
model the distribution of individual classes. Discriminative models do not offer clear 
representations of relations between features and classes in the dataset. Instead of using 
resources to fully model each class, they focus on richly modeling the boundary between 
classes. They classify points, without providing a model of how the points are actually 
generated. Generative algorithms makes some kind of structural assumptions on the 
model but discriminative algorithms make fewer assumptions. For example, Naive Bayes 
algorithm makes the assumption of conditional independence between features of the 
dataset while logistic regression does not make such assumptions. Generative models 
provide a model of how the data was generated.In general, Discriminative models perform 
better than generative models. They work better for larger datasets. They might tend to 
overfit on smaller datasets. Generative models often outperform discriminative models on 
smaller datasets because their generative assumptions place some structure on the model 
that prevent overfitting. For more information regarding the difference, refer this 
stackoverflow answer:- 
https://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-
and-discriminative-algorithm 
 
18. What is the difference between bernoulli and binomial distribution? 
Ans Bernoulli trial is a random experiment with only two possible outcomes. Binomial is 
a sequence of Bernoulli trials performed independently. A random experiment with only 
two possible outcomes with probability p and q; where p+q=1, is called Bernoulli trials. For 
example, if we consider of tossing a coin, there are two possible outcomes, which is said to 
be ‘head’ or ‘tail’. If we are interested in the head to fall; the probability of success is 1/2, 
which can be denoted as P (success) =1/2, and the probability of failure is 1/2. Binomial 
distribution is a sum of independent and evenly distributed Bernoulli trials. Binomial 
distribution is denoted by the notation b(k;n,p); b(k;n,p) = C(n,k)pkqn-k, where C(n,k) is 
known as the binomial coefficient. The binomial coefficient C(n,k) can be calculated by 
using the formula n!/k!(n-k)! For example, if an instant lottery with 25% winning tickets is 
sold among 10 people, the probability of purchasing a winning ticket is b(1;10,0.25) = 
C(10,1)(0.25)(0.75)9 ≈ 9 x 0.25 x 0.075 ≈ 0.169 
 
19. Optimization function used in logistic regression? 
Ans In logistic regression, a cost function called ‘cross-entropy’ is used. This is also called 
log loss. 
 
 
  
 
   
An intuitive way to understand this equation is as follows:- 
 
y = actual label (It takes 0 for negative class and 1 is positive 
class). 
hθ(x) = Predicted probabilities by logistic regression. 
 
If an actual label of a particular data point (y(xi)) is zero , then the cost of logistic function 
will be given by green graph. If an actual label of a particular data point (y(xi)) is one , then 
the cost of logistic function will be given by red graph. 

It shows that, If an actual label of a particular data point (y(xi)) is zero and the predicted 
probability of xi is one, then the cost of the logistic function will be very high. Similarly, If an 
actual label of a particular data point (y(xi)) and the predicted probability of xi are same, 
then the cost of a logistic function will be zero. So, we need to find an estimate (β^) in such 
a way that the cost function will have to be minimum. The above graph shows that the 
logistic cost function is convex cost function, so we don’t need to worry about local 
minimum. But, it is not possible to find a global minimum point using closed form solution 
as linear regression because the sigmoid function is nonlinear 

20. How do you select variables from a large dataset 

Ans Filter Methods They are generally used as a preprocessing step. The 
selection of features is independent of any machine learning algorithms. Instead, features 
are selected on the basis of their scores in various statistical tests for their correlation with 
the outcome variable 

Wrapper methods​ In this method we try to use a subset of features and train a model 
using them. The problem is essentially reduced to a search problem. These methods are 
usually computationally very expensive.Some common examples of wrapper methods are 
forward feature selection, backward feature elimination, recursive feature elimination, etc. 

Embedded methods​ Embedded methods combine the qualities’ of filter and wrapper 
methods. It’s implemented by algorithms that have their own built-in feature selection 
methods.Some of the most popular examples of these methods are LASSO and RIDGE 
regression which have inbuilt penalization functions to reduce overfitting. 

The main differences between the filter and wrapper methods for feature selection 
are​: 
● Filter  methods  measure  the  relevance  of  features  by  their  correlation  with 
dependent  variable  while  wrapper  methods  measure  the  usefulness  of  a  subset  of 
feature by actually training a model on it. 
● Filter  methods  are  much  faster  compared  to  wrapper  methods  as  they  do  not 
involve  training  the  models.  On  the  other  hand,  wrapper  methods  are 
computationally very expensive as well. 
● Filter  methods  use  statistical  methods  for  evaluation  of  a  subset  of  features  while 
wrapper methods use cross validation. 
● Filter  methods  might  fail  to  find  the  best  subset  of  features  in  many  occasions  but 
wrapper methods can always provide the best subset of features. 
● Using  the  subset  of  features  from  the  wrapper  methods  make  the  model  more 
prone  to  overfitting  as  compared  to  using  subset  of  features  from  the  filter 
methods. 
 
Additional Information:- 
https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-
with-an-example-or-how-to-select-the-right-variables/ 
 
 
21. Difference between linear and logistic regression 
Ans The difference between linear and logistic regression is as follows:- 
 
Linear Regression  Logistic Regression 

Target variable is continuous or  Target variable is categorical 


numeric  

The algorithm is based on ‘Least  It is based on Maximum Likelihood 


square estimation’. The regression  Estimation which says coefficients 
coefficients should be chosen in such  should be chosen in such a way that it 
a way that it minimizes the sum of the  maximizes the Probability of Y given X 
squared distances of each observed  (likelihood). 
response to its fitted value. 

Equation -  Equation - 
 
 

 
 

Curve - aims at finding the best-fitting  Changing the coefficient leads to change 
straight line which is also called a  in both the direction and the steepness 
regression line.   of the logistic function. It means positive 
slopes result in an S-shaped curve and 
negative slopes result in a Z-shaped 
curve. 

Linear regression requires error term  Logistic regression does not require that 
should be normally distributed.   the error term should be normally 
distributed 

Linear regression assumes that  Logistic regression does not need 


residuals are approximately equal for  residuals to be equal for each level of 
all predicted dependent variable  the predicted dependent variable values. 
values 

Interpretation - Keeping all other  Interpretation - The effect of a one unit 


independent variables constant, how  of change in X in the predicted odds ratio 
much the dependent variable is  with the other variables in the model 
expected to increase/decrease with  held constant. 
an unit increase in the independent 
variable. 

Linear regression assumes normal or  Logistic regression assumes binomial 


gaussian distribution of dependent  distribution of dependent variable. 
variable.  

Linear regression uses Identity link  logistic regression uses Logit function of 
function of gaussian family.   Binomial family. 
 
 

22. State four assumptions of a linear regression model 


Ans a. The residuals are independent 
b. The residuals are normally distributed 
c.The residuals have a mean of 0 at all values of X 
e. The relationship between the independent and dependent variables to be linear 
f. It requires all variables to be multivariate normal. 
g. There is little or no multicollinearity in the data 
 
23. What is a Generalized linear model? 
Ans In case of general linear models the distribution of residuals is assumed to be 
Gaussian. If it is not the case, it turns out that the relationship between Y and the model 
parameters is no longer linear. But if the distribution of residuals is one from the 
exponential family such as binomial, Poisson, negative binomial, or gamma distributions, 
there exists some functions of mean of Y, which has linear relationship with model 
parameters. This function is called link function.For example, a binomial residual can use a 
logit or a probit link function. A Poisson residual uses a log link function.The basic 
difference between generalized linear model & general linear model can be summarized as 
follows:- 

 
24. In which cases you would use Generalized linear models? 
Ans. The cases in which Generalized linear models can be used are as follows:- 
(a) If the response variable is categorical 
(b) When the distribution of residuals in non-normal or non-gaussian 
(c) If the distribution of the residuals is from the exponential of gamma distributions 
 
25. What is a naïve bayes model called a naïve? 
Ans Naïve Bayes machine learning algorithm is considered Naïve because the 
assumptions the algorithm makes are virtually impossible to find in real-life data. 
Conditional probability is calculated as a pure product of individual probabilities of 
components. This means that the algorithm assumes the presence or absence of a specific 
feature of a class is not related to the presence or absence of any other feature (absolute 
independence of features), given the class variable. For instance, a fruit may be considered 
to be a banana if it is yellow, long and about 5 inches in length. However, if these features 
depend on each other or are based on the existence of other features, a naïve Bayes 
classifier will assume all these properties to contribute independently to the probability 
that this fruit is a banana. Assuming that all features in a given dataset are equally 
important and independent rarely exists in the real-world scenario. 
 
26. How do you handle class imbalance in a dataset. Explain what could you do at the 
data level and at the model level? 
Ans a) Collecting more data 
b) Changing the performance metric (It was observed in many cases that accuracy 
does not work well with imbalanced datasets. 
c) Resampling the dataset. Add samples of data that is under represented. This is 
known as oversampling. Delete samples of data that is over represented. This is 
known as undersampling. 
d) Generating synthetic samples - to randomly sample the attributes from instances 
in the minority class. 
e) Spot checking of different algorithms 
f) Penalizing the models - Penalized classification imposes an additional cost on the 
model for making classification mistakes on the minority class during training. These 
penalties can bias the model to pay more attention to the minority class. 
 
Additional information:- 
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machin
e-learning-dataset/ 
 
 
27. What is the difference between R2 and adj-R2?  
Ans One major difference between R-squared and the adjusted R-squared is that 
R-squared supposes that every independent variable in the model explains the variation in 
the dependent variable R-squared cannot verify whether the coefficient ballpark figure and 
its predictions are prejudiced. It also does not show if a regression model is satisfactory; it 
can show an R-squared figure for a good model, or a high R-squared figure for a model that 
doesn’t fit. 
 
The adjusted R-squared compares the descriptive power of regression models that include 
diverse numbers of predictors. Every predictor added to a model increases R-squared and 
never decreases it. Thus, a model with more terms may seem to have a better fit just for 
the fact that it has more terms, while the adjusted R-squared compensates for the addition 
of variables and only increases if the new term enhances the model above what would be 
obtained by probability and decreases when a predictor enhances the model less than 
what is predicted by chance. In an overfitting condition, an incorrectly high value of 
R-squared, which leads to a decreased ability to predict, is obtained. This is not the case 
with the adjusted R-squared. 
 
 
28. Why is "having" used when "where" is already used ?  
 
Both “Having” and “Where” clauses are similar in nature and have a functionality of 
“FILTERING” the data with a slight difference The WHERE clause filters on individual rows 
and the HAVING clause filters on the aggregated values, it applies only to groups as a 
whole. Now, why "having" clause is being used when "where" is already used, let’s consider 
an example, you are booking a movie ticket from specific theatre e.g. IMAX (as you are 
having specific taste) but due to tight budget you want to list ONLY IMAX theatres which 
has a “average” movie ticket price below Rs. 300 or you want to list only IMAXs whose 
number of shows are greater than 5 per day.  
Here “where” clause will eliminate the theatres that are not from “IMAX” before calculating 
the average price of the movie ticket. To implement the filter on average ticket price we’ll 
need the HAVING clause as here we need both “GROUPING” and “SUMMARIZING”, like 
count(*) or avg(). Also when using having, you must also have a “Group By” clause in the 
query. 
 
https://docs.microsoft.com/en-us/sql/ssms/visual-db-tools/use-having-and-where-clauses-i
n-the-same-query-visual-database-tools 
http://www.java2s.com/Code/Oracle/Select-Query/UsingtheWHEREGROUPBYandHAVINGCl
ausesTogether.htm 
 
29. What are window functions? 
The window function performs calculation on “set of rows” and return a SINGLE 
aggregated value for each row. This is comparably similar to the form of calculation that 
can be done with an aggregate function but with little bit of difference.  
When we use aggregate functions with the GROUP BY clause, we “lose” the individual rows. 
We can’t mix attributes from an individual row with the results of an aggregate function; 
the function is performed on the rows as an entire group. But unlike regular aggregate 
functions, use of a window function does not cause rows to become grouped into a single 
output row — the rows retain their separate identities. We can generate a result set with 
some attributes of an individual row together with the results of the window function. This 
makes windowing one of the coolest feature of SQL. 
e.g if you want to compare each player’s auction value in IPL to average auction amount 
spent on each player of his team in same table. 
SELECT team_name, player_name, auction_amt, avg(auction_amt) OVER (PARTITION BY 
team_name) FROM auction_table; 
 
Refer PostgreSQL documentation: 
https://www.postgresql.org/docs/9.1/static/tutorial-window.html 
https://community.modeanalytics.com/sql/tutorial/sql-window-functions/ 
 
30. Difference between logistic and linear regression.Is logistic regression a linear 
model? Why or why not? 
 
Linear regression : This algorithm’s principle is to find a linear relation within your data. 
Once the linear relation is found, predicting a new value is done with respect to this 
relation.Linear regression is used when the desired output is required to take a continuous 
value based on whatever input/dataset is given to the algorithm.  
Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of 
weight, without asking them their weights! What do you think the child will do? He / she 
would likely look (visually analyze) at the height and build of people and arrange them 
using a combination of these visible parameters. This is linear regression in real life! The 
child has actually figured out that height and build would be correlated to the weight by a 
relationship, which looks like the equation above. 
 
Logistic Regression: Don’t go by its name! It is a c
​ lassification algorithm​ not a regression 
algorithm.Let’s say your friend gives you a puzzle to solve. There are only 2 outcome 
scenarios – either you solve it or you don’t.  
 
Linear regression is used when the output is continuous in nature based on corresponding 
input, it can have any one of an infinite number of possible values. Consider the weather 
forecasting problem where you want to predict the tomorrow’s temperature, % humidity 
etc. 
Now suppose your problem was to not predict the average temperature or % humidity, but 
what type of day it will be (eg., sunny, cloudy, stormy, rainy etc). This problem will give an 
output belonging to a certain set of values predefined, hence it is basically classifying your 
output into categories. Classifying problems can be either binary (yes/no, o/1 like you either 
solve the problem or not) or multiclass (like the problem described above). Logistic 
regression is used in classifying problems of machine learning. 
https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/ 
For mathematical POV: 
https://medium.com/deep-math-machine-learning-ai/chapter-1-complete-linear-regression
-with-math-25b2639dde23 
https://medium.com/deep-math-machine-learning-ai/chapter-2-0-logistic-regression-with-
math-e9cbb3ec6077 
 
Also logistic regression It is called a generalized linear model not because the estimated 
probability of the response event is linear, but because the logit of the estimated 
probability response is a linear function of the parameters. 

 
https://www.slideshare.net/SatishGupta4/ihcc-logistic-regression 
Linear means linear (degree =1 ) in betas (the coefficients) but not in x's (the independent 
variables). 
https://stats.stackexchange.com/questions/88603/why-is-logistic-regression-a-linear-model 
 
   
 
31. What is p value, how to read it, how to calculate it. 
 
Imagine that , India in its full strength (all top player in best of the form) got into a 1 on 1 
match with ZIMBABWE, But it turned out that India lost the game. 
 
Fans were stunned. And frustrated. And angry. 
 
The reasoning goes like this: if India had played as usual, he would have been highly 
unlikely to be defeated. But he lost the game! So fans had every reason to cast doubt on 
the team’s fair play. (Some might pull the allegation of Match Fixing) 
 
To put it another way, the reasoning goes like this:  
We have a hypothesis,that India rocks as usual. If the hypothesis had been true, the 
probability of India losing would have been very small, say, less than 5%. But India lost the 
game. 
So the ​unlikelihood​ was considered as evidence against his fair play. 
You may say p-value is a measurement of weirdness of your observations, according to 
your current believes, smaller the weirder.So you believe in lots of things, for some things, 
the reality reconfirms your believes by giving you expected outcomes; but for the others, 
the reality challenges you by throwing at you weird unexpected outcomes, and at some 
point, you can't deny it anymore, so you start to realize what you once believed may be 
wrong. 
 
The P value, or calculated probability, is the probability of finding the observed, or more 
extreme, results when the null hypothesis (H0) of a study question is true – the definition of 
‘extreme’ depends on how the hypothesis is being tested.The p-value is defined as the 
probability, under the null hypothesis H, of obtaining a result equal to or more extreme 
than what was actually observed.  
Calculation of P-value​: 
Look up your test statistic on the appropriate distribution — in this case, on the standard 
normal (Z-) distribution (see the following Z-table). 

 
 
Pr(X >= x|H) for right tail event 
Pr(X <= x|H) for left tail event 
2min(Pr(X >= x|H),Pr(X <= x|H)) for double tail event 
 
Relationship between significance level and p-value​. The relationship is: the p-value is the 
smallest significance level at which the null hypothesis would be rejected 
Reject the null hypothesis if P​ i​ s "small". 
 
https://en.wikipedia.org/wiki/P-value 
http://www.perfendo.org/docs/BayesProbability/twelvePvaluemisconceptions.pdf 
https://www.students4bestevidence.net/p-value-in-plain-english-2/ 
https://www.quora.com/What-is-a-p-value-explained-in-layman%E2%80%99s-terms 
 
To know how to do Hypothesis testing: 
http://people.cas.uab.edu/~mpogwizd/ma180-fall-2014/HypothesisTesting.pdf 
 
 
 
 
 
32. What is the difference between median and mean, when to chose what?  
The average (mean) is the sum of a set of numbers divided by the count of numbers in the 
data set. 

 
Whereas Median is the middle number in the data set, which can be determined by sorting 
the numbers in order and finding the middle number in the data set 

 
Now which one to choose it depends on the data distribution and purpose. 
Mean is the average count, if you want to to find the per capita income i.e. “What is the 
average income of the country?” you will use mean here and coming to accuracy For a 
bell-shaped population distribution, the mean will be more accurate.  
Median is more of a central measure where you draw some middle line What is the income 
of an average person? On the basis of that you‘ll know people below poverty line and 
coming to accuracy For a heavy-tailed distribution if the data sets are skewed (which is the 
usual case) in one direction or the other the median will be more accurate 
The forte of the median comes when you want to handle outliers. 
 
https://learnandteachstatistics.wordpress.com/2013/04/29/median/ 
https://math.stackexchange.com/questions/2304710/mean-vs-median-when-to-use 
 
 
33. What is VIF? 
A variance inflation factor(VIF) detects multicollinearity (a predictor/independent variable 
can be linearly predicted from the others with a significant accuracy) in regression analysis. 
Multicollinearity is when there’s correlation between predictors (i.e. independent variables) 
in a model;  
it’s presence can adversely affect your regression results. The VIF estimates how much the 
variance of a regression coefficient is inflated due to existing multicollinearity in the model. 
 
VIFs are calculated by taking a predictor, and regressing it against every other predictor in 
the model. This gives you the R-squared values, which can then be plugged into the VIF 
formula. “i” is the predictor you’re looking at (e.g. x1 or x2): 
 
R^2 is the coefficient of determination ( proportion of the variance in the dependent 
variable that is predictable from the independent variable) 
https://en.wikipedia.org/wiki/Coefficient_of_determination 
https://onlinecourses.science.psu.edu/stat501/node/347 
https://en.wikipedia.org/wiki/Variance_inflation_factor 

34. What is BIC and AIC? 


 
Bayesian information criterion (BIC) or Schwarz criterion (also SBC, SBIC) is a basis or 
criterion for model selection among a set of models; Ideally, the model with the lowest BIC 
is preferred. 
Whereas Akaike information criterion (AIC) is gives the idea of the relative quality of 
statistical models for a given set of data.  
For given a collection of models for the data, AIC estimates the quality of each model, 
relative to each of the other models. 
When fitting models, it is possible to increase the likelihood by adding parameters, but 
doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by 
introducing a penalty term for the number of parameters in the model 
https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic
-over-the-other 
https://methodology.psu.edu/AIC-vs-BIC 
 
AIC = -2*ln(likelihood) + 2*k, 
and 
 
BIC = -2*ln(likelihood) + ln(N)*k, 
where: 
 
k = model degrees of freedom 
N = number of observations 
https://www.quora.com/What-is-the-difference-between-an-AIC-information-criteron-and-a
-BIC-information-criterion 
 
35. What is t-test? 
A t-test is commonly used to determine whether the mean of a population significantly 
differs from a specific value (called the hypothesized mean) or from the mean of another 
population. 
t-test is a procedure to narrow down all of your sample data down to one value, the t-value 
 
 
Where t value is signal to noise ratio: 
 
Signal : It’s the difference between sample mean and null hypothesis valued mean. 

 
 
Let’s consider hypothesized mean wait time of an ola cab is 5 minutes.If in your random 
sample, had a mean wait time for cab of 15.1 minutes, the signal is 5.1-15 = 0.1 minutes. 
The difference is relatively small, so the signal in the numerator is weak. 
 
However, if patients in your random sample had a mean wait time of 18 minutes, the 
difference is much larger: 18 - 5 = 3 minutes. So the signal is stronger 
 
The denominator is the noise. The equation in the denominator is a measure of variability 
known as the standard error of the mean. 

 
http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-t-tests-t-values-and-
t-distributions 
http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-t-tests%3A-1-sample
%2C-2-sample%2C-and-paired-t-tests 
 
 
36. How to know if a cluster generated to be good? 
To measure the quality of clustering results, there are two kinds of validity indices: external 
indices and internal indices. 
 
An external index is a measure of agreement between two partitions where the first 
partition is the a priori known clustering structure, and the second results from the 
clustering procedure (Dudoit et al., 2002). 
 
Internal indices are used to measure the goodness of a clustering structure without 
external information (Tseng et al., 2005). 
 
For external indices, we evaluate the results of a clustering algorithm based on a known 
cluster structure of a data set (or cluster labels). 
 
For internal indices, we evaluate the results using quantities and features inherent in the 
data set. The optimal number of clusters is usually determined based on an internal validity 
index. 
http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-
optimal-number-of-clusters-3-must-know-methods/ 
https://link.springer.com/article/10.1007/s40595-016-0086-9 
As your unsupervised learning method is probabilistic, another option is to evaluate some 
probability measure (log-likelihood, perplexity, etc) on held out data. The motivation here is 
that if your unsupervised learning method assigns high probability to similar data that 
wasn't used to fit parameters, then it has probably done a good job of capturing the 
distribution of interest 
 
http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation 
https://stats.stackexchange.com/questions/21807/evaluation-measure-of-clustering-withou
t-having-truth-labels 
https://www.analyticsvidhya.com/blog/2013/11/getting-clustering-right-part-ii/ 
 
 
 
37. What is ANOVA? 
If you want to check the variation among and between the considered groups , basically 
you are testing whether there is difference between the groups or not 
. As example: Students of different coaching institute are giving IIT JEE, we want to see 
which coaching institute is giving better result. Or A researcher conducted a study to 
investigate the effect of 3 different teaching method on the reading ability of school 
children. 
T-test vs ANOVA 
It’s very similar to t-test just for more than two groups.When comparing only two groups (A 
and B), you test the difference between the two groups with a Student t test. So when 
comparing three groups (A, B, and C) it’s natural to think of testing each of the three 
possible two-group comparisons (A – B, A – C, and B – C) with a t test. 
But running an exhaustive set of two-group t tests can be risky, because as the number of 
groups goes up, the number of two-group comparisons goes up even more. 
So here ANOVA comes for rescue. 

 
Where n-1 is the degrees of freedom (DF), the summation is called the sum of squares (SS), 
the result is called the mean square (MS) and the squared terms are deviations from the 
sample mean.  
 
 
https://www.edanzediting.com/blogs/statistics-anova-explained 
http://www.dummies.com/education/science/biology/the-basic-idea-of-an-analysis-of-varia
nce-anova/ 
It’s always to practice for ANOVA. Few problems to understand those: 
http://rstudio-pubs-static.s3.amazonaws.com/228015_d8d0ddab79664707890681a9a75cf1
6d.html 
 
 
38. Basic of Neural Networks ? 
https://towardsdatascience.com/a-gentle-introduction-to-neural-networks-series-part-1-2b
90b87795bc 
Basic of Convolutional neural networks (CNNs) and recurrent neural networks 
(RNNs).  
https://medium.com/machine-learning-for-humans/neural-networks-deep-learning-cdad8a
eae49b 
   
 
39. What is KS statistics? 
Kolmogorov Smirnov Test is basically a test of goodness of fit. It compares the cumulative 
distribution function for a variable with a “specified distribution”. Let’s say we  
Suppose that we have an i.i.d. sample X1, . . . , Xn with some unknown distribution “D” and 
we would like to test the hypothesis that “D” is equal to a particular distribution “D0”, i.e. 
(KS-test) tries to determine whether the two datasets differ significantly or not. The 
advantage of KS-test is its agnostic to the distribution of the sample considered. 
 
Programming POV: ​http://daithiocrualaoich.github.io/kolmogorov_smirnov/ 
Mathematics POV: 
https://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lectur
e-notes/lecture14.pdf 
 
40. What is cross-Validation? Why is it important ? What are the different methods 
of cross validation? 
Ans  There is always a need to validate the stability of the machine learning model.There 
is no assurance that the model created will work well on unseen data. we need some kind 
of assurance that the model has got most of the patterns from the data correct, and its not 
picking up too much on the noise, or in other words its low on bias and variance. 
 
Validation 
This process of deciding whether the numerical results quantifying hypothesized 
relationships between variables, are acceptable as descriptions of the data, is known as 
validation. Generally, an error estimation for the model is made after training, better 
known as evaluation of residuals. In this process, a numerical estimate of the difference in 
predicted and original responses is done, also called the training error. However, this only 
gives us an idea about how well our model does on data used to train it. Now its possible 
that the model is underfitting or overfitting the data. So, the problem with this evaluation 
technique is that it does not give an indication of how well the learner will generalize to an 
independent/ unseen data set. Getting this idea about our model is known as ​Cross 
Validation. 
 
Holdout Method 
Now a basic remedy for this involves removing a part of the training data and using it to get 
predictions from the model trained on rest of the data. The error estimation then tells how 
our model is doing on unseen data or the validation set. This is a simple kind of cross 
validation technique, also known as the holdout method. Although this method doesn’t 
take any overhead to compute and is better than traditional validation, it still suffers from 
issues of high variance. This is because it is not certain which data points will end up in the 
validation set and the result might be entirely different for different sets. 
K-Fold Cross Validation 
As there is never enough data to train your model, removing a part of it for validation 
poses a problem of underfitting. By reducing the training data, we risk losing important 
patterns/ trends in data set, which in turn increases error induced by bias. So, what we 
require is a method that provides ample data for training the model and also leaves ample 
data for validation. K Fold cross validation does exactly that. 
In K
​ Fold cross validation​, the data is divided into k subsets. Now the holdout method is 
repeated k times, such that each time, one of the k subsets is used as the test set/ 
validation set and the other k-1 subsets are put together to form a training set. The error 
estimation is averaged over all k trials to get total effectiveness of our model. As can be 
seen, every data point gets to be in a validation set exactly once, and gets to be in a training 
set k-1times. This significantly reduces bias as we are using most of the data for fitting, and 
also significantly reduces variance as most of the data is also being used in validation set. 
Interchanging the training and test sets also adds to the effectiveness of this method. As a 
general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed 
and it can take any value. 
Stratified K-Fold Cross Validation 
In some cases, there may be a large imbalance in the response variables. For example, in 
dataset concerning price of houses, there might be large number of houses having high 
price. Or in case of classification, there might be several times more negative samples than 
positive samples. For such problems, a slight variation in the K Fold cross validation 
technique is made, such that each fold contains approximately the same percentage of 
samples of each target class as the complete set, or in case of prediction problems, the 
mean response value is approximately equal in all the folds. This variation is also known as 
Stratified K Fold. 
Above explained validation techniques are also referred to as Non-exhaustive cross 
validation methods. These do not compute all ways of splitting the original sample, i.e. you 
just have to decide how many subsets need to be made. 
 
 
 
  
 
 
 

You might also like