Data Science Interview Q's - I

Data Science Interview Q’s — I
A walkthrough with/from the essentials of data science interviews.
Hi hey there, thanks for the continuous support for my previous articles. Today we
will go through the commonly asked essential questions by the interviewers to
understand the root level knowledge of DS rather than going for fancy advanced
questions.
“If your foundation Pillar is strong you can build anything”
1. What are the feature selection methods used to select the right
variables?
There are 2 main methods for feature selection, i.e. filter and wrapper methods.
Filter Methods:- Linear Discrimination Analysis, Anova, Chi-square
Wrapper Methods: — Forward Selection, Backward Selection, Recursive Feature

Elimination
Intrinsic Methods: Trees.
2. You are given a dataset consisting of variables with more than

30% missing values. How will you deal with them?
Answer: We can just simply remove the rows with missing data values. it is the
quickest way but comes with the cost of throwing valuable information.
for smaller dataset, we can substitute missing values with mean, median, model and
even with models like linear models, xgboost to predict the missing values.
3. What are dimensionality reduction and its benefits?
Refers to the process of converting a dataset with vast dimensions into data with
fewer dimensions(fields) to convey similar information concisely.
This reduction helps in compressing data and reducing storage space. it also reduces
computation time as fewer dimensions lead to less computing. it removes redundant
features, for example, there’s no point in strong a value in different units(meters and
inches)
4. For the given points, how will you calculate the euclidean
distance in python?
plot1 = [1,3]
plot2 = [2,5]
The Euclidean distance can be calculated as follows:
euclidean_distance = sqrt((plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2)
5. How can you select K for K-means?
We use the elbow method to select K for K-means clustering. The idea of the elbow
method is to run K-means clustering on the data set where ‘K’ is the number of
clusters.
Within the sum of squares(WSS), it is defined as the sum of the squared distance
between each member of the cluster and its centroid.
6. What is the significance of the p-value?
When p is smaller than or equals 0.05. indicates strong evidence against the null
hypothesis, so we reject the null hypothesis(H0) and accept the alternative
hypothesis (H1)
If p-value typically > 0.05. indicates weak evidence against the null hypothesis,
therefore we accept the null hypothesis and reject the alternative hypothesis.
7. You are given a dataset on cancer detection. You have to build a

classification model and achieved an accuracy of 96%. Why shouldn't
you be happy with your model's performance? What can you do
about it?
Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy

should not be based on a measure of performance. It is important to focus on the
remaining 4 %, which represents the patients who were wrongly diagnosed. Early
diagnosis is crucial when it comes to cancer detection and can greatly improve a
patient’s prognosis.
Hence, to evaluate model performance we should use Sensitivity (True Positive Rate)
and SPecificity (True Negative Rate). F measure to determine the class-wise
performance of the classifier.
8. You have run the association rules algorithm on your dataset, and
the two rules {banana, apple} => {grape} and {apple,orange}=> {grape}
have been found to be relevant. What else must be true?
Choose the right answer.
1. {banana, apple,grape, orange} must be a frequent itemset.

2. {banna,apple} => {orange} must be a relevant rule
3. {grape} => {banna, apple} must be a relevant rule
4. {grape,apple} must be a frequent itemset
The answer is A {grape, apple} must be a frequent itemset
9. Your organization has a website where visitors randomly receive

one of two coupons. It is also possible that visitors to the website
will not receive a coupon. You have been asked to determine if
offering a coupon to website visitors has any impact on their
purchase decisions. Which analysis method should you use?
1. One-way ANOVA
2. K-means Clustering
3. Association Rules
4. Students t-Test
Answer: A: One way ANOVA
10. What are the feature vectors?
A feature vector is an n-dimensional vector of numerical features that represent an

object. In machine learning, feature vectors are used to represent numeric or
symbolic characteristics(called features) of an object in a mathematical way that’s
easy to analyze.
11. What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents but is now
widely used in other areas. It is a problem-solving technique used for isolating the
root cause of faults or problems, A factor is called a root cause if its deduction from
the problem-fault-sequence averts the final undesirable event from recurring.
12. Do gradient descent methods always converge to similar points?
They do not, because in some cases, they reach a local minima or a local optima
point, You would not reach the global optima point. This is governed by the data and
the starting condition.
13. What is the goal of A/B Testing?
This is statistical hypothesis testing for randomized experiments with two variables,
A and B. The objective of A/B testing is to detect any changes to a web page to
maximise or increase the outcome of a strategy.
14. What are the drawbacks of the linear model?
 the assumption of linearity of the errors.

 it can’t be used for count outcomes or binary outcomes.
 there are overfitting problems that it can’t solve
15. What is the law of large numbers?
It is a theorem that describes the result of performing the same experiment very
frequently. This theorem forms the basis of frequency-style thinking. It states that
the sample mean, sample variance, and sample standard deviation converge to what
they are trying to estimate.
16. What are the confounding variables?
These are extraneous variables in a statistical model that correlates directly or

inversely with both the dependent and the independent variable. The estimate fails
to account for the confounding factor.
17. What is star schema?

It is a traditional database schema with a central table. Satellite tables map IDs to
physical names or descriptions and can be connected to the central fact table using
the ID fields, these tables are known as lookup tables and are principally useful in
real-time applications, as they save a lot of memory. Sometimes, star schemas
involve several layers of summarization to recover information faster.
18. What are eigenvalue and eigenvector?
Eignvalues are the directions along which a particular linear transformation acts by
flipping compressing or stretching.
Eignvectors are for understanding linear transformations. In data analysis, we usually

calculate the eigenvectors for a correlation or covariance matrix.
19. Why is resampling done?
Resampling is done in any of these cases.
 Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing
randomly with replacements from a set of data points.
 Substituting labels on data points when performing significance tests.
 Validating models by using random subsets(bootstrapping, cross-validation)
20. What is selection bias?
Selection bias is a problematic situation in which error is introduced due to a non-

random population sample.
21. What are the types of biases that can occur during sampling?
1. Selection bias 2. Undercoverage bias 3. Survivorship bias
22. What is survivorship bias?
Survivorship bias is the logical error of focusing on aspects that support surviving a
process and casually overlooking those that did not because of their lack of
prominence. This can lead to wrong conclusions in numerous ways.
23. If you are having 4GB RAM in your machine and you want to train
your model on a 10GB dataset. How would you go about this problem?
Have you ever faced this kind of problem in your machine
learning/data science experience so far?
For Neural Networks: Batch size with Numbpy array will work.
For Traditional ML’s: currently, SVM for non-linear and SGDClassifier for linear with
the help of Gradient Descent supports incremental learning. All we need to do is call
the model.partial_fit() function and pass batch size data.
24. What is selection bias?
Selection bias is the bias introduced by the selection of individuals, groups or data
for analysis in such a way that proper randomization is not achieved. thereby
ensuring that the sample obtained is not representative of the population intended
to be analyzed. it is sometimes referred to as the selection effect. The phase
“selection bias” most often refers to the distortion of statistical analysis, resulting
from the method of collecting samples. if the selection bias is not taken into account,
then some conclusions of the study may not be accurate.
25. Explain regularization and why it is useful.
Regularization is the process of adding a tuning parameter to the model to induce

smoothness in order to prevent overfitting. This is most often done by adding a
constant multiple to an existing weight vector. This constant is often the L1(Lasso) or
L2(Ridge). The model predictions should then minimize the loss function calculated
on the regularized training set
26. What is TF/IDF vectorization?
tf-idf is short for term frequency-inverse document frequency. is a numerical statistic

that is intended to reflect how important a word is to a document is collection or
corpus. It is often used as a weighting factor in information retrieval and text mining.
The rf-idf value increases proportionally to the number of times a word appears in
the document. but is offset by the frequency of the word in the corpus, which helps
to adjust for the factor that some words appear more frequently in general.
27. What is Box Cox Transformation?
A Box Cox transformation is a way to transform non-normal dependent variables

into normal shape.
28. Is it possible to capture the correlation between continuous and

categorical variable?
Yes, we can use the analysis of covariance technique to capture the association
between continuous and categorical variables.
29. Treating a categorical variable as a continuous variable would

result in a better predictive model?
Yes, the categorical value should be considered as a continuous variable only when
the variable is ordinal in nature. So it is a better predictive model.
30. What is Power Analysis?

The power analysis is an integral part of the experimental design, it helps you to
determine the sample size requires to find out the effect of a given size from a cause
with a specific level of assurance. it also allows you to deploy a particular probability
in a sample size constraint.
31. What are different ranking algorithms?
Traditional ML algorithms solve a prediction problem (classification or regression) on

a single instance at a time. E.g. if you are doing spam detection on email, you will
look at all the features associated with that email and classify it as spam or not. The
aim of traditional ML is to come up with a class (spam or no-spam) or a single
numerical score for that isntance.
Ranking algorithms like LTR solves a ranking problem on a list of items. The aim of
LTR is to come up with optimal ordering of those items As such LTR doesn't care
much about the exact score that each item gets. but cares more about the relative
ordering among all the items. RankNet, LambdaRank, and LambdaMART are all LRT
algorithms developed by Chris Burges and his colleagues at Microsoft Research.
1. RankNet — the cost function of RankNet aims to minimize the number of inversions in

ranking. RankNet optimizes the cost function using Stochastic Gradient Descent.
2. LambadaRank — Burgess et. al found that during RankNet training procedure, you don't need
the costs only need the gradients of the cost with respect to the model score. You can think
of these gradients as little arrows attached to each document in the ranked list, indicating
the direction we’d like those documents to move. Further, they found that scaling the
gradients by the change in NDCG found by swapping each pair of documents gave good
results. The core idea of LambaRank is to use this new cost function for training a RankNet.
On experimental dataset, shows both speed and accuracy improvements over the original
RankNet.
3. LambdaMart — LambdaMART combines LambaRank and MART(Multiple Additive Regression
Trees). While MART uses gradient boosted decision trees for prediction tasks, LambdaMART
uses gradient boosted decision trees using a cost function derived from LambdaRank for
solving a ranking task. On the experimental dataset, LambaMART has shown better results
than LambaRank and the original RankNet
32. Assumptions of Linear Regression.
1. The relationship between X and y must be linear

2. the features must be independent of each other.
3. Homoscedasticity — the variation between the output must be constant for different input
data.
4. The distribution of Y long X should be the Normal Distribution.
33. What are the Type1 and Type 2 errors? in which scenarios the
Type 1 and Type 2 errors are significant?
Rejection of True Null Hypothesis is known as a Type 1 error. in Simple words, False
Positive Rate are also known as Type 1 error.
Not rejecting the False Null Hypothesis is known as a Type 2 error. False Negatives
are known as a Type 2 error.
Type 1 Error is significant where the importance of being

negative becomes significant. For example — If a man is
not suffering from a particular disease marked as
positive for that infection. The medications given to him
might damage his organs. While Type 2 Error is
significant in cases where the importance of being
positive becomes important. For example — The alarm has
to be raised in case of burglary in a bank. But a system
identifies it as a False case that won’t raise the alarm
on time resulting in a heavy loss.
34. What are the conditions for overfitting and underfitting?
In overfitting the model performs well for the training data, but for any new data it
fails to provide output. For Underfitting the model is very simple and not able to
identify the correct relationship. Following are the bias and variance conditions.
Overfitting — Low bias and High Variance results in overfitted model. Decision tree is
more prone to Overfitting.
Underfitting — High bias and Low Variance. Such model doesn’t perform well on test
data also. For example — Linear Regression is more prone to Underfitting
35. What do you mean by Normalisation? Difference between

Normalisation and Standardization?
Normalization is a process of bringing the features in a simple range, so that model

can perform well and does not get inclined towards any particular feature. For
example — If we have a dataset with multiple features and one feature is the Age
data which is in the range 18–60, Another feature is the salary feature ranging from
20000–2000000. In such a case, the values have a very much different in them. Age
ranges in two digits integer while salary is in range significantly higher than the age.
So to bring the features incomparable range we need Normalisation.
Both Normalisation and Standardization are methods of Features Conversion.

However, the methods are different in terms of conversions. The data after
Normalisation scales in the range of 0–1. While in the case of Standardization the
data is scaled such that it means comes out to be 0
36. What do you mean by Regularisation? What are L1 and L2

Regularisation?
Regulation is a method to improve your model which is Overfitted by introducing
extra terms in the loss function. This helps in making the model perform better for
unseen data.
There are two types of Regularisation :
L1 Regularisation — In L1 we add lambda times the absolute weight terms to the loss
function. In this, the feature weights are penalized on the basis of absolute value. L2
Regularisation — In L2 we add lambda times the squared weight terms to the loss
function. In this, the feature weights are penalized on the basis of squared values.
37. Explain Naive Bayes Classifier and the principle on which

it works?
Naive Bayes Classifier algorithm is a probabilistic model. This model works on the
Bayes Theorem principle. The accuracy of Naive Bayes can be increased significantly
by combining it with other kernel functions for making a perfect Classifier.
Bayes Theorem — This is a theorem that explains conditional probability. If we need

to identify the probability of occurrence of Event A provided that Event B has already
occurred such cases are known as Conditional Probability.
38. Explain DBSCAN Clustering technique and in what terms DBSCAN is

better than K- Means Clustering?
DBSCAN( Density-Based) clustering technique is an unsupervised approach that splits

the vectors into different groups based on the minimum distance and number of
points lying in that range. In DBSCAN Clustering we have two significant parameters
–
Epsilon — The minimum radius or distance between the two data points to tag them
in the same cluster.Min — Sample Points — The number of the minimum samples
which should fall under that range to be identified as one cluster.
DBSCAN Clustering technique has few advantages over other clustering algorithms –
1. In DBSCAN we do not need to provide a fixed number of clusters. There can be as

many clusters formed on the basis of the data points distribution. While in k nearest
neighbor we need to provide the number of clusters we need to split our data into.
2. In DBSCAN we also get a noise cluster identified which helps us in identifying the
outliers. This sometimes also acts as a significant term to tune the hyperparameters
of a model accordingly.
The article is already too long to continue so,

Next, we will walk through Linear Regression and
clustering questionnaire in the next article Part II,
that will surprise you!
Thanks again, for your time, if you enjoyed this short article there are tons of topics in
advanced analytics, data science, and machine learning available in my medium repo.
https://medium.com/@bobrupakroy
Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger,
Issuu, Slideshare, Scribd and more.
Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy
Let me know if you need anything. Talk Soon.

Data Science Interview Q's - I

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Interview Q's - I

Uploaded by

Copyright:

Available Formats

Data Science Interview Q’s — I

A walkthrough with/from the essentials of data science interviews.

“If your foundation Pillar is strong you can build anything”

Filter Methods:- Linear Discrimination Analysis, Anova, Chi-square

Wrapper Methods: — Forward Selection, Backward Selection, Recursive Feature

Intrinsic Methods: Trees.

2. You are given a dataset consisting of variables with more than

3. What are dimensionality reduction and its benefits?

The Euclidean distance can be calculated as follows:

euclidean_distance = sqrt((plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2)

5. How can you select K for K-means?

7. You are given a dataset on cancer detection. You have to build a

Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy

Choose the right answer.

1. {banana, apple,grape, orange} must be a frequent itemset.

The answer is A {grape, apple} must be a frequent itemset

9. Your organization has a website where visitors randomly receive

10. What are the feature vectors?

A feature vector is an n-dimensional vector of numerical features that represent an

11. What is root cause analysis?

12. Do gradient descent methods always converge to similar points?

13. What is the goal of A/B Testing?

14. What are the drawbacks of the linear model?

 the assumption of linearity of the errors.

15. What is the law of large numbers?

16. What are the confounding variables?

These are extraneous variables in a statistical model that correlates directly or

17. What is star schema?

18. What are eigenvalue and eigenvector?

Eignvectors are for understanding linear transformations. In data analysis, we usually

19. Why is resampling done?

Resampling is done in any of these cases.

20. What is selection bias?

Selection bias is a problematic situation in which error is introduced due to a non-

1. Selection bias 2. Undercoverage bias 3. Survivorship bias

22. What is survivorship bias?

25. Explain regularization and why it is useful.

Regularization is the process of adding a tuning parameter to the model to induce

26. What is TF/IDF vectorization?

tf-idf is short for term frequency-inverse document frequency. is a numerical statistic

27. What is Box Cox Transformation?

A Box Cox transformation is a way to transform non-normal dependent variables

28. Is it possible to capture the correlation between continuous and

29. Treating a categorical variable as a continuous variable would

30. What is Power Analysis?

31. What are different ranking algorithms?

Traditional ML algorithms solve a prediction problem (classification or regression) on

1. RankNet — the cost function of RankNet aims to minimize the number of inversions in

32. Assumptions of Linear Regression.

1. The relationship between X and y must be linear

Type 1 Error is significant where the importance of being

34. What are the conditions for overfitting and underfitting?

35. What do you mean by Normalisation? Difference between

Normalization is a process of bringing the features in a simple range, so that model

Both Normalisation and Standardization are methods of Features Conversion.

36. What do you mean by Regularisation? What are L1 and L2

There are two types of Regularisation :

37. Explain Naive Bayes Classifier and the principle on which

Bayes Theorem — This is a theorem that explains conditional probability. If we need

38. Explain DBSCAN Clustering technique and in what terms DBSCAN is

DBSCAN( Density-Based) clustering technique is an unsupervised approach that splits

1. In DBSCAN we do not need to provide a fixed number of clusters. There can be as

euclidean_distance = sqrt((plot1[0]-plot2[0])2 + (plot1[1]-plot2[1])2)