Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Data Science Interview Q’s — I

A walkthrough with/from the essentials of data science interviews.

Hi hey there, thanks for the continuous support for my previous articles. Today we
will go through the commonly asked essential questions by the interviewers to
understand the root level knowledge of DS rather than going for fancy advanced
questions.

“If your foundation Pillar is strong you can build anything”

1. What are the feature selection methods used to select the right
variables?
There are 2 main methods for feature selection, i.e. filter and wrapper methods.

Filter Methods:- Linear Discrimination Analysis, Anova, Chi-square

Wrapper Methods: — Forward Selection, Backward Selection, Recursive Feature


Elimination

Intrinsic Methods: Trees.

2. You are given a dataset consisting of variables with more than


30% missing values. How will you deal with them?

Answer: We can just simply remove the rows with missing data values. it is the
quickest way but comes with the cost of throwing valuable information.

for smaller dataset, we can substitute missing values with mean, median, model and
even with models like linear models, xgboost to predict the missing values.

3. What are dimensionality reduction and its benefits?

Refers to the process of converting a dataset with vast dimensions into data with
fewer dimensions(fields) to convey similar information concisely.

This reduction helps in compressing data and reducing storage space. it also reduces
computation time as fewer dimensions lead to less computing. it removes redundant
features, for example, there’s no point in strong a value in different units(meters and
inches)

4. For the given points, how will you calculate the euclidean
distance in python?

plot1 = [1,3]

plot2 = [2,5]

The Euclidean distance can be calculated as follows:

euclidean_distance = sqrt((plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2)

5. How can you select K for K-means?

We use the elbow method to select K for K-means clustering. The idea of the elbow
method is to run K-means clustering on the data set where ‘K’ is the number of
clusters.

Within the sum of squares(WSS), it is defined as the sum of the squared distance
between each member of the cluster and its centroid.
6. What is the significance of the p-value?

When p is smaller than or equals 0.05. indicates strong evidence against the null
hypothesis, so we reject the null hypothesis(H0) and accept the alternative
hypothesis (H1)

If p-value typically > 0.05. indicates weak evidence against the null hypothesis,
therefore we accept the null hypothesis and reject the alternative hypothesis.

7. You are given a dataset on cancer detection. You have to build a


classification model and achieved an accuracy of 96%. Why shouldn't
you be happy with your model's performance? What can you do
about it?

Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy


should not be based on a measure of performance. It is important to focus on the
remaining 4 %, which represents the patients who were wrongly diagnosed. Early
diagnosis is crucial when it comes to cancer detection and can greatly improve a
patient’s prognosis.

Hence, to evaluate model performance we should use Sensitivity (True Positive Rate)
and SPecificity (True Negative Rate). F measure to determine the class-wise
performance of the classifier.

8. You have run the association rules algorithm on your dataset, and
the two rules {banana, apple} => {grape} and {apple,orange}=> {grape}
have been found to be relevant. What else must be true?

Choose the right answer.

1. {banana, apple,grape, orange} must be a frequent itemset.


2. {banna,apple} => {orange} must be a relevant rule
3. {grape} => {banna, apple} must be a relevant rule
4. {grape,apple} must be a frequent itemset

The answer is A {grape, apple} must be a frequent itemset

9. Your organization has a website where visitors randomly receive


one of two coupons. It is also possible that visitors to the website
will not receive a coupon. You have been asked to determine if
offering a coupon to website visitors has any impact on their
purchase decisions. Which analysis method should you use?

1. One-way ANOVA
2. K-means Clustering
3. Association Rules
4. Students t-Test
Answer: A: One way ANOVA

10. What are the feature vectors?

A feature vector is an n-dimensional vector of numerical features that represent an


object. In machine learning, feature vectors are used to represent numeric or
symbolic characteristics(called features) of an object in a mathematical way that’s
easy to analyze.

11. What is root cause analysis?

Root cause analysis was initially developed to analyze industrial accidents but is now
widely used in other areas. It is a problem-solving technique used for isolating the
root cause of faults or problems, A factor is called a root cause if its deduction from
the problem-fault-sequence averts the final undesirable event from recurring.

12. Do gradient descent methods always converge to similar points?

They do not, because in some cases, they reach a local minima or a local optima
point, You would not reach the global optima point. This is governed by the data and
the starting condition.

13. What is the goal of A/B Testing?

This is statistical hypothesis testing for randomized experiments with two variables,
A and B. The objective of A/B testing is to detect any changes to a web page to
maximise or increase the outcome of a strategy.

14. What are the drawbacks of the linear model?

 the assumption of linearity of the errors.


 it can’t be used for count outcomes or binary outcomes.
 there are overfitting problems that it can’t solve

15. What is the law of large numbers?

It is a theorem that describes the result of performing the same experiment very
frequently. This theorem forms the basis of frequency-style thinking. It states that
the sample mean, sample variance, and sample standard deviation converge to what
they are trying to estimate.

16. What are the confounding variables?

These are extraneous variables in a statistical model that correlates directly or


inversely with both the dependent and the independent variable. The estimate fails
to account for the confounding factor.

17. What is star schema?


It is a traditional database schema with a central table. Satellite tables map IDs to
physical names or descriptions and can be connected to the central fact table using
the ID fields, these tables are known as lookup tables and are principally useful in
real-time applications, as they save a lot of memory. Sometimes, star schemas
involve several layers of summarization to recover information faster.

18. What are eigenvalue and eigenvector?

Eignvalues are the directions along which a particular linear transformation acts by
flipping compressing or stretching.

Eignvectors are for understanding linear transformations. In data analysis, we usually


calculate the eigenvectors for a correlation or covariance matrix.

19. Why is resampling done?

Resampling is done in any of these cases.

 Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing
randomly with replacements from a set of data points.
 Substituting labels on data points when performing significance tests.
 Validating models by using random subsets(bootstrapping, cross-validation)

20. What is selection bias?

Selection bias is a problematic situation in which error is introduced due to a non-


random population sample.

21. What are the types of biases that can occur during sampling?

1. Selection bias 2. Undercoverage bias 3. Survivorship bias

22. What is survivorship bias?

Survivorship bias is the logical error of focusing on aspects that support surviving a
process and casually overlooking those that did not because of their lack of
prominence. This can lead to wrong conclusions in numerous ways.

23. If you are having 4GB RAM in your machine and you want to train
your model on a 10GB dataset. How would you go about this problem?
Have you ever faced this kind of problem in your machine
learning/data science experience so far?

For Neural Networks: Batch size with Numbpy array will work.

For Traditional ML’s: currently, SVM for non-linear and SGDClassifier for linear with
the help of Gradient Descent supports incremental learning. All we need to do is call
the model.partial_fit() function and pass batch size data.
24. What is selection bias?

Selection bias is the bias introduced by the selection of individuals, groups or data
for analysis in such a way that proper randomization is not achieved. thereby
ensuring that the sample obtained is not representative of the population intended
to be analyzed. it is sometimes referred to as the selection effect. The phase
“selection bias” most often refers to the distortion of statistical analysis, resulting
from the method of collecting samples. if the selection bias is not taken into account,
then some conclusions of the study may not be accurate.

25. Explain regularization and why it is useful.

Regularization is the process of adding a tuning parameter to the model to induce


smoothness in order to prevent overfitting. This is most often done by adding a
constant multiple to an existing weight vector. This constant is often the L1(Lasso) or
L2(Ridge). The model predictions should then minimize the loss function calculated
on the regularized training set

26. What is TF/IDF vectorization?

tf-idf is short for term frequency-inverse document frequency. is a numerical statistic


that is intended to reflect how important a word is to a document is collection or
corpus. It is often used as a weighting factor in information retrieval and text mining.
The rf-idf value increases proportionally to the number of times a word appears in
the document. but is offset by the frequency of the word in the corpus, which helps
to adjust for the factor that some words appear more frequently in general.

27. What is Box Cox Transformation?

A Box Cox transformation is a way to transform non-normal dependent variables


into normal shape.

28. Is it possible to capture the correlation between continuous and


categorical variable?

Yes, we can use the analysis of covariance technique to capture the association
between continuous and categorical variables.

29. Treating a categorical variable as a continuous variable would


result in a better predictive model?

Yes, the categorical value should be considered as a continuous variable only when
the variable is ordinal in nature. So it is a better predictive model.

30. What is Power Analysis?


The power analysis is an integral part of the experimental design, it helps you to
determine the sample size requires to find out the effect of a given size from a cause
with a specific level of assurance. it also allows you to deploy a particular probability
in a sample size constraint.

31. What are different ranking algorithms?

Traditional ML algorithms solve a prediction problem (classification or regression) on


a single instance at a time. E.g. if you are doing spam detection on email, you will
look at all the features associated with that email and classify it as spam or not. The
aim of traditional ML is to come up with a class (spam or no-spam) or a single
numerical score for that isntance.

Ranking algorithms like LTR solves a ranking problem on a list of items. The aim of
LTR is to come up with optimal ordering of those items As such LTR doesn't care
much about the exact score that each item gets. but cares more about the relative
ordering among all the items. RankNet, LambdaRank, and LambdaMART are all LRT
algorithms developed by Chris Burges and his colleagues at Microsoft Research.

1. RankNet — the cost function of RankNet aims to minimize the number of inversions in


ranking. RankNet optimizes the cost function using Stochastic Gradient Descent.
2. LambadaRank — Burgess et. al found that during RankNet training procedure, you don't need
the costs only need the gradients of the cost with respect to the model score. You can think
of these gradients as little arrows attached to each document in the ranked list, indicating
the direction we’d like those documents to move. Further, they found that scaling the
gradients by the change in NDCG found by swapping each pair of documents gave good
results. The core idea of LambaRank is to use this new cost function for training a RankNet.
On experimental dataset, shows both speed and accuracy improvements over the original
RankNet.
3. LambdaMart — LambdaMART combines LambaRank and MART(Multiple Additive Regression
Trees). While MART uses gradient boosted decision trees for prediction tasks, LambdaMART
uses gradient boosted decision trees using a cost function derived from LambdaRank for
solving a ranking task. On the experimental dataset, LambaMART has shown better results
than LambaRank and the original RankNet

32. Assumptions of Linear Regression.

1. The relationship between X and y must be linear


2. the features must be independent of each other.
3. Homoscedasticity — the variation between the output must be constant for different input
data.
4. The distribution of Y long X should be the Normal Distribution.

33. What are the Type1 and Type 2 errors? in which scenarios the
Type 1 and Type 2 errors are significant?

Rejection of True Null Hypothesis is known as a Type 1 error. in Simple words, False
Positive Rate are also known as Type 1 error.
Not rejecting the False Null Hypothesis is known as a Type 2 error. False Negatives
are known as a Type 2 error.

Type 1 Error is significant where the importance of being


negative becomes significant. For example — If a man is
not suffering from a particular disease marked as
positive for that infection. The medications given to him
might damage his organs. While Type 2 Error is
significant in cases where the importance of being
positive becomes important. For example — The alarm has
to be raised in case of burglary in a bank. But a system
identifies it as a False case that won’t raise the alarm
on time resulting in a heavy loss.

34. What are the conditions for overfitting and underfitting?

In overfitting the model performs well for the training data, but for any new data it
fails to provide output. For Underfitting the model is very simple and not able to
identify the correct relationship. Following are the bias and variance conditions.

Overfitting — Low bias and High Variance results in overfitted model. Decision tree is
more prone to Overfitting.

Underfitting — High bias and Low Variance. Such model doesn’t perform well on test
data also. For example — Linear Regression is more prone to Underfitting

35. What do you mean by Normalisation? Difference between


Normalisation and Standardization?

Normalization is a process of bringing the features in a simple range, so that model


can perform well and does not get inclined towards any particular feature. For
example — If we have a dataset with multiple features and one feature is the Age
data which is in the range 18–60, Another feature is the salary feature ranging from
20000–2000000. In such a case, the values have a very much different in them. Age
ranges in two digits integer while salary is in range significantly higher than the age.
So to bring the features incomparable range we need Normalisation.

Both Normalisation and Standardization are methods of Features Conversion.


However, the methods are different in terms of conversions. The data after
Normalisation scales in the range of 0–1. While in the case of Standardization the
data is scaled such that it means comes out to be 0

36. What do you mean by Regularisation? What are L1 and L2


Regularisation?
Regulation is a method to improve your model which is Overfitted by introducing
extra terms in the loss function. This helps in making the model perform better for
unseen data.

There are two types of Regularisation :

L1 Regularisation — In L1 we add lambda times the absolute weight terms to the loss
function. In this, the feature weights are penalized on the basis of absolute value. L2
Regularisation — In L2 we add lambda times the squared weight terms to the loss
function. In this, the feature weights are penalized on the basis of squared values.

37. Explain Naive Bayes Classifier and the principle on which


it works?

Naive Bayes Classifier algorithm is a probabilistic model. This model works on the
Bayes Theorem principle. The accuracy of Naive Bayes can be increased significantly
by combining it with other kernel functions for making a perfect Classifier.

Bayes Theorem — This is a theorem that explains conditional probability. If we need


to identify the probability of occurrence of Event A provided that Event B has already
occurred such cases are known as Conditional Probability.

38. Explain DBSCAN Clustering technique and in what terms DBSCAN is


better than K- Means Clustering?

DBSCAN( Density-Based) clustering technique is an unsupervised approach that splits


the vectors into different groups based on the minimum distance and number of
points lying in that range. In DBSCAN Clustering we have two significant parameters

Epsilon — The minimum radius or distance between the two data points to tag them
in the same cluster.Min — Sample Points — The number of the minimum samples
which should fall under that range to be identified as one cluster.

DBSCAN Clustering technique has few advantages over other clustering algorithms –

1. In DBSCAN we do not need to provide a fixed number of clusters. There can be as


many clusters formed on the basis of the data points distribution. While in k nearest
neighbor we need to provide the number of clusters we need to split our data into.

2. In DBSCAN we also get a noise cluster identified which helps us in identifying the
outliers. This sometimes also acts as a significant term to tune the hyperparameters
of a model accordingly.

The article is already too long to continue so,


Next, we will walk through Linear Regression and
clustering questionnaire in the next article Part II,
that will surprise you!

Thanks again, for your time, if you enjoyed this short article there are tons of topics in
advanced analytics, data science, and machine learning available in my medium repo.
https://medium.com/@bobrupakroy
Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger,
Issuu, Slideshare, Scribd and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.

You might also like