Professional Documents
Culture Documents
Data Science Interview Q's - I
Data Science Interview Q's - I
Hi hey there, thanks for the continuous support for my previous articles. Today we
will go through the commonly asked essential questions by the interviewers to
understand the root level knowledge of DS rather than going for fancy advanced
questions.
1. What are the feature selection methods used to select the right
variables?
There are 2 main methods for feature selection, i.e. filter and wrapper methods.
Answer: We can just simply remove the rows with missing data values. it is the
quickest way but comes with the cost of throwing valuable information.
for smaller dataset, we can substitute missing values with mean, median, model and
even with models like linear models, xgboost to predict the missing values.
Refers to the process of converting a dataset with vast dimensions into data with
fewer dimensions(fields) to convey similar information concisely.
This reduction helps in compressing data and reducing storage space. it also reduces
computation time as fewer dimensions lead to less computing. it removes redundant
features, for example, there’s no point in strong a value in different units(meters and
inches)
4. For the given points, how will you calculate the euclidean
distance in python?
plot1 = [1,3]
plot2 = [2,5]
We use the elbow method to select K for K-means clustering. The idea of the elbow
method is to run K-means clustering on the data set where ‘K’ is the number of
clusters.
Within the sum of squares(WSS), it is defined as the sum of the squared distance
between each member of the cluster and its centroid.
6. What is the significance of the p-value?
When p is smaller than or equals 0.05. indicates strong evidence against the null
hypothesis, so we reject the null hypothesis(H0) and accept the alternative
hypothesis (H1)
If p-value typically > 0.05. indicates weak evidence against the null hypothesis,
therefore we accept the null hypothesis and reject the alternative hypothesis.
Hence, to evaluate model performance we should use Sensitivity (True Positive Rate)
and SPecificity (True Negative Rate). F measure to determine the class-wise
performance of the classifier.
8. You have run the association rules algorithm on your dataset, and
the two rules {banana, apple} => {grape} and {apple,orange}=> {grape}
have been found to be relevant. What else must be true?
1. One-way ANOVA
2. K-means Clustering
3. Association Rules
4. Students t-Test
Answer: A: One way ANOVA
Root cause analysis was initially developed to analyze industrial accidents but is now
widely used in other areas. It is a problem-solving technique used for isolating the
root cause of faults or problems, A factor is called a root cause if its deduction from
the problem-fault-sequence averts the final undesirable event from recurring.
They do not, because in some cases, they reach a local minima or a local optima
point, You would not reach the global optima point. This is governed by the data and
the starting condition.
This is statistical hypothesis testing for randomized experiments with two variables,
A and B. The objective of A/B testing is to detect any changes to a web page to
maximise or increase the outcome of a strategy.
It is a theorem that describes the result of performing the same experiment very
frequently. This theorem forms the basis of frequency-style thinking. It states that
the sample mean, sample variance, and sample standard deviation converge to what
they are trying to estimate.
Eignvalues are the directions along which a particular linear transformation acts by
flipping compressing or stretching.
Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing
randomly with replacements from a set of data points.
Substituting labels on data points when performing significance tests.
Validating models by using random subsets(bootstrapping, cross-validation)
21. What are the types of biases that can occur during sampling?
Survivorship bias is the logical error of focusing on aspects that support surviving a
process and casually overlooking those that did not because of their lack of
prominence. This can lead to wrong conclusions in numerous ways.
23. If you are having 4GB RAM in your machine and you want to train
your model on a 10GB dataset. How would you go about this problem?
Have you ever faced this kind of problem in your machine
learning/data science experience so far?
For Neural Networks: Batch size with Numbpy array will work.
For Traditional ML’s: currently, SVM for non-linear and SGDClassifier for linear with
the help of Gradient Descent supports incremental learning. All we need to do is call
the model.partial_fit() function and pass batch size data.
24. What is selection bias?
Selection bias is the bias introduced by the selection of individuals, groups or data
for analysis in such a way that proper randomization is not achieved. thereby
ensuring that the sample obtained is not representative of the population intended
to be analyzed. it is sometimes referred to as the selection effect. The phase
“selection bias” most often refers to the distortion of statistical analysis, resulting
from the method of collecting samples. if the selection bias is not taken into account,
then some conclusions of the study may not be accurate.
Yes, we can use the analysis of covariance technique to capture the association
between continuous and categorical variables.
Yes, the categorical value should be considered as a continuous variable only when
the variable is ordinal in nature. So it is a better predictive model.
Ranking algorithms like LTR solves a ranking problem on a list of items. The aim of
LTR is to come up with optimal ordering of those items As such LTR doesn't care
much about the exact score that each item gets. but cares more about the relative
ordering among all the items. RankNet, LambdaRank, and LambdaMART are all LRT
algorithms developed by Chris Burges and his colleagues at Microsoft Research.
33. What are the Type1 and Type 2 errors? in which scenarios the
Type 1 and Type 2 errors are significant?
Rejection of True Null Hypothesis is known as a Type 1 error. in Simple words, False
Positive Rate are also known as Type 1 error.
Not rejecting the False Null Hypothesis is known as a Type 2 error. False Negatives
are known as a Type 2 error.
In overfitting the model performs well for the training data, but for any new data it
fails to provide output. For Underfitting the model is very simple and not able to
identify the correct relationship. Following are the bias and variance conditions.
Overfitting — Low bias and High Variance results in overfitted model. Decision tree is
more prone to Overfitting.
Underfitting — High bias and Low Variance. Such model doesn’t perform well on test
data also. For example — Linear Regression is more prone to Underfitting
L1 Regularisation — In L1 we add lambda times the absolute weight terms to the loss
function. In this, the feature weights are penalized on the basis of absolute value. L2
Regularisation — In L2 we add lambda times the squared weight terms to the loss
function. In this, the feature weights are penalized on the basis of squared values.
Naive Bayes Classifier algorithm is a probabilistic model. This model works on the
Bayes Theorem principle. The accuracy of Naive Bayes can be increased significantly
by combining it with other kernel functions for making a perfect Classifier.
Epsilon — The minimum radius or distance between the two data points to tag them
in the same cluster.Min — Sample Points — The number of the minimum samples
which should fall under that range to be identified as one cluster.
DBSCAN Clustering technique has few advantages over other clustering algorithms –
2. In DBSCAN we also get a noise cluster identified which helps us in identifying the
outliers. This sometimes also acts as a significant term to tune the hyperparameters
of a model accordingly.
Thanks again, for your time, if you enjoyed this short article there are tons of topics in
advanced analytics, data science, and machine learning available in my medium repo.
https://medium.com/@bobrupakroy
Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger,
Issuu, Slideshare, Scribd and more.