Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 24

Interview ML

Statistics
Basic Definitions
Mean is also known as average of all the numbers in the data set.

Median is mid value in this ordered data set.


Mode is the number which occur most often in the data set.
Variance is the numerical values that describe the variability of the observations
from its arithmetic mean and denoted by sigma-squared(σ2 ). Variance measure
how far individuals in the group are spread out, in the set of data from the mean.
Standard deviation is a measure of dispersion of observation within
dataset relative to their mean. It is square root of the variance and
denoted by Sigma (σ). Standard deviation is expressed in the same unit
as the values in the dataset so it measure how much observations of
the data set differs from its mean.
Data Types
• Discrete Data: as the name suggests, can take only specified values.
For example, when you roll a die, the possible outcomes are 1, 2, 3, 4,
5 or 6 and not 1.5 or 2.45.
• Continuous Data can take any value within a given range. The range
may be finite or infinite. For example, A girl’s weight or height, the
length of the road. The weight of a girl can be any value from 54 kgs,
or 54.5 kgs, or 54.5436kgs.
Types of Distributions
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0
(failure), and a single trial. So the random variable X which has a Bernoulli distribution
can take value 1 with the probability of success, say p, and the value 0 with the
probability of failure, say q or 1-p.
In Uniform distribution, the probabilities of getting each outcomes is
equally likely. For e.g., when you roll a fair die, the outcomes are 1 to 6.
Unlike Bernoulli Distribution, all the 6 number of possible outcomes of
a uniform distribution are equally likely. Uniform distribution is called
rectangular distribution.
Binomial Distribution is a distribution where only two outcomes are possible, such
as success or failure, gain or loss, win or lose and where the probability of success
and failure is same for all the trials.
The properties of a Binomial Distribution are
1. Each trial is independent.
2. There are only two possible outcomes in a trial- either a success or a failure.
3. A total number of n identical trials are conducted.
4. The probability of success and failure is same for all trials. (Trials are identical.)
Normal distribution represents the behaviour of most of the situations in the
universe.
Normal distribution has the following characteristics:
1. The mean, median and mode of the distribution coincide.
2. The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
3. The total area under the curve is 1.
4. Exactly half of the values are to the left of the centre and the other half to the
right.
Poisson Distribution is applicable in situations where events occur at random points of time and
space wherein our interest lies only in the number of occurrences of the event.
A distribution is called Poisson distribution when the following assumptions are valid:
1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a
longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.

• λ is the rate at which an event occurs,


• t is the length of a time interval,
• And X is the number of events in that time interval.
Here, X is called a Poisson Random Variable and the probability distribution of X is called Poisson
distribution.
Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.
The exponential distribution is one of the widely used continuous distributions. It is
often used to model the time elapsed between events.
Relationship between Distributions
Data
Science
Decision Tree
Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the tree.
We can represent any Boolean function on discrete attributes using the decision tree.
In Decision Tree the major challenge is to identification of the attribute for the root node in
each level. This process is known as attribute selection.
We have two popular attribute selection measures:
1. Information Gain: measure of this change in entropy.
2. Gini Index: metric to measure how often a randomly chosen element would be incorrectly
identified.

Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an


arbitrary collection of examples. The higher the entropy more the information content.
Random Forest
Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset.
It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of
the model.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
Linear Regression

Linear regression algorithm shows a linear relationship between a dependent (y) and one
or more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent
variable is changing according to the value of the independent variable.
Linear regression can be further divided into two types of the algorithm:
• Simple Linear Regression: If a single independent variable is used to predict the value of
a numerical dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
• Multiple Linear regression: If more than one independent variable is used to predict the
value of a numerical dependent variable, then such a Linear Regression algorithm is
called Multiple Linear Regression.
Mean Squared Error (MSE) cost function is the average of squared error occurred
between the predicted values and actual values.
Gradient descent is used to minimize the MSE by calculating the gradient of the
cost function. A regression model uses gradient descent to update the coefficients
of the line by reducing the cost function. It is done by a random selection of values
of coefficient and then iteratively update the values to reach the minimum cost
function.

R-squared is a statistical method that determines the goodness of fit. It measures


the strength of the relationship between the dependent and independent variables
on a scale of 0-100%. The high value of R-square determines the less difference
between the predicted values and actual values and hence represents a good
model. It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
There are various ways to build a model in Machine Learning, which are:
1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison
Logistic Regression
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables. The final
values given by logistic regression are the probabilistic values which lie between 0
and 1.
On the basis of the categories, Logistic Regression can be classified into three
types:
• Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".

You might also like