DS Theory

One-sample t-test is used to determine whether a population mean is significantly
different from a hypothesised value.
For example, let's say we want to test whether the average age of passengers on the
Titanic was 30. We can use a one-sample t-test to determine whether the mean age
of the sample is significantly different from 30.
The output will be the t-statistic and the p-value. If the p-value is less than the chosen
significance level (usually 0.05), we reject the null hypothesis and conclude that the
mean age is significantly different from 30.
Two-sample t-test is used to determine whether two independent samples have

different means.
For example, let's say we want to test whether the average age of male and female
passengers on the Titanic was significantly different. We can use a two-sample t-test
to determine whether the mean age of male and female passengers is significantly
different.
mean age of male and female passengers is significantly different.
Paired-sample t-test is used to determine whether two related samples have

different means.
For example, let's say we want to test whether there is a significant difference
between the age of passengers before and after the Titanic disaster. We can use a
paired-sample t-test to determine whether the mean age of passengers before and
after the disaster is significantly different.
mean age of passengers before and after the disaster is significantly different.
ANOVA
Analysis of variance (ANOVA) is used to determine whether there are significant
differences between the means of two or more groups.
For example, let's say we want to test whether there is a significant difference
between the average age of passengers in the three classes on the Titanic. We can
use ANOVA to determine whether
PCA (Principal Component Analysis) is a statistical technique used to reduce the
dimensionality of a dataset while retaining as much of the variation as possible
The idea behind PCA is to transform the original data into a new coordinate system
where the first coordinate (principal component) corresponds to the direction of
maximum variation in the data, the second coordinate corresponds to the direction
of the second maximum variation, and so on.
A decision tree classifier is a type of machine learning algorithm used for both
classification and regression tasks.
To build a decision tree classifier, the algorithm recursively partitions the training
data into subsets based on the values of the input features. At each node, the
algorithm selects the attribute that best separates the training data into the purest
subsets
K-means is a clustering algorithm used for unsupervised machine learning tasks. It

is a simple and effective method for grouping similar data points together in a given
dataset.
The algorithm works by first randomly selecting K centroids from the dataset, where
K is the number of clusters desired. Then, for each data point, the algorithm
computes the distance between the point and each of the K centroids and assigns
the point to the closest centroid. This creates K clusters.
Next, the algorithm updates the centroids of each cluster by calculating the mean of
all the points assigned to that cluster.
Regression is a type of machine learning and statistical analysis used to model the
relationship between a dependent variable (also called the response variable or
target variable) and one or more independent variables (also called predictors,
features or input variables).
The goal of regression is to predict the value of the dependent variable based on the
values of the independent variables.
Linear regression is a type of regression analysis that models the relationship

between a dependent variable and one or more independent variables as a linear
function.
Multiple regression is a type of regression analysis that models the relationship

between a dependent variable and two or more independent variables as a linear
function.
The main difference between linear regression and multiple regression is the number
of independent variables used to model the relationship with the dependent variable.
Logistic regression is a type of regression analysis used to model the relationship

between a dependent variable and one or more independent variables. However,
unlike linear regression, logistic regression is used to predict binary or categorical
outcomes rather than continuous outcomes.
In other words, logistic regression is used when the dependent variable is

categorical or dichotomous in nature, such as "yes" or "no", "success" or "failure", or
"0" or "1".
In statistics, the null hypothesis is a statement or assumption that there is no

significant difference or relationship between two or more variables or populations. It
is often denoted as H0 and is the starting point for hypothesis testing.
The null hypothesis is usually the opposite of the research hypothesis or alternative
hypothesis (denoted as Ha), which is the statement or assumption that there is a
significant difference or relationship between the variables or populations. The
research hypothesis is what the researcher is trying to prove or establish.
typically, the significance level is set to 0.05 or 0.01. If the p-value (the probability of
observing the data given the null hypothesis is true) is less than the significance
level, the null hypothesis is rejected in favor of the alternative hypothesis.
A confusion matrix is a table used to evaluate the performance of a classification

model by comparing the actual labels of a dataset to the predicted labels of the
model.
True positive (TP): The number of instances that are correctly classified as positive.
False positive (FP): The number of instances that are incorrectly classified as
positive.
True negative (TN): The number of instances that are correctly classified as negative.
False negative (FN): The number of instances that are incorrectly classified as
negative.
Variance is a measure of the variability or spread of a set of data points around the
mean
Mean square error (MSE) is a measure of the average squared difference between
the predicted values and the actual values
Standard deviation is a measure of the amount of variation or dispersion of a set of
data points around the mean

DS Theory

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS Theory

Uploaded by

Copyright:

Available Formats

One-sample t-test is used to determine whether a population mean is significantly

different from a hypothesised value.

Two-sample t-test is used to determine whether two independent samples have

Paired-sample t-test is used to determine whether two related samples have

K-means is a clustering algorithm used for unsupervised machine learning tasks. It

Linear regression is a type of regression analysis that models the relationship

Multiple regression is a type of regression analysis that models the relationship

Logistic regression is a type of regression analysis used to model the relationship

In other words, logistic regression is used when the dependent variable is

In statistics, the null hypothesis is a statement or assumption that there is no

A confusion matrix is a table used to evaluate the performance of a classification

You might also like