Professional Documents
Culture Documents
DS Theory
DS Theory
For example, let's say we want to test whether the average age of passengers on the
Titanic was 30. We can use a one-sample t-test to determine whether the mean age
of the sample is significantly different from 30.
The output will be the t-statistic and the p-value. If the p-value is less than the chosen
significance level (usually 0.05), we reject the null hypothesis and conclude that the
mean age is significantly different from 30.
For example, let's say we want to test whether the average age of male and female
passengers on the Titanic was significantly different. We can use a two-sample t-test
to determine whether the mean age of male and female passengers is significantly
different.
The output will be the t-statistic and the p-value. If the p-value is less than the chosen
significance level (usually 0.05), we reject the null hypothesis and conclude that the
mean age of male and female passengers is significantly different.
For example, let's say we want to test whether there is a significant difference
between the age of passengers before and after the Titanic disaster. We can use a
paired-sample t-test to determine whether the mean age of passengers before and
after the disaster is significantly different.
The output will be the t-statistic and the p-value. If the p-value is less than the chosen
significance level (usually 0.05), we reject the null hypothesis and conclude that the
mean age of passengers before and after the disaster is significantly different.
ANOVA
Analysis of variance (ANOVA) is used to determine whether there are significant
differences between the means of two or more groups.
For example, let's say we want to test whether there is a significant difference
between the average age of passengers in the three classes on the Titanic. We can
use ANOVA to determine whether
PCA (Principal Component Analysis) is a statistical technique used to reduce the
dimensionality of a dataset while retaining as much of the variation as possible
The idea behind PCA is to transform the original data into a new coordinate system
where the first coordinate (principal component) corresponds to the direction of
maximum variation in the data, the second coordinate corresponds to the direction
of the second maximum variation, and so on.
A decision tree classifier is a type of machine learning algorithm used for both
classification and regression tasks.
To build a decision tree classifier, the algorithm recursively partitions the training
data into subsets based on the values of the input features. At each node, the
algorithm selects the attribute that best separates the training data into the purest
subsets
Regression is a type of machine learning and statistical analysis used to model the
relationship between a dependent variable (also called the response variable or
target variable) and one or more independent variables (also called predictors,
features or input variables).
The goal of regression is to predict the value of the dependent variable based on the
values of the independent variables.
The null hypothesis is usually the opposite of the research hypothesis or alternative
hypothesis (denoted as Ha), which is the statement or assumption that there is a
significant difference or relationship between the variables or populations. The
research hypothesis is what the researcher is trying to prove or establish.
typically, the significance level is set to 0.05 or 0.01. If the p-value (the probability of
observing the data given the null hypothesis is true) is less than the significance
level, the null hypothesis is rejected in favor of the alternative hypothesis.
True positive (TP): The number of instances that are correctly classified as positive.
False positive (FP): The number of instances that are incorrectly classified as
positive.
True negative (TN): The number of instances that are correctly classified as negative.
False negative (FN): The number of instances that are incorrectly classified as
negative.
Variance is a measure of the variability or spread of a set of data points around the
mean
Mean square error (MSE) is a measure of the average squared difference between
the predicted values and the actual values
Standard deviation is a measure of the amount of variation or dispersion of a set of
data points around the mean