Professional Documents
Culture Documents
Deep Jain QA
Deep Jain QA
Statistics is a branch of science dealing with collecting, organizing, summarizing, analyzing and making decisions
from data.
There are two types of Statistics
Descriptive
Inferential
Limitations in Statistics
In systematic sampling every member of the population is listed with a number, but instead of randomly generating
numbers, individuals are chosen at regular intervals.
Stratified sampling involves dividing the population into subpopulations that may differ in important ways. It allows
you draw more precise conclusions by ensuring that every subgroup is properly represented in the sample.
Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar
characteristics to the whole sample. Instead of sampling individuals from each subgroup, you randomly select entire
subgroups. This method is good for dealing with large and dispersed populations.
Efficiency: It saves time and reduces costs by allowing researchers to study a subset of the population instead of the
entire group.
Manageability: Handling a smaller, representative sample makes data collection, processing, and analysis more
practical and less resource-intensive.
Accuracy: Proper sampling techniques, such as random sampling, ensure that the sample represents the population
well, reducing bias and increasing the reliability of results.
Detailed Analysis: Sampling makes it feasible to conduct in-depth analyses, which would be impractical with a larger
population.
Hypothesis Testing: It enables researchers to test hypotheses and make inferences about the population based on
sample data, supporting decision-making with known levels of confidence.
Consistent - We can say that the larger is the sample size, the more accurate is the estimate.
Unbiased - The expectation of the observed values of various samples equals the corresponding population
parameter. Let’s take, for example, We can say that sample mean is an unbiased estimator for the population mean.
Most Efficient That is also Known as Best Unbiased - of all the various consistent, unbiased estimates, the one
possessing the smallest variance (a measure of the amount of dispersion away from the estimate). In simple words,
we can say that the estimator varies least from sample to sample and this generally depends on the particular
distribution of the population. For example, the mean is more efficient than the median (that is the middle value) for
the normal distribution but not for more “skewed” ( also known as asymmetrical) distributions.
Secondary data- data has been collected and analyzed by some agency for its own use and later the data is used by
a different agency.
Precaution
(i) Suitable Purpose of Investigation- The investigator must ensure that the data are suitable for the purpose of the
enquiry.
(ii) Inadequate Data- Adequacy of the data is to be judged in the light of the requirements of the survey as well as the
geographical area covered by the available data.
(iii) Definition of Units- The investigator must ensure that the definitions of units are the same as in the earlier
investigation.
(iv) Degree of Accuracy- The investigator should keep in mind the degree accuracy maintained by each investigator.
(v) Time and Condition of Collection of Facts- It should be ascertained before making use of available data, to which
period and conditions the data were collected.
7. Differnce
Aspect Critical Region Region of Acceptance
Values leading to rejection of the null Values leading to acceptance of the null
Definition hypothesis. hypothesis.
Decision Reject null hypothesis. Do not reject null hypothesis.
Significance Level Determined by 𝛼α (e.g., 0.05). Complement of the critical region.
Probability Probability of Type I error (𝛼α). Region where Type II error (β) is considered.
Location Tails of the distribution. Central part of the distribution.
8. Method of moments
The moments in the "Method of Moments" refer to the statistical properties (sample mean and sample variance) of a
distribution, and the method utilizes these moments to estimate the parameters of the distribution.
Let f(ɵ1, ɵ2… ɵk )be the p.d.f of the population and let x1,x2,…xn be a random sample taken the population.
In the method of moments we find the first k moments of the population and equate them to the corresponding
moments of the sample to obtain k equations.
Then the values of ɵ1, ɵ2… ɵk which are obtained as the solutions of these equations are taken as their estimates.
In short, the method of moments involves equating sample moments with theoretical moments.
9. Multiple regression
Multiple regression analysis is a statistical technique that analyzes the relationship between two or more variables
and uses the information to estimate the value of the dependent variables. In multiple regression, the objective is to
develop a model that describes a dependent variable y to more than one independent variable.
The multiple regression equation is given by y = a + b 1×1+ b2×2+……+ bkxk
Multiple regression analysis permits to control explicitly for many other circumstances that
concurrently influence the dependent variable. The objective of regression analysis is to model
the relationship between a dependent variable and one or more independent variables. Let k
represent the number of variables and denoted by x1, x2, x3, ……, xk. Such an equation is useful
for the prediction of value for y when the values of x are known.
Stepwise regression is a step by step process that begins by developing a regression model
with a single predictor variable and adds and deletes predictor variable one step at a time.
Stepwise multiple regression is the method to determine a regression equation that begins with
a single independent variable and add independent variables one by one. The stepwise multiple
regression method is also known as the forward selection method because we begin with no
independent variables and add one independent variable to the regression equation at each of
the iterations. There is another method called backwards elimination method, which begins with
an entire set of variables and eliminates one independent variable at each of the iterations.
Multicollinearity is a term reserved to describe the case when the inter-correlation of predictor
variables is high.
The high correlation between pairs of predictor variables.
The magnitude or signs of regression coefficients do not make good physical sense.
Non-significant regression coefficients on significant predictors.
The ultimate sensitivity of magnitude or sign of regression coefficients leads to the insertion
or deletion of a predictor variable.
10. Neyman pearson lemmas
The Neyman-Pearson Lemma is a way to find out if the hypothesis test you are using is the one with the greatest
statistical power.
The power of a hypothesis test is the probability that testcorrectly rejects the null hypothesis when the alternate
hypothesis istrue
The goal would be to maximize this power, so that the null hypothesis is rejected as much as possible when the
alternate is true.
The lemma basically tells us that good hypothesis tests are likelihood ratio tests.
The Neyman-Pearson lemma is based on a simple hypothesis test. A “simple” hypothesis test is one where the
unknown parameters are specified as single values.
The Neyman-Pearson Lemma is a statistical principle that helps to make optimal decisions when we have to choose
between two hypotheses.
In simpler words, it's a way of figuring out the best way to decide between two options when we don't know which
one is true.
The lemma suggests that we should base our decision on the likelihood ratio of the two hypotheses. Specifically, we
should choose the hypothesis that has the highest likelihood ratio, as it is most likelyto be true.
However, there's a catch: we can only make this decision if we specify the level of significance weare willing to
accept. This means we have to decide beforehand how likely we want to be wrong (i.e.,reject a true hypothesis or fail
to reject a false hypothesis) when making our decision.
So, in summary, the Neyman-Pearson Lemma helps us to make an optimal decision between two hypotheses by
considering the likelihood ratio, but we also need to specify our level of significance to make a decision.
11. What is hypothesis testing ?explain z test for single mean ?z test for disffernce of mean?
A hypothesis is defined as a formal statement, which gives the explanation about the relationship between two or
more variables of a specified population.
Assume that a particular hypothesis is true, we find that results observed in a random sample differ markedly from
those expected. We say that observed differences are significant and we reject the hypothesis.
Procedures that enable us to decide to accept or reject hypothesis are called test of hypothesis, test of significance,
decision rules.
The Z-test for a single mean is a statistical test used to determine whether the mean of a sample differs significantly
from a known or hypothesized population mean. This test is typically used when the population variance is known
and the sample size is large (generally n > 30).
Here’s a step-by-step explanation of how the Z-test for a single mean is conducted:
1. Formulate the Hypotheses
Null Hypothesis (H₀): The sample mean is equal to the population mean. 𝐻0:𝜇=𝜇0H0:μ=μ0
Alternative Hypothesis (H₁): The sample mean is not equal to the population mean (two-tailed test), or it is
greater than or less than the population mean (one-tailed test). 𝐻1:𝜇≠𝜇0H1:μ=μ0 (two-tailed),
𝐻1:𝜇>𝜇0H1:μ>μ0 or 𝐻1:𝜇<𝜇0H1:μ<μ0 (one-tailed)
2. Determine the Level of Significance
Choose the significance level (α), which is the probability of rejecting the null hypothesis when it is true.
Common choices are 0.05, 0.01, and 0.10.
3. Calculate the Test Statistic
The test statistic for the Z-test is calculated using the formula:
13. Explain the mae and mape to check the performance of regression model?
The mean absolute error (MAE) is the simplest regression error metric to understand. We’ll calculate the residual for
every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out.
The MAE is also the most intuitive of the metrics since we’re just looking at the absolute difference between the data
and the model’s predictions. Because we use the absolute value of the residual, the MAE does not indicate
underperformance or overperformance of the model.
The mean absolute percentage error (MAPE) is the percentage equivalent of MAE. The equation looks just like that
of MAE, but with adjustments to convert everything into percentages. the MAPE is how far the model’s predictions
are off from their corresponding outputs on average.
This formula helps us understand one of the important caveats when using MAPE. In order to calculate this metric,
we need to divide the difference by the actual value. This means that if you have actual values close to or at 0 then
your MAPE score will either receive a division by 0 error, or be extremely large. Therefore, it is advised to not use
MAPE when you have actual values close to 0.
The MAPE is a commonly used measure in machine learning because of how easy it is to interpret. The lower the
value for MAPE, the better the machine learning model is at predicting values. Inversely, the higher the value for
MAPE, the worse the model is at predicting values.
for example, if we calculate a MAPE value of 20% for a given machine learning model, then the average difference
between the predicted value and the actual value is 20%. MAPE will favor models that are under-forecast rather than
over-forecast.