Professional Documents
Culture Documents
MSBT 109 _UNIT 1
MSBT 109 _UNIT 1
MSBT 109 _UNIT 1
M.Sc. Biotechnology
(MSBT-109- DTU)
Fundamental of probability Database Management Cluster analysis Microarray technique Modelling Methods
Biostatistics is the application of statistical techniques to scientific research in health-related fields, including
medicine, biology, and public health.
Descriptive statistics focus on describing the visible characteristics of a dataset. A data set is a collection of
responses or observations from a sample or entire population. They represent all of the procedures that can be
used to organise, summarise, display, and categorise data collected for a certain experiment or event. It
includes tabulation, graphical presentation, measures of central tendency, etc
Inferential statistics focus on making predictions or generalizations about a larger dataset, based on a
sample of those data. The inferential statistics include z test, t test, analysis of variance, etc.
In short statistics are about summarizing and answering question based on data.
Descriptive statistics
Problem Solving
Use in Clinical Trials
Use in Manufacturing of Pharmaceuticals
Use in Quality Control of Pharmaceuticals
Use in Research and Development of Drugs and Technology
Use in Anatomy and Physiology
Use in Pharmacology and Medicine
Use in Community Medicine and Public Health
Use in Health and Vital statistics
Use in Biotechnology, Bioinformatics and Computational Biotechnology
Variables and Constants
A constant is a characteristic that is fixed across conditions. For example: fingerprint, country in which you
were born, genotype.
A variable is something that changes across conditions. For example salary, age, marital status, respiratory
rate, blood type etc. Data are the values we get when we measure or observe a variable.
How do we measure things?
Qualitatively: By putting them into categories
Quantitatively: By using numbers
Variables
Nominal A typical characteristics of this variable are that they do not have any units of measurement, and the ordering
Variable of the categories is completely arbitrary. Example: Blood group types - Nominal categorical variable
Ordinal This data too do not have any units of measurement as like that of nominal variables but the ordering of the
Variable categories is not arbitrary as it was with nominal variables. Example: Scoring of a patients- Ordinal
Ordinal data are not real numbers. They cannot be placed on the number line. As ordinal data are not real
numbers, it is not appropriate to apply any of the rules of basic arithmetic to sort this data
With metric variables, proper measurement is possible and therefore these variables produce data that are real
Continuous
numbers, and can be placed on the number line.
Variable
Metric continuous variables can be properly measured and have units of measurement. Example: Birth weight
(g), blood pressure (mmHg)- Metric continuous
The data produced are real numbers, and are invariably integer (i.e. whole number). They can be placed on the
Discrete
number line, and have the same interval and ratio properties as continuous metric data.
Variable
Metric discrete variables can be properly counted and have units of measurement- ‘numbers of things’.
Example: Number of deaths, number of pressure sores, number of angina Attacks- discrete variables
Continuous metric data usually comes from measuring while discrete metric data, usually comes from counting
Identify type of variable:
Eye Color
Gender Places in a Race
Cloth sizes
01 02 03 04
Frequency distribution tables presents data in a relatively compact form, ready to use but certain information may be lost. The
data can be reduced to manageable form using frequency tables. It can have one or all the following parameters, depending on the
type of data.
1. Frequency tells you how often something happened OR number of times that any particular value come in a data.
2. Relative frequency is the frequency converted into percentage of the total number of observations.
3. Cumulative frequency is the cumulative total of frequencies and is obtained by adding the frequency of observations at each
level point to those frequencies of the preceding level(s).
4. Cumulative relative frequency It is cumulative frequency converted into the percentage of the total number of observations.
Frequency tables can show either categorical variables (sometimes called qualitative variables) or quantitative variables
(numeric values).
Frequency distribution in nominal data
Shows that a higher percentage of nurses than of doctors work Relative frequency table showing the percentage of
in rural areas, but that, overall, a greater proportion of staff students in each blood group
works in urban areas (67%).
It makes no sense to calculate cumulative frequency for nominal data, because of the arbitrary category order.
Hence, cumulative frequency is not calculated. But can be done for the ordinal data.
Construct a frequency table for the data on marks obtained by 20 students in their math exam.
20,43,74,89,75,60,31,43,37,36,50,38,21,99,93,45,64,92,38,60
1. Pie chart is a diagram in which the frequencies of the groups are shown in a circle. Each Pie chart of percentage
segment (slice) of a pie chart should be proportional to the frequency of the category it blood group of pharmacy
represents. A disadvantage of a pie chart is that it can only represent one variable. It can lose students
clarity if it is used to represent more than four or five categories
2. Bar chart: Its a chart with frequency on the vertical axis and category on the horizontal axis.
The simple bar chart is appropriate if only one variable is to be shown.
Simple bar chart of blood
All the bars should be of the same width, and there should be equal spaces between bars.
group of pharmacy students
These spaces emphasise the categorical nature of the data
3. Clustered bar chart: If there are more than one group, we can use the clustered bar chart.
There are two ways of presenting a clustered bar chart. As it compare the relative sizes of
the groups within each category.
5. Pictograms are similar to bar charts. They present
the same type of information, but the bars are
replaced with a proportional number of icons. This
type of presentation for descriptive statistics dates
back to the beginning of civilization when pictorial
images were used to record numbers of people,
animals or objects.
A stacked bar chart of blood group of
Clustered bar chart of blood group of 95 pharmacy students by sex 95 pharmacy students by sex
4. Stacked bar charts: The bars are now stacked on top of each other. Stacked bar charts
are appropriate if we want to compare the total number of subjects in each group but
not so good if we want to compare category sizes between groups. Population of different districts of Western Maharashtra.
Each diagram indicates one lakh population.
6. Line chart is similar to a bar chart except that thin lines, instead of thicker bars, are
used to represent the frequency associated with each level of the discrete variable
7. Point plots are identical to line charts, however, instead of a line, a number of points or
dots equivalent to the frequency are stacked vertically for each value of the horizontal
axis. Also referred to as dot diagram, point plots are useful for small data sets.
9. A frequency polygon can be constructed by placing a dot at the midpoint (class mark) for
each class interval in the histogram and then these dots are connected by straight lines. histogram of the grouped weight in kg
- This frequency polygon gives a better conception of the shape of the distribution. The
class interval midpoint for a section in a histogram is calculated as follows: Midpoint =
(highest + lowest point)/2
- The frequency polygon is then created by listing the midpoints (class marks) on the x
axis, frequencies on the y-axis, and drawing lines to connect the midpoints for each
interval.
By using Ogive we can locate any percentile that will divide the series into parts.
Quartiles: There are three different points located on the entire range of variable. By using these quartiles we can Quintiles: This divides the distribution into 5 equal parts. So, 20th
calculate semi inter quartile range and inter quartile range (Q1-Q3). percentile or 1st quintile will have 20% observations falling to its
- Q1 or lower quartile will have 25% observations falling in its left and 75% observations on its right side. left and 80% to its right.
- Q2 is the median, i.e.,50%values lies on either side.
Deciles: This divides the distribution into 10 equal parts. First decile
- Q3 is the upper quartile, will have 75% observations falling on its left side and 25% observations on its right side.
(10th percentile) will have 10% values to its left and 90% values to
its right. 5th decile is the median and contains 50% values on either
side.
CLASS-III
23 Aug 2023
11. Box-whisker plot: This plot that displays a great deal of information about a continuous variable
is the box-and-whisker plot. It shows the bulk of the data as a rectangular box in which the upper
and lower lines represent the third quartile (75% of observations below Q3) and first quartile (25%
of observations below Ql), respectively. The second quartile (50% of the observations below this
point) is depicted as a horizontal line through the box. Vertical lines (whiskers) extend from the top
and bottom lines of the box to an upper and lower adjacent value.
12. The stem-and-leaf plot is a visual representation for continuous data and contains features
common to both the frequency distribution and dot diagrams. Digits, instead of bars are used to
illustrate the spread and shape of the distribution. Each piece of data is divided into "leading" and
"trailing" digits.
All the leading digits are sorted from lowest to highest and listed to the left of a vertical line. These
digits become the stem. The trailing digits are then written in the appropriate location to the right
of the vertical line. These become the leaves.
13. A scatter diagram is an extremely useful presentation for showing the relationship between two
continuous variables. The two dimensional plot has both horizontal and vertical axes which cover
the ranges of the two variables. Plotted data points represent paired observations for both the x and
y variable. These types of plots are valuable for correlation and regression inferential tests.
1. The values are fairly evenly spread 2. The values are concentrated towards the 3. The values are concentrated towards the top
throughout their possible range. This bottom of the range, with progressively of the range, with progressively fewer values
is a uniform distribution. fewer values towards the top of the range. towards the bottom of the range. This is a left
This is a right or positively skewed or negatively skewed distribution.
distribution.
4. The values are clumped together around one There is one particular symmetric bell-shaped 5. The values are clumped around two or
particular value, with progressively fewer distribution, known as the Normal distribution. Many more particular values. This is a
values both below and above this value. This is human clinical features are distributed normally bimodal or multimodal distribution.
a symmetric or mound-shaped distribution.
It is bell shaped curve
It is symmetrical in distribution; variables on either side of mean are equal in
number.
Its maximum height is at the mean.
Mean= mode= median , in case of normal distribution coincide.
Skewness of the curve is zero.
It is asymptotic, in that tails never touch baseline.
Total area of curve is one and standard deviation is also one.
It has two curves. Central part is convex and when it comes down, it becomes
concave on both sides.
Skewness measures asymmetry around the mean. The parameter is best interpreted as
relative to the normal distribution (whose skewness equals to zero).
The interpretation of the skewness is :
Skewness > 0 asymmetric tail with more values above the mean
Skewness < 0 asymmetric tail with more values below the mean
Skewed data is required to be treated using non parametric tests while normal curve
data is treated using parametric tests.
Kurtosis is a property associated with a frequency distribution and refers to the shape
of the distribution of values regarding its relative flatness and peaked-ness. Compared
with normal distribution, the interpretation of the kurtosis is:
Kurtosis > 0 peaked relative to Normal distribution
Kurtosis < 0 flat relative to Normal distribution
Measure of Central Tendency
In any research, enormous data is collected and, to describe it Defined as “the statistical measure that identifies a single value as
meaningfully, one needs to summarise the same. representative of an entire distribution”.
The bulkiness of the data can be reduced by organising it into a It aims to provide an accurate description of the entire data.
frequency table or histogram. It is the single value that is most typical/representative of the collected
Frequency distribution organises the heap of data into a few meaningful data.
categories.
Collected data can also be summarised as a single index/value, which
represents the entire data.
These measures may also help in the comparison of data.
The mean, median and mode are the three commonly used measures of
central tendency.
Median is the value which occupies the middle position when all the observations are arranged in an ascending/descending order. It divides the
frequency distribution exactly into two halves. 50% of observations in a distribution have scores at or below the median. Hence median is the 50th
percentile. Median is also known as ‘positional average’.
Mode is defined as the value that occurs most frequently in the data. Some data sets do not have a mode because each value occurs only once. On the
other hand, some data sets can have more than one mode. This happens when the data set has two or more values of equal frequency which is greater than
that of any other value. Mode is rarely used as a summary statistic except to describe a bimodal distribution. In a bimodal distribution, the taller peak is
called the major mode and the shorter one is the minor mode.
1. The range is the distance from the smallest value to the largest. Not affected by skewness, but is sensitive to the addition or
removal of an outlier value. Range = Lowest value to Highest value.
2. The interquartile range describes the middle 50% of values when ordered from lowest to highest. To find the interquartile
range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1)
and quartile 3 (Q3). The IQR is the difference between Q3 and Q1. It is not affected either by outliers or skewness, but it
does not use all of the information in the data set since it ignores the bottom and top quarter of values.
3. Standard Deviation is a measure which shows how much variation (such as spread, dispersion, spread,) from the mean exists. It indicates a “typical” deviation
from the mean. The most widely used measure of dispersion, is based on all values. It is the square root, of the, mean of the squared deviations, from arithmetic
mean. In statistics, Variance and standard deviation are related with each other since the square root of variance is considered the standard deviation for the given
data set.
4. The mean deviation is defined as a statistical measure that is used to calculate the
average deviation from the mean value of the given data set. It uses absolute
values instead of squares to circumvent the issue of negative differences between
the data points and their means.
Population Parameters versus Sample Statistics
Biostatistics is to analyze samples in order to make inferences about the population from which the samples were drawn
s2 = 472.10/9 = 52.46 s = √52.46 = 7.24
Questions
35, 50,50,50,56,60,60,75,250
(A AND B) = P(AՈB)
(A OR B) = P(AUB)
Playing with cards
2. A card is drawn at random from a well-shuffled 3. A card is drawn at random from a pack of 52
pack of cards numbered 1 to 20. Find the playing cards. Find the probability that the card
probability of drawn is
(i) a king
(i) getting a number less than 7 (ii) neither a queen nor a jack.
(ii) getting a number divisible by 3.
Independent events: Two or more events are said to be Dependent events: In these events occurrence or non-
independent, when the outcome of one event is not affected occurrence of one event in any one trial affects the
by other outcomes, as they do not effect other. probability of other events in other trails.
Eg. If a coin is tossed twice, the result of second throw is Equally likely event are said to be equally likely when
not affected by the result of the first throw. one does not occur more often than the others.
Probability Distributions
Experiment, Outcome, and Sample Space
An experiment is a process that, when performed, results in one and only one of many observations. These observations are called the
outcomes of the experiment. The collection of all outcomes for an experiment is called a sample space.
Examples of Experiments, Outcomes, and Sample Spaces Tree diagram, each outcome is represented by a branch of the
tree.
• help us understand probability concepts by presenting them
visually
A simple event is also called an elementary event, and a compound event is also called a composite event.
Simple Event An event that includes one and only one of the (final) outcomes for an experiment is called a simple event and is
usually denoted by Ei.
Reconsider previous example: on selecting two workers from a company and observing whether the worker selected each time is a
man or a woman.
Each of the final four outcomes (MM, MW, WM, and WW) for this experiment is a simple event.
These four events can be denoted by E1, E2, E3, and E4, respectively.
Thus, E1 = {MM}, E2 = {MW}, E3 = {WM}, and E4 = {WW}
Compound Event A compound event is a collection of more than one outcome for an experiment.
Reconsider the same example on selecting two workers from a company and observing whether the worker selected each time is a man
or a woman. Let A be the event that at most one man is selected. Is event A a simple or a compound event?
Let L denote the event that a student likes ice tea and N (b) (i) The event both students like ice tea will occur if LL happens.
denote the event that a student does not like ice tea. Thus, Both students like ice tea = {L L}
(a) This experiment has four outcomes, Since this event includes only one of the four outcomes, it is a simple event.
LL = Both students like ice tea (ii) The event at most one student likes ice tea will occur if one or none of the
L N = The first student likes ice tea but the second student two students likes ice tea.
does not At most one student likes ice tea = {L N, NL, N N}
NL = The first student does not like ice tea but the second Since this event includes three outcomes, it is a compound event.
student does (iii) The event at least one student likes ice tea will occur if one or two of the tw
NN = Both students do not like ice tea students like ice tea.
Thus, At least one student likes ice tea = {LN, NL, LL}
Since this event includes three outcomes, it is a compound event.
(iv) The event neither student likes ice tea will occur if neither of the two
students likes ice tea, which will include the event NN.
Thus, Neither student likes ice tea = {NN}
Since this event includes one outcome, it is a simple event.
Probability is a numerical measure of the likelihood that a specific event will occur. The probability that a simple event Ei will occur is
denoted by P(Ei), and the probability that a compound event A will occur is denoted by P(A).
2. The sum of the probabilities of all simple events (or final outcomes) for an experiment, denoted by ΣP(Ei), is always 1.
For an experiment with outcomes E1, E2, E3, .a . . ,
ΣP(Ei) = P(E1) + P(E2) + P(E3) + . . . = 1.0
Classical Probability: Equally Likely Outcomes Two or more outcomes that have the same probability of occurrence are said to be
equally likely outcomes.
Marginal probability is the probability of a single event without consideration of any other event.
Example: 15 employees in this group possess two characteristics: “male” and “in favor of paying high salaries to CEOs.”
Suppose one employee is selected at random from these 100 employees. If only one characteristic is considered at a time, the
employee selected can be a male, a female, in favor, or against. The probability of each of these four characteristics or events is
called marginal probability.
Now suppose that one employee is selected at random from these 100 employees. Furthermore, assume it is
known that this (selected) employee is a male. In other words, the event that the employee selected is a male
has already occurred. Given that this selected employee is a male, he can be in favor or against. What is the
probability that the employee selected is in favor of paying high salaries to CEOs?
Mutually Exclusive Events : Events that cannot occur together are said to be mutually exclusive events.
Consider the following events for one roll of a die:
A = an even number is observed = {2, 4, 6}
B = an odd number is observed = {1, 3, 5}
C = a number less than 5 is observed = {1, 2, 3, 4}
Are events A and B mutually exclusive? Are events A and C mutually exclusive?
Mutually exclusive events A Mutually nonexclusive
and B. events A and C.
Independent Events Two events are said to be independent if the occurrence of one event does not affect the probability of the
occurrence of the other event. In other words, A and B are independent events if either P(A ∣ B) = P(A) or P(B ∣ A) = P(B)
Complementary Events: The complement of event A, denoted by A and read as “A bar” or “A complement,” is the event that
includes all the outcomes for an experiment that are not in A.
Events A and A are complements of each other.
Because two complementary events, taken together, include all the outcomes for an experiment and
because the sum of the probabilities of all outcomes is 1, it is obvious that
P(A) + P(A) = 1.0
The symbol ! (read as factorial) is used to denote factorials. The value of the factorial of a numberis obtained by
multiplying all the integers from that number to 1. For example, 7! is read as “seven factorial” and is evaluated by
multiplying all the integers from 7 to 1.
Discrete Random Variable A random variable that assumes countable values is called a discrete random variable.
examples of discrete random variables
• The number of houses in a certain block
• The number of customers who visit a bank during any given hour
• The number of complaints received at the office of an airline on a given day
Continuous Random Variable A random variable that can assume any value contained in one or more intervals is called
a continuous random variable.
Note: Success does not mean favorable or desirable outcome and a failure does not refer to an unfavorable or undesirable outcome.
The outcome to which the question refers is usually called a success; the outcome to which it does not refer is called a failure.
Example: Seventy five percent of students at a college with a large student population use the social media site Instagram. Three
students are randomly selected from this college. What is the probability that exactly two of these three students use Instagram?
n = total number of trials = number of students selected = 3
x = number of successes = number of students in three who use Instagram = 2
p = probability of success = probability that a student uses Instagram = .75
n − x = number of failures = number of students not using Instagram = 3 − 2 = 1
q = probability of failure = probability that a student does not use Instagram = 1 − .75 = .25
The probability of two successes is denoted by P(x = 2) or simply by P(2). Substituting all of the values in the binomial formula
Mean and Standard Deviation of the Binomial Distribution
The value of the mean is what we expect to obtain, on average, per repetition of the experiment. In this example, if
we select many samples of 50 U.S. adults each, we expect that each sample will contain an average of 11.4 adults,
with a standard deviation of 2.9666, who will not have a religious affiliation.
The Hypergeometric Probability Distribution
If the trials are not independent, we cannot apply the binomial probability distribution to find the probability of x
successes in n trials. In such cases we replace the binomial probability distribution by the hypergeometric probability
distribution
Thus, the probability that 3 of the 4 parts sold are good and 1 is defective is .4506.
The Poisson Probability Distribution:
If the average number of occurrences for a given interval is known, then by using the Poisson probability distribution, we can compute the
probability of a certain number of occurrences, x, in that interval.
Conditions to Apply the Poisson Probability Distribution The following three conditions must be satisfi ed to apply the Poisson
probability distribution.
1. x is a discrete random variable.
2. The occurrences are random.
3. The occurrences are independent.
The following examples also qualify for the application of the Poisson probability distribution.
1. The number of accidents that occur on a given highway during a 1-week period
2. The number of customers entering a grocery store during a 1-hour interval
3. The number of television sets sold at a department store during a given week
Total area under a normal curve A normal curve is symmetric about Areas of the normal curve beyond μ ± 3σ.
the mean
The standard normal distribution is a special case of the normal distribution. For the standard normal distribution, the
value of the mean is equal to zero and the value of the standard deviation is equal to 1.
z Values or z Scores The units marked on the horizontal axis of the standard normal curve are denoted by z and are called
the z values or z scores. A specific value of z gives the distance between the mean and the point represented by z in
terms of the standard deviation. For example, a point with a value of z = 2 is two standard deviations to the right of the
mean. Similarly, a point with a value of z = −2 is two standard deviations to the left of the mean.
CLASS-VII
6 September 2023
th
Hypothesis Testing
Hypothesis testing and its application in biostatistics
The assignment of value(s) to a population parameter based on a value of the corresponding sample statistic is called
estimation.
Hypothesis testing
Hypothesis testing begins with an assumption called a hypothesis, that we make about a population parameter.
Hypothesis testing is a statistical method used to determine if there is enough evidence in a sample data to draw conclusions about a
population.
Testing a hypothesis (or Test of significance) is generally a process of testing of significance, regarding parameters of the population, on the
basis of sample.
Purpose: to help researcher in reaching a conclusion regarding the population by examining a sample from that population.
Hypothesis: It is a statement about one or more populations, which delas with the parameters of the population , about which a statement is
made. Thus, via hypothesis testing , we can decide whether or not such statements are compatible with the available data.
Statistical Hypothesis testing
Hypothesis is stated in such a way that it may be
evaluated by appropriate statistical techniques.
Steps in statistical hypothesis testing
1. Test
statistics
2. Decision Rule
3. Significance Levels
4a. Statistical decision
Then we state that the interval $2630 to $3310 is likely to contain the population mean, μ, and that the mean housing
expenditure per month for all households in the United States is between $2630 and $3310.
The value $2630 is called the lower limit of the interval, and $3310 is called the upper limit of the interval.
The number we add to and subtract from the point estimate is called the margin of error or the maximum error of the
estimate.
Estimation of a Population Mean: σ Known
Confidence interval for the population mean μ when the population standard deviation σ is known.
Next, we substitute all the values in the confidence interval formula for μ.
The 90% confidence interval for μ is x ± zσx = 145 ± 1.65(7.00) = 145 ± 11.55 = (145 − 11.55) to (145 + 11.55) =
$133.45 to $156.55
Thus, we are 90% confident that the mean price of all such college textbooks is between $133.45 and $156.55.
CLASS-VIII
13 September 2023
th
Statistical Tests
Parametric and Non parametric test
Parametric test: Assume the data is of sufficient “Quality” Non-parametric test: distribution free test, Can be used when
The results can be misleading if the assumptions are wrong. data is not of sufficient quality to satisfy the assumptions of
Quality is defined in terms of certain properties of data. parametric test.
Assumptions: Used for Skewed data
• Random Independent samples Median is used
• Interval or ratio level measurement
• Normally distributed Distribution can be checked
• No outliers Histogram
• Homogeneity of variance Kolmogorov Smirnov and Shapiro Wilk test
• Sample size larger than minimum of many non parametric test.
Tests: Tests:
More power to detect a difference that truly exists. 1. Mann Whitney U test/Wilcoxon rank sum test
1. Large sample test (Independent)
• Z test 2. Wilcoxon Signed rank test (Dependent)
2. Small sample test 3. Kruskal Wallis Test (more than 2 groups)
• T test 4. Fisher Exact Test
• Independent/ unpaired T test 5. Chi Square test
• Paired T test 6. Spearman Correlation
• ANOVA (Analysis of Variance)
• One Way ANOVA
• Two Way ANOVA
3. Pearson Correlation
Z test
• Use for testing the mean of a population versus a standard
OR
• a statistical test to determine whether two population means are different when the variances are known
and the sample size is large (n>=30)
• can be performed on one sample, two samples, or on proportions for hypothesis testing.
• A z-statistic, or z-score, is a number representing the result from the z-test.
Example: 1. Comparing the average salaries of men versus women after M.Sc. degree.
2. Comparing the fraction defectives from two production lines.
One-Sample Z Test
A one-sample z test is used to check if there is a difference between the sample mean and the population
mean when the population standard deviation is known.
Left Tailed Test: Right Tailed Test: Two Tailed Test:
Formula:
Null Hypothesis: H0: μ=μ0 Null Hypothesis: H0: μ=μ0 Null Hypothesis: H0: μ=μ0
x̅ is the sample mean, Alternate Hypothesis: H1 : μ<μ0 Alternate Hypothesis: H1 : μ>μ0 Alternate Hypothesis: H1 : μ≠μ0
μ is the population mean,
σ is the population standard deviation Decision Criteria: If the z statistic Decision Criteria: If the z statistic Decision Criteria: If the z statistic
< z critical then reject the null > z critical value then reject the > z critical value then reject the
n is the sample size. hypothesis. null hypothesis. null hypothesis.
Example: A doctor claims that a particular hospital contains more than 100 diabetes patients with a sugar level of more
than 234.
To verify the claim, a random test was conducted on 90 diabetes patients. The test resulted in a mean blood sugar level of
279. In addition, the test resulted in a standard deviation of 18.
Here, we set the significance level at 22.50
Z-test have three main steps:
1.Identifying null and alternate hypotheses.
2.Measuring the statistical significance.
3.Comparing the z score with the significance level. Based on the comparison, the null hypothesis is either accepted or
rejected.
One Proportion Z Test: A one proportion z test is used when there are two groups and compares the value of an observed
proportion to a theoretical one.
p is the observed value of the proportion, p0 is the theoretical
proportion value and n is the sample size.
Two Proportion Z Test: A two proportion z test is conducted on two proportions to check if they are the same or not.
6. Decision
The observed z value (-1.51) is less than z critical (-1.96). Therefore, there is no reason to reject H .
The evidence would suggest that both drugs have the same effect on mental concentration.
Example: A company wants to improve the quality of products by reducing defects and monitoring the efficiency of assembly lines. In
assembly line A, there were 18 defects reported out of 200 samples while in line B, 25 defects out of 600 samples were noted. Is there a
difference in the procedures at a 0.05 alpha level?
= 2.62
As 2.62 > 1.96 thus, the null hypothesis is rejected and it is concluded that there is a significant difference between the two
lines.
Example: where subjects are tested prior to a treatment, say for high sugar levels , and the same subjects are tested again after
treatment with a blood –sugar lowering medications.
Unpaired/ independent T test: examines the averages/means of two independent or unrelated groups to see if there is a
statistically significant difference between them.
2.74 >2.28
So we can
reject the null hypothesi
s
that there is no
difference between
means.
In t test, the degree of freedom is given by _______.
a. N -1 b.N1 –N2 +2 c.N1 +N2 - 2 d.N1 +N2+2
The null hypothesis (H0) of ANOVA is that there is no difference among group means.
The alternative hypothesis (Ha) is that at least one group differs significantly from the overall mean of the dependent variable.
If any of the group means is significantly different from the overall mean, then the null hypothesis is rejected.
ANOVA uses the F test for statistical significance.
The F test compares the variance in each group mean from the overall group variance. If the variance within groups is smaller than
the variance between groups, the F test will find a higher F value, and therefore a higher likelihood that the difference observed is
real and not due to chance.
Two way ANOVA: used to determine the effect of two nominal/categorical predictor/independent variables on a continuous
outcome/dependent variable.
Both of your independent variables should be categorical. If one of your independent variables is categorical and one is quantitative,
use an ANCOVA instead.
ANOVA tests for significance using the F test for statistical significance.
Example: You are researching which type of fertilizer and planting density produces the greatest crop yield in a field experiment. You
assign different plots in a field to a combination of fertilizer type (1, 2, or 3) and planting density (1=low density, 2=high density), and
measure the final crop yield in bushels per acre at harvest time.
bushel is a unit of volume that can be used to measure the amount of a crop that has been harvested. A bushel is equal to 8 imperial gallons, or 36.37 metric
liters in volume.
You can use a two-way ANOVA to find out if fertilizer type and planting density have an effect on average crop yield.
ANOVA formula:
Correlation: The relationship between two metric continuous variables is called correlation.
Correlation test check whether variables are related without hypothesizing a cause-and-effect relationship.
Pearson Correlation Coefficient
It is a common way of measuring the linear correlation. The coefficient is a number between -1 and 1 and determines the strength and
direction of the relationship between two variables. The change in one variable changes the course of another variable change in the
same direction.
2. Wilcoxon signed rank test: Used to test whether or not the difference between two paired population medians is zero.
Variables can be either metric or ordinal.
Distributions may be of any shape, but the differences should be distributed symmetrically.
This is the non-parametric equivalent of the matched-pairs t test.
3. Kruskal-Wallis test: Used to test whether the medians of three or more independent groups are the same.
Variables can be either ordinal or metric.
Distributions may be of any shape, but all need to be similar.
This non-parametric test is an extension of the ANOVA.
4. Chi-squared test : Used to test whether the proportions across a number of categories of two or more independent groups is the same.
Variables must be categorical.
The chi-squared test is also a test of the independence of the two variables.
5. Fisher’s Exact test: Used to test whether the proportions in two categories of two independent groups is the same.
Variables must be categorical.
This test is an alternative to the 2 × 2 chi-squared test, when cell sizes are too small .
Choosing a nonparametric test
Statistical tests in hypothesis testing
Practice questions
1. If metric data is skewed, the choice of statistical test should be__________ .
a. parametric tests b. non parametric tests c. one sample t test d. z test
3. The statistical test used to test the difference in medians of three independent groups is _____.
b. One way ANOVA b. Kruskal-Wallis test c. Wilcoxon signed rank test d. Friedman rank test
4. The test for knowing association between two metric variable is _________.
c. Pearson correlation b. Spearman correlation c. Phi correlation d. None of above
5. The following test is used to test whether or not the difference between two population means is zero.
d. t test b. Paired t test c. Mann Whitney U test d. One sample t test
6. One sample z test is used to test a single sample when the ______ is known.
e. population variance b. sample variance c. a and b d. none of above
7. 3. If we are testing one sample and sample size is more than 30, which of the following test can be used?
f. two sample z test b. one sample t test c. one sample z test d. two sample t test
8. If the calculated value of z lies in the critical region then null hypothesis is_____.
a. rejected b. accepted
Practice questions
1. If a distribution is skewed to the left, then it is __________. 5. The measure of dispersion most commonly used in
a. Negatively skewed conjunction with the mean is the:
b. Positively skewed a. interquartile range
c. Symmetrically skewed b. range
d. Symmetrical c. standard deviation
d. Variance
2. If a test was generally very easy, except for a few students who had very
low scores, then the 6. The probability of an intersect p(A∩B) is easily determined
distribution of scores would be _____. by using ______.
e. positively skewed a. Addition theorem
f. negatively skewed b. Multiplication theorem
g. not skewed at all c. Baye theorem
h. Normal d. Addition and subtracting theorem
3.The range of a sample gives an indication of the________. 7. The probability of an event that is certain to happen is
a. way in which the values cluster about a particular point equal to _______ .
b. number of observations bearing the same value e. 100
c. maximum variation in the sample f. 1
d. degree to which the mean value differs from its expected value. c. 10
d. 0
4.Which range characterizes the interquartile range?
a. From 5th percentile to 95th percentile 8. About99%of the observations lie within _______ standard
b. From 10th percentile to 90th percentile deviation either side of mean.
c. From 25th percentile to 75th percentile a. 1 b. 2 c. 3 d. 4
d. From 1 standard deviation below the mean to 1 standard deviation above
the mean
CLASS-IX
15 September 2023
th
DBMS-Part I
Unit 2
• Database concept
• Database Management System
• 2 tier and 3 tier structure
• 3 level Architecture of DBMS
• Keys: Candidate, Primary and foreign