MSBT 109 _UNIT 1

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 105

Biostatistics and Computer Application

M.Sc. Biotechnology
(MSBT-109- DTU)

Unit I Unit II Unit III Unit IV Unit V

Fundamental of probability Database Management Cluster analysis Microarray technique Modelling Methods

Hypothesis testing Introduction to Spectroscopy for


programming protein structure
language
• Central limit theorem- normal • Types of databases • Phylogeny clustering • RNA expression studies • Homology Modelling
distribution • Data retrieval & browsing • Pattern recognition methods • Data normalization • Protein str. Prediction
• Discrete & continuous • Databases like PubMed, • Sequence analysis • Differential expression of • Parameters: Force filed,
variables NCBI, UCSC, EMBL • protein family classification genes- threshold p- values Gibbs energy, Monte
• Confidence interval & levels browsing examples Carlo,energy minimization
of significance & critical • Introduction to X-ray and simulation/docking
region • Basics of programming crystallography and NMR
language spectroscopy
• Null & alternate hypothesis • Application of language in • Protein structure database &
• Statistical test: p- value biostatistics visualization
 Student T test
 Z test – Fischer test
 ANOVA- one way or two way
• Application of statistical tests
in research studies
CLASS-I
16 Aug 2023
What is Statistics
Statistics is the area of applied math that deals with the collection, organization, analysis, interpretation, and
presentation of data.

Biostatistics is the application of statistical techniques to scientific research in health-related fields, including
medicine, biology, and public health.

Two main branches of Statistics:


• Descriptive Statistics
• Inferential Statistics

Descriptive statistics focus on describing the visible characteristics of a dataset. A data set is a collection of
responses or observations from a sample or entire population. They represent all of the procedures that can be
used to organise, summarise, display, and categorise data collected for a certain experiment or event. It
includes tabulation, graphical presentation, measures of central tendency, etc

Inferential statistics focus on making predictions or generalizations about a larger dataset, based on a
sample of those data. The inferential statistics include z test, t test, analysis of variance, etc.

In short statistics are about summarizing and answering question based on data.

A measurement or datum is a single observation. Data (plural) are a collection of scores.


Descriptive Statistics

Descriptive statistics

• Describe the features of populations and/or samples


• Organize and present data in a purely factual way

• Present final results visually, using tables, charts, or graphs

• Draw conclusions based on known data

• Use measures like central tendency, distribution, and variance


Descriptive Statistics

Descriptive Statistics are used to summarize, organize and simplify data


Descriptive Statistics
Inferential Statistics
Inferential statistics

Use samples to make generalizations about larger populations

Help us to make estimates and predict future outcomes

Present final results in the form of probabilities

Draw conclusions that go beyond the available data

Use techniques like hypothesis testing, confidence intervals, and


regression and correlation analysis
Population and Samples
• The population may be defined as the entire number of observations that constitute a particular
group. Samples are generally a relatively small group of observations that have been taken from a
defined population.
• Parameters are characteristics of populations while statistic are characteristics of samples representing
summary measures computed on observed sample values.
• Parameters or population values are usually represented by Greek symbols (e.g. μ) and sample statistic
are denoted by letters (e.g. x̄ )
Population and Samples
Ques: what’s the mean diastolic blood pressure of the given population?

Sampling error: The discrepancy


between the sample statistic and
the true population parameter it is
estimating.

To reduce the sampling error:


 Sufficiently large sample size
 Random selection (unbiased
sample collection)
Applications and Uses of Biostatistics

 Problem Solving
 Use in Clinical Trials
 Use in Manufacturing of Pharmaceuticals
 Use in Quality Control of Pharmaceuticals
 Use in Research and Development of Drugs and Technology
 Use in Anatomy and Physiology
 Use in Pharmacology and Medicine
 Use in Community Medicine and Public Health
 Use in Health and Vital statistics
 Use in Biotechnology, Bioinformatics and Computational Biotechnology
Variables and Constants
 A constant is a characteristic that is fixed across conditions. For example: fingerprint, country in which you
were born, genotype.
 A variable is something that changes across conditions. For example salary, age, marital status, respiratory
rate, blood type etc. Data are the values we get when we measure or observe a variable.
 How do we measure things?
Qualitatively: By putting them into categories
Quantitatively: By using numbers
Variables
Nominal A typical characteristics of this variable are that they do not have any units of measurement, and the ordering
Variable of the categories is completely arbitrary. Example: Blood group types - Nominal categorical variable

Ordinal This data too do not have any units of measurement as like that of nominal variables but the ordering of the
Variable categories is not arbitrary as it was with nominal variables. Example: Scoring of a patients- Ordinal

Ordinal data are not real numbers. They cannot be placed on the number line. As ordinal data are not real
numbers, it is not appropriate to apply any of the rules of basic arithmetic to sort this data

With metric variables, proper measurement is possible and therefore these variables produce data that are real
Continuous
numbers, and can be placed on the number line.
Variable
Metric continuous variables can be properly measured and have units of measurement. Example: Birth weight
(g), blood pressure (mmHg)- Metric continuous

The data produced are real numbers, and are invariably integer (i.e. whole number). They can be placed on the
Discrete
number line, and have the same interval and ratio properties as continuous metric data.
Variable
Metric discrete variables can be properly counted and have units of measurement- ‘numbers of things’.
Example: Number of deaths, number of pressure sores, number of angina Attacks- discrete variables

Continuous metric data usually comes from measuring while discrete metric data, usually comes from counting
Identify type of variable:

1. Dosage form - tablet/ capsule / ointment


2. Bioavailability measurements (C ,T ,AUC)
3. Age ( in years).
4. Hypertension -Mild, Moderate and Severe
5. Smoking history (cigarattes per day)
6. Test- pass or fail criteria
7. Weight
8. Male vs female subjects
9. Scores for patient responses to treatment
Summary
Fundamental of Biostatistics- Basic concepts and terminologies
1. Biostatistics is branch of statistics applied to_______ science whereby collection, classification, summarising, analysis and
interpretation of data is done. a. pharmaceutical b. medicinal c. biological d. chemical
2. Descriptive statistics represents all of the procedures that can be used to_______.
a. organise, summarise
b. display and categorise
c. a & b
d. none of above
3. The population is_________. a. the entire number of observations that constitutes a particular group. b. the entire number of samples
that constitutes a particular group. c. a & b d. None of above
4. Categorical variable can be divided into__________. a. Nominal & continuous b. nominal & discrete c. discrete&continuous d.
nominal&ordinal
5. Metric data can be divided into___________. a. Nominal & continuous b. nominal & discrete c. discrete&continuous d.
nominal&ordinal
6. The goal of ___________ is to focus on summarizing and explaining a specific set of data. a. inferential statistics b. descriptive
statistics c. none of the above d. all of the above
7. Metric variable can be properly_________. a. measured b. counted c. a & b d. none of the above
8. In categorical variables, values are in ________ categories. a. arbitrary & counted b. arbitrary & measured c. arbitrary & ordered d.
counted & measured.
9. Bioavailability measurement is a ___________ variable. a. metric continuous b. metric discrete c. categorical continuous d.
categorical ordinal.
10. Scores for patient responses to treatment is a ____________ variable. a. metric continuous b. metric discrete c. categorical
continuous d. categorical ordinal
Identify type of variable:

Eye Color
Gender Places in a Race

Cloth sizes

Olympics medals Ethnicity


CLASS-II
18 Aug 2023
Learning objectives

01 02 03 04

Frequency Graphical Shape of data Central tendency and


Distribution representations distribution dispersion
Fundamental of Biostatistics- Data Representation and Distribution
 Whenever the data is collected for some project, it is usually in the ‘raw’ form and not in a organised way.
 Descriptive statistics deals with sorting this raw data by putting it into a table or by presenting it in an appropriate chart or
summarising it numerically.
 An important consideration in sorting the raw data is the type of variable concerned. The data from some variables are best
described with a table, some with a chart, and some with both. However, a numeric summary is more appropriate for some types of
variable
 Tabulation is the first step before the data is used for analysis or interpretation.

 Frequency distribution tables presents data in a relatively compact form, ready to use but certain information may be lost. The
data can be reduced to manageable form using frequency tables. It can have one or all the following parameters, depending on the
type of data.
1. Frequency tells you how often something happened OR number of times that any particular value come in a data.
2. Relative frequency is the frequency converted into percentage of the total number of observations.

3. Cumulative frequency is the cumulative total of frequencies and is obtained by adding the frequency of observations at each
level point to those frequencies of the preceding level(s).
4. Cumulative relative frequency It is cumulative frequency converted into the percentage of the total number of observations.
Frequency tables can show either categorical variables (sometimes called qualitative variables) or quantitative variables
(numeric values).
 Frequency distribution in nominal data

Shows that a higher percentage of nurses than of doctors work Relative frequency table showing the percentage of
in rural areas, but that, overall, a greater proportion of staff students in each blood group
works in urban areas (67%).
 It makes no sense to calculate cumulative frequency for nominal data, because of the arbitrary category order.
Hence, cumulative frequency is not calculated. But can be done for the ordinal data.

The Cumulative and relative cumulative


frequency distributions for data
 The most useful approach with metric continuous data is to group them first, and then construct a frequency
distribution of the grouped data.
It requires major components: class, class limits, boundaries and intervals. Also class marks and open ended
groups can be used.
• Class: (range of data points) ÷ (range of classes) eg 20 ÷ 5 = 4
• Class size: Difference between the lower and upper class-limits.
• Class limit: Represent the smallest and largest data values that can belong to each class.
• Class interval: Difference between the upper and lower class limit
• Classmark : Average of the upper and lower limits of a class.
 Grouped frequency table:
Exclusive Class Interval: the upper limit of one class is the same as the lower limit of the
succeeding class. Note that in continuous cases, any observation corresponding to the
extreme values of a class is always included in that class where it is the lower limit. For
example, if we had a student who has scored 5 marks in the test, his marks would be
included in the class interval 5-10 and not 0-5.
Inclusive Class interval: the lower limit of a class does not get repeated in the upper limit
of the preceding class. When a frequency distribution is analyzed the iinclusive class
interval has to be converted to an exclusive class interval. This can be done by subtracting
0.5 from the lower class limit and adding 0.5 to the upper class limit.

The relative, cumulative and relative cumulative frequency distribution table

The frequency distribution table for metric discrete data


Practice question

Construct a frequency table for the data on marks obtained by 20 students in their math exam.

20,43,74,89,75,60,31,43,37,36,50,38,21,99,93,45,64,92,38,60

Class Freq Relative freq Cumulative freq Relative Cumulative freq


20-40
40-60
60-80
80-100
 The Graphical Presentation of Data
 Graphical presentation of data with appropriate chart is a good idea for describing data effectively. Appropriate chart depends primarily on the type of data, as
well as on what particular features of it we are looking for.

1. Pie chart is a diagram in which the frequencies of the groups are shown in a circle. Each Pie chart of percentage
segment (slice) of a pie chart should be proportional to the frequency of the category it blood group of pharmacy
represents. A disadvantage of a pie chart is that it can only represent one variable. It can lose students
clarity if it is used to represent more than four or five categories
2. Bar chart: Its a chart with frequency on the vertical axis and category on the horizontal axis.
The simple bar chart is appropriate if only one variable is to be shown.
Simple bar chart of blood
All the bars should be of the same width, and there should be equal spaces between bars.
group of pharmacy students
These spaces emphasise the categorical nature of the data

3. Clustered bar chart: If there are more than one group, we can use the clustered bar chart.
There are two ways of presenting a clustered bar chart. As it compare the relative sizes of
the groups within each category.
5. Pictograms are similar to bar charts. They present
the same type of information, but the bars are
replaced with a proportional number of icons. This
type of presentation for descriptive statistics dates
back to the beginning of civilization when pictorial
images were used to record numbers of people,
animals or objects.
A stacked bar chart of blood group of
Clustered bar chart of blood group of 95 pharmacy students by sex 95 pharmacy students by sex

4. Stacked bar charts: The bars are now stacked on top of each other. Stacked bar charts
are appropriate if we want to compare the total number of subjects in each group but
not so good if we want to compare category sizes between groups. Population of different districts of Western Maharashtra.
Each diagram indicates one lakh population.
6. Line chart is similar to a bar chart except that thin lines, instead of thicker bars, are
used to represent the frequency associated with each level of the discrete variable

7. Point plots are identical to line charts, however, instead of a line, a number of points or
dots equivalent to the frequency are stacked vertically for each value of the horizontal
axis. Also referred to as dot diagram, point plots are useful for small data sets.

Line and Point Plot of assay result of 30 amoxicillin tablets


8. A histogram looks like a bar chart but without any gaps between adjacent bars. This
emphasises the continuous nature of the underlying variable.
- If the groups in the frequency table are all of the same width, then the bars in the
histogram will also be of the same width.
- One limitation of the histogram is that it can represent only one variable at a time (like
the pie chart), and this can make comparisons between two histograms difficult.

9. A frequency polygon can be constructed by placing a dot at the midpoint (class mark) for
each class interval in the histogram and then these dots are connected by straight lines. histogram of the grouped weight in kg
- This frequency polygon gives a better conception of the shape of the distribution. The
class interval midpoint for a section in a histogram is calculated as follows: Midpoint =
(highest + lowest point)/2
- The frequency polygon is then created by listing the midpoints (class marks) on the x
axis, frequencies on the y-axis, and drawing lines to connect the midpoints for each
interval.

A Frequency Polygon of the grouped weight in kg


10. Relative Cumulative Frequency Curve (Ogive) is a graph of the cumulative relative frequency distribution.
 So, to draw Ogive we should convert ordinary frequency distribution into relative cumulative frequency.
 With continuous metric data, there is assumed to be a smooth continuum of values, so we can chart relative cumulative frequency with a correspondingly smooth curve, or ogive.

Ogive of the grouped weight of


60 final year students

 By using Ogive we can locate any percentile that will divide the series into parts.
 Quartiles: There are three different points located on the entire range of variable. By using these quartiles we can  Quintiles: This divides the distribution into 5 equal parts. So, 20th
calculate semi inter quartile range and inter quartile range (Q1-Q3). percentile or 1st quintile will have 20% observations falling to its
- Q1 or lower quartile will have 25% observations falling in its left and 75% observations on its right side. left and 80% to its right.
- Q2 is the median, i.e.,50%values lies on either side.
 Deciles: This divides the distribution into 10 equal parts. First decile
- Q3 is the upper quartile, will have 75% observations falling on its left side and 25% observations on its right side.
(10th percentile) will have 10% values to its left and 90% values to
its right. 5th decile is the median and contains 50% values on either
side.
CLASS-III
23 Aug 2023
11. Box-whisker plot: This plot that displays a great deal of information about a continuous variable
is the box-and-whisker plot. It shows the bulk of the data as a rectangular box in which the upper
and lower lines represent the third quartile (75% of observations below Q3) and first quartile (25%
of observations below Ql), respectively. The second quartile (50% of the observations below this
point) is depicted as a horizontal line through the box. Vertical lines (whiskers) extend from the top
and bottom lines of the box to an upper and lower adjacent value.

12. The stem-and-leaf plot is a visual representation for continuous data and contains features
common to both the frequency distribution and dot diagrams. Digits, instead of bars are used to
illustrate the spread and shape of the distribution. Each piece of data is divided into "leading" and
"trailing" digits.

All the leading digits are sorted from lowest to highest and listed to the left of a vertical line. These
digits become the stem. The trailing digits are then written in the appropriate location to the right
of the vertical line. These become the leaves.
13. A scatter diagram is an extremely useful presentation for showing the relationship between two
continuous variables. The two dimensional plot has both horizontal and vertical axes which cover
the ranges of the two variables. Plotted data points represent paired observations for both the x and
y variable. These types of plots are valuable for correlation and regression inferential tests.

14. Time series plot: If the data collected are from


measurements made at regular intervals of time
(minutes, weeks, years, etc.), we can present the
data with a time series chart. Usually these
charts are used with metric data, but may also be
appropriate for ordinal data. Time is always
plotted on the horizontal axis, and data values on
the vertical axis.
Fundamental of Biostatistics- Data Distribution
The choice of the most appropriate procedures for summarising and analysing data will not only depend on the type of variable but also on the shape of the
distribution.

1. The values are fairly evenly spread 2. The values are concentrated towards the 3. The values are concentrated towards the top
throughout their possible range. This bottom of the range, with progressively of the range, with progressively fewer values
is a uniform distribution. fewer values towards the top of the range. towards the bottom of the range. This is a left
This is a right or positively skewed or negatively skewed distribution.
distribution.

4. The values are clumped together around one  There is one particular symmetric bell-shaped 5. The values are clumped around two or
particular value, with progressively fewer distribution, known as the Normal distribution. Many more particular values. This is a
values both below and above this value. This is human clinical features are distributed normally bimodal or multimodal distribution.
a symmetric or mound-shaped distribution.
 It is bell shaped curve
 It is symmetrical in distribution; variables on either side of mean are equal in
number.
 Its maximum height is at the mean.
 Mean= mode= median , in case of normal distribution coincide.
 Skewness of the curve is zero.
 It is asymptotic, in that tails never touch baseline.
 Total area of curve is one and standard deviation is also one.
 It has two curves. Central part is convex and when it comes down, it becomes
concave on both sides.

Skewness measures asymmetry around the mean. The parameter is best interpreted as
relative to the normal distribution (whose skewness equals to zero).
The interpretation of the skewness is :
 Skewness > 0 asymmetric tail with more values above the mean
 Skewness < 0 asymmetric tail with more values below the mean
Skewed data is required to be treated using non parametric tests while normal curve
data is treated using parametric tests.

Kurtosis is a property associated with a frequency distribution and refers to the shape
of the distribution of values regarding its relative flatness and peaked-ness. Compared
with normal distribution, the interpretation of the kurtosis is:
 Kurtosis > 0 peaked relative to Normal distribution
 Kurtosis < 0 flat relative to Normal distribution
Measure of Central Tendency

 In any research, enormous data is collected and, to describe it  Defined as “the statistical measure that identifies a single value as
meaningfully, one needs to summarise the same. representative of an entire distribution”.
 The bulkiness of the data can be reduced by organising it into a  It aims to provide an accurate description of the entire data.
frequency table or histogram.  It is the single value that is most typical/representative of the collected
 Frequency distribution organises the heap of data into a few meaningful data.
categories.
 Collected data can also be summarised as a single index/value, which
represents the entire data.
 These measures may also help in the comparison of data.

 The mean, median and mode are the three commonly used measures of
central tendency.

Mean is generally considered the best measure of central tendency and


the most frequently used.
However, there are some situations where the other measures of central
tendency are preferred.
Measure of Central Tendency- Types of Averages
 Arithmetic mean (or, simply, “mean”) is nothing but the average. It is computed by adding all the values in the data set divided by the number of
observations in it.

 Median is the value which occupies the middle position when all the observations are arranged in an ascending/descending order. It divides the
frequency distribution exactly into two halves. 50% of observations in a distribution have scores at or below the median. Hence median is the 50th
percentile. Median is also known as ‘positional average’.

 Mode is defined as the value that occurs most frequently in the data. Some data sets do not have a mode because each value occurs only once. On the
other hand, some data sets can have more than one mode. This happens when the data set has two or more values of equal frequency which is greater than
that of any other value. Mode is rarely used as a summary statistic except to describe a bimodal distribution. In a bimodal distribution, the taller peak is
called the major mode and the shorter one is the minor mode.

Median is preferred to mean when:


 There are few extreme scores in the distribution.
 Some scores have undetermined values.
 There is an open ended distribution.
 Data are measured in an ordinal scale.
 Mode is the preferred measure when data are measured in a nominal scale.
Measuring dispersion or variations associated with Central Tendency

1. The range is the distance from the smallest value to the largest. Not affected by skewness, but is sensitive to the addition or
removal of an outlier value. Range = Lowest value to Highest value.

2. The interquartile range describes the middle 50% of values when ordered from lowest to highest. To find the interquartile
range (IQR), ​first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1)
and quartile 3 (Q3). The IQR is the difference between Q3 and Q1. It is not affected either by outliers or skewness, but it
does not use all of the information in the data set since it ignores the bottom and top quarter of values.

3. Standard Deviation is a measure which shows how much variation (such as spread, dispersion, spread,) from the mean exists. It indicates a “typical” deviation
from the mean. The most widely used measure of dispersion, is based on all values. It is the square root, of the, mean of the squared deviations, from arithmetic
mean. In statistics, Variance and standard deviation are related with each other since the square root of variance is considered the standard deviation for the given
data set.

4. The mean deviation is defined as a statistical measure that is used to calculate the
average deviation from the mean value of the given data set. It uses absolute
values instead of squares to circumvent the issue of negative differences between
the data points and their means.
Population Parameters versus Sample Statistics
Biostatistics is to analyze samples in order to make inferences about the population from which the samples were drawn
s2 = 472.10/9 = 52.46 s = √52.46 = 7.24
Questions

Q: Salary (in K) of students after PG

35, 50,50,50,56,60,60,75,250

Mean= ~76.2 and SD = ~ 62.3

Median=56 and IQR=17.5

Which statistics should be used to describe the distribution below?

a. Mean and IQR

b. Mean and Standard Deviation

c. Median and IQR

d. Median and Standard Deviation


CLASS-IV
25 Aug 2023
Probability and its fundamental concepts related to biostatistics

 The probability is the likelihood of occurrence of a particular event or chance of


occurrence for an event.
 Experiment refers to describe an act which can be repeated under some given
conditions. Those experiments whose results depends on chance are called random
experiment. Outcomes of an any experiment are called EVENTS.
 Mutually exclusive events: (Incompatible) When two events cannot happen
simultaneously in a single trail. Basically, occurrence of any one event precludes
the occurrence of another events.

(A AND B) = P(AՈB)
(A OR B) = P(AUB)
Playing with cards

How many cards we have in a Deck ?


1. A card is drawn from a well shuffled pack of 52
cards. Find the probability of:
(i) ‘2’ of spades
(ii) a jack
(iii) a king of red colour
(iv) a card of diamond
(v) a king or a queen
(vi) a non-face card
(vii) a black face card
(viii) a black card
(ix) a non-ace
(x) non-face card of black colour
(xi) neither a spade nor a jack
(xii) neither a heart nor a red king

2. A card is drawn at random from a well-shuffled 3. A card is drawn at random from a pack of 52
pack of cards numbered 1 to 20. Find the playing cards. Find the probability that the card
probability of drawn is
(i) a king
(i) getting a number less than 7 (ii) neither a queen nor a jack.
(ii) getting a number divisible by 3.
 Independent events: Two or more events are said to be  Dependent events: In these events occurrence or non-
independent, when the outcome of one event is not affected occurrence of one event in any one trial affects the
by other outcomes, as they do not effect other. probability of other events in other trails.
 Eg. If a coin is tossed twice, the result of second throw is  Equally likely event are said to be equally likely when
not affected by the result of the first throw. one does not occur more often than the others.

 Single-event: we consider, probability of  Complementary-event: Let A & B are


happening or not happening of single events.  Exhaustive-event: Those events whose two events and both of them are
It deals with elementary events. totality includes all possible outcomes of mutually exclusive and exhaustive
RANDOM experiment or sample space. events. Then A is complementary event
 Compound-event: we consider, the joint of B (& VICE-VERSA).
occurrence of two or more events. It deals
with composite events.
(A AND B) = P(AՈB)
(A OR B) = P(AUB)
CLASS-V
30 Aug 2023

None of the students


attended the class
CLASS-VI
1 September 2023
st

Probability Distributions
Experiment, Outcome, and Sample Space
An experiment is a process that, when performed, results in one and only one of many observations. These observations are called the
outcomes of the experiment. The collection of all outcomes for an experiment is called a sample space.

Examples of Experiments, Outcomes, and Sample Spaces Tree diagram, each outcome is represented by a branch of the
tree.
• help us understand probability concepts by presenting them
visually

 tree diagram for the experiment of tossing a coin once


Sample space = {H, T}, where H = Head and T = Tail

Tree diagram for


one toss of a coin.

 Suppose we randomly select two workers from a company and


observe whether the worker selected each time is a man or a
woman.
Sample space = {MM, MW, WM, WW}
Simple and Compound Events

An event is a collection of one or more of the outcomes of an experiment.

A simple event is also called an elementary event, and a compound event is also called a composite event.

Simple Event An event that includes one and only one of the (final) outcomes for an experiment is called a simple event and is
usually denoted by Ei.

Reconsider previous example: on selecting two workers from a company and observing whether the worker selected each time is a
man or a woman.
Each of the final four outcomes (MM, MW, WM, and WW) for this experiment is a simple event.
These four events can be denoted by E1, E2, E3, and E4, respectively.
Thus, E1 = {MM}, E2 = {MW}, E3 = {WM}, and E4 = {WW}

Compound Event A compound event is a collection of more than one outcome for an experiment.
Reconsider the same example on selecting two workers from a company and observing whether the worker selected each time is a man
or a woman. Let A be the event that at most one man is selected. Is event A a simple or a compound event?

Here at most one man means one or no man is selected.


Thus, event A will occur if either no man or one man is selected.
Hence, the event A is given by
A = at most one man is selected = {MW, WM, WW}
Because event A contains more than one outcome, it is a compound event.
Q: In a group of college students, some like ice tea and others do not. There is no student in this
group who is indifferent or has no opinion. Two students are randomly selected from this group.
(a) How many outcomes are possible? List all the possible outcomes.
(b) Consider the following events. List all the outcomes included in each of these events.
Mention whether each of these events is a simple or a compound event.
(i) Both students like ice tea.
(ii) At most one student likes ice tea.
(iii) At least one student likes ice tea.
(iv) Neither student likes ice tea.

Let L denote the event that a student likes ice tea and N (b) (i) The event both students like ice tea will occur if LL happens.
denote the event that a student does not like ice tea. Thus, Both students like ice tea = {L L}
(a) This experiment has four outcomes, Since this event includes only one of the four outcomes, it is a simple event.
LL = Both students like ice tea (ii) The event at most one student likes ice tea will occur if one or none of the
L N = The first student likes ice tea but the second student two students likes ice tea.
does not At most one student likes ice tea = {L N, NL, N N}
NL = The first student does not like ice tea but the second Since this event includes three outcomes, it is a compound event.
student does (iii) The event at least one student likes ice tea will occur if one or two of the tw
NN = Both students do not like ice tea students like ice tea.
Thus, At least one student likes ice tea = {LN, NL, LL}
Since this event includes three outcomes, it is a compound event.
(iv) The event neither student likes ice tea will occur if neither of the two
students likes ice tea, which will include the event NN.
Thus, Neither student likes ice tea = {NN}
Since this event includes one outcome, it is a simple event.
Probability is a numerical measure of the likelihood that a specific event will occur. The probability that a simple event Ei will occur is
denoted by P(Ei), and the probability that a compound event A will occur is denoted by P(A).

Two Properties of Probability


1. The probability of an event always lies in the range 0 to 1.
Whether it is a simple or a compound event, the probability of an event is never less than 0 or greater than 1.
0 ≤ P(Ei) ≤ 1
0 ≤ P(A) ≤ 1

2. The sum of the probabilities of all simple events (or final outcomes) for an experiment, denoted by ΣP(Ei), is always 1.
For an experiment with outcomes E1, E2, E3, .a . . ,
ΣP(Ei) = P(E1) + P(E2) + P(E3) + . . . = 1.0

Classical Probability: Equally Likely Outcomes Two or more outcomes that have the same probability of occurrence are said to be
equally likely outcomes.
Marginal probability is the probability of a single event without consideration of any other event.
Example: 15 employees in this group possess two characteristics: “male” and “in favor of paying high salaries to CEOs.”

Suppose one employee is selected at random from these 100 employees. If only one characteristic is considered at a time, the
employee selected can be a male, a female, in favor, or against. The probability of each of these four characteristics or events is
called marginal probability.

The four marginal probabilities are calculated as follows:


P(male) = Number of males/ Total number of employees = 60/100 = .60
As we can observe, the probability that a male will be selected is
obtained by dividing the total of the row labeled “Male” (60) by the
grand total (100). Similarly,
P(female) = 40/100 = .40
P(in favor) = 19/100 = .19
P(against) = 81/100 = .81
Conditional probability is the probability that an event will occur given that another event has already occurred. If A
and B are two events, then the conditional probability of A given B is written as P(A ∣ B)
and read as “the probability of A given that B has already occurred.”

Now suppose that one employee is selected at random from these 100 employees. Furthermore, assume it is
known that this (selected) employee is a male. In other words, the event that the employee selected is a male
has already occurred. Given that this selected employee is a male, he can be in favor or against. What is the
probability that the employee selected is in favor of paying high salaries to CEOs?
Mutually Exclusive Events : Events that cannot occur together are said to be mutually exclusive events.
Consider the following events for one roll of a die:
A = an even number is observed = {2, 4, 6}
B = an odd number is observed = {1, 3, 5}
C = a number less than 5 is observed = {1, 2, 3, 4}
Are events A and B mutually exclusive? Are events A and C mutually exclusive?
Mutually exclusive events A Mutually nonexclusive
and B. events A and C.
Independent Events Two events are said to be independent if the occurrence of one event does not affect the probability of the
occurrence of the other event. In other words, A and B are independent events if either P(A ∣ B) = P(A) or P(B ∣ A) = P(B)
Complementary Events: The complement of event A, denoted by A and read as “A bar” or “A complement,” is the event that
includes all the outcomes for an experiment that are not in A.
Events A and A are complements of each other.
Because two complementary events, taken together, include all the outcomes for an experiment and
because the sum of the probabilities of all outcomes is 1, it is obvious that
P(A) + P(A) = 1.0

Question: Of the 500 adults, 325 drink coffee with sugar.


Suppose one adult is randomly selected from these 500
adults, and let A be the event that this adult drinks coffee
with sugar. What is the complement of event A? What are
the probabilities of the two events?
Theorem of probability

Joint Probability The probability of the intersection of


two events is called their joint probability.
It is written as P(A and B)
Union of Events : Let A and B be two events defined in a sample space. The union of events A and B is the collection of
all outcomes that belong either to A or to B or to both A and B and is
denoted by (A or B); (A ∪ B)
Counting Rule, Factorials
Counting Rule: to Find Total Outcomes If an experiment consists of three steps, and if the first step can result in m
outcomes, the second step in n outcomes, and the third step in k outcomes, then
Total outcomes for the experiment = m · n · k
Example: Consider three tosses of a coin. How many total outcomes this experiment has?
Solution This experiment of tossing a coin three times. Each step has two outcomes: a head and a tail. Thus,
Total outcomes for three tosses of a coin = 2 × 2 × 2 = 8
The eight outcomes for this experiment are HHH, HHT, HTH, HTT, THH, THT, TTH, and TTT.

The symbol ! (read as factorial) is used to denote factorials. The value of the factorial of a numberis obtained by
multiplying all the integers from that number to 1. For example, 7! is read as “seven factorial” and is evaluated by
multiplying all the integers from 7 to 1.

Example: Evaluate 7!.


Solution To evaluate 7!, we multiply all the
integers from 7 to 1.
7!= 7 ⋅ 6 ⋅ 5 ⋅ 4 ⋅ 3 ⋅ 2 ⋅ 1 = 5040
Thus, the value of 7! is 5040.
Discrete Random Variables and Their Probability Distributions
Random Variable: A random variable is a variable whose value is determined by the outcome of a random experiment.

Discrete Random Variable A random variable that assumes countable values is called a discrete random variable.
examples of discrete random variables
• The number of houses in a certain block
• The number of customers who visit a bank during any given hour
• The number of complaints received at the office of an airline on a given day

Continuous Random Variable A random variable that can assume any value contained in one or more intervals is called
a continuous random variable.

Examples of continuous random variables


• The length of a room
• The time taken to commute from home to work
• The weight of a letter
• The price of a house
Probability Distribution of a Discrete Random Variable:
 Binomial Probability Distribution
 Poisson Probability Distribution
 Hypergeometric Probability Distribution

The Binomial Probability Distribution


 most widely used discrete probability distributions.
 the random variable x must be a discrete dichotomous random variable.
 Each repetition of a binomial experiment is called a trial or a Bernoulli trial.
 For example, if an experiment is defined as one toss of a coin and this experiment is repeated 10 times, then
each repetition (toss) is called a trial. Consequently, there are 10 total trials for this experiment.
An experiment that satisfies the following four conditions is called a binomial experiment.
1. There are n identical trials. In other words, the given experiment is repeated n times, where n is a positive integer. All of these
repetitions are performed under identical conditions.
2. Each trial has two and only two outcomes. These outcomes are usually called a success and a failure, respectively. In case there are
more than two outcomes for an experiment, we can combine outcomes into two events and then apply binomial probability
distribution.
3. The probability of success is denoted by p and that of failure by q, and p + q = 1. The probabilities p and q remain constant for each
trial.
4. The trials are independent. In other words, the outcome of one trial does not affect the outcome of another trial.

Note: Success does not mean favorable or desirable outcome and a failure does not refer to an unfavorable or undesirable outcome.
The outcome to which the question refers is usually called a success; the outcome to which it does not refer is called a failure.
Example: Seventy five percent of students at a college with a large student population use the social media site Instagram. Three
students are randomly selected from this college. What is the probability that exactly two of these three students use Instagram?
n = total number of trials = number of students selected = 3
x = number of successes = number of students in three who use Instagram = 2
p = probability of success = probability that a student uses Instagram = .75
n − x = number of failures = number of students not using Instagram = 3 − 2 = 1
q = probability of failure = probability that a student does not use Instagram = 1 − .75 = .25
The probability of two successes is denoted by P(x = 2) or simply by P(2). Substituting all of the values in the binomial formula
Mean and Standard Deviation of the Binomial Distribution

U.S. Adults with No Religious Affiliation


Example: According to a Pew Research Center survey released on May 12, 2015, 22.8% of U.S. adults do not have a
religious affiliation. Assume that this result is true for the current population of U.S. adults. A sample of 50 U.S. adults is
randomly selected. Let x be the number of adults in this sample who do not have a religious affiliation. Find the mean
and standard deviation of the probability distribution of x.
Sol: This is a binomial experiment with a total of 50 trials. Each trial has two outcomes: (1) the selected adult does not
have a religious affiliation, (2) the selected adult has a religious affiliation or does not want to answer the question.
The probabilities p and q for these two outcomes are .228 and .772, respectively.
n = 50, p = .228 and q = .772
mean and standard deviation of the binomial distribution: μ = np = 50(.228) = 11.4
σ = √npq = √(50) (.228) (.772) = 2.9666

The value of the mean is what we expect to obtain, on average, per repetition of the experiment. In this example, if
we select many samples of 50 U.S. adults each, we expect that each sample will contain an average of 11.4 adults,
with a standard deviation of 2.9666, who will not have a religious affiliation.
The Hypergeometric Probability Distribution

If the trials are not independent, we cannot apply the binomial probability distribution to find the probability of x
successes in n trials. In such cases we replace the binomial probability distribution by the hypergeometric probability
distribution

Defective Auto Parts in a Shipment


Example: Brown Manufacturing makes auto parts that are sold to auto dealers. Last week the company shipped 25 auto
parts to a dealer. Later, it found out that 5 of those parts were defective. By the time the company manager contacted
the dealer, 4 auto parts from that shipment had already been sold. What is the probability that 3 of those 4 parts were
good parts and 1 was defective?
Solution Let a good part be called a success and a defective part be called a failure. From the given information,
N = total number of elements (auto parts) in the population = 25
r = number of successes (good parts) in the population = 20
N − r = number of failures (defective parts) in the population = 5
n = number of trials (sample size) = 4
x = number of successes in four trials = 3
n − x = number of failures in four trials = 1
Using the hypergeometric formula, we calculate the required probability as follows:

Thus, the probability that 3 of the 4 parts sold are good and 1 is defective is .4506.
The Poisson Probability Distribution:
If the average number of occurrences for a given interval is known, then by using the Poisson probability distribution, we can compute the
probability of a certain number of occurrences, x, in that interval.
Conditions to Apply the Poisson Probability Distribution The following three conditions must be satisfi ed to apply the Poisson
probability distribution.
1. x is a discrete random variable.
2. The occurrences are random.
3. The occurrences are independent.
The following examples also qualify for the application of the Poisson probability distribution.
1. The number of accidents that occur on a given highway during a 1-week period
2. The number of customers entering a grocery store during a 1-hour interval
3. The number of television sets sold at a department store during a given week

Mean and Standard Deviation of the Poisson Probability Distribution


For the Poisson probability distribution, the mean and variance both are equal to λ, and the standard
deviation is equal to √λ. That is, for the Poisson probability distribution,
μ = λ, σ2 = λ, and σ = √λ
Washing Machine Breakdowns
A washing machine in a laundromat breaks down an average of three times per month. Using the Poisson
probability distribution formula, find the probability that during the next month this machine will have
(a) exactly two breakdowns
(b) at most one breakdown
Continuous Random Variables and the Normal Distribution

Equation of the normal distribution is

Total area under a normal curve A normal curve is symmetric about Areas of the normal curve beyond μ ± 3σ.
the mean
The standard normal distribution is a special case of the normal distribution. For the standard normal distribution, the
value of the mean is equal to zero and the value of the standard deviation is equal to 1.
z Values or z Scores The units marked on the horizontal axis of the standard normal curve are denoted by z and are called
the z values or z scores. A specific value of z gives the distance between the mean and the point represented by z in
terms of the standard deviation. For example, a point with a value of z = 2 is two standard deviations to the right of the
mean. Similarly, a point with a value of z = −2 is two standard deviations to the left of the mean.
CLASS-VII
6 September 2023
th

Hypothesis Testing
Hypothesis testing and its application in biostatistics

 Hypothesis testing deals with statistical inference.


 Statistical inference is the branch of statistics which is concerned with using probability concept to deal with uncertainty in decision making.
 Basically we use sample statistics to draw a inference about a population parameter.
 Statistical inference delas with 2 different classes : Hypothesis testing & Estimation (point or interval estimates)

 The assignment of value(s) to a population parameter based on a value of the corresponding sample statistic is called
estimation.
Hypothesis testing

 Hypothesis testing begins with an assumption called a hypothesis, that we make about a population parameter.
 Hypothesis testing is a statistical method used to determine if there is enough evidence in a sample data to draw conclusions about a
population.
 Testing a hypothesis (or Test of significance) is generally a process of testing of significance, regarding parameters of the population, on the
basis of sample.
 Purpose: to help researcher in reaching a conclusion regarding the population by examining a sample from that population.
 Hypothesis: It is a statement about one or more populations, which delas with the parameters of the population , about which a statement is
made. Thus, via hypothesis testing , we can decide whether or not such statements are compatible with the available data.
Statistical Hypothesis testing
Hypothesis is stated in such a way that it may be
evaluated by appropriate statistical techniques.
Steps in statistical hypothesis testing
1. Test
statistics

2. Decision Rule
3. Significance Levels
4a. Statistical decision

4b. Statistical decision based on p value


Significance Levels and errors Statistical error based on significance Levels
Two tailed test – rejection region both sides
One –tailed test –at one side
Increase rejection area to avoid type 1 error
Estimate

The estimation procedure involves the following steps.


1. Select a sample.
2. Collect the required information from the members of the sample.
3. Calculate the value of the sample statistic.
4. Assign value(s) to the corresponding population parameter.

Point estimate of a population parameter = Value of the corresponding sample statistic


Interval estimation

Then we state that the interval $2630 to $3310 is likely to contain the population mean, μ, and that the mean housing
expenditure per month for all households in the United States is between $2630 and $3310.

The value $2630 is called the lower limit of the interval, and $3310 is called the upper limit of the interval.

The number we add to and subtract from the point estimate is called the margin of error or the maximum error of the
estimate.
Estimation of a Population Mean: σ Known

Confidence interval for the population mean μ when the population standard deviation σ is known.

in the confidence interval formula is called


the margin of error and is denoted by E
A publishing company has just published a new college textbook. Before the company decides the price at which to sell this
textbook, it wants to know the average price of all such textbooks in the market. The research department at the company
took a random sample of 25 comparable textbooks and collected information on their prices. This information produced a
mean price of $145 for this sample. It is known that the standard deviation of the prices of all such textbooks is $35 and the
population distribution of such prices is approximately normal.
(a) What is the point estimate of the mean price of all such college textbooks?
(b) (b) Construct a 90% confidence interval for the mean price of all such college textbooks.
n = 25, x = $145, and σ = $35
(a) The point estimate of the mean price of all such college textbooks is $145; that is, Point estimate of μ = x = $145

(b) The confidence level is 90%, or .90.

Next, we substitute all the values in the confidence interval formula for μ.
The 90% confidence interval for μ is x ± zσx = 145 ± 1.65(7.00) = 145 ± 11.55 = (145 − 11.55) to (145 + 11.55) =
$133.45 to $156.55

Thus, we are 90% confident that the mean price of all such college textbooks is between $133.45 and $156.55.
CLASS-VIII
13 September 2023
th

Statistical Tests
Parametric and Non parametric test
Parametric test: Assume the data is of sufficient “Quality” Non-parametric test: distribution free test, Can be used when
 The results can be misleading if the assumptions are wrong. data is not of sufficient quality to satisfy the assumptions of
 Quality is defined in terms of certain properties of data. parametric test.
Assumptions:  Used for Skewed data
• Random Independent samples  Median is used
• Interval or ratio level measurement
• Normally distributed Distribution can be checked
• No outliers  Histogram
• Homogeneity of variance  Kolmogorov Smirnov and Shapiro Wilk test
• Sample size larger than minimum of many non parametric test.
Tests: Tests:
More power to detect a difference that truly exists. 1. Mann Whitney U test/Wilcoxon rank sum test
1. Large sample test (Independent)
• Z test 2. Wilcoxon Signed rank test (Dependent)
2. Small sample test 3. Kruskal Wallis Test (more than 2 groups)
• T test 4. Fisher Exact Test
• Independent/ unpaired T test 5. Chi Square test
• Paired T test 6. Spearman Correlation
• ANOVA (Analysis of Variance)
• One Way ANOVA
• Two Way ANOVA
3. Pearson Correlation
Z test
• Use for testing the mean of a population versus a standard
OR
• a statistical test to determine whether two population means are different when the variances are known
and the sample size is large (n>=30)
• can be performed on one sample, two samples, or on proportions for hypothesis testing.
• A z-statistic, or z-score, is a number representing the result from the z-test.

Example: 1. Comparing the average salaries of men versus women after M.Sc. degree.
2. Comparing the fraction defectives from two production lines.

One-Sample Z Test
A one-sample z test is used to check if there is a difference between the sample mean and the population
mean when the population standard deviation is known.
Left Tailed Test: Right Tailed Test: Two Tailed Test:
Formula:
Null Hypothesis: H0: μ=μ0 Null Hypothesis: H0: μ=μ0 Null Hypothesis: H0: μ=μ0
x̅ is the sample mean, Alternate Hypothesis: H1 : μ<μ0 Alternate Hypothesis: H1 : μ>μ0 Alternate Hypothesis: H1 : μ≠μ0
μ is the population mean,
σ is the population standard deviation Decision Criteria: If the z statistic Decision Criteria: If the z statistic Decision Criteria: If the z statistic
< z critical then reject the null > z critical value then reject the > z critical value then reject the
n is the sample size. hypothesis. null hypothesis. null hypothesis.
Example: A doctor claims that a particular hospital contains more than 100 diabetes patients with a sugar level of more
than 234.
To verify the claim, a random test was conducted on 90 diabetes patients. The test resulted in a mean blood sugar level of
279. In addition, the test resulted in a standard deviation of 18.
Here, we set the significance level at 22.50
Z-test have three main steps:
1.Identifying null and alternate hypotheses.
2.Measuring the statistical significance.
3.Comparing the z score with the significance level. Based on the comparison, the null hypothesis is either accepted or
rejected.

•Thus, the Null hypothesis, H0 : µ = 234


•The alternative hypothesis, Ha: µ > 23
•Now we substitute the given values
Z = 279 – 234 / 18/√90
Z = 45 / (18/9.48)
Z = 45/1.89
Z = 23.80

Finally, the z-score (23.80) is compared with the significance level.


22.50 < 23.80; the doctor’s claim is proven correct.
Two Sample Z Test: compare the means of the two samples
• the null hypothesis is given as H0 : μ1=μ2
x̅ 1, σ21 are the sample mean and population variance respectively for the first sample.
x̅ 2, σ22 are the sample mean and population variance respectively for the second sample.

One Proportion Z Test: A one proportion z test is used when there are two groups and compares the value of an observed
proportion to a theoretical one.
p is the observed value of the proportion, p0 is the theoretical
proportion value and n is the sample size.

Two Proportion Z Test: A two proportion z test is conducted on two proportions to check if they are the same or not.

p1 is the proportion of sample 1 with sample size n1 and x1 number of trials.


p2 is the proportion of sample 2 with sample size n2 and x2number of trials.
Example: To determine whether two drugs affected human mental concentration equally, 50 students were given one drug and 50 others
the second drug. All the students were then given an examination to measure their mental concentration index. The mean scores for the
two groups were 65 and 70, and the respective standard deviations were 15 and 18. Is there sufficient evidence to suggest that the drugs
affected mental concentration differently?

1. State the null and alternate hypothesis


The null hypothesis states that there is no difference between the effect of two drugs on mental concentration.
The alternative hypothesis states that there is a difference between the effect of two drugs on mental concentration.
2. State the level of significance
It is assumed that the level of significance is 0.05
3. Sate the number of tails: should determined before data collection.
As the effect produced by two drugs on mental concentration may differ to any side, we have to use two tailed test .
4. Select the appropriate statistical test
Two tailed z test for two independent samples. sample size is more than 30, the z test is the most suitable statistical method.
5. Perform the statistical analysis
Find critical value of z statistics
Calculate z statistic
The critical value of z statistics for two
1. Calculate the mean and standard deviation for each population.
tailed test and at 0.05 level of significance,
is given as ±1.96.

6. Decision
The observed z value (-1.51) is less than z critical (-1.96). Therefore, there is no reason to reject H .
The evidence would suggest that both drugs have the same effect on mental concentration.
Example: A company wants to improve the quality of products by reducing defects and monitoring the efficiency of assembly lines. In
assembly line A, there were 18 defects reported out of 200 samples while in line B, 25 defects out of 600 samples were noted. Is there a
difference in the procedures at a 0.05 alpha level?

Solution: This is an example of a two-tailed two proportion z test.


H0: The two proportions are the same.
H1: The two proportions are not the same.
As this is a two-tailed test the alpha level needs to be divided by 2 to get 0.025.
Using this, the critical value from the z table is 1.96.
n1 = 200, n2 = 600
p1 = 18 / 200 = 0.09
p2 = 25 / 600 = 0.0416
p = (18 + 25) / (200 + 600) = 0.0537

= 2.62

As 2.62 > 1.96 thus, the null hypothesis is rejected and it is concluded that there is a significant difference between the two
lines.

Decision: Reject the null hypothesis


T test:
Properties of t distribution
 It has mean 0
 It has variance greater that 1
 It is bell shaped symmetrical distribution about mean.
Assumption for T test:
 Sample must be random, independent observations.
 Standard deviations is not known
 Normal distribution of a population
Uses of T test:
 To compare the difference of mean between two groups
Types:
Paired T test: compares the means of two measurements taken from the same individual, object, or related units.
"paired" measurements: A measurement taken at two different times (e.g., pre-test and post-test score with an intervention
administered between the two time points)

Example: where subjects are tested prior to a treatment, say for high sugar levels , and the same subjects are tested again after
treatment with a blood –sugar lowering medications.

Unpaired/ independent T test: examines the averages/means of two independent or unrelated groups to see if there is a
statistically significant difference between them.

Example: 1. compare the heights of girls and boys


2. Compare 2 stress reduction interventions: when one group practiced mindfulness meditation while the other group learned
progressive muscle relaxation.
Example: An instructor gives students an exam and the next day gives students a different exam on the same material. The
instructor wants to know if the two exams are equally difficult. We calculate the difference in exam scores for each
student. We test if the mean difference is zero or not.
Step 8: Compare t
statistic value with
critical value

2.74 >2.28

So we can
reject the null hypothesi
s
that there is no
difference between
means.
In t test, the degree of freedom is given by _______.
a. N -1 b.N1 –N2 +2 c.N1 +N2 - 2 d.N1 +N2+2

A critical region defined by z statistic is _______ region for null hypothesis.


a. acceptance b. rejection c. no relation d. can’t say
Analysis of Variance (ANOVA)
A statistical formula used to compare variances across the means (or average) of more than two groups.
Compare multiple groups at a time.
Assumptions of ANOVA
 Independence of observations: data should collected using statistically valid sampling method and there are no hidden relationships
among observations.
 Normally-distributed response variable: The values of the dependent variable follow a normal distribution.
 Homogeneity of variance (Homoscedasticity): The variation within each group being compared is similar for every group.
Types:
One way ANOVA: compares three or more different groups when data are categorized in one way.
When to use a one way ANOVA: one categorical independent variable and one quantitative dependent variable. The independent
variable should have at least three levels (i.e. at least three different groups or categories).
Example:
• Your independent variable is social media use, and you assign groups to low, medium, and high levels of social media use to find out
if there is a difference in hours of sleep per night.
• Your independent variable is brand of soda, and you collect data on Coke, Pepsi, Sprite, and Fanta to find out if there is a difference
in the price per 100ml.
• compare cortisol levels in MDD patients with 3 severity groups (mild, moderate, severe)
• the effect of three different type of fertilizer (1,2,3) mixtures on crop yield.

 The null hypothesis (H0) of ANOVA is that there is no difference among group means.
 The alternative hypothesis (Ha) is that at least one group differs significantly from the overall mean of the dependent variable.
 If any of the group means is significantly different from the overall mean, then the null hypothesis is rejected.
 ANOVA uses the F test for statistical significance.
 The F test compares the variance in each group mean from the overall group variance. If the variance within groups is smaller than
the variance between groups, the F test will find a higher F value, and therefore a higher likelihood that the difference observed is
real and not due to chance.
Two way ANOVA: used to determine the effect of two nominal/categorical predictor/independent variables on a continuous
outcome/dependent variable.
 Both of your independent variables should be categorical. If one of your independent variables is categorical and one is quantitative,
use an ANCOVA instead.

ANOVA tests for significance using the F test for statistical significance.

Example: You are researching which type of fertilizer and planting density produces the greatest crop yield in a field experiment. You
assign different plots in a field to a combination of fertilizer type (1, 2, or 3) and planting density (1=low density, 2=high density), and
measure the final crop yield in bushels per acre at harvest time.
bushel is a unit of volume that can be used to measure the amount of a crop that has been harvested. A bushel is equal to 8 imperial gallons, or 36.37 metric
liters in volume.
You can use a two-way ANOVA to find out if fertilizer type and planting density have an effect on average crop yield.
ANOVA formula:
Correlation: The relationship between two metric continuous variables is called correlation.
Correlation test check whether variables are related without hypothesizing a cause-and-effect relationship.
Pearson Correlation Coefficient
It is a common way of measuring the linear correlation. The coefficient is a number between -1 and 1 and determines the strength and
direction of the relationship between two variables. The change in one variable changes the course of another variable change in the
same direction.

Q: Samples of drug products are stored in their original


containers under normal conditions and sampled
periodically to analyse the content of the medication.
Determine correlation coefficient.

Interpretation the Pearson correlation coefficient


Regression tests
Regression tests look for cause and effect relationship. They can be used to estimate the effect of one or more continuous variables on
another variable.
Non Parametric test: don’t make as many assumptions about the data, and are useful when one or more of the common statistical
assumptions are violated.
1. Mann-Whitney U test: Used to test whether or not the difference between two independent population medians is zero.
 Variables can be either metric or ordinal.
 No requirement as to shape of the distributions.
 This is the non-parametric equivalent of the two-sample t test.

2. Wilcoxon signed rank test: Used to test whether or not the difference between two paired population medians is zero.
 Variables can be either metric or ordinal.
 Distributions may be of any shape, but the differences should be distributed symmetrically.
 This is the non-parametric equivalent of the matched-pairs t test.

3. Kruskal-Wallis test: Used to test whether the medians of three or more independent groups are the same.
Variables can be either ordinal or metric.
Distributions may be of any shape, but all need to be similar.
This non-parametric test is an extension of the ANOVA.

4. Chi-squared test : Used to test whether the proportions across a number of categories of two or more independent groups is the same.
 Variables must be categorical.
 The chi-squared test is also a test of the independence of the two variables.

5. Fisher’s Exact test: Used to test whether the proportions in two categories of two independent groups is the same.
 Variables must be categorical.
 This test is an alternative to the 2 × 2 chi-squared test, when cell sizes are too small .
Choosing a nonparametric test
Statistical tests in hypothesis testing
Practice questions
1. If metric data is skewed, the choice of statistical test should be__________ .
a. parametric tests b. non parametric tests c. one sample t test d. z test

2. If the data is paired but skewed, the following test is used


a. t test b. Paired t test c. Mann Whitney U test d. Wilcoxon signed rank test

3. The statistical test used to test the difference in medians of three independent groups is _____.
b. One way ANOVA b. Kruskal-Wallis test c. Wilcoxon signed rank test d. Friedman rank test

4. The test for knowing association between two metric variable is _________.
c. Pearson correlation b. Spearman correlation c. Phi correlation d. None of above

5. The following test is used to test whether or not the difference between two population means is zero.
d. t test b. Paired t test c. Mann Whitney U test d. One sample t test

6. One sample z test is used to test a single sample when the ______ is known.
e. population variance b. sample variance c. a and b d. none of above

7. 3. If we are testing one sample and sample size is more than 30, which of the following test can be used?
f. two sample z test b. one sample t test c. one sample z test d. two sample t test

8. If the calculated value of z lies in the critical region then null hypothesis is_____.
a. rejected b. accepted
Practice questions

1. If a distribution is skewed to the left, then it is __________. 5. The measure of dispersion most commonly used in
a. Negatively skewed conjunction with the mean is the:
b. Positively skewed a. interquartile range
c. Symmetrically skewed b. range
d. Symmetrical c. standard deviation
d. Variance
2. If a test was generally very easy, except for a few students who had very
low scores, then the 6. The probability of an intersect p(A∩B) is easily determined
distribution of scores would be _____. by using ______.
e. positively skewed a. Addition theorem
f. negatively skewed b. Multiplication theorem
g. not skewed at all c. Baye theorem
h. Normal d. Addition and subtracting theorem

3.The range of a sample gives an indication of the________. 7. The probability of an event that is certain to happen is
a. way in which the values cluster about a particular point equal to _______ .
b. number of observations bearing the same value e. 100
c. maximum variation in the sample f. 1
d. degree to which the mean value differs from its expected value. c. 10
d. 0
4.Which range characterizes the interquartile range?
a. From 5th percentile to 95th percentile 8. About99%of the observations lie within _______ standard
b. From 10th percentile to 90th percentile deviation either side of mean.
c. From 25th percentile to 75th percentile a. 1 b. 2 c. 3 d. 4
d. From 1 standard deviation below the mean to 1 standard deviation above
the mean
CLASS-IX
15 September 2023
th

DBMS-Part I
Unit 2
• Database concept
• Database Management System
• 2 tier and 3 tier structure
• 3 level Architecture of DBMS
• Keys: Candidate, Primary and foreign

You might also like