Professional Documents
Culture Documents
BRM Unit-1
BRM Unit-1
NTRODUCTION TO STATISTICS
Overview: Statistics is the branch of mathematics that deals with the collection, analysis,
interpretation, presentation, and organization of data. It plays a crucial role in various fields,
providing tools for making informed decisions and drawing meaningful conclusions from
data.
Key Concepts:
1. Descriptive Statistics:
o Measures of Central Tendency: Mean, Median, Mode.
o Measures of Dispersion: Range, Variance, Standard Deviation.
o Percentiles and Quartiles: Understanding data distribution.
2. Probability:
o Basic Probability Concepts: Sample space, events, probability rules.
o Conditional Probability: Probability of an event given another event.
3. Random Variables and Probability Distributions:
o Discrete and Continuous Random Variables: Definition and examples.
o Probability Mass Function (PMF): Probability distribution for discrete
variables.
o Probability Density Function (PDF): Probability distribution for continuous
variables.
o Cumulative Distribution Function (CDF): Probability of a random variable
being less than or equal to a specific value.
4. Sampling and Sampling Distributions:
o Sampling Techniques: Random sampling, stratified sampling, cluster
sampling.
o Sampling Distributions: Understanding the distribution of sample statistics.
5. Inferential Statistics:
o Hypothesis Testing: Formulating and testing hypotheses about population
parameters.
o Confidence Intervals: Estimating the range within which a population
parameter is likely to fall.
o Regression Analysis: Analyzing the relationship between variables.
Applications:
Challenges:
INTRODUCTION TO BIOSTATISTICS
Key Concepts:
1. Descriptive Biostatistics:
o Measures of Central Tendency: Mean, Median, Mode for biological and health
data.
o Measures of Dispersion: Variance, Standard Deviation in the context of life
sciences.
2. Biostatistical Methods for Data Analysis:
o Probability and Distribution: Application of probability concepts to biological
data.
o Hypothesis Testing: Formulating and testing hypotheses in biomedical
research.
o Regression Analysis: Analyzing relationships between variables in a biological
context.
3. Clinical Trials and Experimental Design:
o Randomized Controlled Trials (RCTs): Principles and designs.
o Observational Studies: Understanding cohort studies and case-control studies.
4. Epidemiological Methods:
o Incidence and Prevalence: Measurement and interpretation in epidemiological
studies.
o Risk Factors: Identifying and analyzing risk factors for diseases.
5. Survival Analysis:
o Kaplan-Meier Curve: Estimating the survival function.
o Cox Proportional Hazards Model: Analyzing time-to-event data.
Applications:
Challenges:
Key Concepts:
Applications:
Challenges:
Start with a set of raw data that you want to analyze. For example, consider the following
dataset representing the scores of students in a class:
72,85,90,78,85,92,78,90,85,72,78,92,92,8572
Calculate the range of the data by subtracting the minimum value from the maximum value:
In our example:
Range=92−72=20
Divide the range into intervals (bins). The number of intervals depends on the dataset size. A
common choice is 5-15 intervals. In our example, let's choose intervals of width 5:
Construct a table with columns for class intervals and frequencies. Start by listing the
intervals:
Count the number of data points falling into each interval. Update the frequency column
accordingly:
In statistics, the mean is a measure of central tendency that represents the average of a set of
values. It is commonly used to describe the center of a distribution and is calculated by
summing all values and dividing by the total number of observations.
Types of Mean:
Properties of Mean:
1. Sensitive to Outliers:
o The mean is influenced by extreme values and may not be representative of
the central tendency in the presence of outliers.
2. Balancing Property:
o The sum of deviations of individual values from the mean is always zero.
3. Unique Mean:
o For a given dataset, there is only one unique mean.
Example:
Mean=(12+15+18+20+22)/5=87/5=17.4
Applications:
1. Descriptive Statistics:
o Mean is used to describe the central tendency of a dataset.
2. Comparisons:
o It is used to compare different groups or populations.
3. Financial Analysis:
o Mean is widely used in financial analysis to calculate average returns or
prices.
Limitations:
1. Influenced by Outliers:
o Outliers can significantly distort the mean.
2. Not Robust:
o It may not be robust in the presence of skewed data.
To calculate the mean (average) of a frequency distribution, you can use the following
formula:
Mean=∑(Xi⋅fi)/∑fi
where:
Let's assume you have a frequency distribution table for pharmaceutical sales with intervals
and frequencies:
Interval Frequency
0−9 5
10−19 12
20−29 20
30−39 18
40−49 15
50−59 10
60−69 5
1. Calculate Midpoints:
o Midpoint of 0-9: (0+9)/2=4.5
o Midpoint of 10-19: (10+19)/2=14.5
o Midpoint of 20-29: (20+29)/2=24.5
o ...
o Midpoint of 60-69: (60+69)/2=64.5
2. Calculate Xi⋅fi for each interval:
o 4.5⋅5=22.5
o 14.5⋅12=174
o 24.5⋅20=490
o ...
o 64.5⋅5=322.5
3. Calculate ∑(Xi⋅fi): 22.5+174+490+…+322.5
4. Calculate ∑fi: 5+12+20+…+5
5. Calculate the Mean: Mean=∑(Xi⋅fi)/∑fi
Median:
Definition:
The median is a measure of central tendency that represents the middle value of a dataset
when it is ordered. It divides the dataset into two equal halves.
Calculation:
1. Arrange Data:
o Order the dataset in ascending or descending order.
2. Identify the Middle Value:
o The middle value is the one at the center when the data is arranged.
Median=Middle Value
1. Arrange Data:
o Order the dataset in ascending or descending order.
2. Calculate Average of Middle Values:
o Find the two middle values and calculate their average.
Properties:
Example:
Advantages:
Resistant to outliers.
Suitable for ordinal and interval data.
Disadvantages:
Applications:
Problem:
Solution:
To find the median, we need to arrange the data in ascending order and then determine the
middle value. However, in this case, since we have a frequency distribution, we also need to
consider the cumulative frequencies.
Now, looking at the cumulative frequencies, we see that the median falls in the interval with a
cumulative frequency greater than or equal to 50. This is the interval (40,50]
Now, we can use the following formula to find the median in a grouped frequency
distribution:
Where:
L=40,
F=30 (cumulative frequency of the class before),
f=20,
w=10 (width of the interval).
Median=40+10
Median=50
Mode Definition:
Mode: The mode of a dataset is the value or values that appear most frequently.
Types of Modes:
The mode is 44 and 88 because they occur more frequently than other values.
Applicability: The mode is useful for both numerical and categorical data.
Notation: Denoted by Mo, or M.
Robustness: It's less sensitive to extreme values than the mean.
Uses of Mode:
Caution:
Where:
Problem:
Calculate the modal class and the mode of the pain reduction for patients in this study.
Solution:
The mode is the value or values that occur most frequently in the dataset. In a grouped
frequency distribution, the mode is often associated with the modal class, which is the class
interval with the highest frequency.
Let's calculate the modal class. The highest frequency is associated with the class interval
9−11. Therefore, the modal class is 9−11.
To estimate the mode more precisely within the modal class, we can use the following
formula:
Where:
Mode=9+0.857
Mode=9+0.857
Mode≈9.857
Therefore, the modal class is 9−11 and the estimated mode of the pain reduction for patients
in this study is approximately 9.857 on the pain scale.
Measures of dispersion are statistical metrics that describe how spread out or dispersed a set
of data points is. They provide important insights into the variability or diversity within a
dataset, complementing measures of central tendency like the mean or median. Here are the
key measures of dispersion:
Range
Definition:
The range is a measure of dispersion that represents the difference between the maximum and
minimum values in a dataset. It provides a simple but informative indicator of the spread or
variability of the data.
Formula:
Key Points:
1. Simplicity: The range is easy to calculate and understand, making it a quick measure
of variability.
2. Sensitivity to Outliers: The range is sensitive to extreme values or outliers in the
dataset. A single very high or very low value can significantly impact the range.
3. Does Not Consider Distribution: While the range gives an idea of the spread, it does
not provide information about how the values are distributed within that range.
Example:
Consider the following dataset representing the daily temperatures in degrees Celsius for a
week: 15,18,20,23,14,30,25. The range is calculated as follows:
Range=Maximum Value−Minimum Value
Range=30−14=16
Limitations:
Problem:
A pharmaceutical company is conducting a study to analyze the efficacy of a new pain relief
medication. The study measures the reduction in pain intensity (on a scale of 0 to 10) after
patients take the medication for a certain period. The data is grouped into class intervals, and
the frequency distribution is as follows:
Calculate the range of the pain reduction for patients in this study.
Solution:
The range is calculated as the difference between the maximum and minimum values. In a
frequency distribution, we use the boundaries of the extreme classes to determine the
maximum and minimum values.
This means that the observed pain reduction spans a range of 10 units, providing an
indication of the variability in the effectiveness of the medication across different patients.
Standard Deviation
Definition:
The standard deviation is calculated as the square root of the variance. There are two
formulas, one for a population (σσ) and one for a sample (ss):
Where:
Key Points:
1. Sensitivity to Outliers:
o Like variance, standard deviation is sensitive to extreme values in the dataset.
2. Interpretability:
o Standard deviation is in the original units of the data, making it more
interpretable than variance.
3. Comparison:
o Allows for the comparison of variability across datasets with different means.
Problem:
A pharmaceutical company is conducting a clinical trial to evaluate the effect of a new drug
on blood pressure. The systolic blood pressure (in mmHg) of a sample of participants is
measured before and after the treatment. The raw data for the systolic blood pressure change
is as follows:
{−2,1,0,−1,3,2,−1,1,0,−2}
Calculate the standard deviation of the systolic blood pressure change for the participants in
this clinical trial.
Solution:
To calculate the standard deviation for a sample, you can use the following formula:
Therefore, the standard deviation of the systolic blood pressure change for the participants in
this clinical trial is 5/3 or approximately 1.67 mmHg.
This value indicates the spread or variability in the systolic blood pressure change within the
sample.
Problem:
Solution:
To calculate the standard deviation for data grouped into class intervals, we need to use a
slightly modified formula. The formula for the standard deviation of grouped data is:
Where:
This involves a series of calculations based on the provided class intervals and frequencies.
Correlation
Definition:
Correlation measures the statistical association or relationship between two or more variables.
It quantifies how changes in one variable are related to changes in another. The strength and
direction of this relationship are assessed through correlation coefficients.
Key Concepts:
1. Correlation Coefficient:
o The correlation coefficient (rr) is a numerical measure of the strength and
direction of a linear relationship between two variables.
o It ranges from -1 to 1.
r=1 implies a perfect positive linear relationship.
r=−1 implies a perfect negative linear relationship.
r=0 implies no linear relationship.
2. Scatter Plots:
o A scatter plot visually represents the relationship between two variables.
Points on the plot indicate individual data pairs.
3. Positive vs. Negative Correlation:
o Positive correlation: As one variable increases, the other tends to increase.
o Negative correlation: As one variable increases, the other tends to decrease.
4. Strength of Correlation:
o The closer r is to 1 or -1, the stronger the correlation.
o Values near 0 suggest a weak or no linear relationship.
Types of Correlation:
Applications:
1. Research:
o Used in various fields like psychology, economics, biology to explore
relationships between variables.
2. Finance:
o Examining the correlation between different financial assets.
3. Medicine:
o Studying the correlation between lifestyle factors and health outcomes.
Where:
Key Concepts:
1. Range of Values:
o Pearson’s rr ranges from -1 to 1.
o r=1 indicates a perfect positive linear relationship.
o r=−1 indicates a perfect negative linear relationship.
o r=0 suggests no linear relationship.
2. Interpretation:
o The sign of rr indicates the direction of the relationship.
o The magnitude of rr indicates the strength of the relationship.
3. Assumptions:
o Assumes a linear relationship between variables.
o Sensitive to outliers.
1. Compute Means:
o Calculate the means of the two variables (Xˉ and Yˉ).
2. Compute Differences:
o Find the differences between each data point and the mean for both variables.
3. Summation:
o Sum the products and squares of the differences.
4. Plug into Formula:
o Use the formula to calculate rr.
Applications:
1. Research Studies:
o Used in various scientific studies to analyze relationships between variables.
2. Economics:
o Examining the correlation between economic indicators.
3. Health Sciences:
o Studying correlations between lifestyle factors and health outcomes.
Problem
alculate Karl Pearson’s coefficient of correlation to assess the strength and direction of the
relationship between the dosage of the drug and the improvement in the health condition.
Solution:
Interpretation:
The calculated rr value is approximately 0.87. This indicates a strong positive linear
relationship between the dosage of the drug and the improvement in the health condition. As
the dosage increases, there is a tendency for the health condition to improve.
Problem
A pharmaceutical company is conducting a study to analyze the relationship between the time
a patient spends exercising (in hours per week) and the reduction in cholesterol levels
(measured in mg/dL) after taking a new medication. The company collected data from a
sample of patients, and the dataset is as follows:
Calculate Karl Pearson’s coefficient of correlation to assess the strength and direction of the
relationship between exercise time and cholesterol reduction.
Solution:
Interpretation:
The calculated rr value is approximately 0.73. This indicates a moderately strong positive
linear relationship between the time spent exercising and the reduction in cholesterol levels.
As the exercise time increases, there is a tendency for a greater reduction in cholesterol
levels.
This positive value of 0.73 suggests a strong positive correlation, but not perfect. The
interpretation of correlation values is subjective and may vary based on the context and field
of study. In this case, a 0.73 correlation is indicative of a substantial positive association
between exercise and cholesterol reduction.
Multiple Correlation
Definition: Multiple correlation is an extension of the concept of correlation to three or more
variables. It measures the strength and direction of the linear relationship between one
variable (the dependent variable) and two or more predictor variables.
Where:
Key Concepts:
1. Partial Correlation:
o Partial correlation measures the relationship between two variables while
controlling for the effect of one or more additional variables.
o In multiple correlation, each Ryxi is a partial correlation coefficient.
2. Interpretation:
o The multiple correlation coefficient (R) represents the proportion of variance
in the dependent variable that is accounted for by the predictors.
3. Use in Regression:
o Multiple correlation is often used in the context of multiple linear regression
analysis, where it helps assess the overall predictive power of a set of predictor
variables.
Advantages:
Limitations:
1. Assumption of Linearity:
o Like simple correlation, multiple correlation assumes a linear relationship
between variables.
2. Sensitivity to Outliers:
o Sensitive to outliers, especially when there are influential observations.
Applications:
1. Predictive Modeling:
o Commonly used in prdictive modeling, such as predicting sales based on
advertising spend, pricing, and other factors.
2. Psychological Studies:
o Used in psychological studies to analyze the combined effects of various
factors on a psychological outcome.
Problem
A pharmaceutical company is developing a new drug, and they want to predict its
effectiveness based on three factors: dosage (in mg), patient's age, and the duration of
treatment (in weeks). They collect data from a sample of patients and measure the drug
effectiveness on a scale from 1 to 100.
Solution
To calculate the multiple correlation coefficient (RR), we need to follow these steps:
where:
20 35 4 72
30 45 6 85
40 50 8 92
50 40 5 78
60 55 7 88
Interpretation:
The calculated R value is approximately 0.97. This indicates a very strong positive linear
relationship between the dosage, patient's age, and the duration of treatment with the
effectiveness of the drug. As these predictor variables increase, there is a tendency for the
drug's effectiveness to increase.