NAY Eset Assignment - biostatistics final

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

AKSUM UNIVERSITY

CHS AND CSH SCHOOL OF PUBLIC HEALTH


DEPARTMENT OF EPIDEMIOLOGY AND
BIOSTATISTICS

Biostatistics Individual Assignment

Prepared by: Eset Gebru

June, 2024

Axum, Tigray, Ethiopia


Question 1) Find the possible answers both manually and using SPSS from the given ungrouped
data after feeding the data in to SPSS

The following ungrouped data array is the age of individuals on which smoking stops

30 34 35 37 37 38 38 38 38 39 39 40 40 42 42

43 43 43 43 43 43 44 44 44 44 44 44 44 45 45

45 46 46 46 46 46 46 47 47 47 47 47 47 48 48

48 48 48 48 48 49 49 49 49 49 49 49 50 50 50

50 50 50 50 50 51 51 51 51 52 52 52 52 52 52

53 53 53 53 53 53 53 53 53 53 53 53 53 53 53

53 53 54 54 54 54 54 54 54 54 54 54 54 55 55

55 56 56 56 56 56 56 57 57 57 57 57 57 57 58

58 59 59 59 59 59 59 60 60 60 60 61 61 61 61

61 61 61 61 61 61 61 62 62 62 62 62 62 62 63

63 64 64 64 64 64 64 65 65 66 66 66 66 66 66

67 68 68 68 69 69 69 70 71 71 71 71 71 71 71

72 73 75 76 77 78 78 78 82

From the given data above, calculate both Manually Mean of the ungrouped data

1.1) Mean of the ungrouped data

1
Answer:-

=then summing all the observations written in the table gives 10,401

Then dividing the sum of all observations which is 10,401, by the number of population (N=189)
gives 10,401/189

=55.03

1.2) Median of the ungrouped data

Answer:-

To find the median, we arrange in ascending order as follows

Since the data is already arranged in ascending order, and the number of observations (N) is odd,
we take the middle value 95th observation, which is the one between the 94thand 96th
observations giving us 54

1.3) Mode of the ungrouped data

Answer:-

Mode is the observation that appears most frequently. We have to see the frequency of each
observation which needs rearrangement of the table firstly.

The observation s 53 appears 17 times. All other observations appear less than 17 times. Thus,
the observation 53 is the modal value of the ungrouped data.

1.4) Minimum and maximum value of the ungrouped data

2
Answer:-

As can be seen from the ascending arranged observation table, the minimum value which
appears first is 30, and the maximum value which appears at the end is 82.

1.5) Range of the ungrouped data

Answer:-

Range is the difference between the maximum and minimum values. In our case, the difference
between the maximum value which is 82, and the minimum value which is 30, gives the Range
which is 52.

1.6) IQR of the ungrouped data

Answer:-

The IQR describes the middle 50% of values when ordered from lowest to highest. To find the
interquartile range (IQR), first find the median (middle value) of the lower and upper half of the
data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3
and Q1.

IQR=(Q3-Q1)/2

In order to get IQR, first divide the data into to by finding the median. The median as calculated
as the 95th observation which is 54. After dividing data into two datasets, we identify the median
for each of the two datasets.

The median of the first dataset is (48+48)/2=48

The median of the second dataset is (61+62)/2=61.5

Since medians of the first and second datasets are IQ1 and IQ3 which are 48 and 61.5
respectively. The IQR becomes

IQR=(Q3-Q1)/2

IQR =(61.5-48)/2

IQR =13.5/2

IQR =6.75

3
1.7) 90th percentile of the ungrouped data

Answer:-

The percentile means the value that sits when the data is rearranged as a hundred percent. Then,
the 90th percentile means the value that sits at the 90th order when the data is rearranged and seen
as a hundred percent.

Using the given equation, k=90

n=189

P90=90(189+1)/100

P90=90(190)/100

P90=17100/100

P90=171

P90=69

1.8) Variance of the ungrouped data

4
The sum of the squares of each of the 189 observations from the mean gives 18,479.81. Dividing
this by the number of observations (N=189) gives

2=18,479.81/189

2=97.78

Thus, the population variance is 97.78

1.9) Standard deviation of the ungrouped data

The standard deviation, S.D., is just the square root of the variance.

Thus, taking the square root of the population variance which is 97.78 gives

SD=9.89

1.10) Standard error of the ungrouped data

The standard error represents the standard deviation of the sampling distribution of the sample
mean. It tells us how much the sample mean is likely to vary from the population mean.

5
To calculate the standard error, we'll follow these steps:

Calculate the sample mean:

Sample mean = 55.03

Calculate the sample standard deviation:

Sample standard deviation = 9.89

Calculate the standard error:

Standard error = Sample standard deviation / √(Sample size)

Standard error = 9.89 / √189

Standard error = 9.89/13.75

Standard error=0.72

1.11) 95% Confidence interval of the population age at which smoking stops

The mean of the population equals  55.03.


Z=1.96
S=189
N=189
Confidence interval 95% = 55.03+/-1.96(9.89/sqrt 189)

(a) =55.03+1.96(9.89/sqrt 189)

=55.03+1.96(9.89/13.75)

= 55.03+1.41

=56.44

6
(b) =55.03-1.96(9.89/sqrt 189)

=55.03-1.96(9.89/13.75)

= 53.62

So, the confidence interval is between 53.62 and 56.44

(53.62, 56.44)

Put your statistical interpretation for all 1-11

1.12.1) Mean of the ungrouped data

Thus, the mean (average) value, which is one measure of central tendency, of the ungrouped data
is 55.03.

1.12.2) Median of the ungrouped data

Thus, the median (middle) value when arranged in ascending order, which is one measure of
central tendency, of the ungrouped data is 54.

1.12.3) Mode of the ungrouped data

Thus, the mode (most frequently appearing) value, which is one measure of central tendency, of
the ungrouped data is 53.

1.12.4) Minimum and Maximum of the ungrouped data

Thus, the Minimum (lowest) value and the maximum (highest) values of the ungrouped data are
30 and 82 respectively.

1.12.5) Range of the ungrouped data

Thus, the difference between the maximum which is 82 and the minimum which is 30 is 52.

1.12.6) IQR of the ungrouped data

Thus, the IQR (Interquartile Range) which is the difference between the third quartile (the point
that divided the second half of the dataset into two), and the first quartile (the point that divides
the first half of the dataset into two, is 6.75.

1.12.7) 90the percentile of the ungrouped data

7
Thus, when the whole data is rearranged and seen as a hundred percent, the value pertaining to
the 90th order is 69.

1.12.8) Variance of the ungrouped data

Thus, the individual data points vary from the mean or average value of the dataset by 97.78.

1.12.9) Standard Deviation of the ungrouped data

Thus, the result shows that the data points are 9.89 standard deviations away from the mean.

1.12.10) Standard Error (SE) of the ungrouped data

The standard error is smaller than the sample standard deviation, indicating that the sample mean
is a more precise estimate of the population mean compared to individual data points.

With a sample size of 189, we can expect the sample mean to be within approximately 0.72
points of the population mean 68% of the time (based on the 68-95-99.7 rule).

If we were to take multiple samples of size 189 from the population, the sample means would be
distributed around the population mean with a standard deviation of 0.72.

1.12.11) 95% Confidence Interval (CE)

95% Confidence interval of the population age at which smoking stops falls within the age range
from 53.62 to 56.44 ((53.62, 56.44), only 5% of the times the age at which population stops
smoking falls either below 53.62 or above 56.44.

Question 2. From the same data above, Using SPSS, after feeding the data directly in to SPSS
software, Show the steps in your descriptive analysis

Answer:-

USING SPSS

Here, I use SPSS software to answer questions from 2.1 – 2.12.

The procedures are:-

Analyze  Descriptive Statistics  FREQUENCIES  Select the variable to be analyzed,


which in our case is VARIABLES=SmokeStopAge  Go to Statistics  Under the types of
analyses to be conducted, Percentile values, Central Tendency, Dispersion and Distribution,

8
check Quartiles, Percentiles, Mean, Median, Mode, Standard Deviation, Variance, Range,
Minimum, Maximum, Standard Error Mean, Skewness, and Kurtosis as needed  OK.

Then, the following result appears:

Then, the respective answers are:

2.1) Mean of the ungrouped data

The mean of the ungrouped data is 55.03.

2.2) Median of the ungrouped data

The Median of the ungrouped data is 54.00.

2.3) Mode of the ungrouped data

The Mode of the ungrouped data is 53.00.

2.4) Minimum and maximum value of the ungrouped data

The minimum of the ungrouped data is 30, and the maximum of the ungrouped data is 82.

2.5) Range of the ungrouped data

9
The Range of the ungrouped data is 52.00.

2.6) Box plot of the ungrouped data

2.7) Is the above data normally distributed? If no, explain why.

Yes, the data is normally distributed.

2.8) What statistical measure do you take?

There are several statistical measures that can be used to check the normality of a dataset. Here
are some of the most common ones:

1) Skewness and Kurtosis:

 Skewness measures the asymmetry of the distribution. A normal distribution has a


skewness of 0.

10
 Kurtosis measures the "peakedness" of the distribution. A normal distribution has a
kurtosis of 3.

 Values of skewness and kurtosis significantly different from 0 and 3, respectively, can
indicate a departure from normality.

2) Shapiro-Wilk Test:

 The Shapiro-Wilk test is a formal statistical test for normality.

 It compares the sample data to a normally distributed set of values with the same mean
and standard deviation.

 The test statistic (W) ranges from 0 to 1, with values closer to 1 indicating a normal
distribution.

 A p-value less than the chosen significance level (e.g., 0.05) suggests that the data is not
normally distributed.

3) Kolmogorov-Smirnov Test:

 The Kolmogorov-Smirnov (K-S) test is another formal statistical test for normality.

 It compares the cumulative distribution function (CDF) of the sample data to the CDF of
a normal distribution.

 The test statistic (D) represents the maximum difference between the two CDFs.

 A p-value less than the chosen significance level indicates a departure from normality.

4) Anderson-Darling Test:

 The Anderson-Darling test is a variation of the K-S test that is more sensitive to
deviations from normality in the tails of the distribution.

 The test statistic (A^2) measures the difference between the sample data and a normal
distribution.

 A p-value less than the chosen significance level suggest that the data is not normally
distributed.

5) Normal Probability Plot (Q-Q Plot):

11
 The normal probability plot, or Q-Q plot, compares the quantiles of the sample data to the
quantiles of a normal distribution.

 If the data is normally distributed, the points on the Q-Q plot should align closely to a
straight line.

 Deviations from the straight line indicate departures from normality, such as skewness or
kurtosis.

When checking the normality of a dataset, it's generally a good idea to use a combination of
these statistical measures, as well as visual inspections like histograms and Q-Q plots. This
provides a more comprehensive assessment of the normality of the data, which is important for
many statistical analyses and modeling techniques.

The choice of which specific normality test to use may depend on factors such as the sample
size, the expected distribution, and the specific research question or analytical needs.

2.9) Variance of the ungrouped data

The Variance of the ungrouped data is 98.297

2.10) Standard deviation of the ungrouped data

The Standard Deviation of the ungrouped data is 9.94.

2.11) Standard error of the ungrouped data

The Standard Error (SE) of the ungrouped data is 0.721.

2.12) 95% Confidence interval of the population age at which smoking stops and interpret it

The 95% Confidence Interval (CI) of the population at which smoking stops is between ages
53.62 and 56.44.

95% of the times, the population age at which smoking stops falls within the age range from
53.62 to 56.44(53.62, 56.44), only 5% of the times the age at which population stops smoking
falls either below 53.62 or above 56.44.

Overall, it can be seen that the manually calculated results are the same with that of SPSS
calculated results.

12
Question 3. After converting the above ungrouped data in to a grouped data array; present the
data in the following data presentation styles

A frequency distribution table containing

In our case

K=1+3.322(logn)

W=(L-S)/K

13
In our case, the number of classes(k) can be computed using Sturg's rule as:
K= 1 +3.322Log(189)
K=1+3.322*2.28
K=1+7.57
K=8.57
K=9
Then, the width of each class can be calculated:
W=(Largest observation – Smallest observation)/K
W=(82-30)/9
W=52/9
W=5.78
Thus, the width can be six.
Based on these, we get the regrouped dataset as follows:
S.No. Class Limit Class boundary Class mark Frequency RF(%) CF
1 30-35 29.5-35.5 32.5 3 1.59 1.59
2 36-41 35.5-41.5 38.5 10 5.29 6.88
3 42-47 41.5-47.5 44.5 30 15.87 22.75
4 48-53 47.5-53.5 50.5 49 25.93 48.68
5 54-59 53.5-59.5 56.5 35 18.52 67.20
6 60-65 59.5-65.5 62.5 32 16.93 84.13
7 66-71 65.5-71.5 68.5 21 11.11 95.24
8 72-77 71.5-77.5 74.5 5 2.65 97.88
9 78-82 77.5-82.5 80 4 2.12 100.00
3.1) A Frequency
Based on the above calculation, the Frequency is demonstrated in the following table.

S.No. Class Limit Class boundary Class mark Frequency


1 30-35 29.5-35.5 32.5 3
2 36-41 35.5-41.5 38.5 10
3 42-47 41.5-47.5 44.5 30
4 48-53 47.5-53.5 50.5 49
5 54-59 53.5-59.5 56.5 35
6 60-65 59.5-65.5 62.5 32
7 66-71 65.5-71.5 68.5 21
8 72-77 71.5-77.5 74.5 5
9 78-82 77.5-82.5 80 4

14
3.2) A relative frequency distribution
The Relative Frequency (RF) in % is shaded as follows:
S.No. Class Limit Class boundary Class mark Frequency RF(%)
1 30-35 29.5-35.5 32.5 3 1.59
2 36-41 35.5-41.5 38.5 10 5.29
3 42-47 41.5-47.5 44.5 30 15.87
4 48-53 47.5-53.5 50.5 49 25.93
5 54-59 53.5-59.5 56.5 35 18.52
6 60-65 59.5-65.5 62.5 32 16.93
7 66-71 65.5-71.5 68.5 21 11.11
8 72-77 71.5-77.5 74.5 5 2.65
9 78-82 77.5-82.5 80 4 2.12

3.3) A cumulative relative frequency distribution


The Cumulative Relative Frequency (CF) is seen shaded in the following way.

S.No. Class Limit Class boundary Class mark Frequency RF(%) CF


1 30-35 29.5-35.5 32.5 3 1.59 1.59
2 36-41 35.5-41.5 38.5 10 5.29 6.88
3 42-47 41.5-47.5 44.5 30 15.87 22.75
4 48-53 47.5-53.5 50.5 49 25.93 48.68
5 54-59 53.5-59.5 56.5 35 18.52 67.20
6 60-65 59.5-65.5 62.5 32 16.93 84.13
7 66-71 65.5-71.5 68.5 21 11.11 95.24
8 72-77 71.5-77.5 74.5 5 2.65 97.88
9 78-82 77.5-82.5 80 4 2.12 100.00

15
3.4) A histogram

Answer:-

3.5) The Direction of skewness , negatively, positively or not at all

Answer:-

As can be seen from the Histogram above, the data is positively skewed.

3.6) Is the data normally distributed well for further analysis?

Answer:-

Yes, the data is normally distributed, and further analysis can be done.

3.7) Categorize the continuous data in to categorical data using SPSS at four categories 30-39,
40-49,50-60 and above 60

16
Answer:-

Here is the continuous data categorized into four fed into SPSS.

4. Kanjanarat et al. (A-11) estimate the rate of preventable adverse drug events (ADE) in
hospitals to be 35.2 percent. Preventable ADEs typically result from inappropriate care
medication errors, which include errors of commission and errors of omission. Suppose that
10 hospital patients experiencing an ADE are chosen at random. Let p =5, and calculate the
probability that:

17
(a) Exactly seven of those drug events were preventable

Answer:-

The formula is: P(X = x) = (n choose x) * p^x * (1-p)^(n-x)


Plugging in the values:
P(X = 7) = (10 choose 7) * (0.352)^7 * (1-0.352)^(10-7)
P(X = 7) = 120 * 0.0030 * 0.4648 = 0.1654 or 16.54%

(b) More than half of those drug events were preventable

Answer:-

More than half of 10 events means 6, 7, 8, 9, or 10 events were preventable.

P(X > 5) = P(X = 6) + P(X = 7) + P(X = 8) + P(X = 9) + P(X = 10)

P(X > 5) = 0.1654 + 0.0460 + 0.0080 + 0.0007 + 0.0001 = 0.2202 or 22.02%

(c) None of those drug events were preventable

Answer:-

This is the probability that 0 events were preventable.

P(X = 0) = (10 choose 0) * (0.352)^0 * (1-0.352)^(10-0)

P(X = 0) = 1 * 1 * 0.6482 = 0.6482 or 64.82%

18
(d) Between three and six inclusive were preventable

Answer:-

This is the sum of the probabilities for 3, 4, 5, and 6 preventable events.


P(3 ≤ X ≤ 6) = P(X = 3) + P(X = 4) + P(X = 5) + P(X = 6)
P(3 ≤ X ≤ 6) = 0.1654 + 0.0460 + 0.0080 + 0.0007 = 0.2201 or 22.01%

5. The goal of a study by Klingler et al (A-2) was to determine how symptom recognition and
perception influence clinical presentation as a function of race. They characterized symptoms
and care – seeking behavior in African—American patients with chest pain seen iii the
Emergency Department. One of the presenting vital signs was systolic blood pressure. Among
157 African-American men, the near‘ systolic blood pressure was 146 mm Hg with a standard
deviation of 27. The investigator may want to conclude that the mean systolic Mood pressure for
a population of African – American men is greater than 140.mmhg.

Does the researcher provide sufficient evidence to conclude that the population mean is greater
than 140 mm Hg ? Develop both null alternative hypothesis

a) Calculate the test statistic

Answer:-

Null Hypothesis (H0): The population mean systolic blood pressure for African-American men is
less than or equal to 140 mmHg.

H0: μ ≤ 140 mmHg

Alternative Hypothesis (H1): The population mean systolic blood pressure for African-American
men is greater than 140 mmHg.

H1: μ > 140 mmHg

Given information:

Sample size (n) = 157 African-American men

Sample mean (x) = 146 mmHg

Sample standard deviation (s) = 27 mmHg

Using the one-sample z-test formula:

19
z = (x - μ) / (s / √n)

z = (146 - 140) / (27 / √157)

z = 6 / (27 / 12.5)

z = 2.78

b) Do you think that the systolic blood pressure of the population is greater than 140 MM Hg ?

Answer:-

Yes, the calculated z-statistic of 2.78 is greater than the critical value of 1.645 for a one-tailed
test at α = 0.05. This means there is sufficient evidence to conclude that the population mean
systolic blood pressure for African-American men is greater than 140 mmHg.

c) What is your decision to refute or fail to refute the null hypothesis at one tailed, at α =0.05.)

Answer:-

At a significance level of α = 0.05 (one-tailed test), the critical value is 1.645.

Since the calculated z-statistic of 2.78 is greater than the critical value of 1.645, we can reject the
null hypothesis.

Therefore, we can conclude that the population mean systolic blood pressure for African-
American men is greater than 140 mmHg, at a significance level of α = 0.05.

20

You might also like