Unit 3 Count Propotion

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Unit 3: Counts, Percentages, and Proportions Quantiles, Percentiles,

summary

Counts
For categorical variable to compute the mean makes no sense, but it is sometimes useful to
count the number of observations that fall within each category—these counts or frequencies
represent the most elementary summary statistic of categorical data. Getting these factor-
level counts is as straight forward as this:

R> table(chickwts$feed)
casein horsebean linseed meatmeal soybean sunflower
12 10 12 11 14 12

Proportions
You can gather more information from these counts by identifying the proportion of
observations that fall into each category. This will give you comparable measures across
multiple data sets. Proportions represent the fraction of observations in each category, usually
expressed as a decimal (floating-point) number between 0 and 1 (inclusive). To calculate
proportions, you only need to modify the previous count function by dividing the count (or
frequency) by the overall sample size.

R> table(chickwts$feed)/nrow(chickwts)

casein horsebean linseed meatmeal soybean sunflower

0.1690141 0.1408451 0.1690141 0.1549296 0.1971831 0.1690141

Of course, you needn’t do everything associated with counts via table. A simple sum of an
appropriate logical flag vector can be just as useful—recall that TRUEs are automatically
treated as 1 and FALSEs as 0 in any arithmetic treatment of logical structures in R. Such a
sum will provide you with the desired frequency, but to get a proportion, you still need to
divide by the total sample size. Furthermore, this is actually equivalent to finding the mean of
a logical flag vector. For example, to find the proportion of chicks fed soybean, note that the
following two calculations give identical results of around 0:197:
R> sum(chickwts$feed=="soybean")/nrow(chickwts)

[1] 0.1971831

R> mean(chickwts$feed=="soybean")

[1] 0.1971831

You can also use this approach to calculate the proportion of entities in combined groups,
achieved easily through logical operators. The proportion of chicks fed either soybean or
horsebean is as follows:

R> mean(chickwts$feed=="soybean"|chickwts$feed=="horsebean")

[1] 0.3380282

The last function to note is the round function, which rounds numeric data output to a certain
number of decimal places. You need only supply to round your numeric vector (or matrix or
any other appropriate data structure) and however many decimal places (as the argument digits)
you want your figures rounded to.

R> round(table(chickwts$feed)/nrow(chickwts),digits=3)

casein horsebean linseed meatmeal soybean s unflower

0.169 0.141 0.169 0.155 0.197 0.169

This provides output that’s easier to read at a glance. If you set digits=0 (the default), output is
rounded to the nearest integer.
Percentages
A proportion and a percentage represent the same thing. The only difference is the scale; the
percentage is merely the proportion multiplied by 100. The percentage of chicks on a soybean
diet is therefore approximately 19.7 percent.

R> round(mean(chickwts$feed=="soybean")*100,1)

[1] 19.7

Since proportions always lie in the interval [0;1], percentages always lie within [0; 100].

Quantiles, Percentiles, and the Summary

A quantile is a value computed from a collection of numeric measurements that indicates an


observation’s rank when compared to all the other present observations. For example, the
median is itself a quantile—it gives you a value below which half of the measurements lie—
it’s the 0.5th quantile. Alternatively, quantiles can be expressed as a percentile—this is
identical but on a “percent scale” of 0 to 100. In other words, the pth quantile is equivalent to
the 100 pth percentile. The median, therefore, is the 50th percentile.

There are a number of different algorithms that can be used to compute quantiles and
percentiles. They all work by sorting the observations from smallest to largest and using some
form of weighted average to find the numeric value that corresponds to p, but results may vary
slightly in other statistical software.

Obtaining quantiles and percentiles in R is done with the quantile function. Using the eight
observations stored as the vector xdata, the 0.8th quantile (or 80th percentile) is confirmed as
3.6:

R> xdata <- c(2,4.4,3,3,2,2.2,2,4)

R> quantile(xdata, prob=0.8)

80%

3.6
As you can see, quantile takes the data vector of interest as its first argument, followed by a
numeric value supplied to prob, giving the quantile of interest. In fact, prob can take a numeric
vector of quantile values. This is convenient when multiple quantiles are desired.

R> quantile(xdata, prob=c(0,0.25,0.5,0.75,1))

0% 25% 50% 75% 100%

2.00 2.00 2.60 3.25 4.40

Here, you’ve used quantile to obtain what’s called the five-number summary of xdata,
comprised of the 0th percentile (the minimum), the 25th per-centile, the 50th percentile, the
75th percentile, and the 100th percentile (the maximum). The 0.25th quantile is referred to as
the first or lower quartile, and the 0.75th quantile is referred to as the third or upper quartile.
Also note that the 0.5th quantile of xdata is equivalent to the median. The median is the second
quartile, with the maximum value being the fourth quartile.
There are ways to obtain the five-number summary other than using quantile; when applied
to a numeric vector, the summary function also provides these statistics, along with the
mean, automatically.

R> summary(xdata)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 2.600 2.825 3.250 4.400

To look at some examples using real data, let’s compute the lower and upper quartiles of the
weights of the chicks in the chickwts.

R> quantile(chickwts$weight,prob=c(0.25,0.75))

25% 75%

204.5 323.5

This indicates that 25 percent of the weights lie at or below 204.5 grams and that 75 percent
of the weights lie at or below 323.5 grams.
Let’s also compute the five-number summary (along with the mean) of the magnitude of the
seismic events off the coast of Fiji that occurred at a depth of less than 400 km, using the
quakes data frame.

R> summary(quakes$mag[quakes$depth<400])

Min. 1st Qu. Median Mean 3rd Qu. Max.

4.00 4.40 4.60 4.67 4.90 6.40

This begins to highlight how useful quantiles are for interpreting the distribution of numeric
measurements. From these results, you can see that most of the magnitudes of events at a
depth of less than 400 km lie around 4.6, the median, and the first and third quartiles are just
4.4 and 4.9, respectively. But you can also see that the maximum value is much further away
from the upper quartile than the minimum is from the lower quartile, suggesting a skewed
distribution, one that stretches more positively (in other words, to the right) from its center
than negatively (in other words, to the left). This notion is also supported by the fact that the
mean is greater than the median—the mean is being “dragged upward” by the larger values.

You might also like