Professional Documents
Culture Documents
Statistics of Sleep in Mammalian Species
Statistics of Sleep in Mammalian Species
Introduction:
The data set contains values pertaining to sleep in sixty-two different mammalian species. The
data set includes the following for the mammalian species: body weight in kilograms, brain
weight in kilograms, number of hours per day of slow wave sleeping, number of hours per day of
paradoxical sleep, the total time of sleep per day, the maximum life span in years, gestation time
in days, predation index, sleep exposure index, and overall danger index. The predation index
can have values between one and five, with a value of one denoting that the mammal is least
likely to be preyed upon and a value of five denoting that the mammal is most likely to be preyed
upon. The sleep exposure index can have values between one and five, with a value of one
denoting that the mammal sleeps in a well-protected area and a value of five denoting the
mammal sleeps in an exposed area. The overall danger index can have values between one and
five, with a value of one denoting that the mammal is in a low amount of danger from other
animals, and a value of five denoting that the mammal is in a high amount of danger from other
The response variable of interest is the total hours of sleep per day for mammalian species. All
other variables are used as predictor variables to model the total hours of sleep per day. Sleep
plays a significant role in the lives of mammalian species. Though sleep is regarded as an
absolutely necessary function, the reason of why mammals need sleep is not so discernible.
Currently, there are four theories of why mammals sleep: the inactivity theory, the energy
conservation theory, the restorative theory, and the brain plasticity theory.[1] This data set applies
directly to the inactivity theory. The inactivity theory suggests that inactivity at night is an
adaptation that served a survival function by keeping organisms out of harms way at times when
1
ACMS30600 Final Project Francisco Huizar
they would be particularly vulnerable.[1] Therefore, animals that were able to be still and quiet
during hours of vulnerability would have a higher chance of survival compared to animals that
remained active during hours of vulnerability. Animals that remained inactive would then not be
subject to predation, and through the process of natural selection, the behavioral strategy of
inactivity evolved into the function of sleeping.[1] Through data analysis on the provided data set,
a deeper understanding of factors playing a role in the inactivity theory can be observed. Using
Variable Description
BodyWt Body weight (kg)
BrainWt Brain weight (kg)
NonDreaming Slow wave sleep (hrs/day)
Dreaming Paradoxical sleep (hrs/day)
TotalSleep Total hours of sleep (hrs/day)
LifeSpan Maximum life span (years)
Gestation Gestation time (days)
Predation Predation index (1-5)
1 := least likely to be preyed upon
5 := most likely to be preyed upon
Exposure Sleep exposure index
1 := Least exposed sleeping environment
5 := Most exposed sleeping environment
Danger Overall danger index
1 := Least danger from other animals
5 := Most danger from other animals
Table 1. Description of the data set.
Data:
The data was obtained from a research article observing the ecological and constitutional
correlates involved in sleep in mammals.[2] The data was downloaded as a text file from a
website. However, the original electronic data file provided on the website was obtained from the
Carnegie Mellon University statistics library database.[3] Unfortunately, there are several missing
values in the data due to gaps in knowledge during the time that the observations were made.
Therefore, the data analysis observations made from modeling this data set should be approached
2
ACMS30600 Final Project Francisco Huizar
with caution and further evidence or current data would be needed to fortify the results of this
model.
Regression Analysis:
data analysis. Because there are more than three explanatory variables, scatterplots were
created for the most relevant variables. Linear regression lines were applied to the
The first scatter plot observes the correlation between the total hours of sleep per day in
mammals versus the overall danger index (Figure 1). This explanatory variable was
chosen because the overall danger index variable value takes into account the predation
index and exposure index; thus, this scatter plot is a quasi-representation of the predation
index, exposure index, and danger index variables and their correlation to the total hours
of sleep. Through observation of this scatterplot, there is a clear distinction that mammals
that have a lower score on the danger index variable, sleep for greater amounts of time.
This observation makes sense because if a mammal is less likely to be in danger, the
mammal would have more time available during the day to dedicate to sleep.
The second scatterplot observes the correlation between the total hours of sleep per day in
mammals versus the lifespan of the mammal (Figure 2). This explanatory variable was
chosen because according to the inactivity theory, sleep is crucial to an animals survival.
Therefore, an animal that sleeps longer would have a longer maximum lifespan. Through
3
ACMS30600 Final Project Francisco Huizar
observation of this scatterplot, there is a clear distinction that as the maximum lifespan of
a mammal increases, the total hours of sleep per day decreases. This would seemingly
contradict the inactivity theory; however, the inactivity theory only applies during periods
of time when the animal is most vulnerable. Essentially, an animal that is inactive only
during periods of time when the animal is most vulnerable will survive more often.
Sleeping throughout the entire day would render an animal subject to predation because it
is unable to protect itself from predators if it sleeps for a larger part of the day. Thus, the
The third scatterplot observes the correlation between the total hours of sleep per day in
mammals versus the gestation time of the mammalian species (Figure 3). This
explanatory variable was chosen because pregnant mothers are expected to require
observation of this scatterplot, there is a clear distinction that as the number of days
required for gestation in mammalian species, the total hours of sleep per day decreases.
This observation makes sense because mammals that require less days for gestation
would be required to sleep more to allow the offspring to develop faster and be birthed
sooner.
A correlation test was then performed on the explanatory variables in order to observe if
multicollinearity is present amongst the variables (Table 2). The most highly correlated
explanatory variables are bodyweight and brain weight with a correlation of 0.93416384,
followed by the correlation of 0.9160424 between predation index and danger index, and
4
ACMS30600 Final Project Francisco Huizar
finally the correlation of 0.7872031between exposure index and danger index. The
correlations observed do make sense because brain weight is a part of body weight, and
the overall danger index value is evaluated taking into account the exposure index and
factors (VIF) were calculated. The VIF for body weight is 25.83416 and the VIF for
overall danger index is 25.64432. These VIFs are greater than the rule-of-thumb
this, a principal components model will be used in the linear regression analysis because
5
ACMS30600 Final Project Francisco Huizar
6
ACMS30600 Final Project Francisco Huizar
20
15
Total hours of sleep/day
10
5
1 2 3 4 5
Figure 1. Scatterplot of Total Sleep VS. Overall Danger Index in mammalian species
7
ACMS30600 Final Project Francisco Huizar
20
15
Total hours of sleep/day
10
5
0 20 40 60 80 100
8
ACMS30600 Final Project Francisco Huizar
20
15
Total hours of sleep/day
10
5
9
ACMS30600 Final Project Francisco Huizar
In order to optimize variable selection, the principle components data reduction method
was applied to the model. The principal components method was used because the
predictor variables had shown to have multicollinearity. The principal component
analysis is shown in Table 3 and the corresponding Screeplot is shown in Figure 4.
Through observation of the screeplot, the most significant elbow occurs at PC3.
However, there is still notable reduction in variances when moving from PC3 to PC4.
Thus, the optimal number of PCs to include is four. However, due to missing values, the
optimal model with this principal component analysis cannot be constructed because
variable lengths would be different. In order to account for this, a new vector of
TotalSleep was created and indexed such that the values aligned with the X-Scaling of the
PC analysis. The R2 for this optimal model is 0.9903 and the Ra2 is 0.9893, both of which
are only slightly lower than the values of the saturated model.
To test for influential observations and outliers in this new model, standard normal values
and cooks distance values were calculated. In this model, the 22nd observation proved to
be an outlier with a standard normal value of -3.01249654. The 50th percentile of the F-
distribution with the four predictor variables and fourty-two observations from the
newTotalSleep response variable is 0.8863808. Any observations with cooks distances
greater than this value are said to be influential. The 22nd observation proved to be an
influential value as well with a cooks distance value of 0.9390591. Thus, the model
needs to be refit without the influential observation. The summary of this final model is
displayed in Figure 5.
The final model used for the data set is the principal component model with the 22nd
observation removed. This is the best model for the data set because the principal
component analysis removed the multicollinearity that was present amongst the predictor
variables in the original saturated model. Furthermore, this final model is free of any
influential observations and free of outliers that would potentially skew a prediction made
by the model. This model is shown to have an excellent fit with an R2 of 0.9926 and a Ra2
of 0.9918 (Figure 5).
10
ACMS30600 Final Project Francisco Huizar
A residual values versus fitted values plot is shown in Figure 6. This plot shows no
observable patterning in between the residual values and the fitted values. This is because
the points on the plot show no pattern and are centered randomly around zero. A normal
plot of the residuals is shown in Figure 7. Through observation of this plot, there are no
major departures from linearity therefore the model can be said to be normal. Further
investigation led to the plotting of the histogram of the residuals (Figure 8). This
histogram plot does show light right-skewness which is nearly visible in the light tailing
present in Figure 7. However, the skew is not major enough to require a different model
be constructed.
In order to test validation of the model, the boostrap validation method was used. The
bootstrap evaluation R2 was calculated to be 0.9905545 from 100 bootstrap samples. This
evaluation R2 is slightly lower than the final model R2 of 0.9926. If the final model was
overfit, each bootstrap model created would perform poorly on the original data set. Thus
the evaluation R2 would be lower than the final model R2. This is the case for the boostrap
analysis run, thus overfit was adjusted for. However, the percent different between the
final model R2 and the evaluation R2 is 0.21%; thus, the overfit of the final model is
minute and would not warrant a new model be created.
11
ACMS30600 Final Project Francisco Huizar
pca
3
Variances
2
1
0
1 2 3 4 5 6 7 8 9
12
ACMS30600 Final Project Francisco Huizar
13
ACMS30600 Final Project Francisco Huizar
1.0
0.5
0.0
Residuals
-0.5
-1.0
5 10 15 20
Fitted Values
14
ACMS30600 Final Project Francisco Huizar
1.0
0.5
Sample Quantiles
0.0
-0.5
-1.0
-2 -1 0 1 2
Theoretical Quantiles
15
ACMS30600 Final Project Francisco Huizar
20
15
Frequency
10
5
0
residuals
Results:
The confidence intervals for the fitted values are displayed in Table 4 and the prediction intervals
for the fitted values are displayed in Table 5. The confidence intervals for the slope parameters
are displayed in Figure 9. Table 4 depicts the 95% confidence intervals for the mean total sleep
in mammals for the principal component values. This table is viable for observing the expected
total hours of sleep of any mammalian species. For example, the first fitted value of the
16
ACMS30600 Final Project Francisco Huizar
confidence interval is 8.553641. There is 95% confidence that the true mean fitted value of total
hours of sleep required by a mammalian species lies between the interval [8.298168, 8.809114].
Table 5 depicts the 95% prediction intervals for individual mammalian species with given
principal component values. Using the fifth predicted value of 5.285691, there is 95% confidence
that an individual mammalian species will require a total number of hours of sleep between the
interval [4.383291, 6.188091]. Figure 9 depicts the 95% confidence intervals for the slope
parameters of the final model. Using the first slope parameter of -2.07377, there is 95%
confidence that the true slope parameter lies in the interval [-2.144123, -2.003424]. These
inferences are important to the research question, regarding total hours of sleep required for
mammalian species, because the confidence intervals are likely to contain the true values of the
population parameters while the model only captures the sample value parameters. Due to
randomness of samples, the true value of the population parameters will fall between confidence
17
ACMS30600 Final Project Francisco Huizar
18
ACMS30600 Final Project Francisco Huizar
19
ACMS30600 Final Project Francisco Huizar
Conclusion:
The principal component model satisfactorily describes the total hours of sleep for mammalian
species. This is because the model has a high R2 value of 0.9926 and has very little overfitting
after undergoing bootstrap testing. Furthermore, the outliers and influential observations of the
original model were removed. Lastly, the final model removes multicollinearity because the
model was constructed using the principal components method. Using this model, there seems to
be some support for the inactivity theory because of the goodness of fit; however, there may be
other factors that are not being considered. For example, a counterargument to the inactivity
theory, an animal that remains awake and conscious is surely to have greater survival rates due to
having the ability to react to an attacking predator. Further data will need to be collected to
Though the model did show to have good fit, it can be improved. This was because there were
missing data values for some mammalian species; however, this was somewhat accounted for
when re-indexing the final model to not take into account mammals with missing values.
Furthermore, the data set was collected in 1976; thus, there is sure to be more data available on
20
ACMS30600 Final Project Francisco Huizar
the topic to further add to the completeness of the model. A possible predictor variable that
would help add to the fit of the model would be to potentially have a predictor variable that
evaluates the number of predators in the given species environments. This would largely play a
role in the inactivity theory because if an environment has more predators, the rate of survival of
an animal lower on the food chain would decrease significantly. Further questions regarding the
model would be: How would data taken today differ from the data taken in 1976? There has
interactions between species in the past 40 years. Thus, would new data show similar trends to
21
ACMS30600 Final Project Francisco Huizar
References
http://healthysleep.med.harvard.edu/healthy/matters/benefits-of-sleep/why-do-we-sleep.
doi:10.1126/science.982039.
http://www.statsci.org/data/general/sleep.html.
22