Professional Documents
Culture Documents
Predicting Location in Drug Overdose Related Deaths
Predicting Location in Drug Overdose Related Deaths
Deaths
Holly Stephens & Jonathan Cabrera
Introduction
We chose to use the Accidental Drug Related Deaths dataset for our final project, and because of how prolific news stories of the opioid crisis
have become in recent years, we felt a natural propensity towards drug overdose deaths related to opioids. After some exploration, we decided
we wanted to predict the likelihood of an overdose death occurring in a major city.
Data
The Accidental Drug Related Deaths dataset is from a government site containing information about the accidental deaths associated with drug
overdose in Connecticut from 2012 to 2018. It was collected by the Office of the Chief Medical Examiner and includes information such as the
toxicity report, death certificate, and a scene investigation of each overdose death. It should be noted here that Connecticut is also the
headquarters of Purdue Pharma L.P., a company considered by many to be one of the actuators behind the opioid crisis.
## [1] 41
nrow(dat)
## [1] 5105
mean(na.omit(dat$Age))
## [1] 41.96492
summary(dat$Sex)
summary(dat$Race)
summary(dat$DeathCity)
summary(dat$MannerofDeath)
# getting a visual
par(mfrow=c(1,3))
hist(na.omit(dat$Age), main="individuals by age", xlab="age", col="lightblue")
plot(dat$Sex, main="individuals by sex", xlab="sex", col="lightblue")
plot(dat$DeathCity, main="individuals by death city", xlab="death city", col="lightblue")
This gives us a more cohesive picture of the demographic background of a typical individual in our dataset: a white man in his early 40s.
We can also observe that there are 5 cities around which the overdose deaths reported coalesce.
length(unique(dat$COD))
## [1] 3193
To hone in on opioid involvement in a given overdose, let’s instead focus on the substance attributes, which indicate whether a certain drug was
detected in an individual by the Medical Examiner. We will consider an opioid to be any drug listed as an opiate or narcotic on the Addictions and
Recovery site.
# adding an opioid involvement attribute to our dataframe, which will be set to a value of "Yes" if a value of "Y
" is found in any of the following substance attributes
dat$opioidInvolved = factor(ifelse(
(dat$Heroin == "Y" | dat$Fentanyl == "Y" | dat$FentanylAnalogue == "Y" | dat$Oxycodone == "Y" | dat$Oxymorphone
== "Y" | dat$Hydrocodone == "Y"| dat$Methadone == "Y" | dat$Tramad == "Y" | dat$Morphine_NotHeroin == "Y" | dat$H
ydromorphone == "Y"),
"Yes", "No"))
A breakdown of how many enteries in our data involded some kind of opioid:
summary(dat$opioidInvolved)
## No Yes
## 616 4489
As we can see, opioids played a role in nearly 90% of the overdose deaths reported in our dataset. Because this percentage is so high, it doesn’t
seem like it would be valuable to use this as our target variable.
Since white is the most frequently occurring race in our set of opioid overdose deaths, let’s add a binary value for this attribute as well.
# adding an opioid isWhite attribute to our dataframe, which will be set to a value of "Yes" if a value of "White
" is found in the race attribute
dat$isWhite = factor(ifelse(
(dat$Race == "White"),"Yes", "No"))
We saw earlier in our data exploration that there were 7 enteries missing a binary sex value, the majority of which had a blank value for the
attribute. Looking at the other attribute values in those rows, we can see much of the other information is blank as well.
# applying a sum of the other blank columns to the rows that had a undetermined sex attribute
apply(dat[dat$Sex != 'Male' & dat$Sex != 'Female',],2,function(x) sum(x == ''))
## ID Date DateType
## 0 1 1
## Age Sex Race
## NA 6 3
## ResidenceCity ResidenceCounty ResidenceState
## 3 3 3
## DeathCity DeathCounty Location
## 3 3 3
## LocationifOther DescriptionofInjury InjuryPlace
## 7 2 1
## InjuryCity InjuryCounty InjuryState
## 2 2 5
## COD OtherSignifican Heroin
## 0 5 6
## Cocaine Fentanyl FentanylAnalogue
## 5 2 6
## Oxycodone Oxymorphone Ethanol
## 6 7 5
## Hydrocodone Benzodiazepine Methadone
## 6 5 6
## Amphet Tramad Morphine_NotHeroin
## 7 6 7
## Hydromorphone Other OpiateNOS
## 6 7 7
## AnyOpioid MannerofDeath DeathCityGeo
## 3 1 0
## ResidenceCityGeo InjuryCityGeo opioidInvolved
## 0 0 0
## isWhite
## 0
# for later comparison, a datafram containing only the rows of overdoses with no opioid involvement
datNOp = dat[dat$opioidInvolved == "No",]
What is the most frequently occurring manner of death of opioid related overdoses? Does is differ from non opioid rlated deaths?
We can see no difference between these subsets of our data- nearly all enteries in our dataset are accidental overdoses.
Are there any spikes in opioid overdoses depending on the time of year?
##
## 01 02 03 04 05 06 07 08 09 10 11 12
## 336 354 373 346 370 387 394 337 369 405 415 403
We can see the number of opioid related deaths in each month is distributed rather evenly, with a slight increase in the later months of a given
year.
We saw previously that our typical overdosee was a middle aged white man. Here we explore other possible correlations, such as location. A
number of our overdose deaths occurred in large cities- there is a clear cut off point after about the top 6 cities, where the deathcount drops
below half of the maximum.
## No Yes
## 3232 1873
We can see here over 60% of our opioid related deaths occurred in these top 5 cities. This seems like a more interesting candidate for our target
variable.
library(rpart)
library(rpart.plot)
library(maptree)
set.seed(123)
Here we are using Age, MannerofDeath, Sex, and isWHite as our predictors. We determined these to be the most useful classifiers in our dataset
because all other factor attributes have far too many levels. The idea of using the classification tree is that it will aid us in discerning which
attributes are most relevent to predicting the location of an overdose death.
This is an unexpected result: even though white men are the predominant race of the drug overdoses reported in our dataset, they are not the
majority race overdoses in our top cities.
## Call:
## rpart(formula = MajorCity ~ Age + MannerofDeath + isWhite + Sex,
## data = tr_dat, method = "class")
## n= 3828
##
## CP nsplit rel error xerror xstd
## 1 0.1777778 0 1.0000000 1.0000000 0.02134509
## 2 0.0100000 1 0.8222222 0.8222222 0.02031748
##
## Variable importance
## isWhite
## 100
##
## Node number 1: 3828 observations, complexity param=0.1777778
## predicted class=No expected loss=0.3644201 P(node) =1
## class counts: 2433 1395
## probabilities: 0.636 0.364
## left son=2 (3000 obs) right son=3 (828 obs)
## Primary splits:
## isWhite splits as RL, improve=172.0408000, (0 missing)
## Age < 30.5 to the left, improve= 9.5951030, (3 missing)
## MannerofDeath splits as RRLL-R, improve= 1.9867830, (0 missing)
## Sex splits as LLRL, improve= 0.2077175, (0 missing)
## Surrogate splits:
## Sex splits as RLLL, agree=0.784, adj=0.001, (0 split)
##
## Node number 2: 3000 observations
## predicted class=No expected loss=0.2856667 P(node) =0.7836991
## class counts: 2143 857
## probabilities: 0.714 0.286
##
## Node number 3: 828 observations
## predicted class=Yes expected loss=0.3502415 P(node) =0.2163009
## class counts: 290 538
## probabilities: 0.350 0.650
Interestingly enough, we see age appear in the tree run on our test data where it did not for our training data. It would seem that race and age are
the best predictors of determing the location of an overdose death. And our typical individual overdosing in our locations of interest is a non white
person under the age of 29.
## actual
## predicted No Yes
## No 693 311
## Yes 106 167
mean(actual == predicted)
## [1] 0.6734534
Conclusions
We learned that white men in their 40’s are the most likely overdosees. Additionaly, we saw that dense city populations correlated to higher
concentrations opioid related overdose deaths. Contrary to these initial findings, however, we found that the most likely race to overdose is not
more likely to overdose in the most likely locations of overdose.