Predicting Location in Drug Overdose Related Deaths

Predicting Location in Drug Overdose Related
Deaths
Holly Stephens & Jonathan Cabrera
Introduction
We chose to use the Accidental Drug Related Deaths dataset for our final project, and because of how prolific news stories of the opioid crisis
have become in recent years, we felt a natural propensity towards drug overdose deaths related to opioids. After some exploration, we decided
we wanted to predict the likelihood of an overdose death occurring in a major city.
Data
The Accidental Drug Related Deaths dataset is from a government site containing information about the accidental deaths associated with drug
overdose in Connecticut from 2012 to 2018. It was collected by the Office of the Chief Medical Examiner and includes information such as the
toxicity report, death certificate, and a scene investigation of each overdose death. It should be noted here that Connecticut is also the
headquarters of Purdue Pharma L.P., a company considered by many to be one of the actuators behind the opioid crisis.
Initial Data Exploration

A first look at our dataset:
# how large is our data?

ncol(dat)
## [1] 41
nrow(dat)
## [1] 5105
# what types of attributes exist in our dataframe?

unique(sapply(dat,class))
## [1] "factor" "integer"
Exploring the demographic distribution:
mean(na.omit(dat$Age))
## [1] 41.96492
summary(dat$Sex)
## Female Male Unknown

## 6 1325 3773 1
summary(dat$Race)
## Asian Indian Asian, Other

## 13 14 18
## Black Chinese Hawaiian
## 433 2 1
## Hispanic, Black Hispanic, White Native American, Other
## 24 561 1
## Other Unknown White
## 11 23 4004
summary(dat$DeathCity)
## HARTFORD NEW HAVEN WATERBURY BRIDGEPORT

## 563 374 368 341
## NEW BRITAIN MERIDEN BRISTOL NORWICH
## 227 145 144 144
## NEW LONDON DANBURY TORRINGTON MANCHESTER
## 137 131 114 102
## MIDDLETOWN ENFIELD STAMFORD EAST HARTFORD
## 100 81 76 71
## MILFORD NORWALK WEST HAVEN DERBY
## 70 69 69 65
## STRATFORD SOUTHINGTON WILLIMANTIC VERNON
## 55 53 51 48
## HAMDEN NEW MILFORD GROTON NAUGATUCK
## 46 44 42 42
## WALLINGFORD BRANFORD WEST HARTFORD WETHERSFIELD
## 36 35 31 30
## PUTNAM SHELTON EAST HAVEN NEWINGTON
## 28 28 27 26
## WINDSOR ANSONIA PLAINVILLE GREENWICH
## 24 23 23 22
## FARMINGTON WATERTOWN GLASTONBURY WINDSOR LOCKS
## 21 21 20 20
## CLINTON BERLIN PLAINFIELD TERRYVILLE
## 19 18 18 18
## EAST WINDSOR MARLBOROUGH NORTH HAVEN BETHEL
## 17 16 16 15
## EAST HAMPTON SEYMOUR JEWETT CITY SHARON
## 15 15 14 14
## STAFFORD SPRINGS TRUMBULL FAIRFIELD PORTLAND
## 14 14 13 13
## TOLLAND WINSTED MOOSUP ROCKY HILL
## 12 12 11 11
## UNCASVILLE WESTBROOK BLOOMFIELD CANTON
## 11 11 10 10
## COLCHESTER LEDYARD NORTH CANAAN OLD SAYBROOK
## 10 10 10 10
## WATERFORD COVENTRY CROMWELL LEBANON
## 10 9 9 9
## MADISON MANSFIELD NEWTOWN BURLINGTON
## 9 9 9 8
## CHESHIRE COLUMBIA DEEP RIVER ELLINGTON
## 8 8 8 8
## ESSEX KILLINGLY MIDDLEBURY NIANTIC
## 8 8 8 8
## SUFFIELD WINDHAM DANIELSON HEBRON
## 8 8 7 7
## MONROE SIMSBURY WINCHESTER ASHFORD
## 7 7 7 6
## DAYVILLE GUILFORD NORTH BRANFORD (Other)
## 6 6 6 330
summary(dat$MannerofDeath)
## accident Accident ACCIDENT Natural Pending

## 10 13 5066 1 1 14
# getting a visual
par(mfrow=c(1,3))
hist(na.omit(dat$Age), main="individuals by age", xlab="age", col="lightblue")
plot(dat$Sex, main="individuals by sex", xlab="sex", col="lightblue")
plot(dat$DeathCity, main="individuals by death city", xlab="death city", col="lightblue")
This gives us a more cohesive picture of the demographic background of a typical individual in our dataset: a white man in his early 40s.
We can also observe that there are 5 cities around which the overdose deaths reported coalesce.
Data Cleaning and Preprocessing

Looking at the size of unique values in the cause of death attribute, we can see that this is much too broad to use.
length(unique(dat$COD))
## [1] 3193
To hone in on opioid involvement in a given overdose, let’s instead focus on the substance attributes, which indicate whether a certain drug was
detected in an individual by the Medical Examiner. We will consider an opioid to be any drug listed as an opiate or narcotic on the Addictions and
Recovery site.
# adding an opioid involvement attribute to our dataframe, which will be set to a value of "Yes" if a value of "Y
" is found in any of the following substance attributes
dat$opioidInvolved = factor(ifelse(
(dat$Heroin == "Y" | dat$Fentanyl == "Y" | dat$FentanylAnalogue == "Y" | dat$Oxycodone == "Y" | dat$Oxymorphone
== "Y" | dat$Hydrocodone == "Y"| dat$Methadone == "Y" | dat$Tramad == "Y" | dat$Morphine_NotHeroin == "Y" | dat$H
ydromorphone == "Y"),
"Yes", "No"))
A breakdown of how many enteries in our data involded some kind of opioid:
summary(dat$opioidInvolved)
## No Yes
## 616 4489
As we can see, opioids played a role in nearly 90% of the overdose deaths reported in our dataset. Because this percentage is so high, it doesn’t
seem like it would be valuable to use this as our target variable.
Since white is the most frequently occurring race in our set of opioid overdose deaths, let’s add a binary value for this attribute as well.
# adding an opioid isWhite attribute to our dataframe, which will be set to a value of "Yes" if a value of "White
" is found in the race attribute
dat$isWhite = factor(ifelse(
(dat$Race == "White"),"Yes", "No"))
We saw earlier in our data exploration that there were 7 enteries missing a binary sex value, the majority of which had a blank value for the
attribute. Looking at the other attribute values in those rows, we can see much of the other information is blank as well.
# applying a sum of the other blank columns to the rows that had a undetermined sex attribute
apply(dat[dat$Sex != 'Male' & dat$Sex != 'Female',],2,function(x) sum(x == ''))
## ID Date DateType
## 0 1 1
## Age Sex Race
## NA 6 3
## ResidenceCity ResidenceCounty ResidenceState
## 3 3 3
## DeathCity DeathCounty Location
## 3 3 3
## LocationifOther DescriptionofInjury InjuryPlace
## 7 2 1
## InjuryCity InjuryCounty InjuryState
## 2 2 5
## COD OtherSignifican Heroin
## 0 5 6
## Cocaine Fentanyl FentanylAnalogue
## 5 2 6
## Oxycodone Oxymorphone Ethanol
## 6 7 5
## Hydrocodone Benzodiazepine Methadone
## 6 5 6
## Amphet Tramad Morphine_NotHeroin
## 7 6 7
## Hydromorphone Other OpiateNOS
## 6 7 7
## AnyOpioid MannerofDeath DeathCityGeo
## 3 1 0
## ResidenceCityGeo InjuryCityGeo opioidInvolved
## 0 0 0
## isWhite
## 0
Data Exploration and Visualization

Diving deeper into our dataset, we wanted to continue driving at demographics predicting the nature of overdose deaths.
# a dataframe containing only the rows with opioid involvement

datOp = dat[dat$opioidInvolved == "Yes",]
# for later comparison, a datafram containing only the rows of overdoses with no opioid involvement
datNOp = dat[dat$opioidInvolved == "No",]
What is the most frequently occurring manner of death of opioid related overdoses? Does is differ from non opioid rlated deaths?
# getting a count of opioid overdose deaths by month of year

par(mfrow=c(1,2))
plot(datOp$MannerofDeath, main="manner of death opioids", xlab="MOD", col="lightblue")
plot(datNOp$MannerofDeath, main="manner of death no opioids", xlab="MOD", col="lightblue")
We can see no difference between these subsets of our data- nearly all enteries in our dataset are accidental overdoses.
Are there any spikes in opioid overdoses depending on the time of year?
# getting a count of opioid overdose deaths by month of year

datOp["Month"] = apply(datOp["Date"], 1, function(x) substr(x, 1, 2))
table(datOp$Month)
##
## 01 02 03 04 05 06 07 08 09 10 11 12
## 336 354 373 346 370 387 394 337 369 405 415 403
We can see the number of opioid related deaths in each month is distributed rather evenly, with a slight increase in the later months of a given
year.
We saw previously that our typical overdosee was a middle aged white man. Here we explore other possible correlations, such as location. A
number of our overdose deaths occurred in large cities- there is a clear cut off point after about the top 6 cities, where the deathcount drops
below half of the maximum.
topCities = c("HARTFORD", "NEW HAVEN","WATERBURY", "BRIDGEPORT", "NEW BRITAIN")

dat$MajorCity = factor(ifelse(dat$DeathCity %in% topCities,
"Yes", "No"))
summary(dat$MajorCity)
## No Yes
## 3232 1873
We can see here over 60% of our opioid related deaths occurred in these top 5 cities. This seems like a more interesting candidate for our target
variable.
Building a classification tree

Our target variable is the MajorCity attribute we created in the previous section, which specifies whether or not an overdose death occurred in
one of the cities with the highest death tolls. Our approach of using a classification tree is to show us what other attributes might influence this
outcome. We will start constructing our classification tree by creating training and test data sets. The objective is to find predictors that minimize
the our classification errors, and the functions we’ll use to do this will help us in determining what those predictors are.
library(rpart)
library(rpart.plot)
library(maptree)
## Loading required package: cluster
# for split_data function

source("https://raw.githubusercontent.com/grbruns/cst383/master/lin-regr-util.R")
set.seed(123)
splits = split_data(dat, frac=c(3,1))

tr_dat = splits[[1]]
te_dat = splits[[2]]
Here we are using Age, MannerofDeath, Sex, and isWHite as our predictors. We determined these to be the most useful classifiers in our dataset
because all other factor attributes have far too many levels. The idea of using the classification tree is that it will aid us in discerning which
attributes are most relevent to predicting the location of an overdose death.
# building our tree with the training data

tr_fit = rpart(MajorCity ~ Age + MannerofDeath + isWhite + Sex , data=tr_dat, method="class")
prp(tr_fit, extra=106, varlen=-10,
main="classification tree for location of overdose",
box.col=c("palegreen", "pink")[tr_fit$frame$yval])
This is an unexpected result: even though white men are the predominant race of the drug overdoses reported in our dataset, they are not the
majority race overdoses in our top cities.
# examining our fitted tree

summary(tr_fit)
## Call:
## rpart(formula = MajorCity ~ Age + MannerofDeath + isWhite + Sex,
## data = tr_dat, method = "class")
## n= 3828
##
## CP nsplit rel error xerror xstd
## 1 0.1777778 0 1.0000000 1.0000000 0.02134509
## 2 0.0100000 1 0.8222222 0.8222222 0.02031748
##
## Variable importance
## isWhite
## 100
##
## Node number 1: 3828 observations, complexity param=0.1777778
## predicted class=No expected loss=0.3644201 P(node) =1
## class counts: 2433 1395
## probabilities: 0.636 0.364
## left son=2 (3000 obs) right son=3 (828 obs)
## Primary splits:
## isWhite splits as RL, improve=172.0408000, (0 missing)
## Age < 30.5 to the left, improve= 9.5951030, (3 missing)
## MannerofDeath splits as RRLL-R, improve= 1.9867830, (0 missing)
## Sex splits as LLRL, improve= 0.2077175, (0 missing)
## Surrogate splits:
## Sex splits as RLLL, agree=0.784, adj=0.001, (0 split)
##
## Node number 2: 3000 observations
## predicted class=No expected loss=0.2856667 P(node) =0.7836991
##
## Node number 3: 828 observations
## predicted class=Yes expected loss=0.3502415 P(node) =0.2163009
Classifying test data

We now run our classification tree on our test data
# building our tree

te_fit = rpart(MajorCity ~ Age + MannerofDeath + isWhite + Sex , data=te_dat, method="class")
prp(te_fit, extra=106, varlen=-10,
main="classification tree for for location of overdose",
box.col=c("palegreen", "pink")[te_fit$frame$yval])
Interestingly enough, we see age appear in the tree run on our test data where it did not for our training data. It would seem that race and age are
the best predictors of determing the location of an overdose death. And our typical individual overdosing in our locations of interest is a non white
person under the age of 29.
Assessing the model

#creating our confusion matrix
predicted = predict(tr_fit, te_dat, type="class")
actual = te_dat$MajorCity
conf_mtx = table(predicted, actual)
conf_mtx
## actual
## predicted No Yes
## No 693 311
## Yes 106 167
mean(actual == predicted)
## [1] 0.6734534
Conclusions
We learned that white men in their 40’s are the most likely overdosees. Additionaly, we saw that dense city populations correlated to higher
concentrations opioid related overdose deaths. Contrary to these initial findings, however, we found that the most likely race to overdose is not
more likely to overdose in the most likely locations of overdose.

Predicting Location in Drug Overdose Related Deaths

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predicting Location in Drug Overdose Related Deaths

Uploaded by

Copyright:

Available Formats

Predicting Location in Drug Overdose Related

Initial Data Exploration

# how large is our data?

# what types of attributes exist in our dataframe?

## [1] "factor" "integer"

Exploring the demographic distribution:

## Female Male Unknown

## Asian Indian Asian, Other

## HARTFORD NEW HAVEN WATERBURY BRIDGEPORT

## accident Accident ACCIDENT Natural Pending

Data Cleaning and Preprocessing

Data Exploration and Visualization

# a dataframe containing only the rows with opioid involvement

# getting a count of opioid overdose deaths by month of year

# getting a count of opioid overdose deaths by month of year

topCities = c("HARTFORD", "NEW HAVEN","WATERBURY", "BRIDGEPORT", "NEW BRITAIN")

Building a classification tree

## Loading required package: cluster

# for split_data function

splits = split_data(dat, frac=c(3,1))

# building our tree with the training data

# examining our fitted tree

Classifying test data

# building our tree

Assessing the model

You might also like