Over the past 15 years, deaths from prescription opiates have quadrupled, but so has the amount of opiates prescribed. This massive increase in prescription rates has occurred while the levels of pain experienced by Americans has remain largely unchanged. Intuitively it follows that unneccessary prescriptions potentially play a significant role in the increase in opioid overdoses. An effective strategy for identifying instances of overprescribing is therefore a potentially life saving endeavor.
To that end, the goal of this experiment is to demonstrate the possibility that predictive analytics with machine learning can be used to predict the likelihood that a given doctor is a significant prescriber of opiates. I’ll do some cleaning of the data, and build a predictive model using a gradient boosted classification tree ensemble that predicts with >80% accuracy that an arbitrary entry is a significant prescriber of opioids. I’ll also do some analysis and visualization of my results combined with those pulled from other sources.
Here is a visualization of the distribution of fatal overdose rate across the country. By the end of this post, you’ll see exactly how such an image can be built.
Disclaimer I am absolutely not suggesting that doctors who prescribe opiates are culpable for overdoses. These are drugs with true medical value when used appropriately. The idea is rather that a systematic way for identifying sources may reveal trends in particular practices, fields, or regions of the country that could be used effectively to combat the problem.
Building a Dataset
The primary source of the raw data I will be pulling from is cms.org. Part of Obamacare requires more visilibity into government aid programs, so this dataset is a compilation of drug prescriptions made to citizens claiming coverage under Class D Medicare. It contains approxiately 24 million entries in .csv format. Each row contains information about one doctor and one drug, so each doctor occurs multiple times (called long format). We want to pivot this data to wide format so that each row represents a single doctor and contains information on every drug.
Those of you who have used Excel on large datasets will know very well that anything close to a million rows in Excel is painfully slow, so if you have ever wondered how to manage such a file, I’ve included the R code I used to build the dataset here.
This kind of work is pretty code-dense and not exactly riveting, so I won’t include the gory details here. Let’s just skip to the fun part. The final trimmed result is a dataset containing 25,000 unique doctors and the numbers of prescriptions written for 250 drugs in the year 2014. Specifically, The data consists of the following characteristics for each prescriber
- NPI – unique National Provider Identifier number
- Gender - (M/F)
- State - U.S. State by abbreviation
- Credentials - set of initials indicative of medical degree
- Specialty - description of type of medicinal practice
- A long list of drugs with numeric values indicating the total number of prescriptions written for the year by that individual
- Opioid.Prescriber - a boolean label indicating whether or not that individual prescribed opiate drugs more than 10 times in the year
Data Cleaning
Load the relevant libraries and read the data
library(dplyr)
library(magrittr)
library(ggplot2)
library(maps)
library(data.table)
library(lme4)
library(caret)
df <- data.frame(fread("prescriber-info.csv",nrows=-1))
First, we have to remove all of the information about opiate prescriptions from the data because that would be cheating. Conveniently, the same source that provided the raw data used to build this dataset also includes a list of all opiate drugs exactly as they are named in the data.
opioids <- read.csv("opioids.csv",skip=2)
opioids <- as.character(opioids[,2]) # second column contains the names of the opiates
opioids <- gsub("\ |-",".",opioids) # replace hyphens and spaces with periods to match the dataset
df <- df[, !names(df) %in% opioids]
Convert character columns to factors. I’m sure there is a nicer way to code this, but it’s late and this works for now. Feel free to comment the better way.
char_cols <- c("NPI",names(df)[vapply(df,is.character,TRUE)])
df[,char_cols] <- lapply(df[,char_cols],as.factor)
Now let’s clean up the factor variables
str(df[,1:5])
## 'data.frame': 25000 obs. of 5 variables:
## $ NPI : Factor w/ 25000 levels "1003002320","1003004771",..: 18016 6106 10728 16695 16924 13862 10913 10107 679 20644 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 1 ...
## $ State : Factor w/ 57 levels "AA","AE","AK",..: 48 4 38 6 37 42 34 42 48 54 ...
## $ Credentials: Factor w/ 888 levels "","(DMD)","A.N.P.",..: 244 507 403 507 403 314 507 839 697 507 ...
## $ Specialty : Factor w/ 109 levels "Addiction Medicine",..: 19 29 28 42 36 29 25 63 66 42 ...
Looks like the id’s are unique and two genders is expected, but there’s a few too many states and way too many credentials.
df %>%
group_by(State) %>%
dplyr::summarise(state.counts = n()) %>%
arrange(state.counts)
## # A tibble: 57 x 2
## State state.counts
## <fctr> <int>
## 1 AA 1
## 2 AE 2
## 3 GU 2
## 4 ZZ 2
## 5 VI 3
## 6 WY 38
## 7 AK 39
## 8 VT 65
## 9 ND 66
## 10 MT 77
## # ... with 47 more rows
This site indicates that some of these are legitimate, but rare, labels Like Guam (GU), but others appear to be typos. Let’s create an “other” category for the rarely occurring abbreviations
rare.abbrev <- df %>%
group_by(State) %>%
dplyr::summarise(state.counts = n()) %>%
arrange(state.counts) %>%
filter(state.counts < 10) %>%
select(State)
# Insert a new level into the factor, then remove the old names
levels(df$State) <- c(levels(df$State),"other")
df$State[df$State %in% rare.abbrev$State] <- "other"
df$State <- droplevels(df$State)
As a quick sanity check to make sure we still don’t have any bogus states, let’s pull that table from the web and make sure our abbreviations are valid
library(htmltab)
state.abbrevs <- data.frame(htmltab("http://www.infoplease.com/ipa/A0110468.html",which=2))
state.abbrevs <- state.abbrevs$Postal.Code
if (all(levels(df$State)[levels(df$State)!="other"] %in% state.abbrevs)){print("All abbreviations are valid")} else{print("Uh oh..")}
Looking ahead, I’m going to be interested in the variable importance state-by-state, so instead of using a single factor I’ll convert to dummy variables so each can be scored separately.
df <- cbind(df[names(df)!="State"],dummy(df$State))
Let’s look at Credentials
now
df %>%
group_by(Credentials) %>%
dplyr::summarise(credential.counts = n()) %>%
arrange(credential.counts) %>%
data.frame() %>%
head(n=25)
## Credentials credential.counts
## 1 (DMD) 1
## 2 A.P.R.N. N.P. 1
## 3 A.P.R.N., B.C. 1
## 4 A.P.R.N., W.H.N.P. 1
## 5 A.R.N.P. , PMHNP-BC 1
## 6 A.R.N.P.-B.C. 1
## 7 A.T.,C. , C.S.C.S. 1
## 8 ACNP-BC, CCNS 1
## 9 ACNP-BC, MSN 1
## 10 ACNP/FNP 1
## 11 ACNS-BC 1
## 12 ADULT NP 1
## 13 AGACNP 1
## 14 AGNP 1
## 15 AGPCNP-BC 1
## 16 AHCNS 1
## 17 ANNA TAVAKKOLI M.D. 1
## 18 ANP B-C 1
## 19 ANP-BC, GNP 1
## 20 ANP-FNP-BC 1
## 21 ANP,BC 1
## 22 ANP. 1
## 23 ANP/CNP 1
## 24 ANRP 1
## 25 APN FNP-C 1
The credentials are quite the mess. Titles are inconsistently entered with periods, spaces, hyphens, etc, and many people have multiple credentials. While it’s easy to make use of regular expressions to clean them, I won’t bother because I suspect most of the predictive information contained in Credentials
will be captured by Specialty
.
# 7 years of college down the drain
df %<>%
select(-Credentials)
On that note, let’s look at Specialty
df %>%
group_by(Specialty) %>%
dplyr::summarise(specialty.counts = n()) %>%
arrange(desc(specialty.counts)) %>%
data.frame() %>%
head(n=25)
## Specialty
## 1 Internal Medicine
## 2 Family Practice
## 3 Dentist
## 4 Nurse Practitioner
## 5 Physician Assistant
## 6 Emergency Medicine
## 7 Psychiatry
## 8 Cardiology
## 9 Obstetrics/Gynecology
## 10 Orthopedic Surgery
## 11 Optometry
## 12 Student in an Organized Health Care Education/Training Program
## 13 Ophthalmology
## 14 General Surgery
## 15 Gastroenterology
## 16 Neurology
## 17 Podiatry
## 18 Dermatology
## 19 Urology
## 20 Psychiatry & Neurology
## 21 Pulmonary Disease
## 22 Otolaryngology
## 23 General Practice
## 24 Nephrology
## 25 Hematology/Oncology
## specialty.counts
## 1 3194
## 2 2975
## 3 2800
## 4 2512
## 5 1839
## 6 1087
## 7 691
## 8 688
## 9 615
## 10 575
## 11 571
## 12 547
## 13 519
## 14 487
## 15 399
## 16 371
## 17 369
## 18 344
## 19 328
## 20 266
## 21 262
## 22 260
## 23 247
## 24 236
## 25 218
There’s a ton of ways to go about attacking this feature. Since there does not appear to be too much name clashing/redundancy, I’ll use a couple of keywords to collapse some specialties together, and I’ll also lump the rarely occurring ones together into an “other” category like before.
# Get the common specialties, we won't change those
common.specialties <- df %>%
group_by(Specialty) %>%
dplyr::summarise(specialty.counts = n()) %>%
arrange(desc(specialty.counts)) %>%
filter(specialty.counts > 50) %>%
select(Specialty)
common.specialties <- levels(droplevels(common.specialties$Specialty))
# Default to "other", then fill in. I'll make special levels for surgeons and collapse any category containing the word pain
new.specialties <- factor(x=rep("other",nrow(df)),levels=c(common.specialties,"Surgeon","other","Pain.Management"))
new.specialties[df$Specialty %in% common.specialties] <- df$Specialty[df$Specialty %in% common.specialties]
new.specialties[grepl("surg",df$Specialty,ignore.case=TRUE)] <- "Surgeon"
new.specialties[grepl("pain",df$Specialty,ignore.case=TRUE)] <- "Pain.Management"
new.specialties <- droplevels(new.specialties)
df$Specialty <- new.specialties
df %>%
group_by(Specialty) %>%
dplyr::summarise(specialty.counts = n()) %>%
arrange(desc(specialty.counts)) %>%
data.frame() %>%
head(n=25)
## Specialty
## 1 Internal Medicine
## 2 Family Practice
## 3 Dentist
## 4 Nurse Practitioner
## 5 Physician Assistant
## 6 Surgeon
## 7 Emergency Medicine
## 8 Psychiatry
## 9 Cardiology
## 10 Obstetrics/Gynecology
## 11 Optometry
## 12 Student in an Organized Health Care Education/Training Program
## 13 Ophthalmology
## 14 other
## 15 Gastroenterology
## 16 Neurology
## 17 Podiatry
## 18 Dermatology
## 19 Urology
## 20 Psychiatry & Neurology
## 21 Pulmonary Disease
## 22 Otolaryngology
## 23 General Practice
## 24 Nephrology
## 25 Hematology/Oncology
## specialty.counts
## 1 3194
## 2 2975
## 3 2800
## 4 2512
## 5 1839
## 6 1689
## 7 1087
## 8 691
## 9 688
## 10 615
## 11 571
## 12 547
## 13 519
## 14 454
## 15 399
## 16 371
## 17 369
## 18 344
## 19 328
## 20 266
## 21 262
## 22 260
## 23 247
## 24 236
## 25 218
Looks good. Like we did with states
, let’s make it a dummy so we can score the importance by specialty.
df <- df[!is.na(df$Specialty),]
df <- cbind(df[,names(df)!="Specialty"],dummy(df$Specialty))
Because this data is a subset of a larger dataset, it is possible that there are drugs here that aren’t prescribed in any of the rows (i.e. the instances where they were prescribed got left behind). This would result in columns of all 0, which is bad. Multicollinear features are already bad because they mean the feature matrix is rank-deficient and, therefore, non-invertible. But beyond that, pure-zero features are a special kind of useless. Let’s remove them.
df <- df[vapply(df,function(x) if (is.numeric(x)){sum(x)>0}else{TRUE},FUN.VALUE=TRUE)]
I used vapply
because sapply is unpredictable
Model Building
I’ll use a gradient boosted classification tree ensemble with gbm
and caret
. My reasoning is that with so many different features in the form of individual drugs, it’s extremely likely that some of them are highly correlated (i.e. high blood pressure, heart medication, Aspirin). Something like an L1 regularized logistic regression followed by feature trimming would also work, but one of the nice things about this type of tree model is that it doesn’t care about multicollinear features.
First, split the data.
train_faction <- 0.8
train_ind <- sample(nrow(df),round(train_faction*nrow(df)))
Remove the non-features, and convert the target label to something non-numeric (this package requires that)
df %<>% select(-NPI)
df$Opioid.Prescriber <- as.factor(ifelse(df$Opioid.Prescriber==1,"yes","no"))
train_set <- df[train_ind,]
test_set <- df[-train_ind,]
Now we build the model. I’ll use a 5-fold cross validation to optimize hyperparameters for the boosted tree ensemble. A downside of trees is they take a while to train (though they predict quickly). A production model would call for an exhaustive hyperparameter search, but to keep the running time of this script reasonable I will search only a few parameters. I’ll run this and go grab some coffee.
objControl <- trainControl(method='cv', number=5, returnResamp='none', summaryFunction = twoClassSummary, classProbs = TRUE, verboseIter = TRUE,savePredictions = TRUE)
model <- train(train_set %>% select(-Opioid.Prescriber),train_set$Opioid.Prescriber,
method='gbm',
metric = "ROC",
# metric = "Accuracy",
trControl=objControl,
tuneGrid=data.frame(n.trees=c(50,150,200),
interaction.depth=c(3),
shrinkage=c(0.1),
n.minobsinnode=c(10)))
## + Fold1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=200
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2427 nan 0.1000 0.0567
## 2 1.1495 nan 0.1000 0.0466
## 3 1.0717 nan 0.1000 0.0389
## 4 1.0058 nan 0.1000 0.0325
## 5 0.9489 nan 0.1000 0.0286
## 6 0.9012 nan 0.1000 0.0237
## 7 0.8592 nan 0.1000 0.0208
## 8 0.8234 nan 0.1000 0.0178
## 9 0.7920 nan 0.1000 0.0153
## 10 0.7646 nan 0.1000 0.0135
## 20 0.6128 nan 0.1000 0.0043
## 40 0.5327 nan 0.1000 0.0006
## 60 0.5071 nan 0.1000 0.0004
## 80 0.4907 nan 0.1000 0.0001
## 100 0.4782 nan 0.1000 0.0002
## 120 0.4669 nan 0.1000 0.0001
## 140 0.4599 nan 0.1000 0.0000
## 160 0.4522 nan 0.1000 -0.0000
## 180 0.4468 nan 0.1000 0.0001
## 200 0.4405 nan 0.1000 0.0000
##
## - Fold1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=200
## + Fold2: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=200
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2442 nan 0.1000 0.0565
## 2 1.1520 nan 0.1000 0.0463
## 3 1.0744 nan 0.1000 0.0388
## 4 1.0093 nan 0.1000 0.0326
## 5 0.9533 nan 0.1000 0.0277
## 6 0.9053 nan 0.1000 0.0238
## 7 0.8644 nan 0.1000 0.0205
## 8 0.8286 nan 0.1000 0.0177
## 9 0.7982 nan 0.1000 0.0153
## 10 0.7712 nan 0.1000 0.0134
## 20 0.6192 nan 0.1000 0.0042
## 40 0.5393 nan 0.1000 0.0008
## 60 0.5099 nan 0.1000 0.0009
## 80 0.4908 nan 0.1000 0.0002
## 100 0.4802 nan 0.1000 0.0001
## 120 0.4713 nan 0.1000 0.0000
## 140 0.4638 nan 0.1000 -0.0001
## 160 0.4576 nan 0.1000 -0.0001
## 180 0.4516 nan 0.1000 -0.0001
## 200 0.4456 nan 0.1000 0.0001
##
## - Fold2: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=200
## + Fold3: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=200
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2441 nan 0.1000 0.0561
## 2 1.1515 nan 0.1000 0.0464
## 3 1.0738 nan 0.1000 0.0384
## 4 1.0080 nan 0.1000 0.0329
## 5 0.9531 nan 0.1000 0.0275
## 6 0.9056 nan 0.1000 0.0237
## 7 0.8646 nan 0.1000 0.0205
## 8 0.8287 nan 0.1000 0.0180
## 9 0.7984 nan 0.1000 0.0154
## 10 0.7710 nan 0.1000 0.0136
## 20 0.6187 nan 0.1000 0.0042
## 40 0.5380 nan 0.1000 0.0007
## 60 0.5118 nan 0.1000 0.0002
## 80 0.4944 nan 0.1000 0.0005
## 100 0.4827 nan 0.1000 -0.0001
## 120 0.4743 nan 0.1000 0.0002
## 140 0.4669 nan 0.1000 0.0001
## 160 0.4591 nan 0.1000 0.0000
## 180 0.4531 nan 0.1000 0.0000
## 200 0.4479 nan 0.1000 -0.0000
##
## - Fold3: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=200
## + Fold4: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=200
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2433 nan 0.1000 0.0568
## 2 1.1511 nan 0.1000 0.0461
## 3 1.0727 nan 0.1000 0.0391
## 4 1.0061 nan 0.1000 0.0325
## 5 0.9514 nan 0.1000 0.0275
## 6 0.9034 nan 0.1000 0.0237
## 7 0.8619 nan 0.1000 0.0206
## 8 0.8264 nan 0.1000 0.0175
## 9 0.7958 nan 0.1000 0.0152
## 10 0.7688 nan 0.1000 0.0133
## 20 0.6168 nan 0.1000 0.0045
## 40 0.5383 nan 0.1000 0.0005
## 60 0.5088 nan 0.1000 0.0005
## 80 0.4929 nan 0.1000 0.0004
## 100 0.4806 nan 0.1000 0.0002
## 120 0.4699 nan 0.1000 0.0001
## 140 0.4625 nan 0.1000 0.0000
## 160 0.4549 nan 0.1000 0.0001
## 180 0.4503 nan 0.1000 -0.0000
## 200 0.4453 nan 0.1000 -0.0000
##
## - Fold4: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=200
## + Fold5: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=200
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2440 nan 0.1000 0.0569
## 2 1.1507 nan 0.1000 0.0465
## 3 1.0730 nan 0.1000 0.0388
## 4 1.0067 nan 0.1000 0.0328
## 5 0.9509 nan 0.1000 0.0276
## 6 0.9030 nan 0.1000 0.0239
## 7 0.8618 nan 0.1000 0.0207
## 8 0.8257 nan 0.1000 0.0179
## 9 0.7948 nan 0.1000 0.0153
## 10 0.7672 nan 0.1000 0.0136
## 20 0.6146 nan 0.1000 0.0044
## 40 0.5379 nan 0.1000 0.0014
## 60 0.5094 nan 0.1000 0.0002
## 80 0.4923 nan 0.1000 0.0002
## 100 0.4804 nan 0.1000 -0.0000
## 120 0.4720 nan 0.1000 0.0001
## 140 0.4640 nan 0.1000 -0.0001
## 160 0.4581 nan 0.1000 0.0002
## 180 0.4512 nan 0.1000 0.0000
## 200 0.4462 nan 0.1000 -0.0000
##
## - Fold5: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=200
## Aggregating results
## Selecting tuning parameters
## Fitting n.trees = 200, interaction.depth = 3, shrinkage = 0.1, n.minobsinnode = 10 on full training set
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2444 nan 0.1000 0.0567
## 2 1.1511 nan 0.1000 0.0464
## 3 1.0737 nan 0.1000 0.0387
## 4 1.0077 nan 0.1000 0.0327
## 5 0.9520 nan 0.1000 0.0276
## 6 0.9038 nan 0.1000 0.0237
## 7 0.8630 nan 0.1000 0.0206
## 8 0.8273 nan 0.1000 0.0176
## 9 0.7966 nan 0.1000 0.0153
## 10 0.7689 nan 0.1000 0.0137
## 20 0.6179 nan 0.1000 0.0042
## 40 0.5393 nan 0.1000 0.0017
## 60 0.5108 nan 0.1000 0.0003
## 80 0.4923 nan 0.1000 0.0002
## 100 0.4826 nan 0.1000 0.0001
## 120 0.4728 nan 0.1000 0.0000
## 140 0.4656 nan 0.1000 0.0000
## 160 0.4589 nan 0.1000 0.0001
## 180 0.4533 nan 0.1000 0.0002
## 200 0.4494 nan 0.1000 -0.0000
predictions <- predict(model,test_set%>% select(-Opioid.Prescriber),type="raw")
confusionMatrix(predictions,test_set$Opioid.Prescriber,positive="yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1951 417
## yes 70 2562
##
## Accuracy : 0.9026
## 95% CI : (0.894, 0.9107)
## No Information Rate : 0.5958
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8032
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8600
## Specificity : 0.9654
## Pos Pred Value : 0.9734
## Neg Pred Value : 0.8239
## Prevalence : 0.5958
## Detection Rate : 0.5124
## Detection Prevalence : 0.5264
## Balanced Accuracy : 0.9127
##
## 'Positive' Class : yes
##
Looks like we did a good job, the accuracy is not bad for a first attempt. If our goal is to predict significant sources of opioid prescriptions for the purpose of some government agency doing investigative work, then we would likely care more about precision (Pos Pred Value in this package) than accuracy. The reason is because a false positive is potentially sending people on a wild goose chase, wasting money and time.
The other nice thing about trees is the ability to view the importance of the features in the decision making process. This is done by cycling through each feature, randomly shuffling its values about, and seeing how much that hurts your cross-validation results. If shuffling a feature screws you up, then it must be pretty important. Let’s extract the importance from our model with varImp
and make a bar plot.
importance <- as.data.frame(varImp(model)[1])
importance <- cbind(row.names(importance), Importance=importance)
row.names(importance) <-NULL
names(importance) <- c("Feature","Importance")
importance %>% arrange(desc(Importance)) %>%
mutate(Feature=factor(Feature,levels=as.character(Feature))) %>%
slice(1:15) %>%
ggplot() + geom_bar(aes(x=Feature,y=(Importance)),stat="identity",fill="blue") +
theme(axis.text.x=element_text(angle=45,vjust = 1,hjust=1),axis.ticks.x = element_blank()) +ylab("Importance") +ggtitle("Feature Importance for Detecting Opioid Prescription")
Our best feature was a drug called Gabapentin, which unsurprisingly is used to treat nerve pain caused by things like shingles. It’s not an opiate, but likely gets prescribed at the same time. The Surgeon
feature we engineered has done quite well. The other drugs are for various conditions including inflamatation, cancer, and athritis among others.
It probably doesn’t surprise anybody that surgeons commonly prescribe pain medication, or that the people taking cancer drugs are also taking pain pills. But with a very large feature set it becomes entirely intractable for a human to make maximal use of the available information. I would expect a human analyzing this data would immediately classify candidates based upon their profession, but machine learning has enabled us to break it down further based upon the specific drug prescription trends doctor by doctor. Truly a powerful tool.
Summary/Analysis
We succeeded in building a pretty good opiate-prescription detection algorithm based primarily on their profession and prescription rates of non-opiate drugs. This could be used to combat overdoses in a number of ways. Professions who show up with high importance in the model can reveal medical fields that could benefit in the long run from curriculum shifts towards awareness of the potential dangers. Science evolves rapidly, and this sort of dynamic shift of the “state-of-the-art” opinion is quite common. It’s also possible that individual drugs that are strong detectors may identify more accurately a sub-specialty of individual prescribers that could be of interest, or even reveal hidden avenues of illegitimate drug access.
Now, as promised, here is the code used to generate the figure at the beginning of the post.
all_states <- map_data("state")
od <- read.csv("overdoses.csv",stringsAsFactors = FALSE)
od$State <- as.factor(od$State)
od$Population <- as.numeric(gsub(",","",od$Population))
od$Deaths <- as.numeric(gsub(",","",od$Deaths))
od %>%
mutate(state.lower=tolower(State), Population=as.numeric(Population)) %>%
merge(all_states,by.x="state.lower",by.y="region") %>%
select(-subregion,-order) %>%
ggplot() + geom_map(map=all_states, aes(x=long, y=lat, map_id=state.lower,fill=Deaths/Population*1e6) ) + ggtitle("U.S. Opiate Overdose Death Rate") +
geom_text(data=data.frame(state.center,od$Abbrev),aes(x=x, y=y,label=od.Abbrev),size=3) +
scale_fill_continuous(low='gray85', high='darkred',guide=guide_colorbar(ticks=FALSE,barheight=1,barwidth=10,title.vjust=.8,values=c(0.2,0.3)),name="Deaths per Million") + theme(axis.text=element_blank(),axis.title=element_blank(),axis.ticks=element_blank(),legend.position="bottom",plot.title=element_text(size=20))
As you can see, it’s a serious problem. Maybe we as data scientists have the tools to make a difference.