This package aims to parse out identifiable drug names given a corpus of text. By corpus of text, we assume that the data has already been imported into R.
Throughout this vignette, we will employ a sample dataset -
drug_df
- that is intended to represent data collected from
a clinical trial. The dataset contains 3 variables and 500
observations.
dim(drug_df)
[1] 500 3
str(drug_df)
tibble [500 × 3] (S3: tbl_df/tbl/data.frame)
$ textdrug: chr [1:500] "Remeron" "Remeron" "Soma" "Ectacy" ...
$ sex : chr [1:500] "male" "female" "female" "female" ...
$ race : chr [1:500] "black" "ai/an" "ai/an" "hn/pi" ...
class(drug_df)
[1] "tbl_df" "tbl" "data.frame"
Note: drug_df
is a simulated dataset and does not
reflect any true clinical observations.
parse()
The parse()
function is intended to extract identifiable
drug names from a corpus of text such as, clinical trial data, social
media, survey or interview transcription. parse()
takes in
one argument, the vector that contains the strings to be parsed.
Here is an example of some problematic records in the
drug_df
dataset that warrants the use of
parse()
messy_data <- drug_df %>%
# select records that have problematic characters
filter(str_detect(textdrug, ",|;|and|\\/|=|\\(")) %>%
distinct(textdrug)
knitr::kable(messy_data)
textdrug |
---|
Bup/Nx |
Bup/Nx. |
bup/nx |
Percocets and Vicodin |
Barbiturate (doesn’t know which) |
heroin - “few days on, few days off” |
heroin- “a few days on, few days off |
Ambien = 2 pills |
Ambien “a bunch” = 2 pills |
promethazine (25mg), clonidine (0.1mg) |
As you can see there are so many extraneous/problematic characters, multiple drugs in one record and several variations of the same drug (i.e. “bup/nx”). We assume that the user is solely interested in the drugs themselves, not information such as dosage and units.
This messy data is exactly what parse()
was designed
for.
drug_names <- DOPE::parse(messy_data$textdrug)
drug_names
[1] "bup/nx" "bup/nx" "bup/nx" "percocets" "vicodin"
[6] "barbiturate" "heroin" "ambien" "ambien" "promethazine"
[11] "clonidine"
attr(,"na.action")
[1] 8 9
attr(,"class")
[1] "omit"
Notice parse()
cleans up the capitalization and
punctuation of ‘bup/nx’. parse()
has special code to clean
up cases of ‘bup/nx’ and also ‘speedball’. It also finds the distinction
of the final row “promethazine (25mg), clonidine” and separates them.
See the tidytext package.1
The resulting vector can then be passed on to the
lookup_*
functions to identify whether the input drug is a
class, category or a synonym for other drugs in the same category.
lookup_*
lookup()
This function relies on a comprehensive lookup table
lookup_df
. This dataframe contains 3 variables:
class
= Highest level classification e.g. “stimulant”
or “narcotic (opoid)”category
= More specific level of classification
e.g. “heroin”, “opium” or “marijuana”synonym
= Common name or street slang for specific drug
name e.g. “china”, “dope” or “weed”These names were based on terms used by the DEA.2
dim(lookup_df)
[1] 4766 3
str(lookup_df)
'data.frame': 4766 obs. of 3 variables:
$ class : chr "hallucinogen" "hallucinogen" "hallucinogen" "hallucinogen" ...
$ category: chr "2cb" "2cb" "2cb" "2cb" ...
$ synonym : chr "banana split" "bdmpea" "bromo" "mft" ...
class(lookup_df)
[1] "data.frame"
The purpose of this function is to return any possible matches to the
lookup_df, which is a comprehensive dataframe consisting of all drug
classes, categories and synonyms. It serves as a source or helper
function to many of the other more specific function in the package. The
idea is to match any possible columns with a the single word, a list of
separate words or a vector passed as an argument. The dataframe returned
will consist of the lookup_df match as well as the
original_word
that was the source of the match.
Here is an example of a common search done using
lookup
.
original_word | class | category | synonym |
---|---|---|---|
bup/nx | treatment drug | treatment drug | bup/nx |
percocets | NA | NA | NA |
vicodin | narcotic (opioid) | codeine combinations, non-injectable | vicodin |
barbiturate | depressant | barbiturate | fiorina |
barbiturate | depressant | barbiturate | nembutal |
barbiturate | depressant | barbiturate | pentothal |
barbiturate | depressant | barbiturate | seconal |
heroin | heroin | heroin | a-bomb |
heroin | heroin | heroin | a-bomb (mixed with marijuana) |
heroin | heroin | heroin | achivia |
heroin | heroin | heroin | adormidera |
heroin | heroin | heroin | aip |
heroin | heroin | heroin | al capone |
heroin | heroin | heroin | antifreeze |
heroin | heroin | heroin | aries |
You can see that the dataframe returned could be vast in its matches (heroin returns another few hundred matches alone), and that the other more specific functions, below, might be of more use depending on one’s needs.
compress_lookup()
This function takes in one argument: the table returned from a search
using the lookup
function. The purpose of this function is
to narrow down the results to a more specific dataframe consisting of
only relevant values, such as class
and/or
category
depending on the user’s selection.
compress_lookup
returns, by default,
original_word
, class
and
category
.
If a researcher wanted to determine the main classes of drugs being
used by the patients of a clinical study, they might pass a large vector
of substances from clinical notes taken in a study to the
lookup
function, then filter them down to only return the
datafram of classes relevant to their needs.
Here is an example of a common search done using
compress_lookup
.
filtered_df <- compress_lookup(results)
head(filtered_df)
original_word class category
1 bup/nx treatment drug treatment drug
2 percocets <NA> <NA>
3 vicodin narcotic (opioid) codeine combinations, non-injectable
4 barbiturate depressant barbiturate
5 heroin heroin heroin
6 ambien Unknown Unknown
The resulting dataframe is a short list of only the relevant information needed.
lookup_syn()
The purpose of this function is to find all possible synonyms of,
primarily, a slang/street name of a commonly abused drug. Though
searching for a drug class or category with lookup()
will
also return common synonms, this function makes searching specifically
for synonyms explicit by taking just one argument:
drug_name
. The function will then determine the category of
the slang term (drug_name) and return all synonyms that share that
category.
Here is an example of a common search done using
lookup_syn
.
The resulting dataframe contains a moderate list of terms that are
synonyms of the drug_name
given as determined by sources
such as the DEA, FDA and other publicly available resources.