---
title: "Introduction to parse and the lookup_ functions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to parse and the lookup_ functions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "  ",
  message = FALSE,
  warning = FALSE
)
```

```{r setup}
library(DOPE)
library(dplyr)
library(stringr)
```

This package aims to parse out identifiable drug names given a corpus of text. By corpus of text, we assume that the data has already been imported into R.

## Data: drug_df

------

Throughout this vignette, we will employ a sample dataset - `drug_df` - that is intended to represent data collected from a clinical trial. The dataset contains 3 variables and 500 observations. 

```{r data, echo=TRUE}
dim(drug_df)

str(drug_df)

class(drug_df)
```
Note: `drug_df` is a simulated dataset and does not reflect any true clinical observations. 

## `parse()`

------

The `parse()` function is intended to extract identifiable drug names from a corpus of text such as, clinical trial data, social media, survey or interview transcription. `parse()` takes in one argument, the vector that contains the strings to be parsed.

Here is an example of some problematic records in the `drug_df` dataset that warrants the use of `parse()`

```{r messy}

messy_data <- drug_df %>% 
  # select records that have problematic characters
  filter(str_detect(textdrug, ",|;|and|\\/|=|\\(")) %>% 
  distinct(textdrug)

knitr::kable(messy_data)

```

As you can see there are so many extraneous/problematic characters, multiple drugs in one record and several variations of the same drug (i.e. "bup/nx"). We assume that the user is solely interested in the drugs themselves, not information such as dosage and units. 

This messy data is exactly what `parse()` was designed for. 

```{r}
drug_names <- DOPE::parse(messy_data$textdrug)

drug_names
```

Notice `parse()` cleans up the capitalization and punctuation of 'bup/nx'. `parse()` has special code to clean up cases of 'bup/nx' and also 'speedball'. It also finds the distinction of the final row "promethazine (25mg), clonidine" and separates them. See the tidytext package.^[https://juliasilge.github.io/tidytext/]

The resulting vector can then be passed on to the `lookup_*` functions to identify whether the input drug is a class, category or a synonym for other drugs in the same category. 

## `lookup_*`

------

### `lookup()`

This function relies on a comprehensive lookup table `lookup_df`. This dataframe contains 3 variables:  

- `class` =  Highest level classification e.g. "stimulant" or "narcotic (opoid)"
- `category` =  More specific level of classification e.g. "heroin", "opium" or "marijuana"
- `synonym` =  Common name or street slang for specific drug name e.g. "china", "dope" or "weed"

These names were based on terms used by the DEA.^[https://www.dea.gov/sites/default/files/2020-04/Drugs%20of%20Abuse%202020-Web%20Version-508%20compliant-4-24-20_0.pdf]

```{r lookup_df, echo=TRUE}
dim(lookup_df)

str(lookup_df)

class(lookup_df)
```

The purpose of this function is to return any possible matches to the lookup_df, which is a comprehensive dataframe consisting of all drug classes, categories and synonyms. It serves as a source or helper function to many of the other more specific function in the package. The idea is to match any possible columns with a the single word, a list of separate words or a vector passed as an argument. The dataframe returned will consist of the lookup_df match as well as the `original_word` that was the source of the match. 

Here is an example of a common search done using `lookup`. 

```{r}
results <- lookup(unique(drug_names))
head(results, 15) %>%
  knitr::kable()
```

You can see that the dataframe returned could be vast in its matches (heroin returns another few hundred matches alone), and that the other more specific functions, below, might be of more use depending on one's needs. 

------

### `compress_lookup()`

This function takes in one argument: the table returned from a search using the `lookup` function. The purpose of this function is to narrow down the results to a more specific dataframe consisting of only relevant values, such as `class` and/or `category` depending on the user's selection. `compress_lookup` returns, by default, `original_word`, `class` and `category`. 


If a researcher wanted to determine the main classes of drugs being used by the patients of a clinical study, they might pass a large vector of substances from clinical notes taken in a study to the `lookup` function, then filter them down to only return the datafram of classes relevant to their needs.

Here is an example of a common search done using `compress_lookup`. 

```{r}
filtered_df <- compress_lookup(results)
head(filtered_df)
```

The resulting dataframe is a short list of only the relevant information needed. 

------

### `lookup_syn()`

The purpose of this function is to find all possible synonyms of, primarily, a slang/street name of a commonly abused drug. Though searching for a drug class or category with `lookup()` will also return common synonms, this function makes searching specifically for synonyms explicit by taking just one argument: `drug_name`. The function will then determine the category of the slang term (drug_name) and return all synonyms that share that category. 

Here is an example of a common search done using `lookup_syn`. 

```{r}
results <- lookup_syn("shrooms")
head(results)
```

The resulting dataframe contains a moderate list of terms that are synonyms of the `drug_name` given as determined by sources such as the DEA, FDA and other publicly available resources.