A Conceptual Guide to CLARK!

The purpose of this guide is to accompany the technical CLARK documentation with related concepts and a clinical example.

There are two versions of CLARK. CLARK v1 is the original and allows you to build classification algorithms using notes. CLARK v2 builds on this by adding the ability to incorporate structured data, such as demographics, labs, vitals, and medications, into your algorithm. This conceptual guide uses images of CLARK v1, but these concepts still apply to CLARK v2.

For instructions on how to use CLARK, visit the technical documentation for the version of CLARK you are working with: CLARK v1 or CLARK v2.

Introduction
      Who can use CLARK?
      Selecting a Research Question
      Computable Phenotypes
      Clinical Notes

A Classification Example
      The MIMIC-III Database
      Feature Selection
      Literature Review
      Regular Expressions
      Feature Selection Example

Evaluating Algorithm Performance
      Types of Classification
      Confidence thresholds
      Clinical Example: Algorithm Comparison

Algorithm Performance Outside CLARK
      Sensitivity and Specificity
      Area under the ROC Curve
      Training Corpus Considerations

Results in Context

References

Introduction

CLARK is a machine-learning classification software created by NC TraCS and CoVar Applied Technologies to enable computable phenotyping in unstructured data. Computable phenotypes (CP) are algorithms that search the electronic health record (EHR) for patients with specific clinical features based on diagnostic or procedural codes, lab values, clinical notes, and other information to define cohorts¹. In addition to the structured EHR data, unstructured clinical notes hold a wealth of information that is applicable to identifying patient groups. Analyzing data from clinical notes traditionally has required time-consuming, manual chart reviews. With CLARK, once the effort to create a phenotype is completed, the phenotype can be replicated across other sets of notes.

CLARK enables and simplifies text mining for classification through the use of regular expressions, letting researchers efficiently gather features (values, words and phrases) found in clinical notes. These features include symptoms, details about a patient’s experiences, image interpretations, pathology, and other information. Using features found in clinical notes to train classification algorithms in CLARK reflects the reproducible, transferrable nature of computable phenotyping. CLARK’s natural-language processing (NLP) component also decreases the time spent on manual review of patient records for classification.

Essentially, CLARK reads a set of clinical notes from patients of known groups to train a classification algorithm based on words or values in these notes. Then, the algorithm is transferred to classify additional patients based on the content of their clinical notes.

Who can use CLARK?

CLARK is free and open to the public. It is designed for clinicians or their research teams to train algorithms. Some expertise around the context of the classification problem is necessary; however, experience with machine learning and NLP are not needed. Using CLARK requires some familiarity with or willingness to learn regular expressions. Basic statistical knowledge is helpful for interpreting classification results and model performance, which is explained in the technical documentation.

Selecting a Research Question

CLARK can help to identify patient cohorts for a study when inclusion/exclusion criteria are framed as a classification problem, or it can be used to investigate the distribution of characteristics in a population.

A suitable question for CLARK may take the form: “Of this patient population, which individuals have this characteristic?” or “What is the distribution of these characteristics within this patient population?” The former is focused on classifying individuals, and the latter is focused on a general picture of the population. These situations may include differentiating between disease subtypes or the presence/absence of disease. Groups of interest should be discrete and well-defined.
Each machine-learning algorithm included is a classification tool, so CLARK is not designed to predict continuous endpoints or arbitrarily defined groups. For example, whether or not heart failure patients are readmitted within 30 days of discharge is difficult to predict based on clinical notes. This is because the content of a clinical note may not differ greatly between patients admitted at 29 and 31 days–the two patient groups are not inherently distinct.

Past projects using CLARK included identifying patients with non-alcoholic fatty liver disease, symptomatic/asymptomatic uterine fibroids, and primary ciliary dyskinesia. The analysis example in this guide aims to identify a group of patients who have sleep apnea.

Computable Phenotypes

CP definitions are reproducible, transferrable, and invite collaboration. A valid phenotype is robust and meets desired levels of sensitivity and specificity². The Phenotype Knowledge Base (PheKB) is a phenotype library dedicated to hosting existing algorithms and creating space to share working algorithms.
Computable phenotyping may not always be the goal of a CLARK analysis; however, the two efforts can be related. Searching a CP library may be part of a literature review before using CLARK, and contributing to a CP library may be an option after a CLARK analysis. CPs for many conditions have been defined primarily with structured data, and CLARK’s NLP abilities may contribute to the richness and transferability of a computable phenotype. Visit The NIH Collaboratory for more about CPs.

Clinical Notes

Free-text clinical notes contain rich patient information not always captured in structured EHR data. Analysis in CLARK requires two sets (corpora) of clinical notes: the labeled training set with known classifications (“gold standard”), and the evaluation set with subjects that have not been classified A patient should have notes only in one set, since patients are the unit being classified. This means that patients should also only have one label each. The source and size of each set of notes differs depending on the condition of interest. The training set may come from existing labeled data, or users can manually label a subset of the selected population through chart review. For example, to identify a cohort of patients with sleep apnea, the population should be comprised of patients who may have sleep apnea; a training set should include patient notes from several groups, including the condition of interest.

Note: clinical notes contain identified protected health information (PHI). Notes and related CLARK files should be treated securely and saved in a location approved at your institution for storing PHI.

A Classification Example

To illustrate a simplified clinical example of a CLARK analysis, consider this situation: researchers must recruit people with sleep apnea for a study to test out a new therapy. In order to do this, we are tasked with identifying patients with sleep apnea from some hospitals in the area based on the content of their clinical notes (some of which might not have diagnoses attached yet). To identify a cohort for the study, an algorithm must first be built using a labeled training corpus of clinical notes. The scope of this example covers feature selection, algorithm selection, and algorithm evaluation on the training data.

The MIMIC-III database

This example uses clinical notes from the MIMIC III database as a training corpus. MIMIC is a database of clinical encounters developed by MIT and is available after requesting access. It contains de-identified healthcare data for approximately 60,000 hospital ICU admissions, ranging from demographics to imaging reports. You may create an account with physionet.org and request access to MIMIC here. The “gold standard” set of notes was created using This R script and the NOTEEVENTS.csv, DIAGNOSES_ICD.csv, and D_ICD_DIAGNOSES.csv datasets from MIMIC.

The source datasets are large and may be difficult to view as text files or in excel. However, a computer that can run CLARK should be able to read the NOTEEVENTS.csv file into R after a few minutes. For brevity, this example uses a subset of patients and diagnoses from MIMIC notes.

Feature selection

Using CLARK requires no NLP expertise, allowing researchers to focus on selecting features in clinical notes to guide classification algorithms and save time. Features can be words, phrases, or numeric values that are likely to be associated with a particular group. Some features may be strongly associated with more than one possible group, and the models may perform better while excluding these features. CLARK’s algorithms self-train by cross-validation, and each algorithm considers every feature selected by the user. Figure 1 displays the “Algorithm Setup” pane where features are loaded or created and selected in CLARK. This is part of why feature and model selection can and should be an iterative process. In the sleep apnea example, age may be a possible confounding factor, as older age may correlate with sleep apnea, osteoporosis, and dehydration.

A “gold standard” set of notes is used to train a model with features that distinguish sleep apnea patients from others admitted to the hospitals. For simplicity in this example, we will try to distinguish sleep apnea patients from a population that also includes those admitted with dehydration and osteoporosis. These are real clinical notes that differ greatly in content from the “notes” in the animal corpus example. It should take several iterations of feature selection to classify patients of the different diseases in this example.

*Figure 1 (hover): Regexes loaded in CLARK*

Literature Review

A literature review by the user may help in feature selection before analysis. The goal of feature selection is to distinguish groups from one another while, ideally, fostering cohesion within a group. Since computable phenotypes are defined as shareable and reproducible algorithms, an existing CP may help guide feature selection or be built upon after analysis in CLARK. Features can also be found in the text of clinical notes by manual review of the “gold standard” notes. For example, a researcher could use this article, in addition to content in the labeled clinical notes and his or her clinical expertise, to identify features to include in the model.

Regular Expressions

NLP with regular expressions (regex) is best suited to match words and phrases in a body of text. Integer and decimal values can be included with more complicated regex, but may be redundant to values found elsewhere in the EHR. Regex within CLARK are stored in a library and can be selected or suppressed throughout analysis iterations within the CLARK environment.

Feature Selection Example

Regular expressions to distinguish patients into different groups can be created within CLARK and/or uploaded from a file. This sleep apnea example uses this regex library and these active regex as features for each classification algorithm. These files will populate the “Algorithm Setup” pane in CLARK like in Figure 1. Note that the regular expressions are all defined in the regex library and referenced in the active regex file. Details about how to use the Regex Library and Active Regular Expressions sections in CLARK are included in the technical documentation.

Many notes mention CPAP or biPAP machines, both of which aid in breathing or sleeping for sleep apnea patients. Since we would not expect these machines to come up often in notes for patients in the other groups, these terms make useful features in classification.
Regex were also created to find variations of “wheezing” and “shortness of breath” for sleep apnea patients.
“Osseous,” “density,” “fall,” and “osteopenia” were more commonly found in notes for osteoporosis patients. Although they are not about sleep apnea, they still help to distinguish patients into different groups.
In Figure 1, the regex library contains more terms than the active regex file does. Storing all possible terms in one place lets users test and compare different combinations or numbers of features.
Clinical notes vary in richness and detail across the training corpus. CLARK accounts for this by assigning confidence levels to each classification. If a note is sparse or irrelevant and does not contain any matching features to those defined by the user, we can expect CLARK to have a low associated confidence level with that subject’s label.

Evaluating Algorithm Performance

Types of Classification

CLARK includes 4 different machine-learning algorithms: Decision Tree, Linear Support Vector Machine (summarized results in Figure 2), Random Forest (summarized results in Figure 3), and Gaussian Naive Bayes (summarized results in Figure 4). These are explained in detail in the technical documentation.

Confidence Thresholds

Before selecting an algorithm to classify unlabeled data, consider setting a confidence level or specificity under which human review will be conducted to check CLARK’s classifications. In the context of study recruitment, this can be shifted based on recruitment priorities. CLARK reports confidence levels for each individual classification, which can be a guide for which subjects to manually review.

Clinical Example: Algorithm comparison

As described in the technical documentation, algorithm performance can be evaluated on CLARK’s Explore page. After an algorithm is selected, the classification results can be sorted by label or by confidence level. To compare three different algorithms, all using the same features, we will look at the distribution of correct classifications given a certain confidence level in Figures 2-4.

At first glance, each algorithm does not appear to have performed well. Most subjects were misclassified despite careful feature selection. However, keep in mind CLARK’s confidence levels. Subsetting the results for cases where CLARK is at least 50% confident in its classification, the number of correct labels (in the subset) increases for all three algorithms.
Subsetting for a confidence level does decrease the number of subjects whose results are displayed. As some clinical notes were sparsely populated or contained information irrelevant to the diagnosis, some notes did not contain any matching features to those defined by regex. Generally, we can reasonably expect that notes without features useful to the algorithm would have low confidence levels associated with their classification.

Linear SVM:

After cross-validation, pictured in Figure 2, the Linear SVM algorithm resulted in most subjects being misclassified. When subset for a confidence level above 0.5, however, all of these 105 subjects were correctly labeled. The cost of subsetting is “losing” about 90% of the sample in the training set. Deciding how many subjects to include in the training set or manually review after classification depends on the desired sample size of the cohort of interest.

Random Forest:

The Explore page for a Random Forest classification, shown in Figure 3, at first looks very similar to the linear SVM. Regarding the subjects whose confidence levels are at least 50%, 316 are selected and nearly all of them correctly classified. At the cost of just a few mislabels, the number of correct labels is triple that of the linear SVM in the same confidence range. This can be helpful if a larger sample size is desired, and users may even subset for a higher confidence level while still “keeping” more patients than the previous algorithm.

Gaussian Naive Bayes:

The shape of the confidence distributions differs for each selected algorithm despite the similar classification accuracy graphic. Using the Gaussian Naive Bayes classifier (GNB), CLARK returns either high or low confidence levels, with none between 0.2 and 0.7 as shown in Figure 4. 450 subjects have labels and associated confidence levels in the range above 50% (which is actually above 70%), and each of these was correctly labeled.

*Figure 4 (hover to animate): Naive Bayes training corpus results in CLARK*

In this specific example, we may consider Naive Bayes to be the “best” performing model of the three. At the same confidence range of >50%, the most subjects were correctly labeled with GNB. Note that the algorithm performance depends on the data itself, selected features and regular expressions, and the chosen model. With a different combination of features or set of clinical notes, GNB might produce different results and another algorithm might prove favorable. Moving forward, users could continue the feature selection process or transfer this algorithm to the unlabeled, evaluation set of clinical notes. With the evaluation set, choose a confidence level under which patients will be manually reviewed for sleep apnea diagnoses. Remember that CLARK only classifies patients in the evaluation corpus using labels from the training data, so each patient can be labeled with dehydration, osteoporosis, or sleep apnea.

Algorithm Performance Outside CLARK

A data set of resulting classifications and associated confidence levels for each patient in the training or evaluation corpus can be exported from CLARK on the Explore page after running an algorithm. This contains a row for each subject, their true and predicted label, and associated confidence levels. This information can be used to analyze sensitivity and specificity, if desired.

Sensitivity and Specificity

An AUROC curve is created by plotting the True Positive Rate vs. the False Positive Rate.

True positive rate (sensitivity) = fraction of subjects known to have a characteristic who are correctly classified (by the algorithm) as having it. A highly sensitive test captures more true positives; this means that of all subjects who truly have a condition, many or most are correctly classified as such.
False positive rate (1-specificity) = fraction of subjects known to be without a characteristic who are correctly classified, subtracted from 1. A highly specific test captures more true negatives, correctly classifying more subjects without a certain condition.
Note that, for a corpus involving more than one possible label, sensitivity and specificity evaluations should take a 1-vs-all approach. For example, the AUROC for classification as a sleep apnea patient would be found by grouping all non-sleep-apnea patients together as “not” before calculating TPR and FPR.

Ideally, a classification results in high sensitivity and specificity, though there is often a trade-off between the two. With high sensitivity, low specificity is still possible. This means that, while a high number of subjects with a condition are correctly labeled, “false positives” may also be included in that group. With high specificity, low sensitivity is still possible, allowing for “true negatives” to be captured well at the expense of also including “false negatives”. Users can set thresholds with these cases in mind. To help prioritize sensitivity or specificity, consider the question: “Is it preferable to capture all true positives and include some false positives, or to only capture true positives, at the risk of missing some subjects?” Of course, CLARK’s labels are not binding in a study’s context, and human review is wise at some confidence level.

Area under the ROC Curve

The Area under the Receiver Operating Characteristic Curve (AUROC) is generally used to evaluate the accuracy of a test that reveals a particular characteristic in subjects. An AUROC curve is built by plotting the true positive rate (y-axis) by the false positive rate (x-axis). In context of CLARK, the test is the classification model, and the characteristic is the true label for a group of subjects. If several labels are possible for subjects, calculate the sensitivity and specificity in a 1-vs-all approach for each classification group.

True positive rate is also known as sensitivity, recall, or probability of detection. This is the proportion of people who truly belong in a classification that are correctly identified by the model.
True negative rate is also known as specificity. This is the proportion of people who truly do not belong in a classification that are correctly identified by the model. False positive rate = 1 - True negative rate.

The yellow line represents a test that is highly sensitive where the false positive rate is very low, meaning true negative and true positive rates are high at the same point. This indicates an accurate model, and a test with a large area under the curve. In contrast, the blue line represents an AUC of only 0.5, which equates the accuracy of that test to a coin flip; and AUC close to 0.5 indicates that the model is not useful for classification.

Training Corpus Considerations

Some patients may be repeatedly misclassified or have low associated confidence levels with their labels. This may be due to quality and detail level of their clinical notes. It is possible for some patients to have richer or more relevant information than others. To work with this situation, consider setting a confidence threshold under which classifications made by CLARK will be checked manually.
If a certain level of confidence or positive predictive value is desired, consider the number of subjects in the gold standard and unlabeled sets of clinical notes. Generally, a larger sample size may help CLARK deliver better predictions.
CLARK’s processing time is a function of the number of clinical notes being searched. If cross-validation of an algorithm is taking too long, consider the value of all the notes in the training corpus. Can notes from before a certain date or certain sections be excluded for the purpose of this classification? Sections of notes may be excluded within CLARK using regular expressions, and notes before a certain date can be excluded while building the training corpus.
Choosing an algorithm and regular expressions should be an iterative process. Each algorithm works differently, so it is worthwhile to test several and keep track of CLARK’s accuracy and confidence. While existing computable phenotypes can help with feature selection, consider how the dimensionality may be impacting classification. If a classification with 60 features performs about as well as a classification with 20 features, Occam’s razor would recommend choosing the simpler model.
CLARK is not intended to perfectly classify each subject in the training set. Cross-validation helps to prevent overfitting the model, preserving robustness to new data. CLARK must generalize well from training data to unlabeled data at the cost of a “perfect” classification on the gold standard corpus.

Results in Context

The evaluation set has been classified. Now what?

If the goal of a CLARK analysis is cohort selection, study recruitment could begin once patients in the evaluation corpus have been labeled and a subset (based on confidence level) manually reviewed.
If the goal of a classification is to understand the distribution in a population, consider how to discuss evaluation corpus results in context of the population in question. Manually reviewing some patients’ labels is likely still necessary. Do patients in the training and evaluation corpus reasonably represent the population in question?
If the goal of a CLARK analysis was to predict an outcome based on clinical note content, then labels assigned in the evaluation corpus would be the predictions for those patients.

References

Mo H, Thompson WK, Rasmussen LV, et al. Desiderata for computable representations of electronic health records-driven phenotype algorithms. J Am Med Inform Assoc. 2015;22(6):1220-1230. doi: 10.1093/jamia/ocv112
Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O, Peissig PL, Pacheco JA, Tromp G, Pathak J, Carrell DS, Ellis SB, Lingren T, Thompson WK, Savova G, Haines J, Roden DM, Harris PA, Denny JC. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc. 2016 Mar 28; PMID: 27026615
Tape, T. Interpreting Diagnostic Tests. http://gim.unmc.edu/dxtests/Default.htm.

A Conceptual Guide to CLARK!

Table of Contents

Introduction

Who can use CLARK?

Selecting a Research Question

Computable Phenotypes

Clinical Notes

A Classification Example

The MIMIC-III database

Feature selection

Literature Review

Regular Expressions

Feature Selection Example

Evaluating Algorithm Performance

Types of Classification

Confidence Thresholds

Clinical Example: Algorithm comparison

Linear SVM:

Random Forest:

Gaussian Naive Bayes:

Algorithm Performance Outside CLARK

Sensitivity and Specificity

Area under the ROC Curve

Training Corpus Considerations

Results in Context

References