The purpose of this guide is to accompany the technical CLARK documentation with related concepts and a clinical example.
There are two versions of CLARK. CLARK v1 is the original and allows you to build classification algorithms using notes. CLARK v2 builds on this by adding the ability to incorporate structured data, such as demographics, labs, vitals, and medications, into your algorithm. This conceptual guide uses images of CLARK v1, but these concepts still apply to CLARK v2.
For instructions on how to use CLARK, visit the technical documentation for the version of CLARK you are working with: CLARK v1 or CLARK v2.
Introduction
Who can use CLARK?
Selecting a Research Question
Computable Phenotypes
Clinical Notes
A Classification Example
The MIMIC-III Database
Feature Selection
Literature Review
Regular Expressions
Feature Selection Example
Evaluating Algorithm Performance
Types of Classification
Confidence thresholds
Clinical Example: Algorithm Comparison
Algorithm Performance Outside CLARK
Sensitivity and Specificity
Area under the ROC Curve
Training Corpus Considerations
CLARK is a machine-learning classification software created by NC TraCS and CoVar Applied Technologies to enable computable phenotyping in unstructured data. Computable phenotypes (CP) are algorithms that search the electronic health record (EHR) for patients with specific clinical features based on diagnostic or procedural codes, lab values, clinical notes, and other information to define cohorts1. In addition to the structured EHR data, unstructured clinical notes hold a wealth of information that is applicable to identifying patient groups. Analyzing data from clinical notes traditionally has required time-consuming, manual chart reviews. With CLARK, once the effort to create a phenotype is completed, the phenotype can be replicated across other sets of notes.
CLARK enables and simplifies text mining for classification through the use of regular expressions, letting researchers efficiently gather features (values, words and phrases) found in clinical notes. These features include symptoms, details about a patient’s experiences, image interpretations, pathology, and other information. Using features found in clinical notes to train classification algorithms in CLARK reflects the reproducible, transferrable nature of computable phenotyping. CLARK’s natural-language processing (NLP) component also decreases the time spent on manual review of patient records for classification.
CLARK is free and open to the public. It is designed for clinicians or their research teams to train algorithms. Some expertise around the context of the classification problem is necessary; however, experience with machine learning and NLP are not needed. Using CLARK requires some familiarity with or willingness to learn regular expressions. Basic statistical knowledge is helpful for interpreting classification results and model performance, which is explained in the technical documentation.
CLARK can help to identify patient cohorts for a study when inclusion/exclusion criteria are framed as a classification problem, or it can be used to investigate the distribution of characteristics in a population.
A suitable question for CLARK may take the form: “Of this patient population, which individuals have this characteristic?” or “What is the distribution of these characteristics within this patient population?” The former is focused on classifying individuals, and the latter is focused on a general picture of the population. These situations may include differentiating between disease subtypes or the presence/absence of disease. Groups of interest should be discrete and well-defined.
Each machine-learning algorithm included is a classification tool, so CLARK is not designed to predict continuous endpoints or arbitrarily defined groups. For example, whether or not heart failure patients are readmitted within 30 days of discharge is difficult to predict based on clinical notes. This is because the content of a clinical note may not differ greatly between patients admitted at 29 and 31 days–the two patient groups are not inherently distinct.
Past projects using CLARK included identifying patients with non-alcoholic fatty liver disease, symptomatic/asymptomatic uterine fibroids, and primary ciliary dyskinesia. The analysis example in this guide aims to identify a group of patients who have sleep apnea.
CP definitions are reproducible, transferrable, and invite collaboration. A valid phenotype is robust and meets desired levels of sensitivity and specificity2. The Phenotype Knowledge Base (PheKB) is a phenotype library dedicated to hosting existing algorithms and creating space to share working algorithms.
Computable phenotyping may not always be the goal of a CLARK analysis; however, the two efforts can be related. Searching a CP library may be part of a literature review before using CLARK, and contributing to a CP library may be an option after a CLARK analysis. CPs for many conditions have been defined primarily with structured data, and CLARK’s NLP abilities may contribute to the richness and transferability of a computable phenotype. Visit The NIH Collaboratory for more about CPs.
Free-text clinical notes contain rich patient information not always captured in structured EHR data. Analysis in CLARK requires two sets (corpora) of clinical notes: the labeled training set with known classifications (“gold standard”), and the evaluation set with subjects that have not been classified A patient should have notes only in one set, since patients are the unit being classified. This means that patients should also only have one label each. The source and size of each set of notes differs depending on the condition of interest. The training set may come from existing labeled data, or users can manually label a subset of the selected population through chart review. For example, to identify a cohort of patients with sleep apnea, the population should be comprised of patients who may have sleep apnea; a training set should include patient notes from several groups, including the condition of interest.
Note: clinical notes contain identified protected health information (PHI). Notes and related CLARK files should be treated securely and saved in a location approved at your institution for storing PHI.
To illustrate a simplified clinical example of a CLARK analysis, consider this situation: researchers must recruit people with sleep apnea for a study to test out a new therapy. In order to do this, we are tasked with identifying patients with sleep apnea from some hospitals in the area based on the content of their clinical notes (some of which might not have diagnoses attached yet). To identify a cohort for the study, an algorithm must first be built using a labeled training corpus of clinical notes. The scope of this example covers feature selection, algorithm selection, and algorithm evaluation on the training data.
This example uses clinical notes from the MIMIC III database as a training corpus. MIMIC is a database of clinical encounters developed by MIT and is available after requesting access. It contains de-identified healthcare data for approximately 60,000 hospital ICU admissions, ranging from demographics to imaging reports. You may create an account with physionet.org and request access to MIMIC here. The “gold standard” set of notes was created using This R script and the NOTEEVENTS.csv, DIAGNOSES_ICD.csv, and D_ICD_DIAGNOSES.csv datasets from MIMIC.
The source datasets are large and may be difficult to view as text files or in excel. However, a computer that can run CLARK should be able to read the NOTEEVENTS.csv file into R after a few minutes. For brevity, this example uses a subset of patients and diagnoses from MIMIC notes.
Using CLARK requires no NLP expertise, allowing researchers to focus on selecting features in clinical notes to guide classification algorithms and save time. Features can be words, phrases, or numeric values that are likely to be associated with a particular group. Some features may be strongly associated with more than one possible group, and the models may perform better while excluding these features. CLARK’s algorithms self-train by cross-validation, and each algorithm considers every feature selected by the user. Figure 1 displays the “Algorithm Setup” pane where features are loaded or created and selected in CLARK. This is part of why feature and model selection can and should be an iterative process. In the sleep apnea example, age may be a possible confounding factor, as older age may correlate with sleep apnea, osteoporosis, and dehydration.
A “gold standard” set of notes is used to train a model with features that distinguish sleep apnea patients from others admitted to the hospitals. For simplicity in this example, we will try to distinguish sleep apnea patients from a population that also includes those admitted with dehydration and osteoporosis. These are real clinical notes that differ greatly in content from the “notes” in the animal corpus example. It should take several iterations of feature selection to classify patients of the different diseases in this example.
A literature review by the user may help in feature selection before analysis. The goal of feature selection is to distinguish groups from one another while, ideally, fostering cohesion within a group. Since computable phenotypes are defined as shareable and reproducible algorithms, an existing CP may help guide feature selection or be built upon after analysis in CLARK. Features can also be found in the text of clinical notes by manual review of the “gold standard” notes. For example, a researcher could use this article, in addition to content in the labeled clinical notes and his or her clinical expertise, to identify features to include in the model.
NLP with regular expressions (regex) is best suited to match words and phrases in a body of text. Integer and decimal values can be included with more complicated regex, but may be redundant to values found elsewhere in the EHR. Regex within CLARK are stored in a library and can be selected or suppressed throughout analysis iterations within the CLARK environment.
Regular expressions to distinguish patients into different groups can be created within CLARK and/or uploaded from a file. This sleep apnea example uses this regex library and these active regex as features for each classification algorithm. These files will populate the “Algorithm Setup” pane in CLARK like in Figure 1. Note that the regular expressions are all defined in the regex library and referenced in the active regex file. Details about how to use the Regex Library and Active Regular Expressions sections in CLARK are included in the technical documentation.
CLARK includes 4 different machine-learning algorithms: Decision Tree, Linear Support Vector Machine (summarized results in Figure 2), Random Forest (summarized results in Figure 3), and Gaussian Naive Bayes (summarized results in Figure 4). These are explained in detail in the technical documentation.
Before selecting an algorithm to classify unlabeled data, consider setting a confidence level or specificity under which human review will be conducted to check CLARK’s classifications. In the context of study recruitment, this can be shifted based on recruitment priorities. CLARK reports confidence levels for each individual classification, which can be a guide for which subjects to manually review.
As described in the technical documentation, algorithm performance can be evaluated on CLARK’s Explore page. After an algorithm is selected, the classification results can be sorted by label or by confidence level. To compare three different algorithms, all using the same features, we will look at the distribution of correct classifications given a certain confidence level in Figures 2-4.
At first glance, each algorithm does not appear to have performed well. Most subjects were misclassified despite careful feature selection. However, keep in mind CLARK’s confidence levels. Subsetting the results for cases where CLARK is at least 50% confident in its classification, the number of correct labels (in the subset) increases for all three algorithms.
Subsetting for a confidence level does decrease the number of subjects whose results are displayed. As some clinical notes were sparsely populated or contained information irrelevant to the diagnosis, some notes did not contain any matching features to those defined by regex. Generally, we can reasonably expect that notes without features useful to the algorithm would have low confidence levels associated with their classification.
In this specific example, we may consider Naive Bayes to be the “best” performing model of the three. At the same confidence range of >50%, the most subjects were correctly labeled with GNB. Note that the algorithm performance depends on the data itself, selected features and regular expressions, and the chosen model. With a different combination of features or set of clinical notes, GNB might produce different results and another algorithm might prove favorable. Moving forward, users could continue the feature selection process or transfer this algorithm to the unlabeled, evaluation set of clinical notes. With the evaluation set, choose a confidence level under which patients will be manually reviewed for sleep apnea diagnoses. Remember that CLARK only classifies patients in the evaluation corpus using labels from the training data, so each patient can be labeled with dehydration, osteoporosis, or sleep apnea.
A data set of resulting classifications and associated confidence levels for each patient in the training or evaluation corpus can be exported from CLARK on the Explore page after running an algorithm. This contains a row for each subject, their true and predicted label, and associated confidence levels. This information can be used to analyze sensitivity and specificity, if desired.
True positive rate (sensitivity) = fraction of subjects known to have a characteristic who are correctly classified (by the algorithm) as having it. A highly sensitive test captures more true positives; this means that of all subjects who truly have a condition, many or most are correctly classified as such.
False positive rate (1-specificity) = fraction of subjects known to be without a characteristic who are correctly classified, subtracted from 1. A highly specific test captures more true negatives, correctly classifying more subjects without a certain condition.
Note that, for a corpus involving more than one possible label, sensitivity and specificity evaluations should take a 1-vs-all approach. For example, the AUROC for classification as a sleep apnea patient would be found by grouping all non-sleep-apnea patients together as “not” before calculating TPR and FPR.
Ideally, a classification results in high sensitivity and specificity, though there is often a trade-off between the two. With high sensitivity, low specificity is still possible. This means that, while a high number of subjects with a condition are correctly labeled, “false positives” may also be included in that group. With high specificity, low sensitivity is still possible, allowing for “true negatives” to be captured well at the expense of also including “false negatives”. Users can set thresholds with these cases in mind. To help prioritize sensitivity or specificity, consider the question: “Is it preferable to capture all true positives and include some false positives, or to only capture true positives, at the risk of missing some subjects?” Of course, CLARK’s labels are not binding in a study’s context, and human review is wise at some confidence level.
The Area under the Receiver Operating Characteristic Curve (AUROC) is generally used to evaluate the accuracy of a test that reveals a particular characteristic in subjects. An AUROC curve is built by plotting the true positive rate (y-axis) by the false positive rate (x-axis). In context of CLARK, the test is the classification model, and the characteristic is the true label for a group of subjects. If several labels are possible for subjects, calculate the sensitivity and specificity in a 1-vs-all approach for each classification group.
True positive rate is also known as sensitivity, recall, or probability of detection. This is the proportion of people who truly belong in a classification that are correctly identified by the model.
True negative rate is also known as specificity. This is the proportion of people who truly do not belong in a classification that are correctly identified by the model. False positive rate = 1 - True negative rate.
The yellow line represents a test that is highly sensitive where the false positive rate is very low, meaning true negative and true positive rates are high at the same point. This indicates an accurate model, and a test with a large area under the curve. In contrast, the blue line represents an AUC of only 0.5, which equates the accuracy of that test to a coin flip; and AUC close to 0.5 indicates that the model is not useful for classification.
The evaluation set has been classified. Now what?
Mo H, Thompson WK, Rasmussen LV, et al. Desiderata for computable representations of electronic health records-driven phenotype algorithms. J Am Med Inform Assoc. 2015;22(6):1220-1230. doi: 10.1093/jamia/ocv112
Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O, Peissig PL, Pacheco JA, Tromp G, Pathak J, Carrell DS, Ellis SB, Lingren T, Thompson WK, Savova G, Haines J, Roden DM, Harris PA, Denny JC. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc. 2016 Mar 28; PMID: 27026615
Tape, T. Interpreting Diagnostic Tests. http://gim.unmc.edu/dxtests/Default.htm.