The purpose of this guide is to accompany the technical CLARK documentation with related concepts and a clinical example.

There are two versions of CLARK. CLARK v1 is the original and allows you to build classification algorithms using notes. CLARK v2 builds on this by adding the ability to incorporate structured data, such as demographics, labs, vitals, and medications, into your algorithm. This conceptual guide uses images of CLARK v1, but these concepts still apply to CLARK v2.

For instructions on how to use CLARK, visit the technical documentation for the version of CLARK you are working with: CLARK v1 or CLARK v2.

Table of Contents

      Who can use CLARK?
      Selecting a Research Question
      Computable Phenotypes
      Clinical Notes

A Classification Example
      The MIMIC-III Database
      Feature Selection
      Literature Review
      Regular Expressions
      Feature Selection Example

Evaluating Algorithm Performance
      Types of Classification
      Confidence thresholds
      Clinical Example: Algorithm Comparison

Algorithm Performance Outside CLARK
      Sensitivity and Specificity
      Area under the ROC Curve
      Training Corpus Considerations

Results in Context



CLARK is a machine-learning classification software created by NC TraCS and CoVar Applied Technologies to enable computable phenotyping in unstructured data. Computable phenotypes (CP) are algorithms that search the electronic health record (EHR) for patients with specific clinical features based on diagnostic or procedural codes, lab values, clinical notes, and other information to define cohorts1. In addition to the structured EHR data, unstructured clinical notes hold a wealth of information that is applicable to identifying patient groups. Analyzing data from clinical notes traditionally has required time-consuming, manual chart reviews. With CLARK, once the effort to create a phenotype is completed, the phenotype can be replicated across other sets of notes.

CLARK enables and simplifies text mining for classification through the use of regular expressions, letting researchers efficiently gather features (values, words and phrases) found in clinical notes. These features include symptoms, details about a patient’s experiences, image interpretations, pathology, and other information. Using features found in clinical notes to train classification algorithms in CLARK reflects the reproducible, transferrable nature of computable phenotyping. CLARK’s natural-language processing (NLP) component also decreases the time spent on manual review of patient records for classification.

Essentially, CLARK reads a set of clinical notes from patients of known groups to train a classification algorithm based on words or values in these notes. Then, the algorithm is transferred to classify additional patients based on the content of their clinical notes.

Who can use CLARK?

CLARK is free and open to the public. It is designed for clinicians or their research teams to train algorithms. Some expertise around the context of the classification problem is necessary; however, experience with machine learning and NLP are not needed. Using CLARK requires some familiarity with or willingness to learn regular expressions. Basic statistical knowledge is helpful for interpreting classification results and model performance, which is explained in the technical documentation.

Selecting a Research Question

CLARK can help to identify patient cohorts for a study when inclusion/exclusion criteria are framed as a classification problem, or it can be used to investigate the distribution of characteristics in a population.

  • A suitable question for CLARK may take the form: “Of this patient population, which individuals have this characteristic?” or “What is the distribution of these characteristics within this patient population?” The former is focused on classifying individuals, and the latter is focused on a general picture of the population. These situations may include differentiating between disease subtypes or the presence/absence of disease. Groups of interest should be discrete and well-defined.

  • Each machine-learning algorithm included is a classification tool, so CLARK is not designed to predict continuous endpoints or arbitrarily defined groups. For example, whether or not heart failure patients are readmitted within 30 days of discharge is difficult to predict based on clinical notes. This is because the content of a clinical note may not differ greatly between patients admitted at 29 and 31 days–the two patient groups are not inherently distinct.

Past projects using CLARK included identifying patients with non-alcoholic fatty liver disease, symptomatic/asymptomatic uterine fibroids, and primary ciliary dyskinesia. The analysis example in this guide aims to identify a group of patients who have sleep apnea.

Computable Phenotypes

CP definitions are reproducible, transferrable, and invite collaboration. A valid phenotype is robust and meets desired levels of sensitivity and specificity2. The Phenotype Knowledge Base (PheKB) is a phenotype library dedicated to hosting existing algorithms and creating space to share working algorithms.
Computable phenotyping may not always be the goal of a CLARK analysis; however, the two efforts can be related. Searching a CP library may be part of a literature review before using CLARK, and contributing to a CP library may be an option after a CLARK analysis. CPs for many conditions have been defined primarily with structured data, and CLARK’s NLP abilities may contribute to the richness and transferability of a computable phenotype. Visit The NIH Collaboratory for more about CPs.

Clinical Notes

Free-text clinical notes contain rich patient information not always captured in structured EHR data. Analysis in CLARK requires two sets (corpora) of clinical notes: the labeled training set with known classifications (“gold standard”), and the evaluation set with subjects that have not been classified A patient should have notes only in one set, since patients are the unit being classified. This means that patients should also only have one label each. The source and size of each set of notes differs depending on the condition of interest. The training set may come from existing labeled data, or users can manually label a subset of the selected population through chart review. For example, to identify a cohort of patients with sleep apnea, the population should be comprised of patients who may have sleep apnea; a training set should include patient notes from several groups, including the condition of interest.

Note: clinical notes contain identified protected health information (PHI). Notes and related CLARK files should be treated securely and saved in a location approved at your institution for storing PHI.  

A Classification Example

To illustrate a simplified clinical example of a CLARK analysis, consider this situation: researchers must recruit people with sleep apnea for a study to test out a new therapy. In order to do this, we are tasked with identifying patients with sleep apnea from some hospitals in the area based on the content of their clinical notes (some of which might not have diagnoses attached yet). To identify a cohort for the study, an algorithm must first be built using a labeled training corpus of clinical notes. The scope of this example covers feature selection, algorithm selection, and algorithm evaluation on the training data.

The MIMIC-III database

This example uses clinical notes from the MIMIC III database as a training corpus. MIMIC is a database of clinical encounters developed by MIT and is available after requesting access. It contains de-identified healthcare data for approximately 60,000 hospital ICU admissions, ranging from demographics to imaging reports. You may create an account with and request access to MIMIC here. The “gold standard” set of notes was created using This R script and the NOTEEVENTS.csv, DIAGNOSES_ICD.csv, and D_ICD_DIAGNOSES.csv datasets from MIMIC.

The source datasets are large and may be difficult to view as text files or in excel. However, a computer that can run CLARK should be able to read the NOTEEVENTS.csv file into R after a few minutes. For brevity, this example uses a subset of patients and diagnoses from MIMIC notes.

Feature selection

Using CLARK requires no NLP expertise, allowing researchers to focus on selecting features in clinical notes to guide classification algorithms and save time. Features can be words, phrases, or numeric values that are likely to be associated with a particular group. Some features may be strongly associated with more than one possible group, and the models may perform better while excluding these features. CLARK’s algorithms self-train by cross-validation, and each algorithm considers every feature selected by the user. Figure 1 displays the “Algorithm Setup” pane where features are loaded or created and selected in CLARK. This is part of why feature and model selection can and should be an iterative process. In the sleep apnea example, age may be a possible confounding factor, as older age may correlate with sleep apnea, osteoporosis, and dehydration.

A “gold standard” set of notes is used to train a model with features that distinguish sleep apnea patients from others admitted to the hospitals. For simplicity in this example, we will try to distinguish sleep apnea patients from a population that also includes those admitted with dehydration and osteoporosis. These are real clinical notes that differ greatly in content from the “notes” in the animal corpus example. It should take several iterations of feature selection to classify patients of the different diseases in this example.