CLARK v2 is a machine-learning classification software created by NC TraCS and CoVar Applied Technologies to enable computable phenotyping in structured and unstructured data. CLARK’s user-friendly interface makes natural language processing (NLP) an accessible option for searching free-text clinical notes in addition to structured EHR data (e.g., demographics, lab tests, medication orders). This page includes user instructions and technical documentation.

This page includes user instructions and technical documentation for CLARK v2.

For instructions on CLARK v1, go here.

For a conceptual guide to CLARK including research applications and interpretation of results, go here.

Both CLARK v1 and v2 are free and available for download here.

Table of Contents

Getting Started
      System Requirements
      CLARK: Basic Steps
      Loading and Saving Progress

Key Concepts
   Required Data
   Free Text and Structured Data

   Regular Expressions
      Basic Regular Expressions
      Section Break
      Clinical Examples

Training Data
      Loading the Training Data

   Algorithm Setup
      Unstructured data
      Library of Regular Expressions
      Expressions in Use
   Viewing Notes
      Patients and Notes
      Structured Data

      Algorithm Steps-Training Data
      Algorithm Steps-Evaluation Data
      Machine Learning Classifiers

      Distribution by Labels
      Filtered Records
      Evaluation data Results
      Exporting Results

Technical Appendix
      FHIR details
      Algorithms in detail
      General Troubleshooting

Getting started

System requirements

CLARK runs best on Windows machines with 16 GB of RAM, and does not require special infrastructure to operate. Processing may take longer with 8GB of RAM.


CLARK can be downloaded from after creating a free account with NC TraCS. Follow the instructions under the “Sign In” menu or click here to create an account. The files need to be un-zipped using an application such as winzip or 7zip. Simply double-click the CLARK Installer 2.0.exe, and read README.txt and the license agreement. CLARK opens automatically once the installation finishes.

CLARK: Basic Steps

Generally, CLARK can be used in the following steps:

  1. Form a classification question, then identify groups and group-defining features of interest (to be used in the classification algorithm) through a literature review.

  2. Select a training (“gold standard”) set and evaluation (“unlabeled”) set of patients including EHR data and clinical notes, each organized in the FHIR format. Load the training set into CLARK on the Load Data page.

  3. Format features to use in the algorithm as regular expressions that match text in the body of clinical notes.

  4. From the structured (EHR) data, choose demographic, lab, vitals, or other features to include in the classification.

  5. Using cross-validation, iteratively train and assess algorithms and combinations of features until satisfied with CLARK’s performance on the labeled training data.

  6. Transfer algorithm of choice to the unlabeled evaluation data set.

  7. Review, interpret, and use classification results in context.

Loading and saving progress

A previously saved CLARK session may be loaded by selecting Load Session on the Load Data page, or welcome screen. Sessions may be saved by clicking the Save icon in the bottom-left sidebar of CLARK. Saving a session preserves features selected for the algorithm, EHR data and notes in use, and current algorithm results. To save the results of several algorithms for one set of data, save a session for each algorithm. Sessions are saved as .JSON files containing regular expressions and filepaths to the FHIR directory in use. Moving the data after saving a CLARK session will cause errors.

WARNING: When using CLARK to explore clinical data, be mindful that the saved session will contain identified protected health information (PHI). The session should be saved in a location approved at your institution for storing PHI.

Key Concepts

Required Data

Clinical notes contain free-text patient data ranging from family history to current symptoms and imaging results. Structured data from the EHR may include lab tests, vitals, medications, etc. In the context of CLARK, a data set is defined as a collection of clinical notes and structured domains from the EHR for a sample of patients. CLARK requires two different sets of data for a complete classification: the “gold standard” training set with known patient classifications (labels), and the unlabeled “evaluation” set containing patients distinct from those in the training set. Machine learning algorithms are trained first using the gold standard data, then applied to the unlabeled evaluation set for classification. For use in CLARK, these notes and other data sources must be converted to a specific format that renders a whole set usable and searchable.

Creating Gold Standards
Researchers can build a gold standard set through manual chart review for a subset of the patient population, leaving the rest of the population unlabeled for the evaluation data set. Each patient should have one distinct label, and patients of the same group should have identical labels. For example, CLARK recognizes “diabetic”, “diabetes”, and “T1D” as different groups; however, these terms indicate the same characteristic and should share a common label in the data.
Gold standard labels provide CLARK’s classification options for the unlabeled data, so all possible groups should be represented in the training data. The size of a gold standard set depends on several factors: analysis goals, rarity of condition of interest, population size, and desired confidence level for classifications.


Figure 2: Folder structure for training data.

FHIR stands for Fast Healthcare Interoperability Resources. It is a health data exchange standard developed and maintained by HL7. FHIR-compliant files contain references to the HL7 site and ensure controlled terminology for medications, demographics, symptoms, indications, etc.

  • CLARK requires FHIR (version 4.0)-organized and formatted .json files containing unstructured (clinical notes) and structured (other domains) EHR data.
  • The FHIR .json files should be organized as shown in figure 2. All files are housed in a parent folder (called training data in this example). This folder should include:
    • Patients folder: contains a list of patients and their demographics. If a patient is not included in this folder, their other data will not be uploaded to CLARK
    • Labels folder: contains a list of patients with their classifications (e.g., sleep apnea).
    • Labs: contains FHIR-formatted .json file for labs data, with corresponding patient IDs.
    • Medications: contains FHIR-formatted .json file for medications data, with corresponding patient IDs.
    • Vitals: contains FHIR-formatted .json files for Vitals, with corresponding patient IDs.
    • Notes: contains .json file with unstructured clinical notes for each patient.

From the CLARK home screen, select the parent folder (e.g, test data) to load into CLARK. This allows CLARK to load all the patient and domain data for analysis. Only one parent folder may be uploaded at a time.


Figure 3: Sample entry in a vitals file.
  • All data files must be saved as a .json file in which a single patient’s data and entries are enclosed in { } brackets, and fields are separated by commas. For clarity in this example, each field is on a separate line–this is not necessary in practice, as long as entries are properly separated with punctuation.

  • When first extracted from a database, clinical notes and structured data vary in format across institutions. CLARK therefore requires that training and testing data sets comply with FHIR standards.

  • The id field in figure 3 distinguishes patients from each other and is used to link data from different files to the same patient.

  • The fullUrl and system fields and sub-fields, reference HL7 to employ controlled terminology.

  • FHIR standards are international. Race and ethnicity use the US Core FHIR extension.

  • Notes must be saved as a .json file and align with the FHIR document reference resource, with one record for each note.

How to get FHIR-formatted data
Major EHR systems allow data to be exported in FHIR format. Additionally, CAMP FHIR is a tool created by TraCS that can create FHIR-formatted data.

Free Text and Structured Data

Clinical notes are the only unstructured data uploaded to CLARK, and this is where CLARK uses NLP to search for matched features. Regular expressions will only be used to define features in clinical notes, not structured EHR data such as labs and vital signs. A patient may have several notes or none, and they are linked by patient ID and distinguished by date/time. Before starting analysis, a patient’s structured and unstructured data, if available, can be viewed in CLARK.

Regular Expressions