CLARK v2 is a machine-learning classification software created by NC TraCS and CoVar Applied Technologies to enable computable phenotyping in structured and unstructured data. CLARK’s user-friendly interface makes natural language processing (NLP) an accessible option for searching free-text clinical notes in addition to structured EHR data (e.g., demographics, lab tests, medication orders). This page includes user instructions and technical documentation.
This page includes user instructions and technical documentation for CLARK v2.
For instructions on CLARK v1, go here.
For a conceptual guide to CLARK including research applications and interpretation of results, go here.
Both CLARK v1 and v2 are free and available for download here.
Getting Started
System Requirements
Installation
CLARK: Basic Steps
Navigation
Loading and Saving Progress
Key Concepts
Required Data
FHIR
Formatting
Free Text and Structured Data
Regular Expressions
Basic Regular Expressions
Section Break
Clinical Examples
Training Data
Loading the Training Data
Troubleshooting
Setup
Algorithm Setup
Unstructured data
Library of Regular Expressions
Expressions in Use
Sectioning
Viewing Notes
Patients and Notes
Coverage
Structured Data
Algorithm
Algorithm Steps-Training Data
Algorithm Steps-Evaluation Data
Machine Learning Classifiers
Cross-Validation
Explore
Distribution by Labels
Confidence
Filtered Records
Evaluation data Results
Exporting Results
Technical Appendix
FHIR details
Cross-validation
Algorithms in detail
General Troubleshooting
CLARK runs best on Windows machines with 16 GB of RAM, and does not require special infrastructure to operate. Processing may take longer with 8GB of RAM.
CLARK can be downloaded from tracs.unc.edu after creating a free account with NC TraCS. Follow the instructions under the “Sign In” menu or click here to create an account. The files need to be un-zipped using an application such as winzip or 7zip. Simply double-click the CLARK Installer 2.0.exe, and read README.txt and the license agreement. CLARK opens automatically once the installation finishes.
Generally, CLARK can be used in the following steps:
Form a classification question, then identify groups and group-defining features of interest (to be used in the classification algorithm) through a literature review.
Select a training (“gold standard”) set and evaluation (“unlabeled”) set of patients including EHR data and clinical notes, each organized in the FHIR format. Load the training set into CLARK on the Load Data page.
Format features to use in the algorithm as regular expressions that match text in the body of clinical notes.
From the structured (EHR) data, choose demographic, lab, vitals, or other features to include in the classification.
Using cross-validation, iteratively train and assess algorithms and combinations of features until satisfied with CLARK’s performance on the labeled training data.
Transfer algorithm of choice to the unlabeled evaluation data set.
Review, interpret, and use classification results in context.
A previously saved CLARK session may be loaded by selecting Load Session on the Load Data page, or welcome screen. Sessions may be saved by clicking the Save icon in the bottom-left sidebar of CLARK. Saving a session preserves features selected for the algorithm, EHR data and notes in use, and current algorithm results. To save the results of several algorithms for one set of data, save a session for each algorithm. Sessions are saved as .JSON files containing regular expressions and filepaths to the FHIR directory in use. Moving the data after saving a CLARK session will cause errors.
WARNING: When using CLARK to explore clinical data, be mindful that the saved session will contain identified protected health information (PHI). The session should be saved in a location approved at your institution for storing PHI.
Clinical notes contain free-text patient data ranging from family history to current symptoms and imaging results. Structured data from the EHR may include lab tests, vitals, medications, etc. In the context of CLARK, a data set is defined as a collection of clinical notes and structured domains from the EHR for a sample of patients. CLARK requires two different sets of data for a complete classification: the “gold standard” training set with known patient classifications (labels), and the unlabeled “evaluation” set containing patients distinct from those in the training set. Machine learning algorithms are trained first using the gold standard data, then applied to the unlabeled evaluation set for classification. For use in CLARK, these notes and other data sources must be converted to a specific format that renders a whole set usable and searchable.
Creating Gold Standards
Researchers can build a gold standard set through manual chart review for a subset of the patient population, leaving the rest of the population unlabeled for the evaluation data set. Each patient should have one distinct label, and patients of the same group should have identical labels. For example, CLARK recognizes “diabetic”, “diabetes”, and “T1D” as different groups; however, these terms indicate the same characteristic and should share a common label in the data.
Gold standard labels provide CLARK’s classification options for the unlabeled data, so all possible groups should be represented in the training data. The size of a gold standard set depends on several factors: analysis goals, rarity of condition of interest, population size, and desired confidence level for classifications.
FHIR stands for Fast Healthcare Interoperability Resources. It is a health data exchange standard developed and maintained by HL7. FHIR-compliant files contain references to the HL7 site and ensure controlled terminology for medications, demographics, symptoms, indications, etc.
From the CLARK home screen, select the parent folder (e.g, test data) to load into CLARK. This allows CLARK to load all the patient and domain data for analysis. Only one parent folder may be uploaded at a time.
All data files must be saved as a .json file in which a single patient’s data and entries are enclosed in { } brackets, and fields are separated by commas. For clarity in this example, each field is on a separate line–this is not necessary in practice, as long as entries are properly separated with punctuation.
When first extracted from a database, clinical notes and structured data vary in format across institutions. CLARK therefore requires that training and testing data sets comply with FHIR standards.
The id field in figure 3 distinguishes patients from each other and is used to link data from different files to the same patient.
The fullUrl and system fields and sub-fields, reference HL7 to employ controlled terminology.
FHIR standards are international. Race and ethnicity use the US Core FHIR extension.
Notes must be saved as a .json file and align with the FHIR document reference resource, with one record for each note.
How to get FHIR-formatted data
Major EHR systems allow data to be exported in FHIR format. Additionally, CAMP FHIR is a tool created by TraCS that can create FHIR-formatted data.
Clinical notes are the only unstructured data uploaded to CLARK, and this is where CLARK uses NLP to search for matched features. Regular expressions will only be used to define features in clinical notes, not structured EHR data such as labs and vital signs. A patient may have several notes or none, and they are linked by patient ID and distinguished by date/time. Before starting analysis, a patient’s structured and unstructured data, if available, can be viewed in CLARK.