CLARK v2 is a machine-learning classification software created by NC TraCS and CoVar Applied Technologies to enable computable phenotyping in structured and unstructured data. CLARK’s user-friendly interface makes natural language processing (NLP) an accessible option for searching free-text clinical notes in addition to structured EHR data (e.g., demographics, lab tests, medication orders). This page includes user instructions and technical documentation.

This page includes user instructions and technical documentation for CLARK v2.

For instructions on CLARK v1, go here.

For a conceptual guide to CLARK including research applications and interpretation of results, go here.

Both CLARK v1 and v2 are free and available for download here.

Table of Contents

Getting Started
      System Requirements
      Installation
      CLARK: Basic Steps
      Navigation
      Loading and Saving Progress

Key Concepts
   Required Data
   FHIR
      Formatting
   Free Text and Structured Data

   Regular Expressions
      Basic Regular Expressions
      Section Break
      Clinical Examples

Training Data
      Loading the Training Data
      Troubleshooting

Setup
   Algorithm Setup
      Unstructured data
      Library of Regular Expressions
      Expressions in Use
      Sectioning
   Viewing Notes
      Patients and Notes
      Coverage
      Structured Data

Algorithm
      Algorithm Steps-Training Data
      Algorithm Steps-Evaluation Data
      Machine Learning Classifiers
      Cross-Validation

Explore
      Distribution by Labels
      Confidence
      Filtered Records
      Evaluation data Results
      Exporting Results

Technical Appendix
      FHIR details
      Cross-validation
      Algorithms in detail
      General Troubleshooting

Getting started

System requirements

CLARK runs best on Windows machines with 16 GB of RAM, and does not require special infrastructure to operate. Processing may take longer with 8GB of RAM.

Installation

CLARK can be downloaded from tracs.unc.edu after creating a free account with NC TraCS. Follow the instructions under the “Sign In” menu or click here to create an account. The files need to be un-zipped using an application such as winzip or 7zip. Simply double-click the CLARK Installer 2.0.exe, and read README.txt and the license agreement. CLARK opens automatically once the installation finishes.


CLARK: Basic Steps

Generally, CLARK can be used in the following steps:

  1. Form a classification question, then identify groups and group-defining features of interest (to be used in the classification algorithm) through a literature review.

  2. Select a training (“gold standard”) set and evaluation (“unlabeled”) set of patients including EHR data and clinical notes, each organized in the FHIR format. Load the training set into CLARK on the Load Data page.

  3. Format features to use in the algorithm as regular expressions that match text in the body of clinical notes.

  4. From the structured (EHR) data, choose demographic, lab, vitals, or other features to include in the classification.

  5. Using cross-validation, iteratively train and assess algorithms and combinations of features until satisfied with CLARK’s performance on the labeled training data.

  6. Transfer algorithm of choice to the unlabeled evaluation data set.

  7. Review, interpret, and use classification results in context.


Loading and saving progress

A previously saved CLARK session may be loaded by selecting Load Session on the Load Data page, or welcome screen. Sessions may be saved by clicking the Save icon in the bottom-left sidebar of CLARK. Saving a session preserves features selected for the algorithm, EHR data and notes in use, and current algorithm results. To save the results of several algorithms for one set of data, save a session for each algorithm. Sessions are saved as .JSON files containing regular expressions and filepaths to the FHIR directory in use. Moving the data after saving a CLARK session will cause errors.

WARNING: When using CLARK to explore clinical data, be mindful that the saved session will contain identified protected health information (PHI). The session should be saved in a location approved at your institution for storing PHI.


Key Concepts

Required Data

Clinical notes contain free-text patient data ranging from family history to current symptoms and imaging results. Structured data from the EHR may include lab tests, vitals, medications, etc. In the context of CLARK, a data set is defined as a collection of clinical notes and structured domains from the EHR for a sample of patients. CLARK requires two different sets of data for a complete classification: the “gold standard” training set with known patient classifications (labels), and the unlabeled “evaluation” set containing patients distinct from those in the training set. Machine learning algorithms are trained first using the gold standard data, then applied to the unlabeled evaluation set for classification. For use in CLARK, these notes and other data sources must be converted to a specific format that renders a whole set usable and searchable.

Creating Gold Standards
Researchers can build a gold standard set through manual chart review for a subset of the patient population, leaving the rest of the population unlabeled for the evaluation data set. Each patient should have one distinct label, and patients of the same group should have identical labels. For example, CLARK recognizes “diabetic”, “diabetes”, and “T1D” as different groups; however, these terms indicate the same characteristic and should share a common label in the data.
Gold standard labels provide CLARK’s classification options for the unlabeled data, so all possible groups should be represented in the training data. The size of a gold standard set depends on several factors: analysis goals, rarity of condition of interest, population size, and desired confidence level for classifications.

FHIR

Figure 2: Folder structure for training data.

FHIR stands for Fast Healthcare Interoperability Resources. It is a health data exchange standard developed and maintained by HL7. FHIR-compliant files contain references to the HL7 site and ensure controlled terminology for medications, demographics, symptoms, indications, etc.

  • CLARK requires FHIR (version 4.0)-organized and formatted .json files containing unstructured (clinical notes) and structured (other domains) EHR data.
  • The FHIR .json files should be organized as shown in figure 2. All files are housed in a parent folder (called training data in this example). This folder should include:
    • Patients folder: contains a list of patients and their demographics. If a patient is not included in this folder, their other data will not be uploaded to CLARK
    • Labels folder: contains a list of patients with their classifications (e.g., sleep apnea).
    • Labs: contains FHIR-formatted .json file for labs data, with corresponding patient IDs.
    • Medications: contains FHIR-formatted .json file for medications data, with corresponding patient IDs.
    • Vitals: contains FHIR-formatted .json files for Vitals, with corresponding patient IDs.
    • Notes: contains .json file with unstructured clinical notes for each patient.

From the CLARK home screen, select the parent folder (e.g, test data) to load into CLARK. This allows CLARK to load all the patient and domain data for analysis. Only one parent folder may be uploaded at a time.

Formatting

Figure 3: Sample entry in a vitals file.
  • All data files must be saved as a .json file in which a single patient’s data and entries are enclosed in { } brackets, and fields are separated by commas. For clarity in this example, each field is on a separate line–this is not necessary in practice, as long as entries are properly separated with punctuation.

  • When first extracted from a database, clinical notes and structured data vary in format across institutions. CLARK therefore requires that training and testing data sets comply with FHIR standards.

  • The id field in figure 3 distinguishes patients from each other and is used to link data from different files to the same patient.

  • The fullUrl and system fields and sub-fields, reference HL7 to employ controlled terminology.

  • FHIR standards are international. Race and ethnicity use the US Core FHIR extension.

  • Notes must be saved as a .json file and align with the FHIR document reference resource, with one record for each note.


How to get FHIR-formatted data
Major EHR systems allow data to be exported in FHIR format. Additionally, CAMP FHIR is a tool created by TraCS that can create FHIR-formatted data.


Free Text and Structured Data

Clinical notes are the only unstructured data uploaded to CLARK, and this is where CLARK uses NLP to search for matched features. Regular expressions will only be used to define features in clinical notes, not structured EHR data such as labs and vital signs. A patient may have several notes or none, and they are linked by patient ID and distinguished by date/time. Before starting analysis, a patient’s structured and unstructured data, if available, can be viewed in CLARK.


Regular Expressions

Figure 4: (hover to animate) view notes and structured data in CLARK.

Basic Regular Expressions

A regular expression (regex) is a string of letters or numbers that uses additional special characters to define a search pattern in a body of text. The purpose of using regular expressions in CLARK is to robustly identify features (words, values, or phrases) in clinical notes that distinguish patient groups of interest from one another.

A single regular expression can match several words. For example, “\bfly?(ies|ight|ing|)\b” matches “fly”, “flight”, “flying”, and “flies” by offering alternate suffixes to “fly”, and the “?” qualifier makes “y” optional in the prefix.

Writing regular expressions does not require any text-mining expertise and can be learned through online tutorials. Regex101.com provides a useful interface to practice using regular expressions, and RexEgg.com includes an in-depth tutorial to help develop more complicated expressions. Note that CLARK uses python “flavored” regex. This is generally the same as other regex flavors, and more information can be found here.


General Regular Expression Examples

RegEx Purpose Example Matches
“(?i)” Searches case-insensitive “(?i)heart” “HEART”,“Heart”,“heart”
“\b” Encloses any word or phrase “\bhigh BP\b” “high BP”, will not match “high”
“[abc]” A single character a,b,c “arm[sy]” “arms”,“army”
“\d” Single digit 0-9 “\d\d\d” “456” or "333
“\D” Anything not a digit “\D\D\d” “AB1”
“.” Any character except line break “…” “abc” “a c”
“*” Zero or more times “a*h*” “ah” “aaaaahh”
“\” Some characters have special functions in regex. To search for the literal character as text, precede with a backslash. “1\.5\+1\?” “1.5+1?”

Note: In .json files, each time a backslash is used, it should be duplicated. For example, the word boundary “\b” should be entered as “\\b” in a .json file. The backslash does not need to be duplicated when writing expressions within CLARK.


Section Break


Figure 5: Section Break tab within CLARK

Once notes are loaded into CLARK, Sections can be identified in the Sections tab on the Unstructured Data section of the Setup page.

  • The Section Breaker box accepts section-defining regular expressions that can separate sections of a clinical note. Sections can then be named, and ignored from analysis if the user chooses.

  • Sections are particularly useful when the user wants to omit from the analysis sections that may contain irrelevant information, such as medical history or previous prescriptions. Sectioning works best when clinical notes are all similar in structure and organization within a data set.


Instructions for including or excluding specified sections can be found here.


Clinical Examples

There are endless possible regular expressions that can match words and phrases in clinical notes. Below is a sample of features that could be used to classify some common conditions.

Clinical Regular Expression Examples

Name RegEx Matches
Obesity (?i)\b(?<!not )obes(ity|e)\b “Obese”,“obesity”
Tobacco (?i)\b(?<!non-)(smok(ing|er|es))|(tobacco)|(cigar[est]{0,5})\b “tobacco”, “smoking”, “smoker”, “cigars”, “cigarettes”
Coughing (?i)\b(cough[ings]{0,3})|(wheez[inges]{0,3})\b “cough”, “coughs”, “wheezing”, “wheezes”
High Blood Pressure (?i)\b(high\s(bp|blood pressure))|(hypertension)\b “high BP”, “hypertension”, “high blood pressure”

Training Data

A CLARK analysis begins on the Setup page. Here, the user puts to use a set of notes and structured data domains to be analyzed. Regular expressions saved externally may also be uploaded here in addition to those added in the CLARK environment.

What is a training data set? The training data set consists of “gold standard” data: a set of subjects’ clinical notes and structured data with their true labels included. Labels, or groups, are defined in context of the classification question. This set is used to build and assess an algorithm before attempting to classify patients of unknown groups. Training an algorithm is an iterative process, taking several rounds of feature selection and model performance evaluation. The example used in this documentation only consists of two groups: those with or without sleep apnea. However, CLARK can classify into multiple groups. Patients from each possible group should be included in the training data set.

Loading the Training Data

The training data can be loaded on CLARK’s home screen by selecting “Load Data.” A file explorer pops up, then users may navigate to the location of the (FHIR-compliant) training data and double-click on the appropriate folder file to upload it. A green check appears in the left sidebar once the clinical notes and EHR data are successfully loaded.


Troubleshooting

If there are issues with the data uploaded, CLARK notifies users with a warning icon in the bottom-left sidebar. A data set may still be used in analysis if there are errors, but they should be investigated. The two types of errors are “Files Errors” and “Linking Errors”. In each error report, errors associated with each domain are contained in square brackets. To download and view the error reports, click on the warning icon in the bottom-left.

Reading and Understanding Error Reports

While the reports may seem overwhelming and text heavy at first, they are easy to parse once you know what to look for. In the File Errors report, each file name will be listed, followed by a square bracket. Any errors for that domain will be contained in the square brackets. Empty square brackets ([ ]) that means there are no errors for that file. In the Linking Errors report, errors are not grouped by file, but like errors should be grouped together.

“Files Errors” pertain to the formatting of the .json files containing clinical notes or other domains. Incorrect coding or separation of fields may cause issues preventing some information or patients from being included in analysis. FHIR 4.0 formatting examples can be helpful in resolving these issues.

  • The warning, “Observation id x includes a comparator. This is not supported,” indicates a “<” or “>” has been used to describe values in the lab domain. This is a reminder that CLARK reads “<” or “>” as “=”. Therefore, " < 5.0 mg/dL" will be read as “= 5.0 mg/dL”.

“Linking Errors” include issues with linking patients’ data across domains and clinical notes. If a patient ID is missing from some domains of a data set, errors will be produced. Ensure that patient IDs are accurate and each patient has all expected information populated in the data set.

  • If a patient has data in other domains, but their ID is not listed in the patients domain, then their other data will not be loaded into CLARK. This is noted with the warning “Discarding observation id x due to no patient with id x.”

  • The warning “Mismatch in system/code pair display” indicates a probem with LOINC codes. To address this, ensure that the lab.json file has 1-1 mapping of LOINC codes to display names. CLARK can still be used with this warning present.


Setup

Figure 6 (hover to animate): Choose and define features in Setup

The Setup page, with feature-selection window depicted in figure 6, allows users to define/upload regular expressions, select structured data features to include in the algorithm, and view individual patients’ data.

Algorithm Setup

Unstructured data

Regular expressions identify words or phrases whose presence in clinical notes can contribute to the classification algorithm. These are managed in the Unstructured Data section of the Setup page in CLARK.

Library of Regular Expressions

The Library stores regular expressions, which may be added to a given classification algorithm. Regexes may be added to the library from a .json-formatted file (see figure 8) or directly via the CLARK interface (see figure 7). Regular expressions stored in the Library will only be used in an algorithm if they are added under the Expressions tab (see section below: Expression in use).

Users can save regular expressions in the Library by clicking the floppy disk icon within the Library pane. Saved Libraries can loaded into CLARK by clicking the arrow icon within the Library pane.

Libraries allow users to re-use commonly used regular expressions and share regular expressions.


Figure 7: RegEx Library in “Setup.”

Figure 8: Regexes .json file that may be imported to Library or Expressions
Expressions in use

The Expressions tab is used to add, define, delete, and export active regex for use in the algorithm. Deleting a feature from the Expressions tab ensures it will no longer be used by the algorithm or highlighted in clinical notes. Deleting a regex from the Expressions tab does not affect those in the Library tab.


There are three ways to add regex to Active Regular Expressions. These methods can be used individually or in combination.


Figure 9: reference the regex library

1. Load a .json file containing regular expressions To load a .json file into the Library of regex or Expressions in use, click on the arrow icon at the bottom of the window within either tab. An “Active RegEx.json” file is the same in structure as a file uploaded to the Library. The feature name and regex must be specified in “name” and “regex” fields, respectively. The “regex” field includes the regular expression with special characters, and the “name” line includes the label for that feature. Note: Loading a .json file will overwrite any existing regular expressions.

2. Enter regex directly into CLARK interface Similarly to the regex Library, use the “+” button to add a row in the Expressions tab. Enter the name of the feature in the name field, and enter the regular expression on the Reg Exp field below.

3. Reference a regex from the Library A regex that is already defined in the Library can be called on from the Expressions tab to be used in the algorithm. To do so, use the “+” button to add a row in Expressions and enter a name in the name field. In the Reg Exp field, enter “#name” instead of a regex, where [name] references the desired feature in the Library. See the example in figure 9.

Differences between the Expressions and Library Panes:
All regex in the list of Expressions are considered “active” and will be used in the classification algorithm. Those listed in the Library are not automatically included in the classification algorithm. The Library can be thought of as a reference area. To exclude an active regular expression from analysis, it must be deleted from the Expressions pane, this includes references to the library (e.g. “#feature”).


Sectioning

Entire patient records or standalone clinical notes cannot be excluded once loaded into CLARK. However, sections of notes can be excluded. This is done using regular expressions in the Sections tab under Unstructured Data. Sectioning is optional and can be completed before or after adding features to the regular expressions library. Sections, if included, are separated by headers in the free-text clinical notes within the data set. This portion of the documentation explains how to define sections and make CLARK recognize and select sections. By default, all sections of a note are used in the algorithm.

  • Section headers may be formatted in the same way so that they are easily identified in a note. Section headers mark the beginning (and end, therefore) of a section, and their structure is defined in the Section Breaker field. For example, if sections of a note are defined by 1-2 words of 2-20 letters followed by a colon, the regex in Section Breaker would be “(?i)([a-z]{2,20})?()?([a-z]{3-20})?:”. This matches headers such as “Procedures:” or “Family History:”. Sections breaks might also include a line break or tab character; viewing clinical notes can help in choosing a Section Break regex.

  • Defining a Section Break does not automatically invoke sectioning. To identify a named section, add a row in the Sections tab using the “+” button. In the Regex Editor, enter a regex to match the header of the section in the Reg Exp line using word boundaries, “\b and \b” underneath the name of the header. The “Ignore” box can be selected if this section is excluded from analysis. For example, “\b(?i)procedures allows the”Procedures" section to be extracted from other sections in the note. CLARK recognizes the end of a named section when the next term matching the Section Break regex appears in the note.

  • Select Ignore to omit a section from analysis, meaning the terms found by regex in this section are not counted in the classification. When a section is ignored, the coverage[#coverage] values of features found in that section change.

  • Sections excluded from the training data set will also be excluded from notes in the evaluation data. Similarly to features, section-defining regular expressions may be saved and uploaded as .json files using the save icon in the same tab under Unstructured Data.

When to Use Sectioning
If users expect certain information is irrelevant to classification, sectioning can come in handy. For example, general family history might not reveal anything new about a condition among all the other information provided in clinical notes. Excluding sections of clinical notes can also speed up the algorithm when there may be a large volume of notes or patients to process.
Sectioning is not always appropriate. It is best used in cases in which all included notes have a predictable format and structure, and the user should have a thorough understanding of these patterns.


Viewing Notes

Figure 10 (hover to animate): Navigate to patient notes.

Patients and Notes

The right-hand window (displayed in figure 10) in the Setup page has a list of IDs and basic patient info. This includes the number of entries from each domain associated with each patient. Notes can be viewed by clicking on a patient’s row, and the upper-left menu icon in the pop-up window can be used to navigate to notes. If a patient has multiple notes, they will display as separate note records.

  • The patient list can be filtered by any of the following patient-specific attributes: ID, birthdate, gender, and marital status.
  • In this example, patients have either positive or negative diagnoses for sleep apnea. Phrases including “sleep” and “CPAP” will likely help distinguish the patient as positive for sleep apnea, since one can expect these to be common features in someone with sleep apnea.
  • “Alcoholism” can be found in some positive patients’ notes, but it is a common enough condition to be found in negative patients’ notes as well.
  • While some words or phrases are strongly associated with a certain group, simultaneous associations with other groups may muddle the classification. To make decisions about features, researchers benefit from clinical expertise, literature reviews, and iterative algorithm testing in CLARK.
Figure 11 (hover to animate): Highlighted features in notes.


Coverage

Found next to regex listed in the Expressions tab, coverage describes how commonly each regex-defined feature is found in clinical notes among all patients. The coverage value is the proportion of subjects with that feature in their clinical notes. This is visualized with color-coded highlighting of clinical notes (see figure 11) that corresponds to the list of features included in the algorithm.

  • CLARK searches for all features listed in Expressions, and contributes to the coverage value each time a matching term is found in clinical notes.

  • Coverage may decrease for some features if sections containing them are ignored from analysis.

Structured Data

In addition to features defined by regex in clinical notes, data from EHR-sourced domains contribute to the classification algorithm. To be included in the algorithm, structured data features must be specified on the Setup page under Structured Data. CLARK supports demographics, lab results, vital signs, and medications. Each tab under Structured Data lists available features, and a check mark will appear next to those included in analysis.

Patient Demographics
Figure 12: Select patient demographics.
  • Figure 12 displays how age may be included in the classification algorithm: as continuous or binned. If binned, age is treated as a categorical variable. If numeric, age is treated as continuous. Patient date of birth is included in the EHR data, so age must be calculated by selecting a date relevant to analysis.
  • Demographics such as gender, race, ethnicity, and marital status are simply included or excluded as categorical variables. Their values are not listed in the Setup page, but values should conform to FHIR-controlled terminology.
  • Specific values cannot be selected from demographics to be used in analysis. However, users may wish to include a dichotomized version of a variable, such as married/unmarried, in analysis. The dichotomized values must be coded before the domain is uploaded to CLARK.
Labs and Vitals
Figure 13: Vitals in CLARK.
  • Labs and vital signs are listed separately in Structured Data, but added to the algorithm in the same way. Each variable, or potential feature, can be added to the classification algorithm by selecting an aggregation method:
    • Max: the maximum recorded value of this variable from each patient will be included.
    • Min: the minimum recorded value from each patient will be included.
    • Newest: The value from each patient’s most recent visit will be included.
    • Oldest: The value from each patient’s first recorded visit in the data set will be included.
Meds

Medications can be included as either “boolean” or “count” values in the classification. If “boolean” is selected, then patients will be split into those who have ever received the medication of interest and those who never have. If “count” is selected, then the number of times a patient was prescribed a selected medication will be factored into the algorithm.

Algorithm

Before classification algorithms can be selected and tested on the training data, regular expressions and structured data features must be selected in the previous page of CLARK. Then, continue to the Algorithm page by selecting “Continue” in the bottom-right corner of the Setup page.

Algorithm Steps - Training Data

  1. Under “Classifier” select an algorithm from the Algorithm drop-down menu. Options include Linear SVM, Gaussian Naive Bayes, Decision Tree, and Random Forest.
  2. Select “Cross-Validation” as an evaluation method when using the training corpus (training data) to train a classification algorithm.
  3. Select a cross-validation method from the drop-down menu: “random” or “stratified”.
  4. Select number of folds.
  5. Click “Explore Results” once above information is complete. Users may then wish to return to setup and adjust the algorithm as needed.

Algorithm Steps - Evaluation Data

Note: Before loading the evaluation data into CLARK, explore results from the gold standard (training) data. Once satisfied with a model’s performance on the training set, it is time to apply the algorithm to an evaluation data set.

  1. Change the Evaluation Method. Select “Evaluation corpus” to apply an algorithm to the unlabeled clinical notes.
  2. The Load Test Data button prompts a file explorer window where the user selects unlabeled patient data. This should be a folder in the same FHIR format as the training data. Similarly to in the Setup page, the user may browse through the list of unlabeled subjects’ data and view highlighted features in their clinical notes.
  3. Once the evaluation data set is successfully uploaded, select the desired algorithm from the drop-down list under “Classifier”.
  4. Select “Explore” in the bottom-right corner of the page to run the algorithm and view results.


Machine Learning Classifiers

CLARK employs classification algorithms created by SciKit Learn, a resource for machine learning in Python. A detailed description of each algorithm can be found in the technical appendix and at SciKit Learn.org

  • The Linear SVM algorithm uses features (found in structured data and notes) to ‘draw’ lines between data points to separate them into classes. In CLARK, patients are grouped based on what their true label is in the training (gold standard) data, and then their features are associated with these groupings to use in the evaluation data

  • The Gaussian Naive Bayes classifier uses the probability of each feature belonging to a group to predict the most likely label for each patient in the evaluation data set. This algorithm assumes that features are independent of each other: that the presence of one feature does not imply the presence or absence of another.

  • The Decision Tree algorithm classifies patients iteratively into subgroups based on the presence of features. The first “split” in a decision tree uses the most polarizing feature to distinguish into subgroups, then each subgroup is split by the most distinguishing feature, and so on until each subject is labeled. For example, a polarizing feature when classifying lung disease patients would be “smoking”. A decision tree would split the training data into two groups: those with matches to the “smoking” regular expression and those without.

  • The Random Forest algorithm employs the use of many decision trees to classify items into groups. Random samples are taken from the population, and subjects within each sample go through a decision tree with a random sample of features. The same subject appears in several different decision trees, and the algorithm selects its label with the most “votes” from all the decision trees. Decision trees within a random forest do not split samples based on the most polarizing feature, but rather a random feature at each split. This creates diversity between the trees and decreases dependence on a single important feature.


Cross-validation

K-fold cross-validation is a method of evaluating a classification model’s predicted performance on the new data by simulating unlabeled data. In CLARK, this step is important to complete before loading the unlabeled evaluation data. Select a number of folds, k, to specify the number of samples to split the training dataset into.
Image from SciKit Learn

In k-fold cross-validation, one “test fold” is set aside, and the other k-1 “folds” are used to fit a model which is then evaluated for classification accuracy on the test fold. Each of the k subsets is temporarily blinded and treated as a test sample once, and fits a model k-1 times. For example, a training data set with 300 subjects, using 10-fold cross-validation, would have 30 subjects in each fold. The user can select stratified cross-validation, which preserves the proportions of labels among subjects in each fold; or random, which simply fills each subset with a random sample of subjects.

Overfitting
The purpose of cross-validation is to help avoid overfitting a model to the training data. Overfitting can be caused by using too many features, or using training data that is “too clean”. If each feature in the training data maps to only one possible label, the algorithm cannot learn how to classify in more complicated situations. In other words, the model memorizes the training data and loses robustness to new data. Cross-validation simulates new data with the test fold; that is, CLARK temporarily blinds itself from the true labels of some subjects in the training data set. This ensures that the selected algorithm does not rely too heavily on a few influential features, and it self-checks its classifications on the test “unlabeled” data.


Explore

The Explore page is accessible after running cross-validation on a training set, or after an algorithm is transferred to the evaluation data. Trying various algorithms and cross-validation methods with training data helps in selecting a model to classify subjects in the unlabeled evaluation data set. The Explore page contains information about classification accuracy, CLARK’s confidence in its labels, and the distribution of labels among the study population. This page is useful for evaluating algorithm performance and choosing a confidence range to use with the unlabeled evaluation data.

Distribution by Labels

Figure 14 depicts how interactive graphics can be used to explore results. In this example, patients labeled as “1” are positive for the condition of interest, sleep apnea, and those labeled as “0” are negative. CLARK can handle classification problems with more than two possible groups.

  • Distribution by True Class Labels: This plot groups subjects by their true (known) labels, and is the only plot that will not be affected by changes to the algorithm made in Setup. Select a bar to view how CLARK classified patients belonging to one group. For example, clicking the “1” bar, other figures change to display how many truly positive patients were classified as positive or (falsely) negative.
  • Distribution by Classifier Labels: This plot groups subjects by how CLARK’s algorithm classified them. Selecting a bar on this plot can answer questions such as, “Of the patients that CLARK classified as being positive for sleep apnea, how many truly have the condition?”
  • Classification Accuracy: This circle chart displays the proportion of correctly and incorrectly classified subjects, based on their true labels and CLARK’s classification. The “Correct” and “Misclassified” areas will shift if a selection is made in the other plots. The user may also select a category on this plot to explore how “Correct” or “Misclassified” subjects are distributed.
  • Max Classifier Confidence Distribution: For each subject, CLARK computes a value demonstrating its confidence that the subject belongs to each group. Max Classifier Confidence is the greatest confidence value per subject, and corresponds to CLARK’s label for them. This plot displays the frequency and range of confidence values associated with CLARK’s classifications. Ideally, the higher ranges of confidence correspond to a greater proportion of correctly classified subjects in the training data set.
  • True Classifier Confidence Distribution: This plot displays the frequency and range of confidence values for each subject associated with the true label, which is not always the label selected by CLARK. Users may select a range of confidence levels to filter other plots on the page and help evaluate CLARK’s performance.

Figure 14 (hover to animate): Select a label to see the distribution of classifications computed in CLARK.


Similarly, the user may select the group labeled as “1” from the True Class Confidence Distributionn to see how many patients in the training set truly are positive or negative. This investigation is helpful in both directions. If, for example, 50 patients in the training data are truly positive for an indication, and 50 are classified as positive, some may still be mislabeled.

Confidence

Generally, the Classification Accuracy chart indicates a better performance when it is mostly green (“Correct”). However, users should not expect or try to achieve a perfectly performing algorithm on the training data, as this may indicate overfitting and a loss of robustness. Since some misclassification is expected, the True Class Confidence Distribution plot will likely have a wider range and some lower confidence values compared to the Max Classifier Confidence Distribution plot.

Figure 15 (hover to animate): Select a range of confidence levels associated with classifications.


CLARK is meant to supplement human effort and help to decrease time spent on manual chart reviews. However, it may be wise to determine a confidence level at which patient notes and EHR data will be reviewed manually. The confidence distribution plots, in conjunction with the bar plots, can help guide users in selecting a “confidence threshold” to dictate which classifications to review in the unlabeled data. For example, say most patients in the training data are correctly classified at > 0.6 max classifier confidence. Then, the user may wish to manually review patient data for those with max classifier confidence values < 0.6.

When is it time to use the Evaluation data?
After several rounds of training, users may gain a sense of which performance is relatively ‘good’ based on confidence estimates and the proportion of correctly classified subjects in the training data. Given a thorough literature review, researchers may pre-specify the proportion of correctly classified subjects and corresponding confidence levels necessary to move from training to evaluation.

Filtered Records

The Filtered Records dataset, depicted in figure 16, lists individual subjects who fit into the selected label or confidence range selected in the graphics above. If no confidence ranges or classifications are selected from the above section, all subjects are listed. Clicking on a cell in any subject’s row brings up a window that displays the highlighted features in their clinical notes; this may help indicate why CLARK grouped them correctly or not. Clicking on any column header toggles sorting in ascending/descending order.

Navigating Filtered Records
Where Misclassified=“No,” the True Label and Classifier Label columns will have the same value. Max Conf contains the maximum confidence level CLARK computed for a subject’s label. Each label’s column contains CLARK’s confidence level that a given subject belongs in that group, and the largest confidence value in a row will populate the Max Conf column. This corresponds to the label in Classifier Label.

Figure 16: Dataset of individual filtered results

For example, CLARK was 57% confident that patient 84330712 was positive for sleep apnea (labeled as “1”), and 43% confident that they are negative for sleep apnea (labeled as “0”), so they were misclassified. Finding misclassified subjects at various confidence levels can reveal if features or the algorithm need to be updated. When updating a model or list of included features, be careful of overfitting. CLARK is not intended to perfectly classify the training set.

Evaluation Data Results

CLARK’s Explore page for evaluation data results is different from that of the training data set. Since the true classifier labels are unknown, the circle chart and “Distribution by True Class Labels” chart are not included.
Figure 17: Evaluation data results with true labels unknown
Similarly to the training data results, the user can navigate the evaluation data results by classifier label or by confidence distribution. However, because their true labels are unknown, there is no indication in the “Filtered Records” dataset of which subjects were misclassified. At this point, manually checking the classification through EHR and note review of some subjects may be helpful. The algorithm can still be adjusted and re-run without starting over from the training data set.
Figure 18: Evaluation data confidence distribution

Shown in figure 18, CLARK is at least 75% confident in classifications for 18/61 subjects whose true labels are unknown, but this does not necessarily mean that they are all correctly classified. On the other hand, relatively low confidence levels may warrant a manual review or algorithm adjustment.

Exporting Results

By clicking “Export Data” in the upper-left corner of the CLARK environment, the dataset shows in Filtered Records can be saved locally as a .csv file. These files can be opened in Excel or a different software for more detailed evaluations of the classification model’s performance. Users may wish to calculate sensitivity, specificity, or construct ROC curves using results from training data.

WARNING: When using CLARK to explore clinical data, be mindful that the saved session will contain identified protected health information (PHI). The session should be saved in a location approved at your institution for storing PHI.


Technical Appendix

FHIR Details

The HL7 website includes extensive details on FHIR as well as example files. Example FHIR 4.0 files can be found here. The JSON format option is compatible with CLARK, and a folder of properly organized JSON files can be used for exploration of CLARK’s functionality.

Cross-validation

CLARK offers stratified and random cross-validation to use on the training corpus and help prevent overfitting.

  • With stratified cross-validation, the proportion of samples from each class are preserved in each fold. For example, if 30% of patients in the training set are labeled as “benign”, then roughly 30% of patients in each of k folds will also be “benign”. The purpose of stratification is for the training sets to mimic the population, and for the mean response value to be similar in each fold.
  • Random cross-validation simply randomly divides the training population into k folds. In a large data set with many subjects in each class, stratified and random cross-validation would result in similarly distributed classes in each of the k folds. In a small population, or one with some very rare classes, the random method may not consistently capture the best features for rarer classes. This is because some folds may not include certain rare classes at all.

CLARK Under the Hood
Before employing a chosen machine learning classifier, CLARK transforms clinical notes into a multi-dimensional feature vector based on the number of user-defined features (recall that a regular expression can match several different words). The number of matches for each regular expression are calculated for each sentence, then the vector of match counts is summed across all sentences in a note. The vectors are then summarized by patient by calculating the mean feature vector across all of the patient’s notes (this is why metadata fields in clinical notes are important). Once the features have been summarized for each patient, the selected algorithm uses this patient-level information to train a model.

Algorithms in detail

Decision Trees

Decision trees classify subjects based on the presence of certain features. Classification begins at the ‘root node’ containing the whole population, and points are classified stepwise by features. The algorithm makes a distinction at a ‘split’ for each feature. The first split employs the most polarizing feature of the sample, and further splits are then the next-most-distinguishing features of the resulting sub-groups at ‘child nodes’.

Image from iu.edu

The goal of each split is to create subsets as different from each other as possible, with the subjects in each resulting subgroup as similar as possible. The ‘leaf’ at the end of a decision tree run is the resulting classification, with each point in the population classified in a group. The decision tree includes every subject and every feature specified in the algorithm. In cross-validation, the subsamples’ trees include every feature but not every subject.

The probabilities returned by SciKit Learn decision trees are the number of observations in a given class divided by the number of observations captured in that correct leaf during model training.

Decision trees are susceptible to overfitting when too many features are specified in the training set. To help avoid overfitting, consider using a subset of specified features in CLARK. Maximum depth describes the largest number of nodes acceptable for the tree to grow. CLARK does not specify a max depth; by default, nodes are expanded until the leaves are either purely separated or all leaves contain fewer than the minimum number of samples.

Random Forests

The random forest is an ensemble method using several decision trees in classification. The decision trees created in a random forest differ in a few ways from the decision trees described above. Rather than using the most polarizing features at each split, the algorithm selects a subset of features at random for each tree. This creates diversity between the trees. Additionally, random bootstrap samples of the training population are used in the random forest, allowing the same patient to appear in several different trees. Bootstrap aggregation also contributes to model diversity.

Image from global software support

As each subject appears in several different decision trees, one may be assigned more than one classification in the entire forest. The model’s decision for a subject’s classification is the leaf in which the subject appears most often, or the classification with the most ‘votes’ from the forest. The appeal of a random forest is the uncorrelated nature of the individual trees within. Bootstrap aggregation and random feature selection contribute to this and ensure that the trees do not err in the same direction. For classifications to be accurately predicted, the selected features (words and values found by regex) do need to hold some predictive power. Similarly to the decision tree algorithm from SciKit Learn, the random forest in CLARK uses default attributes for maximum depth.

Linear SVM

The Linear Support Vector Machine (SVM) defines a hyperplane in n-dimensional space to separate data points (in context of CLARK, patients) into different groups. The SVM algorithm ‘draws’ a n-1 dimensional hyperplane between points with the maximum distance between groups in the training set. For example, a 1-dimensional line is drawn to distinguish between 2 groups. CLARK uses SciKit Learn’s SVM classifier to enable probability estimates and use a linear kernel.

Image from stanford.edu

The linear SVM algorithm used in CLARK will employ nominal features, being the words found by regex in clinical notes. Dummy variables are created to represent each feature: presence or absence of a word or value, matching a user-defined regex, in patient notes. These features determine where a patient lies in the hyperspace, and the patient in the training set has an inherent label that the algorithm attempts to group together based on the presence of features in clinical notes.

After training, the algorithm applies the hyperplane boundaries defined on the training set to the evaluation corpus, so unlabeled patients are classified by support vectors based on where they lie in n-dimensional space, where n=number of features and their position is dictated by the presence of these features. Model parameters specify priorities of the SVM. There is a trade-off between perfect partition of points in the training set and smoothness of a line.



Image from Research Gate

Gaussian Naive Bayes

The Gaussian Naive Bayes classifier assumes that the likelihood of features appearing in a clinical note follow a normal distribution. This lofty assumption is the reason for “Naive” in the algorithm’s name. Given a true label, the algorithm calculates probabilities of each feature being associated with that label. The algorithm then infers true subject classifications by using Bayes’ theorem in the training set (hence the name of the algorithm).

For example, the probability of an unlabeled animal being a bird, given that it has wings, is derived from the probability that an animal known to be a bird has wings mentioned in its “clinical notes”. The goal is to find the best classification given data, or the most likely patient label given the features found in clinical notes.

Naive Bayes is robust to several possible classifications (more than two labels to distinguish) in a population. CLARK employs the default settings of Gaussian Naive Bayes from SciKit Learn.

General Troubleshooting

Loading .json files

CLARK will notify users of issues with training/evaluation data sets or regular expression files that have been uploaded. First, be sure that all files going into CLARK are saved as .json and comply with FHIR 4.0 standards. If the files still are not being loaded properly, it is most likely a formatting issue.

  • Files should be properly delimited
  • Special characters in free-text clinical notes may need special attention. Clinical notes should not contain any floating “\” characters unless preceding a quotation mark.
  • Other special characters in the training corpus, such as “\023” can also inhibit correct parsing of the clinical notes, preventing CLARK from loading the set of notes. Since the clinical notes are saved in a .json file, it can be opened with notepad. Using find/replace can be helpful to debug issues with free-text notes.
Regular Expressions

If features are not being highlighted in CLARK as expected, there is most likely a problem with some regular expressions.

  • Test and debug feature-defining regex using RegEx101.
  • Always include the double-backslash to act as a single backslash in .json files containing regex. For example, the word boundary regex “\b” should be entered as “\\b” in the .json file to function as “\b” in CLARK.
  • Verify that active regex are correctly referencing the regex library, and that new regular expressions are saved either in the CLARK session or as .json files.
Algorithm Training

In small sample sizes, or in the case of some rare labels among the training data,

  • Consider the size of folds in cross-validation. If a higher value of k is selected, the training and test folds will be smaller. This occurrence can skew the distribution of rarer labels, and lead to them being ignored as an option for classification. If a certain classification occurs only a few times in the training set, CLARK is less likely to build a strong association with features compared to other classifications.
  • CLARK can only classify subjects in the evaluation data using choices presented in the training data set.
Exporting Results
  • The “Filtered Records” dataset from the Explore page can be exported locally. Check formatting of the excel output against what appears in CLARK. In some versions of excel or operating systems, column headers may be shifted to the left. Simply rename the affected columns based on “Filtered Records” before conducting a summary of results or secondary analysis.