CLARK v1 is a machine-learning classification software created by NC TraCS and CoVar Applied Technologies to enable computable phenotyping in unstructured data. CLARK’s user-friendly interface makes natural language processing (NLP) an accessible option for searching free-text clinical notes. This page includes user instructions and technical documentation for CLARK v1.

For instructions on CLARK v2, go here.

For a conceptual guide to CLARK including research applications and interpretation of results, go here.

Both CLARK v1 and v2 are free and available for download here.

Table of Contents

Getting Started
      System Requirements
      Installation
      CLARK: Basic Steps
      Navigation
      Loading and Saving Progress

Key Concepts
   Clinical Notes
      Formatting
      Metadata
      Free Text

   Regular Expressions
      Basic Regular Expressions
      Section Break
      Clinical Examples

Training Corpus
      Loading the Training Corpus
      Troubleshooting

Features
   Algorithm Setup
      Training Corpus
      Regular Expressions Library
      Active Regular Expressions
      Sectioning
   Notes
      Patients and Notes
      Note with Additional Markup
      Using the Notes Viewer

Algorithm
      Algorithm Steps-Training Corpus
      Algorithm Steps-Evaluation Corpus
      Machine Learning Classifiers
      Cross-Validation

Explore
      Distribution by Labels
      Confidence
      Filtered Records
      Evaluation Corpus Results
      Exporting Results
      Sensitivity and Specificity

Technical Appendix
      Cross-validation
      Algorithms in detail
      General Troubleshooting

Getting started

System requirements

CLARK runs best on Windows machines with 16 GB of RAM, and does not require special infrastructure to operate. Processing may take longer with 8GB of RAM.

Installation

CLARK can be downloaded from tracs.unc.edu after creating a free account with NC TraCS. Follow the instructions under the “Sign In” menu or click here to create an account. The files need to be un-zipped using an application such as winzip or 7zip. Simply double-click the CLARK Installer 1.0.3.exe, and read README.txt and the license agreement. CLARK opens automatically once the installation finishes.
Figure 1 (hover to animate): Load a set and view individual notes in CLARK to get started.


Using the Animal Dataset CLARK comes with a few example files to practice with: AnimalCorpus_V2.json, AnimalCorpus_asMRN.json, AnimalKeywords.json, and AnimalExpressions.json. The AnimalCorpus_V2.json and AnimalCorpus_asMRN.json mimic labeled clinical notes, where each animal represents a patient. In AnimalCorpus_asMRN.json, IDs are represented with medical record numbers (MRNs) instead of animal names, and more metadata fields are included. This documentation uses examples from AnimalCorpus_V2.json. Both files can be used in CLARK and contain the same “notes.” AnimalKeywords.json contains input for the Regular Expressions Library, and AnimalExpressions.json can be uploaded to the Active Regular Expressions tab in CLARK.


CLARK: Basic Steps

Generally, CLARK can be used in the following steps:

  1. Form a classification question, then identify groups and group-defining features of interest through a literature review.

  2. Select a training (“gold standard”) corpus and evaluation (unlabeled) corpus of clinical notes. Gather, process, and load clinical notes into CLARK on the Training Corpus page.

  3. Format features (words, phrases, or values) as regular expressions to match text in the body of clinical notes.

  4. Iteratively train and assess algorithms and combinations of features until satisfied with CLARK’s performance on the labeled training data.

  5. Transfer algorithm of choice to the unlabeled evaluation corpus.

  6. Review, interpret, and use classification results in context.


Loading and saving progress

Click More at the top of the CLARK environment to reveal a drop-down menu with options for saving and loading work.
Figure 2: Re-start, load, or save work under “More”.

Saving a session preserves regular expressions in use, clinical notes in use, and current algorithm results. Sessions are saved as .zip files. Previously saved sessions can be loaded from the More menu or on the home screen. There is no need to un-zip a saved session to load it into CLARK again. The contents of the .zip file can be viewed locally, but the corpora and regex files are compressed into a single .json. Regular expressions created/updated in CLARK can be saved individually from the Training Corpus page. To save the results of several algorithms for one set of clinical notes, save the sessions separately.

When using CLARK to explore clinical data, be mindful that the saved session will contain identified protected health information (PHI). The session should be saved in a location approved at your institution for storing PHI.


Key Concepts

Clinical Notes

Clinical notes contain free-text patient data ranging from family history to current symptoms and imaging results. A set of clinical notes is called a corpus. CLARK requires two different corpora for a complete classification: the “gold standard” training set with known patient classifications (labels), and the unlabeled “evaluation” set. Machine learning algorithms are trained first using the gold standard data, then applied to the unlabeled evaluation set for classification. For use in CLARK, these notes must be converted to a specific format that renders a whole set usable and searchable.

Creating Gold Standards
Researchers can build a gold standard corpus through manual chart review for a subset of the patient population, leaving the rest of the population unlabeled for the evaluation corpus. Each patient should have one distinct label, and patients of the same group should have identical labels. For example, CLARK recognizes “diabetic”, “diabetes”, and “T1D” as different groups; however, these terms indicate the same characteristic and should share a common label in the corpus.
Gold standard labels provide CLARK’s classification options for the evaluation corpus, so all possible groups should be represented in the training data. The size of a gold standard set depends on several factors: analysis goals, rarity of condition of interest, population size, and desired confidence level for classifications.

Formatting

When first extracted from a database, clinical notes vary in format across institutions. A corpus must be saved as a .json file
Figure 3: Sample formatting for 1 entry in a corpus.

in which a single note is enclosed in { } brackets, and the whole corpus is enclosed with [ ] brackets and fields are separated by commas. For clarity in this example, each field is on a separate line–this is not necessary in practice, as long as entries are properly separated with punctuation. The name and content of each field is enclosed with quotations, and a delimiting colon populates the field with the content.
All fields other than “note” will populate CLARK as structured metadata. Here is a sample script for formatting clinical notes in R.


Metadata

The Metadata header for notes in CLARK includes limited structured data for each subject in a corpus. Metadata is visible in the Note with additional markup pane. MRN, noteID, noteDate, and label are required. Other metadata fields, such as gender and age, are optional, and some notes may have these extra metadata while others do not. The Animal Corpus example includes the following fields:

Field Description Example Required?
noteDate The date and time of the clinical note. This distinguishes several patient notes from each other. “2016-11-13 19:45:00” Yes
note The body of a clinical note free-text formatting Yes
noteID Distinct note ID, possibly including several per MRN. “1” Yes
label Classification label for a patient. Each distinct label in a training corpus is a possible group for patients in the evaluation corpus. “mammal” Yes*
MRN Patient medical record number or other identifier “Aardvark” Yes
noteCount Counter of notes within a patient “1” No
Figure 4: Metadata section visible in CLARK
  • The label* field should always be included in both corpora, but left blank for the evaluation corpus.
  • A CLARK-ready corpus includes each note as a separate entry, and notes require correctly populated metadata to be linked together.
  • The MRN connects several notes under the same patient so that, during classification, matched features are attributed to the correct subject.
  • The noteID field should be distinct for each note, especially if there are several notes per patient. Each note should inherently have a different date/time associated, but CLARK uses noteID to separate them.


Free Text

The “note” field is the only non-metadata field in a corpus, and this is where CLARK uses NLP to search for matched features. Text in the note fields may be just that: a block of text. However, sections from clinical notes can be preserved through syntax in .json files. Since notes can be viewed within CLARK, sectioning may be worthwhile for readability. The following note has sections that are distinct in CLARK, and these can be excluded because the headers are specified in the note and then recognized by regular expressions.

Figure 5: Free-text clinical note entry including section breaks in a .json file
Figure 6: The entry from Figure 5 in CLARK.


  • Figure 5 is a simplified entry of a “corpus.json” with sections defined in the “note” field. The “\##” expression in the .json file causes a header to display in CLARK. “\##” also allows regular expressions to match a section header for optional exclusion, and “\n” creates a line break after the name of the header.
  • The number of “#” only affects the size of the header, and each header is recognized by CLARK and the built-in section break regex as the same. A clinical note can contain any number of sections.
  • Figure 6 displays the note “as is” under the “Note with Additional Markup” section of the CLARK environment.
  • To see the sections and “#” highlighted, navigate to the Sectioning tab in the note-viewing pane.
  • Section headers will be different sizes depending on the number of “#” in the sequence before the header in the clinical notes.



Regular Expressions

Basic Regular Expressions

A regular expression (regex) is a string of letters or numbers that uses additional special characters to define a search pattern in a body of text. The purpose of using regular expressions in CLARK is to robustly identify features (words, values, or phrases) in clinical notes that distinguish patient groups of interest from one another.

  • A single regular expression can match several words. For example, “\bfly?(ies|ight|ing|)\b” matches “fly”, “flight”, “flying”, and “flies” by offering alternate suffixes to “fly”, and the “?” qualifier makes “y” optional in the prefix.

Writing regular expressions does not require any text-mining expertise and can be learned through online tutorials. Regex101.com provides a useful interface to practice using regular expressions, and RexEgg.com includes an in-depth tutorial to help develop more complicated expressions. Note that CLARK uses python “flavored” regex. This is generally the same as other regex flavors, and more information can be found here.

General Regular Expression Examples

RegEx Purpose Example Matches
“(?i)” Searches case-insensitive “(?i)heart” “HEART”,“Heart”,“heart”
“\b” Encloses any word or phrase “\bhigh BP\b” “high BP”, will not match “high”
“[abc]” A single character a,b,c “arm[sy]” “arms”,“army”
“\d” Single digit 0-9 “\d\d\d” “456” or "333
“\D” Anything not a digit “\D\D\d” “AB1”
“.” Any character except line break “…” “abc” “a c”
“*” Zero or more times “a*h*” “ah” “aaaaahh”
“\” Some characters have special functions in regex. To search for the literal character as text, precede with a backslash. “1\.5\+1\?” “1.5+1?”

More examples of regular expressions can be found in the AnimalKeywords.json and AnimalExpressions.json files that come with the Animal Corpus example. Note that in the .json files, the backslash should be duplicated any time it is used. For example, the word boundary “\b” should be entered as “\\b” in a .json file.


Section Break


Figure 7: Section Break tab within CLARK

Once notes are loaded into CLARK, Sections are identified using the “Section Break Regex”, found under the Section Definitions tab on the Training Corpus page. This regular expression recognizes any word (requiring user input) functioning as a header. Altogether, it matches any number of consecutive #, followed by any number of consecutive characters that are not a space.

  • # tells CLARK that a header is present, and #+ allows CLARK to match any number of consecutive #.
  • The [] brackets enclose optional matches, in this case ^ . The ^ is translated as “not”, and preceding a space, [^ ] means “not a space”.
  • The * qualifier means “matching as many times as possible” what is enclosed in [] brackets.

Instructions for including or excluding specified sections can be found here.


Clinical Examples

There are endless possible regular expressions that can match words and phrases in clinical notes. Below is a sample of features that could be used to classify some common conditions.

Clinical Regular Expression Examples

Name RegEx Matches
Obesity (?i)\b(?<!not )obes(ity|e)\b “Obese”,“obesity”
Tobacco (?i)\b(?<!non-)(smok(ing|er|es))|(tobacco)|(cigar[est]{0,5})\b “tobacco”, “smoking”, “smoker”, “cigars”, “cigarettes”
Coughing (?i)\b(cough[ings]{0,3})|(wheez[inges]{0,3})\b “cough”, “coughs”, “wheezing”, “wheezes”
High Blood Pressure (?i)\b(high\s(bp|blood pressure))|(hypertension)\b “high BP”, “hypertension”, “high blood pressure”

Training Corpus

A CLARK analysis begins on the Training Corpus page. Here, the user loads data (notes and metadata) to be analyzed along with regular expressions.

What is a training corpus? The training corpus consists of “gold standard” data: a set of subjects’ clinical notes with their true labels included. Labels, or groups, are defined in context of the classification question. This set is used to build and assess an algorithm before attempting to classify patients of unknown groups. Training an algorithm is an iterative process, taking several rounds of feature selection and model performance evaluation. The Animal Corpus example that comes with CLARK includes gold standard data for animals of 5 classes: mammal, bird, fish, insect, and reptile. The features selected for this classification (defined in the example regular expressions files) help to distinguish the 5 groups.

Loading the Training Corpus

The training corpus can be loaded on CLARK’s home screen by selecting “Load a Corpus.” A file explorer pops up, then users may navigate to the location of the training corpus and double-click on a .json file to upload it. A green check appears once the notes are successfully loaded, and then they appear in the “Note with Additional Markup” pane.

Troubleshooting

If the training corpus or regular expressions files are not correctly formatted, CLARK will present this error: “Failed to Load Regular Expression Library”. To fix it, pay close attention to the structure of the example animal corpus materials and check that the following criteria in the .json files are satisfied:

  • Each expected field is populated. For example, “expr”: “regex” and “name”: “name” for active regular expressions, and the metadata fields for clinical notes.
  • The first character in the corpus or active regular expressions files should be an opening bracket “[", and the last should be a closing bracket "]”. The regular expressions library file can begin with “{”.
  • Clinical notes are enclosed with “{ }”.
  • Fields within notes, and entire notes, are separated by commas.
  • The last entry in a .json file should be followed by the closing bracket, not a comma.

Features

The Features page allows users to define/upload regular expressions, define sections, and explore notes.

Figure 8: The features page in CLARK


Algorithm Setup

New Training Corpus

Here, users can upload a different training corpus without restarting a CLARK session from the home screen. Similarly to on the home screen, a file explorer pops up to select a “corpus.json” from.

Regular Expressions Library

The RegEx Library stores regular expressions to be used in different classifications. These may be included or suppressed throughout the feature selection process. Regexes may loaded from a file or added directly from the CLARK interface. CLARK accepts a .json file, and its structure differs slightly from those uploaded to the Active Regular Expressions.


Figure 10: RegEx Library in “Algorithm Setup.”       

Figure 9: RegEx Library input as .json
  • Features saved in the RegEx Library are not automatically used in the classification algorithm. They must be called in Active Regular Expressions, either in the .json file uploaded to CLARK or added directly in the program. If specified in the .json file, the example in Figure 9 populates CLARK as in Figure 10. The “+” icon adds new features to the RegEx Library which can be saved locally using the floppy disk icon (default format will be CLARK-compatible and .json). Up and down arrows move a feature in the list, and do not affect whether a feature is included in the algorithm. To delete a feature, click on its row, then use “X” to the right of the “+” icon.


Active Regular Expressions

Similarly to entries in the RegEx Library, Active RegEx can be added, deleted, edited, arranged, and exported locally. The “X” icon deletes an active feature, and it will no longer be used by the algorithm or highlighted in clinical notes. If a deleted active feature calls on a feature in the RegEx Library, it still appears in the Library. There are three ways a user can add regular expressions to Active Regular Expressions.

1. Load a .json file containing regular expressions
To load a .json file into Active Regular Expressions, click on the folder icon in the Active Regular Expressions tab under Algorithm Setup. An “Active RegEx.json” file can include new regular expressions or reference the RegEx Library. The regular expression and the name must be on different lines, and each must be specified with “expr” or “name”. The “expr” line includes the regular expression with special characters, and the “name” line includes the label for that feature.

Figure 11: Active RegEx tab in CLARK
   Figure 12: Active RegEx input as .json


2. Enter regex directly into CLARK interface

  • Use the “+” button to add a row under the Active Regular Expressions.
  • Double-click on the empty field under “NAME” to name a regex, and enter the regular expression syntax under “REG.EXP”.
  • The “COMPILED” field will automatically populate if the regex syntax is valid. If not, the row will highlight in red until the regular expression syntax is fixed.

3. Add from Regular Expression Library

  • To include a feature from the RegEx Library, it can either be referenced in the “Active Regex.json” file, or a reference can be created in the Active RegEx tab.
  • In the example from figures 11-12, the “parent” regular expression is defined in the AnimalKeywords.json file, and called to be used in analysis by “#parent” in the AnimalExpressions.json file. When calling a regex from the Library, do not duplicate the regular expression.
  • If other regex are defined in the ‘regex library.json’ file, but not in the ‘active regex.json’ file upon loading into CLARK, create a reference under Active RegEx. The reference should simply copy the “name” from the library, and include “#name” in the Reg.Exp field under Active Regex.
  • If active regular expressions created within CLARK are also deleted within CLARK, they are not saved automatically to the .json file, but will be saved as part of a CLARK session.

Differences between Active RegEx and the RegEx Library:
The .json file uploaded to the Library has one line per feature. To include these features in analysis, the Active Regex file must reference the desired features by name. To exclude an active regular expression from analysis, it must be deleted from the Active RegEx pane. To exclude a feature that is defined in the Library, it can remain in the library pane as long as it is not called by “#feature” in the Active RegEx pane.


Sectioning

Entire subjects or standalone clinical notes cannot be excluded once loaded into CLARK. However, sections of a clinical note can be excluded across the corpus. This is done using regular expressions in the Section Definitions tab. Sections, if included, are separated by headers in the free-text clinical notes within the corpus. They should already be defined in the corpus.json file before uploading to CLARK. This portion of the documentation explains how to make CLARK recognize and select sections. Instructions for defining sections in clinical notes can be found here.

By default, all sections of a note are searched for features to be used in classification. To match a section with regex that can be included or excluded, add a row to the Section Definitions tab using the “+” button. Type the exact header of the section under the REG. EXP box between the word boundaries, “\b and \b”.
Figure 13. Features found by regex are highlighted in notes.

The section specified will be highlighted in red if not in use, or blue if in use. To toggle section usage, double-click on the box under IN USE and select from the drop-down list. Sections excluded from the training corpus will also be excluded from the evaluation corpus.

When to Use Sectioning
If users expect certain information is irrelevant to classification, sectioning can come in handy. For example, conservation efforts for animals may not tell us anything new about how they are classified. In clinical notes, general family history might not reveal anything new about a condition among all the other information provided. Excluding sections of clinical notes can also speed up the algorithm when there is a large volume of notes or patients to process.


Notes

Patients and Notes

The pane under Training Corpus includes a list of ID numbers (or MRNs) and clinical notes for each patient by date. Scrolling through and selecting a patient ID and note date changes the note in the next pane, under Note with additional markup. By default, the note associated with the earliest date shows up when a patient is selected.
Figure 14 (hover to animate). Features are highlighted in notes.

Note with Additional Markup

This pane includes three tabs, allowing users to explore notes with different views:

  • Note: View the selected clinical note “as is.”
  • Features: View notes with features (terms found with regular expressions) highlighted in different colors.
  • Sectioning: View notes with sections and inclusion status of sections highlighted (optional).

Using the Notes Viewer

  • Under the “Features” selection, a quick visual check shows that the grasshopper has decent coverage and several different features in its “clinical note”.
  • For example, “chitin”, “thorax”, and “antennae” will likely help distinguish the grasshopper as an insect, since one can expect these to be uncommon features in the other animal groups.
  • “Swim” and “wings” are found in the grasshopper’s notes, but also associated with birds or fish.
  • While some words or phrases are strongly associated with a certain group, simultaneous associations with other groups may muddle the classification.
  • To make decisions about features, researchers benefit from clinical expertise and literature reviews.


Algorithm

Algorithm Steps - Training Corpus

Note: Before classification algorithms can be selected and tested on the training corpus, the Active Regular Expressions section in the Features tab must be populated.

  1. Under “Configure a Classifier and Evaluation Method,” select an algorithm from the Algorithm drop-down menu. Options include Linear SVM, Gaussian Naive Bayes, Decision Tree, Random Forest.
  2. Select an Evaluation Method. Select “Cross-Validation” when using the training corpus to train a classification algorithm.
  3. Select a cross-validation method from the drop-down menu: “random” or “stratified”.
  4. Select number of folds.
  5. Click “Cross-validate,” then “Explore Results.”

Algorithm Steps - Evaluation Corpus

Note: Before loading the evaluation corpus into CLARK, explore results from the gold standard, or training, data. Once satisfied with a model’s performance on the training set, it’s time to apply the algorithm to an evaluation corpus.

  1. Change the Evaluation Method. Select “Evaluation corpus” to apply an algorithm to the unlabeled clinical notes.
  2. The Load Evaluation Corpus button prompts a file explorer window where the user selects unlabeled patient data. This should be in the same .json format as the training corpus. Similarly to in the Features page, the user may browse through patient notes and highlighted features in unlabeled subjects’ clinical notes.
  3. Once the evaluation corpus is successfully uploaded, click Train Algorithm to train the selected algorithm on the training corpus. Then, click “Run Algorithm” and navigate to the Explore Results page
    Figure 15: Steps to process the Evaluation Corpus


Machine Learning Classifiers

CLARK employs classification algorithms created by SciKit Learn, a resource for machine learning in Python. A detailed description of each algorithm can be found in the technical appendix and at SciKit Learn.org

  • The Linear SVM algorithm uses features to ‘draw’ lines between data points to separate them into classes. In CLARK, patients are grouped based on what their true label is in the training corpus, and then their features are associated with these groupings to use in the evaluation corpus.

  • The Gaussian Naive Bayes classifier uses the probability of each feature belonging to a group to predict the most likely label for each patient in the evaluation corpus. This algorithm assumes that features are independent of each other: that the presence of one feature does not imply the presence or absence of another.

  • The Decision Tree algorithm classifies patients iteratively into subgroups based on the presence of features. The first “split” in a decision tree uses the most polarizing feature to distinguish into subgroups, then each subgroup is split by the most distinguishing feature, and so on until each subject is labeled. For example, a polarizing feature when classifying animals would be “feathers”. A decision tree would split the animal corpus into two groups: those with matches to the “feathers” regular expression and those without.

  • The Random Forest algorithm employs the use of many decision trees to classify items into groups. Random samples are taken from the population, and subjects within each sample go through a decision tree with a random sample of features. The same subject appears in several different decision trees, and the algorithm selects its label with the most “votes” from all the decision trees. Decision trees within a random forest do not split samples based on the most polarizing feature, but rather a random feature at each split. This creates diversity between the trees and decreases dependence on a single important feature.


Cross-validation

K-fold cross-validation is a method of evaluating a classification model’s predicted performance on the new data by simulating unlabeled data. In CLARK, this step is important to complete before loading the evaluation corpus. Select a number of folds, k, to specify the number of samples to split the training dataset into.
Image from SciKit Learn

In k-fold cross-validation, one “test fold” is set aside, and the other k-1 “folds” are used to fit a model which is then evaluated for classification accuracy on the test fold. Each of the k subsets is temporarily blinded and treated as a test sample once, and fits a model k-1 times. For example, a training corpus with 300 subjects, using 10-fold cross-validation, would have 30 subjects in each fold. The user can select stratified cross-validation, which preserves the proportions of labels among subjects in each fold; or random, which simply fills each subset with a random sample of subjects.

Overfitting
The purpose of cross-validation is to help avoid overfitting a model to the training data. Overfitting can be caused by using too many features, or using training data that is “too clean”. If each feature in the training data maps to only one possible label, the algorithm cannot learn how to classify in more complicated situations. In other words, the model memorizes the training data and loses robustness to new data. Cross-validation simulates new data with the test fold; that is, CLARK temporarily blinds itself from the true labels of some subjects in the training corpus. This ensures that the selected algorithm does not rely too heavily on a few influential features, and it self-checks its classifications on the test “unlabeled” data.


Explore

The Explore page is accessible after running cross-validation on a training set, or after an algorithm is transferred to the evaluation corpus. Trying several different algorithms and cross-validation methods in the training corpus will help in selecting a model to classify subjects in the unlabeled evaluation corpus. The explore page contains information about classification accuracy, CLARK’s confidence in its labels, and the distribution of labels among the study population.

Distribution by Labels

The horizontal bar charts, from left to right, display the distribution of patients by their true classification labels and by CLARK’s predictions. The circle chart shows the proportion of classified and misclassified patients. These graphics are interactive. For example, clicking on “bird” in the True Class Labels chart affects the circle chart and the bar chart of CLARK’s predictions. This displays how CLARK labeled the animals who were truly birds, and how many were misclassified.
Figure 16 (hover to animate): Select a label to see the distribution of classifications computed in CLARK.


The user may also select the group labeled as birds from the chart on the right to see how many truly are birds or of another classification. This investigation is helpful in both directions. If 45 animals in the training corpus are truly birds, and 45 are classified as birds, some may still be mislabeled.

Confidence

When an algorithm is run on a (training or evaluation) corpus, CLARK records its confidence levels for each classification of a subject. Results can also be explored by highlighting confidence ranges in the vertical bar charts below the label distributions. Ideally, the higher ranges of confidence correspond to a greater proportion of correctly classified subjects in the training corpus. The “Classification Accuracy” graphics shift in distribution as different confidence ranges are highlighted.

Figure 17 (hover to animate): Select a range of confidence levels associated with classifications.


When interpreting results grouped by confidence, it is important to remember that CLARK should not perfectly label each subject in the training corpus; CLARK is meant to supplement human effort and help to decrease time spent on manual chart reviews. However, it may be wise to determine a confidence level at which patient notes will be reviewed manually.

When is it time to use the Evaluation Corpus?
After several rounds of training, users may gain a sense of which performance is relatively ‘good’ based on confidence estimates and the proportion of correctly classified subjects in the training data. Given a thorough literature review, researchers may pre-specify the proportion of correctly classified subjects and corresponding confidence levels necessary to move from training to evaluation.

Filtered Records

The Filtered Records dataset lists individual subjects who fit into the selected label or confidence range selected in the graphics above. If no confidence ranges or classifications are selected from the above section, all subjects are listed. Clicking on a cell in any subject’s row brings up a window that displays the highlighted features in their clinical notes; this may help indicate why CLARK grouped them correctly or not. Clicking on any column header toggles sorting in ascending/descending order.

Navigating Filtered Records
Where Misclassified=“No,” the True Label and Classifier Label columns will have the same value. Max Conf contains the maximum confidence level CLARK computed for a subject’s label. Each label’s column contains CLARK’s confidence level that a given subject belongs in that group, and the largest confidence value in a row will populate the Max Conf column. This corresponds to the label in Classifier Label.

Figure 18: Dataset of individual filtered results

For example, CLARK was 20% confident that the finch is a fish, and 80% confident that the finch is a bird, so it was correctly classified as a bird. However, CLARK was 78.89% confident that an anglerfish is a mammal, so it was misclassified. Finding misclassified subjects at different confidence levels can reveal if features or the algorithm need to be updated.
When updating a model or list of included features, be careful of overfitting. CLARK is not intended to perfectly classify the training set.

Evaluation Corpus Results

CLARK’s Explore page for evaluation corpus results is different from that of the training corpus. Since the true classifier labels are unknown, the circle chart and “Distribution by True Class Labels” chart are not included.
Figure 19: Evaluation corpus results with true labels unknown
Similarly to the training corpus results, the user can navigate the evaluation corpus results by classifier label or by confidence distribution. However, there is no indication in the “Filtered Records” dataset of which subjects were misclassified. At this point, manually checking the classification of some subjects may be helpful. The algorithm can still be adjusted and re-run without starting over from the training corpus.
Figure 20: Evaluation corpus confidence distribution

In this example, CLARK has a confidence of less than 50% for 29/114 unlabeled subjects. This does not necessarily mean that they are all misclassified, but relatively low confidence levels may warrant a manual review or algorithm adjustment.

Exporting Results

By clicking “Export Data” in the upper-left corner of the CLARK environment, the dataset shows in Filtered Records can be saved locally as a .csv. These files can be opened in Excel or a different software for more detailed evaluations of the classification model’s performance.

When using CLARK to explore clinical data, be mindful that the saved session will contain identified protected health information (PHI). The session should be saved in a location approved at your institution for storing PHI.


Technical Appendix

Cross-validation

CLARK offers stratified and random cross-validation to use on the training corpus and help prevent overfitting.

  • With stratified cross-validation, the proportion of samples from each class are preserved in each fold. For example, if 30% of patients in the training set are labeled as “benign”, then roughly 30% of patients in each of k folds will also be “benign”. The purpose of stratification is for the training sets to mimic the population, and for the mean response value to be similar in each fold.
  • Random cross-validation simply randomly divides the training population into k folds. In a large data set with many subjects in each class, stratified and random cross-validation would result in similarly distributed classes in each of the k folds. In a small population, or one with some very rare classes, the random method may not consistently capture the best features for rarer classes. This is because some folds may not include certain rare classes at all.

CLARK Under the Hood
Before employing a chosen machine learning classifier, CLARK transforms clinical notes into a multi-dimensional feature vector based on the number of user-defined features (recall that a regular expression can match several different words). The number of matches for each regular expression are calculated for each sentence, then the vector of match counts is summed across all sentences in a note. The vectors are then summarized by patient by calculating the mean feature vector across all of the patient’s notes (this is why metadata fields in clinical notes are important). Once the features have been summarized for each patient, the selected algorithm uses this patient-level information to train a model.

Algorithms in detail

Decision Trees

Decision trees classify subjects based on the presence of certain features. Classification begins at the ‘root node’ containing the whole population, and points are classified stepwise by features. The algorithm makes a distinction at a ‘split’ for each feature. The first split employs the most polarizing feature of the sample, and further splits are then the next-most-distinguishing features of the resulting sub-groups at ‘child nodes’.

Image from iu.edu

The goal of each split is to create subsets as different from each other as possible, with the subjects in each resulting subgroup as similar as possible. The ‘leaf’ at the end of a decision tree run is the resulting classification, with each point in the population classified in a group. The decision tree includes every subject and every feature specified in the algorithm. In cross-validation, the subsamples’ trees include every feature but not every subject.

The probabilities returned by SciKit Learn decision trees are the number of observations in a given class divided by the number of observations captured in that correct leaf during model training.

Decision trees are susceptible to overfitting when too many features are specified in the training set. To help mediate this for use in CLARK, consider using dimensionality reduction and select subsets of RegEx-specified features. Maximum depth describes the largest number of nodes acceptable for the tree to grow. CLARK does not specify a max depth; by default, nodes are expanded until the leaves are either purely separated or all leaves contain fewer than the minimum number of samples.

Random Forests

The random forest is an ensemble method using several decision trees in classification. The decision trees created in a random forest differ in a few ways from the decision trees described above. Rather than using the most polarizing features at each split, the algorithm selects a subset of features at random for each tree. This creates diversity between the trees. Additionally, random bootstrap samples of the training population are used in the random forest, allowing the same patient to appear in several different trees. Bootstrap aggregation also contributes to model diversity.

Image from global software support

As each subject appears in several different decision trees, one may be assigned more than one classification in the entire forest. The model’s decision for a subject’s classification is the leaf in which the subject appears most often, or the classification with the most ‘votes’ from the forest. The appeal of a random forest is the uncorrelated nature of the individual trees within. Bootstrap aggregation and random feature selection contribute to this and ensure that the trees do not err in the same direction. For classifications to be accurately predicted, the selected features (words and values found by regex) do need to hold some predictive power. Similarly to the decision tree algorithm from SciKit Learn, the random forest in CLARK uses default attributes for maximum depth.

Linear SVM

The Linear Support Vector Machine (SVM) defines a hyperplane in n-dimensional space to separate data points (in context of CLARK, patients) into different groups. The SVM algorithm ‘draws’ a n-1 dimensional hyperplane between points with the maximum distance between groups in the training set. For example, a 1-dimensional line is drawn to distinguish between 2 groups. CLARK uses SciKit Learn’s SVM classifier to enable probability estimates and use a linear kernel.

Image from stanford.edu

The linear SVM algorithm used in CLARK will employ nominal features, being the words found by regex in clinical notes. Dummy variables are created to represent each feature: presence or absence of a word or value, matching a user-defined regex, in patient notes. These features determine where a patient lies in the hyperspace, and the patient in the training set has an inherent label that the algorithm attempts to group together based on the presence of features in clinical notes.

After training, the algorithm applies the hyperplane boundaries defined on the training set to the evaluation corpus, so unlabeled patients are classified by support vectors based on where they lie in n-dimensional space, where n=number of features and their position is dictated by the presence of these features. Model parameters specify priorities of the SVM. There is a trade-off between perfect partition of points in the training set and smoothness of a line.



Image from Research Gate

Gaussian Naive Bayes

The Gaussian Naive Bayes classifier assumes that the likelihood of features appearing in a clinical note follow a normal distribution. This lofty assumption is the reason for “Naive” in the algorithm’s name. Given a true label, the algorithm calculates probabilities of each feature being associated with that label. The algorithm then infers true subject classifications by using Bayes’ theorem in the training set (hence the name of the algorithm).

For example, the probability of an unlabeled animal being a bird, given that it has wings, is derived from the probability that an animal known to be a bird has wings mentioned in its “clinical notes”. The goal is to find the best classification given data, or the most likely patient label given the features found in clinical notes.

Naive Bayes is robust to several possible classifications (more than two labels to distinguish) in a population. CLARK employs the default settings of Gaussian Naive Bayes from SciKit Learn.

General Troubleshooting

Loading .json files

Occasionally, CLARK will throw an error upon uploading corpora or regular expression files. First, be sure that all files going into CLARK are saved as .json. If the files still are not being loaded properly, it is most likely a formatting issue.

  • Files should be properly delimited
  • Special characters are properly managed in free-text clinical notes. Clinical notes should not contain any floating “\” characters unless preceding a quotation mark.
  • Any sequence of “#” characters creates a section in the note, with the first word after “#” being the section header.
  • Other special characters in the training corpus, such as “\023” can also inhibit correct parsing of the clinical notes, preventing CLARK from loading the corpus. Since the corpus is saved as a .json file, it can be opened with notepad. Using find/replace can be helpful to debug issues with free-text notes.
Regular Expressions

If features are not being highlighted in CLARK as expected, there is most likely a problem with some regular expressions.

  • Test and debug using RegEx101
  • Be sure to include the double-backslash to act as a single backslash in .json files containing regex. For example, the word boundary regex “\b” should be entered as “\\b” in the .json file to function as “\b” in CLARK.
  • Verify that active regex are correctly referencing the regex library, and that new regular expressions are saved either in the CLARK session or as .json files.
Algorithm Training

In small sample sizes, or in the case of some rare labels among the training corpus,

  • Consider size of folds in cross-validation. If a higher value of k is selected, the training and test folds will be smaller. This occurrence can skew the distribution of rarer labels, and lead to them being ignored as an option for classification. If a certain classification occurs only a few times in the training set, CLARK is less likely to build a strong association with features compared to other classifications.
  • CLARK can only classify subjects in the evaluation corpus using choices presented in the training corpus. For example, if an animal is truly meant to be labeled as a marsupial, CLARK could not classify it as such using the given animal corpus because “marsupial” is not one of the labels in the training data.
Exporting Results
  • The “Filtered Records” dataset from the Explore page can be exported locally. Check formatting of the excel output against what appears in CLARK. In some versions of excel or operating systems, column headers may be shifted to the left. Simply rename the affected columns based on “Filtered Records” before conducting a summary of results or secondary analysis.